# DeepSense: a Deep Learning Method for Full-sentence Search of Biomedical Literature

## Abstract

## Introduction

## Dataset Development

In [1]:
import os
import json
import pandas as pd

In [2]:
base_path = './project/src_new'

### Training and Validation Datasets

In [3]:
def get_train_val_stats(data_file):
    sents = set()
    p = 0
    n = 0
    with open(data_file) as f:
        next(f)
        for row in f:
            row = row.strip().split('\t')
            sents.add(f"{row[0]}|{row[1]}")
            if row[-1] == '1': p += 1
            if row[-1] == '0': n += 1
    print(f"{'Total Sentences':30}{len(sents)}\n{'Positive Instances':30}{p}")
    print(f"{'Negative Instances':30}{n}\n{'Total Instances':30}{p+n}")

Training Dataset: input/train_sentences.tsv

Some sentences have more than 1 citations, so have more than 1 positive instance, resulting in 936,591 positive instances. For each positive instance, sample 2 negative instances, some negative instances may be duplicated, resulting in 1,870,387 negative instances after removal of the duplicated instances. 

In [4]:
data_file = f"{base_path}/input/train_sentences.tsv"
get_train_val_stats(data_file)

Total Sentences               854101
Positive Instances            936591
Negative Instances            1870387
Total Instances               2806978


Validation Dataset: input/valid_sentences.tsv

In [5]:
data_file = f"{base_path}/input/valid_sentences.tsv"
get_train_val_stats(data_file)

Total Sentences               145455
Positive Instances            148269
Negative Instances            296128
Total Instances               444397


### Test Datasets

- SQL_BM25: case_SQL_testing_final_complete
- PubMed_TFIDF: case_PubMed_testing_final_complete
- PubMed_BM: case_PubMed_BM_testing_final_complete
- Google_Scholar: case_Google_scholar_testing_final_complete

Test datasets development process:

- Develop citing sentence and cited paper pairs from PMC articles (code not available, original data not available)
  - Output data format: citing_sentence, citing_sentence_pmid, cited_paper_pmid, (citing_sentence_original_text)
- Acquire search returns for each citing sentence query
  - Codes: create_test_search_PubMed.py <- PM_function.py/create_data_PubMed
  - Output data format: citing_sentence, citing_sentence_pmid, search_returned_paper_pmid, cited_paper_pmid
- Create final test datasets
  - Code: create_PM_final_test.py

Issues:

- Exist in creating final test datasets (Code: create_PM_final_test.py)
  - Incomplete information dictionary of total pubmed papers
    - Forward search of citing sentence paper publishing year when its pmid not in the dictionary
    - Skip search returned papers when their pmids not in the dictionary

In [6]:
# Test sentences record
test_sentences_records_all = json.load(open(f"{base_path}/test_sentences_records.json"))
len(test_sentences_records_all)

96950

In [7]:
# Check publishing year of test sentences and keep only sentences with 1 citation
m, n, p = 0, 0, 0
test_sentences_records = {}
for k, v in test_sentences_records_all.items():
    if len(v['year']) > 1: m += 1
    if len(v['year']) == 1: n += 1
    if len(v['citations']) > 1:
        p += 1
    else:
        test_sentences_records[k] = v
print(f"{'Sentences with 1 publishing year':40}{n}\n{'Sentences with >1 publishing year':40}{m}")
print(f"{'Sentences with 1+ citations':40}{p}")
print(f"{'Sentences with only 1 citation':40}{len(test_sentences_records)}")

Sentences with 1 publishing year        96950
Sentences with >1 publishing year       0
Sentences with 1+ citations             4703
Sentences with only 1 citation          92247


In [8]:
# check citations of test sentences
m, n = 0, 0
cites = set()
for v in test_sentences_records.values():
    n += len(v['citations'])
    if len(v['citations']) > 1: m += 1
    for cite in v['citations']:
        cites.add(cite)
print(f"{'Total citations':30}{n}\n{'Total unique cited papers':30}{len(cites)}\n{'Sentences with >1 citations':30}{m}")

Total citations               92247
Total unique cited papers     82639
Sentences with >1 citations   0


In [9]:
# retain only sentences with 5+ and 50- tokens for final test
test_sentences = {}
m, n = 0, 0
# for k, v in test_sentences_records_all.items():
for k, v in test_sentences_records.items():
    if len(k.split('|')[0].split()) < 5:
        m += 1
    elif len(k.split('|')[0].split()) > 50:
        n += 1
    else:
        test_sentences[k] = v
print(f"{'Sentences with <5 tokens':35}{m}\n{'Sentences with >50 tokens':35}{n}")
print(f"{'Sentences retained for test':35}{len(test_sentences)}")

Sentences with <5 tokens           217
Sentences with >50 tokens          1273
Sentences retained for test        90757


In [10]:
# check citations of final test sentences
m, n = 0, 0
cites = set()
more_cites_sentences = {}
for k, v in test_sentences.items():
    n += len(v['citations'])
    if len(v['citations']) > 1:
        more_cites_sentences[k] = v
        m += 1
    for cite in v['citations']:
        cites.add(cite)
print(f"{'Total citations':30}{n}\n{'Total unique cited papers':30}{len(cites)}\n{'Sentences with >1 citations':30}{m}")

Total citations               90757
Total unique cited papers     81411
Sentences with >1 citations   0


## Evaluation

Functions for stats calculation

In [11]:
def get_test_stats(test_path, test_sentences_records):
    sents = set()
    extra_sents = set()
    cites = set()
    extra_cites = set()
    n, m = 0, 0
    for file in os.listdir(f"{test_path}"):
        with open(f"{test_path}/{file}") as ifile:
            next(ifile)
            for row in ifile:
                row = row.strip().split("\t")
                sentence = f"{row[0]}|{row[1]}"
                citation = row[3]
                if sentence in test_sentences_records:
                    sents.add(sentence)
                    cites.add(citation)
                    n += 1
                else:
                    extra_sents.add(sentence)
                    extra_cites.add(citation)
                    m += 1 
                    
    #             if citation in test_sentences_records[sentence]['citations']: m += 1
    print(f"Test Sentences: {len(sents):12}\tExtra Sentences: {len(extra_sents)}")
    print(f"Test Citations: {len(cites):12}\tExtra Citations: {len(extra_cites.difference(cites))}")
    print(f"Test Instances: {n:12}\tExtra Instances: {m}")

In [12]:
def get_pred_stats(pred_path, test_sentences_records):
    sents = set()
    extra_sents = set()
    cites = set()
    extra_cites = set()
    n, m = 0, 0
    for file in os.listdir(f"{pred_path}"):
        with open(f"{pred_path}/{file}") as ifile:
            next(ifile)
            for row in ifile:
                row = row.strip().split("\t")
                sentence = '|'.join(row[0].split('|')[:2])
                citation = row[1]
                if sentence in test_sentences_records:
                    sents.add(sentence)
                    cites.add(citation)
                    n += 1
                else:
                    extra_sents.add(sentence)
                    extra_cites.add(citation)
                    m += 1
    #             if citation in test_sentences_records[sentence]['citations']: m += 1
    print(f"Test Sentences: {len(sents):12}\tExtra Sentences: {len(extra_sents)}")
    print(f"Test Citations: {len(cites):12}\tExtra Citations: {len(extra_cites.difference(cites))}")
    print(f"Test Instances: {n:12}\tExtra Instances: {m}")

In [42]:
def get_pred_results(pred_path, test_sentences_records):
    pred_results = {}
    serank_top1, serank_top10, serank_top20, serank_top100 = 0, 0, 0, 0
    serr_top10, serr_top20, serr_top100 = [], [], []
    rerank_top1, rerank_top10, rerank_top20, rerank_top100 = 0, 0, 0, 0
    rerr_top10, rerr_top20, rerr_top100 = [], [], []
    better_serank, tie_rank, better_rerank = 0, 0, 0
    num_dup_citations = 0
    for file in os.listdir(pred_path):
        with open(f"{pred_path}/{file}") as ifile:
            next(ifile)
            for row in ifile:
                row = row.strip().split("\t")
                sentence = '|'.join(row[0].split('|')[:2])
                citation = row[1]
                serank = int(row[2])
                rerank = int(row[3])
                if sentence in test_sentences_records:
                    if citation in test_sentences_records[sentence]['citations']:
                        pred_results[sentence] = pred_results.get(sentence, {})
                        if citation not in pred_results[sentence]:
                            pred_results[sentence][citation] = pred_results[sentence].get(citation, [])
                            pred_results[sentence][citation].append((serank, rerank))
                            if serank == 1: serank_top1 += 1
                            if serank <= 10:
                                serank_top10 += 1
                                serr_top10.append(1/serank)
                            if serank <= 20:
                                serank_top20 += 1
                                serr_top20.append(1/serank)
                            if serank <= 100:
                                serank_top100 += 1
                                serr_top100.append(1/serank)

                            if rerank == 1: rerank_top1 += 1
                            if rerank <= 10:
                                rerank_top10 += 1
                                rerr_top10.append(1/rerank)
                            if rerank <= 20:
                                rerank_top20 += 1
                                rerr_top20.append(1/rerank)
                            if rerank <= 100:
                                rerank_top100 += 1
                                rerr_top100.append(1/rerank)

                            if serank < rerank: better_serank += 1
                            if serank == rerank: tie_rank += 1
                            if rerank < serank: better_rerank += 1
                        else: num_dup_citations += 1
    
    print(f"{'Rank':20}\t{'Top1':8}\t{'Top10':8}\t{'Top20':8}\t{'Top100'}")
    print(f"{80*'-'}")
    print(f"{'Search':20}\t{serank_top1:<8}\t{serank_top10:<8}\t{serank_top20:<8}\t{serank_top100}")
    print(f"{'Search R@k':20}\t{serank_top1/len(pred_results):<8.4f}\t{serank_top10/len(pred_results):<8.4f}\t{serank_top20/len(pred_results):<8.4f}\t{serank_top100/len(pred_results):<8.4f}")
    print(f"{'Search MAP':20}\t{serank_top1/len(pred_results):<8.4f}\t{serank_top10*1/10/len(pred_results):<8.4f}\t{serank_top20*1/20/len(pred_results):<8.4f}\t{serank_top100*1/100/len(pred_results):<8.4f}")
    print(f"{'Search MRR':20}\t{serank_top1/len(pred_results):<8.4f}\t{sum(serr_top10)/len(pred_results):<8.4f}\t{sum(serr_top20)/len(pred_results):<8.4f}\t{sum(serr_top100)/len(pred_results):<8.4f}")
    print()
    print(f"{'Rerank':20}\t{rerank_top1:<8}\t{rerank_top10:<8}\t{rerank_top20:<8}\t{rerank_top100}")
    print(f"{'Rerank R@k':20}\t{rerank_top1/len(pred_results):<8.4f}\t{rerank_top10/len(pred_results):<8.4f}\t{rerank_top20/len(pred_results):<8.4f}\t{rerank_top100/len(pred_results):<8.4f}")
    print(f"{'Rerank MAP':20}\t{rerank_top1/len(pred_results):<8.4f}\t{rerank_top10*1/10/len(pred_results):<8.4f}\t{rerank_top20*1/20/len(pred_results):<8.4f}\t{rerank_top100*1/100/len(pred_results):<8.4f}")
    print(f"{'Rerank MRR':20}\t{rerank_top1/len(pred_results):<8.4f}\t{sum(rerr_top10)/len(pred_results):<8.4f}\t{sum(rerr_top20)/len(pred_results):<8.4f}\t{sum(rerr_top100)/len(pred_results):<8.4f}")
    print()
    print(f"{'Duplicated citations':20}\t{num_dup_citations}")
    print(f"{'Search Win':20}\t{better_serank}")
    print(f"{'Tie':20}\t{tie_rank}")
    print(f"{'Rerank Improvement':20}\t{better_rerank}")
    
    return pred_results

Test results: SQL_BM25

In [14]:
test_path = f"{base_path}/test_dataset_sql_bm25"
get_test_stats(test_path, test_sentences)

Test Sentences:        90757	Extra Sentences: 6457
Test Citations:        81411	Extra Citations: 9514
Test Instances:     90924838	Extra Instances: 11627247


In [43]:
# test results of sentences with SQL_BM25 top 500 search returns (New)
pred_path = f"{base_path}/src_model_full_sentence/test_results_sql_bm25"
get_pred_stats(pred_path, test_sentences)
print('\n\n')
results_sql = get_pred_results(pred_path, test_sentences)

Test Sentences:        90757	Extra Sentences: 6457
Test Citations:        81411	Extra Citations: 9514
Test Instances:        90764	Extra Instances: 11629



Rank                	Top1    	Top10   	Top20   	Top100
--------------------------------------------------------------------------------
Search              	17898   	38151   	44957   	62609
Search R@k          	0.1972  	0.4204  	0.4954  	0.6899  
Search MAP          	0.1972  	0.0420  	0.0248  	0.0069  
Search MRR          	0.1972  	0.2632  	0.2684  	0.2731  

Rerank              	23649   	52058   	61132   	79830
Rerank R@k          	0.2606  	0.5736  	0.6736  	0.8796  
Rerank MAP          	0.2606  	0.0574  	0.0337  	0.0088  
Rerank MRR          	0.2606  	0.3523  	0.3593  	0.3645  

Duplicated citations	7
Search Win          	24897
Tie                 	12568
Rerank Improvement  	53292


In [16]:
# Number of citations search returns broken into 2 files
n = 0
sentences_in_2_files = {}
for k, v in results_sql.items():
    for ck, cv in v.items():
        if len(cv) > 1:
            sentences_in_2_files[k] = v
            print(k, v)
            n += 1
print(n)

0


In [17]:
len(results_sql)

90757

In [18]:
serank = []
better_serank = 0
for k, v in results_sql.items():
    for ck, cv in v.items():
        serank.append(cv[0][0])
        if cv[0][0] < cv[0][1]: better_serank += 1
len(results_sql), better_serank, len(serank), min(serank), max(serank)

(90757, 24897, 90757, 1, 1000)

Test results: PubMed_TFIDF

In [19]:
test_path = f"{base_path}/test_dataset_pubmed_tfidf"
get_test_stats(test_path, test_sentences)

Test Sentences:        90757	Extra Sentences: 6193
Test Citations:        81411	Extra Citations: 9258
Test Instances:     89029754	Extra Instances: 11098315


In [44]:
# test results of sentences with PubMed TFIDF top 500 search returns
pred_path = f"{base_path}/src_model_full_sentence/test_results_pubmed_tfidf"
get_pred_stats(pred_path, test_sentences)
print('\n\n')
results_tfidf = get_pred_results(pred_path, test_sentences)

Test Sentences:        57123	Extra Sentences: 4489
Test Citations:        52020	Extra Citations: 5716
Test Instances:        57125	Extra Instances: 6829



Rank                	Top1    	Top10   	Top20   	Top100
--------------------------------------------------------------------------------
Search              	8915    	20951   	25262   	37033
Search R@k          	0.1561  	0.3668  	0.4422  	0.6483  
Search MAP          	0.1561  	0.0367  	0.0221  	0.0065  
Search MRR          	0.1561  	0.2177  	0.2228  	0.2278  

Rerank              	16777   	35103   	40260   	50434
Rerank R@k          	0.2937  	0.6145  	0.7048  	0.8829  
Rerank MAP          	0.2937  	0.0615  	0.0352  	0.0088  
Rerank MRR          	0.2937  	0.3898  	0.3961  	0.4006  

Duplicated citations	2
Search Win          	13900
Tie                 	6164
Rerank Improvement  	37059


In [21]:
# Number of citations search returns broken into 2 files
n = 0
for k, v in results_tfidf.items():
    for ck, cv in v.items():
        if len(cv) > 1:
            print(k, v)
            n += 1
print(n)

0


In [22]:
serank = []
better_serank = 0
for k, v in results_tfidf.items():
    for ck, cv in v.items():
        serank.append(cv[0][0])
        if cv[0][0] < cv[0][1]: better_serank += 1
len(results_tfidf), better_serank, len(serank), min(serank), max(serank)

(57123, 13900, 57123, 1, 998)

In [48]:
n_sql, m = 0, 0
sents = set()
n_sql_se_1, n_sql_se_10, n_sql_se_20, n_sql_se_100 = 0, 0, 0, 0
rr_sql_se_10, rr_sql_se_20, rr_sql_se_100 = [], [], []
n_sql_re_1, n_sql_re_10, n_sql_re_20, n_sql_re_100 = 0, 0, 0, 0
rr_sql_re_10, rr_sql_re_20, rr_sql_re_100 = [], [], []

n_tfidf_se_1, n_tfidf_se_10, n_tfidf_se_20, n_tfidf_se_100 = 0, 0, 0, 0
rr_tfidf_se_10, rr_tfidf_se_20, rr_tfidf_se_100 = [], [], []
n_tfidf_re_1, n_tfidf_re_10, n_tfidf_re_20, n_tfidf_re_100 = 0, 0, 0, 0
rr_tfidf_re_10, rr_tfidf_re_20, rr_tfidf_re_100 = [], [], []

for k,v in results_tfidf.items():
    if k in results_sql:
        m += 1
        for i in v:
            if i in results_sql[k]:
                n_sql += 1
                sents.add(k)
                if results_sql[k][i][0][0] == 1: n_sql_se_1 += 1
                if results_sql[k][i][0][0] <= 10:
                    n_sql_se_10 += 1
                    rr_sql_se_10.append(1/results_sql[k][i][0][0])
                if results_sql[k][i][0][0] <= 20:
                    n_sql_se_20 += 1
                    rr_sql_se_20.append(1/results_sql[k][i][0][0])
                if results_sql[k][i][0][0] <= 100:
                    n_sql_se_100 += 1
                    rr_sql_se_100.append(1/results_sql[k][i][0][0])
                
                if results_sql[k][i][0][1] == 1: n_sql_re_1 += 1
                if results_sql[k][i][0][1] <= 10:
                    n_sql_re_10 += 1
                    rr_sql_re_10.append(1/results_sql[k][i][0][1])
                if results_sql[k][i][0][1] <= 20:
                    n_sql_re_20 += 1
                    rr_sql_re_20.append(1/results_sql[k][i][0][1])
                if results_sql[k][i][0][1] <= 100:
                    n_sql_re_100 += 1
                    rr_sql_re_100.append(1/results_sql[k][i][0][1])
                
                if results_tfidf[k][i][0][0] == 1: n_tfidf_se_1 += 1
                if results_tfidf[k][i][0][0] <= 10:
                    n_tfidf_se_10 += 1
                    rr_tfidf_se_10.append(1/results_tfidf[k][i][0][0])
                if results_tfidf[k][i][0][0] <= 20:
                    n_tfidf_se_20 += 1
                    rr_tfidf_se_20.append(1/results_tfidf[k][i][0][0])
                if results_tfidf[k][i][0][0] <= 100:
                    n_tfidf_se_100 += 1
                    rr_tfidf_se_100.append(1/results_tfidf[k][i][0][0])
                
                if results_tfidf[k][i][0][1] == 1: n_tfidf_re_1 += 1
                if results_tfidf[k][i][0][1] <= 10:
                    n_tfidf_re_10 += 1
                    rr_tfidf_re_10.append(1/results_tfidf[k][i][0][1])
                if results_tfidf[k][i][0][1] <= 20:
                    n_tfidf_re_20 += 1
                    rr_tfidf_re_20.append(1/results_tfidf[k][i][0][1])
                if results_tfidf[k][i][0][1] <= 100:
                    n_tfidf_re_100 += 1
                    rr_tfidf_re_100.append(1/results_tfidf[k][i][0][1])

print(f"{' ':30}\t{'Top1':8}\t{'Top10':8}\t{'Top20':8}\t{'Top100':8}")
print(f"{90*'-'}")
print(f"{'Sentences in SQL/TFIDF':30}\t{m:<8}\t{len(sents):<8}\t{n_sql:<8}")
print(f"{'SQL Search Results':30}\t{n_sql_se_1:<8}\t{n_sql_se_10:<8}\t{n_sql_se_20:<8}\t{n_sql_se_100:<8}")
print(f"{'SQL Search Results R@k':30}\t{n_sql_se_1/n_sql:<8.4f}\t{n_sql_se_10/n_sql:<8.4f}\t{n_sql_se_20/n_sql:<8.4f}\t{n_sql_se_100/n_sql:<8.4f}")
print(f"{'SQL Search Results MAP':30}\t{n_sql_se_1/n_sql:<8.4f}\t{n_sql_se_10*1/10/n_sql:<8.4f}\t{n_sql_se_20*1/20/n_sql:<8.4f}\t{n_sql_se_100*1/100/n_sql:<8.4f}")
print(f"{'SQL Search Results MRR':30}\t{n_sql_se_1/n_sql:<8.4f}\t{sum(rr_sql_se_10)/n_sql:<8.4f}\t{sum(rr_sql_se_20)/n_sql:<8.4f}\t{sum(rr_sql_se_100)/n_sql:<8.4f}")
print()
print(f"{'SQL Rerank Results':30}\t{n_sql_re_1:<8}\t{n_sql_re_10:<8}\t{n_sql_re_20:<8}\t{n_sql_re_100:<8}")
print(f"{'SQL Rerank Results R@k':30}\t{n_sql_re_1/n_sql:<8.4f}\t{n_sql_re_10/n_sql:<8.4f}\t{n_sql_re_20/n_sql:<8.4f}\t{n_sql_re_100/n_sql:<8.4f}")
print(f"{'SQL Rerank Results MAP':30}\t{n_sql_re_1/n_sql:<8.4f}\t{n_sql_re_10*1/10/n_sql:<8.4f}\t{n_sql_re_20*1/20/n_sql:<8.4f}\t{n_sql_re_100*1/100/n_sql:<8.4f}")
print(f"{'SQL Rerank Results MRR':30}\t{n_sql_re_1/n_sql:<8.4f}\t{sum(rr_sql_re_10)/n_sql:<8.4f}\t{sum(rr_sql_re_20)/n_sql:<8.4f}\t{sum(rr_sql_re_100)/n_sql:<8.4f}")
print()
print(f"{'TFIDF Search Results':30}\t{n_tfidf_se_1:<8}\t{n_tfidf_se_10:<8}\t{n_tfidf_se_20:<8}\t{n_tfidf_se_100:<8}")
print(f"{'TFIDF Search Results R@k':30}\t{n_tfidf_se_1/n_sql:<8.4f}\t{n_tfidf_se_10/n_sql:<8.4f}\t{n_tfidf_se_20/n_sql:<8.4f}\t{n_tfidf_se_100/n_sql:<8.4f}")
print(f"{'TFIDF Search Results MAP':30}\t{n_tfidf_se_1/n_sql:<8.4f}\t{n_tfidf_se_10*1/10/n_sql:<8.4f}\t{n_tfidf_se_20*1/20/n_sql:<8.4f}\t{n_tfidf_se_100*1/100/n_sql:<8.4f}")
print(f"{'TFIDF Search Results MRR':30}\t{n_tfidf_se_1/n_sql:<8.4f}\t{sum(rr_tfidf_se_10)/n_sql:<8.4f}\t{sum(rr_tfidf_se_20)/n_sql:<8.4f}\t{sum(rr_tfidf_se_100)/n_sql:<8.4f}")
print()
print(f"{'TFIDF Rerank Results':30}\t{n_tfidf_re_1:<8}\t{n_tfidf_re_10:<8}\t{n_tfidf_re_20:<8}\t{n_tfidf_re_100:<8}")
print(f"{'TFIDF Rerank Results R@k':30}\t{n_tfidf_re_1/n_sql:<8.4f}\t{n_tfidf_re_10/n_sql:<8.4f}\t{n_tfidf_re_20/n_sql:<8.4f}\t{n_tfidf_re_100/n_sql:<8.4f}")
print(f"{'TFIDF Rerank Results MAP':30}\t{n_tfidf_re_1/n_sql:<8.4f}\t{n_tfidf_re_10*1/10/n_sql:<8.4f}\t{n_tfidf_re_20*1/20/n_sql:<8.4f}\t{n_tfidf_re_100*1/100/n_sql:<8.4f}")
print(f"{'TFIDF Rerank Results MRR':30}\t{n_tfidf_re_1/n_sql:<8.4f}\t{sum(rr_tfidf_re_10)/n_sql:<8.4f}\t{sum(rr_tfidf_re_20)/n_sql:<8.4f}\t{sum(rr_tfidf_re_100)/n_sql:<8.4f}")
print()

                              	Top1    	Top10   	Top20   	Top100  
------------------------------------------------------------------------------------------
Sentences in SQL/TFIDF        	57123   	57123   	57123   
SQL Search Results            	15755   	31132   	35567   	45504   
SQL Search Results R@k        	0.2758  	0.5450  	0.6226  	0.7966  
SQL Search Results MAP        	0.2758  	0.0545  	0.0311  	0.0080  
SQL Search Results MRR        	0.2758  	0.3578  	0.3632  	0.3675  

SQL Rerank Results            	19204   	37890   	42911   	52283   
SQL Rerank Results R@k        	0.3362  	0.6633  	0.7512  	0.9153  
SQL Rerank Results MAP        	0.3362  	0.0663  	0.0376  	0.0092  
SQL Rerank Results MRR        	0.3362  	0.4356  	0.4417  	0.4459  

TFIDF Search Results          	8915    	20951   	25262   	37033   
TFIDF Search Results R@k      	0.1561  	0.3668  	0.4422  	0.6483  
TFIDF Search Results MAP      	0.1561  	0.0367  	0.0221  	0.0065  
TFIDF Search Results MRR      	0.1561  	0.217

In [24]:
results_sql[k][i][0]

(1, 2)

Test results: PubMed_BM

In [28]:
test_pubmed_bm = json.load(open('test_sents_pubmed_bm.json', 'r', encoding='utf-8'))
len(test_pubmed_bm)

45115

In [29]:
# pubmed_bm dataset stat
top1000_sentences = {} # sentences with top 1000 search returns
num_sentences = 0
num_citations = 0 # total number of citations
top1000_num_citations = 0 # total number of citations in sentences with top 1000 search returns
for k, v in test_pubmed_bm.items():
    if k in test_sentences:
        num_sentences += 1
        num_citations += len(v['citations'])
        if len(v['pmids']) > 600:
            top1000_sentences[k] = v
            top1000_num_citations += len(v['citations'])
print(f"{'Total Sentences':60}{num_sentences}")
print(f"{'Total Citations':60}{num_citations}")
print(f"{'Sentences with Top 1000 Search Returns':60}{len(top1000_sentences)}")
print(f"{'Total Citations for Sentences with Top 1000 Search Returns':60}{top1000_num_citations}")

Total Sentences                                             41120
Total Citations                                             41120
Sentences with Top 1000 Search Returns                      16824
Total Citations for Sentences with Top 1000 Search Returns  16824


In [25]:
# test_path = f"{base_path}/case_PubMed_BM_testing_final_complete"
test_path = f"{base_path}/test_dataset_pubmed_bm"
get_test_stats(test_path, test_sentences) # top1000_sentences, test_sentences

Test Sentences:        41120	Extra Sentences: 3995
Test Citations:        38835	Extra Citations: 6999
Test Instances:     27026803	Extra Instances: 5207044


In [30]:
# test_path = f"{base_path}/case_PubMed_BM_testing_final_complete"
test_path = f"{base_path}/test_dataset_pubmed_bm"
get_test_stats(test_path, top1000_sentences) # top1000_sentences, test_sentences

Test Sentences:        16824	Extra Sentences: 28291
Test Citations:        16390	Extra Citations: 29444
Test Instances:     14501168	Extra Instances: 17732679


In [45]:
# test results of sentences with PubMed BM top 500 search returns
pred_path = f"{base_path}/src_model_full_sentence/test_results_pubmed_bm"
get_pred_stats(pred_path, top1000_sentences) # top1000_sentences, test_sentences
print('\n\n')
results_bm = get_pred_results(pred_path, top1000_sentences)

Test Sentences:        10782	Extra Sentences: 11468
Test Citations:        10553	Extra Citations: 11706
Test Instances:        10782	Extra Instances: 12494



Rank                	Top1    	Top10   	Top20   	Top100
--------------------------------------------------------------------------------
Search              	432     	2892    	3969    	6447
Search R@k          	0.0401  	0.2682  	0.3681  	0.5979  
Search MAP          	0.0401  	0.0268  	0.0184  	0.0060  
Search MRR          	0.0401  	0.0963  	0.1032  	0.1090  

Rerank              	4382    	7698    	8517    	10031
Rerank R@k          	0.4064  	0.7140  	0.7899  	0.9303  
Rerank MAP          	0.4064  	0.0714  	0.0395  	0.0093  
Rerank MRR          	0.4064  	0.5023  	0.5076  	0.5113  

Duplicated citations	0
Search Win          	1400
Tie                 	396
Rerank Improvement  	8986


In [32]:
n = 0
for k, v in results_bm.items():
    for ck, cv in v.items():
        if len(cv) > 1:
            print(k, v)
            n += 1
print(n)

0


In [33]:
serank = []
better_serank = 0
for k, v in results_bm.items():
    for ck, cv in v.items():
        serank.append(cv[0][0])
        if cv[0][0] > 1000: print(k, v)
        if cv[0][0] < cv[0][1]: better_serank += 1
len(results_bm), better_serank, len(serank), min(serank), max(serank)

(10782, 1400, 10782, 1, 979)

In [47]:
n_sql, n_tfidf = 0, 0
n_sql_se_1, n_sql_se_10, n_sql_se_20, n_sql_se_100 = 0, 0, 0, 0
rr_sql_se_10, rr_sql_se_20, rr_sql_se_100 = [], [], []
n_sql_re_1, n_sql_re_10, n_sql_re_20, n_sql_re_100 = 0, 0, 0, 0
rr_sql_re_10, rr_sql_re_20, rr_sql_re_100 = [], [], []

n_tfidf_se_1, n_tfidf_se_10, n_tfidf_se_20, n_tfidf_se_100 = 0, 0, 0, 0
rr_tfidf_se_10, rr_tfidf_se_20, rr_tfidf_se_100 = [], [], []
n_tfidf_re_1, n_tfidf_re_10, n_tfidf_re_20, n_tfidf_re_100 = 0, 0, 0, 0
rr_tfidf_re_10, rr_tfidf_re_20, rr_tfidf_re_100 = [], [], []

n_bm_se_1, n_bm_se_10, n_bm_se_20, n_bm_se_100 = 0, 0, 0, 0
rr_bm_se_10, rr_bm_se_20, rr_bm_se_100 = [], [], []
n_bm_re_1, n_bm_re_10, n_bm_re_20, n_bm_re_100 = 0, 0, 0, 0
rr_bm_re_10, rr_bm_re_20, rr_bm_re_100 = [], [], []

for k,v in results_bm.items():
    if k in results_sql and k in results_tfidf:
        n_sql += 1
        for i in v:
            if i in results_sql[k] and i in results_tfidf[k]:
                n_tfidf += 1
                if results_sql[k][i][0][0] == 1: n_sql_se_1 += 1
                if results_sql[k][i][0][0] <= 10:
                    n_sql_se_10 += 1
                    rr_sql_se_10.append(1/results_sql[k][i][0][0])
                if results_sql[k][i][0][0] <= 20:
                    n_sql_se_20 += 1
                    rr_sql_se_20.append(1/results_sql[k][i][0][0])
                if results_sql[k][i][0][0] <= 100:
                    n_sql_se_100 += 1
                    rr_sql_se_100.append(1/results_sql[k][i][0][0])
                
                if results_sql[k][i][0][1] == 1: n_sql_re_1 += 1
                if results_sql[k][i][0][1] <= 10:
                    n_sql_re_10 += 1
                    rr_sql_re_10.append(1/results_sql[k][i][0][1])
                if results_sql[k][i][0][1] <= 20:
                    n_sql_re_20 += 1
                    rr_sql_re_20.append(1/results_sql[k][i][0][1])
                if results_sql[k][i][0][1] <= 100:
                    n_sql_re_100 += 1
                    rr_sql_re_100.append(1/results_sql[k][i][0][1])
                
                if results_tfidf[k][i][0][0] == 1: n_tfidf_se_1 += 1
                if results_tfidf[k][i][0][0] <= 10:
                    n_tfidf_se_10 += 1
                    rr_tfidf_se_10.append(1/results_tfidf[k][i][0][0])
                if results_tfidf[k][i][0][0] <= 20:
                    n_tfidf_se_20 += 1
                    rr_tfidf_se_20.append(1/results_tfidf[k][i][0][0])
                if results_tfidf[k][i][0][0] <= 100:
                    n_tfidf_se_100 += 1
                    rr_tfidf_se_100.append(1/results_tfidf[k][i][0][0])
                
                if results_tfidf[k][i][0][1] == 1: n_tfidf_re_1 += 1
                if results_tfidf[k][i][0][1] <= 10:
                    n_tfidf_re_10 += 1
                    rr_tfidf_re_10.append(1/results_tfidf[k][i][0][1])
                if results_tfidf[k][i][0][1] <= 20:
                    n_tfidf_re_20 += 1
                    rr_tfidf_re_20.append(1/results_tfidf[k][i][0][1])
                if results_tfidf[k][i][0][1] <= 100:
                    n_tfidf_re_100 += 1
                    rr_tfidf_re_100.append(1/results_tfidf[k][i][0][1])
                
                if results_bm[k][i][0][0] == 1: n_bm_se_1 += 1
                if results_bm[k][i][0][0] <= 10:
                    n_bm_se_10 += 1
                    rr_bm_se_10.append(1/results_bm[k][i][0][0])
                if results_bm[k][i][0][0] <= 20:
                    n_bm_se_20 += 1
                    rr_bm_se_20.append(1/results_bm[k][i][0][0])
                if results_bm[k][i][0][0] <= 100:
                    n_bm_se_100 += 1
                    rr_bm_se_100.append(1/results_bm[k][i][0][0])
                
                if results_bm[k][i][0][1] == 1: n_bm_re_1 += 1
                if results_bm[k][i][0][1] <= 10:
                    n_bm_re_10 += 1
                    rr_bm_re_10.append(1/results_bm[k][i][0][1])
                if results_bm[k][i][0][1] <= 20:
                    n_bm_re_20 += 1
                    rr_bm_re_20.append(1/results_bm[k][i][0][1])
                if results_bm[k][i][0][1] <= 100:
                    n_bm_re_100 += 1
                    rr_bm_re_100.append(1/results_bm[k][i][0][1])
                    
print(f"{' ':30}\t{'Top1':8}\t{'Top10':8}\t{'Top20':8}\t{'Top100':8}")
print(f"{90*'-'}")
print(f"{'Test Citations in SQL/TFIDF':30}\t{len(results_bm):<8}\t{n_sql:<8}\t{n_tfidf:<8}")
print(f"{'SQL Search Results':30}\t{n_sql_se_1:<8}\t{n_sql_se_10:<8}\t{n_sql_se_20:<8}\t{n_sql_se_100:<8}")
print(f"{'SQL Search Results R@k':30}\t{n_sql_se_1/n_tfidf:<8.4f}\t{n_sql_se_10/n_tfidf:<8.4f}\t{n_sql_se_20/n_tfidf:<8.4f}\t{n_sql_se_100/n_tfidf:<8.4f}")
print(f"{'SQL Search Results MAP':30}\t{n_sql_se_1/n_tfidf:<8.4f}\t{n_sql_se_10*1/10/n_tfidf:<8.4f}\t{n_sql_se_20*1/20/n_tfidf:<8.4f}\t{n_sql_se_100*1/100/n_tfidf:<8.4f}")
print(f"{'SQL Search Results MRR':30}\t{n_sql_se_1/n_tfidf:<8.4f}\t{sum(rr_sql_se_10)/n_tfidf:<8.4f}\t{sum(rr_sql_se_20)/n_tfidf:<8.4f}\t{sum(rr_sql_se_100)/n_tfidf:<8.4f}")
print()
print(f"{'SQL Rerank Results':30}\t{n_sql_re_1:<8}\t{n_sql_re_10:<8}\t{n_sql_re_20:<8}\t{n_sql_re_100:<8}")
print(f"{'SQL Rerank Results R@k':30}\t{n_sql_re_1/n_tfidf:<8.4f}\t{n_sql_re_10/n_tfidf:<8.4f}\t{n_sql_re_20/n_tfidf:<8.4f}\t{n_sql_re_100/n_tfidf:<8.4f}")
print(f"{'SQL Rerank Results MAP':30}\t{n_sql_re_1/n_tfidf:<8.4f}\t{n_sql_re_10*1/10/n_tfidf:<8.4f}\t{n_sql_re_20*1/20/n_tfidf:<8.4f}\t{n_sql_re_100*1/100/n_tfidf:<8.4f}")
print(f"{'SQL Rerank Results MRR':30}\t{n_sql_re_1/n_tfidf:<8.4f}\t{sum(rr_sql_re_10)/n_tfidf:<8.4f}\t{sum(rr_sql_re_20)/n_tfidf:<8.4f}\t{sum(rr_sql_re_100)/n_tfidf:<8.4f}")
print()
print(f"{'TFIDF Search Results':30}\t{n_tfidf_se_1:<8}\t{n_tfidf_se_10:<8}\t{n_tfidf_se_20:<8}\t{n_tfidf_se_100:<8}")
print(f"{'TFIDF Search Results R@k':30}\t{n_tfidf_se_1/n_tfidf:<8.4f}\t{n_tfidf_se_10/n_tfidf:<8.4f}\t{n_tfidf_se_20/n_tfidf:<8.4f}\t{n_tfidf_se_100/n_tfidf:<8.4f}")
print(f"{'TFIDF Search Results MAP':30}\t{n_tfidf_se_1/n_tfidf:<8.4f}\t{n_tfidf_se_10*1/10/n_tfidf:<8.4f}\t{n_tfidf_se_20*1/20/n_tfidf:<8.4f}\t{n_tfidf_se_100*1/100/n_tfidf:<8.4f}")
print(f"{'TFIDF Search Results MRR':30}\t{n_tfidf_se_1/n_tfidf:<8.4f}\t{sum(rr_tfidf_se_10)/n_tfidf:<8.4f}\t{sum(rr_tfidf_se_20)/n_tfidf:<8.4f}\t{sum(rr_tfidf_se_100)/n_tfidf:<8.4f}")
print()
print(f"{'TFIDF Rerank Results':30}\t{n_tfidf_re_1:<8}\t{n_tfidf_re_10:<8}\t{n_tfidf_re_20:<8}\t{n_tfidf_re_100:<8}")
print(f"{'TFIDF Rerank Results R@k':30}\t{n_tfidf_re_1/n_tfidf:<8.4f}\t{n_tfidf_re_10/n_tfidf:<8.4f}\t{n_tfidf_re_20/n_tfidf:<8.4f}\t{n_tfidf_re_100/n_tfidf:<8.4f}")
print(f"{'TFIDF Rerank Results MAP':30}\t{n_tfidf_re_1/n_tfidf:<8.4f}\t{n_tfidf_re_10*1/10/n_tfidf:<8.4f}\t{n_tfidf_re_20*1/20/n_tfidf:<8.4f}\t{n_tfidf_re_100*1/100/n_tfidf:<8.4f}")
print(f"{'TFIDF Rerank Results MRR':30}\t{n_tfidf_re_1/n_tfidf:<8.4f}\t{sum(rr_tfidf_re_10)/n_tfidf:<8.4f}\t{sum(rr_tfidf_re_20)/n_tfidf:<8.4f}\t{sum(rr_tfidf_re_100)/n_tfidf:<8.4f}")
print()
print(f"{'BM Search Results':30}\t{n_bm_se_1:<8}\t{n_bm_se_10:<8}\t{n_bm_se_20:<8}\t{n_bm_se_100:<8}")
print(f"{'BM Search Results R@k':30}\t{n_bm_se_1/n_tfidf:<8.4f}\t{n_bm_se_10/n_tfidf:<8.4f}\t{n_bm_se_20/n_tfidf:<8.4f}\t{n_bm_se_100/n_tfidf:<8.4f}")
print(f"{'BM Search Results MAP':30}\t{n_bm_se_1/n_tfidf:<8.4f}\t{n_bm_se_10*1/10/n_tfidf:<8.4f}\t{n_bm_se_20*1/20/n_tfidf:<8.4f}\t{n_bm_se_100*1/100/n_tfidf:<8.4f}")
print(f"{'BM Search Results MRR':30}\t{n_bm_se_1/n_tfidf:<8.4f}\t{sum(rr_bm_se_10)/n_tfidf:<8.4f}\t{sum(rr_bm_se_20)/n_tfidf:<8.4f}\t{sum(rr_bm_se_100)/n_tfidf:<8.4f}")
print()
print(f"{'BM Rerank Results':30}\t{n_bm_re_1:<8}\t{n_bm_re_10:<8}\t{n_bm_re_20:<8}\t{n_bm_re_100:<8}")
print(f"{'BM Rerank Results R@k':30}\t{n_bm_re_1/n_tfidf:<8.4f}\t{n_bm_re_10/n_tfidf:<8.4f}\t{n_bm_re_20/n_tfidf:<8.4f}\t{n_bm_re_100/n_tfidf:<8.4f}")
print(f"{'BM Rerank Results MAP':30}\t{n_bm_re_1/n_tfidf:<8.4f}\t{n_bm_re_10*1/10/n_tfidf:<8.4f}\t{n_bm_re_20*1/20/n_tfidf:<8.4f}\t{n_bm_re_100*1/100/n_tfidf:<8.4f}")
print(f"{'BM Rerank Results MRR':30}\t{n_bm_re_1/n_tfidf:<8.4f}\t{sum(rr_bm_re_10)/n_tfidf:<8.4f}\t{sum(rr_bm_re_20)/n_tfidf:<8.4f}\t{sum(rr_bm_re_100)/n_tfidf:<8.4f}")

                              	Top1    	Top10   	Top20   	Top100  
------------------------------------------------------------------------------------------
Test Citations in SQL/TFIDF   	10782   	9916    	9916    
SQL Search Results            	3424    	6189    	6915    	8416    
SQL Search Results R@k        	0.3453  	0.6241  	0.6974  	0.8487  
SQL Search Results MAP        	0.3453  	0.0624  	0.0349  	0.0085  
SQL Search Results MRR        	0.3453  	0.4320  	0.4371  	0.4409  

SQL Rerank Results            	3995    	7227    	8033    	9323    
SQL Rerank Results R@k        	0.4029  	0.7288  	0.8101  	0.9402  
SQL Rerank Results MAP        	0.4029  	0.0729  	0.0405  	0.0094  
SQL Rerank Results MRR        	0.4029  	0.5047  	0.5104  	0.5137  

TFIDF Search Results          	2120    	4598    	5398    	7305    
TFIDF Search Results R@k      	0.2138  	0.4637  	0.5444  	0.7367  
TFIDF Search Results MAP      	0.2138  	0.0464  	0.0272  	0.0074  
TFIDF Search Results MRR      	0.2138  	0.288