# DeepSenSe: A Deep Learning Powered Search Engine

## Communication with Chun-Chao

### Chun-Chao's First Email

- The processed train/valid data are on Lambda1:/disk1/cl17d/Project/src_new/input
- The test data are on 
  - SQL full sentence: Lambda1:/disk1/cl17d/Project/src_new/ case_SQL_testing_final_complete
  - SQL keywords: Lambda1:/disk1/cl17d/Project/src_new/case_SQL_testing_final_complete_keywordOnly
  - PubMed full sentence: Lambda1:/disk1/cl17d/Project/src_new/case_PubMed_testing_final_complete
  - PubMed keywords: Lambda1:/disk1/cl17d/Project/src_new/case_PubMed_testing_final_keywordOnly

**Procedure:**
1. Get data:

   1.1 Extract sentences, abstracts, and rank by SQL, including train/valid/test data:
        from Yuchuan's code
 
   1.2 Extract sentences, abstracts, and rank by PubMed (only for test):
        create_test_search_PubMed.py # Need api_key in PM_function.py
        
2. Add features to train/valid/test data from Yuchuan:
        create_SQL_final_train_valid.py
        create_SQL_final_test.py
        create_SQL_final_test_keywordOnly.py
        create_PM_final_test.py
        
3. Train model and output prediction (1 and 2 basically re the same code. Only difference is for full_sentences or for keywords dataset):
   
        1. src_model/train_model20_complete.py
                 train_final_dataset_complete_shuffled.tsv
                 valid_final_dataset_complete.tsv
   
        2. src_model_only_keywords/train_keywordsmodel21_complete.py
                 train_final_dataset_sentence_all_keyword_in_abstarct.tsv
                 valid_final_dataset_complete.tsv
        
4. evaluate results (top1/20/100):
        src_model/analyze_test_result20_with_score_true_in_top10000.py


### Datasets Development

**Word Embeddings**

- Trained by FastText
- Code in "jupiter:/home/cl17d/src_train/fastText-0.1.0"
- Corpus of "jupiter:/home/cl17d/src_train/input/title_abstracts_raw_lower_for_fasttext.txt"
- Commad: "./fasttext skipgram -dim 300 -input INPUT_FILE_NAME -output OUTPUT_MODEL_NAME"
- Output: the fastText model "src_new/model/f_model.bin" and "src_new/model/f_model.vec"
- Embedding matrix:
  - Code: "src_new/model/tokenizer_utils.py"
  - Matrix file: "src_new/model/embedding_matrix.npy"


**Tokenizer**

- Code: "src_new/model/tokenizer_utils.py"
- Corpus: "jupiter:/home/cl17d/src_train/input/title_abstracts_raw_lower_for_sqldb.csv"
- Output: "src_new/model/tokenizer_all_lower.pickle"


**Training, Validation and Test Datasets**

1. Develop sentences with citations for query


- Input data: PMC articles
- Code: also yuchuan's code
- Output:


2. Develop datasets of sentences with query results and additional features


- Query search
  - SQL_BM the query from the MySQL database developed by Yuchuan, ranked by BM25
    - Code: yuchuan's code
    - Input:
    - Output: This outputs are the inputs of "Training and Validation and Test” you mentioned below
  - PubMed_TF the query from the PubMed database using BioPython’s Entrez, ranked by TF-IDF
    - Code: "create_test_search_PubMed.py"
    - Input:
      - TEST_PM_INPUT = '../case_testing_final_PubMed_only_keyword/'
      - TEST_sentence_INPUT = '../case_DL_testing_final_with_score_coverage/'
      - TEST_sentence_INPUT = '../case_DL_testing_final_only_keyword_with_all_score_coverage/'
    - Output: This outputs are the inputs of "Test” you mentioned below
  - PubMed_BM the query from the  PubMed database using the Selenium package, ranked by PubMed's best match
  - Google Scholar the query from the Google Scholar database using the Selenium package
  - Used “OR” operator and limit the date that the date is before the sentence written date

- Training and Validation
  - Code: create_SQL_final_train_valid.py
  - Input:
    - Full sentence: '../case_with_all_label_score_true_in_top10000_add_true_case_with_keyword_only2'
    - Keyword: SQL_INPUT = '../case_sentence_contain_all_keyword_only/'
  - Output:
    - Full sentence
      - Training: train_final_dataset_complete_shuffled.tsv
      - Validation: valid_final_dataset_complete.tsv
    - Keyword
      - Training: train_final_dataset_sentence_all_keyword_in_abstarct.tsv
      - Validation: valid_final_dataset_sentence_all_keyword_in_abstract.tsv

- Test
  - SQL_BM
    - Full sentence
      - Code: create_SQL_final_test.py
      - Input: TEST_SQL_INPUT = '../case_DL_testing_final_sentence_with_all_keywords_score_coverage/'
      - Output: 'case_SQL_testing_final_complete/'
    - Keyword
      - Code: create_SQL_final_test_keywordOnly.py
      - Input: TEST_SQL_INPUT = '../DL_testing_final_keyword_only/'
      - Output: 'case_SQL_testing_final_complete_keywordOnly'
  - PubMed
    - Code: create_PM_final_test.py
    - Input:
      - Full sentence: TEST_INPUT = '../case_PubMed_pmids/'
      - Keyword: TEST_PM_INPUT = '../case_testing_final_PubMed_only_keyword/'
    - Output:
      - Full sentence: 'case_PubMed_testing_final_complete/'
      - Keyword: 'case_PubMed_testing_final_keywordOnly/'



### DeepSenSe Model Training and Testing

**Training/Validation**

1. Model20


- Trained and validated on full sentence datasets
- Code: train_model20_complete.py
- Input: 
  - Training: 'train_final_dataset_complete_shuffled.tsv'
  - Validation: 'valid_final_dataset_complete.tsv'
- Output model: de1000_01_f_model20_complete_true_in_top1000.h5


2. Model21


- Trained and Validated on Keyword datasets
- Code: train_keywordsmodel21_complete.py
- Input:
  - Training: 'train_final_dataset_sentence_all_keyword_in_abstarct.tsv'
  - Validation: 'valid_final_dataset_sentence_all_keyword_in_abstract.tsv'
- Output model: de1000_01_f_model21_complete_only_keywords_in_top1000_last.h5


**Test Results**

1. Model20 on full sentence test datasets (for table 2, 3, and 4 in the paper)


- test_rank20_SQL_complete_true_in_top10000
  - This is the result of "SQL_BM+Decomposable model" + "All" features, results of pure SQL_BM search engine are also included.
  - Test dataset: 'case_SQL_testing_final_complete/'
  - Output results: 'case_SQL_testing_final_complete/' (same as 'test_rank20_SQL_complete_true_in_top10000')
  - Results analysis code: "analyze_test_result20_with_score_true_in_top10000.py"

- test_rank20_PM_complete_true_in_top10000
  - This is the result of "PubMed_TF+Decomposable model" + "All" features.
  - Test dataset: 'case_PubMed_testing_final_complete/'
  - Output results: 'case_PubMed_testing_final_complete/' (same as 'test_rank20_PM_complete_true_in_top10000')
  - Results analysis code: "analyze_test_result20_with_score_true_in_top10000.py"

- test_rank20_BM_complete_true_in_top10000
  - This is the result of "PubMed_BM+Decomposable model" + "All" features.
  - Test dataset: 'case_PubMed_BM_testing_final_complete/'
  - Output results: 'case_PubMed_BM_testing_final_complete/' (same as 'test_rank20_BM_complete_true_in_top10000')
  - Results analysis code: "analyze_test_result20_with_score_true_in_top10000.py"


2. Model21 on keyword datasets (for table 5 in the paper)


- case_SQL_testing_final_complete_keywordOnly
  - Result of "SQL_BM+Decomposable model" + "All" features.
  - Test dataset: 'case_SQL_testing_final_complete_keywordOnly/'
  - Output results: 'case_SQL_testing_final_complete_keywordOnly/'
  - Results analysis code: "analyze_test_result21_with_score_true_in_top10000.py"

- case_PubMed_testing_final_keywordOnly
  - Result of "PubMed_TF+Decomposable model" + "All" features.
  - Test dataset: 'case_PubMed_testing_final_keywordOnly/'
  - Output results: 'case_PubMed_testing_final_keywordOnly/'
  - Results analysis code: "analyze_test_result21_with_score_true_in_top10000.py"


3. Model20 on keyword datasets (not included in the paper)


- de1000_01_f_model21_complete_only_keywords_in_top1000: model20 on SQL_BM keyword dataset
- de1000_01_f_model21_PubMed_only_keywords_in_top1000: model20 on PubMed_TF keyword dataset

### Data Pathes and Configuration

In [1]:
# Input data for development of training datasets
ST_FOR_TRAINDATA_PATH = '../case_with_all_label_score_true_in_top10000_add_true_case_with_keyword_only2/'
KW_FOR_TRAINDATA_PATH = '../case_sentence_contain_all_keyword_only/'

# Input data for development of test datasets
ST_FOR_TESTDATA_PATH_SQL = '../case_DL_testing_final_sentence_with_all_keywords_score_coverage/'
KW_FOR_TESTDATA_PATH_SQL = '../DL_testing_final_keyword_only/'
ST_FOR_TESTDATA_PATH_PMD = '../case_PubMed_pmids/'
KW_FOR_TESTDATA_PATH_PMD = '../case_testing_final_PubMed_only_keyword/'

# Training datasets
TRAINDATA_PATH = "input"
TRAINDATA_ST = 'train_final_dataset_complete_shuffled.tsv'
VALIDDATA_ST = 'valid_final_dataset_complete.tsv'
TRAINDATA_KW = 'train_final_dataset_sentence_all_keyword_in_abstarct.tsv'
VALIDDATA_KW = 'valid_final_dataset_sentence_all_keyword_in_abstract.tsv'

# Test datasets
TESTDATA_PATH_SQL_ST = "case_SQL_testing_final_complete"
TESTDATA_PATH_SQL_KW = "case_SQL_testing_final_complete_keywordOnly"
TESTDATA_PATH_PTF_ST = "case_PubMed_testing_final_complete"
TESTDATA_PATH_PTF_KW = "case_PubMed_testing_final_keywordOnly"
TESTDATA_PATH_PBM_ST = "case_PubMed_BM_testing_final_complete"
TESTDATA_PATH_GGS_ST = "case_Google_scholar_testing_final_complete"

EMBEDDING_FILE = "embedding_matrix.npy"
TOKENIZER_FILE = "tokenizer_all_lower.pickle"
MODEL_PATH = "model"
OUTPUT_PATH = "output"
RESULTS_PATH = "results"

MAX_NB_WORDS = 1500000

In [70]:
import time
import csv
import json
import pandas as pd
import numpy as np
from collections import Counter
#from src_model.tokenizer_utils import load_tokenizer

In [3]:
from text_to_wordlist import text_to_wordlist

### Input Data for Development of Training and Test Datasets

**Input data for development of training datasets**

- Sentence: ST_FOR_TRAINDATA_PATH = '../case_with_all_label_score_true_in_top10000_add_true_case_with_keyword_only2/'
  - search_result_cases_search_result_sentence_citation1.csv (1-560)

In [31]:
with open(f"{ST_FOR_TRAINDATA_PATH}/search_result_cases_search_result_sentence_citation1.csv", newline='') as csvfile:
    csvreader = csv.reader(csvfile, delimiter='\t')
    #header = next(csvreader)
    n = 0
    for row in csvreader:
        n += 1
        print(row)
        if n >= 2: break

['sentence', 'sentence_pmid', 'citation_pmid', 'title', 'abstract', 'label', 'title_score', 'title_abstract_score', 'title_coverage', 'abstract_coverage']
['In fact, other studies performed on the different tissues of lemon with the HR-MAS technique are present in literature [ XREF_B33_END ] that agree with our results', '26495154', '23871074', 'Citron and lemon under the lens of HR-MAS NMR spectroscopy.', 'High Resolution Magic Angle Spinning (HR-MAS) is an NMR technique that can be applied to semi-solid samples. Flavedo, albedo, pulp, seeds, and oil gland content of lemon and citron were studied through HR-MAS NMR spectroscopy, which was used directly on intact tissue specimens without any physicochemical manipulation. HR-MAS NMR proved to be a very suitable technique for detecting terpenes, sugars, organic acids, aminoacids and osmolites. It is valuable in observing changes in sugars, principal organic acids (mainly citric and malic) and ethanol contents of pulp specimens and this s

In [56]:
with open(f"{ST_FOR_TRAINDATA_PATH}/search_result_cases_search_result_sentence_citation461.csv", newline='') as csvfile:
    csvreader = csv.reader(csvfile, delimiter='\t')
    #header = next(csvreader)
    n = 0
    for row in csvreader:
        n += 1
        print(row)
        if n >= 2: break

['sentence', 'sentence_pmid', 'citation_pmid', 'title', 'abstract', 'label', 'title_score', 'title_abstract_score', 'title_coverage', 'abstract_coverage']
['In this study, we investigated the function of CTGF in psoriasis using the established imiquimod (IMQ)-induced psoriasis murine model XREF_B9_END and samples from psoriasis patients', '29386832', '19380832', 'Imiquimod-induced psoriasis-like skin inflammation in mice is mediated via the IL-23/IL-17 axis.', 'Topical application of imiquimod (IMQ), a TLR7/8 ligand and potent immune activator, can induce and exacerbate psoriasis, a chronic inflammatory skin disorder. Recently, a crucial role was proposed for the IL-23/IL-17 axis in psoriasis. We hypothesized that IMQ-induced dermatitis in mice can serve as a model for the analysis of pathogenic mechanisms in psoriasis-like dermatitis and assessed its IL-23/IL-17 axis dependency. Daily application of IMQ on mouse back skin induced inflamed scaly skin lesions resembling plaque type psor

In [147]:
sentences_all = {}
n = 0
for f in range(1, 561):
    with open(f"{ST_FOR_TRAINDATA_PATH}/search_result_cases_search_result_sentence_citation{f}.csv",
              'r', encoding='utf-8', newline='') as csvfile:
        csvreader = csv.reader((line.replace('\0','') for line in csvfile), delimiter='\t', quoting=csv.QUOTE_NONE)
        header = next(csvreader)
        for row in csvreader:
            n += 1
            if row[0] not in sentences_all:
                sentences_all[row[0]] = {'pmids':[], 'citations':[], 'labels':[]}
            sentences_all[row[0]]['pmids'].append(row[1])
            sentences_all[row[0]]['citations'].append(row[2])
            sentences_all[row[0]]['labels'].append(row[5])
print(n)

4265797


In [4]:
len(sentences_all)

1220310

In [154]:
json.dump(sentences_all, open('sentences_all.json', 'w', encoding='utf-8'), indent=4)

In [3]:
sentences_all = json.load(open('sentences_all.json'))

In [8]:
p, c, l = 0, 0, 0
sentences_p = []
sentences_c = []
sentences_l = []
for k, v in sentences_all.items():
    if len(set(v['pmids'])) > 1:
        p += 1
        sentences_p.append(k)
    if len(set(v['citations'])) > 1:
        c += 1
        sentences_c.append(k)
    if len(set(v['labels'])) > 1:
        l += 1
        sentences_l.append(k)

In [9]:
p, c, l

(5562, 1220295, 1220295)

In [25]:
sentences_c[10000]

'For example, several researchers reported that FOF reduces social contacts with friends and family [ XREF_B3-ijerph-14-00469_END , XREF_B13-ijerph-14-00469_END ], which supported the notion that FOF may have the constraining effects on social contact [ XREF_B3-ijerph-14-00469_END ]'

In [26]:
text_to_wordlist(sentences_c[10000])

'example several researchers reported fof reduces social contacts friends family supported notion fof may constraining effects social contact'

In [20]:
sentences_all[sentences_p[0]]

{'pmids': ['28774737',
  '28774737',
  '28774737',
  '26865910',
  '26865910',
  '26865910'],
 'citations': ['19455179',
  '12821098',
  '20668706',
  '20107110',
  '3901279',
  '17612497'],
 'labels': ['1', '0', '0', '1', '0', '0']}

- Keyword: KW_FOR_TRAINDATA_PATH = '../case_sentence_contain_all_keyword_only/'
  - search_result_cases_search_result_sentence_citation1.csv (1-560)

In [60]:
with open(f"{KW_FOR_TRAINDATA_PATH}/search_result_cases_search_result_sentence_citation1.csv", newline='') as csvfile:
    csvreader = csv.reader(csvfile, delimiter='\t')
    #header = next(csvreader)
    n = 0
    for row in csvreader:
        n += 1
        print(row)
        if n >= 2: break

['sentence', 'sentence_pmid', 'citation_pmid', 'title', 'abstract', 'label', 'title_score', 'title_abstract_score', 'title_coverage', 'abstract_coverage']
['lemon technique', '26495154', '23871074', 'Citron and lemon under the lens of HR-MAS NMR spectroscopy.', 'High Resolution Magic Angle Spinning (HR-MAS) is an NMR technique that can be applied to semi-solid samples. Flavedo, albedo, pulp, seeds, and oil gland content of lemon and citron were studied through HR-MAS NMR spectroscopy, which was used directly on intact tissue specimens without any physicochemical manipulation. HR-MAS NMR proved to be a very suitable technique for detecting terpenes, sugars, organic acids, aminoacids and osmolites. It is valuable in observing changes in sugars, principal organic acids (mainly citric and malic) and ethanol contents of pulp specimens and this strongly point to its use to follow fruit ripening, or commercial assessment of fruit maturity. HR-MAS NMR was also used to derive the molar percenta

In [57]:
with open(f"{KW_FOR_TRAINDATA_PATH}/search_result_cases_search_result_sentence_citation461.csv", newline='') as csvfile:
    csvreader = csv.reader(csvfile, delimiter='\t')
    #header = next(csvreader)
    n = 0
    for row in csvreader:
        n += 1
        print(row)
        if n >= 2: break

['sentence', 'sentence_pmid', 'citation_pmid', 'title', 'abstract', 'label', 'title_score', 'title_abstract_score', 'title_coverage', 'abstract_coverage']
['In imiquimod model', '29386832', '19380832', 'Imiquimod-induced psoriasis-like skin inflammation in mice is mediated via the IL-23/IL-17 axis.', 'Topical application of imiquimod (IMQ), a TLR7/8 ligand and potent immune activator, can induce and exacerbate psoriasis, a chronic inflammatory skin disorder. Recently, a crucial role was proposed for the IL-23/IL-17 axis in psoriasis. We hypothesized that IMQ-induced dermatitis in mice can serve as a model for the analysis of pathogenic mechanisms in psoriasis-like dermatitis and assessed its IL-23/IL-17 axis dependency. Daily application of IMQ on mouse back skin induced inflamed scaly skin lesions resembling plaque type psoriasis. These lesions showed increased epidermal proliferation, abnormal differentiation, epidermal accumulation of neutrophils in microabcesses, neoangiogenesis, a

In [155]:
keywords_all = {}
n = 0
for f in range(1, 561):
    with open(f"{KW_FOR_TRAINDATA_PATH}/search_result_cases_search_result_sentence_citation{f}.csv",
              'r', encoding='utf-8', newline='') as csvfile:
        csvreader = csv.reader((line.replace('\0','') for line in csvfile), delimiter='\t', quoting=csv.QUOTE_NONE)
        header = next(csvreader)
        for row in csvreader:
            n += 1
            if row[0] not in keywords_all:
                keywords_all[row[0]] = {'pmids':[], 'citations':[], 'labels':[]}
            keywords_all[row[0]]['pmids'].append(row[1])
            keywords_all[row[0]]['citations'].append(row[2])
            keywords_all[row[0]]['labels'].append(row[5])
print(n)

2038088


In [156]:
len(keywords_all)

662303

In [157]:
json.dump(keywords_all, open('keywords_all.json', 'w', encoding='utf-8'), indent=4)

In [187]:
keywords_all = json.load(open('keywords_all.json'))

In [188]:
p, c, l = 0, 0, 0
keywords_p = []
keywords_c = []
keywords_l = []
for k, v in keywords_all.items():
    if len(set(v['pmids'])) > 1:
        p += 1
        keywords_p.append(k)
    if len(set(v['citations'])) > 1:
        c += 1
        keywords_c.append(k)
    if len(set(v['labels'])) > 1:
        l += 1
        keywords_l.append(k)

In [189]:
p, c, l

(5562, 662190, 662190)

**Input Data for Development of Test Datasets**

- Sentence by SQL: ST_FOR_TESTDATA_PATH_SQL = '../case_DL_testing_final_sentence_with_all_keywords_score_coverage/'
(No  such folder)

- Keyword by SQL: KW_FOR_TESTDATA_PATH_SQL = '../DL_testing_final_keyword_only/'
  - DL_data_search_result_sentence_citation461.csv (461-560)

In [62]:
with open(f"{KW_FOR_TESTDATA_PATH_SQL}/DL_data_search_result_sentence_citation461.csv", newline='') as csvfile:
    csvreader = csv.reader(csvfile, delimiter='\t')
    #header = next(csvreader)
    n = 0
    for row in csvreader:
        n += 1
        print(row)
        if n >= 2: break

['sentence', 'sentence_pmid', 'result_pmid', 'citation_pmid', 'title_abstract_score', 'title', 'abstract', 'sql_rank', 'title_coverage', 'abstract_coverage', 'sentence_orig']
['psoriasis imiquimod imq induced psoriasis model psoriasis', '29386832', '29305258', '19380832', '53.260294914245605', 'IRF-2 haploinsufficiency causes enhanced imiquimod-induced psoriasis-like skin inflammation.', 'IFN regulatory factor (IRF)-2 is one of the potential susceptibility genes for psoriasis, but how this gene influences psoriasis pathogenesis is unclear. Topical application of imiquimod (IMQ), a TLR7 ligand, induces psoriasis-like skin lesions in mice.The aim of this study was to investigate whether IRF-2 gene status would influence severity of skin disease in IMQ-treated mice.Imiquimod-induced psoriasis-like skin inflammation was assessed by clinical findings, histology, and cytokine expression. The effects of imiquimod or IFN on peritoneal macrophages were analyzed in vitro.IMQ-induced skin inflamm

- Sentence by PubMed: ST_FOR_TESTDATA_PATH_PMD = '../case_PubMed_pmids/'
  - DL_data_search_result_sentence_citation461.csv (461-560)

In [34]:
with open(f"{ST_FOR_TESTDATA_PATH_PMD}/DL_data_search_result_sentence_citation461.csv", newline='') as csvfile:
    csvreader = csv.reader(csvfile, delimiter='\t')
    #header = next(csvreader)
    n = 0
    for row in csvreader:
        n += 1
        print(row)
        if n >= 2: break

['\ufeffcontrast reported gastric luminal challenge specific antigen decreased phasic antral activity sensitized rats', '28144845', '26767575', '8194696']
['contrast reported gastric luminal challenge specific antigen decreased phasic antral activity sensitized rats', '28144845', '27197667', '8194696']


- Keyword by PubMed: KW_FOR_TESTDATA_PATH_PMD = '../case_testing_final_PubMed_only_keyword/'
  - DL_data_search_result_sentence_citation461.csv (461-560)

In [35]:
with open(f"{KW_FOR_TESTDATA_PATH_PMD}/DL_data_search_result_sentence_citation461.csv", newline='') as csvfile:
    csvreader = csv.reader(csvfile, delimiter='\t')
    #header = next(csvreader)
    n = 0
    for row in csvreader:
        n += 1
        print(row)
        if n >= 2: break

['\ufefflow vitamin status urban rural population andhra pradesh state south india', '26644714', '23860755', '18497434', 'Harinarayan et al in 2008 reported low Vitamin D status in urban and rural population of Andhra Pradesh state in South India[ XREF_ref32_END ]']
['low vitamin status urban rural population andhra pradesh state south india', '26644714', '26667891', '18497434', 'Harinarayan et al in 2008 reported low Vitamin D status in urban and rural population of Andhra Pradesh state in South India[ XREF_ref32_END ]']


- Other input data
  - '../case_Google_scholar_pmids/'
    - DL_data_search_result_sentence_citation461.csv
  - '../case_PubMed_BM_pmids/'
    - DL_data_search_result_sentence_citation461.csv
  - '../case_SQL_olddata_pmids/'
    - test_sql_temp_review_1000_analysis.csv

In [36]:
with open(f"../case_Google_scholar_pmids/DL_data_search_result_sentence_citation461.csv", newline='') as csvfile:
    csvreader = csv.reader(csvfile, delimiter='\t')
    #header = next(csvreader)
    n = 0
    for row in csvreader:
        n += 1
        print(row)
        if n >= 2: break

['global test proposed', '21245948', '12098431', '14693814']
['global test proposed', '21245948', '12320315', '14693814']


In [38]:
with open(f"../case_PubMed_BM_pmids/DL_data_search_result_sentence_citation461.csv", newline='') as csvfile:
    csvreader = csv.reader(csvfile, delimiter='\t')
    #header = next(csvreader)
    n = 0
    for row in csvreader:
        n += 1
        print(row)
        if n >= 2: break

['\ufeffstudy used multilocus sequence typing mlst method', '17543117', '27646134', '15944444']
['study used multilocus sequence typing mlst method', '17543117', '23979428', '15944444']


In [39]:
with open(f"../case_SQL_olddata_pmids/test_sql_temp_review_1000_analysis.csv", newline='') as csvfile:
    csvreader = csv.reader(csvfile, delimiter='\t')
    #header = next(csvreader)
    n = 0
    for row in csvreader:
        n += 1
        print(row)
        if n >= 2: break

[',sentence,sentence_pmid,filename,pmid,label,PM_rank,article_type,journal_title,sentence_year']
['48277,"Use of K-SRS assessment research (n=13) was done to study system effectiveness, feasibility, accuracy, or reliability [ XREF_ref22_END , XREF_ref23_END , XREF_ref30_END , XREF_ref41_END , XREF_ref42_END , XREF_ref44_END , XREF_ref45_END , XREF_ref50_END , XREF_ref54_END - XREF_ref56_END , XREF_ref61_END , XREF_ref71_END ]",29739739,../PMC_input/JMIR_Rehabil_Assist_Technol/PMC5964303.nxml,27604989,0,1,review-article,JMIR Rehabilitation and Assistive Technologies,2018']


### Training Datasets

- Sentence
  - Training: TRAINDATA_ST = 'train_final_dataset_complete_shuffled.tsv'
  - Validation: VALIDDATA_ST = 'valid_final_dataset_complete.tsv'

- Training Sentences Dataset

In [199]:
with open(f"{TRAINDATA_PATH}/{TRAINDATA_ST}", newline='') as csvfile:
    csvreader = csv.reader(csvfile, delimiter='\t')
    #header = next(csvreader)
    n = 0
    for row in csvreader:
        n += 1
        if row[1] == '26495154' and row[2] == '23871074':
            print(row)
#             break
#         if n >= 2: break

['lemon technique', '26495154', '23871074', 'citron lemon lens hr mas nmr spectroscopy', 'high resolution magic angle spinning hr mas nmr technique applied semi solid samples flavedo albedo pulp seeds oil gland content lemon citron studied hr mas nmr spectroscopy used directly intact tissue specimens without physicochemical manipulation hr mas nmr proved suitable technique detecting terpenes sugars organic acids aminoacids osmolites valuable observing changes sugars principal organic acids mainly citric malic ethanol contents pulp specimens strongly point use follow fruit ripening commercial assessment fruit maturity hr mas nmr also used derive molar percentage fatty acid components lipids seeds change depending citrus species varieties finally technique employed elucidate metabolic profile mold flavedo', '2015', '2013', 'evaluation studies|journal article', '1', '-0.123', '3.259', '0.0', '36.580613136291504', '0.5', '1.0', '1']
['fact studies performed different tissues lemon hr mas t

In [159]:
sentences_train = {}
with open(f"{TRAINDATA_PATH}/{TRAINDATA_ST}", newline='') as csvfile:
    csvreader = csv.reader(csvfile, delimiter='\t')
    header = next(csvreader)
    n = 0
    for row in csvreader:
        n += 1
        if row[0] not in sentences_train:
            sentences_train[row[0]] = {'pmids':[], 'citations':[], 'labels':[], 'true':0, 'false':0}
        sentences_train[row[0]]['pmids'].append(row[1])
        sentences_train[row[0]]['citations'].append(row[2])
        sentences_train[row[0]]['labels'].append(row[-1])
        if row[-1] == '1':
            sentences_train[row[0]]['true'] += 1
        if row[-1] == '0':
            sentences_train[row[0]]['false'] += 1
print(n)

4035476


In [160]:
len(sentences_train)

1889927

In [161]:
json.dump(sentences_train, open('sentences_train.json', 'w', encoding='utf-8'), indent=4)

In [82]:
t = 0
f = 0
for v in sentences_train.values():
    #if v['true'] < 1:
    t += v['true']
    f += v['false']
print(t, f)

2012828 2022648


In [27]:
sentences_train = json.load(open('sentences_train.json'))

In [28]:
len(sentences_train)

1889927

In [35]:
p, c, l = 0, 0, 0
sentences_trp = []
sentences_trc = []
sentences_trl = []
for k, v in sentences_train.items():
    if len(set(v['pmids'])) > 1:
        p += 1
        sentences_trp.append(k)
    if len(set(v['citations'])) > 1:
        c += 1
        sentences_trc.append(k)
    if len(set(v['labels'])) > 1:
        l += 1
        sentences_trl.append(k)

In [36]:
p, c, l

(5835, 896970, 895773)

In [33]:
sentences_trp[0]

'aniridia male pseudohermaphroditism gonadoblastoma mental retardation del 11p13'

In [34]:
sentences_train[sentences_trp[0]]

{'pmids': ['6114032', '28035502'],
 'citations': ['6114032', '28035502'],
 'labels': ['1', '1'],
 'true': 2,
 'false': 0}

- Validation Sentences Dataset

In [193]:
with open(f"{TRAINDATA_PATH}/{VALIDDATA_ST}", newline='') as csvfile:
    csvreader = csv.reader(csvfile, delimiter='\t')
    #header = next(csvreader)
    n = 0
    for row in csvreader:
        n += 1
        if row[1] == '26415954' and row[2] == '15852461':
            print(row)
#         if n >= 2: break

In [162]:
sentences_valid = {}
with open(f"{TRAINDATA_PATH}/{VALIDDATA_ST}", newline='') as csvfile:
    csvreader = csv.reader(csvfile, delimiter='\t')
    header = next(csvreader)
    n = 0
    for row in csvreader:
        n += 1
        if row[0] not in sentences_valid:
            sentences_valid[row[0]] = {'pmids':[], 'citations':[], 'labels':[], 'true':0, 'false':0}
        sentences_valid[row[0]]['pmids'].append(row[1])
        sentences_valid[row[0]]['citations'].append(row[2])
        sentences_valid[row[0]]['labels'].append(row[-1])
        if row[-1] == '1':
            sentences_valid[row[0]]['true'] += 1
        if row[-1] == '0':
            sentences_valid[row[0]]['false'] += 1
print(n)

462978


In [163]:
len(sentences_valid)

150725

In [164]:
json.dump(sentences_valid, open('sentences_valid.json', 'w', encoding='utf-8'), indent=4)

In [84]:
t = 0
f = 0
for v in sentences_valid.values():
    #if v['false'] < 1:
    t += v['true']
    f += v['false']
print(t, f)

154468 308510


In [115]:
n = 0
for k in sentences_valid.keys():
    if k in sentences_train:
        n += 1
print(n)

28337


In [29]:
sentences_valid = json.load(open('sentences_valid.json'))

In [37]:
p, c, l = 0, 0, 0
sentences_vap = []
sentences_vac = []
sentences_val = []
for k, v in sentences_valid.items():
    if len(set(v['pmids'])) > 1:
        p += 1
        sentences_vap.append(k)
    if len(set(v['citations'])) > 1:
        c += 1
        sentences_vac.append(k)
    if len(set(v['labels'])) > 1:
        l += 1
        sentences_val.append(k)

In [38]:
p, c, l

(305, 150720, 150720)

In [39]:
n = 0
for k, v in sentences_valid.items():
    if k in sentences_train:
        if any(pmid in sentences_train[k]['pmids'] for pmid in v['pmids']):
            n += 1

In [40]:
n

26820

- Keyword
  - Training: TRAINDATA_KW = 'train_final_dataset_sentence_all_keyword_in_abstarct.tsv'
  - Validation: VALIDDATA_KW = 'valid_final_dataset_sentence_all_keyword_in_abstract.tsv'

In [200]:
with open(f"{TRAINDATA_PATH}/{TRAINDATA_KW}", newline='') as csvfile:
    csvreader = csv.reader(csvfile, delimiter='\t')
    #header = next(csvreader)
    n = 0
    for row in csvreader:
        n += 1
        if row[1] == '26495154' and row[2] == '23871074':
            print(row)
#         print(row)
#         if n >= 2: break

['lemon technique', '26495154', '23871074', 'citron lemon lens hr mas nmr spectroscopy', 'high resolution magic angle spinning hr mas nmr technique applied semi solid samples flavedo albedo pulp seeds oil gland content lemon citron studied hr mas nmr spectroscopy used directly intact tissue specimens without physicochemical manipulation hr mas nmr proved suitable technique detecting terpenes sugars organic acids aminoacids osmolites valuable observing changes sugars principal organic acids mainly citric malic ethanol contents pulp specimens strongly point use follow fruit ripening commercial assessment fruit maturity hr mas nmr also used derive molar percentage fatty acid components lipids seeds change depending citrus species varieties finally technique employed elucidate metabolic profile mold flavedo', '2015', '2013', 'evaluation studies|journal article', '1', '-0.123', '3.259', '0.0', '36.580613136291504', '0.5', '1.0', '1']


In [165]:
keywords_train = {}
with open(f"{TRAINDATA_PATH}/{TRAINDATA_KW}", newline='') as csvfile:
    csvreader = csv.reader(csvfile, delimiter='\t')
    header = next(csvreader)
    n = 0
    for row in csvreader:
        n += 1
        if row[0] not in keywords_train:
            keywords_train[row[0]] = {'pmids':[], 'citations':[], 'labels':[], 'true':0, 'false':0}
        keywords_train[row[0]]['pmids'].append(row[1])
        keywords_train[row[0]]['citations'].append(row[2])
        keywords_train[row[0]]['labels'].append(row[-1])
        if row[-1] == '1':
            keywords_train[row[0]]['true'] += 1
        if row[-1] == '0':
            keywords_train[row[0]]['false'] += 1
print(n)

1450389


In [166]:
len(keywords_train)

470968

In [167]:
json.dump(keywords_train, open('keywords_train.json', 'w', encoding='utf-8'), indent=4)

In [88]:
t = 0
f = 0
for v in keywords_train.values():
    if v['true'] < 1:
        t += 1
    if v['false'] < 1:
        f += 1
#     t += v['true']
#     f += v['false']
print(t, f)

81 0


In [41]:
keywords_train = json.load(open('keywords_train.json'))

In [42]:
p, c, l = 0, 0, 0
keywords_trp = []
keywords_trc = []
keywords_trl = []
for k, v in keywords_train.items():
    if len(set(v['pmids'])) > 1:
        p += 1
        keywords_trp.append(k)
    if len(set(v['citations'])) > 1:
        c += 1
        keywords_trc.append(k)
    if len(set(v['labels'])) > 1:
        l += 1
        keywords_trl.append(k)

In [43]:
p, c, l

(4418, 470887, 470887)

In [44]:
keywords_trp[0]

'reads using velvet'

In [45]:
keywords_train[keywords_trp[0]]

{'pmids': ['29145801',
  '29145801',
  '29145801',
  '24812227',
  '24812227',
  '24812227'],
 'citations': ['18349386',
  '8562887',
  '9476594',
  '18349386',
  '16271386',
  '21078174'],
 'labels': ['1', '0', '0', '1', '0', '0'],
 'true': 2,
 'false': 4}

In [46]:
with open(f"{TRAINDATA_PATH}/{VALIDDATA_KW}", newline='') as csvfile:
    csvreader = csv.reader(csvfile, delimiter='\t')
    #header = next(csvreader)
    n = 0
    for row in csvreader:
        n += 1
        print(row)
        if n >= 2: break

['sentence', 'sentence_pmid', 'pmid', 'articleTitle', 'abstract', 'sentence_year', 'year', 'publicationType', 'sum_citation', 'normalized_citation', 'journal_IF', 'title_score', 'title_abstract_score', 'title_coverage', 'abstract_coverage', 'label']
['models language interactions working memory capacity episodic semantic memory et rönnberg et', '28690579', '23874273', 'ease language understanding elu model theoretical empirical clinical advances', 'working memory important online language processing conversation use maintain relevant information inhibit ignore irrelevant information attend conversation selectively working memory helps us keep track actively participate conversation including taking turns following gist paper examines ease language understanding model i e elu model rönnberg 2003 rönnberg 2008 light new behavioral neural findings concerning role working memory capacity wmc uni modal bimodal language processing new elu model meaning prediction system depends phonological 

In [168]:
keywords_valid = {}
with open(f"{TRAINDATA_PATH}/{VALIDDATA_KW}", newline='') as csvfile:
    csvreader = csv.reader(csvfile, delimiter='\t')
    header = next(csvreader)
    n = 0
    for row in csvreader:
        n += 1
        if row[0] not in keywords_valid:
            keywords_valid[row[0]] = {'pmids':[], 'citations':[], 'labels':[], 'true':0, 'false':0}
        keywords_valid[row[0]]['pmids'].append(row[1])
        keywords_valid[row[0]]['citations'].append(row[2])
        keywords_valid[row[0]]['labels'].append(row[-1])
        if row[-1] == '1':
            keywords_valid[row[0]]['true'] += 1
        if row[-1] == '0':
            keywords_valid[row[0]]['false'] += 1
print(n)

221200


In [48]:
len(keywords_valid)

73297

In [170]:
json.dump(keywords_valid, open('keywords_valid.json', 'w', encoding='utf-8'), indent=4)

In [92]:
t = 0
f = 0
for v in keywords_valid.values():
#     if v['true'] < 1:
#         t += 1
#     if v['false'] < 1:
#         f += 1
    t += v['true']
    f += v['false']
print(t, f)

73798 147402


In [114]:
n = 0
for k in keywords_valid.keys():
    if k in keywords_train:
        n += 1
print(n)

2847


In [46]:
keywords_valid = json.load(open('keywords_valid.json'))

In [47]:
p, c, l = 0, 0, 0
keywords_vap = []
keywords_vac = []
keywords_val = []
for k, v in keywords_valid.items():
    if len(set(v['pmids'])) > 1:
        p += 1
        keywords_vap.append(k)
    if len(set(v['citations'])) > 1:
        c += 1
        keywords_vac.append(k)
    if len(set(v['labels'])) > 1:
        l += 1
        keywords_val.append(k)

In [49]:
p, c, l

(305, 73289, 73289)

In [50]:
n = 0
for k, v in keywords_valid.items():
    if k in keywords_train:
        if any(pmid in keywords_train[k]['pmids'] for pmid in v['pmids']):
            n += 1

In [51]:
n

1361

### Test Datasets

- Sentence for SQL: TESTDATA_PATH_SQL_ST = "case_SQL_testing_final_complete"
  - DL_data_search_result_sentence_citation461.csv (461-560)


In [90]:
with open(f"{TESTDATA_PATH_SQL_ST}/DL_data_search_result_sentence_citation461.csv", newline='') as csvfile:
    csvreader = csv.reader(csvfile, delimiter='\t')
    #header = next(csvreader)
    n = 0
    for row in csvreader:
        n += 1
        print(row)
        if n >= 3: break

['sentence', 'sentence_pmid', 'pmid', 'citation_pmid', 'articleTitle', 'abstract', 'sentence_year', 'year', 'publicationType', 'sum_citation', 'normalized_citation', 'journal_IF', 'title_score', 'title_abstract_score', 'title_coverage', 'abstract_coverage', 'rank']
['study investigated function ctgf psoriasis using established imiquimod imq induced psoriasis murine model samples psoriasis patients', '29386832', '28507585', '19380832', 'ctgf upregulation correlates mmp 9 level airway remodeling murine model asthma', 'connective tissue growth factor ctgf mediates hypertrophy proliferation extracellular matrix synthesis matrix metalloproteinase mmp plays role airway extracellular matrix remodeling correlation ctgf mmp airway remodeling asthma unknown study investigated lung ctgf expression correlation mmp airway structural changes murine model asthma female balb c mice sensitized challenged intraperitoneal injections intranasal phosphate buffered saline pbs ovalbumin ova airway responsive

In [171]:
sentences_test_sql = {}
n = 0
for f in range(461, 561):
    with open(f"{TESTDATA_PATH_SQL_ST}/DL_data_search_result_sentence_citation{str(f)}.csv", newline='') as csvfile:
        csvreader = csv.reader(csvfile, delimiter='\t')
        header = next(csvreader)
        for row in csvreader:
            n += 1
            if row[0] not in sentences_test_sql:
                sentences_test_sql[row[0]] = {'pmids':[], 'retpmids':[], 'citations':[], 'ranks':[],
                                              'retnum':0, 'intrain':0}
            sentences_test_sql[row[0]]['pmids'].append(row[1])
            sentences_test_sql[row[0]]['retpmids'].append(row[2])
            sentences_test_sql[row[0]]['citations'].append(row[3])
            sentences_test_sql[row[0]]['ranks'].append(row[-1])
            sentences_test_sql[row[0]]['retnum'] += 1
            if row[0] in sentences_train or row[0] in sentences_valid:
                sentences_test_sql[row[0]]['intrain'] += 1
print(n)

102552085


In [56]:
len(sentences_test_sql)

97212

In [173]:
json.dump(sentences_test_sql, open('sentences_test_sql.json', 'w', encoding='utf-8'), indent=4)

In [65]:
g = 0
l = 0
f = 0
r = 0
sentences_ret = []
for k, v in sentences_test_sql.items():
    if v['retnum'] > 1000:
        g += 1
    if v['retnum'] < 1000:
        l += 1
    if v['intrain'] > 0:
        f += 1
    if len(set(v['retpmids'])) > 1000:
        r += 1
    if v['retnum'] > len(set(v['retpmids'])):
        sentences_ret.append(k)
print(g, l, f, r)

46585 41 28927 7


In [82]:
len(sentences_ret)

46588

In [91]:
sent = sentences_ret[1000]
sent

'ceatg mouse cea expression mtec resulted tolerization major fraction cell repertoire'

In [92]:
len(sentences_test_sql[sent]['pmids']), len(set(sentences_test_sql[sent]['pmids']))

(1002, 1)

In [93]:
len(sentences_test_sql[sent]['citations']), len(set(sentences_test_sql[sent]['citations']))

(1002, 1)

In [94]:
len(sentences_test_sql[sent]['retpmids']), len(set(sentences_test_sql[sent]['retpmids']))

(1002, 1000)

In [95]:
len(sentences_test_sql[sent]['ranks']), len(set(sentences_test_sql[sent]['ranks']))

(1002, 1)

In [96]:
set(sentences_test_sql[sent]['ranks'])

{'1'}

In [70]:
sentences_test_sql = json.load(open('sentences_test_sql.json'))

In [54]:
p, c, l = 0, 0, 0
sentences_tsqp = []
sentences_tsqc = []
sentences_tsql = []
for k, v in sentences_test_sql.items():
    if len(set(v['pmids'])) > 1:
        p += 1
        sentences_tsqp.append(k)
    if len(set(v['citations'])) > 1:
        c += 1
        sentences_tsqc.append(k)
    if len(set(v['retpmids'])) > 1000:
        l += 1
        sentences_tsql.append(k)

In [55]:
p, c, l

(2, 4710, 7)

In [60]:
n = 0
for k, v in sentences_test_sql.items():
    if k in sentences_train:
        if any(pmid in sentences_train[k]['pmids'] for pmid in v['pmids']):
            n += 1
    elif k in sentences_valid:
        if any(pmid in sentences_valid[k]['pmids'] for pmid in v['pmids']):
            n += 1

In [61]:
n

28910

- Keyword for SQL: TESTDATA_PATH_SQL_KW = "case_SQL_testing_final_complete_keywordOnly"
  - DL_data_search_result_sentence_citation461.csv (461-560)

In [48]:
with open(f"{TESTDATA_PATH_SQL_KW}/DL_data_search_result_sentence_citation461.csv", newline='') as csvfile:
    csvreader = csv.reader(csvfile, delimiter='\t')
    #header = next(csvreader)
    n = 0
    for row in csvreader:
        n += 1
        print(row)
        #test_data.append(row)
        if n >= 2: break

['sentence', 'sentence_pmid', 'pmid', 'citation_pmid', 'articleTitle', 'abstract', 'sentence_year', 'year', 'publicationType', 'sum_citation', 'normalized_citation', 'journal_IF', 'title_abstract_score', 'title_coverage', 'abstract_coverage', 'rank']
['psoriasis imiquimod imq induced psoriasis model psoriasis', '29386832', '29305258', '19380832', 'irf 2 haploinsufficiency causes enhanced imiquimod induced psoriasis like skin inflammation', 'ifn regulatory factor irf 2 one potential susceptibility genes psoriasis gene influences psoriasis pathogenesis unclear topical application imiquimod imq tlr7 ligand induces psoriasis like skin lesions mice the aim study investigate whether irf 2 gene status would influence severity skin disease imq treated mice imiquimod induced psoriasis like skin inflammation assessed clinical findings histology cytokine expression effects imiquimod ifn peritoneal macrophages analyzed vitro imq induced skin inflammation assessed clinical findings histology severe

In [174]:
keywords_test_sql = {}
n = 0
for f in range(461, 561):
    with open(f"{TESTDATA_PATH_SQL_KW}/DL_data_search_result_sentence_citation{str(f)}.csv", newline='') as csvfile:
        csvreader = csv.reader(csvfile, delimiter='\t')
        header = next(csvreader)
        for row in csvreader:
            n += 1
            if row[0] not in keywords_test_sql:
                keywords_test_sql[row[0]] = {'pmids':[], 'retpmids':[], 'citations':[], 'ranks':[],
                                             'retnum':0, 'intrain':0}
            keywords_test_sql[row[0]]['pmids'].append(row[1])
            keywords_test_sql[row[0]]['retpmids'].append(row[2])
            keywords_test_sql[row[0]]['citations'].append(row[3])
            keywords_test_sql[row[0]]['ranks'].append(row[-1])
            keywords_test_sql[row[0]]['retnum'] += 1
            if row[0] in keywords_train or row[0] in keywords_valid:
                keywords_test_sql[row[0]]['intrain'] += 1
print(n)

89940902


In [175]:
len(keywords_test_sql)

89970

In [176]:
json.dump(keywords_test_sql, open('keywords_test_sql.json', 'w', encoding='utf-8'), indent=4)

In [101]:
g = 0
l = 0
f = 0
for v in keywords_test_sql.values():
    if v['retnum'] > 1000:
        g += 1
    if v['retnum'] < 1000:
        l += 1
    if v['intrain'] > 0:
        f += 1
print(g, l, f)

65797 763 621


- Sentence for PubMed TF-IDF: TESTDATA_PATH_PTF_ST = "case_PubMed_testing_final_complete"
  - DL_data_search_result_sentence_citation461.csv (461-560)

In [49]:
with open(f"{TESTDATA_PATH_PTF_ST}/DL_data_search_result_sentence_citation461.csv", newline='') as csvfile:
    csvreader = csv.reader(csvfile, delimiter='\t')
    #header = next(csvreader)
    n = 0
    for row in csvreader:
        n += 1
        print(row)
        #test_data.append(row)
        if n >= 2: break

['sentence', 'sentence_pmid', 'pmid', 'citation_pmid', 'articleTitle', 'abstract', 'sentence_year', 'year', 'publicationType', 'sum_citation', 'normalized_citation', 'journal_IF', 'title_score', 'title_abstract_score', 'title_coverage', 'abstract_coverage', 'rank']
['contrast reported gastric luminal challenge specific antigen decreased phasic antral activity sensitized rats', '28144845', '27197667', '8194696', 'curcumin blocks naproxen induced gastric antral ulcerations inhibition lipid peroxidation activation enzymatic scavengers rats', 'curcumin polyphenol derived plant curcuma longa used treatment diseases associated oxidative stress inflammation present study undertaken determine protective effect curcumin naproxen induced gastric antral ulcerations rats different doses 10 50 100 mg kg curcumin vehicle curcumin 0 mg kg pretreated 3 days oral gavage gastric mucosal lesions caused 80 mg kg naproxen applied 3 days curcumin significantly inhibited naproxen induced gastric antral ulcer

In [177]:
sentences_test_ptf = {}
n = 0
for f in range(461, 561):
    with open(f"{TESTDATA_PATH_PTF_ST}/DL_data_search_result_sentence_citation{str(f)}.csv", newline='') as csvfile:
        csvreader = csv.reader(csvfile, delimiter='\t')
        header = next(csvreader)
        for row in csvreader:
            n += 1
            if row[0] not in sentences_test_ptf:
                sentences_test_ptf[row[0]] = {'pmids':[], 'retpmids':[], 'citations':[], 'ranks':[],
                                              'retnum':0, 'intrain':0}
            sentences_test_ptf[row[0]]['pmids'].append(row[1])
            sentences_test_ptf[row[0]]['retpmids'].append(row[2])
            sentences_test_ptf[row[0]]['citations'].append(row[3])
            sentences_test_ptf[row[0]]['ranks'].append(row[-1])
            sentences_test_ptf[row[0]]['retnum'] += 1
            if row[0] in sentences_train or row[0] in sentences_valid:
                sentences_test_ptf[row[0]]['intrain'] += 1
print(n)

100128069


In [178]:
len(sentences_test_ptf)

96948

In [179]:
json.dump(sentences_test_ptf, open('sentences_test_ptf.json', 'w', encoding='utf-8'), indent=4)

In [104]:
g = 0
l = 0
f = 0
for v in sentences_test_ptf.values():
    if v['retnum'] > 1000:
        g += 1
    if v['retnum'] < 1000:
        l += 1
    if v['intrain'] > 0:
        f += 1
print(g, l, f)

4675 86772 28837


- Keyword for PubMed TF-IDF: TESTDATA_PATH_PTF_KW = "case_PubMed_testing_final_keywordOnly"
  - DL_data_search_result_sentence_citation461.csv (461-560)

In [50]:
with open(f"{TESTDATA_PATH_PTF_KW}/DL_data_search_result_sentence_citation461.csv", newline='') as csvfile:
    csvreader = csv.reader(csvfile, delimiter='\t')
    #header = next(csvreader)
    n = 0
    for row in csvreader:
        n += 1
        print(row)
        #test_data.append(row)
        if n >= 2: break

['sentence', 'sentence_pmid', 'pmid', 'citation_pmid', 'articleTitle', 'abstract', 'sentence_year', 'year', 'publicationType', 'sum_citation', 'normalized_citation', 'journal_IF', 'title_score', 'title_abstract_score', 'title_coverage', 'abstract_coverage', 'rank', 'sentence_orig']
['low vitamin status urban rural population andhra pradesh state south india', '26644714', '26667891', '18497434', 'status vitamin b12 folate among urban adult population south india', 'deficiency vitamin b12 b12 folate fa leads wide spectrum disorders affect age groups however reports b12 fa status healthy adults india limited hence determined plasma levels dietary intake b12 fa adult population we conducted community based cross sectional study urban setup among 630 apparently healthy adults distributed 3 age groups 21 40 41 60 60 years plasma concentrations b12 fa analyzed radio immunoassay dietary intake 24 hour recall method the overall prevalence fa deficiency 12 significant difference plasma fa concen

In [180]:
keywords_test_ptf = {}
n = 0
for f in range(461, 561):
    with open(f"{TESTDATA_PATH_PTF_KW}/DL_data_search_result_sentence_citation{str(f)}.csv", newline='') as csvfile:
        csvreader = csv.reader(csvfile, delimiter='\t')
        header = next(csvreader)
        for row in csvreader:
            n += 1
            if row[0] not in keywords_test_ptf:
                keywords_test_ptf[row[0]] = {'pmids':[], 'retpmids':[], 'citations':[], 'ranks':[],
                                             'retnum':0, 'intrain':0}
            keywords_test_ptf[row[0]]['pmids'].append(row[1])
            keywords_test_ptf[row[0]]['retpmids'].append(row[2])
            keywords_test_ptf[row[0]]['citations'].append(row[3])
            keywords_test_ptf[row[0]]['ranks'].append(row[-1])
            keywords_test_ptf[row[0]]['retnum'] += 1
            if row[0] in keywords_train or row[0] in keywords_valid:
                keywords_test_ptf[row[0]]['intrain'] += 1
print(n)

99554677


In [181]:
len(keywords_test_ptf)

102070

In [182]:
json.dump(keywords_test_ptf, open('keywords_test_ptf.json', 'w', encoding='utf-8'), indent=4)

In [107]:
g = 0
l = 0
f = 0
for v in keywords_test_ptf.values():
    if v['retnum'] > 1000:
        g += 1
    if v['retnum'] < 1000:
        l += 1
    if v['intrain'] > 0:
        f += 1
print(g, l, f)

188 96873 808


- Sentence for PubMed BM: TESTDATA_PATH_PBM_ST = "case_PubMed_BM_testing_final_complete"
  - DL_data_search_result_sentence_citation461.csv

In [51]:
with open(f"{TESTDATA_PATH_PBM_ST}/DL_data_search_result_sentence_citation461.csv", newline='') as csvfile:
    csvreader = csv.reader(csvfile, delimiter='\t')
    #header = next(csvreader)
    n = 0
    for row in csvreader:
        n += 1
        print(row)
        #test_data.append(row)
        if n >= 2: break

['sentence', 'sentence_pmid', 'pmid', 'citation_pmid', 'articleTitle', 'abstract', 'sentence_year', 'year', 'publicationType', 'sum_citation', 'normalized_citation', 'journal_IF', 'title_score', 'title_abstract_score', 'title_coverage', 'abstract_coverage', 'rank']
['study used multilocus sequence typing mlst method', '17543117', '23979428', '15944444', 'mlst revisited gene by gene approach bacterial genomics', 'multilocus sequence typing mlst proposed 1998 portable sequence based method identifying clonal relationships among bacteria today whole genome era microbiology need systematic standardized descriptions bacterial genotypic variation remains priority here meet need draw successes mlst 16s rrna gene sequencing propose hierarchical gene by gene approach reflects functional evolutionary relationships catalogues bacteria domain strain gene based typing approach using online platforms bacterial isolate genome sequence database bigsdb allows scalable organization analysis whole genome

In [183]:
sentences_test_pbm = {}
n = 0
with open(f"{TESTDATA_PATH_PBM_ST}/DL_data_search_result_sentence_citation461.csv", newline='') as csvfile:
    csvreader = csv.reader(csvfile, delimiter='\t')
    header = next(csvreader)
    for row in csvreader:
        n += 1
        if row[0] not in sentences_test_pbm:
            sentences_test_pbm[row[0]] = {'pmids':[], 'retpmids':[], 'citations':[], 'ranks':[],
                                          'retnum':0, 'intrain':0}
        sentences_test_pbm[row[0]]['pmids'].append(row[1])
        sentences_test_pbm[row[0]]['retpmids'].append(row[2])
        sentences_test_pbm[row[0]]['citations'].append(row[3])
        sentences_test_pbm[row[0]]['ranks'].append(row[-1])
        sentences_test_pbm[row[0]]['retnum'] += 1
        if row[0] in sentences_train or row[0] in sentences_valid:
            sentences_test_pbm[row[0]]['intrain'] += 1
print(n)

759318


In [184]:
len(sentences_test_pbm)

1007

In [185]:
json.dump(sentences_test_pbm, open('sentences_test_pbm.json', 'w', encoding='utf-8'), indent=4)

In [110]:
g = 0
l = 0
f = 0
for v in sentences_test_pbm.values():
    if v['retnum'] > 1000:
        g += 1
    if v['retnum'] < 1000:
        l += 1
    if v['intrain'] > 0:
        f += 1
print(g, l, f)

2 1005 305


- Sentence for Google Scholar: TESTDATA_PATH_GGS_ST = "case_Google_scholar_testing_final_complete"
  - DL_data_search_result_sentence_citation461.csv

In [98]:
with open(f"{TESTDATA_PATH_GGS_ST}/DL_data_search_result_sentence_citation461.csv", newline='') as csvfile:
    csvreader = csv.reader(csvfile, delimiter='\t')
    #header = next(csvreader)
    n = 0
    for row in csvreader:
        n += 1
        print(row)
        #test_data.append(row)
        if n >= 2: break

['sentence', 'sentence_pmid', 'pmid', 'citation_pmid', 'articleTitle', 'abstract', 'sentence_year', 'year', 'publicationType', 'sum_citation', 'normalized_citation', 'journal_IF', 'title_score', 'title_abstract_score', 'title_coverage', 'abstract_coverage', 'rank']
['global test proposed', '21245948', '12320315', '14693814', 'international migration global challenge', 'trends international migration presented multiregional analysis seven worlds wealthiest countries 33 worlds migrant population 16 total world population population growth countries substantially affected migrant population migration challenge external internal external challenge balance need foreign labor commitment human rights migrants seeking economic opportunity political freedom internal challenge assure social adjustment immigrants children integrate society citizens future leaders people cross national borders migration flows likely evolve next decades explained report also presents ways countries manage migration

In [186]:
sentences_test_ggs = {}
n = 0
with open(f"{TESTDATA_PATH_GGS_ST}/DL_data_search_result_sentence_citation461.csv", newline='') as csvfile:
    csvreader = csv.reader(csvfile, delimiter='\t')
    header = next(csvreader)
    for row in csvreader:
        n += 1
        if row[0] not in sentences_test_ggs:
            sentences_test_ggs[row[0]] = {'pmids':[], 'retpmids':[], 'citations':[], 'ranks':[],
                                          'retnum':0, 'intrain':0}
        sentences_test_ggs[row[0]]['pmids'].append(row[1])
        sentences_test_ggs[row[0]]['retpmids'].append(row[2])
        sentences_test_ggs[row[0]]['citations'].append(row[3])
        sentences_test_ggs[row[0]]['ranks'].append(row[-1])
        sentences_test_ggs[row[0]]['retnum'] += 1
        if row[0] in sentences_train or row[0] in sentences_valid:
            sentences_test_ggs[row[0]]['intrain'] += 1
print(n)

6335


In [187]:
len(sentences_test_ggs)

481

In [188]:
json.dump(sentences_test_ggs, open('sentences_test_ggs.json', 'w', encoding='utf-8'), indent=4)

In [113]:
g = 0
l = 0
f = 0
for v in sentences_test_ggs.values():
    if v['retnum'] > 1000:
        g += 1
    if v['retnum'] < 1000:
        l += 1
    if v['intrain'] > 0:
        f += 1
print(g, l, f)

0 481 138


### Combine into an Overall Dataset

**Full Sentences**

- From Training Set

In [97]:
all_sentences_dataset = {}
with open(f"{TRAINDATA_PATH}/{TRAINDATA_ST}", newline='') as csvfile:
    csvreader = csv.reader(csvfile, delimiter='\t')
    header = next(csvreader)
    n = 0
    for row in csvreader:
        n += 1
        sent = f"{row[0]}|{row[1]}"
        if sent not in all_sentences_dataset:
            all_sentences_dataset[sent] = {'train':{'citations':[], 'labels':[]}}
        all_sentences_dataset[sent]['train']['citations'].append(row[2])
        all_sentences_dataset[sent]['train']['labels'].append(row[-1])
print(n)

4035476


In [98]:
len(all_sentences_dataset)

1904033

In [101]:
list(all_sentences_dataset.keys())[0], all_sentences_dataset[list(all_sentences_dataset.keys())[0]]

('another study center cryptogenic cirrhosis autoimmune hepatitis related cirrhosis reported main indications liver transplantation children|24829669',
 {'train': {'citations': ['17430479', '10685237', '1319888'],
   'labels': ['1', '0', '0']}})

- From Validation Set

In [102]:
with open(f"{TRAINDATA_PATH}/{VALIDDATA_ST}", newline='') as csvfile:
    csvreader = csv.reader(csvfile, delimiter='\t')
    header = next(csvreader)
    n = 0
    for row in csvreader:
        n += 1
        sent = f"{row[0]}|{row[1]}"
        if sent not in all_sentences_dataset:
            all_sentences_dataset[sent] = {'valid':{'citations':[], 'labels':[]}}
        all_sentences_dataset[sent]['valid'] = all_sentences_dataset[sent].get('valid', {'citations':[], 'labels':[]})
        all_sentences_dataset[sent]['valid']['citations'].append(row[2])
        all_sentences_dataset[sent]['valid']['labels'].append(row[-1])
print(n)

462978


In [103]:
len(all_sentences_dataset)

2028333

- From Test SQL Dataset

In [104]:
n = 0
for f in range(461, 561):
    with open(f"{TESTDATA_PATH_SQL_ST}/DL_data_search_result_sentence_citation{str(f)}.csv", newline='') as csvfile:
        csvreader = csv.reader(csvfile, delimiter='\t')
        header = next(csvreader)
        for row in csvreader:
            n += 1
            sent = f"{row[0]}|{row[1]}"
            if sent not in all_sentences_dataset:
                all_sentences_dataset[sent] = {'test_sql':{'retpmids':[], 'citations':[], 'ranks':[]}}
            all_sentences_dataset[sent]['test_sql'] = all_sentences_dataset[sent].get('test_sql',
                                                                                      {'retpmids':[],
                                                                                       'citations':[],
                                                                                       'ranks':[]})
            all_sentences_dataset[sent]['test_sql']['retpmids'].append(row[2])
            all_sentences_dataset[sent]['test_sql']['citations'].append(row[3])
            all_sentences_dataset[sent]['test_sql']['ranks'].append(row[-1])
print(n)

102552085


In [105]:
len(all_sentences_dataset)

2096637

- From Test PTF Dataset

In [106]:
n = 0
for f in range(461, 561):
    with open(f"{TESTDATA_PATH_PTF_ST}/DL_data_search_result_sentence_citation{str(f)}.csv", newline='') as csvfile:
        csvreader = csv.reader(csvfile, delimiter='\t')
        header = next(csvreader)
        for row in csvreader:
            n += 1
            sent = f"{row[0]}|{row[1]}"
            if sent not in all_sentences_dataset:
                all_sentences_dataset[sent] = {'test_ptf':{'retpmids':[], 'citations':[], 'ranks':[]}}
            all_sentences_dataset[sent]['test_ptf'] = all_sentences_dataset[sent].get('test_ptf',
                                                                                      {'retpmids':[],
                                                                                       'citations':[],
                                                                                       'ranks':[]})
            all_sentences_dataset[sent]['test_ptf']['retpmids'].append(row[2])
            all_sentences_dataset[sent]['test_ptf']['citations'].append(row[3])
            all_sentences_dataset[sent]['test_ptf']['ranks'].append(row[-1])
print(n)

100128069


In [107]:
len(all_sentences_dataset)

2096637

- From Test PBM Dataset

In [108]:
n = 0
with open(f"{TESTDATA_PATH_PBM_ST}/DL_data_search_result_sentence_citation461.csv", newline='') as csvfile:
    csvreader = csv.reader(csvfile, delimiter='\t')
    header = next(csvreader)
    for row in csvreader:
        n += 1
        sent = f"{row[0]}|{row[1]}"
        if sent not in all_sentences_dataset:
            all_sentences_dataset[sent] = {'test_pbm':{'retpmids':[], 'citations':[], 'ranks':[]}}
        all_sentences_dataset[sent]['test_pbm'] = all_sentences_dataset[sent].get('test_pbm',
                                                                                  {'retpmids':[],
                                                                                   'citations':[],
                                                                                   'ranks':[]})
        all_sentences_dataset[sent]['test_pbm']['retpmids'].append(row[2])
        all_sentences_dataset[sent]['test_pbm']['citations'].append(row[3])
        all_sentences_dataset[sent]['test_pbm']['ranks'].append(row[-1])
print(n)

759318


In [109]:
len(all_sentences_dataset)

2096637

- From Test GGS Dataset

In [110]:
n = 0
with open(f"{TESTDATA_PATH_GGS_ST}/DL_data_search_result_sentence_citation461.csv", newline='') as csvfile:
    csvreader = csv.reader(csvfile, delimiter='\t')
    header = next(csvreader)
    for row in csvreader:
        n += 1
        sent = f"{row[0]}|{row[1]}"
        if sent not in all_sentences_dataset:
            all_sentences_dataset[sent] = {'test_ggs':{'retpmids':[], 'citations':[], 'ranks':[]}}
        all_sentences_dataset[sent]['test_ggs'] = all_sentences_dataset[sent].get('test_ggs',
                                                                                  {'retpmids':[],
                                                                                   'citations':[],
                                                                                   'ranks':[]})
        all_sentences_dataset[sent]['test_ggs']['retpmids'].append(row[2])
        all_sentences_dataset[sent]['test_ggs']['citations'].append(row[3])
        all_sentences_dataset[sent]['test_ggs']['ranks'].append(row[-1])
print(n)

6335


In [111]:
len(all_sentences_dataset)

2096640

In [112]:
json.dump(all_sentences_dataset, open('all_sentences_dataset.json', 'w', encoding='utf-8'), indent=4)

In [182]:
train_sentences = {}
n, m, q, r = 0, 0, 0, 0
dataset = 'train'
for k, v in all_sentences_dataset.items():
    if dataset in v:
        truecite = [v[dataset]['citations'][i] for i, l in enumerate(v[dataset]['labels']) if l == '1']
        falsecite = [v[dataset]['citations'][i] for i, l in enumerate(v[dataset]['labels']) if l == '0']
        if len(set(truecite)&set(falsecite)) == 0:
            if len(set(truecite)) == 1:
                n += 1
                if len(set(truecite))*2 < len(falsecite):
                    q += 1
        if len(falsecite) > len(set(falsecite)):
#         if len(truecite) > len(set(truecite)):
#         if len(set(truecite)&set(falsecite)) > 0:
            r += 1
#             print (k, v[dataset])

In [183]:
n, m, q, r

(1824076, 0, 954, 3958)

In [201]:
all_sentences_dataset[list(all_sentences_dataset.keys())[1800000]]

{'train': {'citations': ['28338784'], 'labels': ['1']}}

In [186]:
n, m, q, r = 0, 0, 0, 0
for k, v in all_sentences_dataset.items():
    if len(k.split()) < 5:
        r += 1
        if 'train' in v:
            n += 1
        if 'valid' in v:
            m += 1
        if 'test_sql' in v:
            q += 1
n, m, q, r

(267556, 34378, 222, 300977)

## New Datasets

We will work on the sentence datasets only. After new datasets are created, use them on the original decomposed attention model and other deep learning models.

### Split the Datasets

- Split the overall dataset into training, validation and test datasets
  - Keep all sentences in the original test set as test dataset
  - Keep all sentences in the original validation set but not in the test dataset as validation dataset
  - Keep all sentences in the original training set but not in the validation and test sets as training dataset

- Training/validation datasets
  - Retain one sample for each positive case
  - Retain all negative samples

- Test dataset
  - Retain all samples

In [212]:
train_sentences = {}
valid_sentences = {}
test_sentences_sql = {}
test_sentences_ptf = {}
n, m, q, r = 0, 0, 0, 0
for k, v in all_sentences_dataset.items():
    if 'test_ptf' in v:
        n += 1
        test_sentences_ptf[k] = v['test_ptf']
        test_sentences_sql[k] = v['test_sql']
    elif 'valid' in v:
        m += 1
        valid_sentences[k] = v['valid']
    elif 'train' in v:
        q += 1
        train_sentences[k] = v['train']

In [213]:
n, m, q, r

(96950, 145451, 1854062, 0)

- Training dataset

In [224]:
json.dump(train_sentences, open('train_sentences.json', 'w', encoding='utf-8'), indent=4)

In [214]:
train_sentences[list(train_sentences.keys())[1000]]

{'citations': ['25173015', '22673783', '17949137'], 'labels': ['1', '0', '0']}

In [230]:
n, m, p, q, r = 0, 0, 0, 0, 0
for k, v in train_sentences.items():
    truecite = [v['citations'][i] for i, l in enumerate(v['labels']) if l == '1']
    falsecite = [v['citations'][i] for i, l in enumerate(v['labels']) if l == '0']
    if len(truecite) == 0:
        n += 1
    if len(truecite) > 1:
        m += 1
        if len(truecite) > len(set(truecite)):
            p += 1
    if len(falsecite) > len(set(falsecite)):
        q += 1
    if len(set(truecite)&set(falsecite)) != 0:
        r += 1

In [231]:
n, m, p, q, r

(11, 65339, 1017, 3095, 994)

- Validation dataset

In [225]:
json.dump(valid_sentences, open('valid_sentences.json', 'w', encoding='utf-8'), indent=4)

In [215]:
valid_sentences[list(valid_sentences.keys())[1000]]

{'citations': ['15019480', '11142531', '18728730'], 'labels': ['1', '0', '0']}

In [232]:
n, m, p, q, r = 0, 0, 0, 0, 0
for k, v in valid_sentences.items():
    truecite = [v['citations'][i] for i, l in enumerate(v['labels']) if l == '1']
    falsecite = [v['citations'][i] for i, l in enumerate(v['labels']) if l == '0']
    if len(truecite) == 0:
        n += 1
    if len(truecite) > 1:
        m += 1
        if len(truecite) > len(set(truecite)):
            p += 1
    if len(falsecite) > len(set(falsecite)):
        q += 1
    if len(set(truecite)&set(falsecite)) != 0:
        r += 1

In [233]:
n, m, p, q, r

(4, 2654, 48, 221, 38)

- Test datasets

In [19]:
test_sentences = set()
for k in test_sentences_sql:
    if k in test_sentences_ptf:
        test_sentences.add(k)

json.dump(list(test_sentences), open('test_sentences.json', 'w', encoding='utf-8'))

In [241]:
len(test_sentences_sql[list(test_sentences_sql.keys())[1000]]['retpmids'])

1000

In [238]:
n, m, p, q, r = 0, 0, 0, 0, 0
for k, v in test_sentences_ptf.items():
    citations = set(v['citations'])
    ranks = set(v['ranks'])
    retpmids = v['retpmids']
    if len(citations) > 1:
        n += 1
    if len(citations) != len(ranks):
        m += 1
    if len(retpmids) > len(set(retpmids)):
        p += 1
        if len(citations) > 1:
            r += 1
    for pmid in citations:
        if not pmid in retpmids:
            q += 1

In [239]:
n, m, p, q, r

(4670, 96948, 4672, 38105, 4669)


## Develop Datasets

**Original Dataset**

- Training: TRAINDATA_ST = 'train_final_dataset_complete_shuffled.tsv'
- Validation: VALIDDATA_ST = 'valid_final_dataset_complete.tsv'
- Test: TESTDATA_PATH_SQL_ST = "case_SQL_testing_final_complete" / TESTDATA_PATH_PTF_ST = "case_PubMed_testing_final_complete"

**New Datasets**

- Dataset Dictionaries
  - Training: 'train_sentences.json'
  - Validation: 'valid_sentences.json'
  - Test: 'test_sentences_sql.json' / 'test_sentences_ptf.json'
- New datasets
  - Training: 'train_sentences.tsv'
  - Validation: 'valid_sentences.tsv'
  - Test: 'test_sentences_sql.json' / 'test_sentences_ptf.json' / 'test_sentences_pbm.json' / TEST_ST_PATH = "test_st"

### New Training Dataset

In [4]:
train_sentences = json.load(open('train_sentences.json'))

In [5]:
len(train_sentences)

1854062

In [12]:
n, t, f, m, r, p = 0, 0, 0, 0, 0, 0
for k, v in train_sentences.items():
    n += len(v['labels'])
    pos = [i for i in v['labels'] if i == '1']
    neg = [i for i in v['labels'] if i == '0']
    t += len(pos)
    f += len(neg)
    if len(neg) == 0:
        m += 1
        if len(pos) > 1:
            r += 1
            print(k, v)
        elif k.split('|')[1] != v['citations'][0]:
            p += 1
print(n, t, f, m, r, p)

asian population gender difference nonalcoholic fatty liver disease post menopausal ultrasonography|29093809 {'citations': ['29093809', '29093809'], 'labels': ['1', '1']}
preliminary investigation deoxyoligonucleotide binding ribonuclease using mass spectrometry attempt develop lab experience undergraduates|29721314 {'citations': ['29721314', '29721314'], 'labels': ['1', '1']}
scale up alcohol use disorder city harmful use alcohol heavy drinking implementation primary health care training support|29188013 {'citations': ['29188013', '29188013'], 'labels': ['1', '1']}
phytophthora cinnamomi phytophthora genome plant pathogen|29188023 {'citations': ['29188023', '29188023'], 'labels': ['1', '1']}
multi photon microscopy optical imaging photoacoustic imaging tomography optical coherence ultrasonography|25754364 {'citations': ['25754364', '25754364'], 'labels': ['1', '1']}
borrelia burgdorferi lyme borreliosi chronic lyme disease sexual transmission spirochete|28690828 {'citations': ['286908

In [13]:
with open(f"{TRAINDATA_PATH}/{TRAINDATA_ST}", newline='') as csvfile:
    csvreader = csv.reader(csvfile, delimiter='\t')
    header = next(csvreader)
    for row in csvreader:
        sent = f"{row[0]}|{row[1]}"
        if sent in train_sentences:
            pos = [i for i in train_sentences[sent]['labels'] if i == '1']
            neg = [i for i in train_sentences[sent]['labels'] if i == '0']
            if len(neg) == 0:
                if len(pos) > 1:
                    print(row)

['asian population gender difference nonalcoholic fatty liver disease post menopausal ultrasonography', '29093809', '29093809', 'gender differences prevalence nonalcoholic fatty liver disease northeast thailand population based cross sectional study', 'background nonalcoholic fatty liver disease nafld leading cause chronic liver disease large number studies strongly described larger proportions men afflicted nafld women however recent studies investigating role gender nafld exposed contrary methods cross sectional study utilized data baseline survey ongoing cohort study called cholangiocarcinoma screening care program cascap conducted northeastern region thailand march 2013 september 2015 information regarding socio demographic including gender collected using standardized self administered questionnaire nafld diagnosed ultrasonography board certified radiologists binomial regression used estimating prevalence differences odds ratios or 95 confidence intervals ci nafld men women result

['scale up alcohol use disorder city harmful use alcohol heavy drinking implementation primary health care training support', '29188013', '29188013', 'scaling up primary health care based prevention management heavy drinking municipal level middle income countries latin america background protocol three country quasi experimental study', 'background primary health care phc based prevention management heavy drinking clinically effective cost effective remains poorly implemented routine practice systematic reviews multi country studies demonstrated ability training support programmes increase phc based screening brief advice activity reduce heavy drinking however gains modest short term best studies concluded effective uptake could achieved embedding phc activity within broader community municipal support protocol quasi experimental study compare phc based prevention management heavy drinking three intervention cities colombia mexico peru three comparator cities countries implementation 

['bmal1 clock dna binding assay chromatin circadian clock phosphorylation', '28928952', '28928952', 'simple method measure clock bmal1 dna binding activity tissue cell extracts', 'proteins clock bmal1 form heterodimeric transcription factor essential circadian rhythms mammals daily rhythms clock bmal1 dna binding activity known oscillate target gene expression vivo present highly sensitive assay recapitulates native clock bmal1 dna binding rhythms crude tissue extracts call clock protein dna binding assay cpdba method detect less 2 fold differences dna binding activity deliver results two hours less using 10 microliters less crude extract requiring neither specialized equipment expensive probes demonstrate sensitivity versatility assay show enzymatic removal phosphate groups proteins tissue extracts pharmacological inhibition casein kinase cell culture increased clock bmal1 dna binding activity 1 5 2 fold measured cpdba addition show cpdba measure clock bmal1 binding reconstituted chro

['preliminary investigation deoxyoligonucleotide binding ribonuclease using mass spectrometry attempt develop lab experience undergraduates', '29721314', '29721314', 'preliminary investigation deoxyoligonucleotide binding ribonuclease using mass spectrometry attempt develop lab experience undergraduates', 'deoxyoligonucleotide binding bovine pancreatic ribonuclease rnase a investigated using electrospray ionization ion trap mass spectrometry esi it ms deoxyoligonucleotides included ccccc dc 5 ccacc dc 2ac 2 work attempt develop biochemistry lab experience would introduce undergraduates use mass spectrometry analysis protein ligand interactions titration experiments performed using fixed rnase concentration variable deoxyoligonucleotide concentrations samples equilibrium infused directly mass spectrometer native conditions deoxyoligonucleotide mass spectra showed one to one binding stoichiometry marked increases total ion abundance ligand bound rnase complexes function concentration acc

['phytophthora cinnamomi phytophthora genome plant pathogen', '29188023', '29188023', 'draft genomes two australian strains plant pathogen phytophthora cinnamomi', 'background oomycete plant pathogen phytophthora cinnamomi responsible destruction thousands species native australian plants well several crops avocado macadamia one widest host plant ranges phytophthora genus currently available genome p cinnamomi based atypical strain large gaps assembly studies pathogenicity species especially australia robust assemblies genomes typical strains required report genome sequencing draft assembly preliminary annotation two geographically separated australian strains p cinnamomi findings 308 million raw reads generated two strains independent genome assembly produced final genomes 62 8 mb in 14 268 scaffolds 68 1 mb in 10 084 scaffolds comparable size contiguity phytophthora genomes gene prediction yielded 22 000 predicted protein encoding genes within genome busco assessment showed 82 5 81 8

['h3n2 electronic biology influenza vaccine efficacy', '29636902', '29636902', 'using electronic biology based platform predict flu vaccine efficacy 2018 2019', 'flu epidemics potential pandemics pose great challenges public health institutions scientists vaccine producers creating right vaccine composition different parts world trivial historically problematic often resulted decrease vaccinations reduced trust public health officials improve future protection population flu urgently need new methods vaccine efficacy prediction vaccine virus selection', '2019', '2018', 'journal article', '0', '-0.646', '0.0', '0.0', '0.0', '0.6666666666666666', '0.3333333333333333', '1']
['biochemical circuit hill equation aggregation robustness synthetic biology systems biology', '29938108', '29938108', 'biochemical logarithmic sensor broad dynamic range', 'sensory perception often scales logarithmically input level similarly output response biochemical systems sometimes scales logarithmically input s

['bioinformatic cancer encode epigenomic genomic roadmap tcga non coding', '28232861', '28232861', 'tcga workflow analyze cancer genomics epigenomics data using bioconductor packages', 'biotechnological advances sequencing led explosion publicly available data via large international consortia cancer genome atlas tcga encyclopedia dna elements encode nih roadmap epigenomics mapping consortium roadmap projects provided unprecedented opportunities interrogate epigenome cultured cancer cell lines well normal tumor tissues high genomic resolution bioconductor project offers 1 000 open source software statistical packages analyze high throughput genomic data however packages designed specific data types eg expression epigenetics genomics comprehensive tool provides complete integrative analysis harnessing resources data provided three public projects need create integration different analyses recently proposed workflow provide series biologically focused integrative downstream analyses diff

In [23]:
with open(f"{TRAINDATA_PATH}/train_sentences.tsv", 'w', encoding='utf-8', newline='') as wfile:
    csvwriter = csv.writer(wfile, delimiter='\t')
    with open(f"{TRAINDATA_PATH}/{TRAINDATA_ST}", newline='') as csvfile:
        csvreader = csv.reader(csvfile, delimiter='\t')
        header = next(csvreader)
        csvwriter.writerow(header)
        n = 0
        for row in csvreader:
            sent = f"{row[0]}|{row[1]}"
            if sent in train_sentences:
#                 csvwriter.writerow(row)
#                 n += 1
                pos = [i for i in train_sentences[sent]['labels'] if i == '1']
                neg = [i for i in train_sentences[sent]['labels'] if i == '0']
                if len(neg) != 0:
                    csvwriter.writerow(row)
                    n += 1
print(n)

2806978


### New Validation Dataset

In [14]:
valid_sentences = json.load(open('valid_sentences.json'))

In [15]:
len(valid_sentences)

145451

In [16]:
n, t, f, m, r, p = 0, 0, 0, 0, 0, 0
for k, v in valid_sentences.items():
    n += len(v['labels'])
    pos = [i for i in v['labels'] if i == '1']
    neg = [i for i in v['labels'] if i == '0']
    t += len(pos)
    f += len(neg)
    if len(neg) == 0:
        m += 1
        if len(pos) > 1:
            r += 1
            print(k, v)
        elif k.split('|')[1] != v['citations'][0]:
            p += 1
print(n, t, f, m, r, p)

444397 148269 296128 0 0 0


In [17]:
with open(f"{TRAINDATA_PATH}/valid_sentences.tsv", 'w', encoding='utf-8', newline='') as wfile:
    csvwriter = csv.writer(wfile, delimiter='\t')
    with open(f"{TRAINDATA_PATH}/{VALIDDATA_ST}", newline='') as csvfile:
        csvreader = csv.reader(csvfile, delimiter='\t')
        header = next(csvreader)
        csvwriter.writerow(header)
        n = 0
        for row in csvreader:
            sent = f"{row[0]}|{row[1]}"
            if sent in valid_sentences:
                csvwriter.writerow(row)
                n += 1
print(n)

444397


### New Test Datasets

In [43]:
test_sentences = set(json.load(open('test_sentences.json')))

In [44]:
len(test_sentences)

96950

**SQL**

In [100]:
search_returns = set()

In [101]:
%%time
test_sentences_sql = {}
n = 0
for f in range(461, 561):
    with open(f"{TESTDATA_PATH_SQL_ST}/DL_data_search_result_sentence_citation{str(f)}.csv", newline='') as csvfile:
        csvreader = csv.reader(csvfile, delimiter='\t')
        header = next(csvreader)
        for row in csvreader:
            sent = f"{row[0]}|{row[1]}"
            citation = row[3]
            retpmid = row[2]
            tiscore = row[12]
            tiabscore = row[13]
            rank = row[-1]
            retarticle = [row[4], row[5], row[7], row[8], row[11]]
            if sent in test_sentences:
                n += 1
                if sent not in test_sentences_sql:
                    test_sentences_sql[sent] = {citation:{'retpmids':[], 'tiscores':[], 'tiabscores':[]}}
                if citation not in test_sentences_sql[sent]:
                    test_sentences_sql[sent][citation] = {'retpmids':[], 'tiscores':[], 'tiabscores':[]}
                if not retpmid in test_sentences_sql[sent][citation]['retpmids']:
                    test_sentences_sql[sent][citation]['retpmids'].append(retpmid)
                    test_sentences_sql[sent][citation]['tiscores'].append(tiscore)
                    test_sentences_sql[sent][citation]['tiabscores'].append(tiabscore)
                    if not retpmid in search_returns:
                        search_returns.add(retpmid)
                        json.dump(retarticle, open(f"searchreturns/{retpmid}.json", 'w', encoding='utf-8'))
#                         with open(f"searchreturns/{retpmid}.tsv", 'w', encoding='utf-8', newline='') as wfile:
#                             csvwriter = csv.writer(wfile, delimiter='\t')
#                             csvwriter.writerow(retarticle)
                if retpmid == citation:
                    test_sentences_sql[sent][citation]['rank'] = rank
print(n, len(test_sentences_sql))

102271588 96950
CPU times: user 1h 19min 55s, sys: 31min 56s, total: 1h 51min 52s
Wall time: 2h 56min 8s


In [102]:
json.dump(test_sentences_sql, open('test_sentences_sql.json', 'w', encoding='utf-8'), indent=4)

In [11]:
import os
import json

In [4]:
base_path = '/jupiter/cl17d/Project/src_new'

In [5]:
test_sentences_sql=json.load(open(f'{base_path}/test_sentences_sql.json', 'r', encoding='utf-8'))

In [110]:
len(search_returns), n, len(test_sentences_sql)

(16952614, 38117, 96950)

In [104]:
m, n, r = [], 0, []
for k, v in test_sentences_sql.items():
    if len(v) > 1:
        m.append(len(v))
    for ck, cv in v.items():
        r.append(len(cv['retpmids']))
#         if len(cv['retpmids']) < 1000:
#             print(k, ck, len(cv['retpmids']), cv['rank'], '\n')
        if not 'rank' in cv:
            n += 1
print(len(m), min(m), max(m), sum(m), n, len(r), min(r), max(r), sum(r))

4703 2 6 9857 0 102104 1 1000 102077191


In [116]:
n = 0
for k, v in test_sentences_sql.items():
    if len(v) > 1:
        retpmids = set()
        for ret in v.values():
            retpmids = retpmids | set(ret['retpmids'])
        if len(retpmids) > 1000:
            n += 1
            print(len(v), len(retpmids))
print(n)

3 1001
2 1001
2 1261
2 1001
2 1001
5


In [9]:
n = 0
for k, v in test_sentences_sql.items():
    if len(v) > 5:
        set_pmids = set()
        list_pmids = []
        for ret in v.values():
            set_pmids = set_pmids | set(ret['retpmids'])
            list_pmids.extend(ret['retpmids'])
        print(f"{k}\tCitations: {len(v)}\tSet PMIDs: {len(set_pmids)}\tList PMIDs: {len(list_pmids)}")

first observed stripe like order 214 cuprates checkerboard order vortex cores subsequently surface bi cl based compounds universality cdw order cuprate phase diagram established nmr x ray scattering probes|27071712	Citations: 6	Set PMIDs: 1000	List PMIDs: 6000


In [None]:
test_sentences_sql[list(test_sentences)[0]]

In [None]:
len(set(test_sentences_sql[list(test_sentences)[0]]['retpmids']))

In [47]:
len(set(test_sentences_sql[list(test_sentences)[0]]['ranks']))

2

In [13]:
sent = "first observed stripe like order 214 cuprates checkerboard order vortex cores subsequently surface bi cl based compounds universality cdw order cuprate phase diagram established nmr x ray scattering probes|27071712"

In [14]:
test_path = f"{base_path}/case_SQL_testing_final_complete"
files = set()
for file in os.listdir(f"{test_path}"):
    with open(f"{test_path}/{file}") as ifile:
        next(ifile)
        for row in ifile:
            row = row.strip().split("\t")
            sentence = f"{row[0]}|{row[1]}"
            if sentence == sent:
                files.add(file)

In [15]:
files

{'DL_data_search_result_sentence_citation475.csv',
 'DL_data_search_result_sentence_citation488.csv',
 'DL_data_search_result_sentence_citation513.csv',
 'DL_data_search_result_sentence_citation517.csv',
 'DL_data_search_result_sentence_citation526.csv',
 'DL_data_search_result_sentence_citation536.csv'}

**PubMed TF-IDF (PTF)**

In [105]:
%%time
test_sentences_ptf = {}
n = 0
for f in range(461, 561):
    with open(f"{TESTDATA_PATH_PTF_ST}/DL_data_search_result_sentence_citation{str(f)}.csv", newline='') as csvfile:
        csvreader = csv.reader(csvfile, delimiter='\t')
        header = next(csvreader)
        for row in csvreader:
            sent = f"{row[0]}|{row[1]}"
            citation = row[3]
            retpmid = row[2]
            tiscore = row[12]
            tiabscore = row[13]
            rank = row[-1]
            retarticle = [row[4], row[5], row[7], row[8], row[11]]
            if sent in test_sentences:
                n += 1
                if sent not in test_sentences_ptf:
                    test_sentences_ptf[sent] = {citation:{'retpmids':[], 'tiscores':[], 'tiabscores':[]}}
                if citation not in test_sentences_ptf[sent]:
                    test_sentences_ptf[sent][citation] = {'retpmids':[], 'tiscores':[], 'tiabscores':[]}
                if not retpmid in test_sentences_ptf[sent][citation]['retpmids']:
                    test_sentences_ptf[sent][citation]['retpmids'].append(retpmid)
                    test_sentences_ptf[sent][citation]['tiscores'].append(tiscore)
                    test_sentences_ptf[sent][citation]['tiabscores'].append(tiabscore)
                    if not retpmid in search_returns:
                        search_returns.add(retpmid)
                        json.dump(retarticle, open(f"searchreturns/{retpmid}.json", 'w', encoding='utf-8'))
                if retpmid == citation:
                    test_sentences_ptf[sent][citation]['rank'] = rank
print(n, len(test_sentences_ptf))

100128069 96950
CPU times: user 1h 18min 15s, sys: 18min 36s, total: 1h 36min 51s
Wall time: 5h 12min 9s


In [106]:
json.dump(test_sentences_ptf, open('test_sentences_ptf.json', 'w', encoding='utf-8'), indent=4)

In [107]:
json.dump(list(search_returns), open('search_returns.json', 'w', encoding='utf-8'), indent=4)

In [108]:
len(search_returns), n, len(test_sentences_ptf)

(16952614, 100128069, 96950)

In [109]:
m, n, r = [], 0, []
for k, v in test_sentences_ptf.items():
    if len(v) > 1:
        m.append(len(v))
    for cv in v.values():
        r.append(len(cv['retpmids']))
#         if len(cv['retpmids']) < 1000:
#             print(k, len(cv['retpmids']), cv['rank'], '\n')
        if not 'rank' in cv:
            n += 1
print(len(m), min(m), max(m), sum(m), n, len(r), min(r), max(r), sum(r))

4670 2 6 9788 38117 102068 1 1000 100123134


In [None]:
n = 0
for k, v in test_sentences_ptf.items():
    if len(v) > 1:
        retpmids = set()
        for ret in v.values():
            retpmids = retpmids | set(ret['retpmids'])
        if len(retpmids) > 1000:
            n += 1
            print(len(v), len(retpmids))
print(n)

In [48]:
len(set(test_sentences_ptf[list(test_sentences)[0]]['citations']))

2

In [49]:
len(set(test_sentences_ptf[list(test_sentences)[0]]['retpmids']))

1000

In [50]:
len(set(test_sentences_ptf[list(test_sentences)[0]]['ranks']))

1000

In [118]:
list(test_sentences)[0]

'reported pro inflammatory cytokines stimulate production matrix metalloproteinase administration vip reported counteract action pro inflammatory stimuli reduces constitutive expression mmp 9 mmp 13 thereby alleviating pain oa|27553659'

In [178]:
n = 0
for k, v in test_sentences_ptf.items():
    for ck in v:
        if not ck in test_sentences_sql[k]:
            n += 1
print(n)

0


**PubMed Best Match (PBM)**

In [119]:
test_sentences_records = {}

In [120]:
for f in range(461, 561):
    with open(f"{TESTDATA_PATH_SQL_ST}/DL_data_search_result_sentence_citation{str(f)}.csv", newline='') as csvfile:
        csvreader = csv.reader(csvfile, delimiter='\t')
        header = next(csvreader)
        for row in csvreader:
            sent = f"{row[0]}|{row[1]}"
            citation = row[3]
            sentyear = row[6]
            if sent in test_sentences:
                test_sentences_records[sent] = test_sentences_records.get(sent, {'year':[], 'citations':[]})
                if not sentyear in test_sentences_records[sent]['year']:
                    test_sentences_records[sent]['year'].append(sentyear)
                if not citation in test_sentences_records[sent]['citations']:
                    test_sentences_records[sent]['citations'].append(citation)

In [121]:
for f in range(461, 561):
    with open(f"{TESTDATA_PATH_PTF_ST}/DL_data_search_result_sentence_citation{str(f)}.csv", newline='') as csvfile:
        csvreader = csv.reader(csvfile, delimiter='\t')
        header = next(csvreader)
        for row in csvreader:
            sent = f"{row[0]}|{row[1]}"
            citation = row[3]
            sentyear = row[6]
            if sent in test_sentences:
                test_sentences_records[sent] = test_sentences_records.get(sent, {'year':[], 'citations':[]})
                if not sentyear in test_sentences_records[sent]['year']:
                    test_sentences_records[sent]['year'].append(sentyear)
                if not citation in test_sentences_records[sent]['citations']:
                    test_sentences_records[sent]['citations'].append(citation)

In [124]:
n, m = 0, []
for k, v in test_sentences_records.items():
    if len(v['year']) == 1:
        n += 1
    if len(v['citations']) > 1:
        m.append(len(v['citations']))
print(n, len(m), min(m), max(m), sum(m))

96950 4703 2 6 9857


In [176]:
json.dump(test_sentences_records, open('test_sentences_records.json', 'w', encoding='utf-8'))

In [125]:
Counter(m)

Counter({2: 4308, 3: 346, 5: 5, 4: 43, 6: 1})

In [127]:
list(test_sentences)[0], test_sentences_records[list(test_sentences)[0]]

('reported pro inflammatory cytokines stimulate production matrix metalloproteinase administration vip reported counteract action pro inflammatory stimuli reduces constitutive expression mmp 9 mmp 13 thereby alleviating pain oa|27553659',
 {'year': ['2016'], 'citations': ['24318839', '17911037']})

In [1]:
import json
import time
import requests
from bs4 import BeautifulSoup

In [2]:
def get_page(url):
    session = requests.Session()
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) Chrome/39.0.2171.95 Safari/537.36',
               'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'}        
    try:
        req = session.get(url, headers=headers)
    except requests.exceptions.RequestException:
        return None
    req.encoding = 'utf-8'
    if req.text == '':
        return None
    return req.text

In [31]:
def pubmedweb_search(query, year):
    pubmed_url = 'https://pubmed.ncbi.nlm.nih.gov/'
    pubmed_url += f'?term={query}&size=200&filter=years.1977-{year}&format=pmid'
    #sort=date,pubdate &sort_order=asc&show_snippets=off
    soup = None
    while soup == None or soup.find('meta', attrs={'name':'log_displayeduids'}) == None:
        soup = BeautifulSoup(get_page(pubmed_url), 'xml')
    counts = int(soup.find('meta', attrs={'name':'log_resultcount'})['content'])
    returns = counts if counts <= 1000 else 1000
    idlist = soup.find('meta', attrs={'name':'log_displayeduids'})['content'].split(',') 
    if returns > 200:
        for page in range(2, int((returns-1)/200)+2):
            page_url = f"{pubmed_url}&page={page}"
            soup = None
            while soup == None or soup.find('meta', attrs={'name':'log_displayeduids'}) == None:
                soup = BeautifulSoup(get_page(page_url), 'xml')
            idlist.extend(soup.find('meta', attrs={'name':'log_displayeduids'})['content'].split(','))
#             time.sleep(5)
    return counts, idlist

In [8]:
test_sentences = json.load(open('test_sentences.json'))

In [4]:
test_sentences_records = json.load(open('test_sentences_records.json'))

In [9]:
query = list(test_sentences)[0].split('|')[0].split()
query = " OR ".join(query)
query

'reported OR pro OR inflammatory OR cytokines OR stimulate OR production OR matrix OR metalloproteinase OR administration OR vip OR reported OR counteract OR action OR pro OR inflammatory OR stimuli OR reduces OR constitutive OR expression OR mmp OR 9 OR mmp OR 13 OR thereby OR alleviating OR pain OR oa'

In [10]:
year = test_sentences_records[list(test_sentences)[0]]['year'][0]
year

'2016'

In [16]:
citation = test_sentences_records[list(test_sentences)[0]]['citations'][1]

In [11]:
se_counts, se_pmids = pubmedweb_search(query, year)

In [12]:
se_counts, len(se_pmids)

(12249583, 1000)

In [37]:
test_sentences_pbm = {}

In [5]:
test_sentences_pbm = json.load(open('test_sentences_pbm.json'))

In [None]:
for k, v in test_sentences_records.items():
    if k in test_sentences_pbm: continue
    query = k.split('|')[0].split()
    if len(query) > 20: continue
    query = " OR ".join(query)
    year = v['year'][0]
    _, se_pmids = pubmedweb_search(query, year)
    test_sentences_pbm[k] = {'retpmids':se_pmids, 'citations':{}}
    for citation in v['citations']:
        if citation in se_pmids:
            rank = str(int(se_pmids.index(citation)) + 1)
        else:
            rank = '9999'
        test_sentences_pbm[k]['citations'][citation] = rank

In [84]:
json.dump(test_sentences_pbm, open('test_sentences_pbm.json', 'w', encoding='utf-8'))

In [45]:
query, year

('hand OR low OR incidence OR concentration OR ota OR found OR samples OR analyzed OR turkey OR tunisia',
 '2017')

In [21]:
len(query.split(" OR "))

63

In [28]:
pubmed_url = 'https://pubmed.ncbi.nlm.nih.gov/'
pubmed_url += f'?term={query}&size=200&filter=years.1977-{year}&show_snippets=off&format=pmid' # sort=date,pubdate &sort_order=asc
soup = None
while soup == None or soup.find('meta', attrs={'name':'log_displayeduids'}) == None:
    soup = BeautifulSoup(get_page(pubmed_url), 'xml')
counts = int(soup.find('meta', attrs={'name':'log_resultcount'})['content'])
returns = counts if counts <= 1000 else 1000
idlist = soup.find('meta', attrs={'name':'log_displayeduids'})['content'].split(',') 
if returns > 200:
    for page in range(2, int((returns-1)/200)+2):
        page_url = f"{pubmed_url}&page={page}"
        soup = None
        while soup == None or soup.find('meta', attrs={'name':'log_displayeduids'}) == None:
            soup = BeautifulSoup(get_page(page_url), 'xml')
        idlist.extend(soup.find('meta', attrs={'name':'log_displayeduids'})['content'].split(','))
#         time.sleep(5)

In [29]:
page, len(idlist)

(5, 1000)

In [44]:
page_url

'https://pubmed.ncbi.nlm.nih.gov/?term=adverse OR effects OR ivig OR therapy OR include OR rash OR chills OR flushing OR fever OR nausea OR severe OR headache OR joint OR pains OR dyspnea OR diarrhea OR tachycardia OR anaphylactic OR reactions&size=200&filter=years.1977-2009&show_snippets=off&format=pmid&page=5'

In [43]:
counts

9294239

In [92]:
len(test_sentences_records), len(test_sentences_pbm)

(96950, 13188)

### Other Data


- abstract_citations.json
- abstract_citations_by_year.json
- journal_IF.json
- normalized_citation_by_year.tsv
- publicationType_dict.txt
- title_abstracts_complete.tsv
- title_abstracts_complete_only_features.tsv
- train_final_dataset_complete_no_onlykeywords.tsv
- train_final_dataset_complete_noshuffled.tsv
- train_final_dataset_for_old_model.jsonl
- valid_final_dataset_for_old_model.jsonl

- title_abstracts_complete.tsv

In [12]:
with open(f"{TRAINDATA_PATH}/title_abstracts_complete.tsv", newline='') as csvfile:
    csvreader = csv.reader(csvfile, delimiter='\t')
    #header = next(csvreader)
    n = 0
    for row in csvreader:
        n += 1
        print(row)
        #query_data.add(row[0])
        #train_data.append(row)
        if n >= 2: break

['\ufeff', 'pmid', 'articleTitle', 'journal', 'volume', 'issue', 'ISSN', 'year', 'month', 'day', 'abstract', 'authors', 'doi', 'keywords', 'publicationType', 'filename', 'citation', 'journal_IF']
['0', '21', '[Biochemical studies on camomile components/III. In vitro studies about the antipeptic activity of (--)-alpha-bisabolol (authors transl)].', 'Arzneimittel-Forschung', '25', '9', '0004-4172', '1975', '9', '1', '(--)-alpha-Bisabolol has a primary antipeptic action depending on dosage, which is not caused by an alteration of the pH-value. The proteolytic activity of pepsin is reduced by 50 percent through addition of bisabolol in the ratio of 1/0.5. The antipeptic action of bisabolol only occurs in case of direct contact. In case of a previous contact with the substrate, the inhibiting effect is lost.', 'O Isaac | K Thiemer', '', '', 'english abstract|journal article', 'pubmed19n0001PubMed_list.csv', '0', '0.0']


- title_abstracts_complete_only_features.tsv

In [13]:
with open(f"{TRAIN_PATH}/title_abstracts_complete_only_features.tsv", newline='') as csvfile:
    csvreader = csv.reader(csvfile, delimiter='\t')
    #header = next(csvreader)
    n = 0
    for row in csvreader:
        n += 1
        print(row)
        #query_data.add(row[0])
        #train_data.append(row)
        if n >= 2: break

['\ufeff', 'pmid', 'year', 'citation', 'journal_IF', 'publicationType']
['0', '21', '1975', '0', '0.0', 'english abstract|journal article']


- train_final_dataset_complete_no_onlykeywords.tsv

In [14]:
with open(f"{TRAIN_PATH}/train_final_dataset_complete_no_onlykeywords.tsv", newline='') as csvfile:
    csvreader = csv.reader(csvfile, delimiter='\t')
    #header = next(csvreader)
    n = 0
    for row in csvreader:
        n += 1
        print(row)
        #query_data.add(row[0])
        #train_data.append(row)
        if n >= 2: break

['sentence', 'sentence_pmid', 'pmid', 'articleTitle', 'abstract', 'sentence_year', 'year', 'publicationType', 'sum_citation', 'normalized_citation', 'journal_IF', 'title_score', 'title_abstract_score', 'title_coverage', 'abstract_coverage', 'label']
['donepezil cholinesterase inhibitor shown improve attentional performance humans sahakian singh rockwood foldi bentley rodents muir balducci romberg particularly subjects low baseline cognitive performance tasks require elevated cognitive attentional load', '26415954', '15852461', 'detecting effects donepezil visual selective attention using signal detection parameters alzheimer s disease', 'attentional function impaired alzheimer s disease ad moreover attention mediated acetylcholine but despite widespread use acetylcholinesterase inhibitors ache i augment available acetylcholine ad measures attentional function used assess drug response we hypothesized cholinergic augmentation impacts directly attentional system higher order measures vis

In [60]:
len(train_data)

1000

In [None]:
tokenizer, nb_words = load_tokenizer(f"{MODEL_PATH}/{TOKENIZER_FILE}", MAX_NB_WORDS)

In [None]:
embedding_matrix = np.load(f"{MODEL_PATH}/{EMBEDDING_FILE}")

In [68]:
query_embeddings_ = model.encode(query_data)

In [61]:
train_embeddings = model.encode(train_data)

In [None]:
for sent, sent_embedding in zip(query_data, query_embeddings_):
    distances = distance.cdist([sent_embedding], train_embeddings, "cosine")[0]
    top_5_ids = distances.argsort()[:5] #[::-1]
    print("Sentence:", sent)
    for i in top_5_ids:
        print(train_data[i])
    print()

In [None]:
header, test_data[:5]

# Deep Information Retrieval Models

## Load Test Data

In [2]:
import csv

In [3]:
BASE_PATH = "/jupiter/cl17d/Project/src_new"
TESTDATA_PATH_SQL_ST = "case_SQL_testing_final_complete"
TESTDATA_PATH_PTF_ST = "case_PubMed_testing_final_complete"
TESTDATA_PATH_PBM_ST = "case_PubMed_BM_testing_final_complete"
TESTDATA_PATH_GGS_ST = "case_Google_scholar_testing_final_complete"

In [4]:
n = 0
test_set = {}
# for f in range(461, 561):
with open(f"{BASE_PATH}/{TESTDATA_PATH_SQL_ST}/DL_data_search_result_sentence_citation461.csv", newline='') as csvfile:
    csvreader = csv.reader(csvfile, delimiter='\t')
    header = next(csvreader)
    for row in csvreader:
        n += 1
        sent = f"{row[0]}|{row[1]}|{row[3]}|{row[-1]}"
        if sent not in test_set:
            test_set[sent] = {'sent':row[0], 'retpmids':[], 'titles':[], 'abstracts':[]}
        test_set[sent]['retpmids'].append(row[2])
        test_set[sent]['titles'].append(row[4])
        test_set[sent]['abstracts'].append(row[5])

In [5]:
len(test_set)

1026

In [None]:
for k, v in test_set.items():
    print(len(v['retpmids']), len(v['titles']), len(v['abstracts']))

In [None]:
len(test_set[list(test_set.keys())[0]]['abstracts'])

In [249]:
list(test_set.keys())[0]

'study investigated function ctgf psoriasis using established imiquimod imq induced psoriasis murine model samples psoriasis patients|29386832|19380832|8'

In [251]:
for pmid_id, pmid in enumerate(test_set[list(test_set.keys())[0]]['retpmids']):
    if pmid == '19380832':
        print(pmid_id, pmid)

26 19380832


## [SBERT](https://github.com/UKPLab/sentence-transformers)

- Load the model

In [1]:
from sentence_transformers import SentenceTransformer
from scipy.spatial import distance
import numpy as np

In [11]:
model = SentenceTransformer('bert-base-nli-mean-tokens')

100%|██████████| 405M/405M [00:20<00:00, 20.0MB/s] 


In [12]:
sentence = list(test_set.keys())[0].split('|')[0]
cite_pmid = list(test_set.keys())[0].split('|')[2]
se_rank = list(test_set.keys())[0].split('|')[3]

In [13]:
sentence, cite_pmid, se_rank

('study investigated function ctgf psoriasis using established imiquimod imq induced psoriasis murine model samples psoriasis patients',
 '19380832',
 '8')

In [259]:
ret_ids = test_set[list(test_set.keys())[0]]['retpmids']
ret_tis = test_set[list(test_set.keys())[0]]['titles']
ret_abs = test_set[list(test_set.keys())[0]]['abstracts']

In [287]:
cite_idx = ret_ids.index(cite_pmid)

In [290]:
cite_idx, cite_pmid, ret_tis[cite_idx], ret_abs[cite_idx]

(26,
 '19380832',
 'imiquimod induced psoriasis like skin inflammation mice mediated via il 23 il 17 axis',
 'topical application imiquimod imq tlr7 8 ligand potent immune activator induce exacerbate psoriasis chronic inflammatory skin disorder recently crucial role proposed il 23 il 17 axis psoriasis hypothesized imq induced dermatitis mice serve model analysis pathogenic mechanisms psoriasis like dermatitis assessed il 23 il 17 axis dependency daily application imq mouse back skin induced inflamed scaly skin lesions resembling plaque type psoriasis lesions showed increased epidermal proliferation abnormal differentiation epidermal accumulation neutrophils microabcesses neoangiogenesis infiltrates consisting cd4 cells cd11c dendritic cells plasmacytoid dendritic cells imq induced epidermal expression il 23 il 17a il 17f well increase splenic th17 cells imq induced dermatitis partially dependent presence cells whereas disease development almost completely blocked mice deficient il 23 i

In [255]:
sentence_embedding = model.encode(sentence)

In [258]:
len(sentence_embedding[0])

768

In [263]:
%%time
title_embeddings = model.encode(ret_tis)

CPU times: user 5min 36s, sys: 2.01 s, total: 5min 38s
Wall time: 22.2 s


In [266]:
len(title_embeddings)

1000

In [268]:
%%time
distances = distance.cdist(sentence_embedding, title_embeddings, "cosine")[0]

In [291]:
np.where(distances.argsort()==cite_idx)[0][0]

412

In [277]:
distances.argsort()

array([638, 389, 386, 162, 609, 884, 187, 390, 558, 384, 416, 459, 650,
       404, 410, 667, 344, 644, 642, 672, 505,   8, 438, 627, 514, 666,
       639, 186, 934, 198, 439, 754, 765, 351, 465, 183, 500, 366, 606,
       503, 872, 220, 566, 565, 205, 798, 618, 199, 453, 649, 531, 490,
       516, 587, 665, 519, 345, 352, 377, 259, 757, 367, 334, 223, 629,
       628, 229, 414, 207, 528, 179, 621, 641, 480, 428, 282, 456, 535,
       546, 357, 398, 574, 608, 495, 285, 307, 852, 189, 522, 635, 671,
       496, 630, 194, 196, 626, 237, 548, 387, 202, 611, 405, 759, 513,
       273, 469, 903, 810, 158, 349, 830, 849, 622, 511, 215, 601, 221,
       539, 865, 331, 281, 436, 318, 773,  52, 218, 762, 106, 891, 685,
        18, 174, 654, 581, 201, 722, 512, 301, 586, 851, 434, 677, 169,
       481, 195, 230, 752, 782, 172, 694,  49, 464, 476, 165, 504, 580,
       272, 122, 394, 596, 620, 624, 568, 302, 263, 691, 509, 278, 175,
       482, 937, 257, 663, 876, 337, 588, 506, 875, 164,  21, 86

In [292]:
%%time
ranks = []
for k, v in test_set.items():
    sentence = k.split('|')[0]
    cite_pmid = k.split('|')[2]
    se_rank = k.split('|')[3]
    ret_ids = v['retpmids']
    ret_tis = v['titles']
    ret_abs = v['abstracts']
    cite_idx = ret_ids.index(cite_pmid)
    sentence_embedding = model.encode(sentence)
    title_embeddings = model.encode(ret_tis)
    distances = distance.cdist(sentence_embedding, title_embeddings, "cosine")[0]
    re_rank = np.where(distances.argsort()==cite_idx)[0][0]
    ranks.append((se_rank, re_rank))
#     print(len(v['retpmids']), len(v['titles']), len(v['abstracts']))

CPU times: user 4d 2h 12min 34s, sys: 52min 21s, total: 4d 3h 4min 55s
Wall time: 6h 36min 42s


In [293]:
len(ranks)

1026

In [297]:
n = 0
for i in ranks:
    if int(i[1]) < int(i[0]):
        n += 1
print(n)

262


In [None]:
ranks

In [3]:
sentences = ['This framework generates embeddings for each input sentence',
             'Sentences are passed as a list of string.',
             'The quick brown fox jumps over the lazy dog.']
sentence_embeddings = model.encode(sentences)

In [5]:
for sentence, embedding in zip(sentences, sentence_embeddings):
    print("Sentence:", sentence)
    print("Embedding:", len(embedding))
    print("")

Sentence: This framework generates embeddings for each input sentence
Embedding: 768

Sentence: Sentences are passed as a list of string.
Embedding: 768

Sentence: The quick brown fox jumps over the lazy dog.
Embedding: 768



In [6]:
corpus = ['A man is eating food.',
          'A man is eating a piece of bread.',
          'The girl is carrying a baby.',
          'A man is riding a horse.',
          'A woman is playing violin.',
          'Two men pushed carts through the woods.',
          'A man is riding a white horse on an enclosed ground.',
          'A monkey is playing drums.',
          'A cheetah is running behind its prey.']

In [8]:
corpus_embeddings = model.encode(corpus)

In [66]:
len(corpus_embeddings[0])

768

In [10]:
queries = ['A man is eating pasta.',
           'Someone in a gorilla costume is playing a set of drums.',
           'A cheetah chases prey on across a field.']
query_embeddings = model.encode(queries)

In [31]:
for query, query_embedding in zip(queries, query_embeddings):
    distances = distance.cdist([query_embedding], corpus_embeddings, "cosine")[0]
    top_n_ids = distances.argsort()[:5] #[::-1]
    print("Query:", query)
    for i in top_n_ids:
        print(corpus[i])
    print()

Query: A man is eating pasta.
A man is eating a piece of bread.
A man is eating food.
Two men pushed carts through the woods.
A monkey is playing drums.
A man is riding a white horse on an enclosed ground.

Query: Someone in a gorilla costume is playing a set of drums.
A monkey is playing drums.
A cheetah is running behind its prey.
The girl is carrying a baby.
A man is riding a horse.
A man is riding a white horse on an enclosed ground.

Query: A cheetah chases prey on across a field.
A cheetah is running behind its prey.
Two men pushed carts through the woods.
A monkey is playing drums.
A man is riding a horse.
A man is riding a white horse on an enclosed ground.



In [32]:
distances

array([0.97565166, 0.95483546, 0.87328053, 0.70701569, 0.94015307,
       0.6337635 , 0.72819961, 0.69392713, 0.09933369])

In [23]:
distances.argsort()[::-1][:5]

array([0, 1, 4, 2, 6])

In [39]:
len(np.array(corpus_embeddings).mean(axis=0))

768

In [44]:
distance.cdist([query_embedding], [np.array(corpus_embeddings).mean(axis=0)], "cosine")[0][0]

0.5123771250750835

## [OpenMatch](https://github.com/thunlp/OpenMatch)

In [1]:
import torch
import OpenMatch as om

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("allenai/scibert_scivocab_uncased")
model = om.models.Bert("allenai/scibert_scivocab_uncased")

In [None]:
tokenizer = AutoTokenizer.from_pretrained("openmatch/bert-base_marco-doc_firstp")
model = om.models.Bert("openmatch/bert-base_marco-doc_firstp")

In [3]:
query = "Classification treatment COVID-19"
doc = "By retrospectively tracking the dynamic changes of LYM% in death cases and cured cases, this study suggests that lymphocyte count is an effective and reliable indicator for disease classification and prognosis in COVID-19 patients."

In [4]:
input_ids = tokenizer.encode(query, doc)
ranking_score, ranking_features = model(torch.tensor(input_ids).unsqueeze(0))

In [11]:
len(input_ids)

49

In [5]:
ranking_score

tensor([0.7896], grad_fn=<SqueezeBackward1>)

In [13]:
len(ranking_features[0])

768

## [FlexNeuART](https://github.com/oaqa/FlexNeuART)

In [27]:
import json
import time

In [None]:
test_sents = json.load(open('test_sents_pubmed_bm.json'))

In [None]:
n = 0
for v in test_sents.values():
    for c in v['citations']:
        if c in v['pmids']:
            n += 1
            print(v['pmids'].index(c))
        else: print(c)
print(n)

In [2]:
import requests
from bs4 import BeautifulSoup

In [3]:
# funtion for getting web pages
def get_page(url):
    session = requests.Session()
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) Chrome/39.0.2171.95 Safari/537.36',
               'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'}        
    try:
        req = session.get(url, headers=headers)
    except requests.exceptions.RequestException:
        return None
    req.encoding = 'utf-8'
    if req.text == '':
        return None
    return req.text

In [37]:
# function for query google scholar
def googlescholar_search(query, year='2020'):
    ggs_url = 'https://scholar.google.com/scholar'
    ggs_url += f'?q=site:pubmed.ncbi.nlm.nih.gov {query}' # sort=date,pubdate &sort_order=asc
    ggs_url += f'&hl=en&as_sdt=0,10&as_ylo=1977&as_yhi={year}'
    soup = BeautifulSoup(get_page(ggs_url), 'html')
    counts = int(soup.find('div', attrs={'id':'gs_ab_md'}).text.split()[1].replace(',', ''))
    idlist = []
    for h3 in soup.find('body').find_all('h3'):
        idlist.append(h3.a['href'].split('/')[-2])
    time.sleep(2)
    for page in range(10, 500, 10):
        page_url = f"{ggs_url}&start={page}"
        soup = BeautifulSoup(get_page(page_url), 'html')
        for h3 in soup.find('body').find_all('h3'):
            idlist.append(h3.a['href'].split('/')[-2])
        time.sleep(2)
    return counts, idlist

In [38]:
base_path = '/jupiter/cl17d/Project/src_new'

In [39]:
# Test sentences record
test_sentences_records = json.load(open(f"{base_path}/test_sentences_records.json"))

In [40]:
len(test_sentences_records)

96950

In [41]:
n = 0
test_google_scholar = {} #json.load(open('test_sents_google_scholar.json'))
remaining_set = set()

In [None]:
for k, v in test_sentences_records.items():
    if k in test_google_scholar: continue
    n += 1
    if n % 1000 == 0:
        print(n)
        json.dump(test_google_scholar, open('test_sents_google_scholar.json', 'w', encoding='utf-8'))
    query = '+OR+'.join(k.split('|')[0].strip().split())
    year = v['year'][0]
#     print(query, year)
    try:
        counts, pmids = googlescholar_search(query, year)
    except:
        remaining_set.add(k)
        continue
#     print(counts, len(pmids))
    test_google_scholar[k] = {'year':year, 'citations':v['citations'], 'counts':counts, 'pmids':pmids}
    time.sleep(15)
json.dump(test_google_scholar, open('test_sents_google_scholar.json', 'w', encoding='utf-8'))

1000
2000


In [20]:
len(list(test_google_scholar.items())[0][1]['pmids'])

100

In [21]:
# testing query
query_or = 'breast cancer'
query_sen = 'psoriasis imiquimod imq induced psoriasis model psoriasis'
query = '+OR+'.join(query_sen.split())
year = '2019'

In [22]:
query

'psoriasis+OR+imiquimod+OR+imq+OR+induced+OR+psoriasis+OR+model+OR+psoriasis'

In [13]:
counts, pmids = googlescholar_search(query, year)

In [14]:
counts, len(pmids)

(6970, 500)

In [28]:
ggs_url = 'https://scholar.google.com/scholar'
ggs_url += f'?q=site:pubmed.ncbi.nlm.nih.gov {query}' # sort=date,pubdate &sort_order=asc
ggs_url += f'&hl=en&as_sdt=0,10&as_ylo=1977&as_yhi={year}'

In [32]:
ggs_url

'https://scholar.google.com/scholar?q=site:pubmed.ncbi.nlm.nih.gov psoriasis+OR+imiquimod+OR+imq+OR+induced+OR+psoriasis+OR+model+OR+psoriasis&hl=en&as_sdt=0,10&as_ylo=1977&as_yhi=2019'

In [35]:
soup = BeautifulSoup(get_page(ggs_url), 'html')

In [36]:
soup

<!DOCTYPE html>
<html><head><title>Google Scholar</title><meta content="text/html;charset=utf-8" http-equiv="Content-Type"/><meta content="IE=Edge" http-equiv="X-UA-Compatible"/><meta content="always" name="referrer"/><meta content="width=device-width,initial-scale=1,minimum-scale=1,maximum-scale=2" name="viewport"/><meta content="telephone=no" name="format-detection"/><link href="/favicon.ico" rel="shortcut icon"/><style>html,body,form,table,div,h1,h2,h3,h4,h5,h6,img,ol,ul,li,button{margin:0;padding:0;border:0;}table{border-collapse:collapse;border-width:0;empty-cells:show;}html,body{height:100%}#gs_top{position:relative;box-sizing:border-box;min-height:100%;min-width:964px;-webkit-tap-highlight-color:rgba(0,0,0,0);}#gs_top>*:not(#x){-webkit-tap-highlight-color:rgba(204,204,204,.5);}.gs_el_ph #gs_top,.gs_el_ta #gs_top{min-width:320px;}#gs_top.gs_nscl{position:fixed;width:100%;}body,td,input,button{font-size:13px;font-family:Arial,sans-serif;line-height:1.24;}body{background:#fff;color

In [None]:
counts = int(soup.find('div', attrs={'id':'gs_ab_md'}).text.split()[1].replace(',', ''))