# Parse PubMed data in order to get plain text of relevant papers

Through the URLs below, PubMed data can be ontained interactively. The idea here is to select papers based on keywords (and perhaps dates, ...) in a first query that results in PMC IDs. After that, the full text of those publications can be ontained one-by-one.

These texts can be analyzed by scispacy language modesl that include POS tagging for biological/scienti

In [9]:
import os
import pandas as pd
import numpy as np
import scispacy
import spacy

from utils import perform_query, extract_clean, retrieve_paper

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Obtain and process data
Steps:
1. Define a query to search papers
2. Call API with that query to obtain all PMC IDs for that query
3. Read the full content of tyhose papers one by one
4. Pre-process the full text
5. Run language model on clean text of body of paper

In [29]:
query = 'tomato'
IDs = perform_query(query)

savetext = True  # Set True if clean text of paper needs to be saved to a text file

all_papers = []
for ID in IDs:
    fname = f'papers/cleantext_{ID}.txt'
    
    if os.path.exists(fname):
        with open(fname, 'r') as f:
            cleantext = f.read()
    else:  
        content = retrieve_paper(ID)
        cleantext = extract_clean(content)
        if savetext:
            with open(fname, 'w') as f:
                f.write(cleantext)
                
    all_papers.append(cleantext)
    

Can't find the body, returning nothing!
Can't find the body, returning nothing!
Can't find the body, returning nothing!
Can't find the body, returning nothing!
Can't find the body, returning nothing!
Can't find the body, returning nothing!
Can't find the body, returning nothing!
Can't find the body, returning nothing!


In [30]:
# all_papers

In [31]:
nlp = spacy.load("en_ner_bionlp13cg_md")

In [45]:
docs = [nlp(text) for text in all_papers]

In [46]:
len(docs)

93

In [47]:
df_all = pd.DataFrame()
for doc in docs:
    
    # print(f"{ent.text:20s}, {ent.start_char:6}, {ent.end_char:6}, {ent.label_:20s}")
    

    df_entities = pd.DataFrame({'text':[ent.text for ent in doc.ents], 
                            'start':[ent.start_char for ent in doc.ents], 
                            'label':[ent.label_ for ent in doc.ents]}
                           )
    
    df_all = pd.concat([df_all, df_entities ])

In [48]:
df_all

Unnamed: 0,text,start,label
0,Christopher Columbus,27.0,ORGANISM
1,marigold,199.0,ORGANISM_SUBSTANCE
2,chili peppers,212.0,ORGANISM
3,Solanum lycopersicum L.,516.0,ORGANISM
4,tomatoes,744.0,IMMATERIAL_ANATOMICAL_ENTITY
...,...,...,...
318,tomato crop,47306.0,ORGANISM
319,DDT,47379.0,GENE_OR_GENE_PRODUCT
320,inorganic synthesis fertilizers while,47480.0,SIMPLE_CHEMICAL
321,material,47690.0,CELLULAR_COMPONENT


In [57]:
with open('exclude_terms.txt', 'r') as f:
    terms = f.readlines()
    terms = [t.replace('\n','') for t in terms]
terms 

['href',
 'GUS',
 'uri',
 'xlink',
 'JA',
 'GA',
 'SA',
 'CA',
 'CK',
 'ABA',
 'GFP',
 'mml',
 'show',
 'ext-link']

In [58]:
def select_genes(df):
    df["extra"] = 1
    for t in terms:
        df["extra"] = df["extra"] * np.array([t not in text_from_ent for text_from_ent in df.text])
        
    return df[df.extra == 1]
        
df_filtered = select_genes(df_all)

In [59]:
df_filtered

Unnamed: 0,text,start,label,extra
0,Christopher Columbus,27.0,ORGANISM,1
1,marigold,199.0,ORGANISM_SUBSTANCE,1
2,chili peppers,212.0,ORGANISM,1
3,Solanum lycopersicum L.,516.0,ORGANISM,1
4,tomatoes,744.0,IMMATERIAL_ANATOMICAL_ENTITY,1
...,...,...,...,...
318,tomato crop,47306.0,ORGANISM,1
319,DDT,47379.0,GENE_OR_GENE_PRODUCT,1
320,inorganic synthesis fertilizers while,47480.0,SIMPLE_CHEMICAL,1
321,material,47690.0,CELLULAR_COMPONENT,1


In [37]:
# df_entities.head(20)
counts = df_all[df_all.label == "GENE_OR_GENE_PRODUCT"].text.value_counts().sort_values(ascending=False)
counts[:40]

Series([], Name: text, dtype: int64)

In [12]:
for token in docs[2][100:110]:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

COVID-19 covid-19 NOUN NN nmod XXXX-dd False False
has have VERB VBZ aux xxx True True
not not PART RB neg xxx True True
spared spare VERB VBN ROOT xxxx True False
children child NOUN NNS dobj xxxx True False
. . PUNCT . punct . False False
Since since SCONJ IN case Xxxxx True True
March March PROPN NNP nmod Xxxxx True False
2020 2020 NUM CD nummod dddd False False
, , PUNCT , punct , False False


# Try locating the Results section from the body to find the associated genes.
`<sec id="blabla><title>Results</title> ... </sec>`

# Perhaps treat review papers differently, or not at all

# Use big page size, perhaps loop to nextPageURL?



Dataframe with text, 