## What do we know about COVID-19 risk factors?

**SOLUTION:** Create an unsupervised scientific literature understanding system that can take in common terms and analyze a very large corpus of scientific papers and return highly relevant text excerpts from papers containing topical data relating to the common text inputed, allowing a single researcher or small team to gather targeted information and quickly and easily locate relevant text in the scientific papers to answer important questions about the new virus from a large corpus of documents.

**APPROACH:** The current implementation uses Pandas built in search technology to search all paper abstracts for the keywords realting to topics where specific answers are desired.  Once the dataframe slice is returned, the abstracts are then parsed into sentence and word levels to understand which of the abstracts likley contain the most relevant answers to the keyword topics.

In [17]:
# fit the notebook to browser
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

In [18]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import functools
from nltk import PorterStemmer

In [19]:
# load the meta data from the CSV file using 3 columns (abstract, title, authors),
df=pd.read_csv('../metadata.csv', usecols=['title','journal','abstract','authors','doi','publish_time','sha'], low_memory=False)
print(df.shape)

#fill na fields
df=df.fillna('no data provided')

#drop duplicate titles
df = df.drop_duplicates(subset='title', keep="first")

#keep only 2020 dated papers
df=df[df['publish_time'].str.contains('2020')]

# convert abstracts to lowercase
df["abstract"] = df["abstract"].str.lower()+df["title"].str.lower()

#show 3 lines of the new dataframe
df=search_focus(df)
print(df.shape)
df.head(3)

(210537, 7)
(70059, 7)


Unnamed: 0,sha,title,doi,abstract,publish_time,authors,journal
4661,no data provided,Latest assessment on COVID-19 from the Europea...,10.2807/1560-7917.es.2020.25.8.2002271,no data providedlatest assessment on covid-19 ...,2020-02-27,no data provided,Euro Surveill
4697,no data provided,Updated rapid risk assessment from ECDC on the...,10.2807/1560-7917.es.2020.25.9.2003051,no data providedupdated rapid risk assessment ...,2020-03-05,no data provided,Euro Surveill
4731,no data provided,Updated rapid risk assessment from ECDC on the...,10.2807/1560-7917.es.2020.25.10.2003121,no data providedupdated rapid risk assessment ...,2020-03-12,no data provided,Euro Surveill


In [20]:
# keep only documents with covid -cov-2 and cov2
def search_focus(df):
    dfa = df[df['abstract'].str.contains('covid')]
    dfb = df[df['abstract'].str.contains('-cov-2')]
    dfc = df[df['abstract'].str.contains('cov2')]
    dfd = df[df['abstract'].str.contains('ncov')]
    frames=[dfa,dfb,dfc,dfd]
    df = pd.concat(frames)
    df=df.drop_duplicates(subset='title', keep="first")
    return df

In [21]:
# function to stem keywords into a common base word
def stem_words(words):
    stemmer = PorterStemmer()
    singles=[]
    for w in words:
        singles.append(stemmer.stem(w))
    return singles

In [23]:
# list of lists for topic words realting to tasks
tasks = [['comorbidities','comorbid'],['risk factor','risk factors'],['cancer patient', 'cancer patients'],['hypertension','hyperten'],['heart', 'disease'],['chronic', 'bronchitis'],['cerebral', 'infarction'],['diabetes', 'diabete'],['copd','copd'],["blood type","type"],['smoking','smok'],['basic','reproductive','number'],["incubation", "period", "days"]]

z=0
for terms in tasks:
    stra=' '
    stra=' '.join(terms)
    k=str(z)
    z=z+1

In [25]:
# loop through the list of lists
z=0
for search_words in tasks:
    df_table = pd.DataFrame(columns = ["pub_date","authors","title","excerpt"])
    str1=''
    # make a string of the search words to print readable search
    str1=' '.join(search_words)
    dfa=df[functools.reduce(lambda a, b: a&b, (df['abstract'].str.contains(s) for s in search_words))]
    df1=dfa.drop_duplicates()
    
    display(HTML('<h3>Task Topic: '+str1+'</h3>'))
    #tell the system how many sentences are needed
    max_sentences=5

    z=z+1
    # record how many sentences have been saved for display
    # loop through the result of the dataframe search
    for index, row in df1.iterrows():
        pub_sentence=''
        sentences_used=0
        #break apart the absracrt to sentence level
        sentences = row['abstract'].split('. ')
        
        #loop through the sentences of the abstract
        for sentence in sentences:
            # missing lets the system know if all the words are in the sentence
            missing=0
            #loop through the words of sentence
            for word in search_words:
                #if keyword missing change missing variable
                if word not in sentence:
                    missing=missing+1
            # after all sentences processed show the sentences not missing keywords limit to max_sentences
            if missing < len(search_words)-1 and sentences_used < max_sentences and len(sentence)<1000 and sentence!='':
                sentence=sentence.capitalize()
                if sentence[len(sentence)-1]!='.':
                    sentence=sentence+'.'
                pub_sentence=pub_sentence+'<br><br>'+sentence
                
        if pub_sentence!='':
            sentence=pub_sentence
            sentences_used=sentences_used+1
            authors=row["authors"].split(" ")
            link=row['doi']
            title=row["title"]
            linka='https://doi.org/'+link
            linkb=title
            sentence='<p align="left">'+sentence+'</p>'
            final_link='<p align="left"><a href="{}">{}</a></p>'.format(linka,linkb)
            to_append = [row['publish_time'],authors[0]+' et al.',final_link,sentence]
            df_length = len(df_table)
            df_table.loc[df_length] = to_append
            
    filename=str1+'.csv'
    df_table.to_csv(filename,index = False)
    df_table=HTML(df_table.to_html(escape=False,index=False))
    display(df_table)

AttributeError: 'HTML' object has no attribute 'head'