### How to get a good list of search terms?
- tried 3 different kw extraction tools. Pytextrank and RAKE seem best at finding relevant ngrams.
- A human will have to check the resultant lists to pick out the most useful keywords.

In [1]:
import gzip
from bs4 import BeautifulSoup as bs
import pandas as pd
import datetime
import os
import glob

# First, read in the data
This is the semantic Scholar CORD data dump which was munged into a dataframe in a previous notebook.
https://pages.semanticscholar.org/coronavirus-research

In [2]:
df = pd.read_csv('data/s2_cr_data.csv', dtype=str)
df.shape

(12617, 14)

In [3]:
df.columns

Index(['pid', 'doi', 'title', 'abstract', 'authors', 'venue', 'year', 'tiabs',
       'journal', 'journal-short', 'pubdate', 'issns', 'publisher', 'pre'],
      dtype='object')

# Exploratory Analysis
We've already done some exploration of the data in a previous notebook, but we can check some things here. 

In [4]:
# what time-frame does this cover?
# consider limiting to recent years
df.year.value_counts()

2018    1051
2019     955
2017     913
2016     810
2015     734
2014     489
2013     432
2012     407
2011     340
2010     234
2009     186
2008     112
2007      81
2020      63
2006      63
2005      51
2004      27
2003      17
1994      12
1992      11
1993       9
1991       9
1990       7
1996       7
1995       7
1997       6
1989       6
1999       5
None       3
1998       3
1988       3
1985       3
1970       2
2001       2
1981       2
1987       2
2000       2
1986       2
1982       2
1974       1
2002       1
1967       1
1980       1
1984       1
1957       1
1969       1
1965       1
Name: year, dtype: int64

In [5]:
# limit years
# df=df[df['year'].isin(set([str(x) for x in range(2000,2021)]))
# df.shape

# What keywords can we find with SpaCy?
Adapted from: https://medium.com/better-programming/extract-keywords-using-spacy-in-python-4a8415478fbf

In [6]:
# !pip install spacy
# !python -m spacy download en_core_web_lg
# !python -m spacy validate

In [7]:
import spacy
from collections import Counter
from string import punctuation

In [8]:
import en_core_web_lg
nlp = en_core_web_lg.load()

In [9]:
def get_hotwords(text):
    result = []
    pos_tag = ['PROPN', 'ADJ', 'NOUN'] 
    doc = nlp(text.lower()) 
    for token in doc:
        if(token.text in nlp.Defaults.stop_words or token.text in punctuation):
            continue
        if(token.pos_ in pos_tag):
            result.append(token.text)
                
    return result 

In [10]:
hotwords = []

for abstract in df['tiabs']:
    hotwords.extend(get_hotwords(abstract))
len(hotwords)    
#     hashtags = [('#' + x[0]) for x in Counter(get_hotwords).most_common(5)]
# print(' '.join(hashtags))

1540036

In [11]:
common_hotwords = [x for x in Counter(hotwords).most_common(30)]
common_hotwords

[('virus', 16383),
 ('infection', 11857),
 ('viral', 10063),
 ('respiratory', 8966),
 ('cells', 8368),
 ('disease', 7809),
 ('viruses', 7614),
 ('human', 7561),
 ('influenza', 7164),
 ('study', 7160),
 ('patients', 7106),
 ('health', 6633),
 ('protein', 6417),
 ('cell', 6224),
 ('data', 5394),
 ('rna', 4870),
 ('results', 4811),
 ('infections', 4596),
 ('analysis', 4544),
 ('cov', 4522),
 ('host', 4418),
 ('clinical', 4299),
 ('cases', 4133),
 ('sars', 4070),
 ('high', 4025),
 ('expression', 3938),
 ('infectious', 3808),
 ('proteins', 3743),
 ('time', 3740),
 ('response', 3721)]

In [12]:
hotdf = pd.DataFrame(common_hotwords, columns = ['kw','count'])
hotdf['algo'] = 'hotwords'

Mostly, this is giving us single words rather than phrases. A lot of these are too broad to discriminate COVID-19-related research from other research.

# Pytextrank
This is a different model that we can also use with SpaCy for finding keywords and keyphrases. 
Adapted from: https://pypi.org/project/pytextrank/

In [13]:
# !pip install pytextrank

In [14]:
import pytextrank
# nlp = spacy.load('en_core_web_sm')
tr = pytextrank.TextRank()
nlp.add_pipe(tr.PipelineComponent, name='textrank', last=True)

In [15]:
keyphrases = []
for abstract in df['tiabs']:
    doc = nlp(abstract)
    keyphrases.extend([(p.text,i) for i,p in enumerate(doc._.phrases)])
pytr = pd.DataFrame((keyphrases), columns = ['kw','rank'])
pytr.shape

(721485, 2)

In [16]:
pytr = pytr[pytr['rank']<5].groupby('kw').count().sort_values('rank', ascending = False).reset_index()

In [17]:
pytr.columns= ['kw','count']
pytr.head()

Unnamed: 0,kw,count
0,viruses,252
1,respiratory viruses,172
2,respiratory syncytial virus,169
3,middle east respiratory syndrome coronavirus,163
4,patients,159


In [18]:
# Counter(keyphrases).most_common(100)

In [19]:
pytr['algo'] = 'pytextrank'

This is much better. We're getting a list of phrases (or ngrams) which seem quite relevant to the coronavirus outbreak. There will be some need to sort through these manually.

# RAKE - Rapid Automatic Keyword Extraction
Adapted from: https://pypi.org/project/rake-nltk/

In [20]:
# !pip install rake_nltk

In [21]:
from rake_nltk import Rake
rake_kws = []
r = Rake()
for abstract in df['tiabs']:
    r.extract_keywords_from_text(abstract)
    rake_kws.extend([(keyphrase,i) for i, keyphrase in enumerate(r.get_ranked_phrases())])
rake = pd.DataFrame(rake_kws, columns = ['kw','rank'])
rake.head()

Unnamed: 0,kw,rank
0,extremely high positive predictive value,0
1,aligning short sequence reads,1
2,delivering highly accurate results,2
3,illumina basespace app,3
4,empirical benchmarking alongside,4


In [22]:
rake[rake['rank']<=5].groupby('kw').mean().sort_values('rank').head()

Unnamed: 0_level_0,rank
kw,Unnamed: 1_level_1
structured rna elements may control virus replication,0.0
evaluated among elderly chinese subjects (≥ 60 years,0.0
evaluated 49 published computational classification workflows,0.0
supporting information incorporating uracil,0.0
evaluate pseudoknot free energies using novel parameters,0.0


In [23]:
rows= Counter([x[0] for x in rake_kws if x[1]<5]).most_common(100)
columns = ['kw','count']

rakedf = pd.DataFrame(rows,columns=columns)
rakedf.head()

Unnamed: 0,kw,count
0,severe acute respiratory syndrome,324
1,middle east respiratory syndrome coronavirus,319
2,middle east respiratory syndrome,163
3,porcine epidemic diarrhea virus,122
4,severe acute respiratory syndrome coronavirus,108


In [24]:
rakedf

Unnamed: 0,kw,count
0,severe acute respiratory syndrome,324
1,middle east respiratory syndrome coronavirus,319
2,middle east respiratory syndrome,163
3,porcine epidemic diarrhea virus,122
4,severe acute respiratory syndrome coronavirus,108
...,...,...
95,acute respiratory infection,6
96,acquired immune deficiency syndrome,6
97,related cell adhesion molecule 1,6
98,time quantitative polymerase chain reaction,6


In [25]:
rakedf['algo'] = 'RAKE'

Also a good list of ngrams which we can use for coronavirus searches. 

## Write out to file
Concatenate all 3 of the keyword lists we produced above into a spreadsheet and output to xlsx.

In [26]:
out = pd.concat([
    pytr,
    rakedf,
#     hotdf  # I'm commenting this out because I don't think the keywords here were very useful
])
out

Unnamed: 0,kw,count,algo
0,viruses,252,pytextrank
1,respiratory viruses,172,pytextrank
2,respiratory syncytial virus,169,pytextrank
3,middle east respiratory syndrome coronavirus,163,pytextrank
4,patients,159,pytextrank
...,...,...,...
95,acute respiratory infection,6,RAKE
96,acquired immune deficiency syndrome,6,RAKE
97,related cell adhesion molecule 1,6,RAKE
98,time quantitative polymerase chain reaction,6,RAKE


This is fairly small and might be ok to put in github for sharing purposes. Large amounts of data shouldn't go in github.

In [27]:
out.to_excel('output/keyword_list.xlsx')