# Keyword / Keyphrase extraction
We are looking for search terms that we can use to find Coronavirus pandemic related search terms. 

### How to get a good list of search terms?
- tried 3 different kw extraction tools. Pytextrank and RAKE seem best at finding relevant ngrams.
- A human will have to check the resultant lists to pick out the most useful keywords.

In [21]:
import gzip
from bs4 import BeautifulSoup as bs
import pandas as pd
import datetime
import os
import glob

# First, read in the data
This is the semantic Scholar CORD data dump which was munged into a dataframe in a previous notebook.
https://pages.semanticscholar.org/coronavirus-research

In [22]:
df = pd.read_csv('data/s2_cr_data.csv', dtype=str)
df.shape

(34185, 22)

In [23]:
df.columns

Index(['sha', 'source_x', 'title', 'doi', 'pmcid', 'pubmed_id', 'license',
       'abstract', 'publish_time', 'authors', 'journal',
       'Microsoft Academic Paper ID', 'WHO #Covidence', 'has_full_text',
       'full_text_file', 'tiabs', 'journal-short', 'pubdate', 'issns',
       'publisher', 'cr_dates', 'year'],
      dtype='object')

# Exploratory Analysis
We've already done some exploration of the data in a previous notebook, but we can check some things here. 

In [24]:
# what time-frame does this cover?
# consider limiting to recent years
df.year.value_counts()

0000    7821
2019    2596
2018    2314
2017    2037
2016    2035
2015    1882
2020    1760
2014    1663
2013    1516
2012    1305
2011    1212
2009    1051
2010    1040
2008     921
2007     765
2006     723
2005     642
2004     585
2003     253
1992     129
1991     129
1995     125
1993     118
1994     115
1990     105
1998      99
1989      99
2002      98
1997      98
1996      97
1988      96
1999      96
2000      90
1987      88
2001      79
1986      78
1985      57
1984      46
1983      43
1982      31
1981      30
1977      19
1979      19
1980      19
1978      16
1976      10
1973       7
1975       7
1970       5
1972       4
1974       3
['20       3
1957       1
1963       1
1971       1
1967       1
1955       1
1965       1
Name: year, dtype: int64

In [25]:
# limit years - only recent stuff is likely to be relevant (?)
df=df[df['year'].isin(set([str(x) for x in range(2000,2021)]))]
df.shape

(24567, 22)

# What keywords can we find with SpaCy?
Adapted from: https://medium.com/better-programming/extract-keywords-using-spacy-in-python-4a8415478fbf

In [26]:
# !pip install spacy
# !python -m spacy download en_core_web_lg
# !python -m spacy validate

In [27]:
import spacy
from collections import Counter
from string import punctuation

In [28]:
import en_core_web_lg
nlp = en_core_web_lg.load()

In [29]:
def get_hotwords(text):
    result = []
    pos_tag = ['PROPN', 'ADJ', 'NOUN'] 
    doc = nlp(text.lower()) 
    for token in doc:
        if(token.text in nlp.Defaults.stop_words or token.text in punctuation):
            continue
        if(token.pos_ in pos_tag):
            result.append(token.text)
                
    return result 

In [None]:
hotwords = []

for abstract in df['tiabs']:
    abstract = str(abstract)
    if len(abstract.split())>5:
        hotwords.extend(get_hotwords(abstract))
len(hotwords)    
#     hashtags = [('#' + x[0]) for x in Counter(get_hotwords).most_common(5)]
# print(' '.join(hashtags))

In [None]:
common_hotwords = [x for x in Counter(hotwords).most_common(30)]
common_hotwords

In [None]:
hotdf = pd.DataFrame(common_hotwords, columns = ['kw','count'])
hotdf['algo'] = 'hotwords'

Mostly, this is giving us single words rather than phrases. A lot of these are too broad to discriminate COVID-19-related research from other research.

# Pytextrank
This is a different model that we can also use with SpaCy for finding keywords and keyphrases. 
Adapted from: https://pypi.org/project/pytextrank/

In [None]:
# !pip install pytextrank

In [None]:
import pytextrank
# nlp = spacy.load('en_core_web_sm')
tr = pytextrank.TextRank()
nlp.add_pipe(tr.PipelineComponent, name='textrank', last=True)

In [None]:
keyphrases = []
for abstract in df['tiabs']:
    doc = nlp(abstract)
    keyphrases.extend([(p.text,i) for i,p in enumerate(doc._.phrases)])
pytr = pd.DataFrame((keyphrases), columns = ['kw','rank'])
pytr.shape

In [None]:
pytr = pytr[pytr['rank']<5].groupby('kw').count().sort_values('rank', ascending = False).reset_index()

In [None]:
pytr.columns= ['kw','count']
pytr.head()

In [None]:
# Counter(keyphrases).most_common(100)

In [None]:
pytr['algo'] = 'pytextrank'

This is much better. We're getting a list of phrases (or ngrams) which seem quite relevant to the coronavirus outbreak. There will be some need to sort through these manually.

# RAKE - Rapid Automatic Keyword Extraction
Adapted from: https://pypi.org/project/rake-nltk/

In [None]:
# !pip install rake_nltk

In [None]:
from rake_nltk import Rake
rake_kws = []
r = Rake()
for abstract in df['tiabs']:
    r.extract_keywords_from_text(abstract)
    rake_kws.extend([(keyphrase,i) for i, keyphrase in enumerate(r.get_ranked_phrases())])
rake = pd.DataFrame(rake_kws, columns = ['kw','rank'])
rake.head()

In [None]:
rake[rake['rank']<=5].groupby('kw').mean().sort_values('rank').head()

In [None]:
rows= Counter([x[0] for x in rake_kws if x[1]<5]).most_common(100)
columns = ['kw','count']

rakedf = pd.DataFrame(rows,columns=columns)
rakedf.head()

In [None]:
rakedf

In [None]:
rakedf['algo'] = 'RAKE'

Also a good list of ngrams which we can use for coronavirus searches. 

## Write out to file
Concatenate all 3 of the keyword lists we produced above into a spreadsheet and output to xlsx.

In [None]:
out = pd.concat([
    pytr,
    rakedf,
#     hotdf  # I'm commenting this out because I don't think the keywords here were very useful
])
out

This is fairly small and might be ok to put in github for sharing purposes. (Large amounts of data shouldn't go in github.)

In [None]:
out.to_excel('output/keyword_list.xlsx')