## Setup

Run the command below to install the [cord19](https://github.com/dgunning/cord19) library which provides easy processing capabilities for the research paper dataset from the [COVID-19 Open Research Dataset Challenge](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge). Then download the data directory from the Kaggle challenge (an account is required) into a directory `data` such there exists the directory `data/CORD-19-research-challenge`. However, there is no need to download the data if you are loading the data from a pickle.

In [1]:
import sys
!{sys.executable} -m pip install git+https://github.com/dgunning/cord19.git numpy pandas scikit-learn

Collecting git+https://github.com/dgunning/cord19.git
  Cloning https://github.com/dgunning/cord19.git to /private/var/folders/r9/tg0dy4g52_zf8yd0xqdl3q4h0000gn/T/pip-req-build-dliqhs_s
  Running command git clone -q https://github.com/dgunning/cord19.git /private/var/folders/r9/tg0dy4g52_zf8yd0xqdl3q4h0000gn/T/pip-req-build-dliqhs_s
Collecting bs4
  Downloading bs4-0.0.1.tar.gz (1.1 kB)
Collecting simplejson
  Downloading simplejson-3.17.0-cp37-cp37m-macosx_10_14_x86_64.whl (73 kB)
[K     |████████████████████████████████| 73 kB 1.7 MB/s eta 0:00:011
Collecting rank_bm25
  Downloading rank_bm25-0.2.tar.gz (4.2 kB)
Collecting pendulum
  Downloading pendulum-2.1.0-cp37-cp37m-macosx_10_13_x86_64.whl (122 kB)
[K     |████████████████████████████████| 122 kB 7.0 MB/s eta 0:00:01
Collecting gensim
  Downloading gensim-3.8.3-cp37-cp37m-macosx_10_9_x86_64.whl (24.2 MB)
[K     |████████████████████████████████| 24.2 MB 11.6 MB/s eta 0:00:01
[?25hCollecting pyarrow>=0.16.0
  Downloading pya

  Downloading boto3-1.13.3-py2.py3-none-any.whl (128 kB)
[K     |████████████████████████████████| 128 kB 4.0 MB/s eta 0:00:01
Collecting s3transfer<0.4.0,>=0.3.0
  Downloading s3transfer-0.3.3-py2.py3-none-any.whl (69 kB)
[K     |████████████████████████████████| 69 kB 8.9 MB/s  eta 0:00:01
[?25hCollecting botocore<1.17.0,>=1.16.3
  Downloading botocore-1.16.3-py2.py3-none-any.whl (6.2 MB)
[K     |████████████████████████████████| 6.2 MB 14.3 MB/s eta 0:00:01
[?25hCollecting jmespath<1.0.0,>=0.7.1
  Downloading jmespath-0.9.5-py2.py3-none-any.whl (24 kB)
Collecting docutils<0.16,>=0.10
  Downloading docutils-0.15.2-py3-none-any.whl (547 kB)
[K     |████████████████████████████████| 547 kB 11.9 MB/s eta 0:00:01
Building wheels for collected packages: cord19, bs4, rank-bm25, annoy, smart-open
  Building wheel for cord19 (setup.py) ... [?25ldone
[?25h  Created wheel for cord19: filename=cord19-0.4.0-py3-none-any.whl size=60265304 sha256=9dbea29103e2baf3022a8f8186b8b83779d5e8ff921

  Building wheel for annoy (setup.py) ... [?25ldone
[?25h  Created wheel for annoy: filename=annoy-1.16.3-cp37-cp37m-macosx_10_9_x86_64.whl size=67970 sha256=01aec9cd1b6f278cb7fe05dd2edf6e1d67d342531a46eeea8919d273eed97ddf
  Stored in directory: /Users/annakrutsinger/Library/Caches/pip/wheels/39/36/d4/ee348a7240ca3e8d1fcbf04ebe46d45f2879ccb094a40f5706
  Building wheel for smart-open (setup.py) ... [?25ldone
[?25h  Created wheel for smart-open: filename=smart_open-2.0.0-py3-none-any.whl size=101341 sha256=4d323d46584f8537cef603ee36cc7a9d73ca7a02e880cc9612814a2e4e73d41d
  Stored in directory: /Users/annakrutsinger/Library/Caches/pip/wheels/bb/1c/9c/412ec03f6d5ac7d41f4b965bde3fc0d1bd201da5ba3e2636de
Successfully built cord19 bs4 rank-bm25 annoy smart-open
Installing collected packages: bs4, simplejson, rank-bm25, pytzdata, pendulum, docutils, jmespath, botocore, s3transfer, boto3, smart-open, gensim, pyarrow, altair, annoy, markdown, azure-common, azure-nspkg, azure-storage, cord19
  

In [2]:
import nltk
nltk.download("punkt")

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/annakrutsinger/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [3]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction import text

In [4]:
from cord import ResearchPapers
try:
    papers = ResearchPapers.from_pickle()
except:
    papers = ResearchPapers.load()
    papers.save()

papers_covid = papers.since_sarscov2()

AssertionError: Cannot find the input dir should be data/CORD-19-research-challenge

## Work

In [None]:
# lazily store vectorizer and topics in same object
class LDA():
    def fit(self, data, my_stops=[], min_df=10, n_components=8, n_jobs=-1):
        self.vectorizer = CountVectorizer(
            min_df=min_df,
            stop_words=text.ENGLISH_STOP_WORDS.union(my_stops))
        counts = self.vectorizer.fit_transform(data)
        self.topics = LatentDirichletAllocation(
                n_components=n_components,
                random_state=0,
                n_jobs=n_jobs).fit(counts)
        return self
    
    def print_topics(self, n_words=8):
        topic_dists = (self.topics.components_.T / self.topics.components_.sum(axis=1)).T
        for comp in range(len(topic_dists)):
            top_i = np.argsort(topic_dists[comp])[-n_words:][::-1]
            print(f"{[key for key, value in self.vectorizer.vocabulary_.items() if value in top_i]}")

In [None]:
covid_stops = [
    "coronavirus", "corona", "covid", "sars", "cov", "19",
    "virus", "viruses", "viral", "disease", "diseases", "2019", "2020"
]

### Attempt with summaries

In [None]:
def clean_summary(summary):
    return summary.replace("\n", " ")

def get_summaries(papers):
    return pd.Series([clean_summary(papers[i].summary) for i in range(len(papers))])

In [None]:
summaries = get_summaries(papers_covid)

In [None]:
lda_summaries = LDA().fit(summaries,
                          n_components=20, 
                          my_stops=covid_stops)

In [None]:
lda_summaries.print_topics()

### Atempt with titles

In [None]:
def get_titles(papers):
    return pd.Series([papers[i].title for i in range(len(papers))])

In [None]:
titles = get_titles(papers_covid)

In [None]:
lda_titles = LDA().fit(titles,
                       n_components=20,
                       my_stops=covid_stops)

In [None]:
lda_titles.print_topics()

### Attempt with searching before LDA

In [None]:
papers_relationship = papers_covid.search("variable")

In [None]:
papers_relationship

### Future stuff

Goal once topics returned by LDA above become useful: use keywords from topics as search terms to find relevant papers for each topic. Searching can be done by `papers.search("term1 term2 [...]")`. Looking more into the papers may be able to reveal what the actual relationship between different variables a given topic may represent. Note the search feature (provided by the `cord19` library), is smart and uses NLP to search by word similarity, not just literal occurences of the search terms.

Also a solution to the current issue of useless topics may be: searching _before_ running LDA. That is, start with a corpus of papers that already investigate relationships.