## Setup

COLAB INSTRUCTIONS: Run the `drive.mount` code block below to allow Colab to access the data files required for this notebook. Open the file explorer on the left, navigate to the `topics`, right click, and choose `Copy path`. Then paste the path into the input `SAVE_DIR` below:

In [0]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
SAVE_DIR = "/content/drive/My Drive/final project/topics" #@param{type:"string"}

In [4]:
# weird issue with pandas version
!pip install -U pandas>=1.0.0
!pip install git+https://github.com/dgunning/cord19.git numpy scikit-learn

Collecting git+https://github.com/dgunning/cord19.git
  Cloning https://github.com/dgunning/cord19.git to /tmp/pip-req-build-hrimulu1
  Running command git clone -q https://github.com/dgunning/cord19.git /tmp/pip-req-build-hrimulu1
Building wheels for collected packages: cord19
  Building wheel for cord19 (setup.py) ... [?25l[?25hdone
  Created wheel for cord19: filename=cord19-0.4.0-cp36-none-any.whl size=60265304 sha256=2f14b18ef286bb6e05904dbff4c6731b94cda949866af13f0cb66e92747db254
  Stored in directory: /tmp/pip-ephem-wheel-cache-6_0k58jg/wheels/b9/9f/97/0d88e94e081f6b407722d7a971a402b10f457098f16a9d6ab3
Successfully built cord19


In [5]:
import nltk
nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [0]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction import text

In [0]:
from cord import ResearchPapers
papers = ResearchPapers.from_pickle(save_dir=SAVE_DIR)
papers_covid = papers.since_sarscov2()

## Work

In [0]:
# lazily store vectorizer and topics in same object
class LDA():
    def fit(self, data, my_stops=[], min_df=10, n_components=8, n_jobs=-1):
        self.vectorizer = CountVectorizer(
            min_df=min_df,
            stop_words=text.ENGLISH_STOP_WORDS.union(my_stops))
        counts = self.vectorizer.fit_transform(data)
        self.topics = LatentDirichletAllocation(
                n_components=n_components,
                random_state=0,
                n_jobs=n_jobs).fit(counts)
        return self
    
    def print_topics(self, n_words=8):
        topic_dists = (self.topics.components_.T / self.topics.components_.sum(axis=1)).T
        for comp in range(len(topic_dists)):
            top_i = np.argsort(topic_dists[comp])[-n_words:][::-1]
            print(f"{[key for key, value in self.vectorizer.vocabulary_.items() if value in top_i]}")

In [0]:
covid_stops = [
    "coronavirus", "corona", "covid", "sars", "cov", "19",
    "virus", "viruses", "viral", "disease", "diseases", "2019", "2020"
]

### Attempt with summaries

In [0]:
def clean_summary(summary):
    return summary.replace("\n", " ")

def get_summaries(papers):
    return pd.Series([clean_summary(papers[i].summary) for i in range(len(papers))])

In [0]:
summaries = get_summaries(papers_covid)

In [0]:
lda_summaries = LDA().fit(summaries,
                          n_components=20, 
                          my_stops=covid_stops)

In [18]:
lda_summaries.print_topics()

['infected', 'number', 'epidemic', 'cases', 'time', 'data', 'model', 'rate']
['respiratory', 'patients', 'associated', 'clinical', 'severe', 'acute', 'group', 'mortality']
['respiratory', 'syndrome', 'mers', 'human', 'rna', 'severe', 'coronaviruses', 'genome']
['number', 'confirmed', 'countries', 'reported', 'cases', 'data', 'china', 'wuhan']
['patients', 'hospital', 'cancer', 'care', 'management', 'surgery', 'pandemic', 'intensive']
['social', 'distancing', 'measures', 'transmission', 'states', 'spread', 'people', 'lockdown']
['economic', 'countries', 'different', 'drug', 'global', 'potential', 'drugs', 'pandemic']
['public', 'health', 'outbreak', 'china', 'world', 'global', 'emergency', 'et']
['cell', 'cells', 'expression', 'infection', 'immune', 'replication', 'host', 'antiviral']
['infection', 'workers', 'health', 'risk', 'medical', 'care', 'healthcare', 'pandemic']
['human', 'receptor', 'proteins', 'binding', 'protein', 'vaccine', 'spike', 'ace2']
['analysis', 'based', 'informatio

### Atempt with titles

In [0]:
def get_titles(papers):
    return pd.Series([papers[i].title for i in range(len(papers))])

In [0]:
titles = get_titles(papers_covid)

In [0]:
lda_titles = LDA().fit(titles,
                       n_components=20,
                       my_stops=covid_stops)

In [0]:
lda_titles.print_topics()

### Attempt with searching before LDA

In [19]:
papers_relationship = papers_covid.search("relationship")
papers_relationship

### Future stuff

Goal once topics returned by LDA above become useful: use keywords from topics as search terms to find relevant papers for each topic. Searching can be done by `papers.search("term1 term2 [...]")`. Looking more into the papers may be able to reveal what the actual relationship between different variables a given topic may represent. Note the search feature (provided by the `cord19` library), is smart and uses NLP to search by word similarity, not just literal occurences of the search terms.

Also a solution to the current issue of useless topics may be: searching _before_ running LDA. That is, start with a corpus of papers that already investigate relationships.