If you haven't yet, start by setting up your environment and datasets by following the instructions in the README. It should be something like:
* `make create_environment`
* `conda activate covid_nlp`
* `make update_environment`
* `make data`

Several common packages that you may want to use (e.g. UMAP, HDBSCAN, enstop, sklearn) have already been added to the `covid_nlp` environment via `environment.yml`. To add more, edit that file and do a:
  ` make update_environment`

## Document embedding of abstracts
In this notebook we'll follow https://github.com/gclen/covid19-kaggle/blob/master/embed_abstracts_interactive.ipynb to embed abstracts. It uses the the techniques from https://umap-learn.readthedocs.io/en/latest/document_embedding.html to embed documents using UMAP. 

In [None]:
# Quick cell to make jupyter notebook use the full screen width
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

In [None]:
# Automatically pick up code changes in the `src` module
%load_ext autoreload
%autoreload 2

In [None]:
import json
import pandas as pd

In [None]:
# Useful imports from easydata
from src import paths
from src.data import Dataset
from src import workflow

In [None]:
# other packages
from scipy import sparse

# tokenizing/vectorizing
from src.data.em_method import em_sparse
import en_core_sci_sm # A full spaCy pipeline for biomedical data.
from scispacy.custom_tokenizer import combined_rule_tokenizer
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
# dimension reduction
import umap
import umap.plot
# clustering
import hdbscan

In [None]:
# Some plotting libraries
import matplotlib.pyplot as plt
%matplotlib inline
from bokeh.plotting import show, save, output_notebook, output_file
from bokeh.resources import INLINE
output_notebook(resources=INLINE)

## Load up the dataset

The metadata has been augmented with where the files can be found relative to `paths["interim_data_path"]`

In [None]:
#paths['interim_data_path']

In [None]:
workflow.available_datasets()

If the previous cell returned an empty list, go back and re-run `make data` as described at the top of this notebook.

In [None]:
ds_name = 'covid_nlp_20200319'

In [None]:
# Load the dataset
meta_ds = Dataset.load(ds_name)

In [None]:
print(meta_ds.DESCR[:457])

In [None]:
# The processed dataframe is the `data` method of this data source 
meta_df = meta_ds.data
meta_df.head()

In [None]:
# optionally filter it down to published papers with a cc-by license

meta_df.file_type.value_counts()

In [None]:
meta_df = meta_df[(meta_df.file_type=='comm_use_subset') | (meta_df.file_type=='noncomm_use_subset')].copy()

## Basics on the dataset

The JSON files given in the `path` column of the metadata dataframe are the papers in `json` format (as dicts)
that include the following keys:
* `paper_id`
* `metadata`
* `abstract`
* `body_text`
* `bib_entries`
* `ref_entries`
* `back_matter`

where the `paper_id` is the sha hash from the medadata.

For example:

In [None]:
filename = paths['interim_data_path'] / ds_name / meta_df['path'][0]
file = json.load(open(filename, 'rb'))
file.keys()

# Embedding abstracts

In [None]:
abstracts = meta_df.abstract.dropna()

In [None]:
abstracts[:5]

In [None]:
len(abstracts)

Shorten abstracts for display

In [None]:
short_abstracts = [a[:140] for a in abstracts]

In [None]:
meta_df['abstract_length'] = meta_df.abstract.str.len()

In [None]:
hover_df = meta_df[meta_df.abstract_length > 0].reset_index()
hover_df['short_abstracts'] = short_abstracts

## Vectorize to get word-document matrix

Use Tfidf with L1 normalize + Estimation Maximization to get rid of the "averages"

### Step 1: Tfidf

In [None]:
%%time
vectorizer = TfidfVectorizer(min_df=5, norm='l1', tokenizer=spacy_tokenizer)
word_doc_matrix = vectorizer.fit_transform(abstracts)

### Step 2: EM

In [None]:
word_doc_matrix, weights = em_sparse(word_doc_matrix, prior_noise=5.0)

### Reduce dimension using UMAP and Hellinger distance

In [None]:
%%time
embedding_2d = umap.UMAP(n_components=2, n_neighbors=10,
                         metric='hellinger',
                         random_state=42).fit(word_doc_matrix)

#### Cluster

In [None]:
%%time
clusterer = hdbscan.HDBSCAN(min_cluster_size=15)
clusterer.fit_predict(embedding_2d.embedding_)
labels = clusterer.labels_

In [None]:
hover_df['cluster'] = labels

In [None]:
hover_df.cluster.value_counts()

In [None]:
len(hover_df.cluster.value_counts())

In [None]:
hover_cols = ['title', 'file_type', 'short_abstracts', 'cluster']

In [None]:
f = umap.plot.interactive(embedding_2d, labels=hover_df['cluster'],
                          hover_data=hover_df.reset_index()[hover_cols]);
show(f)

<img src="../reports/figures/04-DocMAP-abstracts.png" alt="DocMAP embedding visualization" title="DocMAP embedding visualization" />

### Rank points based on distance to a representative point

In [None]:
from src.utils import RankedPoints

In [None]:
examples = RankedPoints(embedding_2d.embedding_, clusterer, metric='euclidean')

In [None]:
examples.calculate_all_distances_to_center()
examples.get_all_cluster_rankings()

In [None]:
hover_df['rank_in_cluster'] = examples.embedding_df['rank_in_cluster']

In [None]:
top_5_papers = {}
grouped_by_cluster = hover_df.groupby('cluster')

for cluster_id, group in grouped_by_cluster:
    top_papers = group.sort_values('rank_in_cluster', ascending=True).head(5)
    
    top_5_papers[int(cluster_id)] = '<ol>' + ''.join([f'<li><a href="{r.doi}">{r.title}</a></li>' for _, r in top_papers.iterrows()]) + '</ol>'

In [None]:
top_5_papers[0]

<ol><li><a href="http://dx.doi.org/10.1155/2011/609465">Feline and Canine Coronaviruses: Common Genetic and Pathobiological Features</a></li><li><a href="http://dx.doi.org/10.3390/v11111068">Diagnosis of Feline Infectious Peritonitis: A Review of the Current Literature</a></li><li><a href="http://dx.doi.org/10.1186/s12917-017-1147-8">Sensitivity and specificity of a real-time reverse transcriptase polymerase chain reaction detecting feline coronavirus mutations in effusion and serum/plasma of cats to diagnose feline infectious peritonitis</a></li><li><a href="http://dx.doi.org/10.1038/srep20022">Experimental feline enteric coronavirus infection reveals an aberrant infection pattern and shedding of mutants with impaired infectivity in enterocyte cultures</a></li><li><a href="http://dx.doi.org/10.1292/jvms.19-0090">Feline coronavirus isolates from a part of Brazil: insights into molecular epidemiology and phylogeny inferred from the 7b gene</a></li></ol>

In [None]:
top_5_papers[1]

<ol><li><a href="http://dx.doi.org/10.3389/fphar.2016.00146">Protein Kinase C-δ Mediates Shedding of Angiotensin-Converting Enzyme 2 from Proximal Tubular Cells</a></li><li><a href="http://dx.doi.org/10.1371/journal.pone.0034747">Angiotensin Converting Enzyme (ACE) and ACE2 Bind Integrins and ACE2 Regulates Integrin Signalling</a></li><li><a href="http://dx.doi.org/10.1186/1471-2164-8-194">Identification and characterisation of the angiotensin converting enzyme-3 (ACE3) gene: a novel mammalian homologue of ACE</a></li><li><a href="http://dx.doi.org/10.3390/biom9120886">Novel Variants of Angiotensin Converting Enzyme-2 of Shorter Molecular Size to Target the Kidney Renin Angiotensin System</a></li><li><a href="http://dx.doi.org/10.1371/journal.pone.0150861">Anti-Inflammatory Action of Angiotensin 1-7 in Experimental Colitis</a></li></ol>

### This can now be used as input to an interactive plot

See https://github.com/gclen/covid19-kaggle for an example