If you haven't yet, start by setting up your environment and datasets by following the instructions in the README. It should be something like:
* `make create_environment`
* `conda activate covid_nlp`
* `make update_environment`
* `make data`

Several common packages that you may want to use (e.g. UMAP, HDBSCAN, enstop, sklearn) have already been added to the `covid_nlp` environment via `environment.yml`. To add more, edit that file and do a:
  ` make update_environment`

In [1]:
# Quick cell to make jupyter notebook use the full screen width
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

In [2]:
# Automatically pick up code changes in the `src` module
%load_ext autoreload
%autoreload 2

In [3]:
import json
import pandas as pd

In [4]:
# Useful imports from easydata
from src import paths
from src.data import Dataset
from src import workflow

## Load up the dataset

The metadata has been augmented with where the files can be found relative to `paths["interim_data_path"]`

In [5]:
#paths['interim_data_path']

In [6]:
workflow.available_datasets()

['covid_nlp_20200319']

If the previous cell returned an empty list, go back and re-run `make data` as described at the top of this notebook.

In [7]:
ds_name = 'covid_nlp_20200319'

In [8]:
# Load the dataset
meta_ds = Dataset.load(ds_name)

In [9]:
print(meta_ds.DESCR[:457])

COVID-19 Open Research Dataset (CORD-19)
Participate in the CORD-19 Challenge

Kaggle is hosting the COVID-19 Open Research Dataset Challenge, a
series of important questions designed to inspire the community
to use CORD-19 to find new insights about the COVID-19 pandemic
including the natural history, transmission, and diagnostics for
the virus, management measures at the human-animal interface,
lessons from previous epidemiological studies, and more.



In [10]:
# The processed dataframe is the `data` method of this data source 
meta_df = meta_ds.data
meta_df.head()

Unnamed: 0,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,Microsoft Academic Paper ID,WHO #Covidence,has_full_text,file_type,path
0,c630ebcdf30652f0422c3ec12a00b50241dc9bd9,CZI,Angiotensin-converting enzyme 2 (ACE2) as a SA...,10.1007/s00134-020-05985-9,,32125455.0,cc-by-nc,,2020,"Zhang, Haibo; Penninger, Josef M.; Li, Yimin; ...",Intensive Care Med,2002765492.0,#3252,True,noncomm_use_subset,noncomm_use_subset/c630ebcdf30652f0422c3ec12a0...
1,53eccda7977a31e3d0f565c884da036b1e85438e,CZI,Comparative genetic analysis of the novel coro...,10.1038/s41421-020-0147-1,,,cc-by,,2020,"Cao, Yanan; Li, Lin; Feng, Zhimin; Wan, Shengq...",Cell Discovery,3003430844.0,#1861,True,comm_use_subset,comm_use_subset/53eccda7977a31e3d0f565c884da03...
2,53eccda7977a31e3d0f565c884da036b1e85438e,PMC,Comparative genetic analysis of the novel coro...,http://dx.doi.org/10.1038/s41421-020-0147-1,PMC7040011,32133153.0,CC BY,,2020 Feb 24,"['Cao, Yanan', 'Li, Lin', 'Feng, Zhimin', 'Wan...",Cell Discov,,,True,comm_use_subset,comm_use_subset/53eccda7977a31e3d0f565c884da03...
3,210a892deb1c61577f6fba58505fd65356ce6636,CZI,Incubation Period and Other Epidemiological Ch...,10.3390/jcm9020538,,,cc-by,The geographic spread of 2019 novel coronaviru...,2020,"Linton, M. Natalie; Kobayashi, Tetsuro; Yang, ...",Journal of Clinical Medicine,3006065484.0,#1043,True,comm_use_subset,comm_use_subset/210a892deb1c61577f6fba58505fd6...
4,e3b40cc8e0e137c416b4a2273a4dca94ae8178cc,CZI,Characteristics of and Public Health Responses...,10.3390/jcm9020575,,32093211.0,cc-by,"In December 2019, cases of unidentified pneumo...",2020,"Deng, Sheng-Qun; Peng, Hong-Juan",J Clin Med,177663115.0,#1999,True,comm_use_subset,comm_use_subset/e3b40cc8e0e137c416b4a2273a4dca...


## Basics on the dataset

The JSON files given in the `path` column of the metadata dataframe are the papers in `json` format (as dicts)
that include the following keys:
* `paper_id`
* `metadata`
* `abstract`
* `body_text`
* `bib_entries`
* `ref_entries`
* `back_matter`

where the `paper_id` is the sha hash from the medadata.

For example:

In [11]:
filename = paths['interim_data_path'] / ds_name / meta_df['path'][0]
file = json.load(open(filename, 'rb'))
file.keys()

dict_keys(['paper_id', 'metadata', 'abstract', 'body_text', 'bib_entries', 'ref_entries', 'back_matter'])

# Pre-processing data for various embeddings

### Example 1: If you want to start with abstracts...
e.g. if you want to reproduce the analysis in 
https://gitlab.com/ar2a/covid19-kaggle/-/blob/master/notebooks/gpclend_embed_abstracts.ipynb (you will be able to pick up this notebook from **Point ranking (will be used later)**) then do this:

In [12]:
abstracts = meta_df.abstract.dropna()

In [13]:
abstracts[:5]

3     The geographic spread of 2019 novel coronaviru...
4     In December 2019, cases of unidentified pneumo...
6     The basic reproduction number of an infectious...
7     The initial cluster of severe pneumonia cases ...
10    Cruise ships carry a large number of people in...
Name: abstract, dtype: object

In [14]:
len(abstracts)

26909

## Example 2: If you want to split up documents by their sections...

If you want to produce similar analyses to:
* https://gitlab.com/ar2a/covid19-kaggle/-/blob/master/notebooks/mpfrane-scispacy-tokenization.ipynb (the processing below will take care of everything up to: **Apply scispacy tokenization**) 
* https://gitlab.com/ar2a/covid19-kaggle/-/blob/master/notebooks/top2vec_corona_dangel.ipynb (the processing below will take care of everything up to: **Train Top2Vec Model**)

i.e. turn each section into its own row, and treat sections as their own documents for the purposes of embedding.

Here we've written a custom processing function (a _data transformer_) called `create_section_df` that will take in the current dataset and produce a new, transformed dataset. 

In [15]:
from src.data.localdata import create_section_df

In [16]:
help(create_section_df)

Help on function create_section_df in module src.data.localdata:

create_section_df(df, unpack_dir=None, extract_dir='covid_nlp_20200319', min_tokens=200)
    Dataset Transformer: extract individual sections from papers, returning one section per row
    
    Given a dataframe, df, formatted like the covid metadata augmented
    dataset (e.g. covid_nlp_20200319), created a new dataframe where
    each row is a section of a paper contained in `df` (for which a
    full-text version exists).
    
    Parameters
    ----------
    df:
        a metadata dataframe (.data from a metadata datasource)
        with at least 'has_full_text' and 'path' fields.
    extract_dir:
        The name of the directory (relative to `unpack_dir`) the files have been unpacked into.
    min_tokens:
        Require sections to have at least `min_tokens` tokens to be included in this dataframe
    unpack_dir:
        The interim data directory. If None, it will use the
        paths['interim_data_path']. Only

In [17]:
# filter however you like based on the metadata. We'll just demo with the first 100 entries
df = meta_df[:100]

In [18]:
%%time
parsed_df = create_section_df(df)

CPU times: user 8.75 s, sys: 151 ms, total: 8.9 s
Wall time: 9.01 s


In [19]:
parsed_df.head()

Unnamed: 0,paper_id,title,abstract,section,text,token_counts
0,c630ebcdf30652f0422c3ec12a00b50241dc9bd9,Angiotensin-converting enzyme 2 (ACE2) as a SA...,,SARS-CoV-2 and severe acute respiratory syndro...,SARS-CoV-2 has been sequenced [3] . A phylogen...,209
1,c630ebcdf30652f0422c3ec12a00b50241dc9bd9,Angiotensin-converting enzyme 2 (ACE2) as a SA...,,SARS-CoV-2 and severe acute respiratory syndro...,SARS-CoV-2 has been sequenced [3] . A phylogen...,338
2,c630ebcdf30652f0422c3ec12a00b50241dc9bd9,Angiotensin-converting enzyme 2 (ACE2) as a SA...,,SARS-CoV-2 and severe acute respiratory syndro...,SARS-CoV-2 has been sequenced [3] . A phylogen...,421
3,c630ebcdf30652f0422c3ec12a00b50241dc9bd9,Angiotensin-converting enzyme 2 (ACE2) as a SA...,,SARS-CoV-2 and severe acute respiratory syndro...,SARS-CoV-2 has been sequenced [3] . A phylogen...,421
4,c630ebcdf30652f0422c3ec12a00b50241dc9bd9,Angiotensin-converting enzyme 2 (ACE2) as a SA...,,Enrichment distribution of ACE2 receptor in hu...,A key question is why the lung appears to be t...,292
