If you haven't yet, start by setting up your environment and datasets by following the instructions in the README. It should be something like:
* `make create_environment`
* `conda activate covid_nlp`
* `make update_environment`
* `make data`

A bunch of packages that you may want to use have already been added to the `covid_nlp` environment.

In [1]:
#Quick cell to make jupyter notebook use the full screen wi"dth
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
import json
import pandas as pd

In [4]:
from src import paths
from src.data import Dataset
from src import workflow

## Load up the dataset

The metadata has been augmented with where the files can be found relative to the `interim_data_path`

In [5]:
#paths['interim_data_path']

In [6]:
workflow.available_datasets()

['covid_nlp_20200319']

In [7]:
ds_name = 'covid_nlp_20200319'

In [8]:
meta_ds = Dataset.load(ds_name)

In [9]:
meta_df = meta_ds.data
meta_df.head()

Unnamed: 0,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,Microsoft Academic Paper ID,WHO #Covidence,has_full_text,file_type,path
0,c630ebcdf30652f0422c3ec12a00b50241dc9bd9,CZI,Angiotensin-converting enzyme 2 (ACE2) as a SA...,10.1007/s00134-020-05985-9,,32125455.0,cc-by-nc,,2020,"Zhang, Haibo; Penninger, Josef M.; Li, Yimin; ...",Intensive Care Med,2002765492.0,#3252,True,noncomm_use_subset,noncomm_use_subset/c630ebcdf30652f0422c3ec12a0...
1,53eccda7977a31e3d0f565c884da036b1e85438e,CZI,Comparative genetic analysis of the novel coro...,10.1038/s41421-020-0147-1,,,cc-by,,2020,"Cao, Yanan; Li, Lin; Feng, Zhimin; Wan, Shengq...",Cell Discovery,3003430844.0,#1861,True,comm_use_subset,comm_use_subset/53eccda7977a31e3d0f565c884da03...
2,53eccda7977a31e3d0f565c884da036b1e85438e,PMC,Comparative genetic analysis of the novel coro...,http://dx.doi.org/10.1038/s41421-020-0147-1,PMC7040011,32133153.0,CC BY,,2020 Feb 24,"['Cao, Yanan', 'Li, Lin', 'Feng, Zhimin', 'Wan...",Cell Discov,,,True,comm_use_subset,comm_use_subset/53eccda7977a31e3d0f565c884da03...
3,210a892deb1c61577f6fba58505fd65356ce6636,CZI,Incubation Period and Other Epidemiological Ch...,10.3390/jcm9020538,,,cc-by,The geographic spread of 2019 novel coronaviru...,2020,"Linton, M. Natalie; Kobayashi, Tetsuro; Yang, ...",Journal of Clinical Medicine,3006065484.0,#1043,True,comm_use_subset,comm_use_subset/210a892deb1c61577f6fba58505fd6...
4,e3b40cc8e0e137c416b4a2273a4dca94ae8178cc,CZI,Characteristics of and Public Health Responses...,10.3390/jcm9020575,,32093211.0,cc-by,"In December 2019, cases of unidentified pneumo...",2020,"Deng, Sheng-Qun; Peng, Hong-Juan",J Clin Med,177663115.0,#1999,True,comm_use_subset,comm_use_subset/e3b40cc8e0e137c416b4a2273a4dca...


## Basics on the dataset

The papers are in `json` format and include:
* `paper_id`
* `metadata`
* `abstract`
* `body_text`
* `bib_entries`
* `ref_entries`
* `back_matter`

For example:

In [10]:
filename = paths['interim_data_path'] / ds_name / meta_df['path'][0]
file = json.load(open(filename, 'rb'))
file.keys()

dict_keys(['paper_id', 'metadata', 'abstract', 'body_text', 'bib_entries', 'ref_entries', 'back_matter'])

# Embedding prep

### If you want to take abstracts to start with as in....
https://gitlab.com/ar2a/covid19-kaggle/-/blob/master/notebooks/gpclend_embed_abstracts.ipynb (you will be able to pick up this notebook from **Point ranking (will be used later)**)

In [11]:
abstracts = meta_df.abstract.dropna()

In [12]:
abstracts[:5]

3     The geographic spread of 2019 novel coronaviru...
4     In December 2019, cases of unidentified pneumo...
6     The basic reproduction number of an infectious...
7     The initial cluster of severe pneumonia cases ...
10    Cruise ships carry a large number of people in...
Name: abstract, dtype: object

In [13]:
len(abstracts)

26909

## If you want to split up documents by their sections

As in
* https://gitlab.com/ar2a/covid19-kaggle/-/blob/master/notebooks/mpfrane-scispacy-tokenization.ipynb (the processing below will take care of everything up to: **Apply scispacy tokenization**) 
* https://gitlab.com/ar2a/covid19-kaggle/-/blob/master/notebooks/top2vec_corona_dangel.ipynb (the processing below will take care of everything up to: **Train Top2Vec Model**)

i.e. turn each section into its own row, and treat sections as their own documents for embedding

In [14]:
from src.data.localdata import create_section_df

In [15]:
# filter down however you like based on the metadata
df = meta_df[:100]

In [16]:
help(create_section_df)

Help on function create_section_df in module src.data.localdata:

create_section_df(df, unpack_dir=None, extract_dir='covid_nlp_20200319', min_tokens=200)
    Given a dataframe df of the form of the covid metadata augmented dataset (e.g. covid_nlp_20200319)
    
    Created a dataframe where each row is a section of a paper from the dataframe (for which a
    full-text version exists)
    
    Parameters
    ----------
    df:
        a metadata dataframe (.data from a metadata datasource)
    extract_dir:
        The name of the directory the files have been unpacked into
    min_tokens:
        Require sections to have at least min_tokens tokens to be included
    unpack_dir:
        The interim data directory. If None, it will use the
        interim_data_path in paths. Only pass this if you want to override the default.
    
    Returns
    -------
    section dataframe with columns: ['paper_id', 'title', 'abstract', 'section', 'text', 'token_counts']



In [17]:
%%time
parsed_df = create_section_df(df)

CPU times: user 11.8 s, sys: 300 ms, total: 12.1 s
Wall time: 13.7 s


In [18]:
parsed_df.head()

Unnamed: 0,paper_id,title,abstract,section,text,token_counts
0,c630ebcdf30652f0422c3ec12a00b50241dc9bd9,Angiotensin-converting enzyme 2 (ACE2) as a SA...,,SARS-CoV-2 and severe acute respiratory syndro...,SARS-CoV-2 has been sequenced [3] . A phylogen...,209
1,c630ebcdf30652f0422c3ec12a00b50241dc9bd9,Angiotensin-converting enzyme 2 (ACE2) as a SA...,,SARS-CoV-2 and severe acute respiratory syndro...,SARS-CoV-2 has been sequenced [3] . A phylogen...,338
2,c630ebcdf30652f0422c3ec12a00b50241dc9bd9,Angiotensin-converting enzyme 2 (ACE2) as a SA...,,SARS-CoV-2 and severe acute respiratory syndro...,SARS-CoV-2 has been sequenced [3] . A phylogen...,421
3,c630ebcdf30652f0422c3ec12a00b50241dc9bd9,Angiotensin-converting enzyme 2 (ACE2) as a SA...,,SARS-CoV-2 and severe acute respiratory syndro...,SARS-CoV-2 has been sequenced [3] . A phylogen...,421
4,c630ebcdf30652f0422c3ec12a00b50241dc9bd9,Angiotensin-converting enzyme 2 (ACE2) as a SA...,,Enrichment distribution of ACE2 receptor in hu...,A key question is why the lung appears to be t...,292
