In this notebook, we will define the basic data sources.

Unless you are changing how the data is fetched or processed, you don't need to run this notebook. When this notebook is run, data source creation will be serialized to the catalog. Once `datasources.json` catalog file has been created, you won't need to run this again.

In [None]:
# Basic utility functions
import logging
import pathlib
import os

In [None]:
# Easydata Imports
from src import paths
from src.log import logger

In [None]:
logger.setLevel(logging.DEBUG)

In [None]:
%load_ext autoreload
%autoreload 2

## Create the Data Source
This data source encapsulates the data at:
    https://pages.semanticscholar.org/coronavirus-research



In [None]:
from src.data import DataSource

In [None]:
extract_date = '20200319'
ds_name = f'covid_nlp_{extract_date}'
dsrc = DataSource(ds_name)

### Add Metadata

In [None]:
license_txt = '''COVID DATASET LICENSE AGREEMENT
By accessing, downloading or otherwise using any Journals, Articles, Metadata, Abstracts,
Full-Texts or any other content types provided in the COVID-19 Open Research Dataset (CORD-19)
Database (the “Data”), You expressly acknowledge and agree to the following:

* AI2 grants to You a worldwide, perpetual, non-exclusive, non-transferable
  license to use and make derivatives of the Data for text and data mining only.

* AI2 warrants that it has the right to make the Data available to You as
  provided for in and subject to this Agreement and in accordance with applicable law. 
  EXCEPT FOR THE LIMITED WARRANTY IN THIS SECTION, THE DATA IS PROVIDED “AS IS”, WITHOUT ANY
  WARRANTIES OF ANY KIND. 

* You agree to comply with all applicable local, state, national, and international laws
  and regulations with respect to AI2’s license and Youruse of the Data.

* Data provided by AI2 is from copyrighted sources of the respective copyright holders.
  You are solely responsible for Your and Your users’ compliance with any copyright, patent
  or trademark restrictions and are referred to the copyright, patent or trademark notices
  appearing in the original sources, all of which are hereby incorporated by reference.
'''

In [None]:
readme_txt = '''COVID-19 Open Research Dataset (CORD-19)
Participate in the CORD-19 Challenge

Kaggle is hosting the COVID-19 Open Research Dataset Challenge, a
series of important questions designed to inspire the community
to use CORD-19 to find new insights about the COVID-19 pandemic
including the natural history, transmission, and diagnostics for
the virus, management measures at the human-animal interface,
lessons from previous epidemiological studies, and more.
Download CORD-19

By downloading this dataset you are agreeing to the Dataset
License. Specific licensing information for individual articles
in the dataset is available in the metadata file.

Additional licensing information is available on the PMC website,
medRxiv website and bioRxiv website.

Latest release contains papers up until 2020-03-13 with over
13,000 full text articles.

Download here:

    Commercial use subset (includes PMC content) -- 9000 papers, 186Mb
    Non-commercial use subset (includes PMC content) -- 1973 papers, 36Mb
    PMC custom license subset -- 1426 papers, 19Mb
    bioRxiv/medRxiv subset (pre-prints that are not peer reviewed) -- 803 papers, 13Mb
    Metadata file -- 47Mb
    Readme

Each paper is represented as a single JSON object. The schema is
available here.

Description:

The dataset contains all COVID-19 and coronavirus-related
research (e.g. SARS, MERS, etc.) from the following sources:

* PubMed's PMC open access corpus using this query
  (COVID-19 and coronavirus research)
* Additional COVID-19 research articles from a corpus
  maintained by the WHO
* bioRxiv and medRxiv pre-prints using the same query
  as PMC (COVID-19 and coronavirus research)

We also provide a comprehensive metadata file of 29,000
coronavirus and COVID-19 research articles with links to PubMed,
Microsoft Academic and the WHO COVID-19 database of
publications (includes articles without open access full text).

We recommend using metadata from the comprehensive file when
available, instead of parsed metadata in the dataset. Please note
the dataset may contain multiple entries for individual PMC IDs
in cases when supplementary materials are available.

This repository is linked to the WHO database of publications on
coronavirus disease and other resources, such as Microsoft
Academic Graph, PubMed, and Semantic Scholar. A coalition
including the Chan Zuckerberg Initiative, Georgetown University’s
Center for Security and Emerging Technology, Microsoft Research,
and the National Library of Medicine of the National Institutes
of Health came together to provide this service. We also thank
and acknowledge Unpaywall for providing open access license
information for portions of the dataset.

Citation:

When including CORD-19 data in a publication or redistribution,
please cite the dataset as follows:

In bibliography:

COVID-19 Open Research Dataset (CORD-19). 2020. Version 2020-03-13. 
Retrieved from https://pages.semanticscholar.org/coronavirus-research. 
Accessed YYYY-MM-DD. doi:10.5281/zenodo.3715506

In text:

(CORD-19, 2020)

The Allen Institute for AI and particularly the Semantic Scholar
team will continue to provide updates to this dataset as the
situation evolves and new research is released.

Contribute to CORD-19

To maximize impact and increase full text available to the global
research community, we are actively encouraging publishers to
make their research content openly available for AI projects like
this that benefit the common good. If you’re a publisher
interested in contributing to the CORD-19 corpus, please contact
partnerships@allenai.org.
'''

In [None]:
dsrc.add_metadata(contents=license_txt, kind='LICENSE')
dsrc.add_metadata(contents=readme_txt, kind='DESCR')


In [None]:
dsrc.add_url('https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/2020-03-13/comm_use_subset.tar.gz',
             name='commercial use subset')

dsrc.add_url('https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/2020-03-13/noncomm_use_subset.tar.gz',
             name='non-commmercial use subset')
dsrc.add_url('https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/2020-03-13/pmc_custom_license.tar.gz',
             name='PMC custom license')
dsrc.add_url('https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/2020-03-13/biorxiv_medrxiv.tar.gz',
             name='bioRxiv and medRxiv')
dsrc.add_url('https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/2020-03-13/all_sources_metadata_2020-03-13.csv',
             name='metadata file')
dsrc.add_url('https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/2020-03-13/all_sources_metadata_2020-03-13.readme',
             name='readme')

In [None]:
dsrc.fetch()

## Transform to a DataSet

Use the process function that we created in local data that adds filename data to the dataframe created from the metadata.

In [None]:
from src.data.localdata import process_covid_metadata

In [None]:
dsrc.parse_function = process_covid_metadata

In [None]:
ds = dsrc.process()

In [None]:
workflow.add_datasource(dsrc)

## Save the datasource and dataset

In [None]:
from src import workflow

In [None]:
#workflow.add_datasource(dsrc)
workflow.available_datasources()

In [None]:
c = paths['catalog_path']

In [None]:
!cat $c/datasources.json

In [None]:
workflow.available_datasources()

Use a dummy transformer to turn this into a dataset

In [None]:
#workflow.add_transformer(from_datasource='covid_nlp_20200319', output_dataset='covid_nlp_20200319')

In [None]:
workflow.make_data()

In [None]:
workflow.available_datasets()