# 1. Data Fetching

In this notebook we will be downloading each of the datsets used for the Publications track of Hércules challenge. This track will be making use of the following datasets:
* __COVID-19__: List of articles included in the COVID-19 Open Research Data Challenge from Kaggle, available through the [following link](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge).
* __Agriculture__: This dataset is composed of a series of articles available in [Europe PMC](https://europepmc.org) and related to the agriculture field.

If the datasets have already been downloaded and placed in their corresponding directories ("_data/agriculture_" for the Agriculture dataset and "_data/cord19_" for the COVID 19 one), this notebook can be skipped.

## Setup
We are going to run our init script, which will set up the module import paths and the logging system:

In [9]:
%run __init__.py

## Dowloading the dataset from kaggle

Since the COVID-19 dataset belongs to a Kaggle competition, we will need to authenticate ourselves before we can download it. It is necessary to have an account created with Kaggle in order to execute the following cells. Another alternative is to unzip the dataset inside the *data/cord19* folder, and skip to the next notebook (Parsing the data).

In the following cell, a prompt will appear to enter your Kaggle username and API key associated to your account. More information about how to obtain the API can be accessed at the [following link](https://www.kaggle.com/docs/api):

In [2]:
import getpass

try:
    from secret import KAGGLE_USER, KAGGLE_KEY
except ModuleNotFoundError:
    KAGGLE_USER = input("Please enter your kaggle username: ")
    KAGGLE_KEY = getpass.getpass("Please enter your kaggle API key: ")

os.environ['KAGGLE_USERNAME'] = KAGGLE_USER
os.environ['KAGGLE_KEY'] = KAGGLE_KEY

KeyboardInterrupt: Interrupted by user

Now that we have entered the Kaggle credentials, we will proceed to download and unzip the dataset in our *data/cord19* folder. This operation may take a few minutes:

In [None]:
import kaggle


kaggle.api.dataset_download_files(CORD_DATASET_NAME, path=CORD_DATASET_DIR, unzip=True)

## Downloading the Agriculture dataset

In this section we are going to fethc the Agriculture dataset with the use of the Europe PMC API.

### Getting the article IDs to retrieve
A text file with the article IDs that belong to the dataset is available under the *data/agriculture/pmc_ids.txt*. In the following cells we are going to define a simple function to retrieve those IDs from the file:

In [2]:
AGRICULTURE_DATASET_DIR = os.path.join(DATA_DIR, 'agriculture')
article_ids_file = os.path.join(AGRICULTURE_DATASET_DIR, 'pmc_ids.txt')

def load_ids(base_file):
    with open(base_file , 'r') as f:
        ids = f.read().splitlines()
    return ids


In [3]:
article_ids = load_ids(article_ids_file)
len(article_ids)

127

In [4]:
article_ids[0]

'PMC3310815'

## Fetching the articles

Now that we know which articles we need to download, we will be making use of 

In [5]:
BMC_BASE_API = 'https://www.ebi.ac.uk/europepmc/webservices/rest'

In [6]:
import requests

def load_pmc_data(ids_to_download):
    return {pmc_id: requests.get(f"{BMC_BASE_API}/{pmc_id}/fullTextXML").content 
            for pmc_id in ids_to_download}

pmc_dataset_xml = load_pmc_data(article_ids)

Since one of the articles is not available for reuse ('PMC6472519') we are going to remove it from the whole track in order to comply with its license. More information about this issue can be found at https://github.com/weso-edma/hercules-challenge-publications/issues/3.

In [7]:
del pmc_dataset_xml['PMC6472519']

Finally, we will be saving each xml file to our _data/agriculture_ directory. These files will be loaded later on in the next notebook:

In [8]:
for key, val in pmc_dataset_xml.items():
    file_path = os.path.join(AGRICULTURE_DATASET_DIR, f"{key}.xml")
    with open(file_path, "wb") as f:
        f.write(val)
