# 1. Data Fetching

In this notebook we will be downloading each of the datsets used for the Publications track of Hércules challenge. This track will be making use of the following datasets:
* __Agriculture__: This dataset is composed of a series of articles available in [Europe PMC](https://europepmc.org) and related to the agriculture field.

If the datasets have already been downloaded and placed in their corresponding directories ("_data/agriculture_" for the Agriculture dataset), this notebook can be skipped.

## Setup
We are going to run our init script, which will set up the module import paths and the logging system:

In [1]:
%run __init__.py

## Downloading the Agriculture dataset

In this section we are going to fethc the Agriculture dataset with the use of the Europe PMC API.

### Getting the article IDs to retrieve
A text file with the article IDs that belong to the dataset is available under the *data/agriculture/pmc_ids.txt*. In the following cells we are going to define a simple function to retrieve those IDs from the file:

In [2]:
AGRICULTURE_DATASET_DIR = os.path.join(DATA_DIR, 'agriculture')
article_ids_file = os.path.join(AGRICULTURE_DATASET_DIR, 'pmc_ids.txt')

def load_ids(base_file):
    with open(base_file , 'r') as f:
        ids = f.read().splitlines()
    return ids


In [3]:
article_ids = load_ids(article_ids_file)
len(article_ids)

126

In [4]:
article_ids[0]

'PMC3310815'

## Fetching the articles

Now that we know which articles we need to download, we will be making use of 

In [5]:
BMC_BASE_API = 'https://www.ebi.ac.uk/europepmc/webservices/rest'

In [6]:
import requests

def load_pmc_data(ids_to_download):
    return {pmc_id: requests.get(f"{BMC_BASE_API}/{pmc_id}/fullTextXML").content 
            for pmc_id in ids_to_download}

pmc_dataset_xml = load_pmc_data(article_ids)

Finally, we will be saving each xml file to our _data/agriculture_ directory. These files will be loaded later on in the next notebook:

In [7]:
for key, val in pmc_dataset_xml.items():
    file_path = os.path.join(AGRICULTURE_DATASET_DIR, f"{key}.xml")
    with open(file_path, "wb") as f:
        f.write(val)