# 1. Data exploration and cleaning
In this notebook we will begin with the initial exploration and cleaning of the datasets for the Publications track of Hércules challenge.

## Setup

In [2]:
import logging
import os
import sys

# set up module paths for imports
module_path = os.path.abspath(os.path.join('..'))
sys.path.append(module_path)

# start logging system and set logging level
logger = logging.getLogger()
logger.setLevel(logging.INFO)
logging.info("Starting logger")

INFO:root:Starting logger


In [3]:
DATA_DIR = os.path.join(module_path, 'data')

## Dataset 1: COVID-19

## Dataset 2: Agriculture

### Getting the article IDs to retrieve

In [4]:
article_ids_file = os.path.join(DATA_DIR, 'pmc_ids.txt')

def load_ids(base_file):
    with open(base_file , 'r') as f:
        ids = f.read().splitlines()
    return ids


In [5]:
article_ids = load_ids(article_ids_file)
len(article_ids)

127

In [6]:
article_ids[0]

'PMC3310815'

### Loading the XML data from the EuropeBMC API

In [7]:
BMC_BASE_API = 'https://www.ebi.ac.uk/europepmc/webservices/rest'

In [8]:
import requests

def load_pmc_data(ids_to_download):
    return {pmc_id: requests.get(f"{BMC_BASE_API}/{pmc_id}/fullTextXML").content 
            for pmc_id in ids_to_download}

pmc_dataset_xml = load_pmc_data(article_ids)

### Parsing the data

In [9]:
import xml.etree.ElementTree as ET

article_xml = pmc_dataset_xml['PMC3310815']

In [25]:
article_xml

b'<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.1 20151215//EN" "JATS-archivearticle1.dtd"> \n<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" article-type="research-article"><?properties open-access?><?DTDIdentifier.IdentifierValue -//NLM//DTD Journal Publishing DTD v2.3 20031101//EN?><?DTDIdentifier.IdentifierType public?><?SourceDTD.DTDName journalpublishing.dtd?><?SourceDTD.Version 2.3?><?ConverterInfo.XSLTName jp2nlmx2.xsl?><?ConverterInfo.Version 1?><front><journal-meta><journal-id journal-id-type="nlm-ta">PLoS Pathog</journal-id><journal-id journal-id-type="iso-abbrev">PLoS Pathog</journal-id><journal-id journal-id-type="publisher-id">plos</journal-id><journal-id journal-id-type="pmc">plospath</journal-id><journal-title-group><journal-title>PLoS Pathogens</journal-title></journal-title-group><issn pub-type="ppub">1553-7366</issn><issn pub-type="epub">1553-7374</issn><publisher><publi

In [32]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(article_xml, 'lxml-xml')

In [37]:
soup.find('body').get_text(separator='')

"IntroductionTransmission of plant pathogens by insect vectors is a complex biological process involving interactions between the plant, insect, and pathogen [1]–[2]. Pathogens can induce changes in the traits of their primary hosts as well as their vectors to affect the frequency and nature of interactions between hosts and vectors [3]–[13]. Plant morphology, as well as, primary and secondary plant compounds, including emitted volatiles and plant nutrients, are some of the traits that can be altered by pathogen infection of plants [14]–[16]. Fecundity, survival, and behavior are primary traits altered in insect vectors due to such infection [7]–[10], [12]–[13], [17]–[21]. Plant pathogen infection may alter both plant morphology and chemistry; therefore, research efforts have focused on the vector's response to such changes in their plant host [7]–[12], [18]–[21].\nCandidatus Liberibacter asiaticus (Las) is a gram-negative, fastidious, phloem-limited bacterium that causes huanglongbing

In [55]:
soup.find('abstract').text

'Transmission of plant pathogens by insect vectors is a complex biological process involving interactions between the plant, insect, and pathogen. Pathogen-induced plant responses can include changes in volatile and nonvolatile secondary metabolites as well as major plant nutrients. Experiments were conducted to understand how a plant pathogenic bacterium, Candidatus Liberibacter asiaticus (Las), affects host preference behavior of its psyllid (Diaphorina citri Kuwayama) vector. D. citri were attracted to volatiles from pathogen-infected plants more than to those from non-infected counterparts. Las-infected plants were more attractive to D. citri adults than non-infected plants initially; however after feeding, psyllids subsequently dispersed to non-infected rather than infected plants as their preferred settling point. Experiments with Las-infected and non-infected plants under complete darkness yielded similar results to those recorded under light. The behavior of psyllids in respons

In [None]:
class PMCArticle():
    def __init__(article_id, )

In [None]:
import xml.etree.ElementTree as ET


def get_abstract(article_soup):
    return article_soup.find('abstract').text

def get_authors(article_soup):
    return [author.find('name').text 
            for author in soup.find_all('contrib', 
                                        {'contrib-type': 'author'})]

def get_full_body(article_soup):
    return article_soup.find('body').get_text(separator=' ')

def get_title(article_soup):
    return article_soup.find('article-title').text

def get_pmc_id(article_soup):
    return article_soup.find('article-id', {'pub-id-type': 'pmcid'}).text

def parse_pmc_article(article_xml):
    pass
