# 1. Data exploration and cleaning
In this notebook we will begin with the initial exploration and cleaning of the datasets for the Publications track of Hércules challenge.

## Setup

In [2]:
import logging
import os
import sys

# set up module paths for imports
module_path = os.path.abspath(os.path.join('..'))
sys.path.append(module_path)

# start logging system and set logging level
logger = logging.getLogger()
logger.setLevel(logging.INFO)
logging.info("Starting logger")

INFO:root:Starting logger


In [3]:
DATA_DIR = os.path.join(module_path, 'data')

## Dataset 1: COVID-19

## Dataset 2: Agriculture

### Getting the article IDs to retrieve

In [4]:
article_ids_file = os.path.join(DATA_DIR, 'pmc_ids.txt')

def load_ids(base_file):
    with open(base_file , 'r') as f:
        ids = f.read().splitlines()
    return ids


In [5]:
article_ids = load_ids(article_ids_file)
len(article_ids)

127

In [6]:
article_ids[0]

'PMC3310815'

### Loading the XML data from the EuropeBMC API

In [7]:
BMC_BASE_API = 'https://www.ebi.ac.uk/europepmc/webservices/rest'

In [8]:
import requests

def load_pmc_data(ids_to_download):
    return {pmc_id: requests.get(f"{BMC_BASE_API}/{pmc_id}/fullTextXML").content 
            for pmc_id in ids_to_download}

pmc_dataset_xml = load_pmc_data(article_ids)

Since one of the articles is not available for reuse ('PMC6472519') we are going to remove it from the whole track in order to comply with its license. More information about this issue can be found at https://github.com/weso-edma/hercules-challenge-publications/issues/3.

In [None]:
del pmc_dataset_xml['PMC6472519']

### Parsing the data

In [32]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(article_xml, 'lxml-xml')

In [37]:
soup.find('body').get_text(separator='')

"IntroductionTransmission of plant pathogens by insect vectors is a complex biological process involving interactions between the plant, insect, and pathogen [1]–[2]. Pathogens can induce changes in the traits of their primary hosts as well as their vectors to affect the frequency and nature of interactions between hosts and vectors [3]–[13]. Plant morphology, as well as, primary and secondary plant compounds, including emitted volatiles and plant nutrients, are some of the traits that can be altered by pathogen infection of plants [14]–[16]. Fecundity, survival, and behavior are primary traits altered in insect vectors due to such infection [7]–[10], [12]–[13], [17]–[21]. Plant pathogen infection may alter both plant morphology and chemistry; therefore, research efforts have focused on the vector's response to such changes in their plant host [7]–[12], [18]–[21].\nCandidatus Liberibacter asiaticus (Las) is a gram-negative, fastidious, phloem-limited bacterium that causes huanglongbing

In [84]:
soup.find('article-title').text

'Induced Release of a Plant-Defense Volatile ‘Deceptively’ Attracts Insect Vectors to Plants Infected with a Bacterial Pathogen'

In [154]:
class PMCArticle():
    def __init__(self, article_id, title, authors,
                 abstract, full_body, references_titles):
        self.article_id = article_id
        self.authors = authors
        self.abstract = abstract
        self.full_body = full_body
        self.title = title
        self.references_titles = references_titles
    
    def to_dict(self):
        return {
            'id': self.article_id,
            'title': self.title,
            'abstract': self.abstract,
            'full_body': self.full_body,
            'authors': '|'.join(self.authors),
            'references': '|'.join(self.references_titles)
        }
        
    def __repr__(self):
        return str(self)
    
    def __str__(self):
        return f"{self.article_id} - {self.title} - {self.abstract[:10]}... - {self.full_body[:20]}..."


In [158]:
import xml.etree.ElementTree as ET


def get_abstract(article_soup):
    return article_soup.find('abstract').text

def get_authors(article_soup):
    return [author.find('name').get_text(separator=' ') 
            for author in article_soup.find_all('contrib', 
                                        {'contrib-type': 'author'})]

def get_full_body(article_soup):
    return article_soup.find('body').get_text(separator=' ')

def get_title(article_soup):
    return article_soup.find('article-title').text

def get_pmc_id(article_soup):
    return article_soup.find('article-id', {'pub-id-type': 'pmcid'}).text

def get_references_titles(article_soup):
    return [reference.find('article-title').text
            for reference in article_soup.find_all('ref')
            if reference.find('article-title')]

def parse_pmc_article(article_xml):
    soup = BeautifulSoup(article_xml, 'lxml-xml')
    return PMCArticle(get_pmc_id(soup), get_title(soup),
                      get_authors(soup), get_abstract(soup),
                      get_full_body(soup), get_references_titles(soup))


In [159]:
pmc_articles = [parse_pmc_article(article_xml) for article_xml in pmc_dataset_xml.values()]
pmc_articles[0]

3310815 - Induced Release of a Plant-Defense Volatile ‘Deceptively’ Attracts Insect Vectors to Plants Infected with a Bacterial Pathogen - Transmissi... - Introduction Transmi...

### Creating the dataframe

In [160]:
import pandas as pd

pmc_df = pd.DataFrame.from_records([article.to_dict() for article in pmc_articles])
pmc_df

Unnamed: 0,id,title,abstract,full_body,authors,references
0,3310815,Induced Release of a Plant-Defense Volatile ‘D...,Transmission of plant pathogens by insect vect...,Introduction Transmission of plant pathogens b...,Mann Rajinder S.|Ali Jared G.|Hermann Sara L.|...,The ecology of bacterial and mycoplasma plant ...
1,3547067,Carbon and Nitrogen Isotopic Survey of Norther...,The development of isotopic baselines for comp...,Introduction Stable isotope analysis is an imp...,Szpak Paul|White Christine D.|Longstaffe Fred ...,Influence of diet on the distribution of nitro...
2,3668195,The effect of ‘Candidatus Liberibacter asiatic...,BackgroundHuanglongbing (HLB) is a highly dest...,Background Citrus Huanglongbing (HLB) or citru...,Nwugo Chika C|Lin Hong|Duan Yongping|Civerolo ...,Current Epidemiological Understanding of Citru...
3,3672096,Emissions of CH4 and N2O under Different Tilla...,Understanding greenhouse gases (GHG) emissions...,Introduction With the current rise in global t...,Zhang Hai-Lin|Bai Xiao-Lin|Xue Jian-Fu|Chen Zh...,Measured and modelled estimates of nitrous oxi...
4,3676804,"Physiological, Biochemical, and Molecular Mech...",High temperature (HT) stress is a major enviro...,1. Introduction Among the ever-changing compon...,Hasanuzzaman Mirza|Nahar Kamrun|Alam Md. Mahab...,Climate and management contributions to recent...
...,...,...,...,...,...,...
121,6681344,"The NAC Protein from Tamarix hispida, ThNAC7, ...","Plant specific NAC (NAM, ATAF1/2 and CUC2) tra...","1. Introduction Environmental constraints, inc...",He Zihang|Li Ziyi|Lu Huijun|Huo Lin|Wang Zhibo...,"Cell signaling during cold, drought, and salt ..."
122,6681968,"Isolation, cloning and expression of CCA1 gene...",Circadian clock genes holds tremendous potenti...,Introduction Circadian Clock Associated1 ( CCA...,Chaudhury Ashok|Dalal Anita Devi|Sheoran Nayan...,Constitutive expression of the circadian clock...
123,6724085,Phytolith Formation in Plants: From Soil to Cell,Silica is deposited extra- and intracellularly...,1. Introduction Phytoliths are microscopic amo...,Nawaz Muhammad Amjad|Zakharenko Alexander Mikh...,Physiological and ecological significance of b...
124,6730492,Responses to Hydric Stress in the Seed-Borne N...,Alternaria brassicicola is a necrotrophic fung...,Introduction The fungus Alternaria brassicico...,N’Guyen Guillaume Quang|Raulo Roxane|Marchi Mu...,Genome sequence of the necrotrophic plant path...


In [121]:
pmc_df.iloc[82]

id                                                     6213855
title        Importance of Mineral Nutrition for Mitigating...
abstract     Aluminum (Al) toxicity is one of the major lim...
full_body    1. Introduction Aluminum (Al) toxicity represe...
authors      [Mann Rajinder S., Ali Jared G., Hermann Sara ...
Name: 82, dtype: object

In [134]:
pmc_df.iloc[0].full_body[:300]

'Introduction Transmission of plant pathogens by insect vectors is a complex biological process involving interactions between the plant, insect, and pathogen  [1] – [2] . Pathogens can induce changes in the traits of their primary hosts as well as their vectors to affect the frequency and nature of '