# 2. Data exploration and cleaning
In this notebook we will begin with the initial exploration and cleaning of the datasets for the Publications track of Hércules challenge.

## Setup

In this section we are going to perform the initial setup of the notebook and define the constants and functions which will be shared for both datasets.

First of all, we are going to add the src directory to the sys path so we can import the modules defined inside that directory. Then, we are going to start the logging system. All of this functionality is called from within the \_\_init\_\_.py script:

In [1]:
%run __init__.py

In the following cell we are going to define common constants and functions shared by both datasets of this track. The ResearchArticle class is also imported from the src package. This class serves as a common interface between the Agriculture and CORD19 datasets. More information about it can be seen at the *src/research_article.py* module.

In [2]:
from src import ResearchArticle

def print_empty_cols(df):
    for col in df.columns:
        print(col)
        print('-' * len(col))
        res = df[df[col] == ''].index
        print(f"{len(res)} articles have no value for column {col}")
        print(res)
        print('\n')


In [3]:
from bokeh.io import output_notebook

output_notebook()

In [4]:
from herc_common import BokehHistogram

hist = BokehHistogram(color_fill="mediumslateblue", color_hover="slateblue")

## Dataset 2: Agriculture
In this section we are going to perform the download and parsing of the Agriculture dataset. This dataset consists of a list of articles available in Europe PMC.

## Loading the data

In [5]:
import glob

AGRICULTURE_DATASET_DIR = os.path.join(DATA_DIR, 'agriculture')

xml_filenames = glob.glob(f'{AGRICULTURE_DATASET_DIR}/**/*.xml', recursive=True)

In [6]:
from pathlib import Path

pmc_dataset_xml = {}
for filename in xml_filenames:
    pmc_id = Path(filename).name.split('.')[0]
    with open(filename, 'rb') as f:
        pmc_dataset_xml[pmc_id] = f.read()

### Parsing the data

In the *src/agriculture/data_reader.py* module we have a series of functions to parse the contents of the XML files returned by the API. These functions return an instance of the *ResearchArticle* class for each article given, just like with the previous dataset:

In [7]:
from src.data_reader import parse_pmc_article

pmc_articles = [parse_pmc_article(article_id, article_xml)
                for article_id, article_xml in pmc_dataset_xml.items()]
pmc_articles[0]

PMC3310815 - Induced Release of a Plant-Defense Volatile ‘Deceptively’ Attracts Insect Vectors to Plants Infected with a Bacterial Pathogen - Transmissi... - Introduction Transmi...

### Creating the dataframe

Now that we have the list of articles from Europe PMC, we can proceed to create a pandas DataFrame to work with the data:

In [8]:
import pandas as pd

pmc_df = pd.DataFrame.from_records([article.to_dict() for article in pmc_articles])
pmc_df.head(n=7)

Unnamed: 0,id,title,abstract,full_body,authors,references,subjects
0,PMC3310815,Induced Release of a Plant-Defense Volatile ‘D...,Transmission of plant pathogens by insect vect...,Introduction Transmission of plant pathogens b...,Mann Rajinder S.|Ali Jared G.|Hermann Sara L.|...,Insect vector relationships with procaryotic p...,Agriculture|Crops|Pest Control|Biology|Ecology...
1,PMC3547067,Carbon and Nitrogen Isotopic Survey of Norther...,The development of isotopic baselines for comp...,Introduction Stable isotope analysis is an imp...,Szpak Paul|White Christine D.|Longstaffe Fred ...,Influence of diet on the distribution of carbo...,Biology|Ecology|Biogeochemistry|Paleontology|P...
2,PMC3668195,The effect of ‘Candidatus Liberibacter asiatic...,BackgroundHuanglongbing (HLB) is a highly dest...,Background Citrus Huanglongbing (HLB) or citru...,Nwugo Chika C|Lin Hong|Duan Yongping|Civerolo ...,"Huanglongbing: a destructive, newly-emerging, ...",
3,PMC3672096,Emissions of CH4 and N2O under Different Tilla...,Understanding greenhouse gases (GHG) emissions...,Introduction With the current rise in global t...,Zhang Hai-Lin|Bai Xiao-Lin|Xue Jian-Fu|Chen Zh...,Simulation of fluxes of greenhouse gases from ...,Agriculture|Agricultural Biotechnology|Agricul...
4,PMC3676804,"Physiological, Biochemical, and Molecular Mech...",High temperature (HT) stress is a major enviro...,1. Introduction Among the ever-changing compon...,Hasanuzzaman Mirza|Nahar Kamrun|Alam Md. Mahab...,Climate change 2007–The physical science basis...,
5,PMC3676838,Plant Defense against Insect Herbivores,Plants have been interacting with insects for ...,1. Introduction Land plants and insects have c...,Fürstenberg-Hägg Joel|Zagrobelny Mika|Bak Søren,Butterflies and plants: A study in coevolution...,
6,PMC3818224,Enhanced Methanol Production in Plants Provide...,Plants naturally emit methanol as volatile org...,Introduction Insect pests cause approximately ...,Dixit Sameer|Upadhyay Santosh Kumar|Singh Harp...,Pesticides and pest control|Biotechnology as a...,


In [9]:
pmc_df.iloc[82]

id                                                   PMC6213855
title         Importance of Mineral Nutrition for Mitigating...
abstract      Aluminum (Al) toxicity is one of the major lim...
full_body     1. Introduction Aluminum (Al) toxicity represe...
authors       Rahman Md. Atikur|Lee Sang-Hoon|Ji Hee Chung|K...
references    Plant adaptation to aid soils: The molecular b...
subjects                                                       
Name: 82, dtype: object

In [10]:
pmc_df.iloc[0].full_body[:300]

'Introduction Transmission of plant pathogens by insect vectors is a complex biological process involving interactions between the plant, insect, and pathogen  [1] – [2] . Pathogens can induce changes in the traits of their primary hosts as well as their vectors to affect the frequency and nature of '

### Cleaning and feature engineering

In the following cells we are going to define a simple function to clean the body text of each article, and apply it to the dataframe to obtain a column with the cleaned text:

In [11]:
import re

def clean(text):
    text = text.replace(u'\u200a', ' ')
    return re.sub(' +', ' ', text).strip()


In [12]:
pmc_df['text_cleaned'] = pmc_df['full_body'].apply(lambda x: clean(x))
pmc_df['text_cleaned'].loc[0][:500]

'Introduction Transmission of plant pathogens by insect vectors is a complex biological process involving interactions between the plant, insect, and pathogen [1] – [2] . Pathogens can induce changes in the traits of their primary hosts as well as their vectors to affect the frequency and nature of interactions between hosts and vectors [3] – [13] . Plant morphology, as well as, primary and secondary plant compounds, including emitted volatiles and plant nutrients, are some of the traits that can'

Finally, we will also define a new column with the number of characters of each article, just like with the CORD19 dataset:

In [13]:
pmc_df['num_chars_text'] = pmc_df['text_cleaned'].apply(lambda x: len(x))

### Initial exploration of the data

We will begin this section by checking if there are any empty or null values in our dataset:

In [14]:
pmc_df.isnull().sum()

id                0
title             0
abstract          0
full_body         0
authors           0
references        0
subjects          0
text_cleaned      0
num_chars_text    0
dtype: int64

Although there is no null value in the dataset, some of the strings could be empty still. We are going to quickly check this by iterating over all of the columns from the df:

In [15]:
print_empty_cols(pmc_df)

id
--
0 articles have no value for column id
Int64Index([], dtype='int64')


title
-----
0 articles have no value for column title
Int64Index([], dtype='int64')


abstract
--------
0 articles have no value for column abstract
Int64Index([], dtype='int64')


full_body
---------
0 articles have no value for column full_body
Int64Index([], dtype='int64')


authors
-------
0 articles have no value for column authors
Int64Index([], dtype='int64')


references
----------
0 articles have no value for column references
Int64Index([], dtype='int64')


subjects
--------
61 articles have no value for column subjects
Int64Index([  2,   4,   5,   6,  15,  18,  19,  21,  22,  23,  24,  28,  30,
             32,  33,  35,  36,  37,  42,  43,  52,  53,  55,  56,  57,  58,
             59,  61,  62,  63,  64,  66,  69,  70,  72,  74,  81,  82,  91,
             93,  95,  97,  98,  99, 103, 104, 105, 106, 107, 108, 111, 112,
            113, 115, 117, 118, 119, 120, 121, 123, 125],
           dtype='int

  res_values = method(rvalues)


We can see above that some of the articles have an empty value for its subjects. However, article has values for the other columns, so we don't need to drop any row from the dataframe.

Finally, we are going to explore the character length of the article in the dataset:

In [16]:
pmc_df['num_chars_text'].describe()

count       126.000000
mean      50685.817460
std       19843.392561
min       14337.000000
25%       36041.000000
50%       46427.500000
75%       60482.250000
max      109010.000000
Name: num_chars_text, dtype: float64

In [17]:
PMC_HIST_COLUMN = 'num_chars_text'
PMC_HIST_TITLE = "Article length distribution for the Agriculture dataset"
PMC_HIST_XLABEL = "Article length (# of characters)"
PMC_HIST_YLABEL = "Number of articles"

hist.load_plot(pmc_df, PMC_HIST_COLUMN, PMC_HIST_TITLE,
          PMC_HIST_XLABEL, PMC_HIST_YLABEL, True)

In [18]:
hist.save_plot(os.path.join(NOTEBOOK_2_RESULTS_DIR, '1_Agriculture_length.svg'))

There was an error exporting the plot. Please verify that both Selenium and Geckodriver are installed: Neither firefox and geckodriver nor a variant of chromium browser and chromedriver are available on system PATH. You can install the former with 'conda install -c conda-forge firefox geckodriver'.


### Serializing the dataframe

In [19]:
PMC_FILE_PATH = os.path.join(NOTEBOOK_2_RESULTS_DIR, 'pmc_dataframe.pkl')

pmc_df.to_pickle(PMC_FILE_PATH)