# 2. Data exploration and cleaning
In this notebook we will begin with the initial exploration and cleaning of the datasets for the Publications track of Hércules challenge.

## Setup

In this section we are going to perform the initial setup of the notebook and define the constants and functions which will be shared for both datasets.

First of all, we are going to add the src directory to the sys path so we can import the modules defined inside that directory. Then, we are going to start the logging system. All of this functionality is called from within the \_\_init\_\_.py script:

In [1]:
%run __init__.py

In the following cell we are going to define common constants and functions shared by both datasets of this track. The ResearchArticle class is also imported from the src package. This class serves as a common interface between the Agriculture and CORD19 datasets. More information about it can be seen at the *src/research_article.py* module.

In [2]:
from src import ResearchArticle

def print_empty_cols(df):
    for col in df.columns:
        print(col)
        print('-' * len(col))
        res = df[df[col] == ''].index
        print(f"{len(res)} articles have no value for column {col}")
        print(res)
        print('\n')


In [3]:
from bokeh.io import output_notebook

output_notebook()

In [4]:
from herc_common import BokehHistogram

hist = BokehHistogram(color_fill="mediumslateblue", color_hover="slateblue")

## Dataset 1: COVID-19

We will begin with the CORD19 dataset. This dataset consists of a list of articles (130003 at the time of writing) included in the **COVID-19 Open Research Data Challenge** proposed by Kaggle.

The following cell will define constants for the name of the dataset, and the directory where it will be saved:

In [5]:
CORD_DATASET_NAME = "allen-institute-for-ai/CORD-19-research-challenge"
CORD_DATASET_DIR = os.path.join(DATA_DIR, 'cord19')

### Parsing the data

Now that the dataset has been downloaded and extracted, we can start parsing the data. We will begin by retrieving all the json files from the dataset:

In [6]:
import glob

json_filenames = glob.glob(f'{CORD_DATASET_DIR}/**/*.json', recursive=True)

And now we are going to import an auxiliary function defined in the src.cord19 package which parses a given json file from the dataset and builds a *ResearchArticle* instance:

In [7]:
from src.cord19 import parse_cord_file

parse_cord_file(json_filenames[0])

0001418189999fea7f7cbe3e82703d71c85a6fe5 - Absence of surface expression of feline infectious peritonitis virus (FIPV) antigens on infected cells isolated from cats with FIP - Feline inf... - Feline infectious pe...

We can iterate over all of the files to parse the complete dataset:

In [8]:
cord19_dataset = [parse_cord_file(file_name) for file_name in json_filenames]

### Creating the dataframe

After parsing the data we have obtained 130,003 instances of the *ResearchArticle* class. However, in order to work with the data it is more convenient to use DataFrames from the pandas library. We have already implemented a *to_dict* method in the *ResearchArticle* class that facilitates the creation of Dataframes:

In [9]:
import pandas as pd

cord19_df = pd.DataFrame.from_records([article.to_dict() for article in cord19_dataset])
cord19_df.head(n=7)

Unnamed: 0,id,title,abstract,full_body,authors,references,subjects
0,0001418189999fea7f7cbe3e82703d71c85a6fe5,Absence of surface expression of feline infect...,Feline infectious peritonitis virus (FIPV) pos...,Feline infectious peritonitis (FIP) is a fatal...,E Cornelissen|H L Dewerchin|E Van Hamme|H J N...,Using direct immunofluorescence to detect coro...,
1,00016663c74157a66b4d509d5c4edffd5391bbe0,,,Viruses are increasingly recognised as pathoge...,,Principles of Virology in Fields Virology|Inac...,
2,0003793cf9e709bc2b9d0c8111186f78fb73fc04,Title: Rethinking high-risk groups in COVID-19,,How do we protect our 'high-risk' patient popu...,Anastasia Vishnevetsky|Michael Levy,COVID-19)|Prevalence of comorbidities in the n...,
3,00039b94e6cb7609ecbddee1755314bcfeb77faa,Plasma inflammatory cytokines and chemokines i...,Severe acute respiratory syndrome (SARS) is a ...,Severe acute respiratory syndrome (SARS) is a ...,W K Lam|C K Wong|C W K Lam|A K L Wu|W K Ip|N L...,A major outbreak of severe acute respiratory s...,
4,0003ddc51c4291d742855e9ac56076a3bea33ad7,Journal Pre-proofs The Fire This Time: The Str...,,It is said that crisis reveals character. The ...,Olusola Ajilore|April D Thames,Ethnic Disparities in Hospitalisation for COVI...,
5,0004456994f6c1d5db7327990386d33c01cff32a,,Background: Influenza immunisation for healthc...,The German standing commission for immunisatio...,Chris J Williams|Brunhilde Schweiger|Genia D...,STIKO: Mitteilung der Ständigen Impfkommission...,
6,0004774b55eb0dad880aba9b572efe362660c5e0,Disaster Perceptions,,". So, if there is no singular definition of ri...",,Principles of emergency planning and managemen...,


In [10]:
cord19_df.iloc[82]

id                     0043d044273b8eb1585d3a66061e9b4e03edc062
title         Evaluation of the tuberculosis programme in Ni...
abstract      Background: Tuberculosis is a devastating dise...
full_body     The Ministry of Health of the People's Republi...
authors       Yu Rong Yang|Donald P Mcmanus|Darren J Gray|Xi...
references    Analysis of factors affecting the epidemiology...
subjects                                                       
Name: 82, dtype: object

### Cleaning and feature engineering

In this section we are going to clean the dataset and add additional features to the data that could be useful in later phases.

Since the body text of each article is already in a good format, we will just remove extra spaces from it:

In [11]:
import re

cord19_df['text_cleaned'] = cord19_df['full_body'].apply(lambda x: re.sub(' +', ' ', x).strip())
cord19_df['text_cleaned'].loc[0][:500]

'Feline infectious peritonitis (FIP) is a fatal chronic disease in cats caused by a coronavirus, feline infectious peritonitis virus (FIPV), and characterized by granulomatous lesions formed at the serosae of different organs. Two forms can be distinguished. Cats suffering from the wet or effusive form have exudates in their body cavities. Exudate is absent in the second form, hence the name dry or non-effusive form.\n FIPV-infected cells are detected in the pyogranulomas and, based on morphology '

We can also make use of the function defined in the setup section to check how many cells have no value (empty string):

In [12]:
print_empty_cols(cord19_df)

id
--
0 articles have no value for column id
Int64Index([], dtype='int64')


title
-----
8084 articles have no value for column title
Int64Index([    1,     5,     7,    21,    22,    35,    47,    54,    98,
              154,
            ...
            75234, 75241, 75260, 75264, 75269, 75270, 75280, 75283, 75287,
            75297],
           dtype='int64', length=8084)


abstract
--------
78252 articles have no value for column abstract
Int64Index([     1,      2,      4,      6,      7,      9,     13,     14,
                15,     18,
            ...
            129993, 129994, 129995, 129996, 129997, 129998, 129999, 130000,
            130001, 130002],
           dtype='int64', length=78252)


full_body
---------
0 articles have no value for column full_body
Int64Index([], dtype='int64')


authors
-------
7722 articles have no value for column authors
Int64Index([     1,      6,     18,     21,     22,     35,     43,     47,
                54,     86,
            ...
     

From the data above we can see that more than half of the articles (78252) do not have a value for the abstract, and some of them (7722) do not have a value for its authors.

Even more importantly, although all the full_body cells are not empty, some of the text_cleaned (21) are. We are going to explore this to see what may be the cause of the issue:

In [13]:
cord19_df.iloc[100983].full_body

'\n\n\n \n\n\n \n\n\n \n\n\n \n\n\n \n\n\n \n\n\n \n\n'

It seems that although all of the articles have a value for the text body, for some of them it consists mainly of newlines and spaces, which are removed when producing the *text_cleaned* column. We can remove these articles from the Dataframe, since they can't be used for the next phases of text processing and topic extraction.

In [14]:
empty_index = cord19_df[cord19_df['text_cleaned'] == ''].index
cord19_df.drop(empty_index, inplace=True)

Finally, we will also add a new column to the Dataframe with the number of characters of each article. This column will be usfeul to detect some anomalies and evaluate which models will perform better of worse with the given text size:

In [15]:
cord19_df['num_chars_text'] = cord19_df['text_cleaned'].apply(lambda x: len(x))

### Initial exploration

In [16]:
cord19_df['abstract'].describe(include='all')

count     129982
unique     50264
top             
freq       78231
Name: abstract, dtype: object

In [17]:
cord19_df.iloc[1000].num_chars_text

23572

In [18]:
cord19_df['num_chars_text'].describe()

count    1.299820e+05
mean     2.666625e+04
std      5.234404e+04
min      1.000000e+00
25%      1.068325e+04
50%      2.011000e+04
75%      3.126175e+04
max      4.111331e+06
Name: num_chars_text, dtype: float64

We can see that although the mean character length of the dataset is 2.66 * 10^4, the article with maximum length has more than 4 million characters. Since there is such a large discrepancy in the character length, we are going to continue investigating this issue to see how many articles have more than 150000 characters:

In [19]:
long_articles_index = cord19_df[cord19_df['num_chars_text'] > 1.5e5].index
len(long_articles_index)

1115

We can see that almost 1% of all the articles will have a considerably longer content than the mean. We are going to look closely at these length discrepancies with a histogram:

In [20]:
CORD19_HIST_COLUMN = "num_chars_text"
CORD19_HIST_TITLE = "Article length distribution for CORD19 dataset"
CORD19_HIST_XLABEL = "Article length (# of characters)"
CORD19_HIST_YLABEL = "Number of articles"

hist.load_plot(cord19_df, CORD19_HIST_COLUMN, CORD19_HIST_TITLE,
          CORD19_HIST_XLABEL, CORD19_HIST_YLABEL, True)

With the histogram displayed above we can observe that almost all of the articles have less than 150000 characters of length, as we have seen before. In order to know the length distribution with a finer level of detail, we are going to remove the 1% of really long articles before displaying the histogram:

In [21]:
hist.load_plot(cord19_df.drop(long_articles_index), CORD19_HIST_COLUMN,
          f"{CORD19_HIST_TITLE} (w/o 1% longest data)",
          CORD19_HIST_XLABEL, CORD19_HIST_YLABEL, True)

Now we can see that most of the articles lie in the 10000-30000 character length range.

In [22]:
hist.save_plot(os.path.join(RESULTS_DIR, '1_COVID19_length.svg'))

### Serializing the dataframe

We are going to serialize the dataframe for this dataset before further experimentation in the following notebooks:

In [23]:
CORD19_FILE_PATH = os.path.join(CORD_DATASET_DIR, 'cord19_dataframe.pkl')

cord19_df.to_pickle(CORD19_FILE_PATH)

## Dataset 2: Agriculture
In this section we are going to perform the download and parsing of the Agriculture dataset. This dataset consists of a list of articles available in Europe PMC.

## Loading the data

In [5]:
import glob

AGRICULTURE_DATASET_DIR = os.path.join(DATA_DIR, 'agriculture')

xml_filenames = glob.glob(f'{AGRICULTURE_DATASET_DIR}/**/*.xml', recursive=True)

In [6]:
pmc_dataset_xml = []
for filename in xml_filenames:
    with open(filename, 'rb') as f:
        pmc_dataset_xml.append(f.read())

### Parsing the data

In the *src/agriculture/data_reader.py* module we have a series of functions to parse the contents of the XML files returned by the API. These functions return an instance of the *ResearchArticle* class for each article given, just like with the previous dataset:

In [7]:
from src.agriculture import parse_pmc_article

pmc_articles = [parse_pmc_article(article_xml) for article_xml in pmc_dataset_xml]
pmc_articles[0]

6736833 - Soil temperature and hydric potential influences the monthly variations of soil Tuber aestivum DNA in a highly productive orchard - Tuber aest... - Introduction Ectomyc...

### Creating the dataframe

Now that we have the list of articles from Europe PMC, we can proceed to create a pandas DataFrame to work with the data:

In [8]:
import pandas as pd

pmc_df = pd.DataFrame.from_records([article.to_dict() for article in pmc_articles])
pmc_df.head(n=7)

Unnamed: 0,id,title,abstract,full_body,authors,references,subjects
0,6736833,Soil temperature and hydric potential influenc...,"Tuber aestivum, also known as the summer or Bu...","Introduction Ectomycorrhizal fungi, i.e., whic...",Todesco Flora|Belmondo Simone|Guignet Yoann|La...,Roots and associated fungi drive long‐term car...,
1,6570029,The economic value of mussel farming for uncer...,Mussel farming has been recognised as a low co...,"Introduction Like many other seas and lakes, t...",Gren Ing-Marie,Eutrophication and hypoxia in coastal areas: A...,Biology and Life Sciences|Organisms|Eukaryota|...
2,5620588,Differential Mechanisms of Photosynthetic Accl...,Photosynthetic organisms are able to sense ene...,1. Introduction Photosynthesis is a highly coo...,Khanal Nityananda|Bray Geoffrey E.|Grisnich An...,Photostasis and cold acclimation: Sensing low ...,
3,3818224,Enhanced Methanol Production in Plants Provide...,Plants naturally emit methanol as volatile org...,Introduction Insect pests cause approximately ...,Dixit Sameer|Upadhyay Santosh Kumar|Singh Harp...,Pesticides and pest control|Biotechnology as a...,
4,4397498,Plant defense phenotypes determine the consequ...,Plants are at the trophic base of terrestrial ...,"Introduction In The Origin of Species , Darwi...",Schuman Meredith C|Allmann Silke|Baldwin Ian T,Population studies in predominantly self-polli...,Ecology|Plant Biology
5,5447229,Endophytic Paecilomyces formosus LHL10 Augment...,This study investigated the Ni-removal efficie...,Introduction Rapid industrialization has contr...,Bilal Saqib|Khan Abdul L.|Shahzad Raheem|Asaf ...,“Soybean under abiotic stress: proteomic appro...,Plant Science|Original Research
6,5762720,Ultraviolet-B enhances the resistance of multi...,Land plants protect themselves from ultraviole...,Introduction Insect feeding is one of the majo...,Qi Jinfeng|Zhang Mou|Lu Chengkai|Hettenhausen ...,Herbivore-associated elicitors: FAC signaling ...,


In [9]:
pmc_df.iloc[82]

id                                                      5935394
title         Diversification and intensification of agricul...
abstract      Smallholder farming systems are vulnerable to ...
full_body     Introduction Smallholder farming systems, and ...
authors       Chen Minjie|Wichmann Bruno|Luckert Marty|Winow...
references    Food security: the challenge of feeding 9 bill...
subjects      Biology and Life Sciences|Agriculture|Crop Sci...
Name: 82, dtype: object

In [10]:
pmc_df.iloc[0].full_body[:300]

'Introduction Ectomycorrhizal fungi, i.e., which live in symbiosis with tree and shrubs, play important roles in forest functioning and biogeochemical cycles 1 . In boreal forests, 50–70% of the carbon stored in the soil is derived from roots and root-associated microorganisms such as ectomycorrhizal'

### Cleaning and feature engineering

In the following cells we are going to define a simple function to clean the body text of each article, and apply it to the dataframe to obtain a column with the cleaned text:

In [14]:
import re


def clean(text):
    text = text.replace(u'\u200a', ' ')
    return re.sub(' +', ' ', text).strip()


In [15]:
pmc_df['text_cleaned'] = pmc_df['full_body'].apply(lambda x: clean(x))
pmc_df['text_cleaned'].loc[0][:500]

'Introduction Ectomycorrhizal fungi, i.e., which live in symbiosis with tree and shrubs, play important roles in forest functioning and biogeochemical cycles 1 . In boreal forests, 50–70% of the carbon stored in the soil is derived from roots and root-associated microorganisms such as ectomycorrhizal fungi 2 . Besides forest ecosystems, ectomycorrhizal trees were also implanted in agroforestry ecosystems and in dedicated orchards for producing non-wood products such as edible fungi. The inoculati'

Finally, we will also define a new column with the number of characters of each article, just like with the CORD19 dataset:

In [16]:
pmc_df['num_chars_text'] = pmc_df['text_cleaned'].apply(lambda x: len(x))

### Initial exploration of the data

We will begin this section by checking if there are any empty or null values in our dataset:

In [17]:
pmc_df.isnull().sum()

id                0
title             0
abstract          0
full_body         0
authors           0
references        0
subjects          0
text_cleaned      0
num_chars_text    0
dtype: int64

Although there is no null value in the dataset, some of the strings could be empty still. We are going to quickly check this by iterating over all of the columns from the df:

In [18]:
print_empty_cols(pmc_df)

id
--
0 articles have no value for column id
Int64Index([], dtype='int64')


title
-----
0 articles have no value for column title
Int64Index([], dtype='int64')


abstract
--------
0 articles have no value for column abstract
Int64Index([], dtype='int64')


full_body
---------
0 articles have no value for column full_body
Int64Index([], dtype='int64')


authors
-------
0 articles have no value for column authors
Int64Index([], dtype='int64')


references
----------
0 articles have no value for column references
Int64Index([], dtype='int64')


subjects
--------
61 articles have no value for column subjects
Int64Index([  0,   2,   3,   6,  13,  17,  18,  19,  21,  23,  24,  25,  27,
             29,  31,  33,  37,  39,  44,  47,  48,  51,  53,  54,  56,  57,
             60,  61,  65,  67,  68,  69,  71,  72,  74,  75,  76,  77,  81,
             86,  87,  88,  91,  93,  99, 101, 103, 105, 106, 108, 109, 110,
            112, 113, 114, 115, 116, 122, 123, 124, 125],
           dtype='int

  res_values = method(rvalues)


We can see above that some of the articles have an empty value for its subjects. However, article has values for the other columns, so we don't need to drop any row from the dataframe.

Finally, we are going to explore the character length of the article in the dataset:

In [19]:
pmc_df['num_chars_text'].describe()

count       126.000000
mean      50685.785714
std       19843.406328
min       14335.000000
25%       36041.000000
50%       46427.500000
75%       60482.250000
max      109010.000000
Name: num_chars_text, dtype: float64

In [20]:
PMC_HIST_COLUMN = 'num_chars_text'
PMC_HIST_TITLE = "Article length distribution for the Agriculture dataset"
PMC_HIST_XLABEL = "Article length (# of characters)"
PMC_HIST_YLABEL = "Number of articles"

hist.load_plot(pmc_df, PMC_HIST_COLUMN, PMC_HIST_TITLE,
          PMC_HIST_XLABEL, PMC_HIST_YLABEL, True)

In [21]:
hist.save_plot(os.path.join(RESULTS_DIR, '1_Agriculture_length.svg'))

There was an error exporting the plot. Please verify that both Selenium and Geckodriver are installed: Neither firefox and geckodriver nor a variant of chromium browser and chromedriver are available on system PATH. You can install the former with 'conda install -c conda-forge firefox geckodriver'.


### Serializing the dataframe

In [22]:
PMC_FILE_PATH = os.path.join(AGRICULTURE_DATASET_DIR, 'pmc_dataframe.pkl')

pmc_df.to_pickle(PMC_FILE_PATH)