# 8. Evaluation of results
In this final notebook we will propose an evaluation metric to check the performance and our final systems on the publication track. For our approach we will rely on the default subjects asigned to each publication by the EuropePMC API. For an additional level of accuracy in the evaluation of our models, the list of subjects used as ground truth could be generated by a human committee. For the scope of this challenge, we will however rely on the default subjects returned by the API and leave the human annotation for future work.

## Setup

In [1]:
%run __init__.py

In [2]:
import pandas as pd

PMC_FILE_PATH = os.path.join(NOTEBOOK_2_RESULTS_DIR, 'pmc_dataframe.pkl')

pmc_df = pd.read_pickle(PMC_FILE_PATH)

## Analyzing the categories
We will begin by selecting a subset of the publications dataframe with just the id of the publication and its subjects:

In [3]:
import numpy as np

categories_df = pmc_df.copy()
categories_df['subjects'].replace('', np.nan, inplace=True)
categories_df.dropna(subset=['subjects'], inplace=True)
categories_df.head(n=4)

Unnamed: 0,id,title,abstract,full_body,authors,references,subjects,text_cleaned,num_chars_text
0,PMC3310815,Induced Release of a Plant-Defense Volatile ‘D...,Transmission of plant pathogens by insect vect...,Introduction Transmission of plant pathogens b...,Mann Rajinder S.|Ali Jared G.|Hermann Sara L.|...,Insect vector relationships with procaryotic p...,Agriculture|Crops|Pest Control|Biology|Ecology...,Introduction Transmission of plant pathogens b...,54143
1,PMC3547067,Carbon and Nitrogen Isotopic Survey of Norther...,The development of isotopic baselines for comp...,Introduction Stable isotope analysis is an imp...,Szpak Paul|White Christine D.|Longstaffe Fred ...,Influence of diet on the distribution of carbo...,Biology|Ecology|Biogeochemistry|Paleontology|P...,Introduction Stable isotope analysis is an imp...,74402
3,PMC3672096,Emissions of CH4 and N2O under Different Tilla...,Understanding greenhouse gases (GHG) emissions...,Introduction With the current rise in global t...,Zhang Hai-Lin|Bai Xiao-Lin|Xue Jian-Fu|Chen Zh...,Simulation of fluxes of greenhouse gases from ...,Agriculture|Agricultural Biotechnology|Agricul...,Introduction With the current rise in global t...,29852
7,PMC3904951,Diversity and Spatial Structure of Belowground...,Plant–mycorrhizal fungal interactions are ubiq...,Introduction More than 90% of wild terrestrial...,Toju Hirokazu|Sato Hirotoshi|Tanabe Akifumi S.,Mycorrhizal associations and other means of nu...,Agriculture|Forestry|Biology|Ecology|Community...,Introduction More than 90% of wild terrestrial...,57644


In [4]:
articles_categories = {aid: categories.split('|') 
                       for aid, categories in zip(categories_df['id'].values,
                                                  categories_df['subjects'].values)}
articles_categories['PMC6394436']

['Review Article']

As we can see above, some of the subjects seem to be not really representative of the main topic of the article. For example, a subject of 'Review Article' does not represent the overall semantics of the publication and adds more noise if we are looking for a representative evaluation.

In the following cells we are going to perform a cleaning of the subjects to remove those that will not be useful for the evaluation of our models. We will begin by removing those subjects that have been considered as non-relevant. After that, we will also remove those words that do not have a match to WordNet, which will be used later on to perform the evaluation of the models. Finally, those rows that do not have any subject will be removed from the final sample:

In [5]:
from herc_common.evaluation import _get_synset

stop_categories = set(["research", "research paper",
                       "review article", "research articles",
                       "regular papers", "basic research papers",
                       "original paper", "research papers",
                       "primary research articles", "primary research article",
                       "original article", "original articles", "original research"])

filtered_articles_categories = {
    k: set([el for el in v if el.lower() not in stop_categories
        and not el.isnumeric() and _get_synset(el) is not None])
    for k, v in articles_categories.items()
}

final_articles_categories = {
    k: v
    for k, v in filtered_articles_categories.items()
    if len(v) != 0
}

len(final_articles_categories)

26

In [6]:
final_articles_categories

{'PMC3310815': {'Agriculture', 'Biology', 'Crops', 'Ecology', 'Zoology'},
 'PMC3547067': {'Anthropology',
  'Archaeology',
  'Biology',
  'Chemistry',
  'Earth Sciences',
  'Ecology',
  'Geochemistry',
  'Isotopes',
  'Paleobotany',
  'Paleontology',
  'Radiochemistry'},
 'PMC3672096': {'Agriculture',
  'Biology',
  'Climate Change',
  'Climatology',
  'Earth Sciences',
  'Ecology'},
 'PMC3904951': {'Agriculture',
  'Biodiversity',
  'Biology',
  'Botany',
  'Ecology',
  'Forestry',
  'Fungi',
  'Microbiology',
  'Mycology'},
 'PMC3951315': {'Agriculture',
  'Biodiversity',
  'Biology',
  'Crops',
  'Ecology',
  'Ecosystems',
  'Microbiology',
  'Virology'},
 'PMC4067337': {'Agriculture',
  'Biotechnology',
  'Crops',
  'Ecology',
  'Microbiology',
  'Rice'},
 'PMC4128675': {'Biochemistry', 'Hormones', 'Plant Hormones'},
 'PMC4171492': {'Microbiology', 'Molecular Biology', 'Tobacco Mosaic Virus'},
 'PMC4397498': {'Ecology'},
 'PMC4623198': {'Environmental Science'},
 'PMC4981420': {'Ag

As we can see above, only 26 from the original samples have one or more subjects to be compared with the output of our models.

## Evaluation
For the evaluation of our system we will use WordNet to obtain a semantic similarity score between the topics predicted by our system and those used as ground truth.

Before we can start calculating these similarity scores, we will obtain the topics predicted by our system. First, we will be loading the final pipeline that has been saved in our previous notebook:

In [7]:
import string

import en_core_sci_lg
import en_core_web_md

from collections import Counter

from tqdm import tqdm

en_core_web_md.load()
en_core_sci_lg.load()

<spacy.lang.en.English at 0x152d311c888>

In [8]:
from herc_common.utils import load_object

final_pipe = load_object(os.path.join(NOTEBOOK_7_RESULTS_DIR, 'final_pipe.pkl'))

Now, we will select the sample of publications with at least one ground truth subject, and obtain the output of our system for those articles:

In [9]:
articles_keys = final_articles_categories.keys()
X = categories_df.set_index('id', inplace=False).loc[articles_keys]['text_cleaned'].values

In [10]:
y_base = final_articles_categories.values()
y_pred = final_pipe.transform(X)

HBox(children=(FloatProgress(value=0.0, max=26.0), HTML(value='')))




HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))




In [11]:
y_pred = [[topic[0] for topic in doc] for doc in y_pred]
y_pred[:5]

[['organism',
  'chemistry',
  'breastfeeding',
  'pharmacology',
  'sociology',
  'chemical element',
  'forestry science'],
 ['protein',
  'specialty',
  'statistics',
  'area studies',
  'agriculture',
  'regional studies',
  'science'],
 ['agriculture',
  'physics',
  'area studies',
  'regional studies',
  'occurrence',
  'physical property',
  'geographic region'],
 ['organism',
  'statistics',
  'biology',
  'taxon',
  'mathematical analysis',
  'specialty',
  'forestry science'],
 ['organism',
  'taxon',
  'area studies',
  'regional studies',
  'pharmacology',
  'anthropology',
  'biology']]

## Similarity
In this section we will be calculating the similarity scores between the topics inferred by the model and the ones used as ground truth:

In [12]:
from herc_common.evaluation import compute_similarity_scores

scores = [compute_similarity_scores(y_b, y_p, 'lch_similarity')
          for y_b, y_p in zip(y_base, y_pred)]
scores[:5]

[{'max similarity': 2.2512917986064953,
  'min similarity': 0.9295359586241757,
  'mean similarity': 1.4453305903895892,
  'median similarity': 1.3350010667323402},
 {'max similarity': 2.538973871058276,
  'min similarity': 1.072636802264849,
  'mean similarity': 1.6911887128516345,
  'median similarity': 1.55814461804655},
 {'max similarity': 3.6375861597263857,
  'min similarity': 1.4403615823901665,
  'mean similarity': 2.2000547269033945,
  'median similarity': 1.6916760106710724},
 {'max similarity': 3.6375861597263857,
  'min similarity': 1.3350010667323402,
  'mean similarity': 2.152276345246925,
  'median similarity': 1.6916760106710724},
 {'max similarity': 3.6375861597263857,
  'min similarity': 1.3350010667323402,
  'mean similarity': 2.2771072070615235,
  'median similarity': 2.0281482472922856}]

In [19]:
final_similarity = np.mean([score['mean similarity'] for score in scores])
final_similarity

2.1098170676819126

## Saving the results
Finally, we are going to save the results. First of all, the predictions will be saved to a new dataframe:

In [13]:
cols_subset = ['title', 'subjects']

results_df = categories_df.set_index('id', inplace=False).loc[articles_keys][cols_subset]
results_df['Topics Predicted'] = ['\n'.join(topics) for topics in y_pred]
results_df['subjects'] = ['\n'.join(topics) for topics in y_base]
results_df.head()

Unnamed: 0_level_0,title,subjects,Topics Predicted
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
PMC3310815,Induced Release of a Plant-Defense Volatile ‘D...,Ecology\nCrops\nZoology\nAgriculture\nBiology,organism\nchemistry\nbreastfeeding\npharmacolo...
PMC3547067,Carbon and Nitrogen Isotopic Survey of Norther...,Paleobotany\nEcology\nAnthropology\nRadiochemi...,protein\nspecialty\nstatistics\narea studies\n...
PMC3672096,Emissions of CH4 and N2O under Different Tilla...,Ecology\nEarth Sciences\nAgriculture\nClimatol...,agriculture\nphysics\narea studies\nregional s...
PMC3904951,Diversity and Spatial Structure of Belowground...,Fungi\nForestry\nBotany\nEcology\nMicrobiology...,organism\nstatistics\nbiology\ntaxon\nmathemat...
PMC3951315,Effects of Introduced and Indigenous Viruses o...,Ecology\nCrops\nMicrobiology\nAgriculture\nEco...,organism\ntaxon\narea studies\nregional studie...


In [14]:
scores_df = pd.DataFrame.from_records(scores)
scores_df.set_index(results_df.index, inplace=True)
scores_df.head()

Unnamed: 0_level_0,max similarity,min similarity,mean similarity,median similarity
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
PMC3310815,2.251292,0.929536,1.445331,1.335001
PMC3547067,2.538974,1.072637,1.691189,1.558145
PMC3672096,3.637586,1.440362,2.200055,1.691676
PMC3904951,3.637586,1.335001,2.152276,1.691676
PMC3951315,3.637586,1.335001,2.277107,2.028148


And now the scores obtained for each publication will be saved too:

In [15]:
final_df = results_df.join(scores_df)
final_df.head()

Unnamed: 0_level_0,title,subjects,Topics Predicted,max similarity,min similarity,mean similarity,median similarity
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
PMC3310815,Induced Release of a Plant-Defense Volatile ‘D...,Ecology\nCrops\nZoology\nAgriculture\nBiology,organism\nchemistry\nbreastfeeding\npharmacolo...,2.251292,0.929536,1.445331,1.335001
PMC3547067,Carbon and Nitrogen Isotopic Survey of Norther...,Paleobotany\nEcology\nAnthropology\nRadiochemi...,protein\nspecialty\nstatistics\narea studies\n...,2.538974,1.072637,1.691189,1.558145
PMC3672096,Emissions of CH4 and N2O under Different Tilla...,Ecology\nEarth Sciences\nAgriculture\nClimatol...,agriculture\nphysics\narea studies\nregional s...,3.637586,1.440362,2.200055,1.691676
PMC3904951,Diversity and Spatial Structure of Belowground...,Fungi\nForestry\nBotany\nEcology\nMicrobiology...,organism\nstatistics\nbiology\ntaxon\nmathemat...,3.637586,1.335001,2.152276,1.691676
PMC3951315,Effects of Introduced and Indigenous Viruses o...,Ecology\nCrops\nMicrobiology\nAgriculture\nEco...,organism\ntaxon\narea studies\nregional studie...,3.637586,1.335001,2.277107,2.028148


In [16]:
final_df.to_csv(os.path.join(NOTEBOOK_8_RESULTS_DIR, 'agriculture_scores.csv'))