# 4. Named Entity Recognition
In this notebook we will be exploring the Named Entity Recognition approaches used for the publications track. We will begin by defining a class that conforms to the sklearn API and uses Spacy models to return the named entities recognized in a given text. After that, several Spacy models will be tried out and we will choose the one that better fits the requirements of the track.

Once the named entity recognition model has been chosen, we will be using it in the following notebooks to obtain an additional list of potential topics from the entities recognized in the text.

## Setup
As always, we will begin by importing the common functionality of every notebook, and importing some of the libraries we will be using:

In [1]:
%run __init__.py

In [2]:
import os
import pandas as pd



We are going to also define the class that will be used to extract the entity from a given piece of text. This class conforms to the sklearn API, so it can be easily integrated in our text pipelines. It also provides additional functionality, like obtaining the most common entities or visualizing them in the notebook:

In [3]:
from collections import Counter

from sklearn.base import TransformerMixin, BaseEstimator
from spacy import displacy


class NamedEntityRecognizer(BaseEstimator, TransformerMixin):
    def __init__(self, spacy_model, disable=None, min_entity_counts=None,
                 max_entities=None):
        self.nlp = spacy_model.load()
        self.disable = disable if disable is not None else []
        self.min_entity_counts = min_entity_counts
        self.max_entities = max_entities
    
    def fit(self, X, y=None):
        return self

    def transform(self, X, *args, **kwargs):
        entities_texts = [self.get_entities(text) for text in X]
        if self.min_entity_counts is None:
            return entities_texts
        
        return [[entity_label 
                for entity_label, entity_count in Counter(entities_text).most_common(self.max_entities)
                if entity_count >= self.min_entity_counts]
                for entities_text in entities_texts]
    
    def get_entities(self, text):
        doc = self.nlp(text)
        return [x.text for x in doc.ents 
                if x.label_ not in self.disable
                and len(x.text) > 2]
    
    def get_most_common_entities(self, text, n=10):
        entities = self.get_entities(text)
        return Counter(entities).most_common(n)
    
    def visualize_entities(self, text, jupyter=True):
        doc = self.nlp(text)
        displacy.render(doc, jupyter=jupyter, style='ent')


## Agriculture

We will begin by loading the pandas DataFrame with the agriculture dataset that has been saved in the data exploration notebook:

In [4]:
PMC_FILE_PATH = os.path.join(NOTEBOOK_2_RESULTS_DIR, 'pmc_dataframe.pkl')

pmc_df = pd.read_pickle(PMC_FILE_PATH)
publications = pmc_df['text_cleaned'].values

The first Spacy model that we will be trying out is the '_en\_core\_sci\_lg_' one from [scispaCy](https://allenai.github.io/scispacy/). This model has been trained and optimized to work with texts of scientific nature. We will also indicate to our named entity recognizer that we want to retrieve entities that appear at least 3 times in the text. This is done in order to avoid noide from entities that appear a few times in the text but are not core to it:

In [5]:
import en_core_sci_lg

ner = NamedEntityRecognizer(en_core_sci_lg, min_entity_counts=3)

In [6]:
text = publications[-2]

ents = ner.transform([text])
ents

[['A. brassicicola',
  'hydrophilins',
  'sorbitol',
  'fungal',
  'genes',
  'desiccation',
  'mutant',
  'wild-type',
  'proteins',
  'seeds',
  'treatment',
  'seed transmission',
  'mutants',
  'exposure to',
  'exposed to',
  'induced',
  'study',
  'water stress',
  'expression',
  'treatments',
  'strains',
  'abnik1',
  'response',
  'AbSih3',
  'abhog1',
  'susceptibility',
  'absch9',
  'amino acid',
  'inoculated',
  'capacity',
  'stress',
  'condition',
  'oxidative stress',
  'up-regulated',
  'strain',
  'plants',
  'Sa.',
  'AbHog1',
  'concentration',
  'decrease',
  'germlings',
  'treated',
  'silica gel',
  'regulation',
  'control',
  'conditions',
  'fungus',
  'accumulation',
  'data',
  'days',
  'analyzed',
  'exposure',
  'amino acids',
  'parental',
  'hydrophilin genes',
  'AbNik1',
  'medium',
  'Alternaria brassicicola',
  'synthesis',
  'yeast',
  'inhibition',
  'results',
  'silica gel beads',
  'analysis',
  'arabitol',
  'proteome',
  'deficient',
  '

In [7]:
ner.get_most_common_entities(text, n=10)

[('A. brassicicola', 32),
 ('hydrophilins', 31),
 ('sorbitol', 31),
 ('fungal', 24),
 ('genes', 24),
 ('desiccation', 20),
 ('mutant', 19),
 ('wild-type', 19),
 ('proteins', 18),
 ('seeds', 17)]

In [8]:
ner.visualize_entities(text[:1500])

The following model that we will be trying out is the '_en\_core\_web\_md_' model from Spacy, which has been trained with generic blogs and news. This model will return several different types of entities (a complete list can be accessed through the [following link](https://spacy.io/api/annotation#named-entities)), so we will configure it to disable some of those entity types:

In [9]:
import en_core_web_md

disallowed_types = ['CARDINAL', 'DATE', 'MONEY', 'ORDINAL', 'PERCENT', 'QUANTITY', 'TIME']
ner_basic = NamedEntityRecognizer(en_core_web_md, disable=disallowed_types)
ner_basic.visualize_entities(text[:1500])

From the output seen above, we can observe that while the core\_web model differentiates between several entity types, the number of entities detected is much lower than with the scispaCy generic model. Some of the key entities from the text are not detected.

Finally, we will be trying two specific models from scispaCy used to detect chemical compounds, diseases and taxons. Those models have been trained on the BC5CDR and CRAFT corpus respectively:

In [10]:
import en_ner_bc5cdr_md

ner_bc5cdr = NamedEntityRecognizer(en_ner_bc5cdr_md)
ner_bc5cdr.visualize_entities(text[:1500])

In [11]:
import en_ner_craft_md

ner_craft = NamedEntityRecognizer(en_ner_craft_md)
ner_craft.visualize_entities(text[:1500])

Although those models detect correctly their corresponding entity types, they are too specific to be used as a generic named entity recognizer for the agriculture articles. Furthermore, the distinction between the different entity types will not be very useful since in the entity linking step we will link those entities to an ontology and will get even more information than just their entity type.

Due to that, for the scope of this challenge we will be using the _en\_core\_sci\_lg_ model from scispaCy.

## Saving results
Finally, we are going to save the named entity recognizer class with the parameters that we have selected for further use in the following phases.

In [13]:
from herc_common.utils import save_object

output_filename = "ner_system.pkl"
save_object(ner, os.path.join(NOTEBOOK_4_RESULTS_DIR, output_filename))