# Pretrained scispacy models NER comparison on samples of data with botanical names - Ginseng

This is a quick and dirty comparison done at the very start of the project during the exploration phase.
At this point I had a surface-level understanding of the text challenges of botanical or plant names.

The 5 samples of text here were handpicked from actual abstracts found from the PubMed web portal <https://pubmed.ncbi.nlm.nih.gov/>.
These sentences on ginseng were selected for their points of interest, in particular:
- full scientific names, which are binomial phrases followed by an author string, e.g. "Panax ginseng C.A. Meyer"
- part scientific or scientific-looking names (binomial nomenclature), e.g. "Panax ginseng" and "Panax quinquefolius"
- common names, e.g. "Asian ginseng"
- some taxonomic family names, e.g. "Asteraceae"

What the examples do not cover is a realistic representation of texts these plant names could be found in - say in a drug or pharmaceutical literature,
where we expect the models to respond well to biomedical entities. A realistic representation of literature for plant names is currently unavailable
due to the lack of digital or accessible sources for the names in the MPNS.

A few other types of examples with other known plant names mentioned in the MPNS (and are therefore potentially interesting) are
available in the data/sample_text folder in this repo. The examples below seemed the richest for discussion, and I felt I was able
to guess which of these models could potentially transfer to the plant domain.

The models compared in this notebook are from scispacy <https://allenai.github.io/scispacy/>. These are spaCy models specialised for
biomedical, scientific and clinical text processing. This seemed worth investigating as literature on medicinal plants (wrt to the MPNS)
would also cover the above domains, and might allow for richer information extraction further down the line.

The scispacy models tested here are:

1. Full spaCy pipelines trained on biomedical data, in increasing order of vocabulary size:
- en_core_sci_sm-0.3.0.tar.gz
- en_core_sci_md-0.3.0.tar.gz
- en_core_sci_lg-0.3.0.tar.gz

2. Full spaCy pipeline trained on biomedical data, based on scibert-base. Unfortunately I couldn't get this one to run.
- en_core_sci_scibert-0.3.0.tar.gz

3. spaCy NER (only) pipelines trained on different corpora. `en_ner_bionlp13cg_md` initially seems the most interesting as it is also trained to recognise organisms.
- en_ner_craft_md-0.3.0.tar.gz
- en_ner_jnlpba_md-0.3.0.tar.gz
- en_ner_bc5cdr_md-0.3.0.tar.gz
- en_ner_bionlp13cg_md-0.3.0.tar.gz

As we don't know exactly what each model is capable of recognising, we think it is fine to forgo a formal evaluation at this stage, against some hand-annotated or gold-standard examples.

**Most of the code in this notebook is borrowed from the official spaCy documentation.** <https://spacy.io/usage/linguistic-features>

## 1: Setup 

### 1.1: Load models and packages

In [1]:
%%capture
!pip install -U spacy<3.0.0
!pip install -U scispacy==0.3.0
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.3.0/en_core_sci_sm-0.3.0.tar.gz
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.3.0/en_core_sci_md-0.3.0.tar.gz
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.3.0/en_core_sci_lg-0.3.0.tar.gz
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.3.0/en_core_sci_scibert-0.3.0.tar.gz
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.3.0/en_ner_craft_md-0.3.0.tar.gz
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.3.0/en_ner_jnlpba_md-0.3.0.tar.gz
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.3.0/en_ner_bc5cdr_md-0.3.0.tar.gz
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.3.0/en_ner_bionlp13cg_md-0.3.0.tar.gz


In [2]:
import os
import spacy
import scispacy
from spacy import displacy


### 1.2: Load text examples

In [4]:
ginseng_sentences_data_filepath = "../../../data/sample_text/ginseng-sentences/"
ginseng_sentences_filenames = os.listdir(ginseng_sentences_data_filepath)
ginseng_sentences_filenames

['26850342-1.txt', '26850342-0.txt', '31885119-0.txt', '31880255-0.txt']

### 1.3: Define shared functions

**The following code (with some minor tweaks) is borrowed from the official spaCy documentation.** <https://spacy.io/usage/linguistic-features>

In [22]:
def run_ner_model(model: str, data_filepath: str, data_filenames: list) -> None:
    nlp = spacy.load(model)

    for _ in data_filenames:
        with open(os.path.join(data_filepath, _), "r") as f:
            text = f.read()
            doc = nlp(text)
            displacy.render(doc, style="ent")
            print("\n")


## 2: NER Model comparison

### 2.1: `en_core_sci_sm`

This model was trained on the smallest vocabulary, but performed reasonably well in picking up most of the plant mentions we are interested in, such as "red Asian ginseng" and "Black Cumin". It was also able to pick up most binomial names like "Nigella sativa".

It did miss the author string in "Panax ginseng C.A. Meyer" however.

However the model could possibly be a bit noisy, in that it picked up words like "article", "world" and "controlling". There is also no distinction between entity types - everything is an ENTITY.

In [23]:
run_ner_model("en_core_sci_sm", ginseng_sentences_data_filepath, ginseng_sentences_filenames)

















### 2.2: `en_core_sci_md`

This model was trained on the next-size-up vocabulary, and performed reasonably well in picking up all of the plant mentions we are interested in, such as "red Asian ginseng" and "Black Cumin". It was also able to pick up most binomial names like "Nigella sativa".

It also detected the full author string in "Panax ginseng C.A. Meyer".

However again the model is a bit noisy, in that it picked up words like "review" and "research". There is no distinction between entity types - everything is an ENTITY.

In [24]:
run_ner_model("en_core_sci_md", ginseng_sentences_data_filepath,ginseng_sentences_filenames)

















### 2.3: `en_core_sci_lg`

This model was trained on the largest vocabulary, and performed similarly to the medium size model. All of the plant mentions we are interested in were detected, such as "red Asian ginseng" and "Black Cumin". It was also able to pick up most binomial names like "Nigella sativa".

It also detected the full author string in "Panax ginseng C.A. Meyer".

However again the model is still noisy, in that it picked up words like "review" and "research". There is no distinction between entity types - everything is an ENTITY.

In [25]:
run_ner_model("en_core_sci_lg", ginseng_sentences_data_filepath, ginseng_sentences_filenames)

















### 2.4: `en_core_sci_scibert`

I was not able to get this model to work.

In [26]:
run_ner_model("en_core_sci_scibert", ginseng_sentences_data_filepath, ginseng_sentences_filenames)

OSError: [E050] Can't find model 'en_core_sci_scibert'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

### 2.5: `en_ner_craft_md`

This model was only able to return 3 entities, and the performance seems inconsistent, e.g. with family names. We are not sure what the entity label CHEBI means.

In [27]:
run_ner_model("en_ner_craft_md", ginseng_sentences_data_filepath, ginseng_sentences_filenames)



















### 2.6: `en_ner_jnlpba_md`

This model returned no results.

In [28]:
run_ner_model("en_ner_jnlpba_md", ginseng_sentences_data_filepath, ginseng_sentences_filenames)

















### 2.7: `en_ner_bc5cdr_md`

This model performed poorly - strange demarcation of entities, e.g. "garlic(Allium sativum)", and "Cumin" without its descriptor and part of its name, "Black". Both of these entities were classified as CHEMICAL.

In [None]:
run_ner_model("en_ner_bc5cdr_md", ginseng_sentences_data_filepath, ginseng_sentences_filenames)

















### 2.8: `en_ner_bionlp13cg_md`

I was interested in this model as it was trained on the largest variety of entities, but its performance on the items we are interested in is variable. For example, "red Asian ginseng" is correctly demarcated, but "red American ginseng" is not.

The classification of some items across different labels, for example across ORGANISM, ORGANISM_SUBSTANCE and TISSUE, is noisy and may not be useful to us.

In [None]:
run_ner_model("en_ner_bionlp13cg_md", ginseng_sentences_data_filepath, ginseng_sentences_filenames)

















## 3: Conclusions

There is some potential perhaps with the en_core_sci_md and en_core_sci_lg models, because they performed well in terms of demarcating scientific names with author strings and binomial names.

However, the fact that all of them are marked ENTITY and nothing else could mean that in a context with more non-plant entities, this could be extremely noisy with too many false positives.