# Pretrained spaCy models NER comparison on samples of data with botanical names - Ginseng

This is a quick and dirty comparison done at the very start of the project during the exploration phase.
At this point I had a surface-level understanding of the text challenges of botanical or plant names.

The 5 samples of text here were handpicked from actual abstracts found from the PubMed web portal <https://pubmed.ncbi.nlm.nih.gov/>.
These sentences on ginseng were selected for their points of interest, in particular:
- full scientific names, which are binomial phrases followed by an author string, e.g. "Panax ginseng C.A. Meyer"
- part scientific or scientific-looking names (binomial nomenclature), e.g. "Panax ginseng" and "Panax quinquefolius"
- common names, e.g. "Asian ginseng"
- some taxonomic family names, e.g. "Asteraceae"

What the examples do not cover is a realistic representation of texts these plant names could be found in - say in a drug or pharmaceutical literature,
where we expect the models to respond well to biomedical entities. A realistic representation of literature for plant names is currently unavailable
due to the lack of digital or accessible sources for the names in the MPNS.

A few other types of examples with other known plant names mentioned in the MPNS (and are therefore potentially interesting) are
available in the data/sample_text folder in this repo. The examples below seemed the richest for discussion, and I felt I was able
to guess which of these models could potentially transfer to the plant domain.

The models compared in this notebook are from spaCy <https://spacy.io/models/en>. These are spaCy models are meant to be more general as they are trained on web data. This seemed worth investigating as literature on medicinal plants (wrt to the MPNS)
covers common names found in cultural contexts, and this might allow for richer information extraction further down the line.

The English spaCy models tested here are:

1. Full spaCy pipelines trained on web data, in increasing order of vocabulary size:
- en_core_web_sm
- en_core_web_md
- en_core_web_lg

2. A full spaCy pipeline trained on web data using `roberta-base`.
- en_core_web_trf

As we don't know exactly what each model is capable of recognising, we think it is fine to forgo a formal evaluation at this stage, against some hand-annotated or gold-standard examples.


## 1: Setup 

### 1.1: Load models and packages

In [9]:
%%capture
!pip install -U spacy[transformers]==3.1.0
!python -m spacy download en_core_web_sm
!python -m spacy download en_core_web_md
!python -m spacy download en_core_web_lg
!python -m spacy download en_core_web_trf


In [2]:
import os
import spacy
from spacy import displacy


### 1.2: Load text examples

In [3]:
ginseng_sentences_data_filepath = "../../../data/sample_text/ginseng-sentences/"
ginseng_sentences_filenames = os.listdir(ginseng_sentences_data_filepath)
ginseng_sentences_filenames

['26850342-1.txt', '26850342-0.txt', '31885119-0.txt', '31880255-0.txt']

### 1.3: Define shared functions

In [6]:
def run_ner_model(model: str, data_filepath: str, data_filenames: list) -> None:
    nlp = spacy.load(model)

    for _ in data_filenames:
        with open(os.path.join(data_filepath, _), "r") as f:
            text = f.read()
            doc = nlp(text)
            displacy.render(doc, style="ent")
            print("\n")


## 2: NER Model comparison

### 2.1: `en_core_web_sm`

This model was trained on the smallest vocabulary. It did not pick up phrases such as "red Asian ginseng", though it picked up "Black Cumin" as a PERSON (possibly because of the title case). It wasn't able to pick up binomial names like "Nigella sativa", so "Panax ginseng C.A. Meyer" was out of the question.

In [7]:
run_ner_model("en_core_web_sm", ginseng_sentences_data_filepath, ginseng_sentences_filenames)

















### 2.2: `en_core_web_md`

This model was trained on the next-size-up vocabulary, and did not show any improvement on identifying whole common phrases and binomial names. Black Cumin is now a LOC (location).

In [10]:
run_ner_model("en_core_web_md", ginseng_sentences_data_filepath,ginseng_sentences_filenames)

















### 2.3: `en_core_web_lg`

This model was trained on the largest vocabulary, and performed similarly to the previous two models. Again, entity classification performance is variable from the last two - Black Cumin is a PERSON again.

In [11]:
run_ner_model("en_core_web_lg", ginseng_sentences_data_filepath, ginseng_sentences_filenames)

















### 2.4: `en_core_web_trf`

This model returned the fewest entities of all of them, and the results are not ideal either.

In [13]:
run_ner_model("en_core_web_trf", ginseng_sentences_data_filepath, ginseng_sentences_filenames)

















## 3: Conclusions

This demonstration has shown that these models' performance would not be useful to the botanical names domain, as t