# Pretrained scispacy models NER comparison on samples of data with botanical names - Daisy

This is a quick and dirty comparison done at the very start of the project during the exploration phase.
At this point I had a surface-level understanding of the text challenges of botanical or plant names.

This notebook is a second round of tests on the same 8 scispacy models we saw in the first notebook.

The 6 samples of text here were handpicked from actual abstracts found from the PubMed web portal <https://pubmed.ncbi.nlm.nih.gov/>.
These sentences on ginseng were selected for their points of interest, in particular:
- full scientific names, which are binomial phrases followed by an author string, e.g. "Bellis perennis L."
- part scientific or scientific-looking names (binomial nomenclature), e.g. "Bellis perennis"
- common names, e.g. "common daisy"
- some taxonomic family names, e.g. "Asteraceae"

Additionally, these abstracts have a lower variation in botanical names to the ginseng-sentence selection of texts, and they are fuller and longer contexts (the whole abstracts rather than select sentences), with other types of entities included.

What the examples do not cover is a realistic representation of texts these plant names could be found in - say in a drug or pharmaceutical literature,
where we expect the models to respond well to biomedical entities. A realistic representation of literature for plant names is currently unavailable
due to the lack of digital or accessible sources for the names in the MPNS.

The models compared in this notebook are from scispacy <https://allenai.github.io/scispacy/>. These are spaCy models specialised for
biomedical, scientific and clinical text processing. This seemed worth investigating as literature on medicinal plants (wrt to the MPNS)
would also cover the above domains, and might allow for richer information extraction further down the line.

The scispacy models tested here are:

1. Full spaCy pipelines trained on biomedical data, in increasing order of vocabulary size:
- en_core_sci_sm-0.3.0.tar.gz
- en_core_sci_md-0.3.0.tar.gz
- en_core_sci_lg-0.3.0.tar.gz

2. Full spaCy pipeline trained on biomedical data, based on scibert-base. Unfortunately I couldn't get this one to run.
- en_core_sci_scibert-0.3.0.tar.gz

3. spaCy NER (only) pipelines trained on different corpora. `en_ner_bionlp13cg_md` initially seems the most interesting as it is also trained to recognise organisms.
- en_ner_craft_md-0.3.0.tar.gz
- en_ner_jnlpba_md-0.3.0.tar.gz
- en_ner_bc5cdr_md-0.3.0.tar.gz
- en_ner_bionlp13cg_md-0.3.0.tar.gz

As we don't know exactly what each model is capable of recognising, we think it is fine to forgo a formal evaluation at this stage, against some hand-annotated or gold-standard examples.

## 1: Setup 

### 1.1: Load models and packages

In [2]:
%%capture
!pip install -U spacy<3.0.0
!pip install -U scispacy==0.3.0
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.3.0/en_core_sci_sm-0.3.0.tar.gz
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.3.0/en_core_sci_md-0.3.0.tar.gz
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.3.0/en_core_sci_lg-0.3.0.tar.gz
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.3.0/en_core_sci_scibert-0.3.0.tar.gz
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.3.0/en_ner_craft_md-0.3.0.tar.gz
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.3.0/en_ner_jnlpba_md-0.3.0.tar.gz
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.3.0/en_ner_bc5cdr_md-0.3.0.tar.gz
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.3.0/en_ner_bionlp13cg_md-0.3.0.tar.gz


In [3]:
import os
import spacy
import scispacy
from spacy import displacy


### 1.2: Load text examples

In [15]:
daisy_data_filepath = "../../../data/sample_text/daisy/"
daisy_data_filenames = os.listdir(daisy_data_filepath)
daisy_data_filenames

['15509561.txt',
 '21626148.txt',
 '24617777.txt',
 'PMC6222741.txt',
 '34175580.txt',
 'PMC6151829.txt']

### 1.3: Define shared functions

In [16]:
def run_ner_model(model: str, data_filepath: str, data_filenames: list) -> None:
    nlp = spacy.load(model)

    for _ in data_filenames:
        with open(os.path.join(data_filepath, _), "r") as f:
            text = f.read()
            doc = nlp(text)
            displacy.render(doc, style="ent")
            print("\n")


## 2: NER Model comparison

### 2.1: `en_core_sci_sm`

This model was trained on the smallest vocabulary, but performed reasonably well in picking up most of the plant mentions we are interested in, such as "red daisy" and "lawn daisy". It was also able to pick up most binomial names like "Bellis perennis".

It did not miss the author string in "Bellis perennis L." (though you recall it missed it previously for "Panax ginseng C.A. Meyer"). However it unsuccessfully identified "Bellis perennis L., cultivar "Galaxy red"" as three separate entities. The full name "Dendranthema morifolium (Ramat.) Yellow" was recognised in two parts with the author string left out.

The model is very noisy, in that it picked up words like "wealth", "cloning" and "small". There is also no distinction between entity types - everything is an ENTITY.

In [18]:
run_ner_model("en_core_sci_sm", daisy_data_filepath, daisy_data_filenames)

























### 2.2: `en_core_sci_md`

This model was trained on the next-size-up vocabulary, and performed reasonably well in picking up all of the plant mentions we are interested in. It performed similarly to the previous smaller model, in that it recognised the same valuable information, but also had a lot of the same noise. The full name "Dendranthema morifolium (Ramat.) Yellow" was recognised in three parts, so there was a slight improvement.

At this point we are not doing a formal evaluation where we quantify the results of the NER, but certainly nothing else stands out in a spotcheck.

In [19]:
run_ner_model("en_core_sci_md", daisy_data_filepath, daisy_data_filenames)

























### 2.3: `en_core_sci_lg`

This model was trained on the largest vocabulary, and performed similarly to the previous models in its family.
Again as to the medium-sized model, the full name "Dendranthema morifolium (Ramat.) Yellow" was recognised in three parts, which is not ideal.

In [20]:
run_ner_model("en_core_sci_lg", daisy_data_filepath, daisy_data_filenames)

























### 2.4: `en_core_sci_scibert`

I was not able to get this model to work.

In [21]:
run_ner_model("en_core_sci_scibert", daisy_data_filepath, daisy_data_filenames)

OSError: [E050] Can't find model 'en_core_sci_scibert'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

### 2.5: `en_ner_craft_md`

This model was able to return some entities, but was unable to return any of the plant names at all - it was very much trained for different interests.

In [22]:
run_ner_model("en_ner_craft_md", daisy_data_filepath, daisy_data_filenames)

























### 2.6: `en_ner_jnlpba_md`

This model was able to return some entities, but was unable to return any of the plant names at all - it was very much trained for different interests.

In [23]:
run_ner_model("en_ner_jnlpba_md", daisy_data_filepath, daisy_data_filenames)



























### 2.7: `en_ner_bc5cdr_md`

This model was most invested in chemicals - there was some responsiveness to "Chrysanthemum morifolium" and "Dendranthema morifolium", identified as chemicals, but no interest in common names or general entities.

In [24]:
run_ner_model("en_ner_bc5cdr_md", daisy_data_filepath, daisy_data_filenames)

























### 2.8: `en_ner_bionlp13cg_md`

I was interested in this model as it was trained on the largest variety of entities, but its performance on the items we are interested in is variable. For example, "red daisy" is recognised, but "common daisy" is not. There is also some inconsistency in terms of recognising binomial names - "Bellis perennis L." was recognised but "Dendranthema morifolium (Ramat.) Pink" was completely missed.

The classification of some items across different labels, for example across ORGANISM, ORGANISM_SUBSTANCE and TISSUE, is noisy and may not be useful to us.

In [25]:
run_ner_model("en_ner_bionlp13cg_md", daisy_data_filepath, daisy_data_filenames)

























## 3: Conclusions

There is some potential perhaps with the en_core_sci_md and en_core_sci_lg models, because they performed well in terms of demarcating scientific names with author strings and binomial names.

However, the fact that all of them are marked ENTITY and nothing else could mean that in a context with more non-plant entities, this could be extremely noisy with too many false positives.

We still have the same conclusion that we arrived at in the first notebook with simpler ginseng-sentences. As these two models offered the most promise out of all of them, we might consider them as candidates for our NER baseline and discussion.