# ODSC 2022 NER from scratch with spaCy
## Author: Ben Batorsky
This notebook contains all the code for the tutorial on NER given at ODSC 2022

### Setup
If you're running this locally, you should run these cells before the tutorial just to get all set up.

If you're running this on Collab, your runtime will be reset after a few minutes of idle time, so you might need to wait until the tutorial starts.  Note that WIFI is spotty in the conference sometimes.

In [None]:
# install the required spacy libraries
!pip install -q spacy==3.2
!pip install -q spacy-transformers==1.1.5

In [None]:
# download a more complete model (vectors + NER)
!spacy download en_core_web_md

Collecting en-core-web-md==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.2.0/en_core_web_md-3.2.0-py3-none-any.whl (45.7 MB)
[K     |████████████████████████████████| 45.7 MB 2.5 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


In [None]:
# run this to install the BERT model we'll be using
from transformers import AutoTokenizer, AutoModel
_ = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
_ = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")


Some weights of the model checkpoint at emilyalsentzer/Bio_ClinicalBERT were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
# clone the necessary repos
# tutorial repo
!git clone https://github.com/bpben/spacy_ner_tutorial.git
# spacy project for drug NER - using spaCy project CLI, more on this later
!spacy project clone tutorials/ner_drugs

fatal: destination path 'spacy_ner_tutorial' already exists and is not an empty directory.

[38;5;1m✘ Can't clone project, directory already exists: /content/ner_drugs[0m



In [None]:
import spacy

## SpaCy basics
[Here](https://spacy.io/architecture-415624fc7d149ec03f2736c4aa8b8f3c.svg) is a good overview of how spaCy works.  The next few cells just show you some basic usage and some of the core spaCy types.

In [None]:
# import base English language model
from spacy.lang.en import English
en = English()
print(en.tokenizer)
# in this base model, there are no pipes (e.g. NER)
print(en.pipe_names)



<spacy.tokenizer.Tokenizer object at 0x7f18a1d66f50>
[]


In [None]:
# running a "document" through a language model
text = 'We are doing NLP.'
doc = en(text)
print(type(doc))
print(type(doc[:2]))
print([(x, type(x)) for x in doc])


<class 'spacy.tokens.doc.Doc'>
<class 'spacy.tokens.span.Span'>
[(We, <class 'spacy.tokens.token.Token'>), (are, <class 'spacy.tokens.token.Token'>), (doing, <class 'spacy.tokens.token.Token'>), (NLP, <class 'spacy.tokens.token.Token'>), (., <class 'spacy.tokens.token.Token'>)]


In [None]:
nlp = spacy.load('en_core_web_md')
# this language model has a lot more components to it
print(nlp.pipe_names)

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']


In [None]:
# our base model has no entities, our expanded model does
doc = en(text)
doc_expanded = nlp(text)
print(doc.ents)
print(doc_expanded.ents)

()
(NLP,)


In [None]:
# can get some additional information from these entities
# the model thinks NLP is an organization.  
# How silly, it'd be great if we could train it to do better...
for e in doc_expanded.ents:
    print(e, type(e), e.label_, spacy.explain(e.label_))


NLP <class 'spacy.tokens.span.Span'> ORG Companies, agencies, institutions, etc.


In [None]:
# neat little side note - the larger model also has GloVe vectors attached to it
# very easy to make use of these in your pipe
print(doc_expanded.vector)

[-1.20593786e-01  1.88496396e-01 -2.94177979e-01 -3.66981983e-01
  1.33377999e-01 -1.61488019e-02  5.73446043e-02 -1.06618002e-01
  2.76278015e-02  1.93136191e+00 -1.90531999e-01  1.27237201e-01
  1.01299003e-01 -3.91399972e-02 -7.14469999e-02 -1.82412818e-01
 -5.17126098e-02  9.75196004e-01 -3.70818198e-01  1.96072400e-01
  1.01875998e-01  9.30623859e-02  1.64807942e-02  3.42839723e-03
 -1.31522596e-01  1.55715004e-01 -2.13100016e-02 -2.13807389e-01
  1.00509003e-01 -7.03402013e-02 -1.69712044e-02 -1.84871599e-01
  4.93465960e-02  2.07261011e-01  1.51058003e-01  2.21679598e-01
  6.01292029e-02  9.42348018e-02 -9.97439995e-02  1.43356994e-02
 -1.62757203e-01 -1.84749905e-02  2.60500014e-02  6.06500022e-02
 -4.75140056e-03  1.25824988e-01 -1.59984201e-01 -9.57759935e-03
 -1.30816400e-01 -9.82340351e-02 -7.52495974e-02 -1.50035396e-01
  6.98117912e-02 -1.49040017e-02  5.82774095e-02 -8.63535982e-03
 -5.30104041e-02  1.86289959e-02 -1.67561788e-02 -1.16726004e-01
 -7.27300048e-02 -7.03819

## Model training in spaCy
The basic model we pulled in has an NER pipeline, but it has a very specific set of entity types.  We could train that model to recognize new types or we could start from scratch and train.  

Here we're going to start from scratch - we'll used the project CLI and show a bit of the code under the hood (which I find helpful for debugging)

The tutorial we downloaded above has data from Reddit that contains drug names.  We can process that using the included preprocessing script (`ner_drugs/scripts/preprocess.py`).

In [None]:
!cd ner_drugs && spacy project run preprocess

[1m
[38;5;4mℹ Skipping 'preprocess': nothing changed[0m


Now we have preprocessed datasets for training and evaluation.  Let's bring in the training set and walk through how to train our model.

SpaCy provides a helper function for training, which is much easier to use than baking your own.  If you want an example of a custom training function, you can see one [here](https://github.com/bpben/ner_chinese_spacy/blob/master/ner_english_example.ipynb).

In [None]:
from spacy.cli.train import train

# can override config info with overrides
# the tutorial config file doesn't have the paths for train/dev corpora
# going to just run this for a few epochs, see how it works
train("./ner_drugs/configs/config.cfg",
      output_path='example_model',
      overrides={"paths.train": "./ner_drugs/corpus/drugs_training.spacy", 
                 "paths.dev": "./ner_drugs/corpus/drugs_eval.spacy",
                 "training.max_epochs": 1})

[38;5;4mℹ Saving to output directory: example_model[0m
[38;5;4mℹ Using CPU[0m
[38;5;4mℹ To switch to GPU 0, use the option: --gpu-id 0[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.0[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00     19.33    0.23    0.20    0.28    0.00
  0     200         11.60  16541.60    0.00    0.00    0.00    0.00
  0     400         19.25   1442.36    0.00    0.00    0.00    0.00
[38;5;2m✔ Saved pipeline to output directory[0m
example_model/model-last


In [None]:
# now we can load our trained model
trained_nlp = spacy.load('./example_model/model-best')

In [None]:
# let's load in the eval dataset
# docbin requires some special handling
eval_data = spacy.tokens.DocBin()
eval_data = eval_data.from_disk("./ner_drugs/corpus/drugs_eval.spacy")
# you can recover the doc objects from the DocBin this way
docs = [x for x in eval_data.get_docs(trained_nlp.vocab)]
# by running the doc through the model, we get entities out
trained_nlp(docs[0]).ents

(Got it out the mud like a pint of lean Always blowing gas, sip,
 Promethazine,
 Plays off the hood, bitch I do my thing Got it out the mud like a pint of lean Smoking purple green, sippin' purple lean)

In [None]:
# spacy models have built-in evaluation functions, but expect Example objects, not Doc
corpus = spacy.training.Corpus("./ner_drugs/corpus/drugs_eval.spacy")
eval_corpus = list(corpus(trained_nlp))
trained_nlp.evaluate(eval_corpus)

{'ents_f': 0.002320185614849188,
 'ents_p': 0.002,
 'ents_per_type': {'DRUG': {'f': 0.002320185614849188,
   'p': 0.002,
   'r': 0.0027624309392265192}},
 'ents_r': 0.0027624309392265192,
 'speed': 29112.548580084105,
 'token_acc': 0.9999332071690973,
 'token_f': 0.9993542784618469,
 'token_p': 0.9991985395609778,
 'token_r': 0.9995100659184037}

You can see the model does pretty poorly.  Likely this is because it was trained for one epoch.  If we train the model longer, we're likely to see better results

In [None]:
# the project cli can do all the work above for us
!cd ner_drugs && spacy project run train
!cd ner_drugs && spacy project run evaluate

[1m
[38;5;4mℹ Skipping 'train': nothing changed[0m
[1m
[38;5;4mℹ Skipping 'evaluate': nothing changed[0m


Nice! But can we do...better? Preferably with a giant model that everyone likes and is named after Sesame Street.

### Spacy-transformers
[Spacy-transformers](https://explosion.ai/blog/spacy-transformers) is essentially just a wrapper for [HuggingFace's models](https://huggingface.co/), but made to work with spaCy.

Let's take a quick look at how you might be able to use this, generally.



In [None]:
from spacy.lang.en import English
# minimal example - initialize English model, add in our BioClinicalBERT
en = English()
# using a custom config - uses BioClinicalBERT
# this is Tok2VecTransformer, which combines Transformer+Listener, we'll use something different in training
config = {
    "model": {
        "@architectures": "spacy-transformers.Tok2VecTransformer.v3",
        "name": "emilyalsentzer/Bio_ClinicalBERT",
        "tokenizer_config": {"use_fast": True},
        # these have to do with alignment
        'get_spans': {'@span_getters': 'spacy-transformers.strided_spans.v1',
          'stride': 96,
          'window': 128},
        "pooling": {"@layers":"reduce_mean.v1"} 
    }
}
trf = en.add_pipe("tok2vec", config=config)
# need to initialize pipeline components
en.initialize()
# two different contexts
ex1 = 'Flintstones vitamins'
ex2 = 'Flintstones cartoon'

Some weights of the model checkpoint at emilyalsentzer/Bio_ClinicalBERT were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
# compare to medium model
nlp = spacy.load('en_core_web_md')
trf_ex1 = en(ex1)
trf_ex2 = en(ex2)
md_ex1 = nlp(ex1)
md_ex2 = nlp(ex2)
# BERT "token vector" is larger than GloVe
print(trf_ex1.vector.shape)
print(md_ex1.vector.shape)

(768,)
(300,)


In [None]:
from sklearn.metrics.pairwise import cosine_similarity
# we know "Flintstones" in ex1 and ex2 are different - but word vectors don't
print('word vector-based similarity \n', 
      cosine_similarity([md_ex1[0].vector, md_ex2[0].vector]))
# the power of transformers! 
print('word vector-based similarity \n', 
      cosine_similarity([trf_ex1[0].vector, trf_ex2[0].vector]))

word vector-based similarity 
 [[1.0000002 1.0000002]
 [1.0000002 1.0000002]]
word vector-based similarity 
 [[1.0000002 0.9413736]
 [0.9413736 1.       ]]


In [None]:
# you CAN play with transformer itself, it's just less friendly for this kind of playing around
en = English()
# using a custom config - uses BioClinicalBERT
# NOTE: definitely check and understand your defaults (spacy recommends it!)
config = {
    "model": {
        "@architectures": "spacy-transformers.TransformerModel.v3",
        "name": "emilyalsentzer/Bio_ClinicalBERT",
        "tokenizer_config": {"use_fast": True}
    }
}
en.add_pipe("transformer", config=config)
en.initialize()
ex1 = en('Flintstones vitamins')
# comes from ModelOutput - last hidden state and pooled output
[x.shape for x in ex1._.trf_data.tensors]

Some weights of the model checkpoint at emilyalsentzer/Bio_ClinicalBERT were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[(1, 7, 768), (1, 768)]

Note - a lot of the above is just to mess around in code.  The CLI is much more friendly.  You can assemble a complete pipeline that spacy can load from a config file with `spacy assemble`

### Using transformers in training

Let's try and bring in a transformer model for training.  You'll need to replace `project.yml` in `ner_drugs` with the modified version available in the [tutorial github repo](https://github.com/bpben/spacy_ner_tutorial).

In [None]:
# let's do this again, with the modified project and config
# made a special set of commands for trf training on GPU
!cd ner_drugs && spacy project run train_trf
!cd ner_drugs && spacy project run evaluate_trf

[1m
Running command: /usr/bin/python3 -m spacy train configs/config_trf.cfg --output training_trf/ --paths.train corpus/drugs_training.spacy --paths.dev corpus/drugs_eval.spacy --gpu-id 0
[38;5;2m✔ Created output directory: training_trf[0m
[38;5;4mℹ Saving to output directory: training_trf[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
[2022-04-14 13:15:16,181] [INFO] Set up nlp object from config
[2022-04-14 13:15:16,193] [INFO] Pipeline: ['transformer', 'ner']
[2022-04-14 13:15:16,197] [INFO] Created vocabulary
[2022-04-14 13:15:16,199] [INFO] Finished initializing nlp object
Some weights of the model checkpoint at emilyalsentzer/Bio_ClinicalBERT were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight']
- This IS expec

Compared to the CNN model, you should see better performance. On this run we're seeing ~7% improvement on F1.   

But this is just with one entity type.  In the real world, we're likely to have multiple.  Let's switch to a clinical context and try this with multiple types.

### n2c2 2018 Medication Identification Challenge
More information in the slides, but below we do a similar approach as above.  We start with a base spaCy model for identifying medications and their attributes.  Then we up the ante with a transformer model.  The results speak for themselves!

In [None]:
#!unzip n2c2.zip

In [None]:
!cd n2c2/ && spacy project run preprocess

[1m
Running command: /usr/bin/python3 scripts/preprocess_i2b2.py raw/training_20180910/ corpus/train.spacy
Processed 303 documents: train.spacy
Running command: /usr/bin/python3 scripts/preprocess_i2b2.py raw/test corpus/test.spacy
Processed 202 documents: test.spacy


In [None]:
!cd n2c2/ && spacy project run train
!cd n2c2/ && spacy project run evaluate

[1m
Running command: /usr/bin/python3 -m spacy train configs/config.cfg --output training/ --paths.train corpus/train.spacy --paths.dev corpus/test.spacy --paths.vectors en_core_web_md --gpu-id 0
[38;5;2m✔ Created output directory: training[0m
[38;5;4mℹ Saving to output directory: training[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
[2022-04-17 02:09:52,651] [INFO] Set up nlp object from config
[2022-04-17 02:09:52,662] [INFO] Pipeline: ['tok2vec', 'ner']
[2022-04-17 02:09:52,666] [INFO] Created vocabulary
[2022-04-17 02:09:54,761] [INFO] Added vectors: en_core_web_md
[2022-04-17 02:09:55,144] [INFO] Finished initializing nlp object
[2022-04-17 02:10:03,358] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.0[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0      

Okay, not bad, but what if we try with the transformer?

In [None]:
!cd n2c2/ && spacy project run train_trf
!cd n2c2/ && spacy project run evaluate_trf

[1m
Running command: /usr/bin/python3 -m spacy train configs/config_trf.cfg --output training_trf/ --paths.train corpus/train.spacy --paths.dev corpus/test.spacy --gpu-id 0
[38;5;2m✔ Created output directory: training_trf[0m
[38;5;4mℹ Saving to output directory: training_trf[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
[2022-04-17 03:37:18,232] [INFO] Set up nlp object from config
[2022-04-17 03:37:18,249] [INFO] Pipeline: ['transformer', 'ner']
[2022-04-17 03:37:18,254] [INFO] Created vocabulary
[2022-04-17 03:37:18,257] [INFO] Finished initializing nlp object
Some weights of the model checkpoint at emilyalsentzer/Bio_ClinicalBERT were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are 

Again - we're seeing benefits from switching to this transformer model, an average 10% gain across entity types.

Is there a way we can see this improvement by actually loading and trying that model?

I've packaged the models off screen here, you can get them via git lfs.  The workflow is described [here](https://spacy.io/usage/saving-loading#models-generating).  

The cool thing about packaging these models is spaCy writes all your docs for you.    Models end up as packages spacy can load and install.

In [None]:
# just uncompressing so we can look at the files in here
#!tar -xvf en_n2c2_cnn-0.0.1.tar.gz
#!tar -xvf en_n2c2_trf-0.0.1.tar.gz

In [81]:
# simple way to get the example models - git LFS
#!apt-get install git-lfs
# now pull the example models - this will take some time ~800 MB
#!cd spacy_ner_tutorial && git lfs pull

Cloning into 'spacy_ner_tutorial'...
remote: Enumerating objects: 54, done.[K
remote: Counting objects:   1% (1/54)[Kremote: Counting objects:   3% (2/54)[Kremote: Counting objects:   5% (3/54)[Kremote: Counting objects:   7% (4/54)[Kremote: Counting objects:   9% (5/54)[Kremote: Counting objects:  11% (6/54)[Kremote: Counting objects:  12% (7/54)[Kremote: Counting objects:  14% (8/54)[Kremote: Counting objects:  16% (9/54)[Kremote: Counting objects:  18% (10/54)[Kremote: Counting objects:  20% (11/54)[Kremote: Counting objects:  22% (12/54)[Kremote: Counting objects:  24% (13/54)[Kremote: Counting objects:  25% (14/54)[Kremote: Counting objects:  27% (15/54)[Kremote: Counting objects:  29% (16/54)[Kremote: Counting objects:  31% (17/54)[Kremote: Counting objects:  33% (18/54)[Kremote: Counting objects:  35% (19/54)[Kremote: Counting objects:  37% (20/54)[Kremote: Counting objects:  38% (21/54)[Kremote: Counting objects:  40% (22/54)[Kremote

In [84]:
# install the models
!pip install spacy_ner_tutorial/example_models/en_n2c2_cnn-0.0.1.tar.gz
!pip install spacy_ner_tutorial/example_models/en_n2c2_trf-0.0.1.tar.gz

Processing ./spacy_ner_tutorial/example_models/en_n2c2_cnn-0.0.1.tar.gz
[33mDEPRECATION: Source distribution is being reinstalled despite an installed package having the same name and version as the installed package. pip 21.2 will remove support for this functionality. A possible replacement is use --force-reinstall. You can find discussion regarding this at https://github.com/pypa/pip/issues/8711.[0m
Building wheels for collected packages: en-n2c2-cnn
  Building wheel for en-n2c2-cnn (setup.py) ... [?25l[?25hdone
  Created wheel for en-n2c2-cnn: filename=en_n2c2_cnn-0.0.1-py3-none-any.whl size=35282842 sha256=55f7ff06a4e9f6ba07fb55be1302ddea6d4d40484b0da5c3711e3291b189f134
  Stored in directory: /root/.cache/pip/wheels/41/ff/7e/0337e6e905a7c5984005473be3871b09b485fe074e154eb937
Successfully built en-n2c2-cnn
Installing collected packages: en-n2c2-cnn
  Attempting uninstall: en-n2c2-cnn
    Found existing installation: en-n2c2-cnn 0.0.1
    Uninstalling en-n2c2-cnn-0.0.1:
      Su

In [60]:
# you may need to restart the runtime at this point
import spacy
cnn_n2c2 = spacy.load('en_n2c2_cnn')
trf_n2c2 = spacy.load('en_n2c2_trf')


In [61]:
inp = 'two aspirin tablets twice daily'
print([(e, e.label_) for e in cnn_n2c2(inp).ents])
print([(e, e.label_) for e in trf_n2c2(inp).ents])

[(aspirin, 'Drug'), (tablets, 'Form'), (twice daily, 'Frequency')]
[(two, 'Dosage'), (aspirin, 'Drug'), (tablets, 'Form'), (twice daily, 'Frequency')]
