# Training and Evaluating an NER model with spaCy on the CoNLL dataset

In this notebook, we will take a look at using spaCy commandline to train and evaluate a NER model. We will also compare it with the pretrained NER model in spacy.

Note: we will create multiple folders during this experiment:
spacyNER_data

## Step 1: Converting data to json structures so it can be used by Spacy

In [1]:
import os

In [2]:
# upload train.txt, test.txt, valid.txt from Data/conll2003/en
try:
    from google.colab import files
    uploaded = files.upload()
except ModuleNotFoundError:
    print('Not using colab')

Saving valid.txt to valid.txt
Saving train.txt to train.txt
Saving test.txt to test.txt


In [3]:
#Read the CONLL data from conll2003 folder, and store the formatted data into a folder spacyNER_data

# !mkdir spacyNER_data
os.mkdir('spacyNER_data')

#the above lines create folder if it doesn't exist. If it does, the output shows a message that it
#already exists and cannot be created again
try:
    import google.colab
    !python -m spacy convert "train.txt" spacyNER_data -c ner
    !python -m spacy convert "test.txt" spacyNER_data -c ner
    !python -m spacy convert "valid.txt" spacyNER_data -c ner
except ModuleNotFoundError:
    !python -m spacy convert "Data/conll2003/en/train.txt" spacyNER_data -c ner
    !python -m spacy convert "Data/conll2003/en/test.txt" spacyNER_data -c ner
    !python -m spacy convert "Data/conll2003/en/valid.txt" spacyNER_data -c ner

[38;5;4mℹ Auto-detected token-per-line NER format[0m
[38;5;4mℹ Grouping every 1 sentences into a document.[0m
[38;5;3m⚠ To generate better training data, you may want to group sentences
into documents with `-n 10`.[0m
[38;5;2m✔ Generated output file (14987 documents):
spacyNER_data/train.spacy[0m
[38;5;4mℹ Auto-detected token-per-line NER format[0m
[38;5;4mℹ Grouping every 1 sentences into a document.[0m
[38;5;3m⚠ To generate better training data, you may want to group sentences
into documents with `-n 10`.[0m
[38;5;2m✔ Generated output file (3684 documents): spacyNER_data/test.spacy[0m
[38;5;4mℹ Auto-detected token-per-line NER format[0m
[38;5;4mℹ Grouping every 1 sentences into a document.[0m
[38;5;3m⚠ To generate better training data, you may want to group sentences
into documents with `-n 10`.[0m
[38;5;2m✔ Generated output file (3466 documents): spacyNER_data/valid.spacy[0m


#### For example, the data before and after running spacy's convert program looks as follows.

In [4]:
try:
    import google.colab
    !echo "BEFORE : (train.txt)"
    !head "train.txt" -n 11 | tail -n 9
except ModuleNotFoundError:
    print("BEFORE : (Data/conll2003/en/train.txt)")
    file = open("Data/conll2003/en/train.txt")
    content = file.readlines()
    print(*content[1:11])

BEFORE : (train.txt)
EU NNP B-NP B-ORG
rejects VBZ B-VP O
German JJ B-NP B-MISC
call NN I-NP O
to TO B-VP O
boycott VB I-VP O
British JJ B-NP B-MISC
lamb NN I-NP O
. . O O


In [5]:
try:
    import google.colab
    !echo "AFTER : (spacyNER_data/train.spacy)"
    !head "spacyNER_data/train.spacy" -n 77 | tail -n 58
except ModuleNotFoundError:
    print("AFTER : (spacyNER_data/train.json)")
    f = open('spacyNER_data/train.json')
    content = f.readlines()
    print(*content[19:77])

AFTER : (spacyNER_data/train.spacy)
y���|g�!Ly1Z��m�R��zz�bN*~#U��d��>~�9���Tm ŜT��>_���}�n�bNjܦ5�U�Bk|�r�����2��5�}�������<+m������o�7S��d�o?���z�r�*�����o=��o_���do�H�m�A1'�wޏ[����sR~��9��*Zu��]3��k:�2{w-a�
5\`ʩ���#Q��S�wH+�sܝ������؜=U!��(������;��zh���`p���5K���&�ˀ�7��ov�]:��ح�z:�=������7��a9o�7v�0�g�Zw�����'�~�i�p������ʓ��yBP��^$)?
����A���v� ;�x�>�������msW�r�I��2_ UKk~J��D���=b�i�bN,�t_��J��e�����׵�}�O<ssR��#/�P��'A1'�?�'Ҋh���%3c���D�bN��l���鴠�]P�I���?���o��,P�I}?���];�u���}� Ŝ۾���J���Ŝ�=w��s��h�ٓ_�Kh� ŜԸ�k�L��7���o����Z��:���e�1]�f�bNj��V=�S��ϯ��Ŝ�ߩ�ؓU����Svh�wh�k�k�)�}F��9����,�!4��\��tl1(����S�^n� Ŝ�?�U�t����냿�b�mu��ū�2ׂbNj���1Q�n�@1�� �����۰kg��kI\^�9�����O}d򃧃b�m����{�Ε�����d���y����'��^;?������[�_��c��<����w����:�Z�zZ�'��ȞZ����������i����{����ZP�I��Q���[�� �9)��� �2�d?(��C�_<�h���v&j�oU]�9����'N0v�9����m�
����bN�80���/P�I}?L�)���m�p�h��dP̹m��I�kOˏ�F�
H�V��������w��E�D¤A�����N+������T

## Training the NER model with Spacy (CLI)

All the commandline options can be seen at: https://spacy.io/api/cli#train
We are training using the train program in spacy, for English (en), and the results are stored in a folder
called "model" (created while training). Our training file is in "spacyNER_data/train.json" and the validation file is at: "spacyNER_data/valid.json".

-G stands for gpu option.
-p stands for pipeline, and it should be followed by a comma separated set of options - in this case, a tagger and an NER are being trained simultaneously

In [6]:
!touch base_config.cfg

In [7]:
!python -m spacy init fill-config base_config.cfg config.cfg

[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [8]:
!python -m spacy download en_core_web_lg

Collecting en-core-web-lg==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.7.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [9]:
!python -m spacy train config.cfg --output ./output

[38;5;2m✔ Created output directory: output[0m
[38;5;4mℹ Saving to output directory: output[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00     50.06    1.62    1.19    2.51    0.02
  0     200         88.99   2555.50   64.88   65.89   63.90    0.65
  0     400         79.32   1684.47   73.15   72.74   73.57    0.73
  0     600         48.21   1467.83   78.12   78.23   78.01    0.78
  0     800         52.26   1456.66   82.36   82.30   82.42    0.82
  0    1000         89.47   1405.13   78.74   78.35   79.13    0.79
  1    1200         68.10   1517.47   82.93   82.28   83.59    0.83
  1    1400         77.26   1533.52   85.44   84.81   86.08    0.85
  1    1600        292.13   1517.64   86.09   87.03   85.16    0.8

Notice how the performance improves with each iteration!
## Evaluating the model with test data set (`spacyNER_data/test.json`)

### On Trained model (`model/model-best`)

In [12]:
#create a folder to store the output and visualizations.
# !mkdir result
#os.mkdir('result')
!python -m spacy evaluate output/model-best spacyNER_data/test.spacy -dp result
# !python -m spacy evaluate model/model-final data/test.txt.json -dp result

[38;5;4mℹ Using CPU[0m
[1m

TOK     -    
NER P   87.46
NER R   87.02
NER F   87.24
SPEED   1975 

[1m

           P       R       F
PER    93.77   90.23   91.96
MISC   80.35   78.63   79.48
LOC    89.11   90.29   89.70
ORG    82.87   84.17   83.51

<IPython.core.display.HTML object>
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/dist-packages/spacy/__main__.py", line 4, in <module>
    setup_cli()
  File "/usr/local/lib/python3.10/dist-packages/spacy/cli/_util.py", line 87, in setup_cli
    command(prog_name=COMMAND)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 728, in main
    return _main(
  File "/usr/loca

In [13]:
!zip -r spacy_ner_trained_model.zip ./output/model-best/

  adding: output/model-best/ (stored 0%)
  adding: output/model-best/ner/ (stored 0%)
  adding: output/model-best/ner/moves (deflated 60%)
  adding: output/model-best/ner/cfg (deflated 33%)
  adding: output/model-best/ner/model (deflated 8%)
  adding: output/model-best/tokenizer (deflated 81%)
  adding: output/model-best/config.cfg (deflated 61%)
  adding: output/model-best/vocab/ (stored 0%)
  adding: output/model-best/vocab/vectors.cfg (stored 0%)
  adding: output/model-best/vocab/strings.json (deflated 77%)
  adding: output/model-best/vocab/vectors (deflated 8%)
  adding: output/model-best/vocab/lookups.bin (stored 0%)
  adding: output/model-best/vocab/key2row (deflated 16%)
  adding: output/model-best/tok2vec/ (stored 0%)
  adding: output/model-best/tok2vec/cfg (stored 0%)
  adding: output/model-best/tok2vec/model (deflated 8%)
  adding: output/model-best/meta.json (deflated 59%)


In [14]:
from google.colab import files
files.download("spacy_ner_trained_model.zip")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
# test model
import spacy
from spacy import displacy

# Load the trained model
nlp = spacy.load("output/model-best")

# Load text to test
with open("valid.txt", "r") as file:
    text = file.read()

doc = nlp(text)

In [1]:
displacy.render(doc, style="ent", jupyter=True)

NameError: name 'displacy' is not defined

a Visualization of the entity tagged test data can be seen in result/entities.html folder.

### On spacy's Pretrained NER model (`en_core_web_sm`)

In [None]:
!python -m spacy download en_core_web_sm

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


In [None]:
# !mkdir pretrained_result
os.mkdir('pretrained_result')
!python -m spacy evaluate en_core_web_sm spacyNER_data/test.json -dp pretrained_result

[1m

Time      7.19 s
Words     46666 
Words/s   6490  
TOK       100.00
POS       86.21 
UAS       0.00  
LAS       0.00  
NER P     6.51  
NER R     9.17  
NER F     7.62  
Textcat   0.00  

  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
[38;5;2m✔ Generated 25 parses as HTML[0m
pretrained_result


a Visualization of the entity tagged test data can be seen in pretrained_result/entities.html folder.