# Training and Evaluating an NER model with spaCy on the CoNLL dataset

In this notebook, we will take a look at using spaCy commandline to train and evaluate a NER model. We will also compare it with the pretrained NER model in spacy. 

Note: we will create multiple folders during this experiment:
spacyNER_data 

## Step 1: Converting data to json structures so it can be used by Spacy

In [None]:
import os

In [None]:
# upload train.txt, test.txt, valid.txt from Data/conll2003/en
try:
    from google.colab import files
    uploaded = files.upload()
except ModuleNotFoundError:
    print('Not using colab')

Saving valid.txt to valid.txt
Saving test.txt to test.txt
Saving train.txt to train.txt
Saving base_config_1.cfg to base_config_1.cfg
Saving base_config.cfg to base_config.cfg


In [None]:
#Read the CONLL data from conll2003 folder, and store the formatted data into a folder spacyNER_data

# !mkdir spacyNER_data
os.mkdir('spacyNER_data')
        
#the above lines create folder if it doesn't exist. If it does, the output shows a message that it
#already exists and cannot be created again
try:
    import google.colab 
    !python -m spacy convert "train.txt" spacyNER_data -c ner
    !python -m spacy convert "test.txt" spacyNER_data -c ner
    !python -m spacy convert "valid.txt" spacyNER_data -c ner
except ModuleNotFoundError:
    !python -m spacy convert "Data/conll2003/en/train.txt" spacyNER_data -c ner
    !python -m spacy convert "Data/conll2003/en/test.txt" spacyNER_data -c ner
    !python -m spacy convert "Data/conll2003/en/valid.txt" spacyNER_data -c ner

[38;5;4mℹ Auto-detected token-per-line NER format[0m
[38;5;4mℹ Grouping every 1 sentences into a document.[0m
[38;5;3m⚠ To generate better training data, you may want to group sentences
into documents with `-n 10`.[0m
[38;5;2m✔ Generated output file (6 documents): spacyNER_data/train.spacy[0m
[38;5;4mℹ Auto-detected token-per-line NER format[0m
[38;5;3m⚠ No sentence boundaries found to use with option `-n 1`. Use `-s` to
automatically segment sentences or `-n 0` to disable.[0m
[38;5;3m⚠ No sentence boundaries found. Use `-s` to automatically segment
sentences.[0m
[38;5;3m⚠ No document delimiters found. Use `-n` to automatically group
sentences into documents.[0m
[38;5;2m✔ Generated output file (1 documents): spacyNER_data/test.spacy[0m
[38;5;4mℹ Auto-detected token-per-line NER format[0m
[38;5;3m⚠ No sentence boundaries found to use with option `-n 1`. Use `-s` to
automatically segment sentences or `-n 0` to disable.[0m
[38;5;3m⚠ No sentence boundaries found. Use

#### For example, the data before and after running spacy's convert program looks as follows.

In [None]:
try:
    import google.colab
    !echo "BEFORE : (train.txt)"
    !head "train.txt" -n 11 | tail -n 9
except ModuleNotFoundError:
    print("BEFORE : (Data/conll2003/en/train.txt)")
    file = open("Data/conll2003/en/train.txt")
    content = file.readlines()
    print(*content[1:11])

BEFORE : (train.txt)
de O O
bautismo O O
Ella O O
se O O
encuentra O O
asentada O O
al O O
folio O O
57 O O


In [None]:
try:
    import google.colab
    !echo "AFTER : (spacyNER_data/train.spacy)"
    !head "spacyNER_data/train.spacy" -n 77 | tail -n 58
except ModuleNotFoundError:
    print("AFTER : (spacyNER_data/train.spacy)")
    f = open('spacyNER_data/train.spacy')
    content = f.readlines()
    print(*content[19:77])

In [None]:
!pip install spacy-transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting spacy-transformers
  Downloading spacy_transformers-1.1.6-py2.py3-none-any.whl (51 kB)
[K     |████████████████████████████████| 51 kB 291 kB/s 
[?25hCollecting spacy-alignments<1.0.0,>=0.7.2
  Downloading spacy_alignments-0.8.5-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 9.4 MB/s 
[?25hCollecting transformers<4.20.0,>=3.4.0
  Downloading transformers-4.19.4-py3-none-any.whl (4.2 MB)
[K     |████████████████████████████████| 4.2 MB 57.9 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 65.4 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K    

In [None]:
!python -m spacy init fill-config base_config.cfg config.cfg

[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


## Training the NER model with Spacy (CLI)

All the commandline options can be seen at: https://spacy.io/api/cli#train
We are training using the train program in spacy, for English (en), and the results are stored in a folder 
called "model" (created while training). Our training file is in "spacyNER_data/train.json" and the validation file is at: "spacyNER_data/valid.json". 

Lenguaje: Español

Components: ner

Hardware: GPU (Transformers)

Optimize for: Accuracy

In [None]:
!python -m spacy train config.cfg --paths.train spacyNER_data/train.spacy --paths.dev spacyNER_data/valid.spacy -o model --gpu-id 0

[38;5;4mℹ Saving to output directory: model[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
[2022-06-21 00:30:44,206] [INFO] Set up nlp object from config
[2022-06-21 00:30:44,216] [INFO] Pipeline: ['transformer', 'ner']
[2022-06-21 00:30:44,220] [INFO] Created vocabulary
[2022-06-21 00:30:44,222] [INFO] Finished initializing nlp object
Some weights of the model checkpoint at dccuchile/bert-base-spanish-wwm-cased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel fr

Lenguaje: Español

Components: ner

Hardware: CPU (tok2vec)

Optimize for: Efficiency

In [None]:
!python -m spacy init fill-config base_config_1.cfg config_1.cfg

[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config_1.cfg
You can now add your data and train your pipeline:
python -m spacy train config_1.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [None]:
!python -m spacy download es_core_news_lg

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting es-core-news-lg==3.3.0
  Downloading https://github.com/explosion/spacy-models/releases/download/es_core_news_lg-3.3.0/es_core_news_lg-3.3.0-py3-none-any.whl (568.0 MB)
[K     |████████████████████████████████| 568.0 MB 9.2 kB/s 
Installing collected packages: es-core-news-lg
Successfully installed es-core-news-lg-3.3.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('es_core_news_lg')


In [None]:
!python -m spacy train config_1.cfg --paths.train spacyNER_data/train.spacy --paths.dev spacyNER_data/valid.spacy -o model --gpu-id 0

[38;5;4mℹ Saving to output directory: model[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
[2022-06-21 13:59:47,803] [INFO] Set up nlp object from config
[2022-06-21 13:59:47,812] [INFO] Pipeline: ['tok2vec', 'ner']
[2022-06-21 13:59:47,816] [INFO] Created vocabulary
[2022-06-21 13:59:53,248] [INFO] Added vectors: es_core_news_lg
tcmalloc: large alloc 1200005120 bytes == 0x1212c2000 @  0x7fed26d9f2a4 0x7fed1bb3c87d 0x7fed1bb3ba5d 0x7fed1bb38235 0x7fed1bb389de 0x59afff 0x515655 0x549576 0x593fce 0x548ae9 0x5127f1 0x7feb4e0d0238 0x7feb4e0d21fe 0x7feb4e0d90f7 0x7feb4e0dc7ff 0x594b72 0x515600 0x593dd7 0x5118f8 0x7feb4e0d0238 0x7feb4e0d21fe 0x7feb4e0d8dd0 0x593835 0x548c51 0x5127f1 0x549e0e 0x593fce 0x548ae9 0x5127f1 0x549e0e 0x593fce
tcmalloc: large alloc 1213743104 bytes == 0xfd68c000 @  0x7fed26d9f2a4 0x7fed1bb39f93 0x7fed1bb3ba5d 0x7fed1bb38235 0x7fed1bb389de 0x59afff 0x515655 0x549576 0x593fce 0x548ae9 0x5127f1 0x593dd7 0x548ae9 0x51566f 0x7feb4e0d0238 0x7feb4e0d21fe 0x7feb4e0d8dd0 0x593835 0x5

Notice how the performance improves with each iteration!
## Evaluating the model with test data set (`spacyNER_data/test.json`)

### On Trained model (`model/model-best`)

In [None]:
#create a folder to store the output and visualizations. 
# !mkdir result
os.mkdir('result')
!python -m spacy evaluate model/model-last spacyNER_data/test.spacy -dp result
# !python -m spacy evaluate model/model-final data/test.txt.json -dp result

[38;5;4mℹ Using CPU[0m
[38;5;4mℹ To switch to GPU 0, use the option: --gpu-id 0[0m
[1m

TOK     -    
NER P   81.72
NER R   85.21
NER F   83.43
SPEED   4418 

[1m

           P       R       F
PER    82.64   87.18   84.85
LOC    76.92   75.76   76.34
DATE   84.75   91.74   88.11

[38;5;2m✔ Generated 25 parses as HTML[0m
result


a Visualization of the entity tagged test data can be seen in result/entities.html folder. 

### On spacy's Pretrained NER model (`en_core_web_sm`)

In [None]:
!python -m spacy download es_core_news_sm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting es_core_news_sm==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/es_core_news_sm-2.2.5/es_core_news_sm-2.2.5.tar.gz (16.2 MB)
[K     |████████████████████████████████| 16.2 MB 14.4 MB/s 
Building wheels for collected packages: es-core-news-sm
  Building wheel for es-core-news-sm (setup.py) ... [?25l[?25hdone
  Created wheel for es-core-news-sm: filename=es_core_news_sm-2.2.5-py3-none-any.whl size=16172933 sha256=9cc5393dde8d34d493faecb300e2c9e3e0b4d7053997d1145da5724cb07d03d9
  Stored in directory: /tmp/pip-ephem-wheel-cache-ygsgwr46/wheels/21/8d/a9/6c1a2809c55dd22cd9644ae503a52ba6206b04aa57ba83a3d8
Successfully built es-core-news-sm
Installing collected packages: es-core-news-sm
Successfully installed es-core-news-sm-2.2.5
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('es_core_news_sm')


In [None]:
# !mkdir pretrained_result
os.mkdir('pretrained_result')
!python -m spacy evaluate es_core_news_sm spacyNER_data/test.json -dp pretrained_result

[1m

Time      1.79 s
Words     10375 
Words/s   5781  
TOK       100.00
POS       0.00  
UAS       0.00  
LAS       0.00  
NER P     28.24 
NER R     49.61 
NER F     35.99 
Textcat   0.00  

[38;5;2m✔ Generated 25 parses as HTML[0m
pretrained_result


a Visualization of the entity tagged test data can be seen in pretrained_result/entities.html folder. 