# Training and Evaluating an NER model with spaCy on the CoNLL dataset

In this notebook, we will take a look at using spaCy commandline to train and evaluate a NER model. We will also compare it with the pretrained NER model in spacy. 

Note: we will create multiple folders during this experiment:
spacyNER_data 

## Step 1: Converting data to json structures so it can be used by Spacy

In [1]:
!wget -P Data https://raw.githubusercontent.com/practical-nlp/practical-nlp/master/Ch5/Data/conll2003/en/test.txt
!wget -P Data https://raw.githubusercontent.com/practical-nlp/practical-nlp/master/Ch5/Data/conll2003/en/train.txt
!wget -P Data https://raw.githubusercontent.com/practical-nlp/practical-nlp/master/Ch5/Data/conll2003/en/valid.txt    

--2021-11-04 00:17:11--  https://raw.githubusercontent.com/practical-nlp/practical-nlp/master/Ch5/Data/conll2003/en/test.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 748095 (731K) [text/plain]
Saving to: ‘Data/test.txt’


2021-11-04 00:17:11 (9.36 MB/s) - ‘Data/test.txt’ saved [748095/748095]

--2021-11-04 00:17:12--  https://raw.githubusercontent.com/practical-nlp/practical-nlp/master/Ch5/Data/conll2003/en/train.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3283420 (3.1M) [text/plain]
Saving to: ‘Data/train.txt’


20

In [2]:
#Read the CONLL data from conll2003 folder, and store the formatted data into a folder spacyNER_data
!mkdir spacyNER_data
#the above two lines create folders if they don't exist. If they do, the output shows a message that it
#already exists and cannot be created again
!python3 -m spacy convert "Data/train.txt" spacyNER_data -c ner
!python3 -m spacy convert "Data/test.txt" spacyNER_data -c ner
!python3 -m spacy convert "Data/valid.txt" spacyNER_data -c ner

[38;5;4mℹ Auto-detected token-per-line NER format[0m
[38;5;4mℹ Grouping every 1 sentences into a document.[0m
[38;5;3m⚠ To generate better training data, you may want to group sentences
into documents with `-n 10`.[0m
[38;5;2m✔ Generated output file (14987 documents): spacyNER_data/train.json[0m
[38;5;4mℹ Auto-detected token-per-line NER format[0m
[38;5;4mℹ Grouping every 1 sentences into a document.[0m
[38;5;3m⚠ To generate better training data, you may want to group sentences
into documents with `-n 10`.[0m
[38;5;2m✔ Generated output file (3684 documents): spacyNER_data/test.json[0m
[38;5;4mℹ Auto-detected token-per-line NER format[0m
[38;5;4mℹ Grouping every 1 sentences into a document.[0m
[38;5;3m⚠ To generate better training data, you may want to group sentences
into documents with `-n 10`.[0m
[38;5;2m✔ Generated output file (3466 documents): spacyNER_data/valid.json[0m


#### For example, the data before and after running spacy's convert program looks as follows.

In [3]:
!echo "BEFORE : (Data/train.txt)"
!head "Data/train.txt" -n 11 | tail -n 9
!echo "\nAFTER : (Data/train.json)"
!head "spacyNER_data/train.json" -n 64 | tail -n 49

BEFORE : (Data/train.txt)
EU NNP B-NP B-ORG
rejects VBZ B-VP O
German JJ B-NP B-MISC
call NN I-NP O
to TO B-VP O
boycott VB I-VP O
British JJ B-NP B-MISC
lamb NN I-NP O
. . O O
\nAFTER : (Data/train.json)
        ]
      }
    ]
  },
  {
    "id":1,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "orth":"EU",
                "tag":"NNP",
                "ner":"U-ORG"
              },
              {
                "orth":"rejects",
                "tag":"VBZ",
                "ner":"O"
              },
              {
                "orth":"German",
                "tag":"JJ",
                "ner":"U-MISC"
              },
              {
                "orth":"call",
                "tag":"NN",
                "ner":"O"
              },
              {
                "orth":"to",
                "tag":"TO",
                "ner":"O"
              },
              {
                "orth":"boycott",
              

## Training the NER model with Spacy (CLI)

All the commandline options can be seen at: https://spacy.io/api/cli#train
We are training using the train program in spacy, for English (en), and the results are stored in a folder 
called "model" (created while training). Our training file is in "spacyNER_data/train.json" and the validation file is at: "spacyNER_data/valid.json". 

-G stands for gpu option.
-p stands for pipeline, and it should be followed by a comma separated set of options - in this case, a tagger and an NER are being trained simultaneously

In [4]:
# %time !python3 -m spacy train en model spacyNER_data/train.json spacyNER_data/valid.json -G -p tagger,ner
#Wall time: 32min 29s

In [5]:
import time #33
start = time.time()
!python3 -m spacy train en model spacyNER_data/train.json spacyNER_data/valid.json -G -p tagger,ner
end = time.time()
print("Time taken for training is:{:.2f} hrs ".format((end-start)/3600.0))

[38;5;2m✔ Created output directory: model[0m
Training pipeline: ['tagger', 'ner']
Starting with blank model 'en'
Counting training words (limit=0)
  "__main__", mod_spec)

Itn  Tag Loss    Tag %    NER Loss   NER P   NER R   NER F   Token %  CPU WPS
---  ---------  --------  ---------  ------  ------  ------  -------  -------
  1  31330.681    94.249  16686.563  82.790  82.413  82.601  100.000    11524
  2  16879.955    94.897   7739.659  86.715  85.796  86.253  100.000    11811
  3  13688.029    95.174   5244.834  87.638  86.974  87.305  100.000    11571
  4  11794.738    95.281   3891.267  88.215  87.933  88.074  100.000    11790
  5  10520.640    95.343   3137.274  88.446  88.118  88.282  100.000    11642
  6   9585.011    95.399   2486.873  88.686  88.253  88.469  100.000    11737
  7   8943.334    95.457   2304.852  88.563  87.967  88.264  100.000    11698
  8   8428.171    95.486   1948.291  88.650  88.068  88.358  100.000    11592
  9   8042.336    95.500   2002.403  88.727  8

Notice how the performance improves with each iteration!
## Evaluating the model with test data set (`spacyNER_data/test.json`)

### On Trained model (`model/model-best`)

In [6]:
#create a folder to store the output and visualizations. 
!mkdir result
!python3 -m spacy evaluate model/model-best spacyNER_data/test.json -dp result
# !python -m spacy evaluate model/model-final data/test.txt.json -dp result

[1m

Time      4.13 s
Words     46666 
Words/s   11307 
TOK       100.00
POS       95.22 
UAS       0.00  
LAS       0.00  
NER P     81.68 
NER R     81.78 
NER F     81.73 
Textcat   0.00  

  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
[38;5;2m✔ Generated 25 parses as HTML[0m
result


a Visualization of the entity tagged test data can be seen in result/entities.html folder. 

### On spacy's Pretrained NER model (`en`)

In [7]:
!mkdir pretrained_result
!python3 -m spacy evaluate en spacyNER_data/test.json -dp pretrained_result

[1m

Time      7.57 s
Words     46666 
Words/s   6168  
TOK       100.00
POS       86.21 
UAS       0.00  
LAS       0.00  
NER P     6.51  
NER R     9.17  
NER F     7.62  
Textcat   0.00  

  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
[38;5;2m✔ Generated 25 parses as HTML[0m
pretrained_result


a Visualization of the entity tagged test data can be seen in pretrained_result/entities.html folder. 