# Trying Custom NER through a TDS Article and Spacy


In [204]:
# All imports needed for the training and testing of the model
import spacy
import random
import pandas as pd
import json
import logging
import sys
import numpy as np

First we must read in a dataset that we will train and test the model with. The dataset used can be found [here](https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus)

In [10]:
df = pd.read_csv('ner_dataset.csv')

In [11]:
df.head()

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,,of,IN,O
2,,demonstrators,NNS,O
3,,have,VBP,O
4,,marched,VBN,O


This function is a helper function to help us find the nth instance of a word in a sentence. This will help us format the data for training the way spacy wants it.

In [136]:
def findnth(string, substring, n):
    parts = string.split(substring, n + 1)
    if len(parts) <= n + 1:
        return -1
    return len(string) - len(parts[-1]) - len(substring)
# findnth('Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country .', 'the', 1)

93

This function formats the data from the .csv to a list of tuples that contain the information needed for the spacy model

In [181]:
def format_data(df,itns):
    ret = [] #This will be the retured list of all the training data
    sentence_col = list(df[~df['Sentence #'].isnull()].index) #this gets all the indices of starts of sentences
    for i in range(itns): #get itns number of sentences for training
        sentence = ' '.join(df['Word'][sentence_col[i]:sentence_col[i+1]]) #one sentence
        dictionary = {} #this dictionary will contain the entities and where they start and where they end.
        dictionary['entities'] = [] #this list contains the tuples of entities, start index, and end index.
        tags = df[sentence_col[i]:sentence_col[i+1]][df['Tag'][sentence_col[i]:sentence_col[i+1]] != 'O'] #This will get all the instances of entities in the sentence
        t = set() # a set that will contain tuples
        for j in tags.index:
            inEnts = False #boolean that checks to see if the entity was already found
            it = 0
            w = tags['Word'][j]
            
            while(inEnts == False): #With the given word, find the right instance of the word in the sentence
                tup = (findnth(sentence,w,it),findnth(sentence,w,it) + len(w),tags['Tag'][j])
                tu = (tup[0],tup[1])
                
                if(tu not in t):
                    dictionary['entities'].append(tup)
                    inEnts = True
                    t.add(tu) #tuple added to the dictionary.
#                     if(tu == (36,42) and i == 83):
#                         print("Hello")
                else:
                    it+=1
#         if(i == 83):
#             print(t)
        tup2 = (sentence, dictionary) #The tuple that is appended to the list
        ret.append(tup2)
    return ret

In [182]:
x = format_data(df,1000) #store the formatted data in a variable


Hello
{(113, 118), (36, 42), (51, 59), (23, 25), (119, 125), (137, 143), (0, 8)}


These are more imports needed for the training and formatting

In [205]:
from tqdm import tqdm
import spacy
from spacy.tokens import DocBin

## Spacy v3.0 Addition
Because of the new version of spacy, the old formatting of data will not work for training, so we will need to reformat the old format into a new .spacy format

In [187]:
def convert(TRAIN_DATA):
    nlp = spacy.blank("en") # load a new spacy model
    db = DocBin() # create a DocBin object
    s = set()
    j=0
    for text, annot in tqdm(TRAIN_DATA): # data in previous format
    #     print(j)
        doc = nlp.make_doc(text) # create doc object from text
        ents = []
        for i in text.split(' '):
            s.add(i)
        for start, end, label in annot["entities"]: # add character indexes
            span = doc.char_span(start, end, label=label, alignment_mode="contract")
            if span is None:
                print("Skipping entity")
            else:
                ents.append(span)
        doc.ents = ents # label the text with the ents
        db.add(doc)
        j+=1

    db.to_disk("./train.spacy") # save the docbin object

In [188]:
convert(x)

 27%|████████████████████▋                                                        | 269/1000 [00:00<00:00, 1349.64it/s]

Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity

 55%|██████████████████████████████████████████▋                                  | 554/1000 [00:00<00:00, 1374.75it/s]


Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity


100%|████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 1437.56it/s]

Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity





Now, we have the training data loaded up in our directory.
Next, we must create a config file so that we can run the training and testing using the spacy CLI tools

In [191]:
!python -m spacy init config --lang en --pipeline ner /content/ner_demo/configs/config.cfg --force

[!] To generate a more effective transformer-based config (GPU-only), install
the spacy-transformers package and re-run this command. The config generated now
does not use transformers.
[i] Generated config template specific for your use case
- Language: en
- Pipeline: ner
- Optimize for: efficiency
- Hardware: CPU
- Transformer: None
[+] Auto-filled config with all values
[+] Saved config
\content\ner_demo\configs\config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


This should give us the default configuration file for spacy ner. Next, we must split data to get test data, and format this data to be used for testing

In [206]:
df1 = df[3000:]
x1 = format_data(df1,1000)

{(147, 155), (5, 11)}


In [198]:
def convert_test(TRAIN_DATA):
    nlp = spacy.blank("en") # load a new spacy model
    db = DocBin() # create a DocBin object
    s = set()
    j=0
    for text, annot in tqdm(TRAIN_DATA): # data in previous format
    #     print(j)
        doc = nlp.make_doc(text) # create doc object from text
        ents = []
        for i in text.split(' '):
            s.add(i)
        for start, end, label in annot["entities"]: # add character indexes
            span = doc.char_span(start, end, label=label, alignment_mode="contract")
            if span is None:
                print("Skipping entity")
            else:
                ents.append(span)
        doc.ents = ents # label the text with the ents
        db.add(doc)
        j+=1

    db.to_disk("./test.spacy") # save the docbin object

In [199]:
convert_test(x1)

 26%|████████████████████                                                         | 261/1000 [00:00<00:00, 1297.05it/s]

Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity


 66%|███████████████████████████████████████████████████                          | 663/1000 [00:00<00:00, 1263.11it/s]

Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity


100%|████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 1340.26it/s]

Skipping entity
Skipping entity





Now we can test the model

In [207]:
!python -m spacy train /content/ner_demo/configs/config.cfg --output /training/ --paths.train train.spacy --paths.dev test.spacy --training.eval_frequency 10 --training.max_steps 500 --gpu-id -1

[i] Using CPU
[1m


[2021-06-03 23:27:48,935] [INFO] Set up nlp object from config
[2021-06-03 23:27:48,941] [INFO] Pipeline: ['tok2vec', 'ner']
[2021-06-03 23:27:48,945] [INFO] Created vocabulary
[2021-06-03 23:27:48,945] [INFO] Finished initializing nlp object
[2021-06-03 23:27:49,807] [INFO] Initialized pipeline components: ['tok2vec', 'ner']


[+] Initialized pipeline
[1m
[i] Pipeline: ['tok2vec', 'ner']
[i] Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00     54.09    2.47    2.10    3.00    0.02
  0      10          3.17    781.27    0.00    0.00    0.00    0.00
  0      20          8.47    252.79    0.00    0.00    0.00    0.00
  0      30          3.68    195.59    1.65    7.87    0.92    0.02
  0      40          3.69    184.73   15.91   28.11   11.09    0.16
  0      50          1.90    130.07   14.30   18.15   11.80    0.14
  0      60          3.29    135.75   26.29   27.87   24.88    0.26
  0      70          3.99    161.53   29.89   31.96   28.06    0.30
  0      80          1.83    109.76   25.78   32.17   21.51    0.26
  0      90          3.49    118.67   33.02   34.23   31.89    0.33
  0     100          3.74    128.74   37.83   36.80   38.91    0.38
  0     110          3.

We see that with 500 iterations, we reach an F1 score of 78.30 and a score of 78, which is quite good. Now, we can see test this model on our own examples.

In [208]:
nlp1 = spacy.load(R"\training\model-best")

In [209]:
doc = nlp1("John Lee is the chief of CBSE.")
spacy.displacy.render(doc, style="ent", jupyter=True) # display in Jupyter

In [210]:
nlp1 = spacy.load(R"\training\model-best")
doc = nlp1("Americans suffered from H5N1 virus in 2002.")
spacy.displacy.render(doc, style="ent", jupyter=True) # display in Jupyter