### Custom NER with spaCy

Let’s install spacy, spacy-transformers, and start by taking a look at the dataset.

In [13]:
# ! pip install spacy[transformers] #  spaCy model pipelines (pretrained BERT, XLNet and GPT-2) that wrap Hugging Face's transformers package, so you can use them in spaCy 

In [14]:
import spacy
from spacy import displacy # displaying NEs
import pandas as pd
import datasets 
import json

In [2]:
nlp = spacy.load("en_core_web_md")

#### Problem example

In [10]:
text = "What video sharing service did Steve Chen, Chad Hurley, and Jawed Karim create in 2005?"
 
displacy.render(nlp(text), style="ent", jupyter=True)

Although this RoBERTa-based model achieves state-of-the-art performance on the CoNLL–2003 dataset it was trained on, it doesn’t perform as well on other kinds of text data. 
For instance, if we try to extract entities from medical journal or other engineering text it won’t detect any relevant information. 

In [9]:
string = "Antiretroviral therapy ( ART ) is recommended for all HIV-infected individuals"
doc = nlp(string)
displacy.render(doc, style="ent", jupyter=True)

To fix this we’ll need to train our own NER model, and the good thing is that spaCy makes that process very straightforward. 

#### Loading dataset 

To train our custom named entity recognition model, we’ll need some relevant text data with the proper annotations. For the purpose of this tutorial, we’ll be using the medical entities dataset available on Kaggle.

<b> We need the text string, the entity start and end indices, and the entity type. </b>

In [23]:
with open('data/archive/Corona2.json', 'r') as f:
    data = json.load(f)
    
# print(data['examples'][0])

Define the NE classes, their stop, start and words

In [26]:
training_data = {'classes' : ['MEDICINE', "MEDICALCONDITION", "PATHOGEN"], 'annotations' : []}

for example in data['examples']:
    temp_dict = {}
    temp_dict['text'] = example['content']
    temp_dict['entities'] = []
    for annotation in example['annotations']:
        start = annotation['start']
        end = annotation['end']
        label = annotation['tag_name'].upper()
        temp_dict['entities'].append((start, end, label))
        training_data['annotations'].append(temp_dict)

print(training_data['annotations'][0])

{'text': "While bismuth compounds (Pepto-Bismol) decreased the number of bowel movements in those with travelers' diarrhea, they do not decrease the length of illness.[91] Anti-motility agents like loperamide are also effective at reducing the number of stools but not the duration of disease.[8] These agents should be used only if bloody diarrhea is not present.[92]\n\nDiosmectite, a natural aluminomagnesium silicate clay, is effective in alleviating symptoms of acute diarrhea in children,[93] and also has some effects in chronic functional diarrhea, radiation-induced diarrhea, and chemotherapy-induced diarrhea.[45] Another absorbent agent used for the treatment of mild diarrhea is kaopectate.\n\nRacecadotril an antisecretory medication may be used to treat diarrhea in children and adults.[86] It has better tolerability than loperamide, as it causes less constipation and flatulence.[94]", 'entities': [(360, 371, 'MEDICINE'), (383, 408, 'MEDICINE'), (104, 112, 'MEDICALCONDITION'), (679,

spaCy uses DocBin class for annotated data, so we’ll have to create the DocBin objects for our training examples. This DocBin class efficiently serializes the information from a collection of Doc objects. It is faster and produces smaller data sizes than pickle, and allows the user to deserialize without executing arbitrary Python code.

In [27]:
from spacy.tokens import DocBin
from tqdm import tqdm

nlp = spacy.blank("en") # load a new spacy model
doc_bin = DocBin() # create a DocBin object

There are some entity span overlaps, i.e., the indices of some entities overlap. spaCy provides a utility method filter_spans to deal with this

In [29]:
from spacy.util import filter_spans

for training_example  in tqdm(training_data['annotations']): 
    text = training_example['text']
    labels = training_example['entities']
    doc = nlp.make_doc(text) 
    ents = []
    for start, end, label in labels:
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        if span is None:
            #print("Skipping entity")
            pass
        else:
            ents.append(span)
    filtered_ents = filter_spans(ents)
    doc.ents = filtered_ents 
    doc_bin.add(doc)

doc_bin.to_disk("training_data.spacy") # save the docbin object

100%|██████████████████████████████████████████████████████████████| 295/295 [00:00<00:00, 1256.10it/s]


Our pipeline needs a config file to run, we'll use the default one from spacy

You can manually create a config file as per the use case or quickly create a base config on spaCy’s training quickstart page [here](https://spacy.io/usage/training#quickstart).

We’ll be working with a base config file using the quickstart page. This is an incomplete file with only our custom options, so we’ll have to fill in the rest with the default values.

In [42]:
# !python3 -m spacy init config base_config.cfg
# Download your base config by checking NER on the link above and saving the file 

In [45]:
!python3 -m spacy init fill-config base_config.cfg config.cfg # Changes stuff in base config if need be, here nothing needs to be changed so we replace config with baseconfig

[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [44]:
!python3 -m spacy train config.cfg --output ./ --paths.train ./training_data.spacy --paths.dev ./training_data.spacy #--gpu-id 0

[38;5;4mℹ Saving to output directory: .[0m
[38;5;4mℹ Using CPU[0m
[1m
[2022-05-20 16:37:37,118] [INFO] Set up nlp object from config
[2022-05-20 16:37:37,126] [INFO] Pipeline: ['tok2vec', 'ner']
[2022-05-20 16:37:37,128] [INFO] Created vocabulary
[2022-05-20 16:37:37,129] [INFO] Finished initializing nlp object
[2022-05-20 16:37:38,373] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00    169.43    0.00    0.00    0.00    0.00
  0     200        154.84   4843.70   54.41   55.09   53.74    0.54
  0     400        231.71    793.03   97.10   96.77   97.44    0.97
  1     600        203.94    256.62   98.41   98.35   98.47    0.98
  1     800        188.40    161.89   98.33   98.19   98.47    

In [46]:
nlp_ner = spacy.load("model-best")

doc = nlp_ner("Antiretroviral therapy (ART) is recommended for all HIV-infected\
individuals to reduce the risk of disease progression.\nART also is recommended \
for HIV-infected individuals for the prevention of transmission of HIV.\nPatients \
starting ART should be willing and able to commit to treatment and understand the\
benefits and risks of therapy and the importance of adherence. Patients may choose\
to postpone therapy, and providers, on a case-by-case basis, may elect to defer\
therapy on the basis of clinical and/or psychosocial factors.")

colors = {"PATHOGEN": "#F67DE3", "MEDICINE": "#7DF6D9", "MEDICALCONDITION":"#FFFFFF"}
options = {"colors": colors} 

spacy.displacy.render(doc, style="ent", options= options, jupyter=True)

#### Custom training

#### References 

https://newscatcherapi.com/blog/train-custom-named-entity-recognition-ner-model-with-spacy-v3

Datasets:https://metatext.io/datasets-list/ner-task