# Customizing spaCy Model

spaCy model if required can be trained to meet our domain requirements if the existing model is unable to meet our needs

<b>Trainings steps</b>

  1. Annotate and prepare input data
  2. Initialize the model weight
  3. Predict a few examples with current weights
  4. Compare prediction with correct answers
  5. Use optimizer to calculate weights that improve model performance
  6. Update weights slightly
  7. Go back to step 3

#### 1. Annotating and preparing data

* First step is to prepare training data in required format
* After collecting data, we annotate it
* Annotation means labeling the intent, entities, etc.\
* This is an example of annotated data

  `annotated_data = {`
  <br>`"sentence": "An antivirul drugs used against influenza is neuraminidase inhibitors.",`
  <br>`"entities": {`
  <br>              `"label": "Medicine"`,
  <br>              `"value": "neuraminidase inhibitors",`
  <br>`}`
  <br>`}`

In [1]:

text = "A patient with chest pain had hyperthyroidism."
entity_1 = "chest pain"
entity_2 = "hyperthyroidism"

# Store annotated data information in the correct format
annotated_data = {"sentence": text, "entities": [{"label": "SYMPTOM", "value": entity_1}, {"label": "DISEASE", "value": entity_2}]}

# Extract start and end characters of each entity
entity_1_start_char = text.find(entity_1)
entity_1_end_char = entity_1_start_char + len(entity_1)
entity_2_start_char = text.find(entity_2)
entity_2_end_char = entity_2_start_char + len(entity_2)

# Store the same input information in the proper format for training
training_data = [(text, {"entities": [(entity_1_start_char,entity_1_end_char,"SYMPTOM"), 
                                      (entity_2_start_char,entity_2_end_char,"DISEASE")]})]
print(training_data)

[('A patient with chest pain had hyperthyroidism.', {'entities': [(15, 25, 'SYMPTOM'), (30, 45, 'DISEASE')]})]


#### Example object data for training

In [2]:
# import required libraries
import spacy
from spacy.training import Example

In [3]:
# Let us train the model with example sentence
nlp = spacy.load('en_core_web_sm')

doc = nlp('I will visit you in Austin')

annotations = {'entities':[(20,26, 'GPE')]}

example_sentence = Example.from_dict(doc, annotations)

print(example_sentence.to_dict())

{'doc_annotation': {'cats': {}, 'entities': ['O', 'O', 'O', 'O', 'O', 'U-GPE'], 'spans': {}, 'links': {}}, 'token_annotation': {'ORTH': ['I', 'will', 'visit', 'you', 'in', 'Austin'], 'SPACY': [True, True, True, True, True, False], 'TAG': ['', '', '', '', '', ''], 'LEMMA': ['', '', '', '', '', ''], 'POS': ['', '', '', '', '', ''], 'MORPH': ['', '', '', '', '', ''], 'HEAD': [0, 1, 2, 3, 4, 5], 'DEP': ['', '', '', '', '', ''], 'SENT_START': [1, 0, 0, 0, 0, 0]}}


In [4]:
example_text = 'A patient with chest pain had hyperthyroidism.'
training_data = [(example_text, {'entities': [(15, 25, 'SYMPTOM'), (30, 45, 'DISEASE')]})]

all_examples = []

# Iterate through text and annotations and convert text to a Doc container
for text, annotations in training_data:
  doc = nlp(text)
  
  # Create an Example object from the doc contianer and annotations
  example_sentence = Example.from_dict(doc, annotations)
  print(example_sentence.to_dict(), "\n")
  
  # Append the Example object to the list of all examples
  all_examples.append(example_sentence)
  
print("Number of formatted training data: ", len(all_examples))

{'doc_annotation': {'cats': {}, 'entities': ['O', 'O', 'O', 'B-SYMPTOM', 'L-SYMPTOM', 'O', 'U-DISEASE', 'O'], 'spans': {}, 'links': {}}, 'token_annotation': {'ORTH': ['A', 'patient', 'with', 'chest', 'pain', 'had', 'hyperthyroidism', '.'], 'SPACY': [True, True, True, True, True, True, False, False], 'TAG': ['', '', '', '', '', '', '', ''], 'LEMMA': ['', '', '', '', '', '', '', ''], 'POS': ['', '', '', '', '', '', '', ''], 'MORPH': ['', '', '', '', '', '', '', ''], 'HEAD': [0, 1, 2, 3, 4, 5, 6, 7], 'DEP': ['', '', '', '', '', '', '', ''], 'SENT_START': [1, 0, 0, 0, 0, 0, 0, 0]}} 

Number of formatted training data:  1


#### It is necessary to disable other pipeline components of an nlp model in order to only train the intended component. 

In [5]:
# Let us disable all other pipes apart from ner
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']

nlp.disable_pipes(*other_pipes)

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer']

In [6]:
nlp = spacy.load("en_core_web_sm")

# Disable all pipeline components of  except `ner`
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
nlp.disable_pipes(*other_pipes)

# Convert a text and its annotations to the correct format usable for training
doc = nlp.make_doc(text)
example = Example.from_dict(doc, annotations)
print("Example object for training: \n", example.to_dict())

Example object for training: 
 {'doc_annotation': {'cats': {}, 'entities': ['O', 'O', 'O', 'B-SYMPTOM', 'L-SYMPTOM', 'O', 'U-DISEASE', 'O'], 'spans': {}, 'links': {}}, 'token_annotation': {'ORTH': ['A', 'patient', 'with', 'chest', 'pain', 'had', 'hyperthyroidism', '.'], 'SPACY': [True, True, True, True, True, True, False, False], 'TAG': ['', '', '', '', '', '', '', ''], 'LEMMA': ['', '', '', '', '', '', '', ''], 'POS': ['', '', '', '', '', '', '', ''], 'MORPH': ['', '', '', '', '', '', '', ''], 'HEAD': [0, 1, 2, 3, 4, 5, 6, 7], 'DEP': ['', '', '', '', '', '', '', ''], 'SENT_START': [1, 0, 0, 0, 0, 0, 0, 0]}}


#### 2. Model Training Procedure

- Go over the training set several times; one iteration is called an epoch

- In each epoch, the training code updates the weights of the model with a small number using an optimizer object on randomly shuffled training data.

- <b>Optimizers</b> update the model weights. Optimizers are functions that update the model weights and aim to lower the risk of errors from these predictions, and improve the accuracy of the model. We can create an optimizer object as below
  <br><br>optimizer = nlp.create_optimizer()

- In each epoch, we first shuffle training_data, an Example object, by using random.shuffle() method.

- Next, for each training data point, which is a tuple of a text and annotations, we extract the equivalent dictionary object from the Example object given the Doc container of a text and training data annotation using Example.from_dict() method.

- The extracted Example dictionary will be used to update the nlp model weights by using the nlp.update() method and passing the list of the example dictionary, the optimizer object and a losses dictionary to track model's loss during training. Loss is a number indicating how bad the model's prediction is on a single example.

- The procedure continues to process next training data points.

In [7]:
import random

nlp = spacy.load("en_core_web_sm")

# Test data for before/after training
test = "Apple is looking at buying a U.K. startup for $1 billion."

# Training data (replace with your actual training data)
training_data = [
    ("Google was founded in 1998.", {"entities": [(0, 6, "ORG")]}),
    ("Apple is a technology company.", {"entities": [(0, 5, "ORG")]}),
]

# Number of training epochs
epochs = 10

print("Before training: ", [(ent.text, ent.label_) for ent in nlp(test).ents])

other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']

nlp.disable_pipes(*other_pipes)

optimizer = nlp.create_optimizer()

# Shuffle training data and the dataset using random package per epoch
for i in range(epochs):
  random.shuffle(training_data)
  for text, annotations in training_data:
    doc = nlp.make_doc(text)
    # Update nlp model after setting sgd argument to optimizer
    example = Example.from_dict(doc, annotations)
    nlp.update([example], sgd = optimizer)

print("After training: ", [(ent.text, ent.label_) for ent in nlp(test).ents])

Before training:  [('Apple', 'ORG'), ('U.K.', 'GPE'), ('$1 billion', 'MONEY')]
After training:  [('Apple', 'ORG'), ('U.K.', 'GPE'), ('$1 billion', 'MONEY')]


#### 3. Save and load a trained model

After we have trained a model, the next step is to test the model. For this purpose, we need to save and later load it. 

- We use the .get_pipe() method to get the trained pipeline component. In an example, if we trained an NER model and hence we get the NER component and save to the disk using the ner.to_disk() method, passing a model name.

- Later, we load a spaCy model and create a blank NER component by using the nlp.create_pipe() method.

- Then, we load the trained NER model from the disk by using ner.from_disk() method on the created NER component.

- Lastly, we add the loaded NER component to the pipeline by calling nlp.add_pipe() method and passing a name for the NER model, such as "ner".

- Once a trained model is saved, it can be loaded as nlp. Then, we can use the model to find entities of a given text.

In [9]:
# Load a blank English model, add NER component, add given labels to the ner pipeline
nlp = spacy.blank("en")
ner = nlp.add_pipe("ner")

# Add labels to the NER component
labels = ["ORG", "PERSON", "GPE"]  # Example labels, replace with your labels
for ent in labels:
    ner.add_label(ent)

# Training data (replace with your actual training data)
training_data = [
    ("Google was founded in 1998.", {"entities": [(0, 6, "ORG")]}),
    ("Elon Musk is the CEO of SpaceX.", {"entities": [(0, 9, "PERSON"), (25, 31, "ORG")]}),
    ("London is a city in the United Kingdom.", {"entities": [(0, 6, "GPE")]}),
]

for ent in labels:
    ner.add_label(ent)

# Disable other pipeline components, complete training loop and run training loop
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
nlp.disable_pipes(*other_pipes)
losses = {}
optimizer = nlp.begin_training()
for text, annotation in training_data:
    doc = nlp.make_doc(text)
    example = Example.from_dict(doc, annotation)
    nlp.update([example], sgd=optimizer, losses=losses)
    print(losses)

{'ner': np.float32(5.571428)}
{'ner': np.float32(11.218441)}
{'ner': np.float32(19.2313)}


