## Training and Updating models

Advantages of updating the default models:
- The default spacy models make predictions based on the examples they were trained on.
- The model can therefore be made more accurate by showing it examples from your specific domain.
- This will help the model predict categories specific to your problem.
- Training the model using your own text will vastly help in text classification and entity recognition within your domain. 
- However this isn't that useful/critical for tagging and parsing.

### How does training work ?

- Initialize the model weights randomly using nlp.begin_training
- Predict a few examples with the current weights by calling nlp.update
- Compare prediction with true labels
- Calculate how to change weights to improve predictions
- Update weights slightly
- Repeat step 2.

### Creating training data

In [1]:
import json
from spacy.matcher import Matcher
from spacy.lang.en import English

with open("iphone.json") as f:
    TEXTS = json.loads(f.read())

nlp = English()
matcher = Matcher(nlp.vocab)

# Two tokens whose lowercase forms match 'iphone' and 'x'
pattern1 = [{"LOWER": "iphone"}, {"LOWER": "x"}]

# Token whose lowercase form matches 'iphone' and an optional digit
pattern2 = [{"LOWER": "iphone"}, {"IS_DIGIT": True, "OP": "?"}]

# Add patterns to the matcher
matcher.add("GADGET", None, pattern1, pattern2)

In [2]:
TRAINING_DATA = []

# Create a Doc object for each text in TEXTS
for doc in nlp.pipe(TEXTS):
    # Match on the doc and create a list of matched spans
    spans = [doc[start:end] for match_id, start, end in matcher(doc)]
    # Get (start character, end character, label) tuples of matches
    entities = [(span.start_char, span.end_char, "GADGET") for span in spans]
    # Format the matches as a (doc.text, entities) tuple
    training_example = (doc.text, {"entities": entities})
    # Append the example to the training data
    TRAINING_DATA.append(training_example)

print(*TRAINING_DATA, sep="\n")

('How to preorder the iPhone X', {'entities': [(20, 28, 'GADGET'), (20, 26, 'GADGET')]})
('iPhone X is coming', {'entities': [(0, 8, 'GADGET'), (0, 6, 'GADGET')]})
('Should I pay $1,000 for the iPhone X?', {'entities': [(28, 36, 'GADGET'), (28, 34, 'GADGET')]})
('The iPhone 8 reviews are here', {'entities': [(4, 10, 'GADGET'), (4, 12, 'GADGET')]})
('Your iPhone goes up to 11 today', {'entities': [(5, 11, 'GADGET')]})
('I need a new phone! Any tips?', {'entities': []})


### How to update an existing model 

Setup the pipeline

In [3]:
import spacy

# Create a blank 'en' model
nlp = spacy.blank("en")

# Create a new entity recognizer and add it to the pipeline
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner)

# Add the label 'GADGET' to the entity recognizer
ner.add_label("GADGET")

Build a training loop

In [4]:
import spacy
import random
import json

with open("gadgets.json") as f:
    TRAINING_DATA = json.loads(f.read())
    
# Start the training
nlp.begin_training()

# Loop for 10 iterations
for itn in range(10):
    # Shuffle the training data
    random.shuffle(TRAINING_DATA)
    losses = {}

    # Batch the examples and iterate over them
    for batch in spacy.util.minibatch(TRAINING_DATA, size=2):
        texts = [text for text, entities in batch]
        annotations = [entities for text, entities in batch]

        # Update the model
        nlp.update(texts, annotations, losses=losses)
        print(losses)

{'ner': 10.833332419395447}
{'ner': 24.334017038345337}
{'ner': 33.06909537315369}
{'ner': 10.731456398963928}
{'ner': 16.30031043291092}
{'ner': 20.78738036751747}
{'ner': 1.8755321726202965}
{'ner': 5.890211756341159}
{'ner': 9.25070423563011}
{'ner': 2.543185191520024}
{'ner': 4.770967783930246}
{'ner': 6.093858534062747}
{'ner': 2.9940559789538383}
{'ner': 4.647309098392725}
{'ner': 7.523067947477102}
{'ner': 1.8030253611505032}
{'ner': 2.664313643472269}
{'ner': 3.853691063122824}
{'ner': 0.5747180124690203}
{'ner': 1.60150545161423}
{'ner': 1.7429394038504142}
{'ner': 0.006832518003918153}
{'ner': 0.8354284901226201}
{'ner': 0.8408137761238343}
{'ner': 0.0019338577189143002}
{'ner': 0.002204498107817532}
{'ner': 2.300642976401678}
{'ner': 2.537248271750059e-05}
{'ner': 6.388674388891485e-05}
{'ner': 2.03935353906421}
