## [Chapter 4](https://course.spacy.io/chapter4)

In this chapter, you'll learn how to update spaCy's statistical models to customize them for your use case – for example, to predict a new entity type in online comments. You'll write your own training loop from scratch, and understand the basics of how training works, along with tips and tricks that can make your custom NLP projects more successful.

In [None]:
import spacy
import json

# $ python -m spacy download en_core_web_sm
# $ python -m spacy download en_core_web_md
# $ python -m spacy download en_core_web_lg

from IPython.display import Image
from IPython.core.display import HTML 

#### Model Training
The training data are the examples we want to update the model with.
The text should be a sentence, paragraph or longer document. For the best results, it should be similar to what the model will see at runtime.
The label is what we want the model to predict. This can be a text category, or an entity span and its type.
The gradient is how we should change the model to reduce the current error. It's computed when we compare the predicted label to the true label.
After training, we can then save out an updated model and use it in our application.

In [4]:
Image(url= "https://course.spacy.io/training.png")

#### Using Match Rules To Create Training Data

Let’s use some match patterns we’ve created to bootstrap a set of training examples. A list of sentences is available as the variable TEXTS.

In [7]:
import json
from spacy.matcher import Matcher
from spacy.lang.en import English

with open("exercises/iphone.json") as f:
    TEXTS = json.loads(f.read())

nlp = English()
matcher = Matcher(nlp.vocab)
pattern1 = [{"LOWER": "iphone"}, {"LOWER": "x"}]
pattern2 = [{"LOWER": "iphone"}, {"IS_DIGIT": True, "OP": "?"}]
matcher.add("GADGET", None, pattern1, pattern2)

TRAINING_DATA = []

# Create a Doc object for each text in TEXTS
for doc in nlp.pipe(TEXTS):
    # Match on the doc and create a list of matched spans
    spans = [doc[start:end] for match_id, start, end in matcher(doc)]
    # Get (start character, end character, label) tuples of matches
    entities = [(span.start_char, span.end_char, "GADGET") for span in spans]
    # Format the matches as a (doc.text, entities) tuple
    training_example = (doc.text, {"entities": entities})
    # Append the example to the training data
    TRAINING_DATA.append(training_example)

print(*TRAINING_DATA, sep="\n")

('How to preorder the iPhone X', {'entities': [(20, 28, 'GADGET'), (20, 26, 'GADGET')]})
('iPhone X is coming', {'entities': [(0, 8, 'GADGET'), (0, 6, 'GADGET')]})
('Should I pay $1,000 for the iPhone X?', {'entities': [(28, 36, 'GADGET'), (28, 34, 'GADGET')]})
('The iPhone 8 reviews are here', {'entities': [(4, 12, 'GADGET')]})
('Your iPhone goes up to 11 today', {'entities': [(5, 11, 'GADGET')]})
('I need a new phone! Any tips?', {'entities': []})


#### Pipeline for Training a Model

Let’s write a simple training loop from scratch!
The pipeline you’ve created in the previous exercise is available as the nlp object. It already contains the entity recognizer with the added label 'GADGET'.
The small set of labelled examples that you’ve created previously is available as TRAINING_DATA. To see the examples, you can print them in your script.

In [9]:
import spacy
import random
import json

with open("exercises/gadgets.json") as f:
    TRAINING_DATA = json.loads(f.read())

nlp = spacy.blank("en")
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner)
ner.add_label("GADGET")

# Start the training
nlp.begin_training()

# Loop for 10 iterations
for itn in range(10):
    # Shuffle the training data
    random.shuffle(TRAINING_DATA)
    losses = {}

    # Batch the examples and iterate over them
    for batch in spacy.util.minibatch(TRAINING_DATA, size=2):
        texts = [text for text, entities in batch]
        annotations = [entities for text, entities in batch]

        # Update the model
        nlp.update(texts, annotations, losses=losses)
        print(losses)

{'ner': 11.199999451637268}
{'ner': 22.608760356903076}
{'ner': 31.69031023979187}
{'ner': 7.134791314601898}
{'ner': 14.171489298343658}
{'ner': 17.40808865427971}
{'ner': 1.6917238347232342}
{'ner': 6.152672458440065}
{'ner': 9.372104265145026}
{'ner': 1.4761813097284175}
{'ner': 3.3679748892900534}
{'ner': 4.7345728858344955}
{'ner': 2.1268748844740912}
{'ner': 5.620544813456945}
{'ner': 8.640644057071768}
{'ner': 2.457177545875311}
{'ner': 5.079249800299294}
{'ner': 5.84424194006715}
{'ner': 1.4112991895526648}
{'ner': 1.7581186078023165}
{'ner': 3.6494354910682887}
{'ner': 1.1694431360810995}
{'ner': 2.290436919418653}
{'ner': 2.3815858341684475}
{'ner': 0.7015699759685994}
{'ner': 0.7040331653881857}
{'ner': 0.7151841569083383}
{'ner': 0.0001465689825774774}
{'ner': 2.1335694881336122}
{'ner': 2.135355341445235}


#### Model Training Best Practice:

1. If you're updating an existing model with new data, especially new labels, it can overfit and adjust too much to the new examples. or instance, if you're only updating it with examples of "website", it may "forget" other labels it previously predicted correctly – like "person". his is also known as the catastrophic forgetting problem.
 - To prevent this, make sure to always mix in examples of what the model previously got correct.
1. Another common problem is that your model just won't learn what you want it to.For example, it may be very difficult to teach a model to predict whether something is adult clothing or children's clothing based on the context. However, just predicting the label "clothing" may work better.
 - Before you start training and updating models, it's worth taking a step back and planning your label scheme. ry to pick categories that are reflected in the local context and make them more generic if possible. You can always add a rule-based system later to go from generic to specific.