# Chapter 4: Training a neural network model

You'll learn how to update spaCy's statistical models to customize them for your use case - for example, to predict a new entity type in online comments. You'll write your own training loop from scratch, and understand the basics of how training works, along with tips and tricks that can make your custom NLP projects more successful.

**Sections**

1. Training and updating models 
2. Purpose of training 
3. Creating training data (Part 1) 
4. Creating training data (Part 2) 
5. The training loop 
6. Setting up the pipeline 
7. Building a training loop 
8. Exploring the model 
9. Training best practices 
10. Good data vs. bad data 
11. Training multiple labels 
12. Wrapping up


## 1. Training and updating models

### Why updating the model?
* Better results on your specific domain
* Learn classification schemes specifically for your problem
* Essential for text classification
* Very useful for named entity recognition
* Less critical for part-of-speech tagging and dependency parsing

**Notes**
* why would we want to update the model with our own examples? why can't we just rely on pre-trained models?
    - stat models make predictions based on the examples they were trained on
    - you can usually make the model more accurate by showing it examples from your domain
    - you often also want to predict categories specific to your problem, so the model needs to learn about them
* this is essential for text classification, very useful for entity recognition and a little less critical for tagging and parsing

### How training works (1)
1. **Initialize** the model weights randomly with `nlp.begin_training` 
2. **Predict** a few examples with the current weights by calling `nlp.update` 
3. **Compare** prediction with true labels 
4. **Calculate** how to change weights to improve predictions 
5. **Update** weights slightly 
6. Go back to 2.

### How training works (2)

* DIAGRAM IN SLIDE

* **Training data**: examples and their annotations
* **Text**: the input text the model should predict a label for
* **Label**: the label the model should predict
* **Gradient**: how to change the weights

**Notes**
* the training data are the examples we want to update the model with
* the text should be a sentence, paragraph, or longer doc
    - for the best results, it should be similar to what the model will see at runtime
* the label is what we want the model to predict
    - this can be a text category or an entity span and its type
* the gradient is how we should change the model to reduce the current error
    - computed when we compare the predicted label to the true label
* after training, we can then save out an updated model and use it in our application

### Example: Training the entity recognizer
* the entity recognizer tags words and phrases in context
* each token can only be part of one entity
* examples need to come with context

`("iPhone X is coming", {"entities": [(0, 8, "GADGET")]})`

* texts with no entities are also important

`("I need a new phone! Any tips?", {"entities": []})`

* **Goal**: teach the model to generalize

**Notes**
* the entity recognizer takes a doc and predicts the phrases and their labels
    - meaning that the training data needs to include texts, the entities they contain, and the entity labels
* entities can't overlap so each token can only be part of one entity
* because the entity recognizer predicts entities _in context_, it also needs to be trained on entities and their surrounding context
* the easiest way to do this is to show the model a text and a list of character offsets
    - EX: "iPhone X" is a gadget, starts at character 0 and ends at character 8
* also very important for the model to learn words that _aren't_ entities
    - EX: list of span annotations will be empty
* GOAL: teach the model to recognize new entities in similar contexts, even if they weren't in the training data

### The training data
* Examples of what we want the model to predict in context
* Update an **existing model**: a few hundred to a few thousand examples
* Train a **new category**: few thousand to a million examples
    - spaCy's english models: 2 million words
* usually created manually by human annotators
* can be semi-automated - for example, using spaCy's `Matcher`!

**Notes**
* the training data tells the model what we want to predict
    - this could be texts and named entities we want to recognize, or
    - tokens and their correct part-of-speech tags
* to update an existing model, we can start w/ a few hundred to a few thouand examples
* to train a new category we may need to update a million
* spaCy's pre-trained English models for instance were trained on 2 million words labelled with part-of-speech tags, dependencies and named entities
* training data is usually created by humans who assign labels to texts
* this is a lot of work but can be semi-automated 
    - EX: using spaCy's `Matcher`

## 2. Purpose of training
While spaCy comes with a range of pre-trained models to predict linguistic annotations, you almost _always_ want to fine-tune them with more examples. You can do this by training them with more labelled data.

What does training **note** help with?

( ) Improve model accuracy on your data.

( ) Learn new classification schemes.

(X) Discover patterns in unlabelled data. 

**Correct!**: spaCy's components are supervised models for text annotations, meaning they can only learn to reproduce examples, not guess new labels from raw text.

## 3. Creating training data (1)

spaCy's rule-based `Matcher` is a great way to quickly create training data for named entity models.

In [None]:
# (do not run)
import json
from spacy.matcher import Matcher
from spacy.lang.en import English

with open("exercises/en/iphone.json", encoding="utf8") as f:
    TEXTS = json.loads(f.read())

nlp = English()
matcher = Matcher(nlp.vocab)

# Two tokens whose lowercase forms match "iphone" and "x"
pattern1 = [{"LOWER": "iphone"}, {"LOWER": "x"}]

# Token whose lowercase form matches "iphone" and a digit
pattern2 = [{"LOWER": "iphone"}, {"IS_DIGIT": True}]

# Add patterns to the matcher and check the result
matcher.add("GADGET", None, pattern1, pattern2)
for doc in nlp.pipe(TEXTS):
    print([doc[start:end] for match_id, start, end in matcher(doc)])

## 4. Create training data (2)



In [None]:
# (do not run)
import json
from spacy.matcher import Matcher
from spacy.lang.en import English

with open("exercises/en/iphone.json", encoding="utf8") as f:
    TEXTS = json.loads(f.read())

nlp = English()
matcher = Matcher(nlp.vocab)
pattern1 = [{"LOWER": "iphone"}, {"LOWER": "x"}]
pattern2 = [{"LOWER": "iphone"}, {"IS_DIGIT": True}]
matcher.add("GADGET", None, pattern1, pattern2)

TRAINING_DATA = []

# Create a Doc object for each text in TEXTS
for doc in nlp.pipe(TEXTS):
    # Match on the doc and create a list of matched spans
    spans = [doc[start:end] for match_id, start, end in matcher(doc)]
    # Get (start character, end character, label) tuples of matches
    entities = [(span.start_char, span.end_char, "GADGET") for span in spans]
    # Format the matches as a (doc.text, entities) tuple
    training_example = (doc.text, {"entities": entities})
    # Append the example to the training data
    TRAINING_DATA.append(training_example)

print(*TRAINING_DATA, sep="\n")

## 5. The training loop

### The steps of a training loop

1. **Loop** for a number of times 
2. **Shuffle** the training data 
3. **Divide** the data into batches 
4. **Update** the model for each batch 
5. **Save** the updated model

**Notes**
* the training loop is a series of steps that's performed to train or update a model
* usually perform it several times, for multiple iterations, so that the model can learn from it effectively
    - if we want to train for 10 iterations, we need to loop 10 times
* to prevent the model from getting stuck in a suboptimal solution, we randomly shuffle the data for each iteration
    - very common strategy when doing stochastic gradient descent
* next, divide the training data into batches of several examples, also known as minibatching
    - increases the reliability of the gradient estimates
* update the model for each batch, and start the loop again until we've reached the last iteration
* then save the model to a directory and use it in spaCy

### Example loop

In [None]:
# (do not run)
TRAINING_DATA = [
    ("How to preorder the iPhone X", {"entities": [(20, 28, "GADGET")]})
    # And many more examples...
]

# Loop for 10 iterations
for i in range(10):
    # Shuffle the training data
    random.shuffle(TRAINING_DATA)
    # Create batches and iterate over time
    for batch in spacy.util.minibatch(TRAINING_DATA):
        # Split the batch in texts and annotations
        texts = [text for text, annotation in batch]
        annotations = [annotation for text, annotation in batch]
        # Update the mode
        nlp.update(texts, annotations)
    
# Save the model
nlp.to_disk(path_to_model)

**Notes**
* imagine we have a list of training examples consisting of texts and entity annotations
* we want to loop for 10 iterations, so we're iterating over a `range` of 10
* next, we use the `random` module to randomly shuffle the training data
* we then use spaCy's `minibatch` utility function to divide the examples into batches
* for each batch, we get the texts and annotations and call the `nlp.update` method to update the model
* finally, we call the `nlp.to_disk` method to save the trained model to a directory

### Updating an existing model
* improve the predictions on new data
* especially useful to improve existing categories, like `"PERSON"`
* also possible to add new categories
* be careful and make sure the model doesn't "forget" the old ones

**Notes**
* spaCy lets you update an exisiting pre-trained model with more data
    - EX: to improve its predictions on different texts
* especially useful if you want to improve categories the model already knows like "person" or "organization"
* you can also update a model to add new categories
* make sure to always update it with examples of new category AND examples of the other categories it previously predicted correctly; otherwise, improving the new category might hurt the other categories

### Setting up a new pipeline from scratch

In [None]:
# (do not run)
# Start with blank English model
nlp = spacy.blank()
# Create blank entity recognizer and add it to the pipeline
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner)
# Add a new label
ner.add_label("GADGET")

# Start the training
nlp.begin_training()
# Train for 10 iterations
for itn in range(10):
    random.shuffle(examples)
    # Divide examples into batches
    for batch in spacy.util.minibatch(examples, size= 2):
        texts = [text for text, annotation in batch]
        annotations = [annotation for text, annotation in batch]
        # Update the model
        nlp.update(texts, annotations)

**Notes**
* the blank model doesn't have any pipeline components, only the language data and tokenization rules
* then create a blank entity recognizer and add it to the pipeline
* use `add_label` method to add new string labels to the model
* can now call `nlp.begin_training` to initialize the model with random weights
* to get better accuracy, we want to loop over the examples more than once and randomly shuffle the data on each iteration
* on each iteration, we divide the examples into batches using spaCy's `minibatch` utility function
    - each example consists of a text and its annotations
* finally, we update the model with the texts and annotations and continue the loop

## 6. Setting up the pipeline

In [4]:
import spacy

# Create a blank "en" model
nlp = spacy.blank("en")

# Create a new entity recognizer and add it to the pipeline
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner)

# Add the label "GADGET" to the entity recognizer
ner.add_label("GADGET")

## 7. Building a training loop

In [None]:
# (do not run)
import spacy
import random
import json

with open("exercises/en/gadgets.json", encoding="utf8") as f:
    TRAINING_DATA = json.loads(f.read())

nlp = spacy.blank("en")
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner)
ner.add_label("GADGET")

# Start the training
nlp.begin_training()

# Loop for 10 iterations
for itn in range(10):
    # Shuffle the training data
    random.shuffle(TRAINING_DATA)
    losses = {}

    # Batch the examples and iterate over them
    for batch in spacy.util.minibatch(TRAINING_DATA, size=2):
        texts = [text for text, entities in batch]
        annotations = [entities for text, entities in batch]

        # Update the model
        nlp.update(texts, annotations, losses=losses)
    print(losses) # amount of work for the operator - the lower the "losses" the better

## 8. Exploring the model

(skipped - multiple choice question on model accuracy)

## 9. Training best practices

### Problem 1: Models can "forget" things
* Existing model can overfit on new data
    - EX: if you only update it with `"WEBSITE"`, it can "unlearn" what a `"PERSON"` is
* also known as "catastrophic forgetting" problem

**Notes**
* statistical models can learn lots of things - but it doesn't mean that they won't unlearn them
* if you're updating an existing model with new data, especially new labels, it can overfit and adjust _too much_ to the new examples
* for instance, if you're only updating it with examples of "website", it may "forget" other labels it previously predicted correctly - like "person"

### Solution 1: Mix in previously correct predictions
* EX: if you're training `"WEBSITE"`, also include examples of `"PERSON"`
* run exisiting spaCy model over data and extract all other relevant entities

In [7]:
# BAD
TRAINING_DATA = [
    ("Reddit is a website", {"entities": [0, 6, "WEBSITE"]})
]

# GOOD
TRAINING_DATA = [
    ("Reddit is a website", {"entities": [0, 6, "WEBSITE"]}),
    ("Obama is a person", {"entities": [0, 5, "PERSON"]})
]

**Notes**
* spaCy can help you: you can create those additional examples by running the existing model over data and extracting the entity spans you can about
* you can then mix those examples in with your exisiting data and update the model with annotations of all labels

### Problem 2: Models can't learn everything
* spaCy's models make predictions based on **local context**
* model can struggle to learn if decision is difficult to make based on context
* label scheme needs to be consistent and not too specific
    - EX: `"CLOTHING"` is better than `"ADULT_CLOTHING"` and `"CHILDRENS_CLOTHING"`

**Notes**
* another common problem is that your model just won't learn what you want it to
* spaCy's models make predictions based on the local context
    - EX: for named entities, the surrounding words are most important
* if the decision is difficult to make based on the context, the model can struggle to learn it
* the label scheme also needs to be consistent and not too specific
    - EX: may be very difficult to teach a model to predict whether something is adult clothing or children's clothing based on the context
    - however, just predicting the label "clothing" may work better

### Solution 2: Plan your label scheme carefully
* pick categories that are reflected in local context
* more generic is better than too specific
* use rules to go from generic labels to specific categories

In [9]:
# BAD
LABELS = ["ADULT_SHOES", "CHILDRENS_SHOES", "BANDS_I_LIKE"]

# GOOD
LABELS = ["CLOTHING", "BAND"]

**Notes**
* before you start training and updating models, it's worth taking a step back and planning your label scheme
* try to pick categories that are reflected in the local context and make them more generic if possible
* you can always add a rule-based system later to go from generic to specific
* generic categories like "clothing" or "band" are both easier to label and easier to learn

## 10. Good data vs. bad data

In [10]:
TRAINING_DATA = [
    (
        "i went to amsterdem last year and the canals were beautiful",
        {"entities": [(10, 19, "TOURIST_DESTINATION")]},
    ),
    (
        "You should visit Paris once in your life, but the Eiffel Tower is kinda boring",
        {"entities": [(17, 22, "TOURIST_DESTINATION")]},
    ),
    ("There's also a Paris in Arkansas, lol", {"entities": []}),
    (
        "Berlin is perfect for summer holiday: lots of parks, great nightlife, cheap beer!",
        {"entities": [(0, 6, "TOURIST_DESTINATION")]},
    ),
]

### Part 1
Why is this data and label scheme problematic?

(X) Whether a place is a tourist destination is a subjective judgement and not a definitive category. It will be very difficult for the entity recognizer to learn.

( ) Paris should also be labelled as tourist destinations for consistency. Otherwise, the model will be confused.

( ) Rare out-of-vocabulary words like the misspelled "amsterdem" shouldn't be labelled as entities

**Correct!**: A much better approach would be to only label `"GPE"` (geopolitical entity) or `"LOCATION"` and then use a rule-based system to determine whether the entity is a tourist destination in this context. For example, you could resolve the entities types back to a knowledge base or look them up in a travel wiki

### Part 2

In [None]:
TRAINING_DATA = [
    (
        "i went to amsterdem last year and the canals were beautiful",
        {"entities": [(10, 19, "GPE")]},
    ),
    (
        "You should visit Paris once in your life, but the Eiffel Tower is kinda boring",
        {"entities": [(17, 22, "GPE")]},
    ),
    ("There's also a Paris in Arkansas, lol", 
     {"entities": [(15, 20, "GPE"), (24, 32, "GPE")]}),
    (
        "Berlin is perfect for summer holiday: lots of parks, great nightlife, cheap beer!",
        {"entities": [(0, 6, "GPE")]},
    ),
]

**Correct!**: Once the model achieves good results on detecting GPE entities in the traveler reviews, you could add a rule-based component to determine whether the entity is a tourist destination in this context

* EX: you could resolve the entities types back to a knowledge base or look them up in a travel wiki

## 11. Training multiple labels

In [14]:
# Part 1
TRAINING_DATA = [
    (
        "Reddit partners with Patreon to help creators build communities",
        {"entities": [
          (0, len("Reddit"), "WEBSITE"), 
          (len("Reddit partners with "), len("Reddit partners with Patreon"), "WEBSITE")
        ]},
    ),
    ("PewDiePie smashes YouTube record", {"entities": [
      (len("PewDiePie smashes "), len("PewDiePie smashes YouTube"), "WEBSITE")
    ]}),
    (
        "Reddit founder Alexis Ohanian gave away two Metallica tickets to fans",
        {"entities": [
          (0, len("Reddit"), "WEBSITE")]},
    ),
    # And so on...
]

### Part 2

(skipped - multiple choice)

In [15]:
# Part 3
TRAINING_DATA = [
    (
        "Reddit partners with Patreon to help creators build communities",
        {"entities": [
          (0, 6, "WEBSITE"), 
          (21, 28, "WEBSITE")
        ]},
    ),
    ("PewDiePie smashes YouTube record", 
     {"entities": [
       (0, len("PewDiePie"), "PERSON"), 
       (18, 25, "WEBSITE")
     ]}),
    (
        "Reddit founder Alexis Ohanian gave away two Metallica tickets to fans",
        {"entities": [
          (0, 6, "WEBSITE"), 
          (len("Reddit founder "), len("Reddit founder Alexis Ohanian"), "PERSON")
        ]},
    ),
    # And so on...
]

## 12. Wrapping up

Recap of what was learned in the course, skipped notes :)