# Chapter 4: Training a neural network model

In this chapter, you'll learn how to update spaCy's statistical models to customize them for your use case – for example, to predict a new entity type in online comments. You'll write your own training loop from scratch, and understand the basics of how training works, along with tips and tricks that can make your custom NLP projects more successful.

## Training and updating models

resources: [slides](slides/chapter4_01_training-updating-models.md)

Welcome to the final chapter, which is about one of the most exciting aspects of modern NLP: training your own models!

In this lesson, you'll learn about training and updating spaCy's neural network models and the data you need for it – focusing specifically on the named entity recognizer.

### Why updating the model?

- Better results on your specific domain
- Learn classification schemes specifically for your problem
- Essential for text classification
- Very useful for named entity recognition
- Less critical for part-of-speech tagging and dependency parsing

### How training works?

1. **Initialize** the model weights randomly with `nlp.begin_training`
2. **Predict** a few examples with the current weights by calling `nlp.update`
3. **Compare** prediction with true labels
4. **Calculate** how to change weights to improve predictions
5. **Update** weights slightly
6. Go back to 2.

![training](slides/static/training.png)

- **Training data**: Examples and their annotations.
- **Text**: The input text the model should predict a label for.
- **Label**: The label the model should predict.
- **Gradient**: How to change the weights.

### Example: Training the entity recognizer

- The entity recognizer tags words and phrases in context
- Each token can only be part of one entity
- Examples need to come with context

```python
("iPhone X is coming", {'entities': [(0, 8, 'GADGET')]})
```

- Texts with no entities are also important

```python
("I need a new phone! Any tips?", {'entities': []})
```

- Goal: teach the model to generalize

### The training data

- Examples of what we want the model to predict in context
- Update an existing model: a few hundred to a few thousand examples
- Train a new category: a few thousand to a million examples
    - spaCy's English models: 2 million words
- Usually created manually by human annotators
- Can be semi-automated – for example, using spaCy's `Matcher`!

In [1]:
import json
from spacy.matcher import Matcher
from spacy.lang.en import English

with open("exercises/iphone.json") as f:
    TEXTS = json.loads(f.read())

nlp = English()
matcher = Matcher(nlp.vocab)
pattern1 = [{"LOWER": "iphone"}, {"LOWER": "x"}]
pattern2 = [{"LOWER": "iphone"}, {"IS_DIGIT": True, "OP": "?"}]
matcher.add("GADGET", None, pattern1, pattern2)

TRAINING_DATA = []

# Create a Doc object for each text in TEXTS
for doc in nlp.pipe(TEXTS):
    # Match on the doc and create a list of matched spans
    spans = [doc[start:end] for match_id, start, end in matcher(doc)]
    # Get (start character, end character, label) tuples of matches
    entities = [(span.start_char, span.end_char, "GADGET") for span in spans]
    # Format the matches as a (doc.text, entities) tuple
    training_example = (doc.text, {"entities": entities})
    # Append the example to the training data
    TRAINING_DATA.append(training_example)

print(*TRAINING_DATA, sep="\n")

('How to preorder the iPhone X', {'entities': [(20, 28, 'GADGET'), (20, 26, 'GADGET')]})
('iPhone X is coming', {'entities': [(0, 8, 'GADGET'), (0, 6, 'GADGET')]})
('Should I pay $1,000 for the iPhone X?', {'entities': [(28, 36, 'GADGET'), (28, 34, 'GADGET')]})
('The iPhone 8 reviews are here', {'entities': [(4, 10, 'GADGET'), (4, 12, 'GADGET')]})
('Your iPhone goes up to 11 today', {'entities': [(5, 11, 'GADGET')]})
('I need a new phone! Any tips?', {'entities': []})


## The training loop

resources: [slides](slides/chapter4_02_training-loop.md)

While some other libraries give you one method that takes care of training a model, spaCy gives you full control over the training loop.

### The steps of a training loop

1. **Loop** for a number of times.
2. **Shuffle** the training data.
3. **Divide** the data into batches.
4. **Update** the model for each batch.
5. **Save** the updated model.

### Recap: How training works

![training](slides/static/training.png)

- **Training data**: Examples and their annotations.
- **Text**: The input text the model should predict a label for.
- **Label**: The label the model should predict.
- **Gradient**: How to change the weights.

### Example Loop

In [7]:
import spacy
import random

TRAINING_DATA = [
    ('How to preorder the iPhone X', {'entities': [(20, 28, 'GADGET')]})
    # And many more examples...
]

# Loop for 10 iterations
for i in range(10):
    # Shuffle the training data
    random.shuffle(TRAINING_DATA)
    # Create batches and iterate over them
    for batch in spacy.util.minibatch(TRAINING_DATA):
        # Split the batch in texts and annotations
        texts = [text for text, anootation in batch]
        annotations = [annotation for text, annotation in batch]
        # Update the model
        nlp.update(texts, annotations)
        
# Save the model
# nlp.to_disk(path_to_model)

### Updating an existing model

- Improve the predictions on new data
- Especially useful to improve existing categories, like `PERSON`
- Also possible to add new categories
- Be careful and make sure the model doesn't "forget" the old ones

### Setting up a new pipeline from scratch

In [8]:
import spacy
import random

examples = [
    # And many more examples...
]

# Start with the blank English model
# The blank model doesn't have any pipeline components, only the language data and tokenization rules.
nlp = spacy.blank('en')

# Create blank entity recognizer and add it to the pipeline
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner)

# Add a new label
ner.add_label('GADGET')

# Start the training
nlp.begin_training()

# Train for 10 iterations
for itn in range(10):
    random.shuffle(examples)
    # Divide examples into batches
    for batch in spacy.util.minibatch(examples, size=2):
        texts = [text for text, annotation in batch]
        annotations = [annotation for text, annotation in batch]
        # Update the model
        nlp.update(texts, annotations)

## Best practices for training spaCy models

resources: [slides](slides/chapter4_03_training-best-practices.md)

When you start running your own experiments, you might find that a lot of things just don't work the way you want them to. And that's okay.

Training models is an iterative process, and you have to try different things until you find out what works best.

In this lesson, I'll be sharing some best practices and things to keep in mind when training your own models.

Let's take a look at some of the problems you may come across.

### Problem 1: Models can "forget" things

- Existing model can overfit on new data
    - e.g.: if you only update it with `WEBSITE`, it can "unlearn" what a `PERSON` is
- Also known as "catastrophic forgetting" problem

### Solution 1: Mix in previously correct predictions

- For example, if you're training `WEBSITE`, also include examples of `PERSON`
- Run existing spaCy model over data and extract all other relevant entities

**BAD:**

```python
TRAINING_DATA = [
    ('Reddit is a website', {'entities': [(0, 6, 'WEBSITE')]})
]
```

**GOOD:**

```python
TRAINING_DATA = [
    ('Reddit is a website', {'entities': [(0, 6, 'WEBSITE')]}),
    ('Obama is a person', {'entities': [(0, 5, 'PERSON')]})
]
```

### Problem 2: Models can't learn everything

- spaCy's models make predictions based on local context
- Model can struggle to learn if decision is difficult to make based on context
- Label scheme needs to be consistent and not too specific
    - For example: `CLOTHING` is better than `ADULT_CLOTHING` and `CHILDRENS_CLOTHING`

### Solution 2: Plan your label scheme carefully

- Pick categories that are reflected in local context
- More generic is better than too specific
- Use rules to go from generic labels to specific categories

**BAD:**

```python
LABELS = ['ADULT_SHOES', 'CHILDRENS_SHOES', 'BANDS_I_LIKE']
```

**GOOD:**

```python
LABELS = ['CLOTHING', 'BAND']
```

## Wrapping up

resources: [slides](slides/chapter4_04_wrapping-up.md)

Congratulations – you've made it to the end of the course!

### Your new spaCy skills

- Extract linguistic features: part-of-speech tags, dependencies, named entities
- Work with pre-trained statistical models
- Find words and phrases using `Matcher` and `PhraseMatcher` match rules
- Best practices for working with data structures `Doc`, `Token`, `Span`, `Vocab`, `Lexeme`
- Find semantic similarities using word vectors
- Write custom pipeline components with extension attributes
- Scale up your spaCy pipelines and make them fast
- Create training data for spaCy' statistical models
- Train and update spaCy's neural network models with new data

### More things to do with spaCy

- [Training and updating](https://spacy.io/usage/training) other pipeline components
    - Part-of-speech tagger
    - Dependency parser
    - Text classifier
- [Customizing the tokenizer](https://spacy.io/usage/linguistic-features#tokenization)
    - Adding rules and exceptions to split text differently
- [Adding or improving support for other languages](https://spacy.io/usage/adding-languages)
    - 45+ languages currently
    - Lots of room for improvement and more languages
    - Allows training models for other languages