# Natural Language Processing

## Part 4: SpaCy + Training Neural Network

We'll write our own training loop from scratch, and understand the basics of how training works, along with tips and tricks that can make your custom NLP projects more successful.

Why updating the model?

* Better results on your specific domain
* Learn classification schemes specifically for your problem
* Essential for text classification
* Very useful for named entity recognition
* Less critical for part-of-speech tagging and dependency parsing

### How training works

1. Initialize the model weights randomly
2. Predict a few examples with the current weights
3. Compare prediction with true labels
4. Calculate how to change weights to improve predictions
5. Update weights slightly
6. Go back to 2

### Training the entity recognizer

Let's look at an example for a specific component: the entity recognizer.

The entity recognizer takes a document and predicts phrases and their labels. This means that the training data needs to include texts, the entities they contain, and the entity labels.

Entities can't overlap, so each token can only be part of one entity.

Because the entity recognizer predicts entities in context, it also needs to be trained on entities and their surrounding context.

In [21]:
from spacy.tokens import Span

doc = nlp("iPhone X is coming")
doc.ents = [Span(doc, 0, 2, label="GADGET")]

It's also very important for the model to learn words that aren't entities.

In [22]:
doc = nlp("I need a new phone! Any tips?")
doc.ents = []

## 1. Creating training data

The training data tells the model what we want it to predict. This could be texts and named entities we want to recognize, or tokens and their correct part-of-speech tags.  To update an existing model, we can start with a few hundred to a few thousand examples.  To train a new category we may need up to a million.  spaCy's pre-trained English models for instance were trained on **2 million words!** labelled with part-of-speech tags, dependencies and named entities.

Training data is usually created by humans who assign labels to texts.  This is a lot of work, but can be semi-automated – for example, using spaCy's `Matcher`

spaCy’s rule-based `Matcher`  is a great way to quickly create training data for named entity models. A list of sentences is available as the variable *text*.  We want to find all mentions of different iPhone models, so we can create training data to teach a model to recognize them as 'GADGET'.

In [33]:
import json
import spacy
from spacy.matcher import Matcher
from spacy.tokens import Span

with open("data/iphone.json", encoding="utf8") as f:
    text = json.loads(f.read())

nlp = spacy.blank("en")
matcher = Matcher(nlp.vocab)

# Two tokens whose lowercase forms match "iphone" and "x"
pattern1 = [{"LOWER": "iphone"}, {"LOWER": "x"}]

# Token whose lowercase form matches "iphone" and a digit
pattern2 = [{"LOWER": "iphone"}, {"IS_DIGIT": True}]

# Add patterns to the matcher and create docs with matched entities
matcher.add("GADGET", [pattern1, pattern2])
docs = []
for doc in nlp.pipe(text):
    print(doc)
    matches = matcher(doc)
    spans = [Span(doc, start, end, label=match_id) for match_id, start, end in matches]
    print(spans)
    doc.ents = spans
    docs.append(doc)

How to preorder the iPhone X
[iPhone X]
iPhone X is coming
[iPhone X]
Should I pay $1,000 for the iPhone X?
[iPhone X]
The iPhone 8 reviews are here
[iPhone 8]
iPhone 11 vs iPhone 8: What's the difference?
[iPhone 11, iPhone 8]
I need a new phone! Any tips?
[]


After creating the data for our corpus, we need to save it out to a `.spacy` file.

Instantiate the `DocBin` with the list of docs.  A container to efficiently store and save `Doc` objects.

Save the `DocBin` to a file called `train.spacy`.

In [29]:
from spacy.tokens import DocBin

doc_bin = DocBin(docs=docs)
doc_bin.to_disk("models/train.spacy")

## 2. Configuring and running the training

### Training config

-single source of truth for all settings
-typically called `config.cfg`
-defines how to initialize the `nlp` object
-includes all settings about the pipeline components and their model implementations
-configures the training process and hyperparameters
-makes your training more reproducible


In [35]:
### Example config.cfg

'''
[nlp]
lang = "en"
pipeline = ["tok2vec", "ner"]
batch_size = 1000

[nlp.tokenizer]
@tokenizers = "spacy.Tokenizer.v1"

[components]

[components.ner]
factory = "ner"

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
hidden_width = 64
'''

'\n[nlp]\nlang = "en"\npipeline = ["tok2vec", "ner"]\nbatch_size = 1000\n\n[nlp.tokenizer]\n@tokenizers = "spacy.Tokenizer.v1"\n\n[components]\n\n[components.ner]\nfactory = "ner"\n\n[components.ner.model]\n@architectures = "spacy.TransitionBasedParser.v2"\nhidden_width = 64\n'

In [None]:
### Generating a config

- spaCy can auto-generate a default config file for you
- interactive quickstart widget in the docs
init config command on the CLI
$ python -m spacy init config ./config.cfg --lang en --pipeline ner
init config: the command to run
config.cfg: output path for the generated config
--lang: language class of the pipeline, e.g. en for English
--pipeline: comma-separated names of components to include

In [19]:
import random
# Loop for 10 iterations
for i in range(10):
    # Shuffle the training data
    random.shuffle(train)
    # Create batches and iterate over them
    batches = spacy.util.minibatch(train, size="6")
    for batch in batches:
        # Split the batch in texts and annotations
        texts = [text for text, annotation in batch]
        annotations = [annotation for text, annotation in batch]
        # Update the model
        nlp.update(texts, annotations)

# Save the model
nlp.to_disk("example")

TypeError: 'str' object is not an iterator

### Update an existing model

* Improve the predictions on new data
* Especially useful to improve existing categories, like PERSON
* Also possible to add new categories
* Be careful and make sure the model doesn't "forget" the old ones

spaCy lets you update an existing pre-trained model with more data – for example, to improve its predictions on different texts.

This is especially useful if you want to improve categories the model already knows, like "person" or "organization".

You can also update a model to add new categories.

Just make sure to always update it with examples of the new category and examples of the other categories it previously predicted correctly. Otherwise improving the new category might hurt the other categories.

### Setting up a new pipeline from scratch

In this example, we start off with a blank English model using the spacy dot blank method. The blank model doesn't have any pipeline components, only the language data and tokenization rules.

We then create a blank entity recognizer and add it to the pipeline.

Using the "add label" method, we can add new string labels to the model.

We can now call nlp dot begin training to initialize the model with random weights.

To get better accuracy, we want to loop over the examples more than once and randomly shuffle the data on each iteration.

On each iteration, we divide the examples into batches using spaCy's minibatch utility function. Each example consists of a text and its annotations.

Finally, we update the model with the texts and annotations and continue the loop.

In [25]:
'''
# Start with blank English model
nlp = spacy.blank('en')
# Create blank entity recognizer and add it to the pipeline
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner)
# Add a new label
ner.add_label('GADGET')

# Start the training
nlp.begin_training()
# Train for 10 iterations
for itn in range(10):
    random.shuffle(examples)
    # Divide examples into batches
    for batch in spacy.util.minibatch(examples, size=2):
        texts = [text for text, annotation in batch]
        annotations = [annotation for text, annotation in batch]
        # Update the model
        nlp.update(texts, annotations)
        '''

"\n# Start with blank English model\nnlp = spacy.blank('en')\n# Create blank entity recognizer and add it to the pipeline\nner = nlp.create_pipe('ner')\nnlp.add_pipe(ner)\n# Add a new label\nner.add_label('GADGET')\n\n# Start the training\nnlp.begin_training()\n# Train for 10 iterations\nfor itn in range(10):\n    random.shuffle(examples)\n    # Divide examples into batches\n    for batch in spacy.util.minibatch(examples, size=2):\n        texts = [text for text, annotation in batch]\n        annotations = [annotation for text, annotation in batch]\n        # Update the model\n        nlp.update(texts, annotations)\n        "

Time to practice! Now that you've seen the training loop, let's use the data created in the previous exercise to update a model.

## Setting up the pipeline

In this exercise, you’ll prepare a spaCy pipeline to train the entity recognizer to recognize 'GADGET' entities in a text – for example, “iPhone X”.

* Create a blank 'en' model, for example using the spacy.blank method.
* Create a new entity recognizer using nlp.create_pipe and add it to the pipeline.
* Add the new label 'GADGET' to the entity recognizer using the add_label method on the pipeline component.


In [None]:
import spacy

# Create a blank 'en' model
nlp = spacy.blank("en")

# Create a new entity recognizer and add it to the pipeline
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner)

# Add the label 'GADGET' to the entity recognizer
ner.add_label("GADGET")

## Building a training loop

Let’s write a simple training loop from scratch!

The pipeline you’ve created in the previous exercise is available as the nlp object. It already contains the entity recognizer with the added label 'GADGET'.

The small set of labelled examples that you’ve created previously is available as TRAINING_DATA. To see the examples, you can print them in your script.

Call nlp.begin_training, create a training loop for 10 iterations and shuffle the training data.
Create batches of training data using spacy.util.minibatch and iterate over the batches.
Convert the (text, annotations) tuples in the batch to lists of texts and annotations.
For each batch, use nlp.update to update the model with the texts and annotations.

In [26]:
import spacy
import random
import json
   
with open("exercises/gadgets.json") as f:
    TRAINING_DATA = json.loads(f.read())

nlp = spacy.blank("en")
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner)
ner.add_label("GADGET")

nlp.vocab.vectors.name = 'example'

# Start the training
nlp.begin_training()

# Loop for 10 iterations
for itn in range(10):
    # Shuffle the training data
    random.shuffle(TRAINING_DATA)
    losses = {}

    # Batch the examples and iterate over them
    for batch in spacy.util.minibatch(TRAINING_DATA, size=2):
        texts = [text for text, entities in batch]
        annotations = [entities for text, entities in batch]

        # Update the model
        nlp.update(texts, annotations, losses=losses)
        print("{0:.10f}".format(losses['ner']) )

2.2457692833
3.2140585109
3.4722888768
0.0000015177
0.5200842448
0.5200924764
0.0000000003
0.0000000077
0.0000000077
0.0000000000
0.0000077896
0.0000088767
0.0000000349
0.0000000349
0.0000000354
0.0000000000
0.0000000003
0.0000000003
0.0000000000
0.0000000000
0.0000000001
0.0000000001
0.0000000002
0.0000000002
0.0000000006
0.0000000006
0.0000000006
0.0000000000
0.0000000000
0.0000000050


## Exploring the model

Let’s see how the model performs on unseen data! To speed things up a little, we already ran a trained model for the label 'GADGET' over some text. Here are some of the results:

<img src= "figures/table.png" />

Out of all the entities in the texts, how many did the model get correct? 

Keep in mind that incomplete entity spans count as mistakes, too! Tip: Count the number of entities that the model should have predicted. Then count the number of entities it actually predicted correctly and divide it by the number of total correct entities.

( ) 45%

( ) 60%

(X) 70%

( ) 90%

## Training best practices

When you start running your own experiments, you might find that a lot of things just don't work the way you want them to. And that's okay.

Training models is an iterative process, and you have to try different things until you find out what works best.

In this lesson, I'll be sharing some best practices and things to keep in mind when training your own models.

Let's take a look at some of the problems you may come across.

### Problem 1: Models can "forget" things

* Existing model can overfit on new data
e.g.: if you only update it with WEBSITE, it can "unlearn" what a PERSON is

* Also known as "catastrophic forgetting" problem

Statistical models can learn lots of things – but it doesn't mean that they won't unlearn them.

If you're updating an existing model with new data, especially new labels, it can overfit and adjust too much to the new examples.

For instance, if you're only updating it with examples of "website", it may "forget" other labels it previously predicted correctly – like "person".

This is also known as the catastrophic forgetting problem.

### Solution 1: Mix in previously correct predictions

To prevent this, make sure to always mix in examples of what the model previously got correct.

If you're training a new category "website", also include examples of "person".

spaCy can help you with this. You can create those additional examples by running the existing model over data and extracting the entity spans you care about.

You can then mix those examples in with your existing data and update the model with annotations of all labels.

For example, if you're training WEBSITE, also include examples of PERSON

Run existing spaCy model over data and extract all other relevant entities

**BAD:**

TRAINING_DATA = [
    ('Reddit is a website', {'entities': [(0, 6, 'WEBSITE')]})
]

**GOOD:**

TRAINING_DATA = [
    ('Reddit is a website', {'entities': [(0, 6, 'WEBSITE')]}),
    ('Obama is a person', {'entities': [(0, 5, 'PERSON')]})
]

### Problem 2: Models can't learn everything

Another common problem is that your model just won't learn what you want it to.

spaCy's models make predictions based on the local context – for example, for named entities, the surrounding words are most important.

If the decision is difficult to make based on the context, the model can struggle to learn it.

The label scheme also needs to be consistent and not too specific.

For example, it may be very difficult to teach a model to predict whether something is adult clothing or children's clothing based on the context. However, just predicting the label "clothing" may work better.

spaCy's models make predictions based on local context:

* Model can struggle to learn if decision is difficult to make based on context

* Label scheme needs to be consistent and not too specific

For example: CLOTHING is better than ADULT_CLOTHING and CHILDRENS_CLOTHING

### Solution 2: Plan your label scheme carefully

Before you start training and updating models, it's worth taking a step back and planning your label scheme.

Try to pick categories that are reflected in the local context and make them more generic if possible.

You can always add a rule-based system later to go from generic to specific.

Generic categories like "clothing" or "band" are both easier to label and easier to learn.

Pick categories that are reflected in local context:
    
* More generic is better than too specific

* Use rules to go from generic labels to specific categories

**BAD:**

LABELS = ['ADULT_SHOES', 'CHILDRENS_SHOES', 'BANDS_I_LIKE]
          
**GOOD:**

LABELS = ['CLOTHING', 'BAND']

## Good data x bad data

Here’s an excerpt from a training set that labels the entity type TOURIST_DESTINATION in traveler reviews.

In [32]:
TRAINING_DATA = [
    (
        "i went to amsterdem last year and the canals were beautiful",
        {"entities": [(10, 19, "TOURIST_DESTINATION")]},
    ),
    (
        "You should visit Paris once in your life, but the Eiffel Tower is kinda boring",
        {"entities": [(17, 22, "TOURIST_DESTINATION")]},
    ),
    ("There's also a Paris in Arkansas, lol", {"entities": []}),
    (
        "Berlin is perfect for summer holiday: lots of parks, great nightlife, cheap beer!",
        {"entities": [(0, 6, "TOURIST_DESTINATION")]},
    ),
]

Why is this data and label scheme problematic? 

( X ) Whether a place is a tourist destination is a subjective judgement and not a definitive category. It will be very difficult for the entity recognizer to learn.

( ) Paris and Arkansas should also be labelled as tourist destinations for consistency. Otherwise, the model will be confused.

( ) Rare out-of-vocabulary words like the misspelled 'amsterdem' shouldn't be labelled as entities.

That's correct! A much better approach would be to only label GPE (geopolitical entity) or LOCATION and then use a rule-based system to determine whether the entity is a tourist destination in this context. For example, you could resolve the entities types back to a knowledge base or look them up in a travel wiki.

Rewrite the TRAINING_DATA to only use the label GPE (cities, states, countries) instead of TOURIST_DESTINATION.

Don’t forget to add tuples for the GPE entities that weren’t labeled in the old data.

In [33]:
TRAINING_DATA = [
    (
        "i went to amsterdem last year and the canals were beautiful",
        {"entities": [(10, 19, "GPE")]},
    ),
    (
        "You should visit Paris once in your life, but the Eiffel Tower is kinda boring",
        {"entities": [(17, 22, "GPE")]},
    ),
    (
        "There's also a Paris in Arkansas, lol",
        {"entities": [(15, 20, "GPE"), (24, 32, "GPE")]},
    ),
    (
        "Berlin is perfect for summer holiday: lots of parks, great nightlife, cheap beer!",
        {"entities": [(0, 6, "GPE")]},
    ),
]

## Traiing multiple labels

Here’s a small sample of a dataset created to train a new entity type WEBSITE. The original dataset contains a few thousand sentences. In this exercise, you’ll be doing the labeling by hand. In real life, you probably want to automate this and use an annotation tool – for example, Brat, a popular open-source solution, or Prodigy, our own annotation tool that integrates with spaCy.

Complete the entity offsets for the WEBSITE entities in the data. Feel free to use len() if you don’t want to count the characters.

In [34]:
TRAINING_DATA = [
    (
        "Reddit partners with Patreon to help creators build communities",
        {"entities": [(0, 6, "WEBSITE"), (21, 28, "WEBSITE")]},
    ),
    ("PewDiePie smashes YouTube record", {"entities": [(18, 25, "WEBSITE")]}),
    (
        "Reddit founder Alexis Ohanian gave away two Metallica tickets to fans",
        {"entities": [(0, 6, "WEBSITE")]},
    ),
    # And so on...
]

A model was trained with the data you just labelled, plus a few thousand similar examples. After training, it’s doing great on WEBSITE, but doesn’t recognize PERSON anymore. Why could this be happening?

( ) It's very difficult for the model to learn about different categories like PERSON and WEBSITE.

( X ) The training data included no examples of PERSON, so the model learned that this label is incorrect.

( ) The hyperparameters need to be retuned so that both entity types can be recognized.

**If PERSON entities occur in the training data but aren’t labelled, the model will learn that they shouldn’t be predicted.** 

Similarly, if an existing entity type isn’t present in the training data, the model may ”forget” and stop predicting it.

Update the training data to include annotations for the PERSON entities “PewDiePie” and “Alexis Ohanian”.

In [35]:
TRAINING_DATA = [
    (
        "Reddit partners with Patreon to help creators build communities",
        {"entities": [(0, 6, "WEBSITE"), (21, 28, "WEBSITE")]},
    ),
    (
        "PewDiePie smashes YouTube record",
        {"entities": [(0, 9, "PERSON"), (18, 25, "WEBSITE")]},
    ),
    (
        "Reddit founder Alexis Ohanian gave away two Metallica tickets to fans",
        {"entities": [(0, 6, "WEBSITE"), (15, 29, "PERSON")]},
    ),
    # And so on...
]

## Wrapping up !

Congratulations – you've made it to the end of the course!



**Your new spaCy skills**

* Extract linguistic features: part-of-speech tags, dependencies, named entities
* Work with pre-trained statistical models
* Find words and phrases using Matcher and PhraseMatcher match rules
* Best practices for working with data structures Doc, Token Span, Vocab, Lexeme
* Find semantic similarities using word vectors
* Write custom pipeline components with extension attributes
* Scale up your spaCy pipelines and make them fast
* Create training data for spaCy' statistical models
* Train and update spaCy's neural network models with new data


Here's an overview of all the new skills you learned so far:

In the first chapter, you learned how to extract linguistic features like part-of-speech tags, syntactic dependencies and named entities, and how to work with pre-trained statistical models.

You also learned to write powerful match patterns to extract words and phrases using spaCy's matcher and phrase matcher.

Chapter 2 was all about information extraction, and you learned how to work with the data structures, the Doc, Token and Span, as well as the vocab and lexical entries.

You also used spaCy to predict semantic similarities using word vectors.

In chapter 3, you got some more insights into spaCy's pipeline, and learned to write your own custom pipeline components that modify the Doc.

You also created your own custom extension attributes for Docs, Tokens and Spans, and learned about processing streams and making your pipeline faster.

Finally, in chapter 4, you learned about training and updating spaCy's statistical models, specifically the entity recognizer.

You learned some useful tricks for how to create training data, and how to design your label scheme to get the best results.

**More things to do with spaCy**

Of course, there's a lot more that spaCy can do that we didn't get to cover in this course.

While we focused mostly on training the entity recognizer, you can also train and update the other statistical pipeline components like the part-of-speech tagger and dependency parser.

Another useful pipeline component is the text classifier, which can learn to predict labels that apply to the whole text. It's not part of the pre-trained models, but you can add it to an existing model and train it on your own data.

Training and updating other pipeline components:

* Part-of-speech tagger
* Dependency parser
* Text classifier

Visit:

https://spacy.io/usage/training