Before we get starting with explaining how, it's worth taking a second to ask ourselves: Why would we want to update the model with our own examples? Why can't we just rely on pre-trained pipelines?

Statistical models make predictions based on the examples they were trained on.

You can usually make the model more accurate by showing it examples from your domain.

You often also want to predict categories specific to your problem, so the model needs to learn about them.

This is essential for text classification, very useful for entity recognition and a little less critical for tagging and parsing.

spaCy supports updating existing models with more examples, and training new models. If we're not starting with a trained pipeline, we first initialize the weights randomly.

Next, spaCy calls nlp.update, which predicts a batch of examples with the current weights.

The model then checks the predictions against the correct answers, and decides how to change the weights to achieve better predictions next time.

Finally, we make a small correction to the current weights and move on to the next batch of examples.

spaCy then continues calling nlp.update for each batch of examples in the data. During training, you usually want to make multiple passes over the data and train until the model stops improving.

How training works???


1. Initialize the model weights randomly

2. Predict a few examples with the current weights

3. Compare prediction with true labels

4. Calculate how to change weights to improve predictions

5. Update weights slightly

6. Go back to 2.

In [1]:
###Training the entity recognizer


#The entity recognizer tags words and phrases in context
#Each token can only be part of one entity
#Examples need to come with context

doc = nlp("iPhone X is coming")
doc.ents = [Span(doc, 0, 2, label="GADGET")]

#Texts with no entities are also important

doc = nlp("I need a new phone! Any tips?")
doc.ents = []

NameError: name 'nlp' is not defined

### The training data

Examples of what we want the model to predict in context

Update an existing model: a few hundred to a few thousand examples

Train a new category: a few thousand to a million examples

- spaCy's English models: 2 million words


Usually created manually by human annotators

Can be semi-automated – for example, using spaCy's Matcher!

### Training vs. evaluation data

Training data: used to update the model


Evaluation data:

- data the model hasn't seen during training
- used to calculate how accurate the model is
- should be representative of the data the model will see at runtime


When training your model, it's important to know how it's doing and whether it's learning the right thing. This is done by comparing the model's predictions on examples it hasn't seen during training to answers we already know. So in addition to the training data, you also need evaluation data, also called development data.

The evaluation data is used to calculate how accurate your model is. For example, an accuracy score of 90% means that the model predicted 90% of the evaluation examples correctly.

In [None]:
import spacy

nlp = spacy.blank("en")

# Create a Doc with entity spans
doc1 = nlp("iPhone X is coming")
doc1.ents = [Span(doc1, 0, 2, label="GADGET")]
# Create another doc without entity spans
doc2 = nlp("I need a new phone! Any tips?")

docs = [doc1, doc2]  # and so on...

In [4]:
### Generating a training corpus (2)

#split data into two portions:
##- training data: used to update the model
##- development data: used to evaluate the model
random.shuffle(docs)
train_docs = docs[:len(docs) // 2]
dev_docs = docs[len(docs) // 2:]

NameError: name 'random' is not defined

In [None]:
#DocBin: container to efficiently store and save Doc objects
#can be saved to a binary file
#binary files are used for training

# Create and save a collection of training docs
train_docbin = DocBin(docs=train_docs)
train_docbin.to_disk("./train.spacy")
# Create and save a collection of evaluation docs
dev_docbin = DocBin(docs=dev_docs)
dev_docbin.to_disk("./dev.spacy")


### Converting your data
# spacy convert lets you convert corpora in common formats
# supports .conll, .conllu, .iob and spaCy's old JSON format
$ python -m spacy convert ./train.gold.conll ./corpus

#### CREATING TRAINING DATA

spaCy’s rule-based Matcher is a great way to quickly create training data for named entity models. A list of sentences is available as the variable TEXTS. You can print it to inspect it. We want to find all mentions of different iPhone models, so we can create training data to teach a model to recognize them as "GADGET".

Write a pattern for two tokens whose lowercase forms match "iphone" and "x".

Write a pattern for two tokens: one token whose lowercase form matches "iphone" and a digit.

In [None]:
import json
import spacy
from spacy.matcher import Matcher
from spacy.tokens import Span

with open("exercises/en/iphone.json", encoding="utf8") as f:
    TEXTS = json.loads(f.read())

nlp = spacy.blank("en")
matcher = Matcher(nlp.vocab)

# Two tokens whose lowercase forms match "iphone" and "x"
pattern1 = [{"LOWER": "iphone"}, {"LOWER": "x"}]

# Token whose lowercase form matches "iphone" and a digit
pattern2 = [{"LOWER": "iphone"}, {"IS_DIGIT": True}]

# Add patterns to the matcher and create docs with matched entities
matcher.add("GADGET", [pattern1, pattern2])
docs = []
for doc in nlp.pipe(TEXTS):
    matches = matcher(doc)
    spans = [Span(doc, start, end, label=match_id) for match_id, start, end in matches]
    print(spans)
    doc.ents = spans
    docs.append(doc)

In [None]:
# After creating the data for our corpus, we need to save it out to a .spacy file. The code from the previous example is already available.

# Instantiate the DocBin with the list of docs.
# Save the DocBin to a file called train.spacy.

import json
import spacy
from spacy.matcher import Matcher
from spacy.tokens import Span, DocBin

with open("exercises/en/iphone.json", encoding="utf8") as f:
    TEXTS = json.loads(f.read())

nlp = spacy.blank("en")
matcher = Matcher(nlp.vocab)
# Add patterns to the matcher
pattern1 = ([{"LOWER": "iphone"}, {"LOWER": "x"}])
pattern2 = [{"LOWER": "iphone"}, {"IS_DIGIT": True}]
matcher.add("GADGET", [pattern1, pattern2])
docs = []
for doc in nlp.pipe(TEXTS):
    matches = matcher(doc)
    spans = [Span(doc, start, end, label=match_id) for match_id, start, end in matches]
    doc.ents = spans
    docs.append(doc)

doc_bin = DocBin(docs=docs)
doc_bin.to_disk("./train.spacy")

### Configuring and running the training

spaCy uses a config file, usually called config.cfg, as the "single source of truth" for all settings. The config file defines how to initialize the nlp object, which pipeline components to add and how their internal model implementations should be configured. It also includes all settings for the training process and how to load the data, including hyperparameters.

Instead of providing lots of arguments on the command line or having to remember to define every single setting in code, you only need to pass your config file to spaCy's training command.

Config files also help with reproducibility: you'll have all settings in one place and always know how your pipeline was trained. You can even check your config file into a Git repo to version it and share it with others so they can train the same pipeline with the same settings.

In [None]:
### The training config (2)
[nlp]
lang = "en"
pipeline = ["tok2vec", "ner"]
batch_size = 1000

[nlp.tokenizer]
@tokenizers = "spacy.Tokenizer.v1"

[components]

[components.ner]
factory = "ner"

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
hidden_width = 64
# And so on...

### Generating a config


spaCy can auto-generate a default config file for you
interactive quickstart widget in the docs
init config command on the CLI



$ python -m spacy init config ./config.cfg --lang en --pipeline ner


init config: the command to run

config.cfg: output path for the generated config

--lang: language class of the pipeline, e.g. en for English

--pipeline: comma-separated names of components to include


Alternatively, you can also use spaCy's built-in init config command. It takes the output file as the first argument. We usually call this file config.cfg. The argument --lang defines the language class that should be used for the pipeline, for example, en for English. The --pipeline argument lets you specify one or more comma-separated pipeline components to include. In this example, we're creating a config with one pipeline component, the named entity recognizer.

### Training a pipeline (1)

all you need is the config.cfg and the training and development data
config settings can be overwritten on the command line

$ python -m spacy train ./config.cfg --output ./output --paths.train train.spacy --paths.dev dev.spacy

train: the command to run

config.cfg: the path to the config file

--output: the path to the output directory to save the trained pipeline

--paths.train: override with path to the training data

--paths.dev: override with path to the evaluation data

Each pass over the data is also called an "epoch". This is shown in the first column of the table.

Within each epoch, spaCy outputs the accuracy scores every 200 examples. These are the steps shown in the second column. You can change the frequency in the config. Each line shows the loss and calculated accuracy score at this point during training.

The most interesting score to keep an eye on is the combined score in the last column. It reflects how accurately your model predicted the correct answers in the evaluation data.

The training runs until the model stops improving and exits automatically.

### Loading a trained pipeline


output after training is a regular loadable spaCy pipeline
- model-last: last trained pipeline
- model-best: best trained pipeline

The pipeline saved after training is a regular loadable spaCy pipeline – just like the trained pipelines provided by spaCy, for example en_core_web_sm. At the end, the last trained pipeline and the pipeline with the best score is saved to the output directory.


load it with spacy.load

import spacy

nlp = spacy.load("/path/to/output/model-best")
doc = nlp("iPhone 11 vs iPhone 8: What's the difference?")
print(doc.ents)

### Packaging your pipeline

spacy package: create an installable Python package containing your pipeline
easy to version and deploy

$ python -m spacy package /path/to/output/model-best ./packages --name my_pipeline --version 1.0.0
$ cd ./packages/en_my_pipeline-1.0.0
$ pip install dist/en_my_pipeline-1.0.0.tar.gz


Load and use the pipeline after installation:

nlp = spacy.load("en_my_pipeline")



To make it easy to deploy your pipelines, spaCy provides a handy command to package them as Python packages. The spacy package command takes the path to your exported pipeline and an output directory. It then generates a Python package containing your pipeline. The Python package is a .tar.gz file and can be installed into your environment.

You can also provide an optional name and version on the command. This lets you manage multiple different versions of a pipeline, for example, if you decide to customize your pipeline later or train it with more data.

The package behaves just like any other Python package. After installation, you can load your pipeline using its name. Note that spaCy will automatically add the language code to the name. So your pipeline my_pipeline will become en_my_pipeline.

In [None]:
The config.cfg file is the “single source of truth” for training a pipeline with spaCy. Which of the following is not true about the config?

It allows you to configure the training process and hyperparameters.

It helps make your training more reproducible.

It creates an installable Python package with your pipeline. (correct)

It defines the pipeline's components and their settings.

Submit
That's correct! The config file includes all settings related to training and how to set up the pipeline, but it doesn’t package your pipeline. To create an installable Python package, you can use the spacy package command.

In [None]:
# Use spaCy’s init config command to auto-generate a config for an English pipeline.
# Save the config to a file config.cfg.
# Use the --pipeline argument to specify one pipeline component, ner.

!python -m spacy init config ./config.cfg --lang en --pipeline ner


!cat ./config.cfg

Let’s use the config file generated in the previous exercise and the training corpus we’ve created to train a named entity recognizer!

The train command lets you train a model from a training config file. A file config_gadget.cfg is already available in the directory exercises/en, as well as a file train_gadget.spacy containing the training examples, and a file dev_gadget.spacy containing the evaluation examples. Because we’re executing the command in a Jupyter environment in this course, we’re using the prefix !. If you’re running the command in your local terminal, you can leave this out.

Call the train command with the file exercises/en/config_gadget.cfg.
Save the trained pipeline to a directory output.
Pass in the exercises/en/train_gadget.spacy and exercises/en/dev_gadget.spacy paths.

In [None]:
!python -m spacy train ./exercises/en/config_gadget.cfg --output ./output --paths.train ./exercises/en/train_gadget.spacy --paths.dev ./exercises/en/dev_gadget.spacy