# Natural Language Processing

## Part 4: SpaCy + Training Neural Network

We'll write our own training loop from scratch, and understand the basics of how training works, along with tips and tricks that can make your custom NLP projects more successful.

Why updating the model?

* Better results on your specific domain
* Learn classification schemes specifically for your problem
* Essential for text classification
* Very useful for named entity recognition
* Less critical for part-of-speech tagging and dependency parsing

### How training works

1. Initialize the model weights randomly
2. Predict a few examples with the current weights
3. Compare prediction with true labels
4. Calculate how to change weights to improve predictions
5. Update weights slightly
6. Go back to 2

### Training the entity recognizer

Let's look at an example for a specific component: the entity recognizer.

The entity recognizer takes a document and predicts phrases and their labels. This means that the training data needs to include texts, the entities they contain, and the entity labels.

Entities can't overlap, so each token can only be part of one entity.

Because the entity recognizer predicts entities in context, it also needs to be trained on entities and their surrounding context.

In [40]:
from spacy.tokens import Span

doc = nlp("iPhone X is coming")
doc.ents = [Span(doc, 0, 2, label="GADGET")]

It's also very important for the model to learn words that aren't entities.

In [41]:
doc = nlp("I need a new phone! Any tips?")
doc.ents = []

## 1. Creating training data

The training data tells the model what we want it to predict. This could be texts and named entities we want to recognize, or tokens and their correct part-of-speech tags.  To update an existing model, we can start with a few hundred to a few thousand examples.  To train a new category we may need up to a million.  spaCy's pre-trained English models for instance were trained on **2 million words!** labelled with part-of-speech tags, dependencies and named entities.

Training data is usually created by humans who assign labels to texts.  This is a lot of work, but can be semi-automated – for example, using spaCy's `Matcher`

spaCy’s rule-based `Matcher`  is a great way to quickly create training data for named entity models. A list of sentences is available as the variable *text*.  We want to find all mentions of different iPhone models, so we can create training data to teach a model to recognize them as 'GADGET'.

In [42]:
import json
import spacy
from spacy.matcher import Matcher
from spacy.tokens import Span

with open("data/iphone.json", encoding="utf8") as f:
    text = json.loads(f.read())

nlp = spacy.blank("en")
matcher = Matcher(nlp.vocab)

# Two tokens whose lowercase forms match "iphone" and "x"
pattern1 = [{"LOWER": "iphone"}, {"LOWER": "x"}]

# Token whose lowercase form matches "iphone" and a digit
pattern2 = [{"LOWER": "iphone"}, {"IS_DIGIT": True}]

# Add patterns to the matcher and create docs with matched entities
matcher.add("GADGET", [pattern1, pattern2])
docs = []
for doc in nlp.pipe(text):
    print(doc)
    matches = matcher(doc)
    spans = [Span(doc, start, end, label=match_id) for match_id, start, end in matches]
    print(spans)
    doc.ents = spans
    docs.append(doc)

How to preorder the iPhone X
[iPhone X]
iPhone X is coming
[iPhone X]
Should I pay $1,000 for the iPhone X?
[iPhone X]
The iPhone 8 reviews are here
[iPhone 8]
iPhone 11 vs iPhone 8: What's the difference?
[iPhone 11, iPhone 8]
I need a new phone! Any tips?
[]


After creating the data for our corpus, we need to save it out to a `.spacy` file.

Instantiate the `DocBin` with the list of docs.  A container to efficiently store and save `Doc` objects.

Save the `DocBin` to a file called `train.spacy` and `dev.spacy`.

In [44]:
from spacy.tokens import DocBin

train_docs = docs[:len(docs) // 2]
dev_docs   = docs[len(docs) // 2:]

train_doc_bin = DocBin(docs=train_docs)
train_doc_bin.to_disk("docs/train.spacy")

dev_doc_bin = DocBin(docs=dev_docs)
dev_doc_bin.to_disk("docs/dev.spacy")

## 2. Configuring

Training in SpaCy is very simple.  All you need is `config.cfg` and the training/dev data.

### Generating a config

- spaCy can auto-generate a default config file for you
- interactive quickstart widget (https://spacy.io/usage/training#quickstart) in the docs
- `init config` command on the CLI
    
    python -m spacy init config configs/config.cfg --lang en --pipeline ner

Note: `configs` is the root folder I created; you can use whatever you want.  Similarly, you can name `config.cfg` to whatever you want, e.g., `my_config.cfg`

Note: `--lang`: language class of the pipeline, e.g. en for English; 
      `--pipeline`: comma-separated names of components to include
      
Note: Because we’re executing the command in a Jupyter environment in this course, we’re using the prefix !. If you’re running the command in your local terminal, you can leave this out.

In [48]:
!python -m spacy init config configs/config.cfg --lang en --pipeline ner

[38;5;3m⚠ To generate a more effective transformer-based config (GPU-only),
install the spacy-transformers package and re-run this command. The config
generated now does not use transformers.[0m
[38;5;4mℹ Generated config template specific for your use case[0m
- Language: en
- Pipeline: ner
- Optimize for: efficiency
- Hardware: CPU
- Transformer: None
[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
configs/config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


Let’s take a look at the config spaCy just generated! You can run the command below to print the config to the terminal and inspect it.

In [49]:
!cat configs/config.cfg

[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "en"
pipeline = ["tok2vec","ner"]
batch_size = 1000
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
scorer = {"@scorers":"spacy.ner_scorer.v1"}
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
upstream = "*"

[components.tok2vec]
factory = "tok2vec"

[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"

[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = ${compon

## 3. Training the pipeline

Training can be done via command line as code below

    python -m spacy train ./config.cfg --output ./output --paths.train train.spacy --paths.dev dev.spacy

Note: `train`: the command to run; `config.cfg`: the path to the config file; 
      `--output`: the path to the output directory to save the trained pipeline; 
      `--paths.train`: override with path to the training data, 
      `--paths.dev`: override with path to the evaluation data

In [52]:
!python -m spacy train configs/config.cfg --output ./output --paths.train docs/train.spacy --paths.dev docs/dev.spacy

[38;5;2m✔ Created output directory: output[0m
[38;5;4mℹ Saving to output directory: output[0m
[38;5;4mℹ Using CPU[0m
[1m
[2022-12-14 07:37:21,815] [INFO] Set up nlp object from config
[2022-12-14 07:37:21,826] [INFO] Pipeline: ['tok2vec', 'ner']
[2022-12-14 07:37:21,830] [INFO] Created vocabulary
[2022-12-14 07:37:21,831] [INFO] Finished initializing nlp object
[2022-12-14 07:37:21,912] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00     11.17    0.00    0.00    0.00    0.00
200     200          0.95    139.37  100.00  100.00  100.00    1.00
400     400          0.00      0.00  100.00  100.00  100.00    1.00
600     600          0.00      0.00  100.00  100.00  100.00    1.00
800     80

## 4. Loading the pipeline

Output after training is a regular loadable spaCy pipeline where two pipelines are returned: `model-last`: last trained pipeline and `model-best`: best trained pipeline.   We can load this with `spacy.load`

In [53]:
import spacy

nlp = spacy.load("output/model-best")
doc = nlp("iPhone 11 vs iPhone 8: What's the difference?")
print(doc.ents)

(iPhone 11, iPhone 8)


## 5. Packaging the pipeline

`SpaCy package`: create an installable Python package containing your pipeline; easy to version and deploy

    python -m spacy package /path/to/output/model-best ./packages --name my_pipeline --version 1.0.0

To install simply:

    cd ./packages/en_my_pipeline-1.0.0
    pip install dist/en_my_pipeline-1.0.0.tar.gz

Then load it and use

    nlp = spacy.load("en_my_pipeline")

Since I don't want to do it, I will skip this part.

## Appendix - Training best practices

When you start running your own experiments, you might find that a lot of things just don't work the way you want them to. And that's okay.

Training models is an iterative process, and you have to try different things until you find out what works best.

Here we will be sharing some best practices and things to keep in mind when training your own models.

Let's take a look at some of the problems you may come across.

### Problem 1: Models can "forget" things

* Existing model can overfit on new data
e.g.: if you only update it with WEBSITE, it can "unlearn" what a PERSON is

* Also known as "catastrophic forgetting" problem

### Solution 1: Mix in previously correct predictions

To prevent this, make sure to always mix in examples of what the model previously got correct.

If you're training a new category "website", also include examples of "person".

### Problem 2: Models can't learn everything

Another common problem is that your model just won't learn what you want it to.

spaCy's models make predictions based on the local context – for example, for named entities, the surrounding words are most important.

If the decision is difficult to make based on the context, the model can struggle to learn it.

The label scheme also needs to be consistent and not too specific.

For example, it may be very difficult to teach a model to predict whether something is adult clothing or children's clothing based on the context. However, just predicting the label "clothing" may work better.

### Solution 2: Plan your label scheme carefully

Before you start training and updating models, it's worth taking a step back and planning your label scheme.

Try to pick categories that are reflected in the local context and make them more generic if possible.

You can always add a **rule-based system** (or `EntityRuler`) later to go from generic to specific.

**BAD:**

LABELS = ['ADULT_SHOES', 'CHILDRENS_SHOES', 'BANDS_I_LIKE]
          
**GOOD:**

LABELS = ['CLOTHING', 'BAND']

## Appendix - Good data vs. bad data

Here’s an excerpt from a training set that labels the entity type TOURIST_DESTINATION in traveler reviews.

In [54]:
doc1 = nlp("i went to amsterdem last year and the canals were beautiful")
doc1.ents = [Span(doc1, 3, 4, label="TOURIST_DESTINATION")]

doc2 = nlp("You should visit Paris once, but the Eiffel Tower is kinda boring")
doc2.ents = [Span(doc2, 3, 4, label="TOURIST_DESTINATION")]

doc3 = nlp("There's also a Paris in Arkansas, lol")
doc3.ents = []

doc4 = nlp("Berlin is perfect for summer holiday: great nightlife and cheap beer!")
doc4.ents = [Span(doc4, 0, 1, label="TOURIST_DESTINATION")]

Why is this data and label scheme problematic? Whether a place is a tourist destination is a **subjective judgement** and not a definitive category. It will be very difficult for the entity recognizer to learn.

A much better approach would be to only label `GPE` (geopolitical entity) or LOCATION and then use a **rule-based system** to determine whether the entity is a tourist destination in this context. For example, you could resolve the entities types back to a knowledge base or look them up in a travel wiki.

Here's a much better code

In [55]:
import spacy
from spacy.tokens import Span

nlp = spacy.blank("en")

doc1 = nlp("i went to amsterdem last year and the canals were beautiful")
doc1.ents = [Span(doc1, 3, 4, label="GPE")]

doc2 = nlp("You should visit Paris once, but the Eiffel Tower is kinda boring")
doc2.ents = [Span(doc2, 3, 4, label="GPE")]

doc3 = nlp("There's also a Paris in Arkansas, lol")
doc3.ents = [Span(doc3, 4, 5, label="GPE"), Span(doc3, 6, 7, label="GPE")]

doc4 = nlp("Berlin is perfect for summer holiday: great nightlife and cheap beer!")
doc4.ents = [Span(doc4, 0, 1, label="GPE")]

## Appendix - Training multiple labels

Here’s a small sample of a dataset created to train a new entity type WEBSITE. The original dataset contains a few thousand sentences. In this exercise, you’ll be doing the labeling by hand. In real life, you probably want to automate this and use an annotation tool – for example, **Brat**, a popular open-source solution, or **Prodigy**, our own annotation tool that integrates with spaCy.

In [34]:
import spacy
from spacy.tokens import Span

nlp = spacy.blank("en")

doc1 = nlp("Reddit partners with Patreon to help creators build communities")
doc1.ents = [
    Span(doc1, 0, 1, label="WEBSITE"),
    Span(doc1, 3, 4, label="WEBSITE"),
]

doc2 = nlp("PewDiePie smashes YouTube record")
doc2.ents = [Span(doc2, 2, 3, label="WEBSITE")]

doc3 = nlp("Reddit founder Alexis Ohanian gave away two Metallica tickets to fans")
doc3.ents = [Span(doc3, 0, 1, label="WEBSITE")]

A model was trained with the data you just labelled, plus a few thousand similar examples. After training, it’s doing great on WEBSITE, but doesn’t recognize PERSON anymore. Why could this be happening?

That is because the training data included no examples of PERSON, so the model learned that this label is incorrect.

**If PERSON entities occur in the training data but aren’t labelled, the model will learn that they shouldn’t be predicted.** 

Similarly, if an existing entity type isn’t present in the training data, the model may ”forget” and stop predicting it.

Let's update the training data to include annotations for the PERSON entities “PewDiePie” and “Alexis Ohanian”.

In [56]:
import spacy
from spacy.tokens import Span

nlp = spacy.blank("en")

doc1 = nlp("Reddit partners with Patreon to help creators build communities")
doc1.ents = [
    Span(doc1, 0, 1, label="WEBSITE"),
    Span(doc1, 3, 4, label="WEBSITE"),
]

doc2 = nlp("PewDiePie smashes YouTube record")
doc2.ents = [Span(doc2, 0, 1, label="PERSON"), Span(doc2, 2, 3, label="WEBSITE")]

doc3 = nlp("Reddit founder Alexis Ohanian gave away two Metallica tickets to fans")
doc3.ents = [Span(doc3, 0, 1, label="WEBSITE"), Span(doc3, 2, 4, label="PERSON")]