### Creating training data(1)

spaCy's rule-based `Matcher` is a great way to quickly create training data for named entity models. A list of sentences is available as the variable `TEXTS`. You can print it to inspect it. We want to find all mentions of different iPhone models, so we can create training data to teach a model to recognize them as `"GADGET"`.

- Write a pattern for two tokens whose lowercase forms match `"iphone"` and `"x"`.
- Write a pattern for two tokens: one token whose lowercase form matches `"iphone"` and a digit.

In [1]:
'''
matcher.add works a bit different at v3
'''
import json
from spacy.matcher import Matcher
from spacy.lang.en import English

with open("iphone.json", encoding="utf8") as f:
    TEXTS = json.loads(f.read())

nlp = English()
matcher = Matcher(nlp.vocab)

# Two tokens whose lowercase forms match "iphone" and "x"
pattern1 = [{"LOWER": "iphone"}, {"LOWER": "x"}]

# Token whose lowercase form matches "iphone" and a digit
pattern2 = [{"LOWER": "iphone"}, {"IS_DIGIT": True}]

# Add patterns to the matcher and check the result
matcher.add("GADGET", [pattern1, pattern2])
for doc in nlp.pipe(TEXTS):
    print([doc[start:end] for match_id, start, end in matcher(doc)])

[iPhone X]
[iPhone X]
[iPhone X]
[iPhone 8]
[iPhone 11, iPhone 8]
[]


### Creating training data(2)

Let's use the match patterns we've created in the previous exercise to bootstrap a set of training examples. A list of sentences is available as the variable `TEXTS`.

- Create a doc object for each text using `nlp.pipe`.
- Match on the `doc` and create a list of matched spans.
- Get `(start character, end character, label)` tuples of matched spans.
- Format each example as a tuple of the text and a dict, mapping `"entities"` to the entity tuples.
- Append the example to `TRAINING_DATA` and inspect the printed data.

In [5]:
'''
matcher.add works a bit different at v3
'''
import json
from spacy.matcher import Matcher
from spacy.lang.en import English

with open("iphone.json", encoding="utf8") as f:
    TEXTS = json.loads(f.read())

nlp = English()
matcher = Matcher(nlp.vocab)
pattern1 = [{"LOWER": "iphone"}, {"LOWER": "x"}]
pattern2 = [{"LOWER": "iphone"}, {"IS_DIGIT": True}]
matcher.add("GADGET", [pattern1, pattern2])

TRAINING_DATA = []

# Create a Doc object for each text in TEXTS
for doc in nlp.pipe(TEXTS):
    # Match on the doc and create a list of matched spans
    spans = [doc[start:end] for match_id, start, end in matcher(doc)]
    # Get (start character, end character, label) tuples of matches
    entities = [(span.start_char, span.end_char, "GADGET") for span in spans]
    # Format the matches as a (doc.text, entities) tuple
    training_example = (doc.text, {"entities": entities})
    # Append the example to the training data
    TRAINING_DATA.append(training_example)

print(*TRAINING_DATA, sep="\n")

('How to preorder the iPhone X', {'entities': [(20, 28, 'GADGET')]})
('iPhone X is coming', {'entities': [(0, 8, 'GADGET')]})
('Should I pay $1,000 for the iPhone X?', {'entities': [(28, 36, 'GADGET')]})
('The iPhone 8 reviews are here', {'entities': [(4, 12, 'GADGET')]})
("iPhone 11 vs iPhone 8: What's the difference?", {'entities': [(0, 9, 'GADGET'), (13, 21, 'GADGET')]})
('I need a new phone! Any tips?', {'entities': []})


### Setting up the pipeline

In this exercise, you'll prepare a spaCy pipeline to train the entity recognizer to recognize `"GADGET"` entities in a text – for example, "iPhone X".

- Create a blank `"en"` model, for example using the `spacy.blank` method.
- Create a new entity recognizer using `nlp.create_pipe` and add it to the pipeline.
- Add the new label `"GADGET"` to the entity recognizer using the add_label method on the pipeline component.

In [9]:
import spacy

# Create a blank "en" model
nlp = spacy.blank("en")

# Create a new entity recognizer and add it to the pipeline
nlp.create_pipe('ner')
nlp.add_pipe('ner')

# Add the label "GADGET" to the entity recognizer
ner.add_label("GADGET")

1

### Building a training loop

Let's write a simple training loop from scratch!  

The pipeline you've created in the previous exercise is available as the nlp object.  
It already contains the entity recognizer with the added label `"GADGET".`  
  
The small set of labelled examples that you've created previously is available as `TRAINING_DATA`.  

To see the examples, you can print them in your script.  

- Call `nlp.begin_training`, create a training loop for 10 iterations and shuffle the training data.
- Create batches of training data using `spacy.util.minibatch` and iterate over the batches.
- Convert the `(text, annotations)` tuples in the batch to lists of `texts` and `annotations`.
- For each batch, use `nlp.update` to update the model with the texts and annotations.

In [38]:
'''
v2 and v3 is totailly different. tutorial about this is based on the v2 which is quite useless.
'''
import spacy
import random
import json

from spacy.training import Example

with open("gadgets.json", encoding="utf8") as f:
    TRAINING_DATA = json.loads(f.read())

nlp = spacy.blank("en")
# ner = nlp.create_pipe("ner")
nlp.add_pipe('ner')
ner.add_label("GADGET")

# Start the training
nlp.begin_training()

# Loop for 10 iterations
for itn in range(10):
    # Shuffle the training data
    random.shuffle(TRAINING_DATA)
    losses = {}
    
    for batch in spacy.util.minibatch(TRAINING_DATA, size=2):
        for text, annotations in batch:
            # create Example
            doc = nlp.make_doc(text)
            example = Example.from_dict(doc, annotations)
            # Update the model
            nlp.update([example], losses=losses, drop=0.3)
        print(losses)
        
'''
v2

import spacy
import random
import json

with open("exercises/en/gadgets.json", encoding="utf8") as f:
    TRAINING_DATA = json.loads(f.read())

nlp = spacy.blank("en")
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner)
ner.add_label("GADGET")

# Start the training
nlp.begin_training()

# Loop for 10 iterations
for itn in range(10):
    # Shuffle the training data
    random.shuffle(TRAINING_DATA)
    losses = {}

    # Batch the examples and iterate over them
    for batch in spacy.util.minibatch(TRAINING_DATA, size=2):
        texts = [text for text, entities in batch]
        annotations = [entities for text, entities in batch]

        # Update the model
        nlp.update(texts, annotations, losses=losses)
    print(losses)
'''

{'ner': 8.158883213996887}
{'ner': 17.940181970596313}
{'ner': 30.74606227874756}
{'ner': 6.24123278260231}
{'ner': 14.7780322432518}
{'ner': 20.45983499288559}
{'ner': 3.5592862367630005}
{'ner': 6.439008187502623}
{'ner': 10.109526328742504}
{'ner': 1.6003090951126069}
{'ner': 4.297520248335786}
{'ner': 7.065472565176606}
{'ner': 0.9593647629226325}
{'ner': 3.129635869915546}
{'ner': 5.524009907433992}
{'ner': 5.002823667789926}
{'ner': 12.618434515912668}
{'ner': 15.255126489033046}
{'ner': 2.2171880868263543}
{'ner': 3.3403736823629515}
{'ner': 4.743925209504141}
{'ner': 0.9311934591439694}
{'ner': 2.411232209146192}
{'ner': 2.4181894513918323}
{'ner': 0.010205586965110758}
{'ner': 0.013129191760111514}
{'ner': 1.4251981868387849}
{'ner': 1.8457801859176162}
{'ner': 1.8459988337280244}
{'ner': 1.8460088892541209}
