## Training Custom NER Models

Today's Goals/Agenda:
- Review Wednesday materials and our deeper diver into spaCy
- Building and training custom NER models
- Discussion of how we build training data

Final afternoon of TAPI and NER workshop 🥳😳!!

In [None]:
# # Download Spacy in Binder
# !pip install -U pip setuptools wheel
# !pip install -U spacy
# !python -m spacy download en_core_web_sm

In [None]:
# # Download Spacy if running locally
# import sys
# !{sys.executable} -m pip install spacy
# !{sys.executable} -m spacy download en_core_web_sm

In [12]:
import pandas as pd
import spacy
from spacy.scorer import Scorer
from spacy.training import Example
import stanza
import spacy_stanza
import numpy as np
import nltk

Here's some of our original code from Monday 

In [7]:
# Load in our data
chars_df = pd.read_csv('./archive/Characters.csv', delimiter=';')
chars_df['split_names'] = chars_df.Name.str.split(' ')
film1_df = pd.read_csv('./archive/Harry Potter 1.csv', delimiter=';')

def find_entities(row):
    # Find character names from chars_df
    character_names = chars_df.split_names.tolist()
    identified_names = []
    for names in character_names:
        if any(name in row.Sentence for name in names):
            identified_names.append(' '.join(names))
    row['identified_names'] = identified_names
    return row

film1_entities = film1_df.apply(find_entities, axis=1)
film1_entities = film1_entities[film1_entities.identified_names.astype(bool)]
film1_exploded = film1_entities.explode('identified_names')

Some of our code from Wednesday 👇🏽

In [9]:
# Load film datasets
film1_df = pd.read_csv('./archive/Harry Potter 1.csv', delimiter=';')
film2_df = pd.read_csv('./archive/Harry Potter 2.csv', delimiter=';')
film3_df = pd.read_csv('./archive/Harry Potter 3.csv', delimiter=';')

# Combine film dataframes
film3_df.columns = map(str.capitalize, film3_df.columns)
film1_df['movie_number'] = 'film 1'
film2_df['movie_number'] = 'film 2'
film3_df['movie_number'] = 'film 3'
films_df = pd.concat([film1_df, film2_df, film3_df])

chars_df['full_names'] = np.where(
    chars_df.split_names.str.len() == 2, 
    chars_df.split_names.str[0].str.lower() + ' ' + chars_df.split_names.str[1].str.lower(), 
    np.where(
        chars_df.split_names.str.len() > 2,
        chars_df.split_names.str[0].str.lower() + ' ' + chars_df.split_names.str[-1].str.lower(),
        chars_df.split_names.str[0].str.lower())) 
chars_df['first_name'] = chars_df.split_names.str[0].str.lower()
chars_df['last_name'] = chars_df.split_names.str[-1].str.lower()

full_names = chars_df.full_names.unique().tolist()
first_names = chars_df.first_name.unique().tolist()
last_names = chars_df.last_name.unique().tolist()

# Get unique names and create our rules
names = list(set(first_names) | set(last_names))
unique_names = list(set(names) | set(full_names))
list_names = [{"label": "PERSON", "pattern": [{"LOWER": f"{name}"}], "id": f"{name}"} for name in unique_names if len(name.split(' ')) == 1]

list_full_names = [{"label": "PERSON", "pattern": [{"LOWER": f"{name.split(' ')[0]}"}, {"LOWER": f"{name.split(' ')[1]}"}], "id": f"{'-'.join(name.split(' '))}"} for name in full_names if len(name.split(' ')) > 1]

all_names = list_names + list_full_names
print('example rule', all_names[0])

# Load our models and pass our rules
full_nlp = spacy.load("en_core_web_sm")
ruler = full_nlp.add_pipe("entity_ruler")
ruler.add_patterns(all_names)

blank_nlp = spacy.blank("en")
ruler = blank_nlp.add_pipe("entity_ruler")
ruler.add_patterns(all_names)

evaluation_data = []
def evaluate_spacy_models(row):
    # Let's also add in blank_nlp
    sentences = nltk.sent_tokenize(row.Sentence.lower())
    for sentence in sentences:
        spacy_full = full_nlp(sentence)
        list_entities = []
        for token in spacy_full.ents:
            list_entities.append([token.start_char, token.end_char, token.label_])
        if len(list_entities) > 0:
            entry = (sentence,{"entities": list_entities})
            evaluation_data.append(entry)
    return row
films_df.apply(evaluate_spacy_models, axis=1)

def evaluate(ner_model, testing_data):
    scorer = Scorer()
    examples = []
    for input_, annot in testing_data:
        doc_gold_text = ner_model.make_doc(input_)
        example = Example.from_dict(doc_gold_text, annot)
        example.predicted = ner_model(input_)
        examples.append(example)
        
    print(scorer.score(examples))

# print the results
evaluate(full_nlp, evaluation_data)

example rule {'label': 'PERSON', 'pattern': [{'LOWER': 'parvati'}], 'id': 'parvati'}
{'token_acc': 1.0, 'token_p': 1.0, 'token_r': 1.0, 'token_f': 1.0, 'sents_p': 1.0, 'sents_r': 1.0, 'sents_f': 1.0, 'tag_acc': None, 'pos_acc': None, 'morph_acc': None, 'morph_per_feat': None, 'dep_uas': None, 'dep_las': None, 'dep_las_per_type': None, 'ents_p': 1.0, 'ents_r': 1.0, 'ents_f': 1.0, 'ents_per_type': {'PERSON': {'p': 1.0, 'r': 1.0, 'f': 1.0}, 'CARDINAL': {'p': 1.0, 'r': 1.0, 'f': 1.0}, 'DATE': {'p': 1.0, 'r': 1.0, 'f': 1.0}, 'TIME': {'p': 1.0, 'r': 1.0, 'f': 1.0}, 'ORDINAL': {'p': 1.0, 'r': 1.0, 'f': 1.0}, 'NORP': {'p': 1.0, 'r': 1.0, 'f': 1.0}, 'QUANTITY': {'p': 1.0, 'r': 1.0, 'f': 1.0}}, 'cats_score': 0.0, 'cats_score_desc': 'macro F', 'cats_micro_p': 0.0, 'cats_micro_r': 0.0, 'cats_micro_f': 0.0, 'cats_macro_p': 0.0, 'cats_macro_r': 0.0, 'cats_macro_f': 0.0, 'cats_macro_auc': 0.0, 'cats_f_per_type': {}, 'cats_auc_per_type': {}}


In [10]:
# How we can test our models
# https://en.wikipedia.org/wiki/Harry_Potter
test_text = "The series continues with Harry Potter and the Chamber of Secrets, describing Harry's second year at Hogwarts. He and his friends investigate a 50-year-old mystery that appears uncannily related to recent sinister events at the school. Ron's younger sister, Ginny Weasley, enrols in her first year at Hogwarts, and finds an old notebook in her belongings which turns out to be the diary of a previous student, Tom Marvolo Riddle, written during World War II. He is later revealed to be Voldemort's younger self, who is bent on ridding the school of 'mudbloods', a derogatory term describing wizards and witches of non-magical parentage. The memory of Tom Riddle resides inside of the diary and when Ginny begins to confide in the diary, Voldemort is able to possess her. Through the diary, Ginny acts on Voldemort's orders and unconsciously opens the 'Chamber of Secrets', unleashing an ancient monster, later revealed to be a basilisk, which begins attacking students at Hogwarts. It kills those who make direct eye contact with it and petrifies those who look at it indirectly. The book also introduces a new Defence Against the Dark Arts teacher, Gilderoy Lockhart, a highly cheerful, self-conceited wizard with a pretentious facade, later turning out to be a fraud. Harry discovers that prejudice exists in the Wizarding World through delving into the school's history, and learns that Voldemort's reign of terror was often directed at wizards and witches who were descended from Muggles. Harry also learns that his ability to speak the snake language Parseltongue is rare and often associated with the Dark Arts. When Hermione is attacked and petrified, Harry and Ron finally piece together the puzzles and unlock the Chamber of Secrets, with Harry destroying the diary for good and saving Ginny, and, as they learn later, also destroying a part of Voldemort's soul. The end of the book reveals Lucius Malfoy, Draco's father and rival of Ron and Ginny's father, to be the culprit who slipped the book into Ginny's belongings."

doc = blank_nlp(test_text)
for token in doc.ents:
    print('blank', token.text, token.label_)

doc = full_nlp(test_text)
for token in doc.ents:
    print('full', token.text, token.label_)

blank Harry Potter PERSON
blank Harry PERSON
blank Weasley PERSON
blank Tom PERSON
blank Riddle PERSON
blank Tom Riddle PERSON
blank Gilderoy Lockhart PERSON
blank Harry PERSON
blank Harry PERSON
blank Hermione PERSON
blank Harry PERSON
blank Harry PERSON
blank Lucius Malfoy PERSON
blank Draco PERSON
full Harry Potter PERSON
full the Chamber of Secrets ORG
full Harry PERSON
full second year DATE
full Hogwarts ORG
full 50-year-old DATE
full Ron PERSON
full Ginny Weasley PERSON
full her first year DATE
full Hogwarts ORG
full Tom Marvolo Riddle PERSON
full World War II EVENT
full Voldemort ORG
full Tom Riddle PERSON
full Ginny PERSON
full Voldemort ORG
full Ginny ORG
full Voldemort ORG
full Hogwarts PRODUCT
full Defence Against the Dark Arts ORG
full Gilderoy Lockhart PERSON
full Harry PERSON
full the Wizarding World LOC
full Voldemort PERSON
full Muggles PERSON
full Harry PERSON
full the Dark Arts ORG
full Hermione PERSON
full Harry PERSON
full Ron PERSON
full the Chamber of Secrets ORG


In [16]:
# And then save our models to disk. You can read more here https://spacy.io/usage/saving-loading
blank_nlp.to_disk('blank_nlp')
full_nlp.to_disk('full_nlp')

***

### Training a Custom NER Model

Two ways you can train.
Either:
1. Train a completely blank model
2. Finetune a pretrained model

Today we're going to try both options and discuss the tradeoffs.

On Wednesday we were trying to build our training data, but it turns out there was a bit of a problem with the approach I attempted. Let's try it out and see why it didn't work.

As a reminder our goal was to take our spells dataset and use it to create a new entity label, named `SPELL`.

In [14]:
spells_df = pd.read_csv('./archive/Spells.csv', sep=";")
spells_df[0:2]

Unnamed: 0,Name,Incantation,Type,Effect,Light
0,Summoning Charm,Accio,Charm,Summons an object,
1,Age Line,Unknown,Charm,Prevents people above or below a certain age f...,Blue


Let's first check that spells exist in our films dataset.

In [15]:
def find_spells(row):
    spells = spells_df[spells_df.Incantation.isna() == False].Incantation.unique().tolist()
    identified_spells = []
    for spell in spells:
        if spell in row.Sentence:
            identified_spells.append(spell)
    row['identified_spells'] = ', '.join(identified_spells) if len(identified_spells) > 0 else ''
    return row
films_spells = films_df.apply(find_spells, axis=1)


In [16]:
films_spells[films_spells.identified_spells.str.len() > 1]

Unnamed: 0,Character,Sentence,movie_number,identified_spells
430,Hermione,Oculus Reparo.,film 1,"Oculus Reparo, Reparo"
720,Hermione,Alohomora,film 1,Alohomora
722,Ron,Alohomora?,film 1,Alohomora
776,Flitwick,Wingardium Leviosa.,film 1,Wingardium Leviosa
786,Hermione,Wingardium Leviosa.,film 1,Wingardium Leviosa
831,Ron,Wingardium Leviosa!,film 1,Wingardium Leviosa
1324,Hermione,Petrificus Totalus.,film 1,Petrificus Totalus
1332,Hermione,Alohomora.,film 1,Alohomora
1391,Ron,Alohomora!,film 1,Alohomora
266,HERMIONE,Oculus Reparo.,film 2,"Oculus Reparo, Reparo"


So here's where we left off on Wednesday... 

To build training data, we can do a few different approaches. Initially, I was hoping to have us use a similar approach to our `PERSON` example.

If you remember, we figured out how to extract character names and then created spaCy rules and fed them into both a pre-trained and blank model.

```python
chars_df['full_names'] = np.where(
    chars_df.split_names.str.len() == 2, 
    chars_df.split_names.str[0].str.lower() + ' ' + chars_df.split_names.str[1].str.lower(), 
    np.where(
        chars_df.split_names.str.len() > 2,
        chars_df.split_names.str[0].str.lower() + ' ' + chars_df.split_names.str[-1].str.lower(),
        chars_df.split_names.str[0].str.lower())) 

full_names = chars_df.full_names.unique().tolist()
first_names = chars_df.first_name.unique().tolist()
last_names = chars_df.last_name.unique().tolist()

names = list(set(first_names) | set(last_names))
unique_names = list(set(names) | set(full_names))
list_names = [{"label": "PERSON", "pattern": [{"LOWER": f"{name}"}], "id": f"{name}"} for name in unique_names if len(name.split(' ')) == 1]

list_full_names = [{"label": "PERSON", "pattern": [{"LOWER": f"{name.split(' ')[0]}"}, {"LOWER": f"{name.split(' ')[1]}"}], "id": f"{'-'.join(name.split(' '))}"} for name in full_names if len(name.split(' ')) > 1]

all_names = list_names + list_full_names
print('example rule', all_names[0])

# Load our models and pass our rules
full_nlp = spacy.load("en_core_web_sm")
ruler = full_nlp.add_pipe("entity_ruler")
ruler.add_patterns(all_names)

blank_nlp = spacy.blank("en")
ruler = blank_nlp.add_pipe("entity_ruler")
ruler.add_patterns(all_names)
```
So let's try and replicate this with our spells data.

First we needed to extract our spells from spells_df. I've written one way to do that below but right now it is splitting our spells apart so that we feed in `Wingardium` and `Leviosa` as separate tokens. On Wednesday we decided that didn't make sense, so let's tweak this code to only create rules with the full spell.

In [40]:
spells_df.Incantation = spells_df.Incantation.fillna("")
spells = spells_df.Incantation.unique().tolist()
spells = [spell for spell in spells if len(spell) > 1]
list_spells = []
for spell in spells:
    split_spells = spell.split()
    patterns = []
    for sp in split_spells:
        patterns.append({"LOWER": f"{sp.lower()}"})
    spell_dict = {"label": "SPELL", "pattern": patterns, "id": f"{'-'.join(spell.split())}"}
    list_spells.append(spell_dict)
list_spells[0:1]

[{'label': 'SPELL', 'pattern': [{'LOWER': 'accio'}], 'id': 'Accio'}]

Now that we have our rules let's try passing them into a blank model. 

In [86]:
spell_nlp = spacy.blank('en')
ruler = spell_nlp.add_pipe("entity_ruler") 
ruler.add_patterns(list_spells)

Looks like this worked! (*slight detour into storytime for why one small error broke my code on Wednesday*). https://spacy.io/usage/rule-based-matching

Next step is testing that our model is working. Let's try passing it some text!

In [48]:
test_doc = spell_nlp(films_spells[films_spells.identified_spells.str.len() > 1].Sentence.values[0])
for ent in test_doc.ents:
    print(ent.text, ent.label_)

Oculus Reparo SPELL


Alright so let's build a function for generating our training data. I have a small function below, but let's tweak it a bit so that we can pass in any dataframe and model, and get our list of training data returned.

In [70]:
training_data = []
def generate_training_data(row):
    # Let's also add in blank_nlp
    sentences = nltk.sent_tokenize(row.Sentence.lower())
    for sentence in sentences:
        spacy_full = spell_nlp(sentence)
        list_entities = []
        for token in spacy_full.ents:
            list_entities.append([token.start_char, token.end_char, token.label_])
        if len(list_entities) > 0:
            entry = (sentence,{"entities": list_entities})
            training_data.append(entry)
    return row
films_spells[films_spells.identified_spells.str.len() > 1].apply(generate_training_data, axis=1)

Unnamed: 0,Character,Sentence,movie_number,identified_spells
430,Hermione,Oculus Reparo.,film 1,"Oculus Reparo, Reparo"
720,Hermione,Alohomora,film 1,Alohomora
722,Ron,Alohomora?,film 1,Alohomora
776,Flitwick,Wingardium Leviosa.,film 1,Wingardium Leviosa
786,Hermione,Wingardium Leviosa.,film 1,Wingardium Leviosa
831,Ron,Wingardium Leviosa!,film 1,Wingardium Leviosa
1324,Hermione,Petrificus Totalus.,film 1,Petrificus Totalus
1332,Hermione,Alohomora.,film 1,Alohomora
1391,Ron,Alohomora!,film 1,Alohomora
266,HERMIONE,Oculus Reparo.,film 2,"Oculus Reparo, Reparo"


In [50]:
training_data

[('oculus reparo.', {'entities': [[0, 13, 'SPELL']]}),
 ('alohomora', {'entities': [[0, 9, 'SPELL']]}),
 ('alohomora?', {'entities': [[0, 9, 'SPELL']]}),
 ('wingardium leviosa.', {'entities': [[0, 18, 'SPELL']]}),
 ('wingardium leviosa.', {'entities': [[0, 18, 'SPELL']]}),
 ('wingardium leviosa!', {'entities': [[0, 18, 'SPELL']]}),
 (' petrificus totalus.', {'entities': [[1, 19, 'SPELL']]}),
 ('alohomora.', {'entities': [[0, 9, 'SPELL']]}),
 ('alohomora!', {'entities': [[0, 9, 'SPELL']]}),
 ('oculus reparo.', {'entities': [[0, 13, 'SPELL']]}),
 ('peskipiksi pesternomi!', {'entities': [[0, 21, 'SPELL']]}),
 ('immobulus!', {'entities': [[0, 9, 'SPELL']]}),
 ('vera verto.', {'entities': [[0, 10, 'SPELL']]}),
 ('vera verto.', {'entities': [[0, 10, 'SPELL']]}),
 ('vera verto!', {'entities': [[0, 10, 'SPELL']]}),
 ('finite incantatem!', {'entities': [[0, 6, 'SPELL']]}),
 ('brackium emendo!', {'entities': [[0, 15, 'SPELL']]}),
 ('expelliarmus!', {'entities': [[0, 12, 'SPELL']]}),
 ('everte st

For those in Williams' morning workshop you probably know what comes next. But now we need to manipulate our training data list into a format that works with spaCy 3.0

Instead of ingesting json files, spaCy now requires that our data be stored in the proprietary `.spacy` format. To do that we need to use the `DocBin` class.

I've included code from this Medium blog post https://towardsdatascience.com/using-spacy-3-0-to-build-a-custom-ner-model-c9256bea098. Let's rework the code into a function that takes a model, data, and file name, and then writes the data in the spacy format to file.

In [60]:
from spacy.tokens import DocBin

db = DocBin() 

for text, annot in training_data[19*2:]: 
    doc = spell_nlp.make_doc(text) 
    ents = []
    for entity in annot["entities"]:
        start = entity[0]
        end= entity[1]
        label = entity[2]
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        if span is None:
            print("Skipping entity")
        else:
            ents.append(span)
    doc.ents = ents # label the text with the ents
    db.add(doc)
db.to_disk(f"./valid_spells.spacy")

In [58]:
training_data[0:19*2], training_data[19*2:]

[('riddikulus!', {'entities': [[0, 10, 'SPELL']]}),
 ('riddikulus!', {'entities': [[0, 10, 'SPELL']]}),
 ('then speak the incantation, expecto patronum.',
  {'entities': [[28, 44, 'SPELL']]}),
 ('expecto patronum.', {'entities': [[0, 16, 'SPELL']]}),
 ('expecto patronum!', {'entities': [[0, 16, 'SPELL']]}),
 ('expecto patronum!', {'entities': [[0, 16, 'SPELL']]}),
 ('expecto patronum!', {'entities': [[0, 16, 'SPELL']]}),
 ('nox.', {'entities': [[0, 3, 'SPELL']]}),
 ('expelliarmus!', {'entities': [[0, 12, 'SPELL']]}),
 ('expelliarmus!', {'entities': [[0, 12, 'SPELL']]}),
 ('expelliarmus!', {'entities': [[0, 12, 'SPELL']]}),
 ('expelliarmus!', {'entities': [[0, 12, 'SPELL']]}),
 ('expecto patronum!', {'entities': [[0, 16, 'SPELL']]}),
 ('immobulus!', {'entities': [[0, 9, 'SPELL']]}),
 ('expecto patronum!', {'entities': [[0, 16, 'SPELL']]}),
 ('bombarda!', {'entities': [[0, 8, 'SPELL']]}),
 ('none of it made any difference.', {'entities': [[0, 4, 'SPELL']]}),
 ('lumos.', {'entities': [[0,

Before we run our custom training, we actually need evaulation data. Let's split our original training data into training and evaluation and then write them both to disk.

Now we can finally start training our model 🥳! Let's go to the spaCy docs https://spacy.io/usage/training#quickstart and we'll see that we need to download our base_config.

Once we have that we can run the following code:
```python
!python -m spacy init fill-config base_config.cfg config.cfg
```

This will populate our config file with our relevant values. You'll notice in the Quickstart, there are multiple settings. We'll try out some of these later on today.

Now let's finally run our model!

```python
!python -m spacy train config.cfg --output ./output/spells-model/ --paths.train ./train_spells.spacy --paths.dev ./valid_spells.spacy
```
This will likely take a while to run but I did run it earlier below.

In [61]:
!python -m spacy train config.cfg --output ./output/spells-model/ --paths.train ./train_spells.spacy --paths.dev ./valid_spells.spacy

[38;5;2m✔ Created output directory: output/spells-model[0m
[38;5;4mℹ Using CPU[0m
[1m
[2021-07-02 11:02:39,791] [INFO] Set up nlp object from config
[2021-07-02 11:02:39,801] [INFO] Pipeline: ['tok2vec', 'ner']
[2021-07-02 11:02:39,806] [INFO] Created vocabulary
[2021-07-02 11:02:44,275] [INFO] Added vectors: en_core_web_lg
[2021-07-02 11:02:44,276] [INFO] Finished initializing nlp object
[2021-07-02 11:02:44,547] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00     83.83   24.00   19.35   31.58    0.24
200     200          0.25    421.15   92.31   90.00   94.74    0.92
400     400          0.00      0.00   92.31   90.00   94.74    0.92
600     600          0.00      0.00   92.31   90.00 

Let's take a look at the spaCy docs to understand our output https://spacy.io/usage/training#metrics and https://spacy.io/usage/training#basics

For evaluating our model, we can load in the best run of our model that is saved to the output file and see how it does with new unseen text data.

In [64]:
model_best = spacy.load('./output/spells-model/model-best')
model_best.pipe_names

['tok2vec', 'ner']

In [67]:
test_text = "Ron Weasley: Wingardium Leviosa! Hermione Granger: You're saying it wrong. It's Wing-gar-dium Levi-o-sa, make the 'gar' nice and long. Ron Weasley: You do it, then, if you're so clever"

doc = model_best(test_text)
for ent in doc.ents:
    print('best', ent.text, ent.label_)

# We can even comare this to our original blank model
test_doc = spell_nlp(test_text)
for ent in test_doc.ents:
    print('blank', ent.text, ent.label_)

best Ron Weasley SPELL
best Wingardium Leviosa SPELL
best Hermione Granger SPELL
best saying SPELL
best wrong SPELL
best Wing SPELL
best gar SPELL
best dium Levi SPELL
best gar SPELL
best long SPELL
best Ron Weasley SPELL
blank Wingardium Leviosa SPELL


Such a short line is difficult to evaluate so let's try a longer line of text:

```python
test_text = """53. Imperio - Makes target obey every command But only for really, really funny pranks. 52. Piertotum Locomotor - Animates statues On one hand, this is awesome. On the other, someone would use this to scare me.

51. Aparecium - Make invisible ink appear

Your notes will be so much cooler.

50. Defodio - Carves through stone and steel

Sometimes you need to get the eff out of there.

49. Descendo - Moves objects downward

You'll never have to get a chair to reach for stuff again.

48. Specialis Revelio - Reveals hidden magical properties in an object

I want to know what I'm eating and if it's magical.

47. Meteolojinx Recanto - Ends effects of weather spells

Otherwise, someone could make it sleet in your bedroom forever.

46. Cave Inimicum/Protego Totalum - Strengthens an area's defenses

Helpful, but why are people trying to break into your campsite?

45. Impedimenta - Freezes someone advancing toward you

"Stop running at me! But also, why are you running at me?"

44. Obscuro - Blindfolds target

Finally, we don't have to rely on "No peeking."

43. Reducto - Explodes object

The "raddest" of all spells.

42. Anapneo - Clears someone's airway

This could save a life, but hopefully you won't need it.

41. Locomotor Mortis - Leg-lock curse

Good for footraces and Southwest Airlines flights.

40. Geminio - Creates temporary, worthless duplicate of any object

You could finally live your dream of lying on a bed of marshmallows, and you'd only need one to start.

39. Aguamenti - Shoot water from wand

No need to replace that fire extinguisher you never bought.

38. Avada Kedavra - The Killing Curse

One word: bugs.

37. Repelo Muggletum - Repels Muggles

Sounds elitist, but seriously, Muggles ruin everything. Take it from me, a Muggle.

36. Stupefy - Stuns target

Since this is every other word of the "Deathly Hallows" script, I think it's pretty useful."""
```

So seems like in this instance our blank model performs better than our trained model. Why do we think that is?

To help us understand what's going on, let's try comparing our `SPELL` model to one trained on character names.

Our list of rules for character names exist in `all_names` variable. First step is that we have to add our rules to a spaCy model. We could create a new blank model, but instead let's add it to our spell_nlp model, so that it can identify characters and spells.

In [87]:
ruler.add_patterns(all_names)

We're also going to increase our training data by using data from all of the movie scripts!

In [80]:
dfs = []
for i in range(1,8):
    df = pd.read_csv(f'./archive/hp{i}.csv')
    dfs.append(df)
hp_dfs = pd.concat(dfs)
hp_dfs[0:1]

Unnamed: 0,movie,chapter,character,dialog
0,Harry Potter and the Philosopher's Stone,Doorstep Delivery,Albus Dumbledore,I should have known that you would be here...P...


So now we need to build our training data and then convert it to the spaCy format. (Small hint make sure we use this format `hp_dfs[hp_dfs.dialog.isna()==False]`)


In [106]:
### CALL TRAINING FUNCTIONS HERE


In [107]:
!python -m spacy train config.cfg --output ./output/person-model/ --paths.train ./train_person.spacy --paths.dev ./valid_person.spacy

[38;5;4mℹ Using CPU[0m
[1m
[2021-07-02 12:21:29,371] [INFO] Set up nlp object from config
[2021-07-02 12:21:29,382] [INFO] Pipeline: ['tok2vec', 'ner']
[2021-07-02 12:21:29,386] [INFO] Created vocabulary
[2021-07-02 12:21:32,556] [INFO] Added vectors: en_core_web_lg
[2021-07-02 12:21:32,556] [INFO] Finished initializing nlp object
[2021-07-02 12:21:34,291] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00     54.80    4.90   27.37    2.69    0.05
  1     200         12.72    946.93   91.44   94.58   88.51    0.91
  2     400         13.45     92.84   93.23   95.25   91.30    0.93
  5     600         11.02     38.52   93.86   95.21   92.55    0.94
  7     800          8.96     28.72   93.37  

We're getting pretty decent results. We could try and find more data (making scraping fanfiction?) but overall this isn't bad considering we still know we have issues with some character names (like Ginny and Ron).

Part of the reason we aren't getting better results is something that Ines Montani describes in this Stack Overflow answer https://stackoverflow.com/questions/50580262/how-to-use-spacy-to-create-a-new-entity-and-learn-only-from-keyword-list/50603247#50603247

"The advantage of training the named entity recognizer to detect SPECIES in your text is that the model won't only be able to recognise your examples, but also generalise and recognise other species in context. If you only want to find a fixed set of terms and not more, a simpler, rule-based approach might work better for you. You can find examples and details of this here.

If you do want the model to generalise and recognise your entity type in context, you also have to show it examples of the entities in context. That's currently the problem with your training examples: you're only showing the model single words, not sentences containing the words. To get good results, the data you're training the model with needs to be as close as possible to the data you later want to analyse.

While there are other approaches for training models without or with fewer labelled examples, the most straightforward strategy for collecting training data to train your spaCy model is to... label training data. However, there are some tricks you can use to make this less painful:

Start with a list of species and use the Matcher or PhraseMatcher to find them in your documents. For each match, you'll get a Span object, so you can extract the start and end position of the span in the text. This easily lets you create a bunch of examples automatically. You can find some more details on this here.

Use word vectors to find more similar terms to the entities you're looking for, so you get more examples you can search for in your text using the above approach. I'm not sure how spaCy's vector models will do for your species, since the terms are quite specific. So if you have a large corpus of raw text containing species, you might have to train your own vectors.

Use a labelling or data annotation tool. There are open-source solutions like Brat, or, once you're getting more serious about annotation and training, you might also want to check out our annotation tool Prodigy, which is a modern commercial solution that integrates seamlessly with spaCy (Disclaimer: I'm one of the spaCy maintainers)."

These final two options of word vectors or using a data annotation tool are things we have yet to discuss.

spaCy actually offers a popular data annotation tool called Prodigy https://prodi.gy/ (though it does require buying a license). Another popular and free tool is INCEpTION https://inception-project.github.io/. You can find a good overview of the many tools available https://bohemian.ai/blog/text-annotation-tools-which-one-pick-2020/ here. 

Let's try out the option of word vectors though. Say we didn't have character names how could we generate some example entities.

The first option would be just adding all our film script data to a spaCy pre-trained model and then extracting the named entities as a our initial training data.

The other option is using gensim to create Custom Word Vectors

https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html#training-your-own-model

In [None]:
from gensim.models import Word2Vec
from gensim.models import KeyedVectors

In [None]:
list_sentences = []
import string
import nltk

def build_corpus(row):
    sentences = nltk.sent_tokenize(row.Sentence)
    for sent in sentences:
        cleaned_sentence = sent.translate(str.maketrans("", "", string.punctuation))
        list_sentences.append(cleaned_sentence.split())

films_df.apply(build_corpus, axis=1)
# list_sentences

In [None]:
w2v_model = Word2Vec(min_count=1,
    window=2,
    vector_size=10,
    sample=6e-5,
    alpha=0.03,
    min_alpha=0.0007,
    negative=20)
w2v_model.build_vocab(list_sentences, progress_per=10000)
w2v_model.train(list_sentences, total_examples=w2v_model.corpus_count, epochs=30, report_delay=1)

In [None]:
word_vectors = w2v_model.wv
word_vectors.most_similar('professor')
.save_word2vec_format("./data/word2vec.txt")

In [None]:
!python -m spacy init vectors en vectors.txt output/vector-model

In [None]:
vector_model = spacy.load('./output/vector-model')

In [None]:
for sentence in films_df.Sentence.tolist()[0:10]:
    doc = vector_model(sentence)
    for ent in doc.ents:
        print(ent.text, ent.label_)

https://explosion.ai/blog/pseudo-rehearsal-catastrophic-forgetting

https://en.wikipedia.org/wiki/Catastrophic_interference