## Training Custom NER Models

Today's Goals/Agenda:
- Review Wednesday materials and our deeper diver into spaCy
- Discussion of how we evaluate NER models and libraries
- Building and training custom NER models

Final afternoon of TAPI and NER workshop 🥳😳!!

In [None]:
# # Download Spacy in Binder
# !pip install -U pip setuptools wheel
# !pip install -U spacy
# !python -m spacy download en_core_web_sm

In [None]:
# # Download Spacy if running locally
# import sys
# !{sys.executable} -m pip install spacy
# !{sys.executable} -m spacy download en_core_web_sm

In [12]:
import pandas as pd
import spacy
from spacy.scorer import Scorer
from spacy.training import Example
import stanza
import spacy_stanza
import numpy as np
import nltk

Here's some of our original code from Monday 

In [7]:
# Load in our data
chars_df = pd.read_csv('./archive/Characters.csv', delimiter=';')
chars_df['split_names'] = chars_df.Name.str.split(' ')
film1_df = pd.read_csv('./archive/Harry Potter 1.csv', delimiter=';')

def find_entities(row):
    # Find character names from chars_df
    character_names = chars_df.split_names.tolist()
    identified_names = []
    for names in character_names:
        if any(name in row.Sentence for name in names):
            identified_names.append(' '.join(names))
    row['identified_names'] = identified_names
    return row

film1_entities = film1_df.apply(find_entities, axis=1)
film1_entities = film1_entities[film1_entities.identified_names.astype(bool)]
film1_exploded = film1_entities.explode('identified_names')

Some of our code from Wednesday 👇🏽

In [9]:
# Load film datasets
film1_df = pd.read_csv('./archive/Harry Potter 1.csv', delimiter=';')
film2_df = pd.read_csv('./archive/Harry Potter 2.csv', delimiter=';')
film3_df = pd.read_csv('./archive/Harry Potter 3.csv', delimiter=';')

# Combine film dataframes
film3_df.columns = map(str.capitalize, film3_df.columns)
film1_df['movie_number'] = 'film 1'
film2_df['movie_number'] = 'film 2'
film3_df['movie_number'] = 'film 3'
films_df = pd.concat([film1_df, film2_df, film3_df])

chars_df['full_names'] = np.where(
    chars_df.split_names.str.len() == 2, 
    chars_df.split_names.str[0].str.lower() + ' ' + chars_df.split_names.str[1].str.lower(), 
    np.where(
        chars_df.split_names.str.len() > 2,
        chars_df.split_names.str[0].str.lower() + ' ' + chars_df.split_names.str[-1].str.lower(),
        chars_df.split_names.str[0].str.lower())) 
chars_df['first_name'] = chars_df.split_names.str[0].str.lower()
chars_df['last_name'] = chars_df.split_names.str[-1].str.lower()

full_names = chars_df.full_names.unique().tolist()
first_names = chars_df.first_name.unique().tolist()
last_names = chars_df.last_name.unique().tolist()

# Get unique names and create our rules
names = list(set(first_names) | set(last_names))
unique_names = list(set(names) | set(full_names))
list_names = [{"label": "PERSON", "pattern": [{"LOWER": f"{name}"}], "id": f"{name}"} for name in unique_names if len(name.split(' ')) == 1]

list_full_names = [{"label": "PERSON", "pattern": [{"LOWER": f"{name.split(' ')[0]}"}, {"LOWER": f"{name.split(' ')[1]}"}], "id": f"{'-'.join(name.split(' '))}"} for name in full_names if len(name.split(' ')) > 1]

all_names = list_names + list_full_names
print('example rule', all_names[0])

# Load our models and pass our rules
full_nlp = spacy.load("en_core_web_sm")
ruler = full_nlp.add_pipe("entity_ruler")
ruler.add_patterns(all_names)

blank_nlp = spacy.blank("en")
ruler = blank_nlp.add_pipe("entity_ruler")
ruler.add_patterns(all_names)

evaluation_data = []
def evaluate_spacy_models(row):
    # Let's also add in blank_nlp
    sentences = nltk.sent_tokenize(row.Sentence.lower())
    for sentence in sentences:
        spacy_full = full_nlp(sentence)
        list_entities = []
        for token in spacy_full.ents:
            list_entities.append([token.start_char, token.end_char, token.label_])
        if len(list_entities) > 0:
            entry = (sentence,{"entities": list_entities})
            evaluation_data.append(entry)
    return row
films_df.apply(evaluate_spacy_models, axis=1)

def evaluate(ner_model, testing_data):
    scorer = Scorer()
    examples = []
    for input_, annot in testing_data:
        doc_gold_text = ner_model.make_doc(input_)
        example = Example.from_dict(doc_gold_text, annot)
        example.predicted = ner_model(input_)
        examples.append(example)
        
    print(scorer.score(examples))

# print the results
evaluate(full_nlp, evaluation_data)

example rule {'label': 'PERSON', 'pattern': [{'LOWER': 'parvati'}], 'id': 'parvati'}
{'token_acc': 1.0, 'token_p': 1.0, 'token_r': 1.0, 'token_f': 1.0, 'sents_p': 1.0, 'sents_r': 1.0, 'sents_f': 1.0, 'tag_acc': None, 'pos_acc': None, 'morph_acc': None, 'morph_per_feat': None, 'dep_uas': None, 'dep_las': None, 'dep_las_per_type': None, 'ents_p': 1.0, 'ents_r': 1.0, 'ents_f': 1.0, 'ents_per_type': {'PERSON': {'p': 1.0, 'r': 1.0, 'f': 1.0}, 'CARDINAL': {'p': 1.0, 'r': 1.0, 'f': 1.0}, 'DATE': {'p': 1.0, 'r': 1.0, 'f': 1.0}, 'TIME': {'p': 1.0, 'r': 1.0, 'f': 1.0}, 'ORDINAL': {'p': 1.0, 'r': 1.0, 'f': 1.0}, 'NORP': {'p': 1.0, 'r': 1.0, 'f': 1.0}, 'QUANTITY': {'p': 1.0, 'r': 1.0, 'f': 1.0}}, 'cats_score': 0.0, 'cats_score_desc': 'macro F', 'cats_micro_p': 0.0, 'cats_micro_r': 0.0, 'cats_micro_f': 0.0, 'cats_macro_p': 0.0, 'cats_macro_r': 0.0, 'cats_macro_f': 0.0, 'cats_macro_auc': 0.0, 'cats_f_per_type': {}, 'cats_auc_per_type': {}}


In [10]:
# https://en.wikipedia.org/wiki/Harry_Potter
test_text = "The series continues with Harry Potter and the Chamber of Secrets, describing Harry's second year at Hogwarts. He and his friends investigate a 50-year-old mystery that appears uncannily related to recent sinister events at the school. Ron's younger sister, Ginny Weasley, enrols in her first year at Hogwarts, and finds an old notebook in her belongings which turns out to be the diary of a previous student, Tom Marvolo Riddle, written during World War II. He is later revealed to be Voldemort's younger self, who is bent on ridding the school of 'mudbloods', a derogatory term describing wizards and witches of non-magical parentage. The memory of Tom Riddle resides inside of the diary and when Ginny begins to confide in the diary, Voldemort is able to possess her. Through the diary, Ginny acts on Voldemort's orders and unconsciously opens the 'Chamber of Secrets', unleashing an ancient monster, later revealed to be a basilisk, which begins attacking students at Hogwarts. It kills those who make direct eye contact with it and petrifies those who look at it indirectly. The book also introduces a new Defence Against the Dark Arts teacher, Gilderoy Lockhart, a highly cheerful, self-conceited wizard with a pretentious facade, later turning out to be a fraud. Harry discovers that prejudice exists in the Wizarding World through delving into the school's history, and learns that Voldemort's reign of terror was often directed at wizards and witches who were descended from Muggles. Harry also learns that his ability to speak the snake language Parseltongue is rare and often associated with the Dark Arts. When Hermione is attacked and petrified, Harry and Ron finally piece together the puzzles and unlock the Chamber of Secrets, with Harry destroying the diary for good and saving Ginny, and, as they learn later, also destroying a part of Voldemort's soul. The end of the book reveals Lucius Malfoy, Draco's father and rival of Ron and Ginny's father, to be the culprit who slipped the book into Ginny's belongings."

doc = blank_nlp(test_text)
for token in doc.ents:
    print('blank', token.text, token.label_)

doc = full_nlp(test_text)
for token in doc.ents:
    print('full', token.text, token.label_)

blank Harry Potter PERSON
blank Harry PERSON
blank Weasley PERSON
blank Tom PERSON
blank Riddle PERSON
blank Tom Riddle PERSON
blank Gilderoy Lockhart PERSON
blank Harry PERSON
blank Harry PERSON
blank Hermione PERSON
blank Harry PERSON
blank Harry PERSON
blank Lucius Malfoy PERSON
blank Draco PERSON
full Harry Potter PERSON
full the Chamber of Secrets ORG
full Harry PERSON
full second year DATE
full Hogwarts ORG
full 50-year-old DATE
full Ron PERSON
full Ginny Weasley PERSON
full her first year DATE
full Hogwarts ORG
full Tom Marvolo Riddle PERSON
full World War II EVENT
full Voldemort ORG
full Tom Riddle PERSON
full Ginny PERSON
full Voldemort ORG
full Ginny ORG
full Voldemort ORG
full Hogwarts PRODUCT
full Defence Against the Dark Arts ORG
full Gilderoy Lockhart PERSON
full Harry PERSON
full the Wizarding World LOC
full Voldemort PERSON
full Muggles PERSON
full Harry PERSON
full the Dark Arts ORG
full Hermione PERSON
full Harry PERSON
full Ron PERSON
full the Chamber of Secrets ORG


In [16]:
# To save spaCy models to disk, use the following syntax. Be careful though to not name it the same as a downloaded spaCy model like 'en_core_web_sm' because that will overwrite that model. You can read more here https://spacy.io/usage/saving-loading
blank_nlp.to_disk('blank_nlp')
full_nlp.to_disk('full_nlp')

***

### Training a Custom NER Model

A few things to keep in mind:
- you can use pre-trained NER models off the shelf and those might be close enough to what you need
While custom training is not always the answer, it might be if you have a label not included in existing models or if you have :
- if you do need to cu

In [19]:
spells_df = pd.read_csv('./archive/Spells.csv', sep=";")
spells_df[0:2]

Unnamed: 0,Name,Incantation,Type,Effect,Light
0,Summoning Charm,Accio,Charm,Summons an object,
1,Age Line,Unknown,Charm,Prevents people above or below a certain age f...,Blue


Let's first check that spells exist in our films dataset.

In [20]:
def find_spells(row):
    spells = spells_df[spells_df.Incantation.isna() == False].Incantation.unique().tolist()
    identified_spells = []
    for spell in spells:
        if spell in row.Sentence:
            identified_spells.append(spell)
    row['identified_spells'] = ', '.join(identified_spells) if len(identified_spells) > 0 else ''
    return row
films_spells = films_df.apply(find_spells, axis=1)


In [None]:
spells_df.Incantation = spells_df.Incantation.fillna("")
spells = spells_df.Incantation.unique().tolist()
list_spells = []
for spell in spells:
    split_spells = spell.split()
    patterns = []
    for sp in split_spells:
        patterns.append({"LOWER": f"{sp.lower()}"})
    spell_dict = {"label": "TEXT", "pattern": patterns, "id": f"{'-'.join(spell.split())}"}
    list_spells.append(spell_dict)
list_spells[0:1]

In [None]:
# # list_spells vs all_names
# test_nlp = spacy.blank('en')
# ruler = test_nlp.add_pipe("entity_ruler") 

# ruler.add_patterns(all_names)

In [None]:
dfs = []
for i in range(1,8):
    df = pd.read_csv(f'./archive/hp{i}.csv')
    dfs.append(df)
hp_dfs = pd.concat(dfs)
hp_dfs[0:1]

In [None]:
def build_spells_data(df, column_name):
    spell_nlp = spacy.blank("en")
    matcher = Matcher(spell_nlp.vocab)
    data_list = []
    for index, row in df[df.identified_spells.str.len() > 1].iterrows():
        spells = row.identified_spells.split(',')
        
        for spell in spells:
            pattern = [{"TEXT": sp} for sp in spell.split()]
            matcher.add(spell, [pattern])
            doc = spell_nlp(row[f'{column_name}'])
            matches = matcher(doc, as_spans=True)
            for span in matches:
                entry =(row[f'{column_name}'], {"entities": [span.start_char, span.end_char, "SPELL"]})
                data_list.append(entry)
    return data_list

Ok so we have spells within our scripts! Our next step is to extract this data as training data. Let's adapt our code from above here.

In [50]:
#### TRAINING DATA CODE
training_data = []
spell_nlp = spacy.blank("en")

spell_nlp.add_pipe("ner", name="spell_ner") #before or after to position pipe spells and potions
spell_ner = spell_nlp.get_pipe("spell_ner")
spell_ner.add_label("SPELL")

spells_df.Incantation = spells_df.Incantation.fillna("")
spells = spells_df.Incantation.unique().tolist()
list_spells = []
for spell in spells:
    split_spells = spell.split()
    patterns = []
    for sp in split_spells:
        patterns.append({"LOWER": f"{sp.lower()}"})
    spell_dict = {"label": "SPELL", "pattern": patterns, "id": f"{'-'.join(spell.split())}"}
    list_spells.append(spell_dict)
spell_nlp.add_patterns(all_names)

def training_spells(row):
    # Let's also add in blank_nlp
    sentences = nltk.sent_tokenize(row.Sentence.lower())
    for sentence in sentences:
        spacy_full = spell_nlp(sentence)
        list_entities = []
        for token in spacy_full.ents:
            print(token)
            list_entities.append([token.start_char, token.end_char, token.label_])
        if len(list_entities) > 0:
            entry = (sentence,{"entities": list_entities})
            training_data.append(entry)
    return row
films_spells[films_spells.identified_spells.str.len() > 1].apply(evaluate_spacy_models, axis=1)
print(training_data)

AttributeError: 'English' object has no attribute 'add_patterns'

In [None]:
training_data =build_spells_data(films_spells, 'Sentence')
evaluation_data = build_spells_data(hp_spells, 'dialog')

In [None]:
# Get unique names and create our rules
names = list(set(first_names) | set(last_names))
unique_names = list(set(names) | set(full_names))
list_names = [{"label": "PERSON", "pattern": [{"LOWER": f"{name}"}], "id": f"{name}"} for name in unique_names if len(name.split(' ')) == 1]

list_full_names = [{"label": "PERSON", "pattern": [{"LOWER": f"{name.split(' ')[0]}"}, {"LOWER": f"{name.split(' ')[1]}"}], "id": f"{'-'.join(name.split(' '))}"} for name in full_names if len(name.split(' ')) > 1]

all_names = list_names + list_full_names
print('example rule', all_names[0])

# Load our models and pass our rules
full_nlp = spacy.load("en_core_web_sm")
ruler = full_nlp.add_pipe("entity_ruler")
ruler.add_patterns(all_names)

blank_nlp = spacy.blank("en")
ruler = blank_nlp.add_pipe("entity_ruler")
ruler.add_patterns(all_names)

In [None]:
import nltk
def build_person_data(df, column_name):
    # Let's also add in blank_nlp
    data_list = []
    for index, row in df.iterrows():
        sentences = nltk.sent_tokenize(row[f'{column_name}'])
        for sentence in sentences:
            spacy_full = blank_nlp(sentence)
            list_entities = []
            for token in spacy_full.ents:
                list_entities.append([token.start_char, token.end_char, token.label_])
            if len(list_entities) > 0:
                entry = (sentence,{"entities": list_entities})
                data_list.append(entry)
    return data_list

In [None]:
ppl_train_data = build_person_data(films_df, 'Sentence')
ppl_valid_data = build_person_data(hp_dfs[hp_dfs.dialog.isna() == False], 'dialog')

In [None]:
from spacy.tokens import DocBin

def build_spacy_data(data, file_name):

    nlp = spacy.blank("en") # load a new spacy model
    db = DocBin() # create a DocBin object

    for text, annot in data: # data in previous format
        doc = nlp.make_doc(text) # create doc object from text
        ents = []
        for entity in annot["entities"]:
            start = entity[0]
            end= entity[1]
            label = entity[2]
            span = doc.char_span(start, end, label=label, alignment_mode="contract")
            if span is None:
                print("Skipping entity")
            else:
                ents.append(span)
        doc.ents = ents # label the text with the ents
        db.add(doc)
    db.to_disk(f"./{file_name}.spacy")

In [None]:
build_spacy_data(ppl_train_data, 'train_person')
build_spacy_data(ppl_valid_data, 'valid_person')

In [None]:
def training_spells(row):
    # Let's also add in blank_nlp
    
    doc = spell_nlp(row.Sentence)
    print(doc)
    matches = matcher(doc)
    print(matches)
    for match_id, start, end in matches:
        string_id = spell_nlp.vocab.strings[match_id]  # Get string representation
        span = doc[start:end]  # The matched span
        print(match_id, string_id, start, end, span.text)
    
films_spells[films_spells.identified_spells.str.len() > 1][0:15].apply(evaluate_spacy_models, axis=1)
print(training_data)

In [None]:
model_best = spacy.load('./output/model-best')
model_best.pipe_names

In [None]:
# https://en.wikipedia.org/wiki/Harry_Potter
# test_text = "The series continues with Harry Potter and the Chamber of Secrets, describing Harry's second year at Hogwarts. He and his friends investigate a 50-year-old mystery that appears uncannily related to recent sinister events at the school. Ron's younger sister, Ginny Weasley, enrols in her first year at Hogwarts, and finds an old notebook in her belongings which turns out to be the diary of a previous student, Tom Marvolo Riddle, written during World War II. He is later revealed to be Voldemort's younger self, who is bent on ridding the school of 'mudbloods', a derogatory term describing wizards and witches of non-magical parentage. The memory of Tom Riddle resides inside of the diary and when Ginny begins to confide in the diary, Voldemort is able to possess her. Through the diary, Ginny acts on Voldemort's orders and unconsciously opens the 'Chamber of Secrets', unleashing an ancient monster, later revealed to be a basilisk, which begins attacking students at Hogwarts. It kills those who make direct eye contact with it and petrifies those who look at it indirectly. The book also introduces a new Defence Against the Dark Arts teacher, Gilderoy Lockhart, a highly cheerful, self-conceited wizard with a pretentious facade, later turning out to be a fraud. Harry discovers that prejudice exists in the Wizarding World through delving into the school's history, and learns that Voldemort's reign of terror was often directed at wizards and witches who were descended from Muggles. Harry also learns that his ability to speak the snake language Parseltongue is rare and often associated with the Dark Arts. When Hermione is attacked and petrified, Harry and Ron finally piece together the puzzles and unlock the Chamber of Secrets, with Harry destroying the diary for good and saving Ginny, and, as they learn later, also destroying a part of Voldemort's soul. The end of the book reveals Lucius Malfoy, Draco's father and rival of Ron and Ginny's father, to be the culprit who slipped the book into Ginny's belongings."
test_text = "Ron Weasley: Wingardium Leviosa! Hermione Granger: You're saying it wrong. It's Wing-gar-dium Levi-o-sa, make the 'gar' nice and long. Ron Weasley: You do it, then, if you're so clever"

test_text = """53. Imperio - Makes target obey every command But only for really, really funny pranks. 52. Piertotum Locomotor - Animates statues On one hand, this is awesome. On the other, someone would use this to scare me.

51. Aparecium - Make invisible ink appear

Your notes will be so much cooler.

50. Defodio - Carves through stone and steel

Sometimes you need to get the eff out of there.

49. Descendo - Moves objects downward

You'll never have to get a chair to reach for stuff again.

48. Specialis Revelio - Reveals hidden magical properties in an object

I want to know what I'm eating and if it's magical.

47. Meteolojinx Recanto - Ends effects of weather spells

Otherwise, someone could make it sleet in your bedroom forever.

46. Cave Inimicum/Protego Totalum - Strengthens an area's defenses

Helpful, but why are people trying to break into your campsite?

45. Impedimenta - Freezes someone advancing toward you

"Stop running at me! But also, why are you running at me?"

44. Obscuro - Blindfolds target

Finally, we don't have to rely on "No peeking."

43. Reducto - Explodes object

The "raddest" of all spells.

42. Anapneo - Clears someone's airway

This could save a life, but hopefully you won't need it.

41. Locomotor Mortis - Leg-lock curse

Good for footraces and Southwest Airlines flights.

40. Geminio - Creates temporary, worthless duplicate of any object

You could finally live your dream of lying on a bed of marshmallows, and you'd only need one to start.

39. Aguamenti - Shoot water from wand

No need to replace that fire extinguisher you never bought.

38. Avada Kedavra - The Killing Curse

One word: bugs.

37. Repelo Muggletum - Repels Muggles

Sounds elitist, but seriously, Muggles ruin everything. Take it from me, a Muggle.

36. Stupefy - Stuns target

Since this is every other word of the "Deathly Hallows" script, I think it's pretty useful."""
doc = model_best(test_text)
for token in doc.ents:
    print('best', token.text, token.label_)

# doc = full_nlp(test_text)
# for token in doc.ents:
#     print('full', token.text, token.label_)

Once we have that working let's take a look at the spaCy docs for training a model https://spacy.io/usage/training#ner

> When training a model, we don’t just want it to memorize our examples – we want it to come up with a theory that can be generalized across unseen data. After all, we don’t just want the model to learn that this one instance of “Amazon” right here is a company – we want it to learn that “Amazon”, in contexts like this, is most likely a company. That’s why the training data should always be representative of the data we want to process. A model trained on Wikipedia, where sentences in the first person are extremely rare, will likely perform badly on Twitter. Similarly, a model trained on romantic novels will likely perform badly on legal text.

*Annotation Tools*

https://bohemian.ai/blog/text-annotation-tools-which-one-pick-2020/

We hopefully now have our baseconfig file before we run it we need to save our corpora and let spaCy know where it exists.

Let's follow the instructions in this blog post https://towardsdatascience.com/using-spacy-3-0-to-build-a-custom-ner-model-c9256bea098 and this one https://medium.com/analytics-vidhya/custom-named-entity-recognition-ner-model-with-spacy-3-in-four-steps-7e903688d51

In [None]:
from spacy.tokens import DocBin

nlp = spacy.blank("en") # load a new spacy model
db = DocBin() # create a DocBin object

for text, annot in tqdm(TRAIN_DATA): # data in previous format
    doc = nlp.make_doc(text) # create doc object from text
    ents = []
    for start, end, label in annot["entities"]: # add character indexes
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        if span is None:
            print("Skipping entity")
        else:
            ents.append(span)
    doc.ents = ents # label the text with the ents
    db.add(doc)

db.to_disk("./train.spacy")

In [None]:
# !python -m spacy init fill-config base_config.cfg config.cfg

https://explosion.ai/blog/pseudo-rehearsal-catastrophic-forgetting

https://en.wikipedia.org/wiki/Catastrophic_interference
