## Introduction to Machine Learning and NER

Today's Goals/Agenda:
- Review homework exercises and deeper dive into spaCy
- Discussion of how we evaluate NER models and libraries
- Building and training custom NER models

Let's start with our homework from Monday 🥳!!

In [None]:
# # Download Spacy in Binder
# !pip install -U pip setuptools wheel
# !pip install -U spacy
# !python -m spacy download en_core_web_sm

In [None]:
# # Download Spacy if running locally
# import sys
# !{sys.executable} -m pip install spacy
# !{sys.executable} -m spacy download en_core_web_sm

In [3]:
import pandas as pd
import spacy

Here's all our code from Monday compiled into one cell 😳👇🏽

In [71]:
# Load in our data
chars_df = pd.read_csv('./archive/Characters.csv', delimiter=';')
chars_df['split_names'] = chars_df.Name.str.split(' ')
film1_df = pd.read_csv('./archive/Harry Potter 1.csv', delimiter=';')
# Load in our spaCy model
nlp = spacy.load('en_core_web_sm')

def find_entities(row):
    # Find character names from chars_df
    character_names = chars_df.split_names.tolist()
    identified_names = []
    for names in character_names:
        if any(name in row.Sentence for name in names):
            identified_names.append(' '.join(names))
    row['identified_names'] = identified_names
    return row

film1_entities = film1_df.apply(find_entities, axis=1)
film1_entities = film1_entities[film1_entities.identified_names.astype(bool)]
film1_exploded = film1_entities.explode('identified_names')

film1_exploded['sentence_lower'] = film1_exploded.Sentence.str.lower()
chars_df['name_lower'] = chars_df.Name.str.lower()

# Build our first rules-based NER model with spaCy
chars_df['first_name'] = chars_df.split_names.str[0].str.lower()
chars_df['last_name'] = chars_df.split_names.str[-1].str.lower()

first_names = chars_df.first_name.unique().tolist()
last_names = chars_df.last_name.unique().tolist()

names = list(set(first_names) | set(last_names))
ruler = nlp.add_pipe("entity_ruler")
list_names = [{"label": "PERSON", "pattern": f"{name}"} for name in names]

ruler.add_patterns(list_names)

def find_spacy_entities(row):
    # code goes here
    spacy_sentence = nlp(row.Sentence.lower())
    
    list_tokens = []
    list_entities = []
    for token in spacy_sentence.ents:
        list_tokens.append(token.text)
        list_entities.append(token.label_)
    row['spacy_tokens'] = list_tokens
    row['spacy_entities'] = list_entities
    return row

film1_df = film1_df.apply(find_spacy_entities, axis=1)

Let's review the first option for the homework (don't worry we'll get to the second dataset and option later).

For the Harry Potter dataset, here are the following steps:
- Load in the other scripts and join them with our first film
- Rerun our code for identifying characters (NER algorithm and spaCy model)
If you get through that quickly:
- Try improving our code to produce better character results (maybe get spaCy working with bigrams?)
- Try expanding our custom named entities patterns to also include other entities, such as locations like Hogwarts and Diagon Alley

In [72]:
# Load film datasets
film1_df = pd.read_csv('./archive/Harry Potter 1.csv', delimiter=';')
film2_df = pd.read_csv('./archive/Harry Potter 2.csv', delimiter=';')
film3_df = pd.read_csv('./archive/Harry Potter 3.csv', delimiter=';')

In [73]:
# Combine film dataframes
film3_df.columns = map(str.capitalize, film3_df.columns)
film1_df['movie_number'] = 'film 1'
film2_df['movie_number'] = 'film 2'
film3_df['movie_number'] = 'film 3'
films_df = pd.concat([film1_df, film2_df, film3_df])
films_df.head()

Unnamed: 0,Character,Sentence,movie_number
0,Dumbledore,"I should've known that you would be here, Prof...",film 1
1,McGonagall,"Good evening, Professor Dumbledore.",film 1
2,McGonagall,"Are the rumors true, Albus?",film 1
3,Dumbledore,"I'm afraid so, professor.",film 1
4,Dumbledore,The good and the bad.,film 1


In [74]:
# Identify Character Names for full name datasets using our original functions

films_entities = films_df.apply(find_entities, axis=1)

films_spacy = films_df.apply(find_spacy_entities, axis=1)

In [75]:
# Merge our NER algorithm identified names with those identified by spaCy. We could now remake our sankey graph from Monday using https://rawgraphs.io/.

identified_names = films_entities.explode('identified_names')
identified_names = identified_names[['Character', 'identified_names', 'movie_number']]
spacy_names = films_spacy.set_index(['Character', 'Sentence', 'movie_number']).apply(pd.Series.explode).reset_index()
spacy_names = spacy_names[spacy_names.spacy_entities == 'PERSON'][['Character', 'spacy_tokens', 'movie_number']]
merged_names = pd.merge(spacy_names, identified_names, on=['Character', 'movie_number'], how='outer')
merged_names = merged_names.fillna('')
merged_names.head()

Unnamed: 0,Character,spacy_tokens,movie_number,identified_names
0,Dumbledore,mcgonagall,film 1,Minerva McGonagall
1,Dumbledore,mcgonagall,film 1,
2,Dumbledore,mcgonagall,film 1,
3,Dumbledore,mcgonagall,film 1,Rubeus Hagrid
4,Dumbledore,mcgonagall,film 1,Rubeus Hagrid


Just to review what we've done above is extend spaCy's rules-based NER. From William Mattingly's textbook:
> Rules-based NER is the method by which an NLP practitioner either creates or utalizes an NLP system that has a predefined set of instructions, or rules, to perform certain NLP tasks. For NER, this often times means using what is known as a gazetteer. A gazetteer is a list, or dictionary, of entities that align with a specific label. In the case of people, this would be a list of first and last names. If you are developing an NER for a specific region, as we will in a later notebook, this may be a list of all locations in that region.

(Just FYI the code in William's textbook will not work with the lastest 3.0 version of Spacy, more info here https://spacy.io/usage/v3#migrating)

Rules-based NER works well when we have a set number of patterns and entities we want to identify within a text. It works particularly well if you're using a gazetteer or a pre-defined list of entities (and not so well if you think your text data will have a lot of variation - whether errors or otherwise - or a lot of overlapping meaning between words).

Our code from above:
```python
ruler = nlp.add_pipe("entity_ruler")
list_names = [{"label": "PERSON", "pattern": f"{name}"} for name in names]

ruler.add_patterns(list_names)
```
is using spaCy's **EntityRuler** to create a new pipeline for our NER model. According to the spaCy documentation https://spacy.io/usage/rule-based-matching#entityruler:

> The EntityRuler is a component that lets you add named entities based on pattern dictionaries, which makes it easy to combine rule-based and statistical named entity recognition for even more powerful pipelines.

The statistical part of our pipeline is the original spacy model (which in our current code is the small english web model).

Accoring to that same documentation:

> The entity ruler is designed to integrate with spaCy’s existing pipeline components and enhance the named entity recognizer. If it’s added before the "ner" component, the entity recognizer will respect the existing entity spans and adjust its predictions around it. This can significantly improve accuracy in some cases. If it’s added after the "ner" component, the entity ruler will only add spans to the doc.ents if they don’t overlap with existing entities predicted by the model. To overwrite overlapping entities, you can set overwrite_ents=True on initialization.

With this in mind, let's try improving on our initial output.

One of the first things we can do is try and use both full names and individual names.

In [76]:
import numpy as np

chars_df['full_names'] = np.where(
    chars_df.split_names.str.len() == 2, 
    chars_df.split_names.str[0].str.lower() + ' ' + chars_df.split_names.str[1].str.lower(), 
    np.where(
        chars_df.split_names.str.len() > 2,
        chars_df.split_names.str[0].str.lower() + ' ' + chars_df.split_names.str[-1].str.lower(),
        chars_df.split_names.str[0].str.lower())) 

full_names = chars_df.full_names.unique().tolist()
first_names = chars_df.first_name.unique().tolist()
last_names = chars_df.last_name.unique().tolist()

So now that we have our full names we can update our model. Let's follow this example in the spaCy documentation https://spacy.io/usage/rule-based-matching#entityruler-ent-ids.

In [77]:
# Get unique names and create our rules
names = list(set(first_names) | set(last_names))
unique_names = list(set(names) | set(full_names))
list_names = [{"label": "PERSON", "pattern": [{"LOWER": f"{name}"}], "id": f"{name}"} for name in unique_names if len(name.split(' ')) == 1]

list_full_names = [{"label": "PERSON", "pattern": [{"LOWER": f"{name.split(' ')[0]}"}, {"LOWER": f"{name.split(' ')[1]}"}], "id": f"{'-'.join(name.split(' '))}"} for name in full_names if len(name.split(' ')) > 1]

all_names = list_names + list_full_names
print('example rule', all_names[0])

# Load our models and pass our rules
full_nlp = spacy.load("en_core_web_sm")
ruler = full_nlp.add_pipe("entity_ruler")
ruler.add_patterns(all_names)

blank_nlp = spacy.blank("en")
ruler = blank_nlp.add_pipe("entity_ruler")
ruler.add_patterns(all_names)

example rule {'label': 'PERSON', 'pattern': [{'LOWER': 'binns'}], 'id': 'binns'}


We've created two separate models above. One contains our initial spaCy model (the English web small version) and then a blank spaCy model. We've added the same type of EntityRuler and rules to each model, so now we can compare them!

In [11]:
def compare_spacy_models(row):
    # Read in text data
    spacy_full = full_nlp(row.Sentence)
    spacy_blank = blank_nlp(row.Sentence)
    list_tokens = []
    list_entities = []
    for token in spacy_full.ents:
        list_tokens.append(token.text)
        list_entities.append(token.label_)
    row['spacy_full_tokens'] = ', '.join(list_tokens)
    row['spacy_full_entities'] = ', '.join(list_entities)

    list_tokens = []
    list_entities = []
    for token in spacy_blank.ents:
        list_tokens.append(token.text)
        list_entities.append(token.label_)
    row['spacy_blank_tokens'] = ', '.join(list_tokens)
    row['spacy_blank_entities'] = ', '.join(list_entities)
    return row

spacy_films = films_df.apply(compare_spacy_models, axis=1)

Let's see which entities character names were identified as different between our two spacy models.

In [12]:
spacy_subset = spacy_films[(spacy_films.spacy_full_tokens.str.len() > 1) & (spacy_films.spacy_blank_tokens.str.len() > 1)]
spacy_subset[spacy_subset.spacy_full_tokens != spacy_subset.spacy_blank_tokens]

Unnamed: 0,Character,Sentence,movie_number,spacy_full_tokens,spacy_full_entities,spacy_blank_tokens,spacy_blank_entities
1,McGonagall,"Good evening, Professor Dumbledore.",film 1,"evening, Dumbledore","TIME, PERSON",Dumbledore,PERSON
36,Harry,"Yes, Aunt Petunia.",film 1,Aunt Petunia,LOC,Petunia,PERSON
39,Harry,"Yes, Uncle Vernon.",film 1,Uncle Vernon,PERSON,Vernon,PERSON
107,Dudley,"Dad, look! Harry's got a letter!",film 1,"Dad, Harry","PERSON, PERSON",Harry,PERSON
120,Vernon,Not one single bloody letter. Not one!,film 1,"one, bloody","CARDINAL, PERSON",bloody,PERSON
...,...,...,...,...,...,...,...
1352,HERMIONE,"Listen Harry, they've captured Sirius.",film 3,"Listen Harry, Sirius","PERSON, PERSON","Harry, Sirius","PERSON, PERSON"
1374,DUMBLEDORE,Sirius Black is in the topmost cell of the Dar...,film 3,"Sirius Black, the Dark Tower","PERSON, FAC",Sirius Black,PERSON
1399,HERMIONE,"This is a Time-Turner, Harry.",film 3,"a Time-Turner, Harry","ORG, PERSON",Harry,PERSON
1400,HERMIONE,McGonagall gave it to me first term.,film 3,"McGonagall, first","PERSON, ORDINAL",McGonagall,PERSON


While we're getting some differing results, if we actually added a condition to our function (`if token.label_ == 'PERSON'`) we would get exact results. This on one level makes sense, since we passed in our identical rules. But actually this identical results also has to do with our entities being fairly unique (not many multiple meanings for names like Hermione for example).

One cautionary note is that this might not be true for your data. So let's test an example where we might not get identical results.

For example, if we added this label to our code above for our rules:
```python
scotland = {"label": "PERSON", "pattern": [{"LOWER": "scotland"}], "id":"scotland"}
list_names.append(scotland)
```
And then ran our models again with the term `Scotland`:
```python
scotland = full_nlp("Scotland")
scotland2 = blank_nlp("Scotland")
[print(token.text, token.label_) for token in scotland.ents]
[print(token.text, token.label_) for token in scotland2.ents]
```
We would actually get two separate entity types (`GPE` or `PERSON` for the full and blank models respectively). 

This result is because currently when we add our rules to the existing model, we are adding them to an existing pipeline, which already has `Scotland` as an entity. We can either overwrite these entities (passing in `full_nlp.add_pipe("entity_ruler", config={"overwrite":True})`) or tell spaCy to put our rules at the beginning of the pipeline. 

***

#### Evaluating Spacy Models

So far we've been eyeballing our results but we can actually use spaCy to score each of our models. Let's try reworking our code in our `compare_spacy_models` function. We'll be using spaCy's Scorer https://spacy.io/api/scorer and Example https://spacy.io/api/example classes to evaluate our models.

First we need to extract our text data and entities in the correct format (like in this StackOverflow post https://stackoverflow.com/questions/66637485/spacy-3-0-1-accuracy-prediction). We can also refer to William's textbook http://ner.pythonhumanities.com/04_02_create_ner_training_set.html though we are using this syntax to evaluate, not create training data at the moment.

The overall syntax we need is a list of tuples containing the original sentence, and then the identified entities within it, along with their start, end, and label.

```python
test_data = [
    ("Trump says he's answered Mueller's Russia inquiry questions \u2013 
    live",{"entities":[[0,5,"PERSON"],[25,32,"PERSON"],[35,41,"GPE"]]}),
    ("Alexander Zverev reaches ATP Finals semis then reminds Lendl who is 
    boss",{"entities":[[0,16,"PERSON"],[55,60,"PERSON"]]}),
    ("Britain's worst landlord to take nine years to pay off string of fines", 
    {"entities":[[0,7,"GPE"]]}),
    ("Tom Watson: people's vote more likely given weakness of May's position", 
    {"entities":[[0,10,"PERSON"],[56,59,"PERSON"]]}),
]
```
So how might we extract this information using our original function as an example?

In [13]:
import nltk
evaluation_data = []
def evaluate_spacy_models(row):
    # Let's also add in blank_nlp
    sentences = nltk.sent_tokenize(row.Sentence.lower())
    for sentence in sentences:
        spacy_full = full_nlp(sentence)
        list_entities = []
        for token in spacy_full.ents:
            list_entities.append([token.start_char, token.end_char, token.label_])
        if len(list_entities) > 0:
            entry = (sentence,{"entities": list_entities})
            evaluation_data.append(entry)
    return row
films_df.apply(evaluate_spacy_models, axis=1)

Unnamed: 0,Character,Sentence,movie_number
0,Dumbledore,"I should've known that you would be here, Prof...",film 1
1,McGonagall,"Good evening, Professor Dumbledore.",film 1
2,McGonagall,"Are the rumors true, Albus?",film 1
3,Dumbledore,"I'm afraid so, professor.",film 1
4,Dumbledore,The good and the bad.,film 1
...,...,...,...
1633,HERMIONE,"How fast is it, Harry?",film 3
1634,HARRY,Lumos.,film 3
1635,HARRY,I solemnly swear that I am up to no good.,film 3
1636,HARRY,Mischief managed.,film 3


Notice here that we are taking all our identified entities as evaluation data. We could however do something like a train-test-split here, but since we want to evaluate how well our model has learned our entities this approach of taking all entities works well enough.

Let's try evaluating our model using the following code that I adapted from this Github issue https://github.com/explosion/spaCy/discussions/8178#discussioncomment-781241. It's worth clicking on the issue and taking a look at spaCy's discussion section on Github.

In [14]:
from spacy.scorer import Scorer
from spacy.training import Example

# evaluate function
def evaluate(ner_model, testing_data):
    scorer = Scorer()
    examples = []
    for input_, annot in testing_data:
        doc_gold_text = ner_model.make_doc(input_)
        example = Example.from_dict(doc_gold_text, annot)
        example.predicted = ner_model(input_)
        examples.append(example)
        
    print(scorer.score(examples))

# print the results
evaluate(full_nlp, evaluation_data)

{'token_acc': 1.0, 'token_p': 1.0, 'token_r': 1.0, 'token_f': 1.0, 'sents_p': 1.0, 'sents_r': 1.0, 'sents_f': 1.0, 'tag_acc': None, 'pos_acc': None, 'morph_acc': None, 'morph_per_feat': None, 'dep_uas': None, 'dep_las': None, 'dep_las_per_type': None, 'ents_p': 1.0, 'ents_r': 1.0, 'ents_f': 1.0, 'ents_per_type': {'PERSON': {'p': 1.0, 'r': 1.0, 'f': 1.0}, 'CARDINAL': {'p': 1.0, 'r': 1.0, 'f': 1.0}, 'DATE': {'p': 1.0, 'r': 1.0, 'f': 1.0}, 'TIME': {'p': 1.0, 'r': 1.0, 'f': 1.0}, 'ORDINAL': {'p': 1.0, 'r': 1.0, 'f': 1.0}, 'NORP': {'p': 1.0, 'r': 1.0, 'f': 1.0}, 'QUANTITY': {'p': 1.0, 'r': 1.0, 'f': 1.0}}, 'cats_score': 0.0, 'cats_score_desc': 'macro F', 'cats_micro_p': 0.0, 'cats_micro_r': 0.0, 'cats_micro_f': 0.0, 'cats_macro_p': 0.0, 'cats_macro_r': 0.0, 'cats_macro_f': 0.0, 'cats_macro_auc': 0.0, 'cats_f_per_type': {}, 'cats_auc_per_type': {}}


Running the full model first we see that we get back a dictionnary with lots of values. We can checkout the documentation for the Scorer https://spacy.io/api/scorer#score (though it's not particularly helpful in my opinion).Let's take a look at `'ents_per_type'`.

`'ents_per_type': {'PERSON': {'p': 1.0, 'r': 1.0, 'f': 1.0}, 'CARDINAL': {'p': 1.0, 'r': 1.0, 'f': 1.0}, 'DATE': {'p': 1.0, 'r': 1.0, 'f': 1.0}, 'TIME': {'p': 1.0, 'r': 1.0, 'f': 1.0}, 'ORDINAL': {'p': 1.0, 'r': 1.0, 'f': 1.0}, 'NORP': {'p': 1.0, 'r': 1.0, 'f': 1.0}, 'QUANTITY': {'p': 1.0, 'r': 1.0, 'f': 1.0}}`

This part of the scorer is actually giving us precision, recall, and f1 scores for each of the NER labels in this model.

We could try running with the blank_nlp model, and what do you expect we would get instead?

So we can see that at least that the model seems to be accurately assessing our NER labels, though it's difficult to tell without doing a train-test-split to evaluate it actually (don't worry we'll try that later on today).

An easier way to assess our model is to try and feed it new unseen text before. Below I've copied a section from the Harry Potter wiki page. Let's look at the results!

In [15]:
# https://en.wikipedia.org/wiki/Harry_Potter
test_text = "The series continues with Harry Potter and the Chamber of Secrets, describing Harry's second year at Hogwarts. He and his friends investigate a 50-year-old mystery that appears uncannily related to recent sinister events at the school. Ron's younger sister, Ginny Weasley, enrols in her first year at Hogwarts, and finds an old notebook in her belongings which turns out to be the diary of a previous student, Tom Marvolo Riddle, written during World War II. He is later revealed to be Voldemort's younger self, who is bent on ridding the school of 'mudbloods', a derogatory term describing wizards and witches of non-magical parentage. The memory of Tom Riddle resides inside of the diary and when Ginny begins to confide in the diary, Voldemort is able to possess her. Through the diary, Ginny acts on Voldemort's orders and unconsciously opens the 'Chamber of Secrets', unleashing an ancient monster, later revealed to be a basilisk, which begins attacking students at Hogwarts. It kills those who make direct eye contact with it and petrifies those who look at it indirectly. The book also introduces a new Defence Against the Dark Arts teacher, Gilderoy Lockhart, a highly cheerful, self-conceited wizard with a pretentious facade, later turning out to be a fraud. Harry discovers that prejudice exists in the Wizarding World through delving into the school's history, and learns that Voldemort's reign of terror was often directed at wizards and witches who were descended from Muggles. Harry also learns that his ability to speak the snake language Parseltongue is rare and often associated with the Dark Arts. When Hermione is attacked and petrified, Harry and Ron finally piece together the puzzles and unlock the Chamber of Secrets, with Harry destroying the diary for good and saving Ginny, and, as they learn later, also destroying a part of Voldemort's soul. The end of the book reveals Lucius Malfoy, Draco's father and rival of Ron and Ginny's father, to be the culprit who slipped the book into Ginny's belongings."

doc = blank_nlp(test_text)
for token in doc.ents:
    print('blank', token.text, token.label_)

doc = full_nlp(test_text)
for token in doc.ents:
    print('full', token.text, token.label_)

blank Harry Potter PERSON
blank Harry PERSON
blank Weasley PERSON
blank Tom PERSON
blank Riddle PERSON
blank Tom Riddle PERSON
blank Gilderoy Lockhart PERSON
blank Harry PERSON
blank Harry PERSON
blank Hermione PERSON
blank Harry PERSON
blank Harry PERSON
blank Lucius Malfoy PERSON
blank Draco PERSON
full Harry Potter PERSON
full the Chamber of Secrets ORG
full Harry PERSON
full second year DATE
full Hogwarts ORG
full 50-year-old DATE
full Ron PERSON
full Ginny Weasley PERSON
full her first year DATE
full Hogwarts ORG
full Tom Marvolo Riddle PERSON
full World War II EVENT
full Voldemort ORG
full Tom Riddle PERSON
full Ginny PERSON
full Voldemort ORG
full Ginny ORG
full Voldemort ORG
full Hogwarts PRODUCT
full Defence Against the Dark Arts ORG
full Gilderoy Lockhart PERSON
full Harry PERSON
full the Wizarding World LOC
full Voldemort PERSON
full Muggles PERSON
full Harry PERSON
full the Dark Arts ORG
full Hermione PERSON
full Harry PERSON
full Ron PERSON
full the Chamber of Secrets ORG


After running, you'll notice we are getting very different results 👀.

Particularly I was surprised to see that the blank_nlp model did not pickup Ron or Ginny as Persons compared to the full_nlp model. Looking in our chars_df for those two gives us some clues as to why.

```python
chars_df[chars_df.Name.str.contains('Ron|Ginny')]
```

But even with our data erros, our results also give us a sense of why you might want to use a spaCy model and add your labels, rather than train a blank one. And that's because spaCy models are able to generalize patterns (which to be fair does gives us some incorrect results) and capture both our rules (and prioritze them depending on how we build our pipeline), as well as entities we don't explicitly list out.

How does spaCy do this? Let's take a look inside the model to figure it out!

In [16]:
# To save spaCy models to disk, use the following syntax. Be careful though to not name it the same as a downloaded spaCy model like 'en_core_web_sm' because that will overwrite that model. You can read more here https://spacy.io/usage/saving-loading
blank_nlp.to_disk('blank_nlp')
full_nlp.to_disk('full_nlp')

In [106]:
# #Also while we are looking at those let's also download the larger spaCy English model
# !python -m spacy download en_core_web_lg

Let's open the meta.json in the `blank_nlp` folder we just created. It should look like this:
![blank_nlp](./images/blank_nlp.png)

This contains all the top-level information about our model. It's fairly sparse because this was our blank model, but towards the bottom you do see that this blank version contains our `entity_ruler` pipeline and our label of `PERSON`.

Let's compare to the `full_nlp` folder!

There's a lot going on here so let's take a look at the spaCy documentation for the meta.json https://spacy.io/api/data-formats#meta. We can see that this contains information on how our model was trained (though since spacy 3.0 this file no longer specifies how it will be built).

In [112]:
# full_nlp.pipe_names
# full_nlp.pipe_labels

After clicking through the documentation, we can also click through the other folders. In `entity_ruler`, you'll find a list of all our patterns that we added to the model for example. You can also take a look inside of `vocab` that contains the `strings.json` file, which has a list of all the words that our model was trained on.

Notice that in both our models our `meta.json` has the following setting under `vectors`:

```python
"vectors":{
    "width":0,
    "vectors":0,
    "keys":0,
    "name":null
}
```

This is why we downloaded the large spaCy model, so let's save a version of it to disk (I will likely work locally for this since I doubt Github or Binder will appreciate a giant model 😅).


In [113]:
large_nlp = spacy.load('en_core_web_lg')
large_nlp.to_disk('large_nlp')

Let's inspect the vectors now in our larger model. In our `meta.json`, we should now see the following:

```python
"vectors":{
    "width":300,
    "vectors":684830,
    "keys":684830,
    "name":"en_vectors"
  }
```
From the documentation we can learn the following:
> Information about the word vectors included with the pipeline. Typically a dict with the keys "width", "vectors" (number of vectors), "keys" and "name".
Or visit the model documentation directly https://spacy.io/models/en#en_core_web_lg.

To break this down more, what we are seeing is that in this model, we have a set of vectors named `en_vectors` (could have a different name depending on the model), that has identical number of keys and vectors, as well as a width of 300. Those keys and vectors are our tokens and their respective vector, while the width tells us that each word vector has 300 dimensions 😳.

For those unfamiliar, word vectors are essentially the backbone of modern machine learning.

We can create word vectors by essentially representing the frequency of words in a corpus as counts. 

![vector](https://www.oreilly.com/library/view/applied-text-analysis/9781491963036/assets/atap_0402.png)

This image is from a great book, *Applied Text Analysis with Python* https://www.oreilly.com/library/view/applied-text-analysis/9781491963036/. But it's showing us how you would start build word counts into vectors.

The main assumption about how language works here comes from John Firth, *Modes of *Meaning* 1957:

![firth](https://image.slidesharecdn.com/icsc2012distributionalpp2-120920083125-phpapp01/95/a-study-on-compositional-semantics-of-words-in-distributional-spaces-2-728.jpg?cb=1348129993)

Once we represent words as vectors we can start using vector math to explore how words cluster together (for more on distance metrics I would highly recommend this Programming Historian tutorial https://programminghistorian.org/en/lessons/common-similarity-measures#what-is-similarity-or-distance)

![word_vectors](https://miro.medium.com/max/1838/1*OEmWDt4eztOcm5pr2QbxfA.png)

These are some of the most famous examples of word vectors, and spaCy actually has functionality built in for you to try and find most_similar terms once we are using their vector models.

For a more zoomed out explanation of what spaCy is doing, I highly recommend checking out these two stack overflow answers.

https://stackoverflow.com/questions/60381170/which-deep-learning-algorithm-does-spacy-uses-when-we-train-custom-model/60394246#60394246


https://stackoverflow.com/questions/44492430/how-does-spacy-use-word-embeddings-for-named-entity-recognition-ner


TIME FOR A BREAK ☕️

***

#### Custom NER Models

So far we've started reviewing our homework, doing rules-based NER, and then understanding how spaCy works under the hood.

In the remainder of our time, I want to spend some time discussing using non-English language models and training custom NER models. We may or may not get through everything today but let's see how it goes!

We could breakout into breakout rooms now to discuss multilingual NER by language or time-period. Or we can have a general discussion together.

One thing I'm curious about is how many people are working with non-English data? Also how many have trained custom models before?

I'll be honest here that none of my intro DH courses have been advanced enough to include lessons on custom NER models (though I have had students use it in independent seminar). One big question that I would like us to keep in mind is how does one decide if they want to build a custom model from scratch or if they want to fine-tune an existing model. 

Some of this tradeoff is technical, but a lot of it is also your personal project goals.

Let's take a look first at using different language models.

We can see that spaCy has quite a few language models already built-in https://spacy.io/usage/models#languages but I also know from Monday that some of you have already used `stanza` https://stanfordnlp.github.io/stanza/ner.html because of its language support.

One of our goals from the homework was trying out a new NER library so let's briefly test out stanza.

In [26]:
import stanza

In [27]:
stanza_nlp = stanza.Pipeline(lang='en', processors='tokenize,ner')
doc = stanza_nlp(films_df[0:1].Sentence.values[0])
print(*[f'entity: {ent.text}\ttype: {ent.type}' for ent in doc.ents], sep='\n')

2021-06-30 12:23:31 INFO: Loading these models for language: en (English):
| Processor | Package   |
-------------------------
| tokenize  | combined  |
| ner       | ontonotes |

2021-06-30 12:23:31 INFO: Use device: cpu
2021-06-30 12:23:31 INFO: Loading: tokenize
2021-06-30 12:23:31 INFO: Loading: ner
2021-06-30 12:23:32 INFO: Done loading processors!


entity: Professor McGonagall	type: PERSON


Interestingly stanza is regonizing both McGonagall and Professor as a person from the outset, which spaCy was unable to do.

Let's try running on our film dataset!

In [28]:
# THIS RUNS VERY SLOWLY!!!
def find_stanza_entities(row):
    # code goes here
    stanza_sentence = stanza_nlp(row.Sentence)
    
    list_tokens = []
    list_entities = []
    for token in stanza_sentence.ents:
        list_tokens.append(token.text)
        list_entities.append(token.type)
    row['stanza_tokens'] = list_tokens
    row['stanza_entities'] = list_entities
    return row
films_stanza = films_df[0:20].apply(find_stanza_entities, axis=1)

In [29]:
films_stanza

Unnamed: 0,Character,Sentence,movie_number,stanza_tokens,stanza_entities
0,Dumbledore,"I should've known that you would be here, Prof...",film 1,[Professor McGonagall],[PERSON]
1,McGonagall,"Good evening, Professor Dumbledore.",film 1,"[evening, Dumbledore]","[TIME, PERSON]"
2,McGonagall,"Are the rumors true, Albus?",film 1,[Albus],[PERSON]
3,Dumbledore,"I'm afraid so, professor.",film 1,[],[]
4,Dumbledore,The good and the bad.,film 1,[],[]
5,McGonagall,And the boy?,film 1,[],[]
6,Dumbledore,Hagrid is bringing him.,film 1,[Hagrid],[PERSON]
7,McGonagall,Do you think it wise to trust Hagrid with some...,film 1,[Hagrid],[PERSON]
8,Dumbledore,"Ah, Professor, I would trust Hagrid with my life.",film 1,[Hagrid],[PERSON]
9,Hagrid,"Professor Dumbledore, sir.",film 1,[Dumbledore],[PERSON]


While these are very promising results, especially considering spaCy needed to have additional rules to identify these names, the stanza implementation is very slow and also requires learning a new syntax. While we can't fix the speed of the library, we can use stanza models with spaCy https://github.com/explosion/spacy-stanza so let's try that!

In [57]:
# !pip install spacy_stanza

In [30]:
import spacy_stanza

spacy_stanza_nlp = spacy_stanza.load_pipeline("en")

2021-06-30 12:23:41 INFO: Loading these models for language: en (English):
| Processor | Package   |
-------------------------
| tokenize  | combined  |
| pos       | combined  |
| lemma     | combined  |
| depparse  | combined  |
| sentiment | sstplus   |
| ner       | ontonotes |

2021-06-30 12:23:41 INFO: Use device: cpu
2021-06-30 12:23:41 INFO: Loading: tokenize
2021-06-30 12:23:41 INFO: Loading: pos
2021-06-30 12:23:41 INFO: Loading: lemma
2021-06-30 12:23:41 INFO: Loading: depparse
2021-06-30 12:23:41 INFO: Loading: sentiment
2021-06-30 12:23:42 INFO: Loading: ner
2021-06-30 12:23:43 INFO: Done loading processors!


In [31]:
#STILL VERY SLOW!!
def find_spacy_stanza_entities(row):
    # code goes here
    stanza_sentence = spacy_stanza_nlp(row.Sentence)
    
    list_tokens = []
    list_entities = []
    for token in stanza_sentence.ents:
        list_tokens.append(token.text)
        list_entities.append(token.label_)
    row['stanza_tokens'] = list_tokens
    row['stanza_entities'] = list_entities
    return row
films_stanza = films_df[0:20].apply(find_spacy_stanza_entities, axis=1)

  return torch.max_pool1d(input, kernel_size, stride, padding, dilation, ceil_mode)


I did download both the French and English models for stanza so one thing we could do know is try and answer our second homework option, of working with French and English data.

We could also compare between spaCy and stanza using our model evaluate code from above. But before we start evaluating, I want to discuss something called `The Bender Rule`.

![bender_rule](./images/bender_rule.png)

https://thegradient.pub/the-benderrule-on-naming-the-languages-we-study-and-why-it-matters/

In [17]:
### HOMEWORK REVIEW CODE

Ok so hypothetically we've written the code above for our second homework option, let's now return to option one since we didn't quite finish it. 

Our final goal was to add in new entities! 

In [2]:
spells_df = pd.read_csv('./archive/Spells.csv', sep=";")
spells_df[0:2]

Unnamed: 0,Name,Incantation,Type,Effect,Light
0,Summoning Charm,Accio,Charm,Summons an object,
1,Age Line,Unknown,Charm,Prevents people above or below a certain age f...,Blue


Let's first check that spells exist in our films dataset.

In [93]:
def find_spells(row, column_name):
    spells = spells_df[spells_df.Incantation.isna() == False].Incantation.unique().tolist()
    spells = [spell for spell in spells if len(spell) > 1]
    identified_spells = []
    for spell in spells:
        if spell in row[f'{column_name}']:
            identified_spells.append(spell)
    row['identified_spells'] = ', '.join(identified_spells) if len(identified_spells) > 0 else ''
    return row

hp_spells = hp_dfs[hp_dfs.dialog.isna() == False].apply(find_spells, column_name='dialog', axis=1)
films_spells = films_df[films_df.Sentence.isna() == False].apply(find_spells, column_name='Sentence', axis=1)


Ok so we have spells within our scripts! Our next step is to extract this data as training data. Let's adapt our code from above here.

~~HOMEWORK FOR FRIDAY!~~ 

Originally I wanted us to extract training data as homework but there was an issue with the spells data and working with spaCy that we'll discuss in class on Friday.

So Friday we'll figure out how to make our evaluation training dataset but with the `SPELL` label:
```python
evaluation_data =[("I should've known that you would be here, Professor McGonagall.",
  {'entities': [[52, 62, 'PERSON']]})]
```
You can also consult this blog post that we'll be following for training our custom model to see how our data needs to be formatted https://towardsdatascience.com/using-spacy-3-0-to-build-a-custom-ner-model-c9256bea098).

If you have time before class, do read the spaCy docs on rules-based matching https://spacy.io/usage/rule-based-matching and then also try to working with multilingual data with the ParlaMint dataset (load in the data and try extracting entities with the different language models).