# SI 699 :: Seminar :: Book 2 of 3 :: Annotation

# Tutorial Roadmap

<b>Acquisition (Part 1 of 3 :: PRAW // Data Gathering)</b>
- We gathered data from Reddit, divided it up, and dumped it into some files

<b>Preparation (Part 2 of 3 :: Default NERS)</b>
- (Demo Default NER System)
- Now we import those files, decide our "classes", and provide annotations
- We do this so the model can learn what we want it to find, and what that looks like

<b>Execution (Part 3 of 3 :: Natural Entity Recognition Training)</b>
- Use our labelled data to train a NER model in SpaCy and observe our results

# Wait, what actually are we doing?
We're teaching models to read text and identify substrings of specific categories

Ex: "Digest this corpus of 100,000 recipes and extract/label/tag all fruits, spices, and dairy products"

***
<b>Setup</b><br>
Import necessary libraries:

In [None]:
import pandas as pd
import spacy

from spacy.util import minibatch, compounding
from spacy.lang.en import English

Mount Google Drive so we can operate directly on files in our drive without uploading:

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


Specify our file location so we can read from Google Drive.

If you're trying to run this from class -- this will not work for you because this data exists in our team drive.

In [None]:
to_label_1 = pd.read_csv("/content/drive/Shareddrives/SI699 Capstone/Tutorial/data/to_label_1.csv")
to_label_2 = pd.read_csv("/content/drive/Shareddrives/SI699 Capstone/Tutorial/data/to_label_2.csv")
to_label_3 = pd.read_csv("/content/drive/Shareddrives/SI699 Capstone/Tutorial/data/to_label_3.csv")

***
# Pre-Annotation :: Why... are we doing this?
SpaCy is a great library and it does already have an on-board, pre-trained model for natural-entity-recognition that is available to you.

In [None]:
to_label_1.head()

Unnamed: 0,id,fused
0,mfrbm3,GEO and Councilmember Nelson look to change le...
1,mfqsnz,Every time I check my email..
2,mfknj4,International student on some questions about ...
3,mfqj0e,M-Sci Has anyone done this program? Can they g...
4,mfq4xe,Scenes from Yesterday’s Win. Let’s do it again...


Let's load the pre-trained model spacy provides and see what it does to a sample of our data.

In [None]:
nlp = spacy.load("en_core_web_sm")

In [None]:
for row in to_label_1.iterrows():
    text = row[1].fused 
    ents = nlp(text).ents
    print(text.replace('\n\n','\n'))

    # "Woah why is the text colored, you can do that?"
    # Yes, look up "ANSI Escape Sequences"
    print(f"---->\033[91m[Entities :: {ents}]\x1b[0m<----")
    
    for entity in ents:
        print(f"{entity.label_} :: \"{entity.text}\"")
    print("="*65)


GEO and Councilmember Nelson look to change leasing laws to ease student housing hunt 
---->[91m[Entities :: (GEO, Councilmember Nelson)][0m<----
ORG :: "GEO"
PERSON :: "Councilmember Nelson"
Every time I check my email.. 
---->[91m[Entities :: ()][0m<----
International student on some questions about visa and vaccination Hey umich Reddit, 
I'm a freshman international student moving into campus next semester since the classes won't be virtual anymore. I have some quick questions about visas since some peers recommended me to utilize Reddit to ask questions. 
First, what should I do after I receive my F-2 visa? Are there procedures I have to know regarding when to move in and such?
Second, I am currently waiting for my certification email from the international center for my re-printing of my I-20. How long does it usually take? I sent mine about a week ago. 
Third, I am aware that I cannot be vaccinated unless everyone with a greencard is vaccinated. Is that true?
&#x200B;
Thanks 

*** 
# Flashier Visualization Option :: DisplaCy :: Examples

- It'll take the annotations it finds and highlight them for you

In [None]:
to_label_1.iloc[11]

id                                                  mfdinh
fused    Elite Eight!!!!!!! Feels great to be a part of...
Name: 11, dtype: object

<b>"The Good"</b> -- While imperfect this example shows the default model working really effectively to catch organizations, date/time labeling, monetary labeling

(Misses "BU" as an org and things Columbia Law is a Person, whoops)

In [None]:
from spacy import displacy
displacy.render(nlp(to_label_2.iloc[19].fused.replace('\n\n','\n')), jupyter=True, style='ent')

<b>"The Bad"</b> -- Sometimes it misses entities. Sometimes it finds nothing.

In [None]:
from spacy import displacy
displacy.render(nlp(to_label_1.iloc[9].fused.replace('\n\n','\n')), jupyter=True, style='ent')

  "__main__", mod_spec)


<b>"The Awkward"</b> -- Here's a sports post. What if we wanted to focus on sports? We would want "Elite Eight" and "March Madness" rather than the cardinal number 8, the month march, and a person called Madness, which is what the model gives us.

In [None]:
from spacy import displacy
displacy.render(nlp(to_label_1.iloc[11].fused.replace('\n\n','\n')[:150]), jupyter=True, style='ent')

<b>"The Ugly"</b> -- Things can get mislabelled.
- "AA" in this case means Ann Arbor, which should be a "GPE" (Geopolitical Entity) label rather than an "Org". The organization AA is "Alcoholics Anonymous"
- The undergraduate program "Ross Minor" is now a person "Ross Minor," apparently short for "Ross Minor Acceptance Applying"
- "CoE at UMich" is labelled a "Work_of_Art"

In [None]:
from spacy import displacy
displacy.render(nlp(to_label_1.iloc[19].fused.replace('\n\n','\n')[:100]), jupyter=True, style='ent')

In [None]:
from spacy import displacy
displacy.render(nlp(to_label_3.iloc[19].fused.replace('\n\n','\n')), jupyter=True, style='ent')

In [None]:
from spacy import displacy
displacy.render(nlp(to_label_3.iloc[11].fused.replace('\n\n','\n')), jupyter=True, style='ent')

# And now for the final notebook: Training our own model