### spacy

https://www.youtube.com/playlist?list=PL2VXyKi-KpYvuOdPwXR-FZfmZ0hjoNSUo

https://towardsdatascience.com/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0

In [1]:
with open('./alice/aliceinwonderland.txt', 'r', encoding='utf-8') as f:
    f = f.read().replace('\n', ' ').replace('“', '"').replace('”', '"')
    chapters = f.split('CHAPTER ')[13:] # first few splits were just table of contents etc
chapters[0]

'I. Down the Rabbit-Hole   Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, "and what is the use of a book," thought Alice "without pictures or conversations?"  So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her.  There was nothing so _very_ remarkable in that; nor did Alice think it so _very_ much out of the way to hear the Rabbit say to itself, "Oh dear! Oh dear! I shall be late!" (when she thought it over afterwards, it occurred to her that she ought to have wondered at this, but at the time it all seemed quite natural); but when the Rabbit actually _took a watch out of its waistcoa

In [2]:
import pandas as pd

### install and load SpaCy

In [3]:
# https://spacy.io/usage for installation and config of packages.
# always run gpu for real world projects.
# !pip install -U spacy
# !python -m spacy download en_core_web_lg

In [4]:
import spacy
# spacy.prefer_gpu()

# load pre-trained model. there's also _md and _lg for use in production
# sm md lg model acc is around 87%, 93%, 95%
# 30mb, 120mb, 870mb
nlp = spacy.load("en_core_web_lg")
doc = nlp(chapters[0])

### tokenize sentences

In [14]:

#! SENTENCE TOKENIZATION
# because you can't just split on periods and exclamations, and even if you made 
# complex regex patterns you still couldn't account for all the weird places 
# periods etc could occur and break your data, or even typos where periods were 
# missed out. that's why we use a pre-trained ML model, because model already 
# learned how sentences look like. and therefore is the best way to split sentences.
sentences = list(doc.sents)
sentence = sentences[3]


### named entity recognition

In [None]:

#! NAMED ENTITY RECOGNITION
# spacy is more convenient, powerful, faster, and more true positives of doing stuff 
# like this vs nltk

# also, the different models behave slightly differently too.
# see the accuracies/false positives below:
nlp = spacy.load("en_core_web_sm")
doc = nlp(chapters[1])
print(list(doc.ents))

nlp = spacy.load("en_core_web_lg")
doc = nlp(chapters[1])
print(list(doc.ents))

#? print(list(sentence.ents)) # also works


In [6]:
# So it's easy to extract these things, and do some cleaning up later on false positives etc.

ents = list(doc.ents) 

people = []
for ent in ents:
    if ent.label_ == 'PERSON':
        people.append(ent)
print(people)

df_ent = pd.DataFrame([[ent, ent.label, ent.label_, ent.text] for ent in ents])
df_ent

[I. Down the Rabbit-Hole, Alice, Alice, Alice, Alice, Alice, Alice, Alice, Down, Alice, Alice, Alice, Alice, Alice, Alice, Alice, Alice, Alice, Alice, Alice, Alice, Alice, Alice, Alice, Alice, Alice, Alice, Alice, Alice]


Unnamed: 0,0,1,2,3
0,"(I., Down, the, Rabbit, -, Hole)",380,PERSON,I. Down the Rabbit-Hole
1,(Alice),380,PERSON,Alice
2,(Alice),380,PERSON,Alice
3,(Alice),380,PERSON,Alice
4,(Alice),380,PERSON,Alice
5,(Alice),380,PERSON,Alice
6,(Alice),380,PERSON,Alice
7,(First),396,ORDINAL,First
8,(Alice),380,PERSON,Alice
9,(Down),380,PERSON,Down


### tokenize parts of speech/sentence

In [15]:
for token in sentence[:15]:
    print(token.text, token.pos_)

There PRON
was VERB
nothing PRON
so SCONJ
_ PUNCT
very ADV
_ ADV
remarkable ADJ
in ADP
that PRON
; PUNCT
nor CCONJ
did AUX
Alice PROPN
think VERB


### noun and noun chunks
- sometimes you want noun chunks which are groups of words related to the noun. e.g. finding 'New York City police station' where NYC and police station may be separate nouns.

In [16]:
# nouns = []
# for token in sentence:
#     if token.pos_ == 'NOUN':
#         nouns.append(token)
# print(nouns)
# print()

[ print(w) for w in sentence if w.pos_ =='NOUN' ]

print(list(sentence.noun_chunks)) # groupings of nouns that belong together

way
[nothing, that, Alice, it, the way, the Rabbit, itself]


In [20]:
chunks = list(doc.noun_chunks)
[ print(chunk) for chunk in chunks if 'rabbit' in str(chunk).lower() ]

the Rabbit-Hole
a White Rabbit
the Rabbit
the Rabbit
a rabbit
a large rabbit-hole
The rabbit-hole
the White Rabbit
the Rabbit


[None, None, None, None, None, None, None, None, None]

### verb and verb phrases
- are more problematic than noun chunks. the order of verb-adverb etc can be all over the place.

In [21]:
import textacy
# from textacy.extract import matches

patterns = [
    [{'POS' : 'VERB'},{'POS' : 'ADV'}],
    [{'POS' : 'VERB'},{'POS' : 'PRON'}],
    [{'POS' : 'ADV'},{'POS' : 'VERB'}],
    ]
verb_phrases = textacy.extract.token_matches(doc, patterns)

for verb_phrase in verb_phrases:
    print(verb_phrase)


get very
having nothing
made her
feel very
ran close
was nothing
think it
thought it
seemed quite
then hurried
hurried on
before seen
see it
down went
once considering
went straight
then dipped
dipped suddenly
stopping herself
found herself
fell very
wonder what
happen next
see anything
killing somebody
so managed
put it
think nothing
all think
think me
say anything
said aloud
getting somewhere
Let me
say it
wonder what
thought they
began again
ask them
think you
manage it
think me
never do
see it
was nothing
soon began
talking again
miss me
hope they
remember her
wish you
’s very
get rather
put it
just begun
tell me
ever eat
away went
hear it
found herself
all locked
walked sadly
was nothing
open any
noticed before
ever saw
even get
get her
go through
wish I
think I
only knew
happened lately
went back
half hoping
hoping she
beautifully printed
Drink me
do _
look first
taught them
burn you
hold it
cut your
usually bleeds
never forgotten
drink much
taste it
finding it
soon finished
fini

### lemmatize

In [22]:
for word in sentence:
    if word.pos_ == 'VERB':
        print(word, word.lemma_)

was be
think think
hear hear
say say


### visualization with displaCy

In [24]:
from spacy import displacy

html = displacy.render(sentence, style="dep", page=True, jupyter=False) #style: ent
with open('datavis.html', 'w') as f:
    f.write(html)

In [25]:
html = displacy.render(sentence, style="ent", page=True) #style: ent


In [27]:

##! STYLING AND LIMITING WHICH TAGS APPEAR
sentence = sentences[0:20]
bgcolors = {"PERSON" : '#ffcc00', "GPE" : 'linear-gradient(45deg, #ccff00, #9944cc)', }
options = {"ents" : ["PERSON", "GPE"], "colors" : bgcolors}

html = displacy.render(doc, style="ent", page=True, options=options) #style: ent


### **Custom Visualisation Case Study**
- highlight all quote/speeches and color code by speaker
- have to build custom function as spacy doesn't have this feature of finding quotes

In [34]:
import spacy
from spacy import displacy
import re

with open('./alice/aliceinwonderland.txt', 'r', encoding='utf-8') as f:
    f = f.read().replace('\n', ' ').replace('“', '"').replace('”', '"')
    chapters = f.split('CHAPTER ')[13:] 
    chapter1 = chapters[0]

def find_sents(text=chapter1):
    nlp = spacy.load('en_core_web_sm')
    doc = nlp(text)
    sentences = list(doc.sents)
    return sentences

def get_quotes(text):
    quotes = re.findall(r" \"(.*?)\"", text)
    # quotes.append(re.findall(r" '(.*?)'", text))
    return (quotes)

found_sents = find_sents()

for sent in found_sents:
    str_sent = str(sent)
    found_quotes = get_quotes(str_sent)
    if len(found_quotes) > 0:
        print(found_quotes)

['and what is the use of a book,', 'without pictures or conversations?']
['ORANGE MARMALADE']
['—yes, that’s about the right distance—but then I wonder what Latitude or Longitude I’ve got to?']
['Do bats eat cats?']
['Now, Dinah, tell me the truth: did you ever eat a bat?']
['Oh my ears and whiskers, how late it’s getting!']
['and even if my head would go through,']
['DRINK ME,']
['Drink me,']
['and see whether it’s marked ‘_poison_’ or not', 'poison,']
['poison,']
['I must be shutting up like a telescope.']
['for it might end, you know,']
['I advise you to leave off this minute!']
['EAT ME']
['and if it makes me grow larger, I can reach the key; and if it makes me grow smaller, I can creep under the door; so either way I’ll get into the garden, and I don’t care which happens!']


[Stopped at playlist](https://www.youtube.com/watch?v=E9h8qVm2uNY&list=PL2VXyKi-KpYvuOdPwXR-FZfmZ0hjoNSUo&index=14)


- Gazetteer and NER in Python (Rules-based NER)
- Introduction to Machine Learning NER
  - people are shifting to ML NER because it's more versatile, but rules-based is still useful in some applications
  - in order to understand when to use ML NER, it's important to understand Rules-based NER
- using spaCy's NER
- Whats under the hood of spaCy?
- Identifying weaknesses in spaCy's NER
  - domain problems. most available NERs are trained on general web corpora and perform poorly on domain specific texts

TRAIN A CUSTOM MODEL:
- Introduction to Word Vectors
- Generating Custom Word Vectors in Gensim
- Importing Custorm Word Vectors from Gensim
- Training spaCy's NER on new domain-specific texts
- Creating New Entity Labels in spaCy
- Generating New Training Data Quickly
- Training and Deploying a Domain NER Model
 
