#### Advanced NLP with spaCy
#### [Chapter 2: Large-scale data analysis with spaCy](https://course.spacy.io/en/chapter2)

##### 1. Data Structures (1)

Processed tokens and their corresponding hashes are added to the vocab object's bidirrectional string store dict:

In [30]:
import spacy
nlp = spacy.blank("en")
nlp.vocab.strings.add("coffee")
coffee_hash = nlp.vocab.strings["coffee"]
coffee_string = nlp.vocab.strings[coffee_hash]
print(coffee_string)

coffee


In [31]:
doc = nlp("I love coffee")
print("hash value:", nlp.vocab.strings["coffee"])
print("string value:", nlp.vocab.strings[3197928453018144401])
print("doc has value:", nlp.vocab.strings["coffee"])

hash value: 3197928453018144401
string value: coffee
doc has value: 3197928453018144401


Lexemes are stored directly in the vocab obj:

In [32]:
lexeme = nlp.vocab["coffee"]
print(lexeme.text, lexeme.orth, lexeme.is_alpha)

coffee 3197928453018144401 True


##### 2. Strings to hashes

In [33]:
doc = nlp("I have a cat")
cat_hash = nlp.vocab.strings["cat"]
cat_string = nlp.vocab.strings[cat_hash]
print(cat_string)

cat


In [34]:
doc = nlp("David Bowie is a PERSON")
person_hash = nlp.vocab.strings["PERSON"]
person_string = nlp.vocab.strings[person_hash]
print(person_string)

PERSON


##### 3. Vocab, hashes and lexemes
Strings/hashes are only added to the nlp object if their docs are associated, so take care if you use more than one.

##### 4. Data Structures (2): Doc, Span, and Token
Creating a doc manually:

In [35]:
from spacy.tokens import Doc
nlp = spacy.blank("en")
words = ["Hello", "world", "!"]
spaces = [True, False, False]
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc)

Hello world!


Creating a span without python slicing syntax a doc object (they can be easily labeled this way):

In [36]:
from spacy.tokens import Span
span = Span(doc, 0, 2)
span_with_label = Span(doc, 0, 2, label="GREETING")
doc.ents = [span_with_label] # doc.ents can add entities manually by overwriting with a list of spans
print(doc.ents)

(Hello world,)


##### 5. Creating a Doc

In [37]:
nlp = spacy.blank("en")
words = ["spaCy", "is", "cool", "!"]
spaces = [True, True, False, False]
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

spaCy is cool!


In [38]:
nlp = spacy.blank("en")
words = ["Go", ",", "get", "started", "!"]
spaces = [False, True, True, False, False]
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

Go, get started!


In [39]:
nlp = spacy.blank("en")
words = ["Oh", ",", "really", "?", "!"]
spaces = [False, True, False, False, False]
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

Oh, really?!


##### 6. Docs, spans and entities from scratch 

In [40]:
nlp = spacy.blank("en")
words = ["I", "like", "David", "Bowie"]
spaces = [True, True, True, False]
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

span = Span(doc, 2, 4, label="PERSON")
print(span.text, span.label_)
doc.ents = [span]
print([(ent.text, ent.label_) for ent in doc.ents])

I like David Bowie
David Bowie PERSON
[('David Bowie', 'PERSON')]


##### 7. Data structures best practices
Convert tokens to strings as late as possible:

In [41]:
nlp = spacy.load("en_core_web_sm")
doc = nlp("Berlin looks like a nice city")
for index, token in enumerate(doc):
    if token.pos_ == "PROPN":
        if doc[index + 1].pos_ == "VERB":
            print("Found proper noun before a verb:", token.text)

Found proper noun before a verb: Berlin


##### 8. Word vectors and semantic similarity

In [42]:
nlp = spacy.load("en_core_web_md")
doc1 = nlp("I like fast food")
doc2 = nlp("I like pizza")
print(doc1.similarity(doc2))

0.869833325851152


In [43]:
doc = nlp("I like pizza and pasta")
token1 = doc[2]
token2 = doc[4]
print(token1.similarity(token2))

0.6850197911262512


In [44]:
doc = nlp("I like pizza")
token = nlp("soap")[0]
print(doc.similarity(token))

0.18213694934365615


In [45]:
span = nlp("I like pizza and pasta")[2:5]
doc = nlp("McDonalds sells burgers")
print(span.similarity(doc))

0.4719003666806404


In [46]:
doc = nlp("I have a banana")
print(len(doc[3].vector))

300


In [47]:
doc1 = nlp("I like cats")
doc2 = nlp("I hate cats") # <--- take this person out back and shoot them
print(doc1.similarity(doc2))

0.9530093158841214


##### 9. Inspecting word vectors

In [48]:
nlp = spacy.load("en_core_web_md")
doc = nlp("Two bananas in pyjamas")
bananas_vector = doc[1].vector
print(bananas_vector)

[-2.1689e-01 -2.5989e+00 -1.3144e+00  2.2500e+00 -4.6767e-01 -2.0695e+00
 -6.3379e-01 -4.0222e-01 -3.4022e+00 -3.6932e-01 -7.9938e-01 -1.0412e+00
  9.3756e-01  1.6070e+00  8.8330e-01 -2.8483e+00  1.3349e-01 -3.1656e+00
  8.1896e-01 -4.8113e+00  1.5655e+00  1.6665e+00 -4.7081e-01 -1.9475e+00
 -1.1779e+00 -1.3810e+00 -2.0071e+00 -2.1639e-01  9.0609e-01  1.5279e+00
  1.2587e-04 -2.9000e+00  7.6069e-01 -2.2825e+00  1.2495e-02 -1.5653e+00
  2.0052e+00 -1.7747e+00  5.9220e-01 -1.1428e+00 -1.3441e+00  3.4784e-01
  1.7492e+00  1.9086e+00  1.0600e+00  1.2965e+00  4.1431e-01  7.9416e-01
 -1.1277e+00 -1.1403e+00  7.5891e-01 -9.4419e-01  1.4413e+00 -2.2554e+00
  1.6226e-01  3.8901e-01  1.2299e-01  1.1577e+00  1.5524e+00  1.3853e+00
  1.1112e+00  7.5767e-01  3.9431e+00 -2.8506e-01 -2.1645e+00 -1.0862e+00
 -1.4973e+00 -1.2781e+00  2.4643e+00 -1.5886e+00  2.5679e-01  6.4918e-01
  1.6809e-01  5.7693e-01  3.1121e-01 -4.5278e-01 -2.7555e+00 -2.1846e+00
  4.4865e+00  2.7107e-01 -5.3831e-01  8.3013e-01  6

##### 10. Comparing similarities

In [49]:
doc1 = nlp("It's a warm summer day")
doc2 = nlp("It's sunny outside")
similarity = doc1.similarity(doc2)
print(similarity)

0.8220092482601077


In [50]:
doc = nlp("TV and books")
token1, token2 = doc[0], doc[2]
similarity = token1.similarity(token2)
print(similarity)

0.10219937562942505


In [51]:
doc = nlp("This was a great restaurant. Afterwards, we went to a really nice bar.")
span1 = doc[3:5]
span2 = doc[-4:-1]
print(span1.text, span2.text)

great restaurant really nice bar


In [52]:
similarity = span1.similarity(span2)
print(similarity)

0.6348509788513184


##### 11. Combining predictions and rules

In [55]:
from spacy.matcher import Matcher

matcher = Matcher(nlp.vocab)
matcher.add("DOG", [[{"LOWER": "golden"}, {"LOWER": "retriever"}]])
doc = nlp("I have a Golden Retriever")

for match_id, start, end in matcher(doc):
    span = doc[start:end]
    print("Matched span:", span.text)
    print("Root token:", span.root.text)
    print("Root token head:", span.root.head.text)
    print("Previous token:", doc[start -1].text, doc[start - 1].pos_)

Matched span: Golden Retriever
Root token: Retriever
Root token head: have
Previous token: a DET


Phrase matcher:

In [56]:
from spacy.matcher import PhraseMatcher

matcher = PhraseMatcher(nlp.vocab)

pattern = nlp("Golden Retriever")
matcher.add("DOG", [pattern])
doc = nlp("I have a Golden Retriever")

for match_id, start, end in matcher(doc):
    span = doc[start:end]
    print("Matched span:", span.text)

Matched span: Golden Retriever


##### 12. and 13. Debugging patterns
- Don't include spaces in matcher patterns since they are working off tokenized data already
- 

In [69]:
nlp = spacy.load("en_core_web_sm")
doc = nlp(
    "Twitch Prime, the perks program for Amazon Prime members offering free "
    "loot, games and other benefits, is ditching one of its best features: "
    "ad-free viewing. According to an email sent out to Amazon Prime members "
    "today, ad-free viewing will no longer be included as a part of Twitch "
    "Prime for new members, beginning on September 14. However, members with "
    "existing annual subscriptions will be able to continue to enjoy ad-free "
    "viewing until their subscription comes up for renewal. Those with "
    "monthly subscriptions will have access to ad-free viewing until October 15."
)

pattern1 = [{"LOWER": "amazon"}, {"IS_TITLE": True, "POS": "PROPN"}]
pattern2 = [{"LOWER": "ad"}, {"LOWER": "-"}, {"LOWER": "free"}, {"POS": "NOUN"}]

matcher = Matcher(nlp.vocab)
matcher.add("PATTERN1", [pattern1])
matcher.add("PATTERN2", [pattern2])

for match_id, start, end in matcher(doc):
    print(doc.vocab.strings[match_id], doc[start:end].text)

PATTERN1 Amazon Prime
PATTERN2 ad-free viewing
PATTERN1 Amazon Prime
PATTERN2 ad-free viewing
PATTERN2 ad-free viewing
PATTERN2 ad-free viewing


##### 14. Efficient phrase matching

In [71]:
import json

with open("countries.json", encoding="utf-8") as f:
    COUNTRIES = json.loads(f.read())

nlp = spacy.blank("en")
doc = nlp("Czech Republic may help Slovakia protect its airspace")

matcher = PhraseMatcher(nlp.vocab)

patterns = list(nlp.pipe(COUNTRIES))
matcher.add("COUNTRY", patterns)

matches = matcher(doc)
print([doc[start:end] for _, start, end in matches])

[Czech Republic, Slovakia]


##### 15. Extracting countries and relationships

In [75]:
with open("country_text.txt", encoding="utf-8") as f:
    TEXT = f.read()

nlp = spacy.load("en_core_web_sm")
matcher = PhraseMatcher(nlp.vocab)
patterns = list(nlp.pipe(COUNTRIES))
matcher.add("COUNTRY", patterns)

doc = nlp(TEXT)
doc.ents = []

for match_id, start, end in matcher(doc):
    span = Span(doc, start, end, label="GPE")
    doc.ents = doc.ents + (span,)
    span_root_head = span.root.head
    print(span_root_head.text, "-->", span.text)

print([(ent.text, ent.label_) for ent in doc.ents if ent.label_ == "GPE"])

in --> Namibia
in --> South Africa
Africa --> Cambodia
of --> Kuwait
as --> Somalia
Somalia --> Haiti
Haiti --> Mozambique
in --> Somalia
for --> Rwanda
Britain --> Singapore
War --> Sierra Leone
of --> Afghanistan
invaded --> Iraq
in --> Sudan
of --> Congo
earthquake --> Haiti
[('Namibia', 'GPE'), ('South Africa', 'GPE'), ('Cambodia', 'GPE'), ('Kuwait', 'GPE'), ('Somalia', 'GPE'), ('Haiti', 'GPE'), ('Mozambique', 'GPE'), ('Somalia', 'GPE'), ('Rwanda', 'GPE'), ('Singapore', 'GPE'), ('Sierra Leone', 'GPE'), ('Afghanistan', 'GPE'), ('Iraq', 'GPE'), ('Sudan', 'GPE'), ('Congo', 'GPE'), ('Haiti', 'GPE')]
