# Chapter 2: Large-scale data analysis with spaCy

This chapter will
* extract specific information from large volumns of text
* make most of spaCy's data structures
* effectively combine statistical & rule-based approaches for text analysis

**Sections** 

1. Data Structures (Part 1) 
2. Strings to hashes 
3. Vocab, hashes and lexemes 
4. Data Structures (Part 2) 
5. Creating a Doc 
6. Docs, spans and entities from scratch 
7. Data structures best practices 
8. Word vectors and semantic similarity 
9. Inspecting word vectors 
10. Comparing similarities 
11. Combining models and rules
12. Debugging patterns (Part 1) 
13. Debugging patterns (Part 2) 
14. Efficient phrase matching 
15. Extracting countries and relationships

## 1. Data Structures (Part 1): Vocab, Lexemes, and StringStore
* Goal: Look at shared vocabulary and how spaCy deals with strings


### Shared vocab and string store (1)
* `Vocab`: stores data share across multiple documents
* To save memory, spaCy encodes all strings to **hash values**
* Strings are only stored once in the `StringStore` via `nlp.vocab.strings`
* String store: **lookup table** in both directions

In [None]:
# (do not run)

coffee_hash = nlp.vocab.strings["coffee"]
coffee_string = nlp.vocab.strings[coffee_hash]

print(coffee_hash)
print(coffee_string)

* Hashes can't be reversed - that's why we need to provide the shared vocab

In [None]:
# (do not run)

# Raises an error if we haven't seen the string before
string = nlp.vocab.string[123456789]

**Notes**
* spaCy stores all shared data in a vocabulary: the Vocab
    - includes words but also labels schemes for tags and entities
* to save memory, all strings are encoded to hash IDs
    - if a word occurs more than once, we don't need to save it every time
* instead, spaCy uses a hash function to generate an ID and stores the string only once in the string store
    - string store available as `nlp.vocab.strings`
    - `nlp.vocab.strings` is a lookup table that works in both directions
    - you can look up a string and get its hash
    - you can also look up a hash to get its string value
    - internally, spaCy only communicates in hash IDs
* hash IDs can't be reversed
    - if a word is not in vocabulary, there's no way to get its string, meaning we always need to pass around a shared vocab

### Shared vocab and string store (2)
* Look up the string and hash in `nlp.vocab.strings`

In [3]:
# not in slide
import spacy
nlp = spacy.load("en_core_web_sm")

# in slide
doc = nlp("I love coffee")
print("hash value:", nlp.vocab.strings["coffee"])
print("string value:", nlp.vocab.strings[3197928453018144401])

hash value: 3197928453018144401
string value: coffee


* the `doc` also exposes the vocab and strings

In [4]:
doc = nlp("I love coffee")
print("hash value:", doc.vocab.strings["coffee"])

hash value: 3197928453018144401


**Notes**
* to get the hash for a string, we can look it up in `nlp.vocab.strings`
* to get the string representation of a hash, we can look up the hash
* a `Doc` object also exposes its vocab and strings

### Lexemes: entries in the vocabulary
* a `Lexeme` object is an entry in the vocabulary

In [6]:
doc = nlp("I love coffee")
lexeme = nlp.vocab["coffee"]

# Print the lexical attributes
print(lexeme.text, lexeme.orth, lexeme.is_alpha)

coffee 3197928453018144401 True


* contains the **context-independent** information about a word
    - word text: `lexeme.text` and `lexeme.orth` (the hash)
    - lexical attributes like `lexeme.is_alpha`
    - **not** context-dependent part-of-speech tags, dependencies or entity labels

**Notes**
* lexemes are context-independent entries in the vocabulary
* you can get a lexeme by looking up a string or a hash ID in the vocabulary
* lexemes expose attributes, just like tokens
    - they hold context-independent information about a word, like the text, or whether the word consists of alphabetic characters
* lexemes don't have part-of-speech tags, dependencies or entity labels
    - those depend on the context

### Vocab, hashes and lexemes
(note: slide has a diagram)

* the `Doc` contains words in contexts
    - EX. the tokens "I", "love", and "coffee"
    - contains their part-of-speech tags and dependencies
* each token refers to a lexeme, which knows the word's hash ID
* to get the string representation of the word, spaCy looks up the hash in the string store

## 2. Strings to hashes

In [7]:
# Part 1
from spacy.lang.en import English

nlp = English()
doc = nlp("I have a cat")

# Look up the hash for the word "cat"
cat_hash = nlp.vocab.strings["cat"]
print(cat_hash)

# Look up the cat_hash to get the string
cat_string = nlp.vocab.strings[cat_hash]
print(cat_string)

5439657043933447811
cat


In [8]:
# Part 2
from spacy.lang.en import English

nlp = English()
doc = nlp("David Bowie is a PERSON")

# Look up the hash for the string label "PERSON"
person_hash = nlp.vocab.strings["PERSON"]
print(person_hash)

# Look up the person_hash to get the string
person_string = nlp.vocab.strings[person_hash]
print(person_string)

380
PERSON


## 3. Vocab, hashes and lexemes

In [9]:
# Why does this code throw an error?
from spacy.lang.en import English
from spacy.lang.de import German

# Create an English and German nlp object
nlp = English()
nlp_de = German()

# Get the ID for the string 'Bowie'
bowie_id = nlp.vocab.strings["Bowie"]
print(bowie_id)

# Look up the ID for "Bowie" in the vocab
print(nlp_de.vocab.strings[bowie_id])

2644858412616767388


KeyError: "[E018] Can't retrieve string for hash '2644858412616767388'. This usually refers to an issue with the `Vocab` or `StringStore`."

(**x**) The string `"Bowie"` isn't in the German vocab, so the hash can't be resolved in the string store

() `"Bowie"` is not a regular word in the English or German dictionary, so it can't be hashed

() `nlp_de` is not a valid name. The vocab can only be shared if the `nlp` objects have the same name

Explanation: Hashes can't be reversed. To prevent this problem, add the word to the new vocab by processing a text or looking up the string, or use the same vocab to resolve the hash back to a string.

## 4. Data Structures (Part 2): Doc, Span and Token

### The Doc Object

In [11]:
# Create an nlp object
from spacy.lang.en import English
nlp = English()

# Import the Doc class
from spacy.tokens import Doc

# The words and spaces to create the doc from
words = ["Hello", "world", "!"]
spaces = [True, False, False]

# Create a doc manually
doc = Doc(nlp.vocab, words=words, spaces=spaces)

**Notes**
* the `Doc` is one of the central data structures in spaCy
    - it's created automatically when you process a text with an `nlp` object
    - you can also instantiate the class manually
* after creating the `nlp` object, we can import the `Doc` class from `spacy.tokens`
* EX: creating a doc from 3 words
    - spaces are a list of boolean values indicating whether the word is followed by a space
    - every token includes that information
* the `Doc` class takes three arguments
    - the shared vocab
    - the words
    - the spaces

### The Span object (1)
(note: slide has a diagram)
* a `Span` is a slice of a doc consisting of one or more tokens
* the `Span` takes at least three arguments
    - the doc it refers to
    - the start index of a span
    - the end index of a span (which is exclusive!)

### The Span object (2)
* to create a `Span` manually, we can also import the class from `spacy.tokens`
* we can then instantiate it with the doc and the span's start and end index, and an optional label argument
* the `doc.ents` are writable, so we can add entites manually by overwriting it with a list of spans

In [19]:
# Import the Doc and Span classes
from spacy.tokens import Doc, Span

# The words and spaces to create the doc from
words = ["Hello", "world", "!"]
spaces = [True, False, False]

# Create a doc manually
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print("Doc:", doc)

# Create a span manually
span = Span(doc, 0, 2)
print("Span:", span)

# Create a span with a label
span_with_label = Span(doc, 0, 2, label="GREETING")

# Add span to the doc.ents
doc.ents = [span_with_label]

Doc: Hello world!
Span: Hello world


### Best practices
* `Doc` and `Span` are very powerful and hold references and relationships of words and sentences
    - **convert result to strings as late as possible**
    - **use token attributes if available**
        - EX. `token.i` for the token index
* Don't forget to pass in the shared `vocab`

**note**
* `Doc` and `Span` are very powerful and optimized for performance
    - give you access to all references and relationships of words and sentences
* if your application needs to output strings, make sure to convert the doc as late as possible
    - if you do it too early, you'll lose all the relationships between the tokens
* to keep things consistent, try to use built-in token attributes whenever possible
    - EX: `token.i` for the token index
* don't forget to always pass in the shared vocab

## 5. Creating a Doc

In [20]:
# Part 1
from spacy.lang.en import English

nlp = English()

# Import the Doc class
from spacy.tokens import Doc

# Desired text: "spaCy is cool!"
words = ["spaCy", "is", "cool", "!"]
spaces = [True, True, False, False]

# Create a Doc from the words and spaces
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

spaCy is cool!


In [21]:
# Part 2
from spacy.lang.en import English

nlp = English()

# Import the Doc class
from spacy.tokens import Doc

# Desired text: "Go, get started!"
words = ["Go", ",", "get", "started", "!"]
spaces = [False, True, True, False, False]

# Create a Doc from the words and spaces
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

Go, get started!


In [22]:
# Part 3
from spacy.lang.en import English

nlp = English()

# Import the Doc class
from spacy.tokens import Doc

# Desired text: "Oh, really?!"
words = ["Oh", ",", "really", "?", "!"]
spaces = [False, True, False, False, False]

# Create a Doc from the words and spaces
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

Oh, really?!


## 6. Docs, spans and entities from scratch

In this exercise, you'll create the `Doc` and `Span` objects manually, and update the named entities - just like spaCy does behind the scenes. A shared `nlp` object has already been created.

In [23]:
from spacy.lang.en import English

nlp = English()

# Import the Doc and Span classes
from spacy.tokens import Doc, Span

words = ["I", "like", "David", "Bowie"]
spaces = [True, True, True, False]

# Create a doc from the words and spaces
doc = Doc(nlp.vocab, words, spaces)
print(doc.text)

# Create a span for "David Bowie" from the doc and assign it the label "PERSON"
span = Span(doc, 2, 4, label="PERSON")
print(span.text, span.label_)

# Add the span to the doc's entities
doc.ents = [span]

# Print entities' text and labels
print([(ent.text, ent.label_) for ent in doc.ents])

I like David Bowie
David Bowie PERSON
[('David Bowie', 'PERSON')]


## 7. Data structures best practices

In [None]:
# Part 1 (did not run)
# Why is this code bad?

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Berlin looks like a nice city")

# Get all tokens and part-of-speech tags
token_texts = [token.text for token in doc]
pos_tags = [token.pos_ for token in doc]

for index, pos in enumerate(pos_tags):
    # Check if the current token is a proper noun
    if pos == "PROPN":
        # Check if the next token is a verb
        if pos_tags[index + 1] == "VERB":
            result = token_texts[index]
            print("Found proper noun before a verb:", result)

Why is this code bad?

ANSWER: It only uses lists of strings instead of native token attributes. This is often less efficient and can't express complex relationships.

EXPLANATION: Always convert the results to strings as late as possible, and try to use native token attributes to keep things consistent.

In [24]:
# Part 2
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Berlin looks like a nice city")

# Iterate over the tokens
for token in doc:
    # Check if the current token is a proper noun
    if token.pos_ == "PROPN":
        # Check if the next token is a verb
        if doc[token.i + 1].pos_ == "VERB":
            print("Found proper noun before a verb:", token.text)

Found proper noun before a verb: Berlin


Note about solution: If the doc ends in a proper noun, `doc[token.i + 1]` will fail. To make sure the code generalizes, you should first check if `token.i + 1 < len(doc)`

## 8. Word vectors and semantic similarity
GOALS:

1. learn how to use spaCy to predict how similar documents, spans or tokens are

2. learn how to use word vectors and how to take advantage of them in your NLP application

### Comparing semantic similarity
* `spaCy` can compare two objects and predict similarity
* `Doc.similarity()`, `Span.similarity()` and `Token.similarity()`
* Take another object and return a similarity score (`0` to `1`)
* **Important**: needs a model that has word vectors included, for example:
    - (x) `en_core_web_md` (medium model)
    - (x) `en_core_web_lg` (large model)
    - ( ) **NOT** `en_core_web_sm` (small model)

**Notes**
* spaCy can compare two objects and predict how similar they are (EX. documents, spans, or single tokens)
* `Doc`, `Token`, and `Span` objects have a `.similarity` method that takes another object and returns a floating point number between 0 and 1, indicating how similar they are
* in order to use similarity, you need a larger spaCy model that has word vectors included
    - EX. medium or large English model but NOT the small one
    - if you want to use vectors, always go with a model that ends with "md" or "lg"
* find more details in the [models documentation](https://spacy.io/models)

### Similarity examples (1)

In [2]:
import spacy

# Load a larger model with vectors
nlp = spacy.load("en_core_web_md")

# Compare two documents
doc1 = nlp("I like fast food")
doc2 = nlp("I like pizza")
print(doc1.similarity(doc2))

0.8627204117787385


In [3]:
# Compare two tokens
doc = nlp("I like pizza and pasta")
token1 = doc[2]
token2 = doc[4]
print(token1.similarity(token2))

0.7369546


### Similarity examples (2)

You can compare different types of objects as well.

In [4]:
# Compare a document with a token
doc = nlp("I like pizza")
token = nlp("soap")[0]
print(doc.similarity(token))

0.32531983166759537


In [5]:
# Compare a span with a document
span = nlp("I like pizza and pasta")[2:5]
doc = nlp("McDonalds sells burgers")

print(span.similarity(doc))

0.6199092090831612


### How does spaCy predict similarity?
* Similarity is determined using **word vectors**
* Multi-dimensional meaning representations of words
* Generated using an algorithm like [Word2Vec](https://en.wikipedia.org/wiki/Word2vec)
* Can be added to spaCy's statistical models
* DEFAULT: cosine similarity, but can be adjusted
* `Doc` and `Span` vectors default to average of token vectors
* Short phrases are better than long documents with many irrelevant words

**Notes**
* Similarity is determined using word vectors
    - multi-dimensional representations of meanings of words
    - EX: Word2Vec, an algorithm that's often used to train word vectors from raw text
* Vectors can be added to spaCy's statistical models
* By default, the similarity returned by spaCy is the cosine similarity between two vectors, but can be adjusted if necessary
* Vectors for objects consisting of several tokens, like the `Doc` and `Span`, default to the average of their token vectors
    - also why you usually get more value out of shorter phrases with fewer irrelevant words

### Word vectors in spaCy

In [6]:
# Load a larger model with vectors
nlp = spacy.load("en_core_web_md")

doc = nlp("I have a banana")

# Access the vector via the token.vector attribute
print(doc[3].vector)

[ 2.0228e-01 -7.6618e-02  3.7032e-01  3.2845e-02 -4.1957e-01  7.2069e-02
 -3.7476e-01  5.7460e-02 -1.2401e-02  5.2949e-01 -5.2380e-01 -1.9771e-01
 -3.4147e-01  5.3317e-01 -2.5331e-02  1.7380e-01  1.6772e-01  8.3984e-01
  5.5107e-02  1.0547e-01  3.7872e-01  2.4275e-01  1.4745e-02  5.5951e-01
  1.2521e-01 -6.7596e-01  3.5842e-01 -4.0028e-02  9.5949e-02 -5.0690e-01
 -8.5318e-02  1.7980e-01  3.3867e-01  1.3230e-01  3.1021e-01  2.1878e-01
  1.6853e-01  1.9874e-01 -5.7385e-01 -1.0649e-01  2.6669e-01  1.2838e-01
 -1.2803e-01 -1.3284e-01  1.2657e-01  8.6723e-01  9.6721e-02  4.8306e-01
  2.1271e-01 -5.4990e-02 -8.2425e-02  2.2408e-01  2.3975e-01 -6.2260e-02
  6.2194e-01 -5.9900e-01  4.3201e-01  2.8143e-01  3.3842e-02 -4.8815e-01
 -2.1359e-01  2.7401e-01  2.4095e-01  4.5950e-01 -1.8605e-01 -1.0497e+00
 -9.7305e-02 -1.8908e-01 -7.0929e-01  4.0195e-01 -1.8768e-01  5.1687e-01
  1.2520e-01  8.4150e-01  1.2097e-01  8.8239e-02 -2.9196e-02  1.2151e-03
  5.6825e-02 -2.7421e-01  2.5564e-01  6.9793e-02 -2

**Notes**
* We can process a text and look ip a token's vector using the `.vector` attribute
* The result is a 300-dimensional vector of the word "banana"

### Similarity depends on the application context
* Useful for many applications: recommendation systems, flagging duplicates, etc.
* There's no objective definition of "similarity"
* Depends on the context and what application needs to do

In [7]:
doc1 = nlp("I like cats")
doc2 = nlp("I hate cats")

print(doc1.similarity(doc2))

0.9501447503553421


**Notes**
* Predicting similarity can be useful for many types of applications
    - EX: to recommend a user similar texts based on the ones they have read
    - EX: flag duplicate content like posts on an online platform
* Important to keep in mind there's no objective definition of what's similar and what isn't
    - Always depends on the context and what your application needs to do
* EX: High similarity score in above example
    - Score makes sense because both texts express sentiment about cats
    - However, in different application context, you might want to consider the phrases as very *dissimilar* because they talk about opposite sentiments

## 9. Inspecting word vectors

In [8]:
import spacy

# Load the en_core_web_md model
nlp = spacy.load("en_core_web_md")

# Process a text
doc = nlp("Two bananas in pyjamas")

# Get the vector for the token "bananas"
bananas_vector = doc[1].vector
print(bananas_vector)

[-2.2009e-01 -3.0322e-02 -7.9859e-02 -4.6279e-01 -3.8600e-01  3.6962e-01
 -7.7178e-01 -1.1529e-01  3.3601e-02  5.6573e-01 -2.4001e-01  4.1833e-01
  1.5049e-01  3.5621e-01 -2.1508e-01 -4.2743e-01  8.1400e-02  3.3916e-01
  2.1637e-01  1.4792e-01  4.5811e-01  2.0966e-01 -3.5706e-01  2.3800e-01
  2.7971e-02 -8.4538e-01  4.1917e-01 -3.9181e-01  4.0434e-04 -1.0662e+00
  1.4591e-01  1.4643e-03  5.1277e-01  2.6072e-01  8.3785e-02  3.0340e-01
  1.8579e-01  5.9999e-02 -4.0270e-01  5.0888e-01 -1.1358e-01 -2.8854e-01
 -2.7068e-01  1.1017e-02 -2.2217e-01  6.9076e-01  3.6459e-02  3.0394e-01
  5.6989e-02  2.2733e-01 -9.9473e-02  1.5165e-01  1.3540e-01 -2.4965e-01
  9.8078e-01 -8.0492e-01  1.9326e-01  3.1128e-01  5.5390e-02 -4.2423e-01
 -1.4082e-02  1.2708e-01  1.8868e-01  5.9777e-02 -2.2215e-01 -8.3950e-01
  9.1987e-02  1.0180e-01 -3.1299e-01  5.5083e-01 -3.0717e-01  4.4201e-01
  1.2666e-01  3.7643e-01  3.2333e-01  9.5673e-02  2.5083e-01 -6.4049e-02
  4.2143e-01 -1.9375e-01  3.8026e-01  7.0883e-03 -2

## 10. Comparing similarities

In [9]:
# Part 1
import spacy

nlp = spacy.load("en_core_web_md")

doc1 = nlp("It's a warm summer day")
doc2 = nlp("It's sunny outside")

# Get the similarity of doc1 and doc2
similarity = doc1.similarity(doc2)
print(similarity)

0.8789265574516525


In [10]:
# Part 2
import spacy

nlp = spacy.load("en_core_web_md")

doc = nlp("TV and books")
token1, token2 = doc[0], doc[2]

# Get the similarity of the tokens "TV" and "books"
similarity = token1.similarity(token2)
print(similarity)

0.22325331


In [14]:
# Part 3
import spacy

nlp = spacy.load("en_core_web_md")

doc = nlp("This was a great restaurant. Afterwards, we went to a really nice bar.")

# Create spans for "great restaurant" and "really nice bar"
span1 = doc[3:5]
span2 = doc[-4:-1]

# Get the similarity of the spans
similarity = span1.similarity(span2)
print(similarity)

0.75173926


The similarities are not always this conclusive. Once you're getting serious about developing NLP applications that leverage semantic similarity, you might want to train vectors on your own data, or tweak the similarity algorithm.

## 11. Combining models and rules

### Statistical predictions vs. rules
**Notes**
* statistical models are useful if your application needs to be able to generalize based on a few examples
    - EX. detecting product or person names usually benefits from a statistical model
    - i.e. instead of providing a list of all person names ever, your application will be able to predict whether a span of tokens is a person name
    - similarly, you can predict dependency labels to find subject/object relationships
* to do this, you would use spaCy's entity recognizer, dependency parser or part-of-speech tagger
* rule-based approaches come in handy if there's a more or less finite number of instances you want to find
    - EX: all countries or cities of the world, drug names or even dog breeds
    - in spaCy you can achieve this with custom tokenization rules as well as the matcher and phrase matcher

Table
* use cases
    - statistical models: applicatoin needs to generalize based on examples
    - rule-based systems: dictionary with finite number of examples
* real-world examples
    - statistical models: product names, person names, subject/object relationships
    - rule-based systems: countries of the world, citires, drug names, dog breeds
* spaCy features
    - statistical models: entity recognizer, dependency parser, part-of-speech tagger
    - rule-based systems: tokenizer, `Matcher`, `PhraseMatcher`

### Recap: Rule-based Matching

In [15]:
# Initialize with the shared vocab
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

# Patterns are lists of dictionaries describing the tokens
pattern = [{"LEMMA": "love", "POS": "VERB"}, {"LOWER": "cats"}]
matcher.add("LOVE_CATS", None, pattern)

# Operators can specify how often a token should be matched
pattern = [{"TEXT": "very", "OP": "+"}, {"TEXT": "happy"}]
matcher.add("VERY_HAPPY", None, pattern)

# Calling matcher on doc returns list of (match_id, start, end) tuples
doc = nlp("I love cats and I'm very very happy")
matches = matcher(doc)

### Adding statistical predictions

In [17]:
matcher = Matcher(nlp.vocab)
matcher.add("DOG", None, [{"LOWER": "golden"}, {"LOWER": "retriever"}])
doc = nlp("I have a Golden Retriever")

for match_id, start, end in matcher(doc):
    span = doc[start:end]
    print("Matched span:", span.text)
    # Get the span's root token and root head token
    print("Root token:", span.root.text)
    print("Root head token:", span.root.head.text)
    # Get the previous token and its POS tag
    print("Previous token:", doc[start - 1].text, doc[start - 1].pos_)

Matched span: Golden Retriever
Root token: Retriever
Root head token: have
Previous token: a DET


**Note**
* Above EX: "golden retriever"
* If we iterate over the matches returned by the matcher, we can get the match ID and the start and end index of the matched span & find out more about it
    - `Span` objects give us access to the original document and all other token attributes and linguistic features predicted by the model
* EX: we can get the span's root token
    - if span consists of more than one token, this will be the token that decides the category of the phrase
    - EX: the root of "Golden Retriever" is "Retriever"
* we can also find the head token of the root
    - this is the syntactic "parent" that governs the phrase i.e. the verb "have"
* finally, we can look at the previous token and its attributes
    - it's determiner, the article "a"

### Efficient phrase matching (1)
* `PhraseMatcher` like regular expressions or keyword search - but with access to the tokens!
* takes `Doc` object as patterns
* more efficient and faster than the `Matcher`
* great for matching large word lists

**Notes**
* the phrase matcher is another helpful tool to find sequences of words in your data
* it performs a keyword search on the document, but instead of only finding strings, it gives you direct access to the tokens in context
* it takes `Doc` objects as patterns AND it's also really fast
* this makes it very useful for matching large dictionaries and word lists on large volumns of text

### Efficient phrase matching (2)

In [2]:
import spacy

nlp = spacy.load("en_core_web_sm")

from spacy.matcher import PhraseMatcher

matcher = PhraseMatcher(nlp.vocab)

pattern = nlp("Golden Retriever")
matcher.add("DOG", None, pattern)
doc = nlp("I have a Golden Retriever")

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Get the matched span
    span = doc[start:end]
    print("Matched span:", span.text)

Matched span: Golden Retriever


* the phrase matcher can be imported from `spacy.matcher` and follows the same API as the regular matcher
* instead of a list of dictionaries, we pass in a `Doc` object as the pattern
* we can then iterate over the matches in the text, which gives us the match ID, and the start and end of the match
* this lets us create a `Span` object for the matched tokens "Golden Retriever" to analyze it in context

## 12. Debugging patterns (1)

In [3]:
# Why does this pattern not match the tokens "Silicon Valley" in the `doc`?
pattern = [{"LOWER": "silicon"}, {"TEXT": " "}, {"LOWER": "valley"}]

doc = nlp("Can Silicon Valley workers rein in big tech from within?")

( ) The tokens "Silicon" and "Valley" are not lowercase, so the "LOWER" attribute won't match
- **Incorrect**: The "LOWER" attribute in the pattern describes tokens whose lowercase form matches the given value

(x) The tokenizer doesn't create tokens for single spaces, so there's no token with the value " " in between.

( ) The tokens are missing an operator "OP" to indicate that they should be matched exactly once.

**EXPLANATION**: The tokenizer already takes care of splitting off whitespace and each dictionary in the pattern describes one token

## 13. Debugging patterns (2)

Both patterns in this exercise contain mistakes and won't match as expected. Can you fix them? If you get stuck, try printing the tokens in the `doc` to see how the text will be split and adjust the pattern so that each dictionary represents one token>

1. Edit `pattern1` so that it correctly matches all case-insensitive mentions of `"Amazon"` plus a title-cased proper noun

2. Edit `pattern2` so that it correctly matches all case-insensitive mentions of `"ad-free"`, plus the following noun.


In [5]:
# ORIGINAL CODE
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
doc = nlp(
    "Twitch Prime, the perks program for Amazon Prime members offering free "
    "loot, games and other benefits, is ditching one of its best features: "
    "ad-free viewing. According to an email sent out to Amazon Prime members "
    "today, ad-free viewing will no longer be included as a part of Twitch "
    "Prime for new members, beginning on September 14. However, members with "
    "existing annual subscriptions will be able to continue to enjoy ad-free "
    "viewing until their subscription comes up for renewal. Those with "
    "monthly subscriptions will have access to ad-free viewing until October 15."
)

# Create the match patterns
pattern1 = [{"LOWER": "Amazon"}, {"IS_TITLE": True, "POS": "PROPN"}]
pattern2 = [{"LOWER": "ad-free"}, {"POS": "NOUN"}]

# Initialize the Matcher and add the patterns
matcher = Matcher(nlp.vocab)
matcher.add("PATTERN1", None, pattern1)
matcher.add("PATTERN2", None, pattern2)

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Print pattern string name and text of matched span
    print(doc.vocab.strings[match_id], doc[start:end].text)

In [9]:
# NEW CODE
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
doc = nlp(
    "Twitch Prime, the perks program for Amazon Prime members offering free "
    "loot, games and other benefits, is ditching one of its best features: "
    "ad-free viewing. According to an email sent out to Amazon Prime members "
    "today, ad-free viewing will no longer be included as a part of Twitch "
    "Prime for new members, beginning on September 14. However, members with "
    "existing annual subscriptions will be able to continue to enjoy ad-free "
    "viewing until their subscription comes up for renewal. Those with "
    "monthly subscriptions will have access to ad-free viewing until October 15."
)

# Create the match patterns
pattern1 = [{"LOWER": "amazon"}, {"IS_TITLE": True, "POS": "PROPN"}]
pattern2 = [{"LOWER": "ad"}, {"IS_PUNCT": True}, {"LOWER": "free"}, {"POS": "NOUN"}]

# Initialize the Matcher and add the patterns
matcher = Matcher(nlp.vocab)
matcher.add("PATTERN1", None, pattern1)
matcher.add("PATTERN2", None, pattern2)

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Print pattern string name and text of matched span
    print(doc.vocab.strings[match_id], doc[start:end].text)

PATTERN1 Amazon Prime
PATTERN2 ad-free viewing
PATTERN1 Amazon Prime
PATTERN2 ad-free viewing
PATTERN2 ad-free viewing
PATTERN2 ad-free viewing


**Correct!**: For the token '-', you can match on the attribute 'TEXT', 'LOWER', or even 'SHAPE'. All of those are correct. As you can see, paying close attention to the tokenization is very important when working with the token-based 'Matcher'. Sometimes it's much easier to just match exact strings instead and use the 'PhraseMatcher'.

## 14. Efficient phrase matching

Sometimes it's more efficient to match exact strings instead of writing patterns describing the individual tokens. This is especially true for finite categories of things - like all countries of the world. We already have a list of countries, so let's use this as the basis of our information extraction script. A list of string names is available as the variable `COUNTRIES`.

* Import the `PhraseMatcher` and initialize it with the shared `vocab` as the variable `matcher`
* Add the phrase patterns and call the matcher on the `doc`

In [None]:
import json
from spacy.lang.en import English

with open("exercises/en/countries.json", encoding="utf8") as f:
    COUNTRIES = json.loads(f.read())

nlp = English()
doc = nlp("Czech Republic may help Slovakia protect its airspace")

# Import the PhraseMatcher and initialize it
from spacy.matcher import PhraseMatcher

matcher = PhraseMatcher(nlp.vocab)

# Create pattern Doc objects and add them to the matcher
# This is the faster version of: [nlp(country) for country in COUNTRIES]
patterns = list(nlp.pipe(COUNTRIES))
matcher.add("COUNTRY", None, *patterns)

# Call the matcher on the test document and print the result
matches = matcher(doc)
print([doc[start:end] for match_id, start, end in matches])

## 15. Extracting countries and relationships

Now we'll use the country matcher in the previous exercise on a longer text, analyze the syntax and update the document's entities with the matches countries

In [None]:
# (did not run)

import spacy
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span
import json

with open("exercises/en/countries.json", encoding="utf8") as f:
    COUNTRIES = json.loads(f.read())
with open("exercises/en/country_text.txt", encoding="utf8") as f:
    TEXT = f.read()

nlp = spacy.load("en_core_web_sm")
matcher = PhraseMatcher(nlp.vocab)
patterns = list(nlp.pipe(COUNTRIES))
matcher.add("COUNTRY", None, *patterns)

# Create a doc and reset existing entities
doc = nlp(TEXT)
doc.ents = []

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Create a Span with the label for "GPE"
    span = Span(doc, start, end, label="GPE")

    # Overwrite the doc.ents and add the span
    doc.ents = list(doc.ents) + [span]

    # Get the span's root head token
    span_root_head = span.root.head
    # Print the text of the span root's head token and the span text
    print(span_root_head.text, "-->", span.text)

# Print the entities in the document
print([(ent.text, ent.label_) for ent in doc.ents if ent.label_ == "GPE"])