# Chapter 2: Large-scale data analysis with spaCy

In this chapter, you'll use your new skills to extract specific information from large volumes of text. You''ll learn how to make the most of spaCy's data structures, and how to effectively combine statistical and rule-based approaches for text analysis.

## Data Structures(1): Vocab, Lexemes and StringStore

resources: [slides](slides/chapter2_01_data-structures-1.md)

Welcome back! Now that you've had some real experience using spaCy's objects, it's time for you to learn more about what's actually going on under spaCy's hood.

In this lesson, we'll take a look at the shared vocabulary and how spaCy deals with strings.

### Shared vocab and string store

- `Vocab`: stores data shared across multiple documents
    - This includes words, but also the labels schemes for tags and entities.
- To save memory, spaCy encodes all strings to hash values
- Strings are only stored once in the `StringStore` via `nlp.vocab.strings`
- String store: lookup table in both directions
    - Internally, spaCy only communicates in hash IDs.
    - Hash IDs can't be reversed, though. If a word in not in the vocabulary, there's no way to get its string. That's why we always need to pass around the shared vocab.

In [43]:
# Import the English language class
from spacy.lang.en import English

nlp = English()

coffee_hash = nlp.vocab.strings['coffee']
print(coffee_hash)

# Raises an error if we haven't seen the string before
coffee_string = nlp.vocab.strings[coffee_hash]
print(coffee_string)

3197928453018144401


KeyError: "[E018] Can't retrieve string for hash '3197928453018144401'. This usually refers to an issue with the `Vocab` or `StringStore`."

In [28]:
doc = nlp('I love coffee')
print('hash value:', nlp.vocab.strings['coffee'])
print('string value:', nlp.vocab.strings[nlp.vocab.strings['coffee']])

hash value: 3197928453018144401
string value: coffee


In [24]:
# The doc also exposes the vocab and strings
doc = nlp('I love coffee')
print('hash value:', doc.vocab.strings['coffee'])

hash value: 3197928453018144401


### Lexemes: entries in the vocabulary


- A `Lexeme` object is an entry in the vocabulary
- Contains the context-independent information about a word
    - Word text: `lexeme.text` and `lexeme.orth` (the hash)
    - Lexical attributes like `lexeme.is_alpha`
    - **Not** context-dependent part-of-speech tags, dependencies or entity labels

In [29]:
doc = nlp('I love coffee')
lexeme = nlp.vocab['coffee']

# Print the lexical attributes
print(lexeme.text, lexeme.orth, lexeme.is_alpha)

coffee 3197928453018144401 True


### Vocab, hashes and lexemes

Here's an example.

The Doc contains words in context – in this case, the tokens "I", "love" and "coffee" with their part-of-speech tags and dependencies.

Each token refers to a lexeme, which knows the word's hash ID. To get the string representation of the word, spaCy looks up the hash in the string store.

![example](slides/static/vocab_stringstore.png)

## Data Structures(2): Doc, Span and Token

references: [slides](slides/chapter2_02_data-structures-2.md)

Now that you know all about the vocabulary and string store, we can take a look at the most important data structure: the Doc, and its views Token and Span.

### The Doc object

The Doc is one of the central data structures in spaCy. It's created automatically when you process a text with the nlp object. But you can also instantiate the class manually.

After creating the nlp object, we can import the Doc class from spacy dot tokens.

Here we're creating a Doc from three words. The spaces are a list of boolean values indicating whether the word is followed by a space. Every token includes that information – even the last one!

The Doc class takes three arguments: the shared vocab, the words and the spaces.

In [42]:
# Create an nlp object
from spacy.lang.en import English
nlp = English()

# Import the Doc class
from spacy.tokens import Doc

# The words and spaces to create the doc from
words = ['Hello', 'world', '!']
spaces = [True, False, False]

# Create a doc manually
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

Hello world!


### The Span object

A Span is a slice of a Doc consisting of one or more tokens. The Span takes at least three arguments: the doc it refers to, and the start and end index of the span. Remember that the end index is exclusive!

![span](slides/static/span_indices.png)

To create a Span manually, we can also import the class from spacy dot tokens. We can then instantiate it with the doc and the span's start and end index, and an optional label argument.

The doc dot ents are writable, so we can add entities manually by overwriting it with a list of spans.

In [41]:
# Import the Doc and Span classes
from spacy.tokens import Doc, Span

# The words and spaces to create the doc from
words = ['Hello', 'world', '!']
spaces = [True, False, False]

# Create a doc manually
doc = Doc(nlp.vocab, words=words, spaces=spaces)

# Create a span manually
span = Span(doc, 0, 2)

# Create a span with a label
span_with_label = Span(doc, 0, 2, label='GREETING')

# Add span to the doc.ents
doc.ents = [span_with_label]
print(doc.ents)

(Hello world,)


### Best practices

- `Doc` and `Span` are very powerful and hold references and relationships of words and sentences
    - **Convert result to strings as late as possible**. If you do it too early, you'll lose all relationships between the tokens.
    - **Use token attributes if availabe** – for example, `token.i` for the token index
- Don't forget to pass in the shared `vocab`

## Word vectors and semantic similarity

resources: [slides](slides/chapter2_03_word-vectors-similarity.md)

In this lesson, you'll learn how to use spaCy to predict how similar documents, spans or tokens are to each other.

You'll also learn about how to use word vectors and how to take advantage of them in your NLP application.

### Comparing semantic similarity

- `spaCy` can compare two objects and predict similarity
- `Doc.similarity()`, `Span.similarity()` and `Token.similarity()`
- Take another object and return a similarity score (`0` to `1`)
- Important: needs a model that has word vectors included, for example:
    - ✅ `en_core_web_md` (medium model)
    - ✅ `en_core_web_lg` (large model)
    - 🚫 NOT `en_core_web_sm` (small model)
- English Model documentation: https://spacy.io/models/en

### Similarity examples

In [5]:
import spacy

# Load a larger model with vectors
nlp = spacy.load('en_core_web_md')

# Compare two documents
doc1 = nlp('I like fast food')
doc2 = nlp('I like pizza')
print(doc1.similarity(doc2))

# Compare two tokens
doc = nlp('I like pizza and pasta')
token1 = doc[2]
token2 = doc[4]
print(token1.similarity(token2))

# Compare a document with a token
doc = nlp('I like pizza')
token = nlp('soap')[0]

print(doc.similarity(token))
print(token.similarity(doc))

# Compare a span with a document
span = nlp('I like pizza and pasta')[2:5]
doc = nlp('McDonalds sells burgers')

print(span.similarity(doc))

0.8627204117787385
0.7369546
0.32531983166759537
0.32531983166759537
0.6199092090831612


### How does spaCy predict similarity?

- Similarity is determined using **word vectors**
- Multi-dimensional meaning representations of words
- Generated using an algorithm like [Word2Vec](https://www.wikiwand.com/en/Word2vec) and lots of text
- Can be added to spaCy's statistical models
- Default: cosine similarity, but can be adjusted
- `Doc` and `Span` vectors default to average of token vectors
- Short phrases are better than long documents with many irrelevant words

### Word vectors in spaCy

In [13]:
import spacy

# Load a larger model with vectors
nlp = spacy.load('en_core_web_md')

doc = nlp('I have a banana')

# Access the vector via the token.vector attribute
token_vector = doc[3].vector
print(f"dimensions: {len(token_vector)}")
print(token_vector)

dimensions: 300
[ 2.0228e-01 -7.6618e-02  3.7032e-01  3.2845e-02 -4.1957e-01  7.2069e-02
 -3.7476e-01  5.7460e-02 -1.2401e-02  5.2949e-01 -5.2380e-01 -1.9771e-01
 -3.4147e-01  5.3317e-01 -2.5331e-02  1.7380e-01  1.6772e-01  8.3984e-01
  5.5107e-02  1.0547e-01  3.7872e-01  2.4275e-01  1.4745e-02  5.5951e-01
  1.2521e-01 -6.7596e-01  3.5842e-01 -4.0028e-02  9.5949e-02 -5.0690e-01
 -8.5318e-02  1.7980e-01  3.3867e-01  1.3230e-01  3.1021e-01  2.1878e-01
  1.6853e-01  1.9874e-01 -5.7385e-01 -1.0649e-01  2.6669e-01  1.2838e-01
 -1.2803e-01 -1.3284e-01  1.2657e-01  8.6723e-01  9.6721e-02  4.8306e-01
  2.1271e-01 -5.4990e-02 -8.2425e-02  2.2408e-01  2.3975e-01 -6.2260e-02
  6.2194e-01 -5.9900e-01  4.3201e-01  2.8143e-01  3.3842e-02 -4.8815e-01
 -2.1359e-01  2.7401e-01  2.4095e-01  4.5950e-01 -1.8605e-01 -1.0497e+00
 -9.7305e-02 -1.8908e-01 -7.0929e-01  4.0195e-01 -1.8768e-01  5.1687e-01
  1.2520e-01  8.4150e-01  1.2097e-01  8.8239e-02 -2.9196e-02  1.2151e-03
  5.6825e-02 -2.7421e-01  2.5564e-0

### Similarity depends on the application context

- Useful for many applications: recommendation systems, flagging duplicates etc.
- There's no objective definition of "similarity"
- Depends on the context and what application needs to do
    - Here's an example: spaCy's default word vectors assign a very high similarity score to "I like cats" and "I hate cats". This makes sense, because both texts express sentiment about cats. But in a different application context, you might want to consider the phrases as very dissimilar, because they talk about opposite sentiments.

In [14]:
doc1 = nlp('I like cats')
doc2 = nlp('I hate cats')

print(doc1.similarity(doc2))

0.9501447503553421


## Combining models and rules

resources: [slides](slides/chapter2_04_models-rules.md)

Combining statistical models with rule-based systems is one of the most powerful tricks you should have in your NLP toolbox.

In this lesson, we'll take a look at how to do it with spaCy.

### Statistical predictions vs. rules

||Statistical models|Rule-based systems|
|---|---|---|
|Use cases|application needs to generalize based on examples|dictionary with finite number of examples|
|Real-world examples|product names, person names, subject/object relationships|countries of the world, cities, drug names, dog breeds|
|spaCy features|entity recognizer, dependency parser, part-of-speech tagger|tokenizer, Matcher, PhraseMatcher|

### Recap: Rule-based Matching

In [15]:
# Initialize with the shared vocab
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

# Patterns are lists of dictionaries describing the tokens
pattern = [{'LEMMA': 'love', 'POS': 'VERB'}, {'LOWER': 'cats'}]
matcher.add('LOVE_CATS', None, pattern)

# Operators can specify how often a token should be matched
pattern = [{'TEXT': 'very', 'OP': '+'}, {'TEXT': 'happy'}]
matcher.add('VERY_HAPPY', None, pattern)

# Calling matcher on doc returns list of (match_id, start, end) tuples
doc = nlp('I love cats and I\'m very very happy')
matches = matcher(doc)
print(matches)

[(9137535031263442622, 1, 3), (2447047934687575526, 7, 9), (2447047934687575526, 6, 9)]


### Adding statistical predictions

In [16]:
matcher = Matcher(nlp.vocab)
matcher.add('DOG', None, [{'LOWER': 'golden'}, {'LOWER': 'retriever'}])
doc = nlp('I have a Golden Retriever')

for match_id, start, end in matcher(doc):
    span = doc[start:end]
    print('Matched span:', span.text)
    # Get the span's root token and root head token
    print('Root token:', span.root.text)
    print('Root head token:', span.root.head.text)
    # Get the previous token and its POS tag
    print('Previous token:', doc[start - 1].text, doc[start - 1].pos_)

Matched span: Golden Retriever
Root token: Retriever
Root head token: have
Previous token: a DET


### Efficient phrase matching

- `PhraseMatcher` like regular expressions or keyword search – but with access to the tokens!
- Takes `Doc` object as patterns
- More efficient and faster than the `Matcher`
- Great for matching large word lists

In [17]:
from spacy.matcher import PhraseMatcher

matcher = PhraseMatcher(nlp.vocab)

pattern = nlp('Golden Retriever')
matcher.add('DOG', None, pattern)
doc = nlp('I have a Golden Retriever')

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Get the matched span
    span = doc[start:end]
    print('Matched span:', span.text)

Matched span: Golden Retriever


In [18]:
import json
from spacy.lang.en import English

with open("exercises/countries.json") as f:
    COUNTRIES = json.loads(f.read())

nlp = English()
doc = nlp("Czech Republic may help Slovakia protect its airspace")

# Import the PhraseMatcher and initialize it
from spacy.matcher import PhraseMatcher

matcher = PhraseMatcher(nlp.vocab)

# Create pattern Doc objects and add them to the matcher
# This is the faster version of: [nlp(country) for country in COUNTRIES]
patterns = list(nlp.pipe(COUNTRIES))
matcher.add("COUNTRY", None, *patterns)

# Call the matcher on the test document and print the result
matches = matcher(doc)
print([doc[start:end] for match_id, start, end in matches])

[Czech Republic, Slovakia]


In [19]:
from spacy.lang.en import English
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span
import json

with open("exercises/countries.json") as f:
    COUNTRIES = json.loads(f.read())
with open("exercises/country_text.txt") as f:
    TEXT = f.read()

nlp = English()
matcher = PhraseMatcher(nlp.vocab)
patterns = list(nlp.pipe(COUNTRIES))
matcher.add("COUNTRY", None, *patterns)

# Create a doc and find matches in it
doc = nlp(TEXT)

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Create a Span with the label for "GPE"
    span = Span(doc, start, end, label='GPE')

    # Overwrite the doc.ents and add the span
    doc.ents = list(doc.ents) + [span]

    # Get the span's root head token
    span_root_head = span.root.head
    # Print the text of the span root's head token and the span text
    print(span_root_head.text, "-->", span.text)

# Print the entities in the document
print([(ent.text, ent.label_) for ent in doc.ents if ent.label_ == "GPE"])

Namibia --> Namibia
South --> South Africa
Cambodia --> Cambodia
Kuwait --> Kuwait
Somalia --> Somalia
Haiti --> Haiti
Mozambique --> Mozambique
Somalia --> Somalia
Rwanda --> Rwanda
Singapore --> Singapore
Sierra --> Sierra Leone
Afghanistan --> Afghanistan
Iraq --> Iraq
Sudan --> Sudan
Congo --> Congo
Haiti --> Haiti
[('Namibia', 'GPE'), ('South Africa', 'GPE'), ('Cambodia', 'GPE'), ('Kuwait', 'GPE'), ('Somalia', 'GPE'), ('Haiti', 'GPE'), ('Mozambique', 'GPE'), ('Somalia', 'GPE'), ('Rwanda', 'GPE'), ('Singapore', 'GPE'), ('Sierra Leone', 'GPE'), ('Afghanistan', 'GPE'), ('Iraq', 'GPE'), ('Sudan', 'GPE'), ('Congo', 'GPE'), ('Haiti', 'GPE')]
