# Chapter 2: Large-scale data analysis with SpaCy

## Data structures 1

Now that you've had some real experience using spaCy's objects, it's time for you to learn more about what's actually going on under spaCy's hood. In this lesson, we'll take a look at the shared vocabulary and how spaCy deals with strings.

spaCy stores all shared data in a vocabulary, the Vocab. This includes words, but also the labels schemes for tags and entities. To save memory, all strings are encoded to hash IDs. If a word occurs more than once, we don't need to save it every time. Instead, spaCy uses a hash function to generate an ID and stores the string only once in the string store. The string store is available as `nlp.vocab.strings`. It's a lookup table that works in both directions. You can look up a string and get its hash, and look up a hash to get its string value. Internally, spaCy only communicates in hash IDs. Hash IDs can't be reversed, though. If a word is not in the vocabulary, there's no way to get its string. That's why we always need to pass around the shared vocab.

To get the hash for a string, we can look it up in `nlp.vocab.strings`. To get the string representation of a hash, we can look up the hash. A Doc object also exposes its vocab and strings.

In [1]:
import spacy
nlp = spacy.load('en_core_web_sm')

# From string to hash
coffee_hash = nlp.vocab.strings['coffee']
print(coffee_hash)

# This doesn't work
# coffee_string = nlp.vocab.strings[coffee_hash]

3197928453018144401


However, if we try to reverse the hash, we get an error, because this hash is not in the string strore. To include it in the string store, `nlp` must *see* the documents.

In [2]:
doc = nlp("I love coffee")
print('hash value:', nlp.vocab.strings['coffee'])
print('string value:', nlp.vocab.strings[3197928453018144401])

hash value: 3197928453018144401
string value: coffee


**Lexemes** are context-independent entries in the vocabulary. You can get a lexeme by looking up a string or a hash ID in the vocab. Lexemes expose attributes, just like tokens. They hold context-independent information about a word, like the text, or whether the the word consists of alphanumeric characters. Lexemes don't have part-of-speech tags, dependencies or entity labels. Those depend on the context. From the documentation:

>An entry in the vocabulary. A `Lexeme` has no string context – it's a
 word-type, as opposed to a word token.  It therefore has no part-of-speech
 tag, dependency parse, or lemma (lemmatization depends on the
 part-of-speech tag).

In [3]:
lexeme = nlp.vocab['coffee']

# Print the lexical attributes. lexeme.orth is the hash
print(lexeme.text, lexeme.orth, lexeme.is_alpha)

coffee 3197928453018144401 True


In the example below, the Doc contains words in context – in this case, the tokens "I", "love" and "coffee" with their part-of-speech tags and dependencies. Each token refers to a lexeme, which knows the word's hash ID. To get the string representation of the word, spaCy looks up the hash in the string store (so, if I understand correctly, the string store is a set of key-value pairs mapping the hash ID to the string.
![string_store](fig/string_store.png)

## Data structures 2 

Now that you know all about the vocabulary and string store, we can take a look at the most important data structure: the Doc, and its views Token and Span.

The Doc is one of the central data structures in spaCy. It's created automatically when you process a text with the nlp object. But you can also instantiate the class manually.
After creating the nlp object, we can import the Doc class from `spacy.tokens`’.
Here we're creating a Doc from three words. The spaces are a list of boolean values indicating whether the word is *followed* by a space. Every token includes that information – even the last one! The Doc class takes three arguments: the shared vocab, the words and the spaces.

In [4]:
# Create an nlp object
from spacy.lang.en import English
nlp = English()

# Import the Doc class
from spacy.tokens import Doc

# The words and spaces to create the doc from
words = ['Hello', 'world', '!']
spaces = [True, False, False]

# Create a doc manually
doc = Doc(nlp.vocab, words=words, spaces=spaces)
doc

Hello world!

A **span** (shown below) is a slice of a Doc consisting of one or more tokens. The Span takes at least three arguments: the doc it refers to, and the start and end index of the span. Remember that the end index is exclusive! ![span](fig/span.png)

To create a Span manually, we can also import the class from `spacy.tokens`. We can then instantiate it with the doc and the span's start and end index. To add an entity label to the span, we first need to look up the string in the string store. We can then provide it to the span as the label argument. The `doc.ents` are writable, so we can add entities manually by overwriting it with a list of spans.

In [5]:
# Import the Doc and Span classes
from spacy.tokens import Doc, Span

# The words and spaces to create the doc from
words = ['Hello', 'world', '!']
spaces = [True, False, False]

# Create a doc manually
doc = Doc(nlp.vocab, words=words, spaces=spaces)

# Create a span manually
span = Span(doc, 0, 2)

# Create a span with a label
span_with_label = Span(doc, 0, 2, label="GREETING")

# Add span to the doc.ents
doc.ents = [span_with_label]

A few tips and tricks before we get started: The Doc and Span are very powerful and optimized for performance. They give you access to all references and relationships of the words and sentences. If your application needs to output strings, make sure to convert the doc as late as possible. If you do it too early, you'll lose all relationships between the tokens.
To keep things consistent, try to use built-in token attributes wherever possible. For example, `token.i` for the token index. Also, don't forget to always pass in the shared vocab! Let's see another example.

In [6]:
from spacy.lang.en import English

nlp = English()

# Import the Doc and Span classes
from spacy.tokens import Doc, Span

words = ["I", "like", "David", "Bowie"]
spaces = [True, True, True, False]

# Create a doc from the words and spaces
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

# Create a span for "David Bowie" from the doc and assign it the label "PERSON"
span = Span(doc, 2, 4, label="PERSON")
print(span.text, span.label_)

# Add the span to the doc's entities
doc.ents = [span]

# Print entities' text and labels
print([(ent.text, ent.label_) for ent in doc.ents])

I like David Bowie
David Bowie PERSON
[('David Bowie', 'PERSON')]


Let's compare the two code snippets below. The first one is inefficient, as it converts tokens to text too early and doesn't take advantage of the `pos_` and `.i` attributes.

In [7]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Berlin is a nice city")

# Get all tokens and part-of-speech tags
token_texts = [token.text for token in doc]
pos_tags = [token.pos_ for token in doc]

for index, pos in enumerate(pos_tags):
    # Check if the current token is a proper noun
    if pos == "PROPN":
        # Check if the next token is a verb
        if pos_tags[index + 1] == "VERB":
            result = token_texts[index]
            print("Found proper noun before a verb:", result)

Found proper noun before a verb: Berlin


This second version is much better, as it leverages the built-in capabilities.

In [8]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Berlin is a nice city")

for token in doc:
    # Check if the current token is a proper noun
    if token.pos_ == "PROPN":
        # Check if the next token is a verb
        if doc[token.i + 1].pos_ == "VERB":
            result = token.text
            print("Found proper noun before a verb:", result)

Found proper noun before a verb: Berlin


## Word vectors and semantic similarity

In this lesson, you'll learn how to use spaCy to predict how similar documents, spans or tokens are to each other.
You'll also learn about how to use word vectors and how to take advantage of them in your NLP application.

spaCy can compare two objects and predict how similar they are – for example, documents, spans or single tokens. The Doc, Token and Span objects have a dot similarity method that takes another object and returns a floating point number between 0 and 1, indicating how similar they are. One thing that's **very important**: In order to use similarity, you need a larger spaCy model that has word vectors included. For example, the medium or large English model – but not the small one. So if you want to use vectors, always go with a model that ends in "md" or "lg". You can find more details on this in the models documentation.

Here's an example. Let's say we want to find out whether two documents are similar. First, we load the medium English model, "en_core_web_md". We can then create two doc objects and use the first doc's similarity method to compare it to the second. Here, a fairly high similarity score of 0.86 is predicted for "I like fast food" and "I like pizza". The same works for tokens. According to the word vectors, the tokens "pizza" and "pasta" are kind of similar, and receive a score of 0.7.

In [9]:
# Load a larger model with vectors
nlp = spacy.load('en_core_web_md')

# Compare two documents
doc1 = nlp("I like fast food")
doc2 = nlp("I like pizza")

print(doc1.similarity(doc2))

0.8627204117787385


In [10]:
# Compare two tokens
doc = nlp("I like pizza and pasta")
token1 = doc[2]
token2 = doc[4]
print(token1.similarity(token2))

0.7369546


From the documentation of `doc1.similarity`:
> Make a semantic similarity estimate. The default estimate is cosine
similarity using an average of word vectors.


You can also use the similarity methods to compare different types of objects. For example, a document and a token. Here, the similarity score is pretty low and the two objects are considered fairly dissimilar. Here's another example comparing a span – "pizza and pasta" – to a document about McDonalds. The score returned here is 0.61, so it's determined to be kind of similar.

In [15]:
# Compare a document with a token
doc = nlp("I like pizza")
token = nlp("soap")[0]
print(doc.similarity(token))

# Compare a span with a document
span = nlp("I like pizza and pasta")[2:5]
doc = nlp("McDonalds sells burgers")
print(span.similarity(doc))

0.32531983166759537
0.6199092090831612


Let's test document similarity on a more challenging task. The first two sentences are closer from a semantic point of view, but the second and the third are also seen as very similar.

In [19]:
doc1 = nlp("I think this product is great")
doc2 = nlp("This toaster is pretty good")
doc3 = nlp("This is the cat that ate my homework")

print(doc1.similarity(doc2))
print(doc1.similarity(doc3))
print(doc2.similarity(doc3))

0.8581527458617464
0.8329065710857103
0.7989105055776513


But how does spaCy do this under the hood? Similarity is determined using **word vectors**, multi-dimensional representations of meanings of words. You might have heard of Word2Vec, which is an algorithm that's often used to train word vectors from raw text. Vectors can be added to spaCy's statistical models. By default, the similarity returned by spaCy is the cosine similarity between two vectors – but this can be adjusted if necessary. Vectors for objects consisting of several tokens, like the Doc and Span, default to the average of their token vectors. That's also why you usually get more value out of shorter phrases with fewer irrelevant words.

To give you an idea of what those vectors look like, here's an example. First, we load the medium model again, which ships with word vectors. Next, we can process a text and look up a token's vector using the dot vector attribute. The result is a 300-dimensional vector of the word "banana" (we are printing only the first 30 entries).

In [21]:
# Load a larger model with vectors
# nlp = spacy.load('en_core_web_md')

doc = nlp("I have a banana")
# Access the vector via the token.vector attribute
print(doc[3].vector[:20])

[ 0.20228  -0.076618  0.37032   0.032845 -0.41957   0.072069 -0.37476
  0.05746  -0.012401  0.52949  -0.5238   -0.19771  -0.34147   0.53317
 -0.025331  0.1738    0.16772   0.83984   0.055107  0.10547 ]


Predicting similarity can be useful for many types of applications. For example, to recommend a user similar texts based on the ones they have read. It can also be helpful to flag duplicate content, like posts on an online platform. However, it's important to keep in mind that there's no objective definition of what's similar and what isn't. It always depends on the context and what your application needs to do. Here's an example: spaCy's default word vectors assign a very high similarity score to "I like cats" and "I hate cats". This makes sense, because both texts express sentiment about cats. But in a different application context, you might want to consider the phrases as very dissimilar, because they talk about opposite sentiments.

In [22]:
doc1 = nlp("I like cats")
doc2 = nlp("I hate cats")
print(doc1.similarity(doc2))

0.9501447503553421


In [26]:
doc = nlp("This was a great restaurant. Afterwards, we went to a really nice bar.")

# Create spans for "great restaurant" and "really nice bar"
span1 = doc[3:5]
span2 = doc[12:15]

print(span1, '---', span2)

span1.similarity(span2)

great restaurant --- really nice bar


0.75173926

## Combining models and rules

Combining statistical models with rule-based systems is one of the most powerful tricks you should have in your NLP toolbox. In this lesson, we'll take a look at how to do it with spaCy.
Statistical models are useful if your application needs to be able to generalize based on a few examples. For instance, detecting product or person names usually benefits from a statistical model. Instead of providing a list of all person names ever, your application will be able to predict whether a span of tokens is a person name. Similarly, you can predict dependency labels to find subject/object relationships. To do this, you would use spaCy's entity recognizer, dependency parser or part-of-speech tagger.

|	| Statistical models | Rule-based systems |
|---|--------------------|--------------------|
|Use cases | application needs to generalize based on examples | dictionary with finite number of examples |
| Real-world examples | product names, person names, subject/object relationships | countries of the world, cities, drug names, dog breeds | 
| spaCy features | entity recognizer, dependency parser, part-of-speech tagger | tokenizer, Matcher, PhraseMatcher |

In the last chapter, you learned how to use spaCy's rule-based matcher to find complex patterns in your texts. Here's a quick recap. The matcher is initialized with the shared vocabulary – usually `nlp.vocab`. Patterns are lists of dictionaries, and each dictionary describes one token and its attributes. Patterns can be added to the matcher using the matcher dot add method. Operators let you specify how often to match a token. For example, "+" will match one or more times. Calling the matcher on a doc object will return a list of the matches. Each match is a tuple consisting of an ID, and the start and end token index in the document.

In [28]:
# Initialize with the shared vocab
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

# Patterns are lists of dictionaries describing the tokens
pattern = [{'LEMMA': 'love', 'POS': 'VERB'}, {'LOWER': 'cats'}]
matcher.add('LOVE_CATS', None, pattern)

# Operators can specify how often a token should be matched
pattern = [{'TEXT': 'very', 'OP': '+'}, {'TEXT': 'happy'}]

# Calling matcher on doc returns list of (match_id, start, end) tuples
doc = nlp("I love cats and I'm very very happy")
matches = matcher(doc)
matches

[(9137535031263442622, 1, 3)]

Here's an example of a matcher rule for "golden retriever". If we iterate over the matches returned by the matcher, we can get the match ID and the start and end index of the matched span. We can then find out more about it. Span objects give us access to the original document and all other token attributes and linguistic features predicted by the model. For example, we can get the span's **root token**. If the span consists of more than one token, this will be the token that decides the category of the phrase. For example, the root of "Golden Retriever" is "Retriever". We can also find the **head token** of the root. This is the syntactic "parent" that governs the phrase – in this case, the verb "have". Finally, we can look at the previous token and its attributes. In this case, it's a determiner, the article "a".

In [31]:
matcher = Matcher(nlp.vocab)
matcher.add('DOG', None, [{'LOWER': 'golden'}, {'LOWER': 'retriever'}])
doc = nlp("I have a Golden Retriever")

for match_id, start, end in matcher(doc):
    span = doc[start:end]
    print('Matched span:', span.text)
    # Get the span's root token and root head token
    print('Root token:', span.root.text)
    print('Root head token:', span.root.head.text)
    # Get the previous token and its POS tag
    print('Previous token:', doc[start - 1].text, doc[start - 1].pos_)

Matched span: Golden Retriever
Root token: Retriever
Root head token: have
Previous token: a DET


The **phrase matcher** is another helpful tool to find sequences of words in your data.
It performs a keyword search on the document, but instead of only finding strings, it gives you direct access to the tokens in context. It takes Doc objects as patterns. It's also really fast. This makes it very useful for matching large dictionaries and word lists on large volumes of text.

In [35]:
from spacy.matcher import PhraseMatcher

matcher = PhraseMatcher(nlp.vocab)

# Note that case matters: "golden retriever" is not found.
pattern = nlp("Golden Retriever")

matcher.add('DOG', None, pattern)
doc = nlp("I have a Golden Retriever")

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Get the matched span
    span = doc[start:end]
    print('Matched span:', span.text)

Matched span: Golden Retriever


Both patterns in this exercise contain mistakes and won’t match as expected. Can you fix them? If you get stuck, try printing the tokens in the doc to see how the text will be split and adjust the pattern so that each dictionary represents one token.

* Edit `pattern1` so that it correctly matches all case-insensitive mentions of "Amazon" plus a title-cased proper noun.
* Edit `pattern2` so that it correctly matches all case-insensitive mentions of "ad-free", plus the following noun.

In [40]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
doc = nlp(
    "Twitch Prime, the perks program for Amazon Prime members offering free "
    "loot, games and other benefits, is ditching one of its best features: "
    "ad-free viewing. According to an email sent out to Amazon Prime members "
    "today, ad-free viewing will no longer be included as a part of Twitch "
    "Prime for new members, beginning on September 14. However, members with "
    "existing annual subscriptions will be able to continue to enjoy ad-free "
    "viewing until their subscription comes up for renewal. Those with "
    "monthly subscriptions will have access to ad-free viewing until October 15."
)

# Create the match patterns
pattern1 = [{"LOWER": "Amazon"}, {"IS_TITLE": True}, {"POS": "PROPN"}]
pattern2 = [{"LOWER": "ad-free"}, {"POS": "NOUN"}]

# Initialize the Matcher and add the patterns
matcher = Matcher(nlp.vocab)
matcher.add("PATTERN1", None, pattern1)
matcher.add("PATTERN2", None, pattern2)

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Print pattern string name and text of matched span
    print(doc.vocab.strings[match_id], doc[start:end].text)

In [37]:
doc

Twitch Prime, the perks program for Amazon Prime members offering free loot, games and other benefits, is ditching one of its best features: ad-free viewing. According to an email sent out to Amazon Prime members today, ad-free viewing will no longer be included as a part of Twitch Prime for new members, beginning on September 14. However, members with existing annual subscriptions will be able to continue to enjoy ad-free viewing until their subscription comes up for renewal. Those with monthly subscriptions will have access to ad-free viewing until October 15.