In this chapter, you'll use your new skills to extract specific information from large volumes of text. You''ll learn how to make the most of spaCy's data structures, and how to effectively combine statistical and rule-based approaches for text analysis.

### Data Structures (1): Vocab, Lexemes and StringStore

spaCy stores all shared data in a vocabulary, the Vocab.

This includes words, but also the labels schemes for tags and entities.

To save memory, all strings are encoded to hash IDs. If a word occurs more than once, we don't need to save it every time.

Instead, spaCy uses a hash function to generate an ID and stores the string only once in the string store. The string store is available as nlp dot vocab dot strings.

It's a lookup table that works in both directions. You can look up a string and get its hash, and look up a hash to get its string value. Internally, spaCy only communicates in hash IDs.

Hash IDs can't be reversed, though. If a word in not in the vocabulary, there's no way to get its string. That's why we always need to pass around the shared vocab.

In [25]:
from spacy.lang.en import English
nlp = English()
coffee_hash = nlp.vocab.strings['coffee']
coffee_hash

3197928453018144401

In [12]:
# Import the English language class
from spacy.lang.en import English

# Create the nlp object
nlp = English()

To get the hash for a string, we can look it up in nlp dot vocab dot strings.

To get the string representation of a hash, we can look up the hash.

A Doc object also exposes its vocab and strings.

In [26]:
doc = nlp("I love coffee")
print('hash value:', nlp.vocab.strings['coffee'])
print(nlp.vocab.strings[3197928453018144401])

hash value: 3197928453018144401
coffee


The doc also exposes the vocab and strings

In [27]:
doc = nlp("I love coffee")
print('hash value:', doc.vocab.strings['coffee'])

hash value: 3197928453018144401


Lexemes are context-independent entries in the vocabulary.

You can get a lexeme by looking up a string or a hash ID in the vocab.

Lexemes expose attributes, just like tokens.

They hold context-independent information about a word, like the text, or whether the the word consists of alphanumeric characters.

Lexemes don't have part-of-speech tags, dependencies or entity labels. Those depend on the context.

A Lexeme object is an entry in the vocabulary

In [30]:
doc = nlp("I love coffee")
lexeme = nlp.vocab['coffee']
lexeme_hash = nlp.vocab.strings['coffee']
# Print the lexical attributes
print(lexeme.text, lexeme.orth, lexeme.is_alpha)

coffee 3197928453018144401 True


Here's an example.

The Doc contains words in context – in this case, the tokens "I", "love" and "coffee" with their part-of-speech tags and dependencies.

Each token refers to a lexeme, which knows the word's hash ID. To get the string representation of the word, spaCy looks up the hash in the string store.

### Strings to Hashes