## spaCY Data Structures

In [5]:
import spacy
nlp = spacy.load("en_core_web_sm")

### Shared vocab and StringStore

- spaCy stores all shared data in a vocabulary, the Vocab.This includes words, but also the labels schemes for tags and entities. All strings are encoded to hash IDs to save memory. spaCy generates ID via a hash function. The string is stored only once in the string store which is available as nlp.vocab.strings.

- String Store: It's a lookup table that works in both directions. You can look up a string and get its hash, and look up a hash to get its string value. Internally, spaCy only communicates in hash IDs.

***Hash IDs can't be reversed, though. If a word in not in the vocabulary, there's no way to get its string. That's why we always need to pass around the shared vocab.

In [6]:
doc = nlp("I have a headache")

# Look up the hash for headache
headache_hash = nlp.vocab.strings["headache"]
print(headache_hash)

# Look up the cat_hash to get the string
headache = nlp.vocab.strings[headache_hash]
print(headache)

13539992551592322469
headache


### Lexemes

- Lexemes are context-independent entries in the vocabulary that can be looked up using a string or a hash ID in the vocab. They hold context-independent information about a word such as the text, or whether the the word consists of alphabetic characters.
- Context dependent information like part-of-speech tags (POS), dependencies or entity labels are not contained within lexemes.
- Like tokens, lexemes have and expose attributes.

In [3]:
doc = nlp("I love masala chai")
lexeme = nlp.vocab['chai']

# Print the lexical attributes
print(lexeme.text, lexeme.orth, lexeme.is_alpha)ß

chai 11747502562491277758 True


### Doc object

- One of spaCY's central data structures.
- Automatically created when text processed with nlp object.
- But can be manually instantiated too.

In [11]:
# Automatic instantiation
auto_doc = nlp("Automated wonder")
print(auto_doc.text)
print(type(auto_doc))

Automated wonder
<class 'spacy.tokens.doc.Doc'>


In [12]:
#Manual instantiation
from spacy.tokens import Doc

#create list of words for doc
man_words = ['Dreary','manual','labor','!']
#create list of spaces for sentence
man_spaces = [True,True,True,False]

#instantiate doc
man_doc = Doc(nlp.vocab, words=man_words, spaces=man_spaces)

print(man_doc.text)
print(type(man_doc))

Dreary manual labor !
<class 'spacy.tokens.doc.Doc'>


### The Span Object

- Slice of a Doc consisting of one or more tokens.
- Span takes at least three arguments: 
    - The doc it refers to
    - Span start index
    - Span end index (exclusive)
    
A Span is automatically created when a doc is automatically instantiated. To manually create one see below:

In [20]:
#import span class from tokens
from spacy.tokens import Span

#create new manual doc
man_words2 = ['Still','doing','things','manually!']
#create list of spaces for sentence
man_spaces2 = [True,True,True,False]

#instantiate doc
man_doc2 = Doc(nlp.vocab, words=man_words, spaces=man_spaces)

#create span 
man_span = Span(man_doc2, 0,3)

#Create one with a label
man_span_label = Span(man_doc2,0,3, label="Observation")

if len(man_doc2.ents) == 0:
    print("No entities in manual document")

No entities in manual document


In [21]:
#Add span to doc's entity list
man_doc2.ents = [man_span_label]

for ent in man_doc2.ents:
    print(ent.text, ent.label)

Dreary manual labor 5897125384360097813


### Best practices
- Doc and Span are very powerful and hold references and relationships of words and sentences
- Convert result to strings as late as possible
- Use token attributes if available – for example, token.i for the token index
- Don't forget to pass in the shared vocab

eg. Bad Code

In [30]:
doc = nlp("Anna loved her sister Elsa")

# Get all tokens and POS tags
token_texts = [token.text for token in doc]
pos_tags = [token.pos_ for token in doc]

for index, pos in enumerate(pos_tags):
    # Check for proper noun
    if pos == "PROPN":
        # Check if the next token is a verb
        if (index+1) < len(pos_tags) and pos_tags[index + 1] == "VERB":
            result = token_texts[index]
            print("Found proper noun before a verb:", result)

Found proper noun before a verb: Anna


This is an example of a bad code because it only uses lists of strings instead of native token attributes. This is often less efficient, and can't express complex relationships.

Better Code

In [38]:
doc = nlp("Alexis soared high above the mountains")

#Use native token attributes
for token in doc:
    # Check for proper noun
    if token.pos_ == "PROPN":
        # Check if the next token is a verb
        if (token.i+1) < len(doc) and doc[token.i + 1].pos_ == "VERB":
            result = token.text
            print("Found proper noun before a verb:", result)

Found proper noun before a verb: Alexis
