# Spacy

In [3]:
import spacy
nlp = spacy.load('en_core_web_sm')
nlp


<spacy.lang.en.English at 0x1a291493e20>

## Strings to Hashes

You're aware that whenever you created a doc, the words of the doc are stored in the **vocab**.

Think like that you have about 1000 text documents each having information about various clothing items of different brands. The chances are, the words "shirt" and "pants" are going to be very common. Each time the word "shirt" occurs, if spaCy were to store the exact string, you'll end up losing huge memory space.

But this doesn't happen. why?

spaCy **hashes** or converts each string to a unique ID that is stored in the $StringStore$.

But, what is StringStore?

It's a dictionary **mapping of hash value to string**, for example 10543432924755684266 -> box

You can print the hash value if you know the string and vice-versa. This is contained in **nlp.vocab.strings**


In [4]:
#Strings to hashes and back
doc = nlp("I love travelling")

In [5]:
# Look up the hash for the word "travelling"

word_hash = nlp.vocab.strings["travelling"]
print(word_hash)

5902765392174988614


In [7]:
# Look up the word_hash to get the string
word_string = nlp.vocab.strings[word_hash]
print(word_string)

travelling


Interestingly, a word will have the same hash value irrespective of which document it occurs in or which spacy model is being used.

So your results are reproducible even if you run your code in some one else's machine.

In [11]:
# Create two different doc with a common word
doc1 = nlp('Raymond shirts are famous')
doc2 = nlp('I washed my shirts')

print('--------Doc 1----------')
for token in doc1:
    hash_value = nlp.vocab.strings[token.text]
    print(token.text, " ", hash_value)

print('----------Doc 2-----------')
for token in doc2:
    hash_value = nlp.vocab.strings[token.text]
    print(token.text, " ", hash_value)

--------Doc 1----------
Raymond   5945540083247941101
shirts   9181315343169869855
are   5012629990875267006
famous   17809293829314912000
----------Doc 2-----------
I   4690420944186131903
washed   5520327350569975027
my   227504873216781231
shirts   9181315343169869855


## Lexical attributes of spaCy

Recall that we used **is_punct** and **is_space** attributes in Text Preprocessing. They are called as **'lexical attributes'**

we will learn about a few more significant lexical attributes.

The spaCy model provides many useful lexical attributes. These are the attributes of token object, that give you information on the type of token.

e.g : **like_num** attributes of a token to check if it is a number. Let's print all the numbers in a text.

In [12]:
# Print the tokens which are like numbers

text = '2020 is far worse than 2009'
doc = nlp(text)
for token in doc:
    if token.like_num:
        print(token)

2020
2009


Some real-life applications of these features

In [13]:
Production = "Production in chennai is 87%. In kolkata it as low as 43%. In Bangalore, production is as good as 98%. In jaipur, production is average around 78%"

List of various percentages in the text.

We can convert the text into a **Doc** object of spaCy and check what tokens are numbers through like_num attribute. If it is a number, we can check if the next token is "%". we can access the index of next token through token. i+1

In [17]:
#Finding the tokens which are number followed by %

production_doc = nlp(Production)
for token in production_doc:
    if token.like_num:
        index_of_next_token = token.i + 1
        next_token = production_doc[index_of_next_token]
        if next_token.text == "%":
            print(token.text)

87
43
98
78
