#### NLP with spaCY part 2

#### Chapter 2: Large-scale data analysis with spaCy
+ extract specific information from large volumes of text using spaCy's data structures, and combine statistical and rule-based approaches for text analysis

-------

#### Data Structures
+ `Vocab`: stores data shared across multiple documents; To save memory, spaCy encodes all strings to hash values
+ It's a lookup table that works in both directions. You can look up a string and get its hash, and look up a hash to get its string value.
+ To get the hash for a string, we can look it up in `nlp.vocab.strings`

In [2]:
import spacy

nlp = spacy.blank('en')
doc = nlp("I love coffee")
print("hash value:", nlp.vocab.strings["coffee"])
print("string value:", nlp.vocab.strings[3197928453018144401])

nlp.vocab.strings.add("coffee")
coffee_hash = nlp.vocab.strings["coffee"]
coffee_string = nlp.vocab.strings[coffee_hash]

hash value: 3197928453018144401
string value: coffee


**Lexemes** are context-independent entries in the vocabulary.
+ You can get a lexeme by looking up a string or a hash ID in the vocab.
+ They hold context-independent information about a word, like the text, or whether the word consists of alphabetic characters

In [3]:
doc = nlp("I love coffee")
lexeme = nlp.vocab["coffee"]

# Print the lexical attributes {text, hash, and is it alphabetic value}
print(lexeme.text, lexeme.orth, lexeme.is_alpha)

coffee 3197928453018144401 True


---------

#### Strings to Hashes

In [4]:
# Look up the string “cat” in nlp.vocab.strings to get the hash.

nlp = spacy.blank("en")
doc = nlp("I have a cat")

# Look up the hash for the word "cat"
cat_hash = nlp.vocab.strings['cat']
print(cat_hash)

# Look up the cat_hash to get the string
cat_string = nlp.vocab.strings[cat_hash]
print(cat_string)

5439657043933447811
cat


In [5]:
# Look up the string label “PERSON” in nlp.vocab.strings to get the hash.

nlp = spacy.blank("en")
doc = nlp("David Bowie is a PERSON")

# Look up the hash for the string label "PERSON"
person_hash = nlp.vocab.strings['PERSON']
print(person_hash)

# Look up the person_hash to get the string
person_string = nlp.vocab.strings[person_hash]
print(person_string)

380
PERSON


-------

#### Data Structures 2

##### The Doc Object
+ it's created automatically when you process a text with the nlp object

In [6]:
nlp = spacy.blank('en')

from spacy.tokens import Doc
# The words and spaces to create the doc from
words = ["Hello", "world", "!"]
spaces = [True, False, False]

# Create a doc manually
doc = Doc(nlp.vocab, words=words, spaces=spaces)

In [7]:
doc

Hello world!

##### The Span Object 2

+ To create a Span manually, we can also import the class from spacy.tokens. We can then instantiate it with the doc and the span's start and end index

In [8]:
# Import the Doc and Span classes
from spacy.tokens import Doc, Span

# The words and spaces to create the doc from
words = ["Hello", "world", "!"]
spaces = [True, False, False]

# Create a doc manually
doc = Doc(nlp.vocab, words=words, spaces=spaces)

# Create a span manually
span = Span(doc, 0, 2)

# Create a span with a label
span_with_label = Span(doc, 0, 2, label="GREETING")

# Add span to the doc.ents
doc.ents = [span_with_label]
doc

Hello world!

##### Create a Doc from the words and spaces

In [9]:
nlp = spacy.blank('en')

from spacy.tokens import Doc

# Desired text: "spaCy is cool!"
words = ["spaCy", "is", "cool", "!"]
spaces = [True, True, False, False]

# create a doc manually
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

spaCy is cool!


##### Create a Doc from the words and spaces. Don’t forget to pass in the vocab!

In [10]:
nlp = spacy.blank('en')

from spacy.tokens import Doc

words = ["Go", ",", "get", "started", "!"]
spaces = [False, True, True, False, True]

# create a doc manually
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

Go, get started! 


##### Complete the words and spaces to match the desired text and create a doc.

In [11]:
nlp = spacy.blank('en')
from spacy.tokens import Doc

# Desired text: "Oh, really?!"
words = ["Oh", ',', 'really', '?', "!"]
spaces = [False, True, False,False, False]

doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

Oh, really?!


-----

#### Docs, Spans, and entities from scratch
+ create the Doc and Span objects manually, and update the named entities – just like spaCy does behind the scenes.
+ A shared nlp object has already been created

In [12]:
nlp = spacy.blank('en')

# Import the Doc and Span classes
from spacy.tokens import Doc, Span

words = ["I", "like", "David", "Bowie"]
spaces = [True, True, True, False]

# Create a doc from the words and spaces
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

# Create a span for "David Bowie" from the doc and assign it the label "PERSON"
span = Span(doc, 2, 4, label='PERSON')
print(span.text, span.label_)

# Add the span to the doc's entities
doc.ents = [span]

# Print entities' text and labels
print([(ent.text, ent.label_) for ent in doc.ents])

I like David Bowie
David Bowie PERSON
[('David Bowie', 'PERSON')]


----------

#### Word vectors and semantic similarity
+ spaCy can compare two objects and predict how similar they are – for example, documents, spans or single tokens
+ We can then create two doc objects and use the first doc's **similarity method** to compare it to the second.

In [13]:
# Load a larger pipeline with vectors
nlp = spacy.load("en_core_web_md")

# Compare two documents
doc1 = nlp("I like fast food")
doc2 = nlp("I like pizza")
print(doc1.similarity(doc2))

0.8382381200790405


In [14]:
# Compare two tokens
doc = nlp("I like pizza and pasta")
token1 = doc[2]
token2 = doc[4]
print(token1.similarity(token2))

1.0000001192092896


+ You can also use the similarity methods to compare different types of objects; For example, a document and a token

In [15]:
# Compare a document with a token
doc = nlp("I like pizza")
token = nlp("soap")[0]

print(doc.similarity(token))

0.2274084836244583


In [16]:
# Compare a span with a document
span = nlp("I like pizza and pasta")[2:5]
doc = nlp("McDonalds sells burgers")

print(span.similarity(doc))

0.5528544783592224


-----

#### How does spaCy predict similarity?
+ Similarity is determined using word vectors, multi-dimensional representations of meanings of words


#### Word vectors in spaCy

In [18]:
nlp = spacy.load("en_core_web_md")

doc = nlp("I have a banana")
# Access the vector via the token.vector attribute
print(doc[3].vector)

## The result is a 300-dimensional vector of the word "banana".

[-0.6334     0.18981   -0.53544   -0.52658   -0.30001    0.30559
 -0.49303    0.14636    0.012273   0.96802    0.0040354  0.25234
 -0.29864   -0.014646  -0.24905   -0.67125   -0.053366   0.59426
 -0.068034   0.10315    0.66759    0.024617  -0.37548    0.52557
  0.054449  -0.36748   -0.28013    0.090898  -0.025687  -0.5947
 -0.24269    0.28603    0.686      0.29737    0.30422    0.69032
  0.042784   0.023701  -0.57165    0.70581   -0.20813   -0.03204
 -0.12494   -0.42933    0.31271    0.30352    0.09421   -0.15493
  0.071356   0.15022   -0.41792    0.066394  -0.034546  -0.45772
  0.57177   -0.82755   -0.27885    0.71801   -0.12425    0.18551
  0.41342   -0.53997    0.55864   -0.015805  -0.1074    -0.29981
 -0.17271    0.27066    0.043996   0.60107   -0.353      0.6831
  0.20703    0.12068    0.24852   -0.15605    0.25812    0.007004
 -0.10741   -0.097053   0.085628   0.096307   0.20857   -0.23338
 -0.077905  -0.030906   1.0494     0.55368   -0.10703    0.052234
  0.43407   -0.13926    0

#### Inspecting word vectors
+ Load the medium "en_core_web_md" pipeline with word vectors.; Print the vector for "bananas" using the token.vector attribute.

In [19]:
# Load the en_core_web_md pipeline
nlp = spacy.load("en_core_web_md")

# Process a text
doc = nlp("Two bananas in pyjamas")

# Get the vector for the token "bananas"
bananas_vector = doc[1].vector
print(bananas_vector)

[-0.6334     0.18981   -0.53544   -0.52658   -0.30001    0.30559
 -0.49303    0.14636    0.012273   0.96802    0.0040354  0.25234
 -0.29864   -0.014646  -0.24905   -0.67125   -0.053366   0.59426
 -0.068034   0.10315    0.66759    0.024617  -0.37548    0.52557
  0.054449  -0.36748   -0.28013    0.090898  -0.025687  -0.5947
 -0.24269    0.28603    0.686      0.29737    0.30422    0.69032
  0.042784   0.023701  -0.57165    0.70581   -0.20813   -0.03204
 -0.12494   -0.42933    0.31271    0.30352    0.09421   -0.15493
  0.071356   0.15022   -0.41792    0.066394  -0.034546  -0.45772
  0.57177   -0.82755   -0.27885    0.71801   -0.12425    0.18551
  0.41342   -0.53997    0.55864   -0.015805  -0.1074    -0.29981
 -0.17271    0.27066    0.043996   0.60107   -0.353      0.6831
  0.20703    0.12068    0.24852   -0.15605    0.25812    0.007004
 -0.10741   -0.097053   0.085628   0.096307   0.20857   -0.23338
 -0.077905  -0.030906   1.0494     0.55368   -0.10703    0.052234
  0.43407   -0.13926    0