# Chapter 2: Large-scale data analysis with spaCy

* In this chapter, you'll use your new skills to extract specific information from large volumes of text. 
* You''ll learn how to make the most of spaCy's data structures, and how to effectively combine statistical and rule-based approaches for text analysis.

## Word vectors and semantic similarity
* In this lesson, you'll learn how to use spaCy to predict how similar documents, spans or tokens are to each other.
* You'll also learn how to use word vectors and how to take advantage of them in your NLP application.

### Comparing semantic similarity
* spaCy can compare two objects and predict similarity
* Doc.similarity(), Span.similarity() and Token.similarity()
* Take another object and return a similarity score (0 to 1)
* Important: needs a model that has word vectors included, for example:
    * ✅ en_core_web_md (medium model)
    * ✅ en_core_web_lg (large model)
    * 🚫 NOT en_core_web_sm (small model)
    
### Example:
* Here's an example. Let's say we want to find out whether two documents are similar.
* First, we load the medium English model, "en_core_web_md".
* We can then create two doc objects and use the first doc's similarity method to compare it to the second.
* Here, a fairly high similarity score of 0.86 is predicted for "I like fast food" and "I like pizza".
* The same works for tokens.
* According to the word vectors, the tokens "pizza" and "pasta" are kind of similar, and receive a score of 0.7.    

### Similarity examples (1)

In [8]:
import spacy
# Load a larger model with vectors
nlp = spacy.load("en_core_web_sm")

# Compare two documents
doc1 = nlp("I like fast food")
doc2 = nlp("I like pizza")
print(doc1.similarity(doc2))

0.8018373287411041


  "__main__", mod_spec)


In [9]:
# Compare two tokens
doc = nlp("I like pizza and pasta")
token1 = doc[2]
token2 = doc[4]
print(token1.similarity(token2))

0.32624283


  "__main__", mod_spec)


### Similarity examples (2)
* You can also use the similarity methods to compare different types of objects.
* For example, a document and a token.
* Here, the similarity score is pretty low and the two objects are considered fairly dissimilar.
* Here's another example comparing a span – "pizza and pasta" – to a document about McDonalds.
* The score returned here is 0.61, so it's determined to be kind of similar.

In [10]:
# Compare a document with a token
doc = nlp("I like pizza")
token = nlp("soap")[0]
print(doc.similarity(token))

0.28162675424923095


  "__main__", mod_spec)


In [11]:
# Compare a span with a document
span = nlp("I like pizza and pasta")[2:5]
doc = nlp("McDonalds sells burgers")

print(span.similarity(doc))

0.294720206859215


  "__main__", mod_spec)


### How does spaCy predict similarity?
* Similarity is determined using word vectors
* Multi-dimensional meaning representations of words
* Generated using an algorithm like Word2Vec and lots of text
* Can be added to spaCy's statistical models
* Default: cosine similarity, but can be adjusted
* Doc and Span vectors default to average of token vectors
* Short phrases are better than long documents with many irrelevant words

### Word vectors in spaCy
* To give you an idea of what those vectors look like, here's an example.
* First, we load the medium model again, which ships with word vectors.
* Next, we can process a text and look up a token's vector using the .vector attribute.
* The result is a 300-dimensional vector of the word "banana".

In [12]:
# Load a larger model with vectors
nlp = spacy.load("en_core_web_sm")

doc = nlp("I have a banana")
# Access the vector via the token.vector attribute
print(doc[3].vector)

[ 0.9383564  -2.9524927   1.1866798   0.49744225 -0.11475766  0.804008
  0.4672468  -1.1062207   2.9193573   1.800931   -0.31358248  1.1920271
 -1.2406584  -2.3237133   2.099902   -0.66673994 -0.96991694  0.8316833
  0.10666084 -0.42245626  1.6402073   0.95437694  1.2855074  -2.038612
 -0.7317371  -0.17545497  0.14752543  1.327169    3.2502053  -3.9332502
  1.7409098  -0.73711336  1.4852796  -2.8246899  -1.8938334  -1.2638527
  5.298433   -1.2850044  -2.7470415  -1.5607052   5.181785    2.242096
 -2.1922808  -5.310454    1.0295098   1.484088   -1.5894104  -0.14745024
  1.7829046   1.8879583   4.152973   -3.1493165  -0.18937713  2.09369
 -2.1269834   0.63290507  2.6979058   1.800016   -2.3953576   2.54901
  1.0445759  -1.3137031   2.4631662  -0.07756937 -1.129545    0.1169464
  1.3869805   0.53586185 -2.242661    2.8641388  -3.8719153  -0.6409143
  0.6971829   4.484493   -1.6210997   2.494869    0.7218447  -3.3112261
 -0.2163549  -2.5339773  -1.1702836  -0.9627162  -3.7210062   1.559916

### Similarity depends on the application context
* Useful for many applications: recommendation systems, flagging duplicates etc.
* There's no objective definition of "similarity"
* Depends on the context and what application needs to do

Predicting similarity can be useful for many types of applications. For example, to recommend a user similar texts based on the ones they have read. It can also be helpful to flag duplicate content, like posts on an online platform.

However, it's important to keep in mind that there's no objective definition of what's similar and what isn't. It always depends on the context and what your application needs to do.

Here's an example: spaCy's default word vectors assign a very high similarity score to "I like cats" and "I hate cats". This makes sense, because both texts express sentiment about cats. But in a different application context, you might want to consider the phrases as very dissimilar, because they talk about opposite sentiments.

In [13]:
doc1 = nlp("I like cats")
doc2 = nlp("I hate cats")

print(doc1.similarity(doc2))

0.849750871465223


  "__main__", mod_spec)
