# **Word vectors**


In the previous exercise we observed that colors that we think of as similar are 'closer' to each other in RGB vector space. Is it possible to create a vector space for all English words that has this same 'closer in space is closer in meaning' property?

The answer is yes! Luckily, you don't need to create those vectors from scratch. Many researchers have made downloadable databases of pre-trained vectors. One such project is [Stanford's Global Vectors for Word Representation (GloVe)](https://nlp.stanford.edu/projects/glove/).

These $300$-dimensional vectors are included with $\texttt{spaCy}$, and they're the vectors we'll be using in this exercise.

![cosine similarity: picture](https://d33wubrfki0l68.cloudfront.net/d2742976a92aa4d6c39f19c747ec5f56ed1cec30/3803f/images/guide-to-word-vectors-with-gensim-and-keras_files/word2vec-king-queen-vectors.png)

In [1]:
# The following will download the language model.
# Resart the runtime (Runtime -> Restart runtime) after running this cell
# (and don't run it for the second time).
!python -m spacy download en_core_web_lg

Collecting en-core-web-lg==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.7.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


Let's load the model now:

In [2]:
import spacy

nlp = spacy.load('en_core_web_lg')

## **Word vectors: the first glance**

You can see the vector of any word in $\texttt{spaCy}$' s vocabulary using the $\texttt{vector}$ attribute:

In [3]:
# A 300-dimensional vector
len(nlp('dog').vector)

300

In [4]:
nlp('dog').vector

array([ 1.2330e+00,  4.2963e+00, -7.9738e+00, -1.0121e+01,  1.8207e+00,
        1.4098e+00, -4.5180e+00, -5.2261e+00, -2.9157e-01,  9.5234e-01,
        6.9880e+00,  5.0637e+00, -5.5726e-03,  3.3395e+00,  6.4596e+00,
       -6.3742e+00,  3.9045e-02, -3.9855e+00,  1.2085e+00, -1.3186e+00,
       -4.8886e+00,  3.7066e+00, -2.8281e+00, -3.5447e+00,  7.6888e-01,
        1.5016e+00, -4.3632e+00,  8.6480e+00, -5.9286e+00, -1.3055e+00,
        8.3870e-01,  9.0137e-01, -1.7843e+00, -1.0148e+00,  2.7300e+00,
       -6.9039e+00,  8.0413e-01,  7.4880e+00,  6.1078e+00, -4.2130e+00,
       -1.5384e-01, -5.4995e+00,  1.0896e+01,  3.9278e+00, -1.3601e-01,
        7.7732e-02,  3.2218e+00, -5.8777e+00,  6.1359e-01, -2.4287e+00,
        6.2820e+00,  1.3461e+01,  4.3236e+00,  2.4266e+00, -2.6512e+00,
        1.1577e+00,  5.0848e+00, -1.7058e+00,  3.3824e+00,  3.2850e+00,
        1.0969e+00, -8.3711e+00, -1.5554e+00,  2.0296e+00, -2.6796e+00,
       -6.9195e+00, -2.3386e+00, -1.9916e+00, -3.0450e+00,  2.48

## **Cosine similarity**

**Cosine similarity** is a common way of assessing similarity between words in NLP. It is essentially defined as the cosine of the angle between the vectors representing the words of interest.

Recall that the angle $\phi$ between two non-zero vectors $u$ and $v$ can be computed as follows:

$cos(\phi) = \frac{(u,v)}{||u||\cdot||v||}$

![](https://miro.medium.com/max/1394/1*_Bf9goaALQrS_0XkBozEiQ.png)



Define a function computing cosine similarity between two vectors.

In [10]:
import numpy as np

def cosine(v1, v2):
  # Your code here
  cs = np.dot(v1,v2)/(np.linalg.norm(v1)*np.linalg.norm(v2))
  return cs

Test your function by computing similarities of some random pairs of words, e.g. $dog$ and $puppy$ vs. $dog$ and $kitten$.

In [11]:
# Your code here
v1 = nlp('dog').vector
v2 = nlp('puppy').vector
cos_sim = cosine(v1, v2)
cos_sim

0.81076676

## **Loading the text**

Let's load the full text of *Alice in Wonderland*. It will serve us as a corpus of English words.

In [14]:
import requests

# Alice in Wonderland
response = requests.get('https://www.gutenberg.org/files/11/11-0.txt')

# If you prefer Dracula, load this instead:
#response = requests.get('https://www.gutenberg.org/cache/epub/345/pg345.txt')

# Extracting separate words from the text
doc = nlp(response.text)
tokens = list(set([w.text for w in doc if w.is_alpha]))

Check out the content of $\texttt{tokens}$ now.

In [15]:
tokens

['ignorant',
 'into',
 'fellow',
 'exclaimed',
 'circumstances',
 'It',
 'blue',
 'eager',
 'ran',
 'eleventh',
 'flustered',
 'ugly',
 'flying',
 'New',
 'whole',
 'uncivil',
 'merrily',
 'belong',
 'soup',
 'Therefore',
 'cherry',
 'shake',
 'wink',
 'herself',
 'bend',
 'END',
 'flock',
 'fountains',
 'length',
 'hollow',
 'nurse',
 'ears',
 'noises',
 'year',
 'verdict',
 'considering',
 'again',
 'Hardly',
 'afford',
 'reaching',
 'boy',
 'camomile',
 'afore',
 'sign',
 'venture',
 'Stop',
 'common',
 'rapidly',
 'See',
 'singers',
 'uncommonly',
 'pardon',
 'skirt',
 'just',
 'telescopes',
 'turtles',
 'thanked',
 'dozing',
 'couples',
 'across',
 'shouted',
 'THE',
 'crust',
 'Caucus',
 'labelled',
 'patiently',
 'paused',
 'stingy',
 'age',
 'morals',
 'live',
 'Table',
 'sorry',
 'speech',
 'remembering',
 'older',
 'rumbling',
 'bee',
 'remain',
 'word',
 'pepper',
 'wearily',
 'calmly',
 'added',
 'scaly',
 'obliged',
 'wherever',
 'howling',
 'write',
 'really',
 'planning'

Define a function that takes a word and lists the $n$ most similar words in our corpus.

In [21]:
def spacy_closest(tokens, new_vec, n=10):
  # Your code here
  similar_words = []
  for c in sorted(tokens,key=lambda x: cosine(new_vec,nlp(x).vector),reverse=True)[:n]:
    similar_words.append(c)
  return similar_words
  pass

Try to find words similar to some random words, e.g. $good$.

In [24]:
spacy_closest(tokens, nlp('coffee').vector)

  cs = np.dot(v1,v2)/(np.linalg.norm(v1)*np.linalg.norm(v2))


['kettle',
 'milk',
 'bottle',
 'roast',
 'soup',
 'sugar',
 'dinner',
 'vegetable',
 'apple',
 'tastes']

In [22]:
spacy_closest(tokens, nlp('good').vector)

  cs = np.dot(v1,v2)/(np.linalg.norm(v1)*np.linalg.norm(v2))


['good',
 'bad',
 'better',
 'wonderful',
 'clever',
 'really',
 'pleasing',
 'uncomfortable',
 'certainly',
 'little']

You can also get creative and search for combinations of words. For example, what is similar to $king - man + woman$?

In [26]:
# Your code here
new_vec = nlp('king').vector - nlp('man').vector + nlp('woman').vector
spacy_closest(tokens, new_vec)

  cs = np.dot(v1,v2)/(np.linalg.norm(v1)*np.linalg.norm(v2))


['Queen',
 'usurpation',
 'neighbouring',
 'authority',
 'ancient',
 'instance',
 'Fainting',
 'familiarly',
 'generally',
 'particular']

## **Sentence vectors**

We can also construct a vector representation for the whole sentence. For example, we can define it as an *average* of the   vectors representing the words in it.

Let's take a random sentence *My favorite food is strawberry ice cream* and construct its vector representation.

In [35]:
sent = nlp('My favorite food is strawberry ice cream.')

# Your code here
sentv = 0
for word in sent:
  sentv +=  word.vector
sentv/=len(sent)

Let's also extract sentences (as opposed to individual words) from our corpus:

In [38]:
sents = list(doc.sents)
sents[:3]

[ï»¿ï»¿*** START OF THE PROJECT GUTENBERG EBOOK,
 ALICE'S ADVENTURES IN
 WONDERLAND *,
 **
 [Illustration]
 
 
 
 
 Aliceâs Adventures in Wonderland
 
 by Lewis Carroll
 
 THE MILLENNIUM FULCRUM EDITION 3.0
 
 Contents
 
  CHAPTER I.     Down the Rabbit-Hole
  CHAPTER II.    ]

Define a function that takes a random sentence and lists $n$ most similar sentences from our corpus.

In [55]:
def spacy_closest_sent(sentences, input_vec, n=10):
  # Your code here
  return sorted(sentences,
                  key = lambda x: cosine(np.mean([w.vector for w in x],axis=0),input_vec),
                  reverse=True)[:n]

Let's try it out!

In [56]:
for s in spacy_closest_sent(sents, sentv, n=10):
  print(s)
  print('\n---')

This
is the driest thing I know.

---
And oh, my poor hands, how is it

---
beautiful Soup!
Soup of the evening, beautiful Soup!
    Beauâootiful Sooâoop!
    

---
Oh
my fur and whiskers!

---
âI dare say youâre wondering why I donât put my arm round your waist,â
the Duchess said after a pause: âthe reason is, that Iâm doubtful about
the temper of your flamingo.

---
The Mouse did not
answer, so Alice went on eagerly: âThere is such a nice little dog near
our house I should like to show you!

---
And sheâs such a capital one for catching mice you
canât think!

---
Soup
does very well withoutâMaybe itâs always pepper that makes people
hot-tempered,â she went on, very much pleased at having found out a new
kind of rule, âand vinegar that makes them sourâand camomile that makes
them bitterâandâand barley-sugar and such things that make children
sweet-tempered.

---
she knows such a
very little!

---
And she tried to fancy what the
flam

  cs = np.dot(v1,v2)/(np.linalg.norm(v1)*np.linalg.norm(v2))


## **References**

This notebook is inspired by a [tutorial by Allison Parrish](https://gist.github.com/aparrish/2f562e3737544cf29aaf1af30362f469).