# **Word vectors**


In the previous exercise we observed that colors that we think of as similar are 'closer' to each other in RGB vector space. Is it possible to create a vector space for all English words that has this same 'closer in space is closer in meaning' property?

The answer is yes! Luckily, you don't need to create those vectors from scratch. Many researchers have made downloadable databases of pre-trained vectors. One such project is [Stanford's Global Vectors for Word Representation (GloVe)](https://nlp.stanford.edu/projects/glove/). 

These $300$-dimensional vectors are included with $\texttt{spaCy}$, and they're the vectors we'll be using in this exercise.

![cosine similarity: picture](https://d33wubrfki0l68.cloudfront.net/d2742976a92aa4d6c39f19c747ec5f56ed1cec30/3803f/images/guide-to-word-vectors-with-gensim-and-keras_files/word2vec-king-queen-vectors.png)

In [1]:
import numpy as np

In [10]:
x

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

In [8]:
# numpy world

x = np.arange(16).reshape(4,4)

print("X :\n%s\n" % x)
print("X.shape : %s\n" % (x.shape,))
print("add x :\n%s\n" % (x + x))
print("X*X^T  :\n%s\n" % np.matmul(x,x.T))
print("mean over cols :\n%s\n" % (x.mean(axis=-1)))
print("cumsum of cols :\n%s\n" % (np.cumsum(x,axis=0)))

X :
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]]

X.shape : (4, 4)

add x :
[[ 0  2  4  6]
 [ 8 10 12 14]
 [16 18 20 22]
 [24 26 28 30]]

X*X^T  :
[[ 56  62  68  74]
 [152 174 196 218]
 [248 286 324 362]
 [344 398 452 506]]

mean over cols :
[ 1.5  5.5  9.5 13.5]

cumsum of cols :
[[ 0  1  2  3]
 [ 4  6  8 10]
 [12 15 18 21]
 [24 28 32 36]]



In [None]:
# The following will download the language model.
# Resart the runtime (Runtime -> Restart runtime) after running this cell
# (and don't run it for the second time).
!python -m spacy download en_core_web_lg

Let's load the model now:

In [14]:
import spacy

nlp = spacy.load('en_core_web_lg')

## **Word vectors: the first glance**

You can see the vector of any word in $\texttt{spaCy}$'s vocabulary using the $\texttt{vector}$ attribute:

In [15]:
# A 300-dimensional vector
len(nlp('dog').vector)

300

In [17]:
# nlp('dog').vector

## **Cosine similarity**

**Cosine similarity** is a common way of assessing similarity between words in NLP. It is essentially defined as the cosine of the angle between the vectors representing the words of interest.

Recall that the angle $\phi$ between two non-zero vectors $u$ and $v$ can be computed as follows:

$cos(\phi) = \frac{(u,v)}{||u||\cdot||v||}$

![](https://miro.medium.com/max/1394/1*_Bf9goaALQrS_0XkBozEiQ.png)



Define a function computing cosine similarity between two vectors.

In [19]:
import numpy as np

def cosine(v1, v2):
    if np.linalg.norm(v1)*np.linalg.norm(v2) > 0:
        return np.dot(v1, v2) / (np.linalg.norm(v1)*np.linalg.norm(v2))
    else:
        return 0

Test your function by computing similarities of some random pairs of words, e.g. $dog$ and $puppy$ vs. $dog$ and $kitten$. 

In [20]:
cosine(nlp('dog').vector, nlp('puppy').vector)

0.8107667

In [21]:
cosine(nlp('dog').vector, nlp('kitten').vector)

0.651503

In [22]:
cosine(nlp('dog').vector, nlp('coffee').vector)

0.20769283

In [24]:
cosine(nlp('barcelona').vector, nlp('spain').vector)

0.46489927

## **Loading the text**

Let's load the full text of *Alice in Wonderland*. It will serve us as a corpus of English words.

In [25]:
import requests

# Alice in Wonderland
response = requests.get('https://www.gutenberg.org/files/11/11-0.txt')

# If you prefer Dracula, load this instead:
#response = requests.get('https://www.gutenberg.org/cache/epub/345/pg345.txt')

# Extracting separate words from the text
doc = nlp(response.text)
tokens = list(set([w.text for w in doc if w.is_alpha]))

Check out the content of $\texttt{tokens}$ now.

In [31]:
len(tokens)

3137

Define a function that takes a word and lists the $n$ most similar words in our corpus.

In [29]:
def spacy_closest(tokens, new_vec, n=10):
    return sorted(tokens,
                  key=lambda x: cosine(new_vec, nlp(x).vector),
                  reverse=True)[:n]

Try to find words similar to some random words, e.g. $good$.

In [30]:
spacy_closest(tokens, nlp('good').vector)

['good',
 'great',
 'bad',
 'excellent',
 'nice',
 'better',
 'wonderful',
 'pleasant',
 'wise',
 'happy']

In [32]:
spacy_closest(tokens, nlp('coffee').vector)

['tea',
 'toffee',
 'drink',
 'wine',
 'kettle',
 'custard',
 'bread',
 'meal',
 'milk',
 'bottle']

You can also get creative and search for combinations of words. For example, what is similar to $king - man + woman$? 

In [33]:
new_vec = nlp('king').vector - nlp('man').vector + nlp('woman').vector
spacy_closest(tokens, new_vec)

['king',
 'throne',
 'courtiers',
 'royal',
 'crown',
 'King',
 'conquest',
 'Queen',
 'father',
 'usurpation']

In [34]:
new_vec = nlp('king').vector - nlp('queen').vector + nlp('girl').vector
spacy_closest(tokens, new_vec)

['king',
 'man',
 'girl',
 'boy',
 'woman',
 'father',
 'person',
 'friend',
 'child',
 'someone']

## **Sentence vectors**

We can also construct a vector representation for the whole sentence. For example, we can define it as an *average* of the   vectors representing the words in it.

Let's take a random sentence *My favorite food is strawberry ice cream* and construct its vector representation.

In [35]:
sent = nlp('My favorite food is strawberry ice cream.')

sentv = 0
for w in sent:
    sentv += w.vector
sentv /= len(sent)

Let's also extract sentences (as opposed to individual words) from our corpus:

In [36]:
sents = list(doc.sents)

Define a function that takes a random sentence and lists $n$ most similar sentences from our corpus.

In [37]:
def spacy_closest_sent(sentences, input_vec, n=10):
    return sorted(sentences,
                  key=lambda x: cosine(np.mean([w.vector for w in x], axis=0), input_vec),
                  reverse=True)[:n]

Let's try it out!

In [38]:
for s in spacy_closest_sent(sents, sentv, n=10):
    print(s)
    print('\n---')

This
is the driest thing I know.

---
Oh
my fur and whiskers!

---
Alice, âit would be of very little use without my shoulders.

---
The Mouse did not
answer, so Alice went on eagerly: âThere is such a nice little dog near
our house I should like to show you!

---
âI donât even know what a Mock Turtle is.â

âItâs the thing Mock Turtle Soup is made from,â said the Queen.



---
And sheâs such a capital one for catching mice you
canât think!

---
Soup
does very well withoutâMaybe itâs always pepper that makes people
hot-tempered,â she went on, very much pleased at having found out a new
kind of rule, âand vinegar that makes them sourâand camomile that makes
them bitterâandâand barley-sugar and such things that make children
sweet-tempered.

---
âCome, thereâs half my plan
done now!

---
she knows such a
very little!

---
âIâve seen a good many little girls in my time, but never
_one_ with such a neck as that!

---


In [39]:
a = np.ones(10)
b = np.zeros(10)

In [41]:
a

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

In [42]:
b

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

0.0

## **References**

This notebook is inspired by a [tutorial by Allison Parrish](https://gist.github.com/aparrish/2f562e3737544cf29aaf1af30362f469).