# Word Embeddings

## Quick Set Up

We're going to be using a library called SpaCy later in this notebook. To make things easier we're quickly going to download an install the model now.

If you're running on local run both steps one after another.

For colab:

All you need to do is run the following cell. Then, when it's finished running, restart the runtime above.

In [0]:
! python -m spacy download en_core_web_md

Then once you've done the above steps, run the cell below:

In [0]:
import spacy
nlp = spacy.load('en_core_web_md')

## Introduction

In the first notebook in this series, we looked at different ways of representing text. We showed that a string could be represented as an array of words and their frequencies and that this was a useful way of analysing text.

Whilst our Bag of Word (BoW) models were increadibly quick and a useful tool for getting started, the trade-off for this is that they fail to capture a lot of information within our text.

Pause here and have a think about how we created the bag of words models and see if you can work out what information we are missing.
  
BoW models are created by assigning dummy variables to indicate the frequence of a word's occurance in a text. Imagine the following creating a BoW model for the following word list:

`W = ['great', 'ghost', 'good', 'ghoul', 'grenade', 'gave']`

You would get something like this:

_  | Great | Ghost | Good | Ghoul | Grenade | Gave
--- | --- | --- | --- | --- | --- | ---
 Great| 1 | 0 | 0 | 0 | 0 | 0
 Ghost | 0 | 1 | 0 | 0 | 0 | 0
 Good | 0 | 0 | 1 | 0 | 0 | 0
 Ghoul | 0 | 0 | 0 | 1 | 0 | 0
 Grenade | 0 | 0 | 0 | 0 | 1 | 0
Gave | 0 | 0 | 0 | 0 | 0 | 1

If we focus on the top three lines, we can see that the vectors (shown in each row) for good, great and ghost, are all equally similar and dissimilar to each other:

_  | Great | Ghost | Good | Ghoul | Grenade | Gave 
--- | --- | --- | --- | --- | --- | ---
**Great**| **1** | **0** | **0** | **0** | **0** | **0**
 **Ghost** | **0** | **1** | **0** | **0** | **0** | **0**
 **Good** | **0** | **0** | **1** | **0** | **0** | **0**
 Ghoul | 0 | 0 | 0 | 1 | 0 | 0
 Grenade | 0 | 0 | 0 | 0 | 1 | 0
 Gave | 0 | 0 | 0 | 0 | 0 | 1 | 0

To you or me, however, it is readily apparent that 'good' and 'great' are very similar and 'ghost' and 'ghoul' are quite similar, whilst none should be similar with 'gave' or 'grenade'.

The second issue that we face is that from this list we have no idea about how the original sentence was constructed. Presuming stop words were removed beforehand, the sentence could have been:

`The good ghost gave a grenade to the great ghoul.`

It also could have been many other combinations. This becomes especially troublesome when words could have had different meanings in different places. E.g. was it 'good' as in not evil, or 'good' as in skilled at something?

In summary, BoW models fail to capture any infomation about:

* How similar a word is to another
* Contextual meaning of a word
      
---

The solution to this problem is to create word vectors.

Word vectors solve these problems as well as being easy-to-use and interpretable. They can be defined as:

*A vector comprised of valued numbers where each dimension captures an aspect of the word’s meaning* more recently this has expanded to include  *semantically similar words have similar vectors.*

What makes this approach even better is that by thinking of words as vectors, we will be able to use mathematical operators on them... more on this later.

First, lets look at how vectors achieve similarity and context.

## Word Vectors and Similarity

Creating word embeddings is a method of vector representation that capture information about the context of a word, not just the word itself. For example, we would expect similar words to have similar vectors. Take the following two sentences:

1. Yesterday I **read** a book
2. I will **study** that paper

Here the words read and study convey the same information and their vector representation should capture this information.

We can visualise this easier if we take a larger amount of words, and see how they compare along some imaginary dimensions

![Table of words](https://auquan-public-data.s3.ap-south-1.amazonaws.com/Screenshot+2020-03-24+at+13.37.10.png)

We can see in this table that certain words have similar vectors - such as those for NFL and NBA, or England and India. This makes sense on a conceptual level too, as these words refer to similar things.

Once part of this table is misleading however. We've labelled the dimensions by giving them clearly defined meanings, this isn't real. In real life the vector dimensions don't each code for a tangiable feature that we would understand. The are abstractions. Imagining that they have an real meaning is helpful for building an intuitive understanding of them. 

So how do we capture this information in practice? On a technical level, word embeddings have the following properties:

1. Each word in the vocabulary is represented by a low dimensional vector (~ 300d)
2. All words are embedded into the same space
3. Similar words have similar vectors
(= their vectors are close to each other in the vector space)
4. Word embeddings are successfully used for various NLP application

Lets make it real with some examples:

We're going to plot the embeddings of all the words in the table and see how they relate. Since the actual embeddings have around 300 dimensions, we'll be reducing them to 2 for visualization

In [0]:
from sklearn.decomposition import PCA
import numpy as np

# a string with all the words we'll be looking at
sentence = "cricket baseball football handball NFL NBA  basketball spectator superbowl stadium rugby england india tennis badminton court squash"
# creating tokens from the string
tokens = nlp(sentence)
# creating vectors for each of the words
vectors = np.vstack([word.vector for word in tokens if word.has_vector])
# reducing dimensions using PCA
pca = PCA(n_components=2, random_state=0)
# get the vectors
vecs_transformed = pca.fit_transform(vectors)
# get vectors with labels
vecs_transformed = np.c_[sentence.split(), vecs_transformed]

In [0]:
import matplotlib.pyplot as plt

# Get x and y coordinates
x_coords = np.array([float(x[1]) for x in vecs_transformed])
y_coords = np.array([float(x[2]) for x in vecs_transformed])
# display scatter plot
plt.scatter(x_coords, y_coords)

# add labels
for label, x, y in vecs_transformed:
    plt.annotate(label, xy=(float(x), float(y)), xytext=(0, 0), textcoords='offset points')
plt.xlim(x_coords.min()+0.00005, x_coords.max()+0.00005)
plt.ylim(y_coords.min()+0.00005, y_coords.max()+0.00005)
plt.show()

**Exercise:** Find word embeddings and look at their plots for the following words: dogs, cats, sparrow, eagle, chicken, turkey, sheep, cow, crocodile, lion, shark, fish, whale. Do you see any patterns?

In [0]:
# write your answer here
# Hint: You need to do exactly what we did in our example

#### Measuring Similarity

Lets take another list of simple words `dog cat bannana`:

In [0]:
doc = nlp("dog cat banana")

for token in doc:
    print(token.text, token.has_vector, token.vector_norm, token.is_oov)

# has_vector: whether there is a word vector for that token
# token.vector_norm: norm of the vector
# token.is_oov: is the token out of vocabulary
# You can find more info here: https://spacy.io/api/token

If we look at the vector_norm value, it certainly seens that `cat` and `dog` are much more similar to one another than either one is  with `banana`.

But what do we mean by similarity? And what does this look like in a mathematical context?

First lets actually compute the similarity values:

In [0]:
import pandas as pd

doc = nlp("dog cat banana")

n = len(doc)
arr = np.zeros((n, n))

for i, token1 in enumerate(doc):
    for j, token2 in enumerate(doc):
        arr[i, j] = token1.similarity(token2)
        print('token1: %s , token2: %s, similarity: %s'%( token1.text, token2.text, token1.similarity(token2)) )

In [0]:
# Let's plot the similarity measures
import seaborn as sn

sims = pd.DataFrame(arr, index=[x.text for x in doc], columns=[x.text for x in doc])
sn.heatmap(sims, annot=True, fmt='g')

So we appear to have been right. Cat and Dog are quite similar, and banana isn't. But what do these numbers represent?

The similarity defined here is actually the cosine similarity between the two vectors. It is given by Euclidean dot product between the two vectors. Mathematically:

$$\text{similarity} = \cos(\theta) = {\mathbf{A} \cdot \mathbf{B} \over \|\mathbf{A}\| \|\mathbf{B}\|} = \frac{ \sum\limits_{i=1}^{n}{A_i  B_i} }{ \sqrt{\sum\limits_{i=1}^{n}{A_i^2}}  \sqrt{\sum\limits_{i=1}^{n}{B_i^2}} }$$

In simple terms: **sum of elementwise product of normalized vectors**. Here are some properties of this measure:
1. The value can vary between -1 and 1
2. cosine distance of a vector with itself is 1 (which makes sense as a vector is the most similar to itself)
3. If two vectors are in the same direction, their cosine distance > 0 and < 0 if they are in opposite directions
4. If the cosine distance is zero then the vectors are called perpendicular to each other.

You can read more about that here: https://en.wikipedia.org/wiki/Cosine_similarity


**Exercise:** Try to find the similarity between all pairs of words in the sentence: `cricket baseball football handball NFL NBA`

In [0]:
# write your answer here

#### Hint:


1. Convert to spacy doc
2. Iterate over all the tokens in a nested for loop
3. Use `token1.similarity(token2)` to get the similarity measure, just like we did earlier

## Capturing Context

In this section, we will look at how vector representations gain their contextual understanding about a word. 

In the previous section we were able to clearly show that similar words have similar vectors. This isn't possible for the context component. Instead we're going to tell you how the vectors are created. Hopefully this will help demonstrate why they must contain contextual information.

The core principle behind the design of modern word vector algorithms is **The Distributional Hypothesis**. This has two main ideas:
* Words that occur in the same contexts tend to have similar
meanings
(Harris, 1954)
* “You shall know a word by the company it keeps” (Firth, 1957)

We're going to look a model called word2vec as our demonstration. (Newer models build upon this foundation by adding more complicated statistics).

Word2Vec uses the context/neighbors of a word to compute the embedding for each of the words. There are two approaches that've been used in the original paper. Without going into much detail let's discuss the intuition behind each one:

1. CBOW: Continuous Bag of words
2. Skip Grams

**CBOW**

Suppose you have large amount of text. You're reading it with some window size of say 5 words. At step in moving this window you try to predict the middle word using the neighbors.

For example, let's look at the sentence: 

`Buffaloes are one of the strongest and most dangerous animals in all of Africa`

For a window size of 5, you'll be initially reading (after removing the stop words): `Buffaloes one strongest dangerous animals`.
In CBOW you'll try to predict the vector for the middle word (in this case `strongest`) using vectors of all other words in the window. Once you've made your guess, you then learn from this by making updates to all of the other vectors

In the next step you'll consider: `one strongest dangerous animals Africa` and try to predict `dangerous` using all the other words 

**Skip Grams**

The process is quite similar to the CBOW but in this case we try to predict the neighbors using the central word in each iteration of the moving window.

So for our original words (`Buffaloes one strongest dangerous animals`), we would drop everything apart from strongest and try and predict the outer words.

## Creating Word Vectors in Practice
To create word embeddings we can continue to use use either Spacy or Gensim. These library's incorporate models we'll describe below. We're going to continue using SpaCy as it's usefull for the next notebook. By default Spacy will create Word2Vec embeddings.

**Different Word Vector Models:**

The most commonly used word embeddings are: **word2vec** and **GloVe**, both used extensively and giving similar results. 

`Word2Vec` was made by Thomas Mikov at Google in 2013 and introduced the use of CBoW and Skip Grams. It's considered to produce high quality embeddings (vector representations) and to do so reasonably quickly.

`GloVe` is an extension on `Word2Vec` that introduces global statistics in the langauge modelling process. This can create much richer embeddings.

There are several other noteworthy libraries that you can experiment with. `Bert` and `Elmo` are both recent advances that create different embeddings for different contexts.

We're not going to look into how they each work, but rather, we're going to focus on how we can make use of them.

For this notebook we're going to look at using Word2Vec, but you're encouraged to import the other libraries and explore any differences you may find.

Let's look at actual normalized vectors first:

In [0]:
# get the vector and normalize
word = 'man'

def getVec(word):
  vec = nlp(word).vector / nlp(word).vector_norm
  return vec

vec = getVec(word)
print(vec)

In [0]:
# get similarity between the vectors
import numpy as np

# directly using spacy
print(nlp('car').similarity(nlp('truck')))

# reproducing the results yourself
print(np.dot( getVec('car'), getVec('truck')))

In [0]:
# get difference between two vectors
getVec('read') - getVec('study')

**Now answer the following questions**


#### **Exercise 1:** What is `man - woman`?

In [0]:
## get the normalized vectors and find the difference

##### *Example Solution:*

In [0]:
diff1 = getVec('man') - getVec('woman')

#### **Exercise 2:** What is `king - queen`?


In [0]:
## same as above

##### *Example Solution:*

In [0]:
diff2 = getVec('king') - getVec('queen')


#### **Exercise 3:** Are 1 and 2 close?


In [0]:
## get the cosine distance between 1 and 2

##### *Example Solution:*

In [0]:
np.dot( diff1, diff2) 

#### **Exercise 4:**  What are the similarities between `Japan and Tokyo`, between `India and New Delhi`? Are they similar?

In [0]:
## use everything you've learnt so far

##### *Example Solution:*

In [0]:
nlp('Japan').similarity(nlp('Tokyo')), nlp('India').similarity(nlp('New Delhi')) 