### Practical 3: Word Embedding
#### Ayoub Bagheri
<img src="img/uu_logo.png" alt="logo" align="right" title="UU" width="50" height="20" />

In this practical we are going to apply different word embedding methods. For this purpose, we use the following packages:

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np    
from sklearn.decomposition import PCA
from numpy import linalg as LA

# First install the gensim library

In this practical session we're going to use the [gensim](https://radimrehurek.com/gensim/) library. This library offers a variety of methods to read
in pre-trained word embeddings as well as train your own.

The website contains a lot of documentation, for example here: https://radimrehurek.com/gensim/auto_examples/index.html#documentation

If gensim isn't installed yet, you can use the following command:


In [2]:
# !pip install gensim

In [3]:
from gensim.test.utils import datapath

# Reading in a pre-trained model

1\. **Use the code below to load in a pre-trained GloVe model. Note: this can take around five minutes.**

See https://github.com/RaRe-Technologies/gensim-data for an overview of the models you can try. For example

*   word2vec-google-news-300: word2vec trained on Google news. 1662 MB.
*   glove-twitter-200: trained on Twitter: 758 MB 

We're going to start with `glove-wiki-gigaword-300` which
is 376.1MB to download. These embeddings are trained on 
Wikipedia (2014) and the Gigaword corpus, a large collection
of newswire text.

In [4]:
import gensim.downloader as api
wv = api.load('glove-wiki-gigaword-300')

# Exploring the vocabulary

2\. **How many words does the vocabulary contain?**

3\. **Is '*utrecht*' in the vocabulary?**

4\. **Print a word embedding.**

5\. **How many dimensions does this embedding have?**

6\. **Explore the embeddings for a few other words. Can you find words that are *not* in the vocabulary?**

(For example, think of uncommon words, misspellings, etc.)

# Vector arithmethics

7\. **Use the code below to calculate the [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) between two words.**

In [9]:
wv.similarity('university', 'student')

0.5970514

*Note*: cosine similarity is the same as the dot product between the normalized word embeddings

In [10]:
wv_university_norm = wv['university']/ LA.norm(wv['university'], 2)
wv_student_norm = wv['student'] / LA.norm(wv['student'], 2)

wv_university_norm.dot(wv_student_norm)

0.5970514

# Similarity analysis

8\. **Print the top 5 most similar words to `car`.**

**Question**: What are the top 5 most similar words to *cat*?  And to *king*? And to *fast*? What kind of words often appear in the top? 

Now calculate the similarities between two words

In [12]:
wv.similarity('buy', 'purchase')

0.77922326

In [13]:
wv.similarity('cat', 'dog')

0.68167466

In [14]:
wv.similarity('car', 'green')

0.25130013

We can calculate the cosine similarity between a list of word pairs and correlate these with human ratings. One such dataset with human ratings is called WordSim353.

**Goto** https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/test/test_data/wordsim353.tsv to get a sense of the data. 


Gensim already implements a method to evaluate a word embedding model using this data. 
* It calculates the cosine similarity between each word pair
* It calculates both the Spearman and Pearson correlation coefficient between the cosine similarities and human judgements

See https://radimrehurek.com/gensim/models/keyedvectors.html for a description of the methods.

In [None]:
wv.evaluate_word_pairs(datapath('wordsim353.tsv'))

# Analogies

Man is to woman as king is to. ..?

This can be converted into vector arithmethics:

```
king - ? = man - woman.

king - man + woman = ?
```

In [None]:
wv.most_similar(negative=['man'], positive=['king', 'woman'])

france - paris + amsterdam = ?

In [None]:
wv.most_similar(negative=['paris'], positive=['france', 'amsterdam'])

Note that it we would just retrieve the most similar words to '*amsterdam*' we would receive a different result.

In [18]:
print(wv.most_similar(positive=['amsterdam'], topn=5))

[('rotterdam', 0.6485881209373474), ('schiphol', 0.5740087032318115), ('utrecht', 0.5608800053596497), ('netherlands', 0.5472348928451538), ('frankfurt', 0.5457332730293274)]


cat is to cats as girl is to ?

```
girl - ? = cat - cats
girl - cat + cats = ?
```

In [None]:
wv.most_similar(negative=['cat'], positive=['cats', 'girl'])

Compare against a baseline. What if we would just have retrieved the most similar words to '*girl*'?

In [None]:
print(wv.most_similar(positive=['girl'], topn=5))

**Fun**: Try a few of your own analogies, do you get the expected answer?

# Visualization

9\. **We can't visualize embeddings in their raw format, because of their high dimensionality. However, we can use dimensionality reduction techniques such as PCA to project them onto a 2D space. Use the code below to do this.**

In [21]:
def display_scatterplot(wv, words=None, sample=0):
        
    # first get the word vectors
    word_vectors = np.array([wv[w] for w in words])

    # transform the data using PCA
    wv_PCA = PCA().fit_transform(word_vectors)[:,:2]
    
    plt.figure(figsize=(10,10))

    plt.scatter(wv_PCA[:,0], wv_PCA[:,1], 
                edgecolors='k', c='r')
    
    for word, (x,y) in zip(words, wv_PCA):
        plt.text(x+0.05, y+0.05, word)



In [None]:
display_scatterplot(wv, 
                        ['dog', 'cat', 'dogs', 'cats', 'horse', 'tiger',
                         'university', 'lesson', 'student', 'students',
                         'netherlands', 'amsterdam', 'utrecht', 'belgium', 'spain', 'china',
                         'coffee', 'tea', 'pizza', 'sushi', 'sandwich',
                         'car', 'train', 'bike', 'bicycle', 'trains'])

**Question**: What do you notice in this plot? Do the distances between the words make sense? Any surprises? Feel free to add your own words!