# Word vector visualization with Gensim

Source: 😊[Day 12 - Special Data Types: Natural Language Processing](https://github.com/core-skills/12-text-processing) *repository*

> ☝️Before moving on with this notebook, ensure that you have:
- Downloaded the *glove.6B.100d.txt* embeddings and placed them in the `./data` directory. If not, [download](http://nlp.stanford.edu/data/glove.6B.zip) and save them before continuing.

**Overview**:
In this notebook we will explore word vectors. To achieve this we will use the [Gensim](https://radimrehurek.com/gensim/) library with pretrained [GloVe vectors](https://nlp.stanford.edu/projects/glove/). Gensim allows us to convert a file of GloVe vectors into word2vec format. The 100d GloVe embeddings are used within the notebook, however there are various dimensions such as 50 and 300.

<img src="https://cdn.analyticsvidhya.com/wp-content/uploads/2017/06/06062705/Word-Vectors.png" alt="" style="width:800px;"/>

[analyticsvidhya](https://cdn.analyticsvidhya.com/wp-content/uploads/2017/06/06062705/Word-Vectors.png) 

**Supplementary Content**: Check out https://rare-technologies.com/word2vec-tutorial/ for an interactive web-based application that allows you to do play with different functionalties of word embeddings.

Adapted from: *CS224n: Natural Language Processing with Deep Learning*

# Table of Contents
1. [Word Vectors](#word_vectors)
2. [Word Similarities](#word_similarities)
3. [Word Analogies](#word_analogies)
4. [Visualising Word Vectors](#vector_visualisation)

### Import Dependencies
- [numpy](https://numpy.org/) - library that we will use for helping visualise word vectors
- [matplotlib](https://matplotlib.org/) - library that we will use for plotting the data
- [gensim](https://radimrehurek.com/gensim/) - library that we will use to experiment with word embeddings/vectors
- [sklearn](https://scikit-learn.org/) - library that we will use for performing dimensionality reduction to help visualise word vectors

In [None]:
import pprint
from typing import List
from pathlib import Path

import numpy as np
import matplotlib.pyplot as plt

from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec

from sklearn.decomposition import PCA

### Set up the notebook environment and load helper functions

In [None]:
# Makes printing lists prettier 
pp = pprint.PrettyPrinter(indent=2)

In [None]:
# Get the interactive Tools for Matplotlib
%matplotlib inline
plt.style.use('ggplot')

In [None]:
def prettify_similarities(similarities: List[tuple]) -> List[str]:
    ''' Prettifies list of word similarities produced by Gensim.
    '''
    longest_str = max([len(sim[0]) for sim in similarities])
    return "\n".join([f'{idx+1}.\t{sim[0]:{longest_str+1}}\t{sim[1]*100:0.1f}%' for idx, sim in enumerate(similarities)])

### Load pretrained word embeddings from disk

In [None]:
data_path = Path('../data/glove.6B.50d.txt').resolve()

In [None]:
# Load embedding model
glove_file = datapath(data_path)
word2vec_glove_file = get_tmpfile("glove.6B.50d.word2vec.txt")
glove2word2vec(glove_file, word2vec_glove_file)
model = KeyedVectors.load_word2vec_format(word2vec_glove_file)

## Word Vectors <a name="word_vectors"></a>
Now that we have loaded the pre-trained word embedding model, let's unpack it. 

In [None]:
test_word = 'france'

In [None]:
# Numerical representation words
# Note: if we add words that are out-of-vocabulary (e.g. not in the corpus the model was trained on, we'll receive an error)
model[test_word]

In [None]:
# Checking out the shape of the embeddings that are produced for a given word
model[test_word].shape

## Word Similarities<a name="word_similarities"></a>

Using the pre-trained word vectors, we can perform simple distance operations such as finding similar words e.g. finding those that are the closest in vector space

In [None]:
print(prettify_similarities(model.most_similar('obama')))

In [None]:
print(prettify_similarities(model.most_similar('gold')))

In [None]:
print(prettify_similarities(model.most_similar('apple')))

In [None]:
print(prettify_similarities(model.most_similar(negative='apple')))

## Word Analogies - "A is to B as C is to?"<a name="word_analogies"></a>

Recall: *king - man + woman = queen*

<img src="https://mlwhiz.com/images/word2vec.png" alt="Word analogy example" style="width: 600px;"/>

In [None]:
print(prettify_similarities(model.most_similar(positive=['woman', 'king'], negative=['man'])))

In [None]:
def analogy(x1: str, x2: str, y1: str):
    '''Finds missing word form partial analogy'''
    result = model.most_similar(positive=[y1, x2], negative=[x1])
    print(f'{x1} is to {x2} as {y1} is to \033[1m{result[0][0]}\033[0m')

In [None]:
analogy('japan', 'japanese', 'australia')

In [None]:
analogy('australia', 'beer', 'france')

In [None]:
analogy('obama', 'clinton', 'reagan')

In [None]:
analogy('tall', 'tallest', 'long')

In [None]:
analogy('good', 'fantastic', 'bad')

In [None]:
analogy('gold', 'copper', 'oil')

## Find the odd word out

In [None]:
def find_odd_one_out(words: List[str]) -> str:
    '''Finds odd word out of list of words'''
    assert type(words) is list
    odd_one = model.doesnt_match(words)
    words_marked = " ".join([word if word != odd_one else f'\033[1m{word}\033[0m' for word in words])
    print(words_marked)

In [None]:
find_odd_one_out(["breakfast", "scereal", "dinner", "lunch"])

In [None]:
find_odd_one_out(["copper","gold","iron","oil"])

## Visualising Word Vectors<a name="vector_visualisation"></a>

In [None]:
def generate_visualisation(model, words: List[str]=None, sample_size: int=0):
    '''Displays scatterplot of dimensionality reduced word vectors using principal component analysis (PCA)
    
    Note:
        - If no words are provided, a random set will be sampled from the embedding models vocabulary.
    '''

    if words == None:
        if sample_size > 0:
            words = np.random.choice(list(model.vocab.keys()), sample_size)
        else:
            words = [word for word in model.vocab]
        
    word_vectors = np.array([model[w] for w in words])

    twodim = PCA().fit_transform(word_vectors)[:,:2]
    
    plt.figure(figsize=(6,6))
    plt.scatter(twodim[:,0], twodim[:,1], edgecolors='b', c='r')
    for word, (x,y) in zip(words, twodim):
        plt.text(x+0.05, y+0.05, word)
    
    plt.show()

### Visualise groups of beverages, foods, animals, locations, etc, to see how they cluster in 2D space.

In [None]:
generate_visualisation(model,
                        ['coffee', 'tea', 'beer', 'wine', 'brandy', 'rum', 'champagne', 'water',
                         'spaghetti', 'borscht', 'hamburger', 'pizza', 'falafel', 'sushi', 'meatballs',
                         'dog', 'horse', 'cat', 'monkey', 'parrot', 'koala', 'lizard',
                         'frog', 'toad', 'monkey', 'ape', 'kangaroo', 'wombat', 'wolf',
                         'france', 'germany', 'hungary', 'luxembourg', 'australia', 'fiji', 'china',
                         'homework', 'assignment', 'problem', 'exam', 'test', 'class',
                         'school', 'college', 'university', 'institute'])

### Visualise randomly sampled words

In [None]:
generate_visualisation(model, sample_size=50)