[Example](http://web.stanford.edu/class/cs224n/materials/Gensim%20word%20vector%20visualization.html) taken from stanford [CS224n: Natural Language Processing with Deep Learning](http://web.stanford.edu/class/cs224n/index.html) Course.

You need to install below packages in your environment to run this examples

- sklearn
- gensim



In [None]:
# TODO gensim line???
# !pip install scikit-learn
# !pip install gensim 

# Gensim word vector visualization of various word vectors

In [None]:
import numpy as np

# Get the interactive Tools for Matplotlib
%matplotlib notebook
import matplotlib.pyplot as plt
plt.style.use('ggplot')

from sklearn.decomposition import PCA

from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec

For looking at word vectors, I'll use Gensim. 
Gensim isn't really a deep learning package. 
It's a package for for word and text similarity modeling, which started with (LDA-style) topic models and grew into SVD and neural word representations. 
But its efficient and scalable, and quite widely used.

There are three well known pre-trained word vectors.
- word2vec 
- fasttext
- GloVe

GloVe word vectors are proposed by Stanford in XX.
Gensim allows you to convert glove vectors to word2vec format.

Download the GloVe vectors from [the Glove page](https://nlp.stanford.edu/projects/glove/) following  [zip file](https://nlp.stanford.edu/data/glove.6B.zip)

There are 

- 50d
- 100d
- 300d

pre trained vector files in the  [the Glove page](https://nlp.stanford.edu/projects/glove/).

In the following code 100d vectors are used. All of them are trade-off between speed-size vs quality. 
300d works better but is a larger size too.

In [None]:
#glove_file = datapath('./glove.6B.100d.txt')
glove_file = 'data/glove.6B.100d.txt'

word2vec_glove_file = get_tmpfile("glove.6B.100d.word2vec.txt")
word2vec_glove_file

In [None]:
glove2word2vec(glove_file, word2vec_glove_file)

In [None]:
model = KeyedVectors.load_word2vec_format(word2vec_glove_file)

In [None]:
model.most_similar('obama')

In [None]:
model.most_similar('merkel')

In [None]:
model.most_similar('banana')

In [None]:
model.most_similar('turkey')

In [None]:
model.most_similar('germany')

In [None]:
model.most_similar(negative='banana')

In [None]:
result = model.most_similar(positive=['woman', 'king'], negative=['man'])
print("{}: {:.4f}".format(*result[0]))

In [None]:
def analogy(x1, x2, y1):
    result = model.most_similar(positive=[y1, x2], negative=[x1])
    return result[0][0]

![Analogy](imgs/word2vec-king-queen-composition.png)

In [None]:
analogy('japan', 'japanese', 'australia')

In [None]:
analogy('japan', 'japanese', 'turkey')

In [None]:
analogy('japan', 'japanese', 'germany')

In [None]:
analogy('australia', 'beer', 'france')

In [None]:
analogy('australia', 'beer', 'turkey')

In [None]:
analogy('australia', 'beer', 'germany')

In [None]:
analogy('obama', 'clinton', 'reagan')

In [None]:
analogy('tall', 'tallest', 'long')

In [None]:
analogy('good', 'fantastic', 'bad')

In [None]:
print(model.doesnt_match("breakfast cereal dinner lunch".split()))

In [None]:
def display_pca_scatterplot(model, words=None, sample=0):
    if words == None:
        if sample > 0:
            words = np.random.choice(list(model.vocab.keys()), sample)
        else:
            words = [ word for word in model.vocab ]
        
    word_vectors = np.array([model[w] for w in words])

    twodim = PCA().fit_transform(word_vectors)[:,:2]
    
    plt.figure(figsize=(6,6))
    plt.scatter(twodim[:,0], twodim[:,1], edgecolors='k', c='r')
    for word, (x,y) in zip(words, twodim):
        plt.text(x+0.05, y+0.05, word)

In [None]:
display_pca_scatterplot(model, 
                        ['coffee', 'tea', 'beer', 'wine', 'brandy', 'rum', 'champagne', 'water',
                         'spaghetti', 'borscht', 'hamburger', 'pizza', 'falafel', 'sushi', 'meatballs',
                         'dog', 'horse', 'cat', 'monkey', 'parrot', 'koala', 'lizard',
                         'frog', 'toad', 'monkey', 'ape', 'kangaroo', 'wombat', 'wolf',
                         'france', 'germany', 'hungary', 'luxembourg', 'australia', 'fiji', 'china',
                         'homework', 'assignment', 'problem', 'exam', 'test', 'class',
                         'school', 'college', 'university', 'institute'])

In [None]:
display_pca_scatterplot(model, sample=300)