## Word embeddings
### *Apr 16, 2019*
### Author:

Gensim documents: https://radimrehurek.com/gensim/models/word2vec.html
If there's any import error, please run `pip install -r requirements.txt` under this directory.

### 1.
Word2vec offers a small corpus for testing: http://mattmahoney.net/dc/text8.zip. Let's play around with this trained model.

Loading the data requires a large amount of RAM and may take a while.

In [1]:
!pip install -r requirements.txt
from gensim import models
import gensim.downloader as api
import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE


dataset = api.load("text8")

[33mYou are using pip version 18.1, however version 19.0.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [4]:
w = models.Word2Vec.load('pretrained')

In [10]:
w.wv.similar_by_word('drink')

[('meat', 0.8315112590789795),
 ('eat', 0.8029888272285461),
 ('milk', 0.7873179912567139),
 ('drinks', 0.7806337475776672),
 ('fruit', 0.7606136798858643),
 ('eating', 0.7544460296630859),
 ('honey', 0.7533469796180725),
 ('beer', 0.750900387763977),
 ('liquor', 0.7410033345222473),
 ('beef', 0.739946722984314)]

Try some differnt words and find their most similar words according to our model. Which of them are unexpected? What's the possible explanation?

In [3]:
w.save('pretrained')

### 2.
We mentioned how to find analogies in our class. Can you come up with more such relations? Please add two more relations, and leavea a possible explanation if the result is unexpected (which is often the case).

In [9]:
w.wv.most_similar(positive=['woman', 'king'], negative=['man'])

[('queen', 0.6991748213768005),
 ('princess', 0.6138308644294739),
 ('empress', 0.6116824150085449),
 ('elizabeth', 0.5983043909072876),
 ('emperor', 0.5824447870254517),
 ('prince', 0.5801239013671875),
 ('regent', 0.5786587595939636),
 ('daughter', 0.57823246717453),
 ('isabella', 0.5703590512275696),
 ('mary', 0.5700576901435852)]

### 3.
Now you can define a new model and try to change the parameter, see if that makes a difference in your result. What are the influences of window size? What about embeddings' length?

### 4. (optional)
Train your own word2vec model by using your own corpus from e.g. Wikipedia, Twitter, novels, etc.

You may refer to: https://en.wikipedia.org/wiki/List_of_text_corpora

In [None]:
your_dataset = open('<filename>').read()  ## save your corpus as raw text
your_model = models.Word2Vec(your_dataset)

### 5. (optional)
Visualizing your word embeddings with the function offered below.

In [13]:
def display_closestwords_tsnescatterplot(model, word):
    
    arr = np.empty((0,200), dtype='f')
    word_labels = [word]
    close_words = model.most_similar_cosmul(word, topn=10)
    close_words = list(set(close_words))
    arr = np.append(arr, np.array([model[word]]), axis=0)
    for wrd_score in close_words:
        wrd_vector = model[wrd_score[0]]
        word_labels.append(wrd_score[0])
        arr = np.append(arr, np.array([wrd_vector]), axis=0)
        
    tsne = TSNE(n_components=2, random_state=0)
    np.set_printoptions(suppress=True)
    Y = tsne.fit_transform(arr)

    x_coords = Y[:, 0]
    y_coords = Y[:, 1]
    plt.scatter(x_coords, y_coords)

    for label, x, y in zip(word_labels, x_coords, y_coords):
        plt.annotate(label, xy=(x, y), xytext=(0, 0), textcoords='offset points')
    plt.xlim(x_coords.min()+0.00005, x_coords.max()+0.00005)
    plt.ylim(y_coords.min()+0.00005, y_coords.max()+0.00005)
    plt.show()