This notebook explores word embeddings through the functionality of Gensim; we train new embeddings from a dataset of our own and compare with pre-trained Glove embeddings.

In [None]:
import re
from gensim.models import Word2Vec, KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec
from gensim.test.utils import datapath

First, let's train a new word2vec model on our data -- a total of ~6 million words sampled from 15,290 works of English fiction in Project Gutenberg, mostly published before 1923.  (Note this data has already been tokenized.)

In [None]:
sentences=[]
filename="../data/fiction.6M.txt"
with open(filename) as file:
    for line in file:
        words=line.rstrip().lower()
        # replace any sequence of whitespace (space, tab, newline, etc.) with single space
        words=re.sub("\s+", " ", words)
        sentences.append(words.split(" "))

We'll use Gensim to train 100-dimensional embeddings, treating a window of 5 words around each target word as its context for prediction.

In [None]:
model = Word2Vec(
        sentences,
        size=100,
        window=5,
        min_count=2,
        workers=10)

In [None]:
my_trained_vectors = model.wv
# save vectors to file if you want to use them later
my_trained_vectors.save_word2vec_format('embeddings.txt', binary=False)

Let's find the words most similar to "car" in the literary embeddings we just trained; search for other terms here to see their nearest neighbors in embedding space.

In [None]:
my_trained_vectors.most_similar("car", topn=10)

Now let's load in vectors that have already been trained on a much bigger dataset. [Glove vectors](https://nlp.stanford.edu/projects/glove/) are trained using a different method than word2vec, but results in vectors that can be read in by Gensim.  The top 50K words in the "Common Crawl (42B)"  vectors (300-dimensional) can be found here: [glove.42B.300d.50K.txt](https://drive.google.com/file/d/1n1jt0UIdI3CD26cY1EIeks39XH5S8O8M/view?usp=sharing); download it and place  in your `data` directory.

In [None]:
# First we have to convert the Glove format into w2v format; this creates a new file
glove_file="../data/glove.42B.300d.50K.txt"
glove_in_w2v_format="../data/glove.42B.300d.50K.w2v.txt"
_ = glove2word2vec(glove_file, glove_in_w2v_format)

In [None]:
glove = KeyedVectors.load_word2vec_format(glove_in_w2v_format, binary=False)

In [None]:
glove.most_similar("car", topn=10)

`most_similar` allows for vector arithmetic (as the average value of the input positive/negative vectors, where negative vectors are first multiplied by -1).  Play around with this function to discover other analogies that have been learned in this representation.

In [None]:
# one + two = three + ?
one="man"
two="king"
three="woman"

one="paris"
two="france"
three="berlin"

glove.most_similar(positive=[two, three], negative=[one], topn=5)

We can also evaluate the quality of the learned vectors through an intrinsic evaluation comparing to human judgments in the wordsim 353 dataset.

In [None]:
glove.evaluate_word_pairs(datapath('wordsim353.tsv'))

In [None]:
my_trained_vectors.evaluate_word_pairs(datapath('wordsim353.tsv'))