This notebook explores word embeddings through the functionality of Gensim; we train new embeddings from a dataset of our own and compare with pre-trained Glove embeddings.

Before running, install gensim with:

`conda install gensim`


In [2]:
import re
from gensim.models import Word2Vec, KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec
from gensim.test.utils import datapath

First, let's train a new word2vec model on our data.

In [3]:
sentences=[]
filename="../data/wiki.10K.txt"
with open(filename) as file:
    for line in file:
        words=line.rstrip().lower()
        # this file is already tokenize, so we can split on whitespace
        # but first let's replace any sequence of whitespace (space, tab, newline, etc.) with single space
        words=re.sub("\s+", " ", words)
        sentences.append(words.split(" "))

In [4]:
model = Word2Vec(
        sentences,
        vector_size=100,
        window=5,
        min_count=2,
        workers=10)

In [5]:
my_trained_vectors = model.wv
# save vectors to file if you want to use them later
#my_trained_vectors.save_word2vec_format('embeddings.txt', binary=False)

In [6]:
my_trained_vectors.most_similar("actor", topn=10)

[('actress', 0.9563195109367371),
 ('musician', 0.9091904759407043),
 ('composer', 0.9039162993431091),
 ('producer', 0.8941104412078857),
 ('writer', 0.8919810056686401),
 ('artist', 0.8903812170028687),
 ('dancer', 0.8714725971221924),
 ('singer', 0.864169180393219),
 ('pianist', 0.8626704812049866),
 ('journalist', 0.8625409007072449)]

Let's load in vectors that have already been trained on a much bigger dataset. [Glove vectors](https://nlp.stanford.edu/projects/glove/) are trained using a different method than word2vec, but results in vectors that can be read in by Gensim.  Here we'll use a 100-dimensional model trained on 6B words (from Wikipedia and news), but bigger models are also available.

In [9]:
# First we have to convert the Glove format into w2v format; this creates a new file
glove_file="../data/glove.6B.100d.100K.txt"
glove_in_w2v_format="../data/glove.6B.100d.100K.w2v.txt"
_ = glove2word2vec(glove_file, glove_in_w2v_format)

  _ = glove2word2vec(glove_file, glove_in_w2v_format)


In [10]:
glove = KeyedVectors.load_word2vec_format("../data/glove.6B.100d.100K.w2v.txt", binary=False)

In [11]:
glove.most_similar("actor", topn=10)

[('actress', 0.8580666184425354),
 ('comedian', 0.795758843421936),
 ('starring', 0.7920297384262085),
 ('starred', 0.7582033276557922),
 ('actors', 0.7394535541534424),
 ('filmmaker', 0.7349801063537598),
 ('screenwriter', 0.7342271208763123),
 ('film', 0.6941469311714172),
 ('movie', 0.6924506425857544),
 ('comedy', 0.6884662508964539)]

`most_similar` allows for vector arithmetic (as the average value of the input positive/negative vectors, where negative vectors are first multiplied by -1).  Play around with this function to discover other analogies that have been learned in this representation.

In [13]:
# one + two = three + ?
one="man"
two="king"
three="woman"

one="paris"
two="france"
three="berlin"

glove.most_similar(positive=[two, three], negative=[one], topn=5)

[('germany', 0.892362117767334),
 ('austria', 0.7597677111625671),
 ('poland', 0.7425415515899658),
 ('denmark', 0.7360999584197998),
 ('german', 0.6986513733863831)]

We can also evaluate the quality of the learned vectors through an intrinsic evaluation comparing to human judgments in the wordsim 353 dataset.

In [14]:
glove.evaluate_word_pairs(datapath('wordsim353.tsv'))

((0.5483502278951546, 4.235096667699337e-29),
 SpearmanrResult(correlation=0.5327354323238274, pvalue=2.8654146580558905e-27),
 0.0)

In [15]:
my_trained_vectors.evaluate_word_pairs(datapath('wordsim353.tsv'))

((0.38471142639602574, 1.0135344149176915e-13),
 SpearmanrResult(correlation=0.39151377703291373, pvalue=3.399532939047208e-14),
 1.41643059490085)