This notebook explores word embeddings through the functionality of Gensim; we train new embeddings from a dataset of our own and compare with pre-trained Glove embeddings.

Before running, install gensim with:

`conda install gensim`


In [1]:
import re
from gensim.models import Word2Vec, KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec
from gensim.test.utils import datapath

First, let's train a new word2vec model on our data.

In [2]:
sentences=[]
filename="../data/wiki.10K.txt"
with open(filename) as file:
    for line in file:
        words=line.rstrip().lower()
        # this file is already tokenize, so we can split on whitespace
        # but first let's replace any sequence of whitespace (space, tab, newline, etc.) with single space
        words=re.sub("\s+", " ", words)
        sentences.append(words.split(" "))

In [3]:
model = Word2Vec(
        sentences,
        vector_size=100,
        window=5,
        min_count=2,
        workers=10)

In [4]:
my_trained_vectors = model.wv
# save vectors to file if you want to use them later
my_trained_vectors.save_word2vec_format('embeddings.txt', binary=False)

In [5]:
my_trained_vectors.most_similar("actor", topn=10)

[('actress', 0.952454686164856),
 ('musician', 0.9157603979110718),
 ('composer', 0.9041920304298401),
 ('artist', 0.9004024267196655),
 ('writer', 0.8933359980583191),
 ('pianist', 0.8832546472549438),
 ('singer', 0.8730253577232361),
 ('producer', 0.8704878091812134),
 ('journalist', 0.862703263759613),
 ('comedian', 0.8607137203216553)]

Let's load in vectors that have already been trained on a much bigger dataset. [Glove vectors](https://nlp.stanford.edu/projects/glove/) are trained using a different method than word2vec, but results in vectors that can be read in by Gensim.  Here we'll use a 100-dimensional model trained on 6B words (from Wikipedia and news), but bigger models are also available.

In [7]:
# First we have to convert the Glove format into w2v format; this creates a new file
glove_file="../data/glove.6B.100d.100K.txt"
glove_in_w2v_format="../data/glove.6B.100d.100K.w2v.txt"
_ = glove2word2vec(glove_file, glove_in_w2v_format)

  _ = glove2word2vec(glove_file, glove_in_w2v_format)


In [8]:
glove = KeyedVectors.load_word2vec_format("../data/glove.6B.100d.100K.w2v.txt", binary=False)

In [9]:
glove.most_similar("actor", topn=10)

[('actress', 0.8580665588378906),
 ('comedian', 0.795758843421936),
 ('starring', 0.7920297384262085),
 ('starred', 0.7582033276557922),
 ('actors', 0.7394535541534424),
 ('filmmaker', 0.7349801063537598),
 ('screenwriter', 0.7342271208763123),
 ('film', 0.6941469311714172),
 ('movie', 0.6924506425857544),
 ('comedy', 0.6884662508964539)]

`most_similar` allows for vector arithmetic (as the average value of the input positive/negative vectors, where negative vectors are first multiplied by -1).  Play around with this function to discover other analogies that have been learned in this representation.

In [10]:
# one + two = three + ?
one="man"
two="king"
three="woman"

one="paris"
two="france"
three="berlin"

glove.most_similar(positive=[two, three], negative=[one], topn=5)

[('germany', 0.892362117767334),
 ('austria', 0.7597678303718567),
 ('poland', 0.7425415515899658),
 ('denmark', 0.7360999584197998),
 ('german', 0.6986513137817383)]

We can also evaluate the quality of the learned vectors through an intrinsic evaluation comparing to human judgments in the wordsim 353 dataset.

In [11]:
glove.evaluate_word_pairs(datapath('wordsim353.tsv'))

(PearsonRResult(statistic=0.5483502231756461, pvalue=4.235102204649125e-29),
 SignificanceResult(statistic=0.5327354323238274, pvalue=2.8654146580558905e-27),
 0.0)

In [12]:
my_trained_vectors.evaluate_word_pairs(datapath('wordsim353.tsv'))

(PearsonRResult(statistic=0.3809759021361142, pvalue=1.8269597951561735e-13),
 SignificanceResult(statistic=0.3917266231124822, pvalue=3.28395419650688e-14),
 1.41643059490085)