# Word embeddings

This is small notebook that shows you how to obtain word embeddings, and how to make fancy plots with them.

In [1]:
import numpy as np
import pickle
from collections import defaultdict, Counter
from random import random
import matplotlib.pyplot as plt
%matplotlib inline

## Data handling

Define a reading function:

In [6]:
def read(fname, w2i):
    """
    Reading function for the gensim model
    Format: [['I', 'like', 'custard'],...]
    """
    data = []
    counts = Counter()
    with open(fname, "r") as fh:
        for line in fh:
            tokens = line.strip().split()
            counts.update(tokens)
            data.append(tokens)
            # Add word to our dictionary
            for w in tokens:
                w2i[w]
        return data, counts, w2i

# Gensim

Let's see the embeddings that an off-the-shelf word2vec implementation can give us. For this we can use the Word2Vec implementation provided by [Gensim](https://radimrehurek.com/gensim/models/word2vec.html). 

(Want to know how this is so fast? Read [this](https://rare-technologies.com/word2vec-in-python-part-two-optimizing/) post.)

In [14]:
from gensim.models import Word2Vec

## Visualize the embeddings

We use t-SNE to produce 2-dimensional embeddings for visualization. Let's also throw in some K-means clustering to help give the dots some pretty colors.

Required are:
* Bokeh interactive plots: <http://bokeh.pydata.org/en/latest/> 
* scikit-learn ML library (aka `sklearn`): <http://scikit-learn.org/stable/documentation.html>

In [15]:
from sklearn.cluster import KMeans
from sklearn.manifold import TSNE

from bokeh.models import ColumnDataSource, LabelSet
from bokeh.plotting import figure, show, output_file
from bokeh.palettes import d3
from bokeh.io import output_notebook
output_notebook()

def emb_scatter(data, names, N=20, perplexity=30.0):
    ## try to find some clusters ##
    print("finding clusters")
    kmeans = KMeans(n_clusters=N)
    kmeans.fit(data)
    klabels = kmeans.labels_

    ## get a tsne fit ##
    print("fitting tsne")
    tsne = TSNE(n_components=2, perplexity=perplexity)
    emb_tsne = tsne.fit_transform(data)
    
    ## plot the tsne of the embeddings with bokeh ##
    # source: https://github.com/oxford-cs-deepnlp-2017/practical-1
    p = figure(tools="pan,wheel_zoom,reset,save",
               toolbar_location="above",
               title="T-SNE for most common words")

    # set colormap as a list
    colormap = d3['Category20'][N]
    colors = [colormap[i] for i in klabels]

    source = ColumnDataSource(data=dict(x1=emb_tsne[:,0],
                                        x2=emb_tsne[:,1],
                                        names=names,
                                        colors=colors))

    p.scatter(x="x1", y="x2", size=8, source=source, color='colors')

    labels = LabelSet(x="x1", y="x2", text="names", y_offset=6,
                      text_font_size="8pt", text_color="#555555",
                      source=source, text_align='center')
    p.add_layout(labels)

    show(p)

## Penn Treebank data

First read in the data:

In [16]:
train_file = "data/ptb/train.txt"

# This way w2i will automatically assigns
# the next availlable integer to each new word:
w2i = defaultdict(lambda: len(w2i))
UNK = w2i["<unk>"] # So: UNK = 0

# Time how long this takes
%time ptb_data, ptb_counts, w2i = read(train_file, w2i)

# Now w2i returns UNK as default
w2i = defaultdict(lambda: UNK, w2i)
nwords = len(w2i)

# Also construct the inverse dictionary i2w
i2w = dict()
for w, i in w2i.items():
    i2w[i] = w

CPU times: user 474 ms, sys: 38.9 ms, total: 513 ms
Wall time: 512 ms


Then train the Word2Vec model:

In [17]:
%time ptb_model = Word2Vec(ptb_data, size=100, window=5, min_count=5, workers=4)

CPU times: user 7.4 s, sys: 69 ms, total: 7.47 s
Wall time: 2.43 s


Then plot them:

In [18]:
ptb_top_words = [k for k,v in ptb_counts.most_common(1000)]
ptb_top_vecs = ptb_model[ptb_top_words]

emb_scatter(ptb_top_vecs, ptb_top_words, N=20)

finding clusters
fitting tsne


## Ted talks

Let's check this out with this some other dataset made up of a load of TedX talks:

In [20]:
train_file = "data/ted-cleaned.txt"

# This way w2i will automatically assigns
# the next availlable integer to each new word:
w2i = defaultdict(lambda: len(w2i))
UNK = w2i["<unk>"] # So: UNK = 0

# Time how long this takes
%time ted_data, ted_counts, ted_w2i = read(train_file, w2i)

# Now w2i returns UNK as default
w2i = defaultdict(lambda: UNK, ted_w2i)
nwords = len(ted_w2i)

# Also construct the inverse dictionary i2w
ted_i2w = dict()
for w, i in ted_w2i.items():
    ted_i2w[i] = w

CPU times: user 3.18 s, sys: 185 ms, total: 3.37 s
Wall time: 3.39 s


In [22]:
ted_model = Word2Vec(ted_data, size=100, window=5, min_count=5, workers=4)

In [23]:
ted_top_words = [k for k,v in ted_counts.most_common(1000)]
ted_top_vecs = ted_model[ted_top_words]

emb_scatter(ted_top_vecs, ted_top_words, N=20)

finding clusters
fitting tsne


## Wikipedia dataset

The more data the better. And this dataset is pretty big.

In [25]:
train_file = "data/wiki-cleaned.txt"

# This way w2i will automatically assigns
# the next availlable integer to each new word:
w2i = defaultdict(lambda: len(w2i))
UNK = w2i["<unk>"] # So: UNK = 0

# Time how long this takes
%time wiki_data, wiki_counts, wiki_w2i = read(train_file, w2i)

# Now w2i returns UNK as default
w2i = defaultdict(lambda: UNK, wiki_w2i)
nwords = len(wiki_w2i)

# Also construct the inverse dictionary i2w
ted_i2w = dict()
for w, i in wiki_w2i.items():
    ted_i2w[i] = w

CPU times: user 1min 6s, sys: 3.92 s, total: 1min 10s
Wall time: 1min 10s


In [26]:
wiki_model = Word2Vec(wiki_data, size=100, window=5, min_count=5, workers=4)

In [28]:
wiki_top_words = [k for k,v in wiki_counts.most_common(1000)]
wiki_top_vecs = wiki_model[wiki_top_words]

emb_scatter(wiki_top_vecs, wiki_top_words, N=20)

finding clusters
fitting tsne


## Nearest vectors with cosine similarity

The Gensim implementation has the nice feature that you can query a trained model which word-vectors are nearest to a given word-vector. The distance is measured using the [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity). 

Inspect the difference in quality of the resulting embeddings:

In [144]:
ptb_model.most_similar("language")

[('progress', 0.9617605209350586),
 ('danger', 0.9613274335861206),
 ('surely', 0.9609348177909851),
 ('priority', 0.9599844813346863),
 ('diversion', 0.9583083391189575),
 ('consistent', 0.9560381770133972),
 ('cfcs', 0.9550004005432129),
 ('rooms', 0.9544030427932739),
 ('delicious', 0.9541617035865784),
 ('obvious', 0.9538336992263794)]

In [145]:
ted_model.most_similar("language")

[('culture', 0.6662563681602478),
 ('english', 0.6344000101089478),
 ('discipline', 0.6286273002624512),
 ('nature', 0.616771399974823),
 ('empathy', 0.6132360100746155),
 ('mathematics', 0.6067396402359009),
 ('meaning', 0.6036574840545654),
 ('logic', 0.603035569190979),
 ('narrative', 0.5966923832893372),
 ('creativity', 0.5927315950393677)]

In [146]:
wiki_model.most_similar("language")

[('dialect', 0.7908326387405396),
 ('languages', 0.7729238271713257),
 ('vocabulary', 0.7686649560928345),
 ('phonology', 0.7616441249847412),
 ('syntax', 0.7473596334457397),
 ('orthography', 0.7317029237747192),
 ('sundanese', 0.7253890037536621),
 ('vernacular', 0.7225418090820312),
 ('colloquial', 0.7171159982681274),
 ('mandarin', 0.7134953737258911)]

I think the Wiki embeddings seem like the most meaningful. Inspect it some more:

In [147]:
wiki_model.most_similar("hogwarts") # A real univerity?

[('balamb', 0.6359041929244995),
 ('dumbledore', 0.6249843239784241),
 ('rosh', 0.607549786567688),
 ('snape', 0.5927407741546631),
 ('gonville', 0.5868369340896606),
 ('caius', 0.5818954110145569),
 ('katara', 0.5635119676589966),
 ('dulwich', 0.5569760799407959),
 ('aang', 0.5506828427314758),
 ('eton', 0.5477432012557983)]

In [148]:
wiki_model.most_similar("amsterdam")

[('brussels', 0.8473321199417114),
 ('paris', 0.8179935812950134),
 ('copenhagen', 0.8024680614471436),
 ('rotterdam', 0.8021714687347412),
 ('stockholm', 0.7587275505065918),
 ('antwerp', 0.7397375106811523),
 ('bruges', 0.7171226143836975),
 ('berlin', 0.7119771838188171),
 ('vienna', 0.7091464996337891),
 ('prague', 0.70756995677948)]

In [149]:
wiki_model.most_similar("stalin")

[('khrushchev', 0.8586050868034363),
 ('lenin', 0.8509545922279358),
 ('hitler', 0.8186794519424438),
 ('gorbachev', 0.8118611574172974),
 ('brezhnev', 0.8073111772537231),
 ('himmler', 0.7880926132202148),
 ('kosygin', 0.7761569023132324),
 ('goebbels', 0.773198127746582),
 ('mussolini', 0.7717257142066956),
 ('mao', 0.7672331929206848)]

In [150]:
wiki_model.most_similar("reddish")

[('yellowish', 0.9638515710830688),
 ('grayish', 0.9523113369941711),
 ('pinkish', 0.9513583183288574),
 ('brownish', 0.950848400592804),
 ('greenish', 0.9315527081489563),
 ('whitish', 0.9270859956741333),
 ('bluish', 0.9249571561813354),
 ('greyish', 0.9191488027572632),
 ('buff', 0.9170758724212646),
 ('blackish', 0.9139665365219116)]

Gensim also provides for vector arithmetic with the embeddings. Let's see if Wikipedia agrees with us that

$$king - man + woman \approx queen$$
$$Amsterdam - Netherlands + France \approx Paris $$

In [151]:
wiki_model.most_similar(positive=['woman', 'king'], negative=['man'])

[('queen', 0.770358145236969),
 ('monarch', 0.728584885597229),
 ('empress', 0.7106271982192993),
 ('tsar', 0.6899988651275635),
 ('prince', 0.671586275100708),
 ('isabella', 0.6688051819801331),
 ('regent', 0.66827392578125),
 ('dowager', 0.6503204703330994),
 ('noblewoman', 0.6491597294807434),
 ('consort', 0.6414600610733032)]

In [152]:
wiki_model.most_similar(positive=['amsterdam','france'], negative=['netherlands'])

[('paris', 0.7890597581863403),
 ('brussels', 0.6972391605377197),
 ('rotterdam', 0.6772001385688782),
 ('lisbon', 0.6623631715774536),
 ('bordeaux', 0.6591888070106506),
 ('algiers', 0.6387784481048584),
 ('antwerp', 0.6358806490898132),
 ('strasbourg', 0.6283596158027649),
 ('genoa', 0.6240521669387817),
 ('rouen', 0.6220749616622925)]

Play around with this and let us know about the best ones you find!

The funniest ones I found are:

In [153]:
ted_model.most_similar(positive=['sex'], negative=['love'])
# which reveals that Ted talks are rather prude and moralising. Or realistic? sex - love = disease

[('rates', 0.5408843755722046),
 ('cancer', 0.5374121069908142),
 ('disease', 0.528205931186676),
 ('hiv', 0.5195662975311279),
 ('breast', 0.5132015943527222),
 ('retirement', 0.5038480162620544),
 ('malaria', 0.499352365732193),
 ('males', 0.48933854699134827),
 ('antibiotics', 0.47545453906059265),
 ('rate', 0.469136118888855)]

In [154]:
ted_model.most_similar(positive=['party'], negative=['alcohol'])

[('meeting', 0.5937921404838562),
 ('conference', 0.5925645232200623),
 ('visit', 0.5016125440597534),
 ('ted', 0.4858279228210449),
 ('museum', 0.48003435134887695),
 ('class', 0.46813347935676575),
 ('university', 0.4609370231628418),
 ('stage', 0.42508798837661743),
 ('dinner', 0.42372316122055054),
 ('night', 0.4102555215358734)]

## POS-tag embedding

Let's make some POS-tag embeddings! We took a tagged version of the PTB data used above and discarded the words and preserving the order of the POS tags. The rest is exactly the same

In [36]:
import pickle

with open('data/ptb-tagset.pkl', 'rb') as f:
    ptb_tags = pickle.load(f)
with open('data/ptb-tagseq.pkl', 'rb') as g:
    ptb_tagseq = pickle.load(g)

In [39]:
ptb_tags[0:10]

['NNP', 'POS', 'WRB', 'RBS', 'WDT', 'UH', 'PRP', 'NNS', 'RP', 'EX']

In [40]:
# about a million tags
len([w for sent in ptb_tagseq for w in sent])

929552

In [41]:
pos_model = Word2Vec(ptb_tagseq, size=100, window=3, min_count=5, workers=4)

In [42]:
pos_vecs = pos_model[ptb_tags]

# since the set of vectors we perform t-SNE on is small we 
# should use a smaller perplexity (sklearn default is 30)
# see https://lvdmaaten.github.io/tsne/#examples for reference
emb_scatter(pos_vecs, ptb_tags, N=12, perplexity=8.0)

finding clusters
fitting tsne


## Pre-trained GloVe

Finally let's look at some pretrained [GloVe](https://nlp.stanford.edu/projects/glove/) vectors from the Stanford paper. All their embeddings are downloadable from there.

The 300-dimensional ones are availlable from the python library [spaCy](https://spacy.io/), which we will use for ease. This library is part of the Anaconda distribution of python.

Check it out:

In [168]:
import spacy

glove = spacy.load('en_vectors_glove_md')

Vector can then be obtained with the following syntax

In [169]:
# glove("erroneous").vector

In [171]:
glove_top_words = ted_top_words
glove_top_vecs, glove_top_names = zip(*[(glove(word).vector, word) for word in glove_top_words])
glove_top_vecs = np.vstack(glove_top_vecs)

emb_scatter(glove_top_vecs, glove_top_names, N=20)

finding clusters
fitting tsne


In [172]:
glove_top_vecs, glove_top_names = zip(*[(glove(word).vector, word) for word in wiki_top_words])
glove_top_vecs = np.vstack(glove_top_vecs)

emb_scatter(glove_top_vecs, glove_top_names, N=20)

finding clusters
fitting tsne


## Loading with gensim

Apparently, gensim can also be used for loading word2vec files (as long as they are in the right format)

In [None]:
# model = Word2Vec.load_word2vec_format('./model/GoogleNews-vectors-negative300.bin', binary=True)  