This notebook uses the word2vec model to find the most similar words for a set of examples words.

During training we did not provide any supervised information about word semantics or word relationships.

This demonstrates the ability of the model to learn implicitly the similarity of words.

In [2]:
import numpy as np
from numpy import dot
from gensim import matutils
from gensim.corpora import Dictionary
from keras.models import Model, load_model

In [14]:
def cosine_similarity(doc_vec1, doc_vec2):
    # Taken from: gensim.models.keyedvectors.Doc2VecKeyedVectors    
    return dot(matutils.unitvec(doc_vec1), matutils.unitvec(doc_vec2))

def token_vec(token, vocab, W):
    token_id = vocab.token2id[token]
    return W[token_id]

def most_similar(token, vocab, W, topn=5):
    vec = token_vec(token, vocab, W)
    similarities = [cosine_similarity(vec, W[i]) for i in range(W.shape[0])]
    word_indices = matutils.argsort(similarities, topn, reverse=True)
    word_similarities = np.array(similarities)[word_indices]
    return zip(word_indices, word_similarities)

def show_similar_words(word, vocab, W, topn=10):
    print('Words similar to "%s":' % word)
    token_ids = most_similar(word, vocab, W, topn)
    for token_id, token_similarity in token_ids:
        print('%s (%f)' % (vocab[token_id], token_similarity))


## Load word2vec  model

In [15]:
vocab_path = '/tmp/vocab.pkl'
model_path = '/tmp/w2v_enwiki_vocab=10000_model.h5'

# load vocabulary
vocab = Dictionary.load(vocab_path)

# load the Keras model
model = load_model(model_path)

# extract the word vector matrix W from the embedding layer
embedding_layer = model.layers[1]
W = embedding_layer.get_weights()[0]

print('Word vector matrix shape:', W.shape)

Word vector matrix shape: (10000, 100)


## Show word similarities

Show the words most similar to a set of examples.

In [16]:
words = ['london', 'berlin', 'monday', 'morning', 'strong', 'blue']
for word in words:
    show_similar_words(word, vocab, W, topn=10)
    print()

Words similar to "london":


  if np.issubdtype(vec.dtype, np.int):


london (1.000000)
liverpool (0.599567)
birmingham (0.591890)
manchester (0.590634)
bristol (0.568052)
surrey (0.554686)
edinburgh (0.554203)
dublin (0.545324)
oxford (0.531534)
sussex (0.528844)

Words similar to "berlin":
berlin (1.000000)
hamburg (0.796351)
dresden (0.774030)
munich (0.753499)
vienna (0.748455)
frankfurt (0.745147)
stuttgart (0.710112)
cologne (0.703344)
prague (0.694123)
leipzig (0.690584)

Words similar to "monday":
monday (1.000000)
thursday (0.881363)
tuesday (0.856933)
wednesday (0.854535)
saturday (0.790926)
friday (0.786855)
sunday (0.662439)
pm (0.611944)
afternoon (0.607712)
airs (0.603803)

Words similar to "morning":
morning (1.000000)
evening (0.767806)
afternoon (0.742414)
night (0.569338)
midnight (0.540847)
friday (0.522020)
pm (0.508591)
nights (0.488907)
tuesday (0.487015)
sunday (0.480688)

Words similar to "strong":
strong (1.000000)
enthusiastic (0.512568)
tremendous (0.468794)
powerful (0.463790)
remarkable (0.449273)
exceptional (0.448022)
vital