# A Word2Vec playground

To play with this notebook, you'll need Annoy, Gensim, and the GoogleNews vector.

  pip install annoy  
  pip install gensim  
  you can find the GoogleNews vector by googling _./GoogleNews-vectors-negative300.bin_

Inspired by: https://github.com/chrisjmccormick/inspect_word2vec


In [18]:
# import and init
from annoy import AnnoyIndex
import gensim
import os.path
import numpy as np

prefix_filename = 'word2vec'
ann_filename = prefix_filename + '.ann'
i2k_filename = prefix_filename + '_i2k.npy'
k2i_filename = prefix_filename + '_k2i.npy'

## Create a model or load it

In [17]:
# Load Google's pre-trained Word2Vec model.
print "load GoogleNews Model"
model = gensim.models.KeyedVectors.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary=True)  
print "loading done"

hello = model['hello']
vector_size = len(hello)
print 'model size=', len(model.vocab)
print 'vector size=', vector_size

load GoogleNews Model
loading done
model size= 3000000
vector size= 300


In [19]:
# process the model and save a model
# or load the model directly
vocab = model.vocab.keys()
indexNN = AnnoyIndex(vector_size)
index2key = len(model.vocab)
key2index = {}

if not os.path.isfile(ann_filename): 
    print 'creating indexes'
    i = 0
    try:
        for key in vocab:
            indexNN.add_item(i, model[key])
            key2index[key]=i
            index2key[i]=key
            i=i+1
            if (i%10000==0):
                print i, key
    except TypeError:
        print 'Error with key', key
    print 'building 10 trees'
    indexNN.build(10) # 10 trees
    print 'save  files'
    indexNN.save(ann_filename)
    np.save(i2k_filename, index2key)
    np.save(k2i_filename, key2index)
    print 'done'
else:
    print "loading files"
    indexNN.load(ann_filename)
    index2key = np.load(i2k_filename)
    key2index = np.load(k2i_filename)
    print "loading done:", indexNN.get_n_items(), "items"

loading files
loading done: 3000000 items


## King - Male + Female = Queen?
Nope!

At least not based on a word2vec that is trained on the News...

In [9]:
king = model['king']
male = model['male']
female = model['female']

what_vec = king - male + female
what_indexes = indexNN.get_nns_by_vector(what_vec, 1)

print 'king - male + female = '
for i in what_indexes:
    print index2key[i]

king - male + female = 
king


## Berlin  - Germany + France = Paris?
Yes!

This makes me happy, but if someone understand why, please tell me!

In [20]:
what_vec = model['Berlin'] - model['Germany'] + model['France']
what_indexes = indexNN.get_nns_by_vector(what_vec, 1)

for i in what_indexes:
    print index2key[i]

Paris


## Trump - USA + Germany = Hitler?
FAKE NEWS

In [12]:
what_vec = model['Trump'] + model['Germany'] - model['USA']
what_indexes = indexNN.get_nns_by_vector(what_vec, 1)

for i in what_indexes:
    print index2key[i]

Dean_Gitter


If you play with this notebook and find good word2vec equation, please tweet them to me!  
__@dh7net__