# A Word2Vec playground

To play with this notebook, you'll need Numpy, Annoy, Gensim, and the [GoogleNews word2vec model]( https://code.google.com/archive/p/word2vec/)

* pip install numpy
* pip install annoy  
* pip install gensim  
* you can find the GoogleNews vector by googling _./GoogleNews-vectors-negative300.bin_  
  

Inspired by: https://github.com/chrisjmccormick/inspect_word2vec


In [1]:
# import and init
from annoy import AnnoyIndex
import gensim
import os.path
import numpy as np

prefix_filename = 'word2vec'
ann_filename = prefix_filename + '.ann'
i2k_filename = prefix_filename + '_i2k.npy'
k2i_filename = prefix_filename + '_k2i.npy'

## Create a model or load it

In [2]:
# Load Google's pre-trained Word2Vec model.
print "load GoogleNews Model"
model = gensim.models.KeyedVectors.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary=True)  
print "loading done"

hello = model['hello']
vector_size = len(hello)
print 'model size=', len(model.vocab)
print 'vector size=', vector_size

load GoogleNews Model
loading done
model size= 3000000
vector size= 300


In [3]:
# process the model and save a model
# or load the model directly
vocab = model.vocab.keys()
#indexNN = AnnoyIndex(vector_size, metric='angular')
indexNN = AnnoyIndex(vector_size)
index2key = [None]*len(model.vocab)
key2index = {}

if not os.path.isfile(ann_filename): 
    print 'creating indexes'
    i = 0
    try:
        for key in vocab:
            indexNN.add_item(i, model[key])
            key2index[key]=i
            index2key[i]=key
            i=i+1
            if (i%10000==0):
                print i, key
    except TypeError:
        print 'Error with key', key
    print 'building 10 trees'
    indexNN.build(10) # 10 trees
    print 'save  files'
    indexNN.save(ann_filename)
    np.save(i2k_filename, index2key)
    np.save(k2i_filename, key2index)
    print 'done'
else:
    print "loading files"
    indexNN.load(ann_filename)
    index2key = np.load(i2k_filename)
    key2index = np.load(k2i_filename)
    print "loading done:", indexNN.get_n_items(), "items"

creating indexes
10000 DeLille_Cellars
20000 igned
30000 industrial_Ruhr
40000 ANSI_ASHRAE_IESNA_Standard
50000 coach_Jay_Vidovich
60000 Kizil
70000 Nanakshahi
80000 iSink_U_Facebook
90000 Renfrey
100000 Doctorate_Degree
110000 Synthetic_Cannabinoids
120000 Employee_Jeff_Colucy
130000 Kolbek
140000 dunce_hat
150000 Irn_Bru_First
160000 model_Maggie_Rizer
170000 OTTAWA_Karlheinz_Schreiber
180000 BGiles
190000 prMac.com_Vienna_Austria
200000 Tina_Pisnik
210000 undersigned_Rubin_Lublin
220000 Willnett_Crockett
230000 Sony_Pictures_Studios
240000 Voices
250000 salmon_Delta_smelt
260000 Yasuaki_Iwamoto_auto
270000 Ambrose
280000 DeLamatre
290000 BY_JOYCE_J._PERSICO
300000 Austin_Ruse
310000 Adeline_Teoh
320000 1_Utama_Shopping
330000 iSimCity
340000 symbol_TOT.UN
350000 southeastwards
360000 Whitchurch_Heath
370000 WEXFORD
380000 Kirk_Baert
390000 church_renounced_polygamy
400000 Whitney_Otis_elevators
410000 Fonze
420000 Fabian_Babich
430000 Desmodur_®
440000 Michael_Egholm_Ph.D.
450000 Co

## King - Male + Female = Queen?
Nope!

At least not based on a word2vec that is trained on the News...

In [10]:
what_vec = model['king'] - model['male'] + model['female']

what_indexes = indexNN.get_nns_by_vector(what_vec, 1)
print index2key[what_indexes[0]]

king


## King - boy + girl = Queen?
Yes :)  
but it don't work with man & women :(

In [12]:
what_vec = model['king'] - model['boy'] + model['girl']

what_indexes = indexNN.get_nns_by_vector(what_vec, 1)
print index2key[what_indexes[0]]

queen


In [15]:
what_vec = model['king'] - model['man'] + model['women']

what_indexes = indexNN.get_nns_by_vector(what_vec, 1)
print index2key[what_indexes[0]]

absolute_monarch


## Berlin  - Germany + France = Paris?
Yes!

This makes me happy, but if someone understand why, please tell me!

In [14]:
what_vec = model['Berlin'] - model['Germany'] + model['France']

what_indexes = indexNN.get_nns_by_vector(what_vec, 1)
print index2key[what_indexes[0]]

Paris


## Trump - USA + Germany = Hitler?
FAKE NEWS

In [12]:
what_vec = model['Trump'] + model['Germany'] - model['USA']
what_indexes = indexNN.get_nns_by_vector(what_vec, 1)

for i in what_indexes:
    print index2key[i]

Dean_Gitter


# Let's explore the stereotypes hidded in the news:

In [53]:
man2women =  - model['boy'] + model['girl'] 

word_list = ["king","prince", "male", "boy","dad", "father", "president", "dentist",
             "scientist", "efficient",  "teacher", "doctor", "minister", "lover"]
for word in word_list:
    what_vec = model[word] + man2women
    what_indexes = indexNN.get_nns_by_vector(what_vec, 1)
    print word, "for him,", index2key[what_indexes[0]], "for her."

king for him, queen for her.
prince for him, duchess for her.
male for him, female for her.
boy for him, girl for her.
dad for him, motherly_instincts for her.
father for him, mother for her.
president for him, president for her.
dentist for him, plastic_surgeon for her.
scientist for him, linguistics_professor for her.
efficient for him, efficient for her.
teacher for him, teacher for her.
doctor for him, doctor for her.
minister for him, minister for her.
lover for him, seductress for her.


In [54]:
capital = model['Berlin'] - model['Germany'] 

word_list = ["Germany", "France", "Italy", "USA", "Russia", "boys", "cars", "flowers", "soldiers",
             "scientists", ]
for word in word_list:
    what_vec = model[word] + capital
    what_indexes = indexNN.get_nns_by_vector(what_vec, 1)
    print index2key[what_indexes[0]], "is the capital of", word

Berlin is the capital of Germany
Paris is the capital of France
Rome is the capital of Italy
Teen_Poetry_Slam is the capital of USA
Moscow is the capital of Russia
kids is the capital of boys
paddywagon is the capital of cars
flower is the capital of flowers
civilians is the capital of soldiers
Humberto_Campins is the capital of scientists


If you play with this notebook and find good word2vec equation, please tweet them to me!  
__@dh7net__