# word2vec

This notebook is equivalent to `demo-word.sh`, `demo-analogy.sh`, `demo-phrases.sh` and `demo-classes.sh` from the Google examples.

In [None]:
%load_ext autoreload
%autoreload 2

## Training

Download some data, for example: [http://mattmahoney.net/dc/text8.zip](http://mattmahoney.net/dc/text8.zip)

You could use `make test-data` from the root of the repo.

In [2]:
import word2vec

Run `word2phrase` to group up similar words "Los Angeles" to "Los_Angeles"

In [3]:
word2vec.word2phrase('../data/text8', '../data/text8-phrases', verbose=True)

Running command: word2phrase -train ../data/text8 -output ../data/text8-phrases -min-count 5 -threshold 100 -debug 2
Starting training using file ../data/text8
Words processed: 17000K     Vocab size: 4399K  
Vocab size (unigrams + bigrams): 2419827
Words in train file: 17005206
Words written: 17000K

This created a `text8-phrases` file that we can use as a better input for `word2vec`.

Note that you could easily skip this previous step and use the text data as input for `word2vec` directly.

Now train the word2vec model.

In [4]:
word2vec.word2vec('../data/text8-phrases', '../data/text8.bin', size=100, binary=True, verbose=True)

Running command: word2vec -train ../data/text8-phrases -output ../data/text8.bin -size 100 -window 5 -sample 1e-3 -hs 0 -negative 5 -threads 12 -iter 5 -min-count 5 -alpha 0.025 -debug 2 -binary 1 -cbow 1
Starting training using file ../data/text8-phrases
Vocab size: 98331
Words in train file: 15857306
Alpha: 0.000002  Progress: 100.03%  Words/thread/sec: 372.45k  4844  Progress: 0.64%  Words/thread/sec: 355.24k   Progress: 0.86%  Words/thread/sec: 338.88k  ss: 1.08%  Words/thread/sec: 352.03k  9%  Words/thread/sec: 362.78k  ds/thread/sec: 352.84k  Progress: 2.28%  Words/thread/sec: 359.77k  05%  Words/thread/sec: 361.45k   ress: 4.60%  Words/thread/sec: 366.96k  gress: 5.03%  Words/thread/sec: 366.26k  : 363.87k  ogress: 6.36%  Words/thread/sec: 367.88k  : 367.25k  ogress: 8.12%  Words/thread/sec: 369.18k  rds/thread/sec: 367.96k   Words/thread/sec: 368.16k  thread/sec: 368.42k  c: 368.42k  022424  Progress: 10.32%  Words/thread/sec: 369.10k  s: 11.70%  Words/thread/sec: 368.98k  0.02

That created a `text8.bin` file containing the word vectors in a binary format.

Generate the clusters of the vectors based on the trained model.

In [5]:
word2vec.word2clusters('../data/text8', '../data/text8-clusters.txt', 100, verbose=True)

Running command: word2vec -train ../data/text8 -output ../data/text8-clusters.txt -size 100 -window 5 -sample 1e-3 -hs 0 -negative 5 -threads 12 -iter 5 -min-count 5 -alpha 0.025 -debug 2 -binary 0 -cbow 1 -classes 100
Starting training using file ../data/text8
Vocab size: 71291
Words in train file: 16718843
Alpha: 0.000002  Progress: 100.04%  Words/thread/sec: 384.20k  024813  Progress: 0.76%  Words/thread/sec: 371.85k  3  Progress: 0.96%  Words/thread/sec: 351.62k  gress: 1.16%  Words/thread/sec: 363.80k  ead/sec: 365.33k  352  Progress: 2.60%  Words/thread/sec: 374.29k  ead/sec: 376.16k  s/thread/sec: 379.89k   377.57k  rogress: 4.46%  Words/thread/sec: 379.25k  sec: 380.91k  3527  Progress: 5.91%  Words/thread/sec: 380.41k  hread/sec: 379.70k  25  Progress: 7.51%  Words/thread/sec: 379.69k    ds/thread/sec: 381.44k  40k  1.61k  ss: 9.60%  Words/thread/sec: 381.09k  ress: 10.42%  Words/thread/sec: 381.32k  ead/sec: 381.57k  ords/thread/sec: 381.38k  ec: 381.83k  .67k  ress: 13.45%  

That created a `text8-clusters.txt` with the cluster for every word in the vocabulary

## Predictions

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import word2vec

Import the `word2vec` binary file created above

In [3]:
model = word2vec.load('../data/text8.bin')

We can take a look at the vocabulary as a numpy array

In [4]:
model.vocab

array(['</s>', 'the', 'of', ..., 'dakotas', 'nias', 'burlesques'],
      dtype='<U78')

Or take a look at the whole matrix

In [5]:
model.vectors.shape

(98331, 100)

In [6]:
model.vectors

array([[ 0.14333282,  0.15825513, -0.13715845, ...,  0.05456942,
         0.10955409,  0.00693387],
       [ 0.09608812,  0.04207754, -0.03806892, ..., -0.03362655,
         0.08464055, -0.09016852],
       [ 0.12761782, -0.04113456, -0.125634  , ..., -0.05283457,
         0.13063692,  0.0944253 ],
       ...,
       [ 0.03679391,  0.07516006, -0.0347415 , ...,  0.05325394,
         0.0078049 ,  0.02930316],
       [ 0.01181444, -0.00175774,  0.08741257, ...,  0.09732709,
        -0.16259262, -0.05862194],
       [-0.13616219,  0.00300416, -0.05052309, ...,  0.07784981,
        -0.13167247, -0.09754379]])

We can retreive the vector of individual words

In [7]:
model['dog'].shape

(100,)

In [8]:
model['dog'][:10]

array([ 0.09475172,  0.13367875,  0.09377556,  0.17996974,  0.07507402,
        0.01064182,  0.14914408,  0.06769233,  0.00867516, -0.01029446])

We can calculate the distance between two or more (all combinations) words.

In [9]:
model.distance("dog", "cat", "fish")

[('dog', 'cat', 0.871114802703574),
 ('dog', 'fish', 0.593594274883308),
 ('cat', 'fish', 0.619406762291862)]

## Similarity

We can do simple queries to retreive words similar to "socks" based on cosine similarity:

In [10]:
indexes, metrics = model.similar("dog")
indexes, metrics

(array([ 2437,  5478,  7593, 10309,  2428,  9963, 10230,  4812,  2391,
         3964]),
 array([0.8711148 , 0.83590669, 0.7858697 , 0.77791787, 0.761248  ,
        0.75992145, 0.75895443, 0.75799265, 0.75621061, 0.75438873]))

This returned a tuple with 2 items:
1. numpy array with the indexes of the similar words in the vocabulary
2. numpy array with cosine similarity to each word

We can get the words for those indexes

In [11]:
model.vocab[indexes]

array(['cat', 'cow', 'goat', 'rat', 'bear', 'rabbit', 'pig', 'wolf',
       'girl', 'dogs'], dtype='<U78')

There is a helper function to create a combined response as a numpy [record array](http://docs.scipy.org/doc/numpy/user/basics.rec.html)

In [12]:
model.generate_response(indexes, metrics)

rec.array([('cat', 0.8711148 ), ('cow', 0.83590669), ('goat', 0.7858697 ),
           ('rat', 0.77791787), ('bear', 0.761248  ),
           ('rabbit', 0.75992145), ('pig', 0.75895443),
           ('wolf', 0.75799265), ('girl', 0.75621061),
           ('dogs', 0.75438873)],
          dtype=[('word', '<U78'), ('metric', '<f8')])

Is easy to make that numpy array a pure python response:

In [13]:
model.generate_response(indexes, metrics).tolist()

[('cat', 0.8711148027035741),
 ('cow', 0.835906689576186),
 ('goat', 0.785869702381845),
 ('rat', 0.7779178650245034),
 ('bear', 0.761248001192216),
 ('rabbit', 0.7599214521649291),
 ('pig', 0.7589544283401916),
 ('wolf', 0.7579926492523692),
 ('girl', 0.7562106136633342),
 ('dogs', 0.7543887265825944)]

### Phrases

Since we trained the model with the output of `word2phrase` we can ask for similarity of "phrases", basically compained words such as "Los Angeles"

In [14]:
indexes, metrics = model.similar('los_angeles')
model.generate_response(indexes, metrics).tolist()

[('san_francisco', 0.8972880096916375),
 ('san_diego', 0.8764641849445813),
 ('miami', 0.839973256930157),
 ('seattle', 0.8341662072927412),
 ('las_vegas', 0.8333799027822566),
 ('detroit', 0.8210299188979637),
 ('chicago', 0.818606214969616),
 ('cincinnati', 0.8172210251957758),
 ('atlanta', 0.8101558040323271),
 ('st_louis', 0.8101060091264682)]

### Analogies

Its possible to do more complex queries like analogies such as: `king - man + woman = queen` 
This method returns the same as `cosine` the indexes of the words in the vocab and the metric

In [15]:
indexes, metrics = model.analogy(pos=['king', 'woman'], neg=['man'])
indexes, metrics

(array([1087, 7523, 1145, 8419, 1335,  648, 6768, 1827, 3141,  344]),
 array([0.30186443, 0.27870405, 0.27706988, 0.27548467, 0.27114285,
        0.27089344, 0.26896827, 0.26865763, 0.26679681, 0.26623266]))

In [16]:
model.generate_response(indexes, metrics).tolist()

[('queen', 0.3018644250758753),
 ('empress', 0.2787040539374566),
 ('prince', 0.2770698840727954),
 ('aragon', 0.2754846673265303),
 ('wife', 0.27114284971472047),
 ('emperor', 0.2708934379826948),
 ('regent', 0.2689682685888384),
 ('throne', 0.2686576321715817),
 ('monarch', 0.26679681320875803),
 ('son', 0.2662326633886227)]

### Clusters

In [17]:
clusters = word2vec.load_clusters('../data/text8-clusters.txt')

We can see get the cluster number for individual words

In [18]:
clusters.vocab

array(['</s>', 'the', 'of', ..., 'bredon', 'skirting', 'santamaria'],
      dtype='<U29')

We can see get all the words grouped on an specific cluster

In [19]:
clusters.get_words_on_cluster(90).shape

(209,)

In [20]:
clusters.get_words_on_cluster(90)[:10]

array(['along', 'together', 'associated', 'relations', 'relationship',
       'deal', 'combined', 'contact', 'connection', 'bond'], dtype='<U29')

We can add the clusters to the word2vec model and generate a response that includes the clusters

In [21]:
model.clusters = clusters

In [22]:
indexes, metrics = model.analogy(pos=["paris", "germany"], neg=["france"])

In [23]:
model.generate_response(indexes, metrics).tolist()

[('berlin', 0.3222541117490837, 15),
 ('munich', 0.28194423437060834, 15),
 ('vienna', 0.27364676745734806, 12),
 ('moscow', 0.26758512046265515, 74),
 ('leipzig', 0.26465489144791404, 8),
 ('st_petersburg', 0.2621400985097239, 61),
 ('dresden', 0.2565623490594948, 71),
 ('prague', 0.24847315163847422, 72),
 ('bonn', 0.2449265540079843, 8),
 ('z_rich', 0.2402333559866084, 80)]