In [1]:
%matplotlib inline


Word2Vec Model
==============

Introduces Gensim's Word2Vec model and demonstrates its use on the Lee Corpus.




In [2]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In case you missed the buzz, word2vec is a widely featured as a member of the
“new wave” of machine learning algorithms based on neural networks, commonly
referred to as "deep learning" (though word2vec itself is rather shallow).
Using large amounts of unannotated plain text, word2vec learns relationships
between words automatically. The output are vectors, one vector per word,
with remarkable linear relationships that allow us to do things like:

* vec("king") - vec("man") + vec("woman") =~ vec("queen")
* vec("Montreal Canadiens") – vec("Montreal") + vec("Toronto") =~ vec("Toronto Maple Leafs").

Word2vec is very useful in `automatic text tagging
<https://github.com/RaRe-Technologies/movie-plots-by-genre>`_\ , recommender
systems and machine translation.

This tutorial:

#. Introduces ``Word2Vec`` as an improvement over traditional bag-of-words
#. Shows off a demo of ``Word2Vec`` using a pre-trained model
#. Demonstrates training a new model from your own data
#. Demonstrates loading and saving models
#. Introduces several training parameters and demonstrates their effect
#. Discusses memory requirements
#. Visualizes Word2Vec embeddings by applying dimensionality reduction

Review: Bag-of-words
--------------------

.. Note:: Feel free to skip these review sections if you're already familiar with the models.

You may be familiar with the `bag-of-words model
<https://en.wikipedia.org/wiki/Bag-of-words_model>`_ from the
`core_concepts_vector` section.
This model transforms each document to a fixed-length vector of integers.
For example, given the sentences:

- ``John likes to watch movies. Mary likes movies too.``
- ``John also likes to watch football games. Mary hates football.``

The model outputs the vectors:

- ``[1, 2, 1, 1, 2, 1, 1, 0, 0, 0, 0]``
- ``[1, 1, 1, 1, 0, 1, 0, 1, 2, 1, 1]``

Each vector has 10 elements, where each element counts the number of times a
particular word occurred in the document.
The order of elements is arbitrary.
In the example above, the order of the elements corresponds to the words:
``["John", "likes", "to", "watch", "movies", "Mary", "too", "also", "football", "games", "hates"]``.

Bag-of-words models are surprisingly effective, but have several weaknesses.

First, they lose all information about word order: "John likes Mary" and
"Mary likes John" correspond to identical vectors. There is a solution: bag
of `n-grams <https://en.wikipedia.org/wiki/N-gram>`__
models consider word phrases of length n to represent documents as
fixed-length vectors to capture local word order but suffer from data
sparsity and high dimensionality.

Second, the model does not attempt to learn the meaning of the underlying
words, and as a consequence, the distance between vectors doesn't always
reflect the difference in meaning.  The ``Word2Vec`` model addresses this
second problem.

Introducing: the ``Word2Vec`` Model
-----------------------------------

``Word2Vec`` is a more recent model that embeds words in a lower-dimensional
vector space using a shallow neural network. The result is a set of
word-vectors where vectors close together in vector space have similar
meanings based on context, and word-vectors distant to each other have
differing meanings. For example, ``strong`` and ``powerful`` would be close
together and ``strong`` and ``Paris`` would be relatively far.

The are two versions of this model and :py:class:`~gensim.models.word2vec.Word2Vec`
class implements them both:

1. Skip-grams (SG)
2. Continuous-bag-of-words (CBOW)

.. Important::
  Don't let the implementation details below scare you.
  They're advanced material: if it's too much, then move on to the next section.

The `Word2Vec Skip-gram <http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model>`__
model, for example, takes in pairs (word1, word2) generated by moving a
window across text data, and trains a 1-hidden-layer neural network based on
the synthetic task of given an input word, giving us a predicted probability
distribution of nearby words to the input. A virtual `one-hot
<https://en.wikipedia.org/wiki/One-hot>`__ encoding of words
goes through a 'projection layer' to the hidden layer; these projection
weights are later interpreted as the word embeddings. So if the hidden layer
has 300 neurons, this network will give us 300-dimensional word embeddings.

Continuous-bag-of-words Word2vec is very similar to the skip-gram model. It
is also a 1-hidden-layer neural network. The synthetic training task now uses
the average of multiple input context words, rather than a single word as in
skip-gram, to predict the center word. Again, the projection weights that
turn one-hot words into averageable vectors, of the same width as the hidden
layer, are interpreted as the word embeddings.




Word2Vec Demo
-------------

To see what ``Word2Vec`` can do, let's download a pre-trained model and play
around with it. We will fetch the Word2Vec model trained on part of the
Google News dataset, covering approximately 3 million words and phrases. Such
a model can take hours to train, but since it's already available,
downloading and loading it with Gensim takes minutes.

.. Important::
  The model is approximately 2GB, so you'll need a decent network connection
  to proceed.  Otherwise, skip ahead to the "Training Your Own Model" section
  below.

You may also check out an `online word2vec demo
<http://radimrehurek.com/2014/02/word2vec-tutorial/#app>`_ where you can try
this vector algebra for yourself. That demo runs ``word2vec`` on the
**entire** Google News dataset, of **about 100 billion words**.




In [12]:
import gensim.downloader as api
wv = api.load('word2vec-google-news-300')

2020-04-17 14:21:35,672 : INFO : loading projection weights from /Users/caihaocui/gensim-data/word2vec-google-news-300/word2vec-google-news-300.gz
2020-04-17 14:23:38,474 : INFO : loaded (3000000, 300) matrix from /Users/caihaocui/gensim-data/word2vec-google-news-300/word2vec-google-news-300.gz


In [14]:
import joblib
# Compressed joblib pickles
# joblib.dump(wv, '../models/word2vec-google-news-300.joblib.z')

['../models/word2vec-google-news-300.joblib.z']

A common operation is to retrieve the vocabulary of a model.  That is trivial:



In [4]:
for i, word in enumerate(wv.vocab):
    if i == 10:
        break
    print(word)

</s>
in
for
that
is
on
##
The
with
said


We can easily obtain vectors for terms the model is familiar with:




In [5]:
vec_king = wv['king']

In [6]:
vec_king

array([ 1.25976562e-01,  2.97851562e-02,  8.60595703e-03,  1.39648438e-01,
       -2.56347656e-02, -3.61328125e-02,  1.11816406e-01, -1.98242188e-01,
        5.12695312e-02,  3.63281250e-01, -2.42187500e-01, -3.02734375e-01,
       -1.77734375e-01, -2.49023438e-02, -1.67968750e-01, -1.69921875e-01,
        3.46679688e-02,  5.21850586e-03,  4.63867188e-02,  1.28906250e-01,
        1.36718750e-01,  1.12792969e-01,  5.95703125e-02,  1.36718750e-01,
        1.01074219e-01, -1.76757812e-01, -2.51953125e-01,  5.98144531e-02,
        3.41796875e-01, -3.11279297e-02,  1.04492188e-01,  6.17675781e-02,
        1.24511719e-01,  4.00390625e-01, -3.22265625e-01,  8.39843750e-02,
        3.90625000e-02,  5.85937500e-03,  7.03125000e-02,  1.72851562e-01,
        1.38671875e-01, -2.31445312e-01,  2.83203125e-01,  1.42578125e-01,
        3.41796875e-01, -2.39257812e-02, -1.09863281e-01,  3.32031250e-02,
       -5.46875000e-02,  1.53198242e-02, -1.62109375e-01,  1.58203125e-01,
       -2.59765625e-01,  

Unfortunately, the model is unable to infer vectors for unfamiliar words.
This is one limitation of Word2Vec: if this limitation matters to you, check
out the FastText model.




In [7]:
try:
    vec_cameroon = wv['cameroon']
except KeyError:
    print("The word 'cameroon' does not appear in this model")

The word 'cameroon' does not appear in this model


Moving on, ``Word2Vec`` supports several word similarity tasks out of the
box.  You can see how the similarity intuitively decreases as the words get
less and less similar.




In [8]:
pairs = [
    ('car', 'minivan'),   # a minivan is a kind of car
    ('car', 'bicycle'),   # still a wheeled vehicle
    ('car', 'airplane'),  # ok, no wheels, but still a vehicle
    ('car', 'cereal'),    # ... and so on
    ('car', 'communism'),
]
for w1, w2 in pairs:
    print('%r\t%r\t%.2f' % (w1, w2, wv.similarity(w1, w2)))

'car'	'minivan'	0.69
'car'	'bicycle'	0.54
'car'	'airplane'	0.42
'car'	'cereal'	0.14
'car'	'communism'	0.06


Print the 5 most similar words to "car" or "minivan"



In [9]:
print(wv.most_similar(positive=['car', 'minivan'], topn=5))

2020-04-17 14:17:08,984 : INFO : precomputing L2-norms of word weight vectors


[('SUV', 0.853219211101532), ('vehicle', 0.8175784349441528), ('pickup_truck', 0.7763689160346985), ('Jeep', 0.7567334175109863), ('Ford_Explorer', 0.756571888923645)]


Which of the below does not belong in the sequence?



In [10]:
print(wv.doesnt_match(['fire', 'water', 'land', 'sea', 'air', 'car']))

car


  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


Training Your Own Model
-----------------------

To start, you'll need some data for training the model.  For the following
examples, we'll use the `Lee Corpus
<https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/test/test_data/lee_background.cor>`_
(which you already have if you've installed gensim).

This corpus is small enough to fit entirely in memory, but we'll implement a
memory-friendly iterator that reads it line-by-line to demonstrate how you
would handle a larger corpus.




In [15]:
from gensim.test.utils import datapath
from gensim import utils

class MyCorpus(object):
    """An interator that yields sentences (lists of str)."""

    def __iter__(self):
        corpus_path = datapath('lee_background.cor')
        for line in open(corpus_path):
            # assume there's one document per line, tokens separated by whitespace
            yield utils.simple_preprocess(line)

If we wanted to do any custom preprocessing, e.g. decode a non-standard
encoding, lowercase, remove numbers, extract named entities... All of this can
be done inside the ``MyCorpus`` iterator and ``word2vec`` doesn’t need to
know. All that is required is that the input yields one sentence (list of
utf8 words) after another.

Let's go ahead and train a model on our corpus.  Don't worry about the
training parameters much for now, we'll revisit them later.




In [16]:
import gensim.models

sentences = MyCorpus()
model = gensim.models.Word2Vec(sentences=sentences)

2020-04-17 14:37:21,372 : INFO : collecting all words and their counts
2020-04-17 14:37:21,378 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-04-17 14:37:21,451 : INFO : collected 6981 word types from a corpus of 58152 raw words and 300 sentences
2020-04-17 14:37:21,453 : INFO : Loading a fresh vocabulary
2020-04-17 14:37:21,459 : INFO : effective_min_count=5 retains 1750 unique words (25% of original 6981, drops 5231)
2020-04-17 14:37:21,460 : INFO : effective_min_count=5 leaves 49335 word corpus (84% of original 58152, drops 8817)
2020-04-17 14:37:21,468 : INFO : deleting the raw counts dictionary of 6981 items
2020-04-17 14:37:21,469 : INFO : sample=0.001 downsamples 51 most-common words
2020-04-17 14:37:21,470 : INFO : downsampling leaves estimated 35935 word corpus (72.8% of prior 49335)
2020-04-17 14:37:21,475 : INFO : estimated required memory for 1750 words and 100 dimensions: 2275000 bytes
2020-04-17 14:37:21,476 : INFO : resetting layer weight

Once we have our model, we can use it in the same way as in the demo above.

The main part of the model is ``model.wv``\ , where "wv" stands for "word vectors".




In [17]:
vec_king = model.wv['king']

Retrieving the vocabulary works the same way:



In [18]:
for i, word in enumerate(model.wv.vocab):
    if i == 10:
        break
    print(word)

hundreds
of
people
have
been
forced
to
their
homes
in


Storing and loading models
--------------------------

You'll notice that training non-trivial models can take time.  Once you've
trained your model and it works as expected, you can save it to disk.  That
way, you don't have to spend time training it all over again later.

You can store/load models using the standard gensim methods:




In [19]:
import tempfile

with tempfile.NamedTemporaryFile(prefix='gensim-model-', delete=False) as tmp:
    temporary_filepath = tmp.name
    model.save(temporary_filepath)
    #
    # The model is now safely stored in the filepath.
    # You can copy it to other machines, share it with others, etc.
    #
    # To load a saved model:
    #
    new_model = gensim.models.Word2Vec.load(temporary_filepath)

2020-04-17 14:37:50,323 : INFO : saving Word2Vec object under /var/folders/nb/1jpl223s247d7gwj8hb54b0w0000gn/T/gensim-model-2mfz9t3x, separately None
2020-04-17 14:37:50,324 : INFO : not storing attribute vectors_norm
2020-04-17 14:37:50,325 : INFO : not storing attribute cum_table
2020-04-17 14:37:50,351 : INFO : saved /var/folders/nb/1jpl223s247d7gwj8hb54b0w0000gn/T/gensim-model-2mfz9t3x
2020-04-17 14:37:50,351 : INFO : loading Word2Vec object from /var/folders/nb/1jpl223s247d7gwj8hb54b0w0000gn/T/gensim-model-2mfz9t3x
2020-04-17 14:37:50,369 : INFO : loading wv recursively from /var/folders/nb/1jpl223s247d7gwj8hb54b0w0000gn/T/gensim-model-2mfz9t3x.wv.* with mmap=None
2020-04-17 14:37:50,370 : INFO : setting ignored attribute vectors_norm to None
2020-04-17 14:37:50,371 : INFO : loading vocabulary recursively from /var/folders/nb/1jpl223s247d7gwj8hb54b0w0000gn/T/gensim-model-2mfz9t3x.vocabulary.* with mmap=None
2020-04-17 14:37:50,372 : INFO : loading trainables recursively from /var/

which uses pickle internally, optionally ``mmap``\ ‘ing the model’s internal
large NumPy matrices into virtual memory directly from disk files, for
inter-process memory sharing.

In addition, you can load models created by the original C tool, both using
its text and binary formats::

  model = gensim.models.KeyedVectors.load_word2vec_format('/tmp/vectors.txt', binary=False)
  # using gzipped/bz2 input works too, no need to unzip
  model = gensim.models.KeyedVectors.load_word2vec_format('/tmp/vectors.bin.gz', binary=True)




In [24]:
joblib.dump(model, '../models/gensim-model-Lee Corpus.z')
new_model = joblib.load('../models/gensim-model-Lee Corpus.z')

In [25]:
for i, word in enumerate(new_model.wv.vocab):
    if i == 10:
        break
    print(word)
    
print(model.wv['king'])

hundreds
of
people
have
been
forced
to
their
homes
in
[-0.01732911 -0.02997664 -0.03767837 -0.0766324   0.01324796 -0.01675296
  0.05482481 -0.03078569 -0.04718136  0.00045335  0.01799739  0.03271269
 -0.00772294  0.05855966 -0.01453003  0.03809257  0.03609253 -0.00192986
  0.00481473 -0.01255722  0.01087584 -0.02284502  0.00936334 -0.00693843
 -0.02568576  0.08502577 -0.0183079  -0.05152747 -0.00454475 -0.01175759
 -0.01653733 -0.02631769 -0.01993887 -0.01263655  0.02192106  0.0039287
  0.03444264  0.01505884 -0.01548206 -0.02628593 -0.00882326 -0.04035963
  0.00923365  0.05125612 -0.03212337  0.0096742  -0.02519332 -0.0061903
  0.02187816  0.02910632  0.02966235 -0.00727127  0.03666604  0.02229308
 -0.04737489 -0.01757282  0.04990115 -0.05369172 -0.03785142  0.01172126
  0.01106479 -0.00365501 -0.03034444 -0.02472414  0.00095579 -0.0501749
 -0.04425829  0.01184128 -0.03996208  0.0517815   0.0351619   0.058522
 -0.02869451  0.02419358  0.00569182  0.03154719  0.02417227 -0.01564167
  

Training Parameters
-------------------

``Word2Vec`` accepts several parameters that affect both training speed and quality.

min_count
---------

``min_count`` is for pruning the internal dictionary. Words that appear only
once or twice in a billion-word corpus are probably uninteresting typos and
garbage. In addition, there’s not enough data to make any meaningful training
on those words, so it’s best to ignore them:

default value of min_count=5



In [26]:
model = gensim.models.Word2Vec(sentences, min_count=10)

2020-04-17 14:43:38,738 : INFO : collecting all words and their counts
2020-04-17 14:43:38,741 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-04-17 14:43:38,811 : INFO : collected 6981 word types from a corpus of 58152 raw words and 300 sentences
2020-04-17 14:43:38,812 : INFO : Loading a fresh vocabulary
2020-04-17 14:43:38,815 : INFO : effective_min_count=10 retains 889 unique words (12% of original 6981, drops 6092)
2020-04-17 14:43:38,816 : INFO : effective_min_count=10 leaves 43776 word corpus (75% of original 58152, drops 14376)
2020-04-17 14:43:38,818 : INFO : deleting the raw counts dictionary of 6981 items
2020-04-17 14:43:38,819 : INFO : sample=0.001 downsamples 55 most-common words
2020-04-17 14:43:38,819 : INFO : downsampling leaves estimated 29691 word corpus (67.8% of prior 43776)
2020-04-17 14:43:38,821 : INFO : estimated required memory for 889 words and 100 dimensions: 1155700 bytes
2020-04-17 14:43:38,821 : INFO : resetting layer weigh

size
----

``size`` is the number of dimensions (N) of the N-dimensional space that
gensim Word2Vec maps the words onto.

Bigger size values require more training data, but can lead to better (more
accurate) models. Reasonable values are in the tens to hundreds.




In [27]:
# default value of size=100
model = gensim.models.Word2Vec(sentences, size=200)

2020-04-17 14:44:02,868 : INFO : collecting all words and their counts
2020-04-17 14:44:02,869 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-04-17 14:44:02,940 : INFO : collected 6981 word types from a corpus of 58152 raw words and 300 sentences
2020-04-17 14:44:02,941 : INFO : Loading a fresh vocabulary
2020-04-17 14:44:02,945 : INFO : effective_min_count=5 retains 1750 unique words (25% of original 6981, drops 5231)
2020-04-17 14:44:02,946 : INFO : effective_min_count=5 leaves 49335 word corpus (84% of original 58152, drops 8817)
2020-04-17 14:44:02,950 : INFO : deleting the raw counts dictionary of 6981 items
2020-04-17 14:44:02,950 : INFO : sample=0.001 downsamples 51 most-common words
2020-04-17 14:44:02,951 : INFO : downsampling leaves estimated 35935 word corpus (72.8% of prior 49335)
2020-04-17 14:44:02,953 : INFO : estimated required memory for 1750 words and 200 dimensions: 3675000 bytes
2020-04-17 14:44:02,954 : INFO : resetting layer weight

workers
-------

``workers`` , the last of the major parameters (full list `here
<http://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec>`_)
is for training parallelization, to speed up training:




In [35]:
# default value of workers=3 (tutorial says 1...)
model = gensim.models.Word2Vec(sentences, workers=4)

2020-04-17 14:52:34,974 : INFO : collecting all words and their counts
2020-04-17 14:52:34,977 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-04-17 14:52:35,051 : INFO : collected 6981 word types from a corpus of 58152 raw words and 300 sentences
2020-04-17 14:52:35,052 : INFO : Loading a fresh vocabulary
2020-04-17 14:52:35,057 : INFO : effective_min_count=5 retains 1750 unique words (25% of original 6981, drops 5231)
2020-04-17 14:52:35,058 : INFO : effective_min_count=5 leaves 49335 word corpus (84% of original 58152, drops 8817)
2020-04-17 14:52:35,062 : INFO : deleting the raw counts dictionary of 6981 items
2020-04-17 14:52:35,063 : INFO : sample=0.001 downsamples 51 most-common words
2020-04-17 14:52:35,063 : INFO : downsampling leaves estimated 35935 word corpus (72.8% of prior 49335)
2020-04-17 14:52:35,067 : INFO : estimated required memory for 1750 words and 100 dimensions: 2275000 bytes
2020-04-17 14:52:35,067 : INFO : resetting layer weight

The ``workers`` parameter only has an effect if you have `Cython
<http://cython.org/>`_ installed. Without Cython, you’ll only be able to use
one core because of the `GIL
<https://wiki.python.org/moin/GlobalInterpreterLock>`_ (and ``word2vec``
training will be `miserably slow
<http://rare-technologies.com/word2vec-in-python-part-two-optimizing/>`_\ ).




Memory
------

At its core, ``word2vec`` model parameters are stored as matrices (NumPy
arrays). Each array is **#vocabulary** (controlled by min_count parameter)
times **#size** (size parameter) of floats (single precision aka 4 bytes).

Three such matrices are held in RAM (work is underway to reduce that number
to two, or even one). So if your input contains 100,000 unique words, and you
asked for layer ``size=200``\ , the model will require approx.
``100,000*200*4*3 bytes = ~229MB``.

There’s a little extra memory needed for storing the vocabulary tree (100,000 words would take a few megabytes), but unless your words are extremely loooong strings, memory footprint will be dominated by the three matrices above.




Evaluating
----------

``Word2Vec`` training is an unsupervised task, there’s no good way to
objectively evaluate the result. Evaluation depends on your end application.

Google has released their testing set of about 20,000 syntactic and semantic
test examples, following the “A is to B as C is to D” task. It is provided in
the 'datasets' folder.

For example a syntactic analogy of comparative type is bad:worse;good:?.
There are total of 9 types of syntactic comparisons in the dataset like
plural nouns and nouns of opposite meaning.

The semantic questions contain five types of semantic analogies, such as
capital cities (Paris:France;Tokyo:?) or family members
(brother:sister;dad:?).




Gensim supports the same evaluation set, in exactly the same format:




In [36]:
model.accuracy('../data/raw/questions-words.txt')

  """Entry point for launching an IPython kernel.
2020-04-17 14:52:40,715 : INFO : precomputing L2-norms of word weight vectors
2020-04-17 14:52:40,719 : INFO : capital-common-countries: 0.0% (0/6)
2020-04-17 14:52:40,730 : INFO : capital-world: 0.0% (0/2)
2020-04-17 14:52:40,741 : INFO : family: 16.7% (1/6)
2020-04-17 14:52:40,752 : INFO : gram3-comparative: 0.0% (0/20)
2020-04-17 14:52:40,759 : INFO : gram4-superlative: 0.0% (0/12)
2020-04-17 14:52:40,768 : INFO : gram5-present-participle: 0.0% (0/20)
2020-04-17 14:52:40,779 : INFO : gram6-nationality-adjective: 0.0% (0/30)
2020-04-17 14:52:40,787 : INFO : gram7-past-tense: 0.0% (0/20)
2020-04-17 14:52:40,797 : INFO : gram8-plural: 0.0% (0/30)
2020-04-17 14:52:40,799 : INFO : total: 0.7% (1/146)


[{'section': 'capital-common-countries',
  'correct': [],
  'incorrect': [('CANBERRA', 'AUSTRALIA', 'KABUL', 'AFGHANISTAN'),
   ('CANBERRA', 'AUSTRALIA', 'PARIS', 'FRANCE'),
   ('KABUL', 'AFGHANISTAN', 'PARIS', 'FRANCE'),
   ('KABUL', 'AFGHANISTAN', 'CANBERRA', 'AUSTRALIA'),
   ('PARIS', 'FRANCE', 'CANBERRA', 'AUSTRALIA'),
   ('PARIS', 'FRANCE', 'KABUL', 'AFGHANISTAN')]},
 {'section': 'capital-world',
  'correct': [],
  'incorrect': [('CANBERRA', 'AUSTRALIA', 'KABUL', 'AFGHANISTAN'),
   ('KABUL', 'AFGHANISTAN', 'PARIS', 'FRANCE')]},
 {'section': 'currency', 'correct': [], 'incorrect': []},
 {'section': 'city-in-state', 'correct': [], 'incorrect': []},
 {'section': 'family',
  'correct': [('HIS', 'HER', 'HE', 'SHE')],
  'incorrect': [('HE', 'SHE', 'HIS', 'HER'),
   ('HE', 'SHE', 'MAN', 'WOMAN'),
   ('HIS', 'HER', 'MAN', 'WOMAN'),
   ('MAN', 'WOMAN', 'HE', 'SHE'),
   ('MAN', 'WOMAN', 'HIS', 'HER')]},
 {'section': 'gram1-adjective-to-adverb', 'correct': [], 'incorrect': []},
 {'section': 

This ``accuracy`` takes an `optional parameter
<http://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec.accuracy>`_
``restrict_vocab`` which limits which test examples are to be considered.




In the December 2016 release of Gensim we added a better way to evaluate semantic similarity.

By default it uses an academic dataset WS-353 but one can create a dataset
specific to your business based on it. It contains word pairs together with
human-assigned similarity judgments. It measures the relatedness or
co-occurrence of two words. For example, 'coast' and 'shore' are very similar
as they appear in the same context. At the same time 'clothes' and 'closet'
are less similar because they are related but not interchangeable.




In [37]:
# model.evaluate_word_pairs(datapath('../data/raw/wordsim353.tsv'))
model.evaluate_word_pairs('../data/raw/wordsim353.tsv')

  
2020-04-17 14:52:47,014 : INFO : Pearson correlation coefficient against ../data/raw/wordsim353.tsv: 0.1420
2020-04-17 14:52:47,015 : INFO : Spearman rank-order correlation coefficient against ../data/raw/wordsim353.tsv: 0.1612
2020-04-17 14:52:47,015 : INFO : Pairs with unknown words ratio: 83.0%


((0.14199559050123303, 0.2791410602553877),
 SpearmanrResult(correlation=0.16115048036198817, pvalue=0.2186723333380953),
 83.0028328611898)

.. Important::
  Good performance on Google's or WS-353 test set doesn’t mean word2vec will
  work well in your application, or vice versa. It’s always best to evaluate
  directly on your intended task. For an example of how to use word2vec in a
  classifier pipeline, see this `tutorial
  <https://github.com/RaRe-Technologies/movie-plots-by-genre>`_.




Online training / Resuming training
-----------------------------------

Advanced users can load a model and continue training it with more sentences
and `new vocabulary words <online_w2v_tutorial.ipynb>`_:




In [40]:
temporary_filepath = '../models/gensim-model-Lee Corpus.z'

# model = gensim.models.Word2Vec.load(temporary_filepath)

model = joblib.load(temporary_filepath)

more_sentences = [
    ['Advanced', 'users', 'can', 'load', 'a', 'model',
     'and', 'continue', 'training', 'it', 'with', 'more', 'sentences']
]

model.build_vocab(more_sentences, update=True)

model.train(more_sentences, total_examples=model.corpus_count, epochs=model.iter)

# cleaning up temporary file
import os
os.remove(temporary_filepath)

2020-04-17 14:55:04,133 : INFO : collecting all words and their counts
2020-04-17 14:55:04,133 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-04-17 14:55:04,133 : INFO : collected 13 word types from a corpus of 13 raw words and 1 sentences
2020-04-17 14:55:04,134 : INFO : Updating model with new vocabulary
2020-04-17 14:55:04,134 : INFO : New added 0 unique words (0% of original 13) and increased the count of 0 pre-existing words (0% of original 13)
2020-04-17 14:55:04,135 : INFO : deleting the raw counts dictionary of 13 items
2020-04-17 14:55:04,135 : INFO : sample=0.001 downsamples 0 most-common words
2020-04-17 14:55:04,136 : INFO : downsampling leaves estimated 0 word corpus (0.0% of prior 0)
2020-04-17 14:55:04,138 : INFO : estimated required memory for 1750 words and 100 dimensions: 2275000 bytes
2020-04-17 14:55:04,138 : INFO : updating layer weights
  
2020-04-17 14:55:04,140 : INFO : training model with 3 workers on 1750 vocabulary and 100 fea

You may need to tweak the ``total_words`` parameter to ``train()``,
depending on what learning rate decay you want to simulate.

Note that it’s not possible to resume training with models generated by the C
tool, ``KeyedVectors.load_word2vec_format()``. You can still use them for
querying/similarity, but information vital for training (the vocab tree) is
missing there.




Training Loss Computation
-------------------------

The parameter ``compute_loss`` can be used to toggle computation of loss
while training the Word2Vec model. The computed loss is stored in the model
attribute ``running_training_loss`` and can be retrieved using the function
``get_latest_training_loss`` as follows :




In [41]:
# instantiating and training the Word2Vec model
model_with_loss = gensim.models.Word2Vec(
    sentences,
    min_count=1,
    compute_loss=True,
    hs=0,
    sg=1,
    seed=42
)

# getting the training loss value
training_loss = model_with_loss.get_latest_training_loss()
print(training_loss)

2020-04-17 14:55:31,578 : INFO : collecting all words and their counts
2020-04-17 14:55:31,581 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-04-17 14:55:31,652 : INFO : collected 6981 word types from a corpus of 58152 raw words and 300 sentences
2020-04-17 14:55:31,652 : INFO : Loading a fresh vocabulary
2020-04-17 14:55:31,661 : INFO : effective_min_count=1 retains 6981 unique words (100% of original 6981, drops 0)
2020-04-17 14:55:31,662 : INFO : effective_min_count=1 leaves 58152 word corpus (100% of original 58152, drops 0)
2020-04-17 14:55:31,676 : INFO : deleting the raw counts dictionary of 6981 items
2020-04-17 14:55:31,677 : INFO : sample=0.001 downsamples 43 most-common words
2020-04-17 14:55:31,677 : INFO : downsampling leaves estimated 45723 word corpus (78.6% of prior 58152)
2020-04-17 14:55:31,688 : INFO : estimated required memory for 6981 words and 100 dimensions: 9075300 bytes
2020-04-17 14:55:31,689 : INFO : resetting layer weights
20

1376148.875


Benchmarks
----------

Let's run some benchmarks to see effect of the training loss computation code
on training time.

We'll use the following data for the benchmarks:

#. Lee Background corpus: included in gensim's test data
#. Text8 corpus.  To demonstrate the effect of corpus size, we'll look at the
   first 1MB, 10MB, 50MB of the corpus, as well as the entire thing.




In [42]:
import io
import os

import gensim.models.word2vec
import gensim.downloader as api
import smart_open


def head(path, size):
    with smart_open.open(path) as fin:
        return io.StringIO(fin.read(size))


def generate_input_data():
    lee_path = datapath('lee_background.cor')
    ls = gensim.models.word2vec.LineSentence(lee_path)
    ls.name = '25kB'
    yield ls

    text8_path = api.load('text8').fn
    labels = ('1MB', '10MB', '50MB', '100MB')
    sizes = (1024 ** 2, 10 * 1024 ** 2, 50 * 1024 ** 2, 100 * 1024 ** 2)
    for l, s in zip(labels, sizes):
        ls = gensim.models.word2vec.LineSentence(head(text8_path, s))
        ls.name = l
        yield ls


input_data = list(generate_input_data())



2020-04-17 14:56:24,037 : INFO : text8 downloaded


We now compare the training time taken for different combinations of input
data and model training parameters like ``hs`` and ``sg``.

For each combination, we repeat the test several times to obtain the mean and
standard deviation of the test duration.




In [44]:
# Temporarily reduce logging verbosity
logging.root.level = logging.ERROR

import time
import numpy as np
import pandas as pd

train_time_values = []
seed_val = 42
sg_values = [0, 1]
hs_values = [0, 1]

fast = True
if fast:
    input_data_subset = input_data[:3]
else:
    input_data_subset = input_data


for data in input_data_subset:
    for sg_val in sg_values:
        for hs_val in hs_values:
            for loss_flag in [True, False]:
                time_taken_list = []
                for i in range(3):
                    start_time = time.time()
                    w2v_model = gensim.models.Word2Vec(
                        data,
                        compute_loss=loss_flag,
                        sg=sg_val,
                        hs=hs_val,
                        seed=seed_val,
                    )
                    time_taken_list.append(time.time() - start_time)

                time_taken_list = np.array(time_taken_list)
                time_mean = np.mean(time_taken_list)
                time_std = np.std(time_taken_list)

                model_result = {
                    'train_data': data.name,
                    'compute_loss': loss_flag,
                    'sg': sg_val,
                    'hs': hs_val,
                    'train_time_mean': time_mean,
                    'train_time_std': time_std,
                }
                print("Word2vec model #%i: %s" % (len(train_time_values), model_result))
                train_time_values.append(model_result)

train_times_table = pd.DataFrame(train_time_values)
train_times_table = train_times_table.sort_values(
    by=['train_data', 'sg', 'hs', 'compute_loss'],
    ascending=[False, False, True, False],
)
print(train_times_table)

2020-04-17 14:58:43,589 : INFO : collecting all words and their counts
2020-04-17 14:58:43,591 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-04-17 14:58:43,607 : INFO : collected 10781 word types from a corpus of 59890 raw words and 300 sentences
2020-04-17 14:58:43,608 : INFO : Loading a fresh vocabulary
2020-04-17 14:58:43,614 : INFO : effective_min_count=5 retains 1762 unique words (16% of original 10781, drops 9019)
2020-04-17 14:58:43,614 : INFO : effective_min_count=5 leaves 46084 word corpus (76% of original 59890, drops 13806)
2020-04-17 14:58:43,619 : INFO : deleting the raw counts dictionary of 10781 items
2020-04-17 14:58:43,619 : INFO : sample=0.001 downsamples 45 most-common words
2020-04-17 14:58:43,620 : INFO : downsampling leaves estimated 32610 word corpus (70.8% of prior 46084)
2020-04-17 14:58:43,623 : INFO : estimated required memory for 1762 words and 100 dimensions: 2290600 bytes
2020-04-17 14:58:43,623 : INFO : resetting layer we

2020-04-17 14:58:44,909 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 14:58:44,912 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 14:58:44,914 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 14:58:44,915 : INFO : EPOCH - 2 : training on 59890 raw words (32552 effective words) took 0.0s, 1288032 effective words/s
2020-04-17 14:58:44,934 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 14:58:44,936 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 14:58:44,939 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 14:58:44,939 : INFO : EPOCH - 3 : training on 59890 raw words (32603 effective words) took 0.0s, 1422867 effective words/s
2020-04-17 14:58:44,963 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 14:58:44,964 : INFO : worker thread finished; awaiting finish of 1 more threads
2020

Word2vec model #0: {'train_data': '25kB', 'compute_loss': True, 'sg': 0, 'hs': 0, 'train_time_mean': 0.46947534879048664, 'train_time_std': 0.016242761213383295}


2020-04-17 14:58:45,304 : INFO : training model with 3 workers on 1762 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
2020-04-17 14:58:45,323 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 14:58:45,327 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 14:58:45,328 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 14:58:45,329 : INFO : EPOCH - 1 : training on 59890 raw words (32543 effective words) took 0.0s, 1388438 effective words/s
2020-04-17 14:58:45,354 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 14:58:45,355 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 14:58:45,359 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 14:58:45,360 : INFO : EPOCH - 2 : training on 59890 raw words (32652 effective words) took 0.0s, 1109503 effective words/s
2020-04-17 14:58:45,381 : INFO : work

2020-04-17 14:58:46,296 : INFO : EPOCH - 4 : training on 59890 raw words (32587 effective words) took 0.0s, 1387148 effective words/s
2020-04-17 14:58:46,318 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 14:58:46,319 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 14:58:46,321 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 14:58:46,322 : INFO : EPOCH - 5 : training on 59890 raw words (32592 effective words) took 0.0s, 1381096 effective words/s
2020-04-17 14:58:46,322 : INFO : training on a 299450 raw words (162877 effective words) took 0.1s, 1265861 effective words/s
2020-04-17 14:58:46,324 : INFO : collecting all words and their counts
2020-04-17 14:58:46,325 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-04-17 14:58:46,340 : INFO : collected 10781 word types from a corpus of 59890 raw words and 300 sentences
2020-04-17 14:58:46,341 : INFO : Loading a fresh voc

Word2vec model #1: {'train_data': '25kB', 'compute_loss': False, 'sg': 0, 'hs': 0, 'train_time_mean': 0.4421900113423665, 'train_time_std': 0.0035101271553160955}


2020-04-17 14:58:46,680 : INFO : training model with 3 workers on 1762 vocabulary and 100 features, using sg=0 hs=1 sample=0.001 negative=5 window=5
2020-04-17 14:58:46,714 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 14:58:46,721 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 14:58:46,726 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 14:58:46,727 : INFO : EPOCH - 1 : training on 59890 raw words (32543 effective words) took 0.0s, 713141 effective words/s
2020-04-17 14:58:46,761 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 14:58:46,764 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 14:58:46,768 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 14:58:46,769 : INFO : EPOCH - 2 : training on 59890 raw words (32552 effective words) took 0.0s, 803783 effective words/s
2020-04-17 14:58:46,803 : INFO : worker

2020-04-17 14:58:47,959 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 14:58:47,961 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 14:58:47,966 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 14:58:47,966 : INFO : EPOCH - 4 : training on 59890 raw words (32587 effective words) took 0.0s, 809798 effective words/s
2020-04-17 14:58:48,001 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 14:58:48,005 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 14:58:48,011 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 14:58:48,012 : INFO : EPOCH - 5 : training on 59890 raw words (32592 effective words) took 0.0s, 738930 effective words/s
2020-04-17 14:58:48,012 : INFO : training on a 299450 raw words (162877 effective words) took 0.2s, 754712 effective words/s
2020-04-17 14:58:48,014 : INFO : collecting all words and their

Word2vec model #2: {'train_data': '25kB', 'compute_loss': True, 'sg': 0, 'hs': 1, 'train_time_mean': 0.563281774520874, 'train_time_std': 0.004205707141329885}


2020-04-17 14:58:48,356 : INFO : training model with 3 workers on 1762 vocabulary and 100 features, using sg=0 hs=1 sample=0.001 negative=5 window=5
2020-04-17 14:58:48,390 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 14:58:48,392 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 14:58:48,397 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 14:58:48,398 : INFO : EPOCH - 1 : training on 59890 raw words (32543 effective words) took 0.0s, 800745 effective words/s
2020-04-17 14:58:48,432 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 14:58:48,441 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 14:58:48,443 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 14:58:48,443 : INFO : EPOCH - 2 : training on 59890 raw words (32552 effective words) took 0.0s, 748253 effective words/s
2020-04-17 14:58:48,476 : INFO : worker

2020-04-17 14:58:49,615 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 14:58:49,618 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 14:58:49,622 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 14:58:49,623 : INFO : EPOCH - 4 : training on 59890 raw words (32587 effective words) took 0.0s, 825782 effective words/s
2020-04-17 14:58:49,656 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 14:58:49,658 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 14:58:49,662 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 14:58:49,663 : INFO : EPOCH - 5 : training on 59890 raw words (32592 effective words) took 0.0s, 845455 effective words/s
2020-04-17 14:58:49,663 : INFO : training on a 299450 raw words (162877 effective words) took 0.2s, 783579 effective words/s
2020-04-17 14:58:49,665 : INFO : collecting all words and their

Word2vec model #3: {'train_data': '25kB', 'compute_loss': False, 'sg': 0, 'hs': 1, 'train_time_mean': 0.5502861340840658, 'train_time_std': 0.0011700468198865938}


2020-04-17 14:58:49,976 : INFO : training model with 3 workers on 1762 vocabulary and 100 features, using sg=1 hs=0 sample=0.001 negative=5 window=5
2020-04-17 14:58:50,037 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 14:58:50,038 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 14:58:50,041 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 14:58:50,042 : INFO : EPOCH - 1 : training on 59890 raw words (32543 effective words) took 0.1s, 510268 effective words/s
2020-04-17 14:58:50,102 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 14:58:50,103 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 14:58:50,106 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 14:58:50,106 : INFO : EPOCH - 2 : training on 59890 raw words (32552 effective words) took 0.1s, 512501 effective words/s
2020-04-17 14:58:50,165 : INFO : worker

2020-04-17 14:58:51,518 : INFO : EPOCH - 4 : training on 59890 raw words (32587 effective words) took 0.1s, 479117 effective words/s
2020-04-17 14:58:51,581 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 14:58:51,584 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 14:58:51,588 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 14:58:51,589 : INFO : EPOCH - 5 : training on 59890 raw words (32592 effective words) took 0.1s, 471630 effective words/s
2020-04-17 14:58:51,589 : INFO : training on a 299450 raw words (162877 effective words) took 0.3s, 486049 effective words/s
2020-04-17 14:58:51,591 : INFO : collecting all words and their counts
2020-04-17 14:58:51,592 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-04-17 14:58:51,607 : INFO : collected 10781 word types from a corpus of 59890 raw words and 300 sentences
2020-04-17 14:58:51,608 : INFO : Loading a fresh vocabu

Word2vec model #4: {'train_data': '25kB', 'compute_loss': True, 'sg': 1, 'hs': 0, 'train_time_mean': 0.6418556372324625, 'train_time_std': 0.0035530302350750237}


2020-04-17 14:58:51,906 : INFO : training model with 3 workers on 1762 vocabulary and 100 features, using sg=1 hs=0 sample=0.001 negative=5 window=5
2020-04-17 14:58:51,967 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 14:58:51,971 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 14:58:51,974 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 14:58:51,975 : INFO : EPOCH - 1 : training on 59890 raw words (32543 effective words) took 0.1s, 489852 effective words/s
2020-04-17 14:58:52,036 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 14:58:52,036 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 14:58:52,039 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 14:58:52,040 : INFO : EPOCH - 2 : training on 59890 raw words (32552 effective words) took 0.1s, 511585 effective words/s
2020-04-17 14:58:52,098 : INFO : worker

2020-04-17 14:58:53,626 : INFO : EPOCH - 4 : training on 59890 raw words (32587 effective words) took 0.1s, 403632 effective words/s
2020-04-17 14:58:53,695 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 14:58:53,700 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 14:58:53,706 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 14:58:53,707 : INFO : EPOCH - 5 : training on 59890 raw words (32592 effective words) took 0.1s, 415942 effective words/s
2020-04-17 14:58:53,708 : INFO : training on a 299450 raw words (162877 effective words) took 0.4s, 424655 effective words/s
2020-04-17 14:58:53,710 : INFO : collecting all words and their counts
2020-04-17 14:58:53,711 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-04-17 14:58:53,732 : INFO : collected 10781 word types from a corpus of 59890 raw words and 300 sentences
2020-04-17 14:58:53,732 : INFO : Loading a fresh vocabu

Word2vec model #5: {'train_data': '25kB', 'compute_loss': False, 'sg': 1, 'hs': 0, 'train_time_mean': 0.7063467502593994, 'train_time_std': 0.04149649974900909}


2020-04-17 14:58:54,205 : INFO : training model with 3 workers on 1762 vocabulary and 100 features, using sg=1 hs=1 sample=0.001 negative=5 window=5
2020-04-17 14:58:54,385 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 14:58:54,399 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 14:58:54,400 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 14:58:54,400 : INFO : EPOCH - 1 : training on 59890 raw words (32528 effective words) took 0.2s, 169635 effective words/s
2020-04-17 14:58:54,537 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 14:58:54,538 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 14:58:54,545 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 14:58:54,545 : INFO : EPOCH - 2 : training on 59890 raw words (32557 effective words) took 0.1s, 228937 effective words/s
2020-04-17 14:58:54,677 : INFO : worker

2020-04-17 14:58:56,828 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 14:58:56,829 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 14:58:56,834 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 14:58:56,834 : INFO : EPOCH - 4 : training on 59890 raw words (32587 effective words) took 0.1s, 264006 effective words/s
2020-04-17 14:58:56,956 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 14:58:56,958 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 14:58:56,963 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 14:58:56,963 : INFO : EPOCH - 5 : training on 59890 raw words (32592 effective words) took 0.1s, 255736 effective words/s
2020-04-17 14:58:56,963 : INFO : training on a 299450 raw words (162877 effective words) took 0.7s, 249890 effective words/s
2020-04-17 14:58:56,965 : INFO : collecting all words and their

Word2vec model #6: {'train_data': '25kB', 'compute_loss': True, 'sg': 1, 'hs': 1, 'train_time_mean': 1.0849557717641194, 'train_time_std': 0.1178281414835077}


2020-04-17 14:58:57,308 : INFO : training model with 3 workers on 1762 vocabulary and 100 features, using sg=1 hs=1 sample=0.001 negative=5 window=5
2020-04-17 14:58:57,426 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 14:58:57,427 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 14:58:57,430 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 14:58:57,430 : INFO : EPOCH - 1 : training on 59890 raw words (32543 effective words) took 0.1s, 269547 effective words/s
2020-04-17 14:58:57,547 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 14:58:57,548 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 14:58:57,555 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 14:58:57,556 : INFO : EPOCH - 2 : training on 59890 raw words (32552 effective words) took 0.1s, 262740 effective words/s
2020-04-17 14:58:57,672 : INFO : worker

2020-04-17 14:58:59,830 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 14:58:59,830 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 14:58:59,837 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 14:58:59,838 : INFO : EPOCH - 4 : training on 59890 raw words (32587 effective words) took 0.1s, 259401 effective words/s
2020-04-17 14:58:59,957 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 14:58:59,958 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 14:58:59,961 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 14:58:59,961 : INFO : EPOCH - 5 : training on 59890 raw words (32592 effective words) took 0.1s, 267096 effective words/s
2020-04-17 14:58:59,962 : INFO : training on a 299450 raw words (162877 effective words) took 0.7s, 249801 effective words/s
2020-04-17 14:58:59,964 : INFO : collecting all words and their

Word2vec model #7: {'train_data': '25kB', 'compute_loss': False, 'sg': 1, 'hs': 1, 'train_time_mean': 0.9994269212086996, 'train_time_std': 0.022374977285268764}


2020-04-17 14:59:00,703 : INFO : training model with 3 workers on 4125 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
2020-04-17 14:59:00,777 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 14:59:00,781 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 14:59:00,782 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 14:59:00,782 : INFO : EPOCH - 1 : training on 175599 raw words (110284 effective words) took 0.1s, 1621173 effective words/s
2020-04-17 14:59:00,862 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 14:59:00,866 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 14:59:00,871 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 14:59:00,872 : INFO : EPOCH - 2 : training on 175599 raw words (110156 effective words) took 0.1s, 1426117 effective words/s
2020-04-17 14:59:00,973 : INFO : 

2020-04-17 14:59:03,329 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 14:59:03,330 : INFO : EPOCH - 4 : training on 175599 raw words (110190 effective words) took 0.1s, 1566767 effective words/s
2020-04-17 14:59:03,407 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 14:59:03,411 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 14:59:03,412 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 14:59:03,412 : INFO : EPOCH - 5 : training on 175599 raw words (110080 effective words) took 0.1s, 1576637 effective words/s
2020-04-17 14:59:03,412 : INFO : training on a 877995 raw words (550903 effective words) took 0.4s, 1364860 effective words/s
2020-04-17 14:59:03,414 : INFO : collecting all words and their counts
2020-04-17 14:59:03,426 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-04-17 14:59:03,457 : INFO : collected 17251 word types from a c

Word2vec model #8: {'train_data': '1MB', 'compute_loss': True, 'sg': 0, 'hs': 0, 'train_time_mean': 1.1500287055969238, 'train_time_std': 0.027173434297282157}


2020-04-17 14:59:04,135 : INFO : training model with 3 workers on 4125 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
2020-04-17 14:59:04,206 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 14:59:04,208 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 14:59:04,209 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 14:59:04,210 : INFO : EPOCH - 1 : training on 175599 raw words (109994 effective words) took 0.1s, 1753864 effective words/s
2020-04-17 14:59:04,291 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 14:59:04,295 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 14:59:04,296 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 14:59:04,297 : INFO : EPOCH - 2 : training on 175599 raw words (110178 effective words) took 0.1s, 1462118 effective words/s
2020-04-17 14:59:04,375 : INFO : 

2020-04-17 14:59:06,656 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 14:59:06,656 : INFO : EPOCH - 4 : training on 175599 raw words (110428 effective words) took 0.1s, 1765462 effective words/s
2020-04-17 14:59:06,726 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 14:59:06,728 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 14:59:06,729 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 14:59:06,730 : INFO : EPOCH - 5 : training on 175599 raw words (110382 effective words) took 0.1s, 1794707 effective words/s
2020-04-17 14:59:06,730 : INFO : training on a 877995 raw words (551512 effective words) took 0.4s, 1433128 effective words/s
2020-04-17 14:59:06,732 : INFO : collecting all words and their counts
2020-04-17 14:59:06,742 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-04-17 14:59:06,770 : INFO : collected 17251 word types from a c

Word2vec model #9: {'train_data': '1MB', 'compute_loss': False, 'sg': 0, 'hs': 0, 'train_time_mean': 1.1058674653371174, 'train_time_std': 0.013163796722582029}


2020-04-17 14:59:07,520 : INFO : training model with 3 workers on 4125 vocabulary and 100 features, using sg=0 hs=1 sample=0.001 negative=5 window=5
2020-04-17 14:59:07,648 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 14:59:07,659 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 14:59:07,660 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 14:59:07,661 : INFO : EPOCH - 1 : training on 175599 raw words (110202 effective words) took 0.1s, 857523 effective words/s
2020-04-17 14:59:07,784 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 14:59:07,793 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 14:59:07,796 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 14:59:07,797 : INFO : EPOCH - 2 : training on 175599 raw words (110115 effective words) took 0.1s, 891692 effective words/s
2020-04-17 14:59:07,926 : INFO : wo

2020-04-17 14:59:10,917 : INFO : EPOCH - 3 : training on 175599 raw words (110352 effective words) took 0.1s, 886643 effective words/s
2020-04-17 14:59:11,044 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 14:59:11,052 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 14:59:11,057 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 14:59:11,058 : INFO : EPOCH - 4 : training on 175599 raw words (110233 effective words) took 0.1s, 863029 effective words/s
2020-04-17 14:59:11,192 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 14:59:11,194 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 14:59:11,200 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 14:59:11,200 : INFO : EPOCH - 5 : training on 175599 raw words (110132 effective words) took 0.1s, 851833 effective words/s
2020-04-17 14:59:11,201 : INFO : training on a 87

Word2vec model #10: {'train_data': '1MB', 'compute_loss': True, 'sg': 0, 'hs': 1, 'train_time_mean': 1.4909920692443848, 'train_time_std': 0.024463411032697953}


2020-04-17 14:59:12,003 : INFO : training model with 3 workers on 4125 vocabulary and 100 features, using sg=0 hs=1 sample=0.001 negative=5 window=5
2020-04-17 14:59:12,127 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 14:59:12,131 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 14:59:12,134 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 14:59:12,134 : INFO : EPOCH - 1 : training on 175599 raw words (109994 effective words) took 0.1s, 919642 effective words/s
2020-04-17 14:59:12,258 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 14:59:12,265 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 14:59:12,267 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 14:59:12,267 : INFO : EPOCH - 2 : training on 175599 raw words (110182 effective words) took 0.1s, 907933 effective words/s
2020-04-17 14:59:12,389 : INFO : wo

2020-04-17 14:59:15,347 : INFO : EPOCH - 3 : training on 175599 raw words (109913 effective words) took 0.1s, 884991 effective words/s
2020-04-17 14:59:15,476 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 14:59:15,478 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 14:59:15,481 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 14:59:15,481 : INFO : EPOCH - 4 : training on 175599 raw words (110284 effective words) took 0.1s, 905940 effective words/s
2020-04-17 14:59:15,608 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 14:59:15,611 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 14:59:15,613 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 14:59:15,613 : INFO : EPOCH - 5 : training on 175599 raw words (110445 effective words) took 0.1s, 915664 effective words/s
2020-04-17 14:59:15,614 : INFO : training on a 87

Word2vec model #11: {'train_data': '1MB', 'compute_loss': False, 'sg': 0, 'hs': 1, 'train_time_mean': 1.4708325862884521, 'train_time_std': 0.013612271737445957}


2020-04-17 14:59:16,332 : INFO : training model with 3 workers on 4125 vocabulary and 100 features, using sg=1 hs=0 sample=0.001 negative=5 window=5
2020-04-17 14:59:16,547 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 14:59:16,553 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 14:59:16,559 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 14:59:16,559 : INFO : EPOCH - 1 : training on 175599 raw words (109994 effective words) took 0.2s, 510674 effective words/s
2020-04-17 14:59:16,771 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 14:59:16,780 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 14:59:16,784 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 14:59:16,784 : INFO : EPOCH - 2 : training on 175599 raw words (110177 effective words) took 0.2s, 520576 effective words/s
2020-04-17 14:59:17,002 : INFO : wo

2020-04-17 14:59:20,867 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 14:59:20,868 : INFO : EPOCH - 4 : training on 175599 raw words (110284 effective words) took 0.2s, 530668 effective words/s
2020-04-17 14:59:21,079 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 14:59:21,080 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 14:59:21,090 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 14:59:21,091 : INFO : EPOCH - 5 : training on 175599 raw words (110128 effective words) took 0.2s, 520804 effective words/s
2020-04-17 14:59:21,091 : INFO : training on a 877995 raw words (550736 effective words) took 1.1s, 496405 effective words/s
2020-04-17 14:59:21,093 : INFO : collecting all words and their counts
2020-04-17 14:59:21,105 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-04-17 14:59:21,133 : INFO : collected 17251 word types from a corp

Word2vec model #12: {'train_data': '1MB', 'compute_loss': True, 'sg': 1, 'hs': 0, 'train_time_mean': 1.8248923619588215, 'train_time_std': 0.013058244416970654}


2020-04-17 14:59:21,812 : INFO : training model with 3 workers on 4125 vocabulary and 100 features, using sg=1 hs=0 sample=0.001 negative=5 window=5
2020-04-17 14:59:22,019 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 14:59:22,023 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 14:59:22,027 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 14:59:22,028 : INFO : EPOCH - 1 : training on 175599 raw words (109994 effective words) took 0.2s, 536314 effective words/s
2020-04-17 14:59:22,234 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 14:59:22,240 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 14:59:22,245 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 14:59:22,245 : INFO : EPOCH - 2 : training on 175599 raw words (110105 effective words) took 0.2s, 535263 effective words/s
2020-04-17 14:59:22,451 : INFO : wo

2020-04-17 14:59:26,292 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 14:59:26,292 : INFO : EPOCH - 4 : training on 175599 raw words (110184 effective words) took 0.2s, 532419 effective words/s
2020-04-17 14:59:26,500 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 14:59:26,504 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 14:59:26,512 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 14:59:26,513 : INFO : EPOCH - 5 : training on 175599 raw words (110174 effective words) took 0.2s, 526822 effective words/s
2020-04-17 14:59:26,513 : INFO : training on a 877995 raw words (550418 effective words) took 1.1s, 500913 effective words/s
2020-04-17 14:59:26,515 : INFO : collecting all words and their counts
2020-04-17 14:59:26,526 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-04-17 14:59:26,557 : INFO : collected 17251 word types from a corp

Word2vec model #13: {'train_data': '1MB', 'compute_loss': False, 'sg': 1, 'hs': 0, 'train_time_mean': 1.8072483539581299, 'train_time_std': 0.0038839340637318305}


2020-04-17 14:59:27,302 : INFO : training model with 3 workers on 4125 vocabulary and 100 features, using sg=1 hs=1 sample=0.001 negative=5 window=5
2020-04-17 14:59:27,719 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 14:59:27,742 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 14:59:27,747 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 14:59:27,747 : INFO : EPOCH - 1 : training on 175599 raw words (110227 effective words) took 0.4s, 253871 effective words/s
2020-04-17 14:59:28,178 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 14:59:28,202 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 14:59:28,203 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 14:59:28,204 : INFO : EPOCH - 2 : training on 175599 raw words (110383 effective words) took 0.4s, 250170 effective words/s
2020-04-17 14:59:28,632 : INFO : wo

2020-04-17 14:59:34,771 : INFO : EPOCH - 3 : training on 175599 raw words (110272 effective words) took 0.4s, 247661 effective words/s
2020-04-17 14:59:35,200 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 14:59:35,220 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 14:59:35,223 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 14:59:35,224 : INFO : EPOCH - 4 : training on 175599 raw words (110180 effective words) took 0.4s, 250027 effective words/s
2020-04-17 14:59:35,647 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 14:59:35,665 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 14:59:35,672 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 14:59:35,672 : INFO : EPOCH - 5 : training on 175599 raw words (110246 effective words) took 0.4s, 254488 effective words/s
2020-04-17 14:59:35,672 : INFO : training on a 87

Word2vec model #14: {'train_data': '1MB', 'compute_loss': True, 'sg': 1, 'hs': 1, 'train_time_mean': 3.0536507765452066, 'train_time_std': 0.005384403896102122}


2020-04-17 14:59:36,467 : INFO : training model with 3 workers on 4125 vocabulary and 100 features, using sg=1 hs=1 sample=0.001 negative=5 window=5
2020-04-17 14:59:36,890 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 14:59:36,898 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 14:59:36,913 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 14:59:36,914 : INFO : EPOCH - 1 : training on 175599 raw words (110284 effective words) took 0.4s, 253568 effective words/s
2020-04-17 14:59:37,325 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 14:59:37,344 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 14:59:37,353 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 14:59:37,354 : INFO : EPOCH - 2 : training on 175599 raw words (110008 effective words) took 0.4s, 257318 effective words/s
2020-04-17 14:59:37,766 : INFO : wo

2020-04-17 14:59:43,761 : INFO : EPOCH - 3 : training on 175599 raw words (110198 effective words) took 0.4s, 257415 effective words/s
2020-04-17 14:59:44,178 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 14:59:44,193 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 14:59:44,200 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 14:59:44,200 : INFO : EPOCH - 4 : training on 175599 raw words (110412 effective words) took 0.4s, 259036 effective words/s
2020-04-17 14:59:44,617 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 14:59:44,635 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 14:59:44,638 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 14:59:44,639 : INFO : EPOCH - 5 : training on 175599 raw words (110397 effective words) took 0.4s, 258997 effective words/s
2020-04-17 14:59:44,639 : INFO : training on a 87

Word2vec model #15: {'train_data': '1MB', 'compute_loss': False, 'sg': 1, 'hs': 1, 'train_time_mean': 2.9887211322784424, 'train_time_std': 0.008065168496341799}


2020-04-17 14:59:45,111 : INFO : collected 73167 word types from a corpus of 1788017 raw words and 179 sentences
2020-04-17 14:59:45,111 : INFO : Loading a fresh vocabulary
2020-04-17 14:59:45,152 : INFO : effective_min_count=5 retains 20167 unique words (27% of original 73167, drops 53000)
2020-04-17 14:59:45,152 : INFO : effective_min_count=5 leaves 1703716 word corpus (95% of original 1788017, drops 84301)
2020-04-17 14:59:45,197 : INFO : deleting the raw counts dictionary of 73167 items
2020-04-17 14:59:45,202 : INFO : sample=0.001 downsamples 38 most-common words
2020-04-17 14:59:45,202 : INFO : downsampling leaves estimated 1242287 word corpus (72.9% of prior 1703716)
2020-04-17 14:59:45,251 : INFO : estimated required memory for 20167 words and 100 dimensions: 26217100 bytes
2020-04-17 14:59:45,252 : INFO : resetting layer weights
2020-04-17 14:59:48,388 : INFO : training model with 3 workers on 20167 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5


2020-04-17 15:00:10,487 : INFO : training model with 3 workers on 20167 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
2020-04-17 15:00:11,475 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 15:00:11,479 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 15:00:11,482 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 15:00:11,482 : INFO : EPOCH - 1 : training on 1788017 raw words (1242352 effective words) took 1.0s, 1288968 effective words/s
2020-04-17 15:00:12,463 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 15:00:12,468 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 15:00:12,472 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 15:00:12,473 : INFO : EPOCH - 2 : training on 1788017 raw words (1241918 effective words) took 0.9s, 1396194 effective words/s
2020-04-17 15:00:13,540 : IN

Word2vec model #16: {'train_data': '10MB', 'compute_loss': True, 'sg': 0, 'hs': 0, 'train_time_mean': 10.25504732131958, 'train_time_std': 1.2357424930850787}


2020-04-17 15:00:15,873 : INFO : collected 73167 word types from a corpus of 1788017 raw words and 179 sentences
2020-04-17 15:00:15,874 : INFO : Loading a fresh vocabulary
2020-04-17 15:00:15,915 : INFO : effective_min_count=5 retains 20167 unique words (27% of original 73167, drops 53000)
2020-04-17 15:00:15,916 : INFO : effective_min_count=5 leaves 1703716 word corpus (95% of original 1788017, drops 84301)
2020-04-17 15:00:15,959 : INFO : deleting the raw counts dictionary of 73167 items
2020-04-17 15:00:15,962 : INFO : sample=0.001 downsamples 38 most-common words
2020-04-17 15:00:15,963 : INFO : downsampling leaves estimated 1242287 word corpus (72.9% of prior 1703716)
2020-04-17 15:00:16,002 : INFO : estimated required memory for 20167 words and 100 dimensions: 26217100 bytes
2020-04-17 15:00:16,003 : INFO : resetting layer weights
2020-04-17 15:00:19,220 : INFO : training model with 3 workers on 20167 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5


2020-04-17 15:00:38,709 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 15:00:38,710 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 15:00:38,710 : INFO : EPOCH - 1 : training on 1788017 raw words (1242948 effective words) took 0.8s, 1564843 effective words/s
2020-04-17 15:00:39,592 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 15:00:39,599 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 15:00:39,600 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 15:00:39,601 : INFO : EPOCH - 2 : training on 1788017 raw words (1241864 effective words) took 0.8s, 1572117 effective words/s
2020-04-17 15:00:40,624 : INFO : EPOCH 3 - PROGRESS: at 93.30% examples, 1153899 words/s, in_qsize 5, out_qsize 0
2020-04-17 15:00:40,736 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 15:00:40,749 : INFO : worker thread finished; awaitin

Word2vec model #17: {'train_data': '10MB', 'compute_loss': False, 'sg': 0, 'hs': 0, 'train_time_mean': 9.338944673538208, 'train_time_std': 0.6340272309613076}


2020-04-17 15:00:43,913 : INFO : collected 73167 word types from a corpus of 1788017 raw words and 179 sentences
2020-04-17 15:00:43,914 : INFO : Loading a fresh vocabulary
2020-04-17 15:00:43,970 : INFO : effective_min_count=5 retains 20167 unique words (27% of original 73167, drops 53000)
2020-04-17 15:00:43,971 : INFO : effective_min_count=5 leaves 1703716 word corpus (95% of original 1788017, drops 84301)
2020-04-17 15:00:44,028 : INFO : deleting the raw counts dictionary of 73167 items
2020-04-17 15:00:44,032 : INFO : sample=0.001 downsamples 38 most-common words
2020-04-17 15:00:44,032 : INFO : downsampling leaves estimated 1242287 word corpus (72.9% of prior 1703716)
2020-04-17 15:00:44,054 : INFO : constructing a huffman tree from 20167 words
2020-04-17 15:00:44,584 : INFO : built huffman tree with maximum node depth 18
2020-04-17 15:00:44,629 : INFO : estimated required memory for 20167 words and 100 dimensions: 38317300 bytes
2020-04-17 15:00:44,630 : INFO : resetting layer w

2020-04-17 15:01:10,515 : INFO : Loading a fresh vocabulary
2020-04-17 15:01:10,569 : INFO : effective_min_count=5 retains 20167 unique words (27% of original 73167, drops 53000)
2020-04-17 15:01:10,570 : INFO : effective_min_count=5 leaves 1703716 word corpus (95% of original 1788017, drops 84301)
2020-04-17 15:01:10,619 : INFO : deleting the raw counts dictionary of 73167 items
2020-04-17 15:01:10,625 : INFO : sample=0.001 downsamples 38 most-common words
2020-04-17 15:01:10,627 : INFO : downsampling leaves estimated 1242287 word corpus (72.9% of prior 1703716)
2020-04-17 15:01:10,648 : INFO : constructing a huffman tree from 20167 words
2020-04-17 15:01:11,136 : INFO : built huffman tree with maximum node depth 18
2020-04-17 15:01:11,177 : INFO : estimated required memory for 20167 words and 100 dimensions: 38317300 bytes
2020-04-17 15:01:11,178 : INFO : resetting layer weights
2020-04-17 15:01:14,703 : INFO : training model with 3 workers on 20167 vocabulary and 100 features, using

Word2vec model #18: {'train_data': '10MB', 'compute_loss': True, 'sg': 0, 'hs': 1, 'train_time_mean': 13.668074687321981, 'train_time_std': 0.5688892683694853}


2020-04-17 15:01:24,821 : INFO : collected 73167 word types from a corpus of 1788017 raw words and 179 sentences
2020-04-17 15:01:24,821 : INFO : Loading a fresh vocabulary
2020-04-17 15:01:24,862 : INFO : effective_min_count=5 retains 20167 unique words (27% of original 73167, drops 53000)
2020-04-17 15:01:24,862 : INFO : effective_min_count=5 leaves 1703716 word corpus (95% of original 1788017, drops 84301)
2020-04-17 15:01:24,905 : INFO : deleting the raw counts dictionary of 73167 items
2020-04-17 15:01:24,908 : INFO : sample=0.001 downsamples 38 most-common words
2020-04-17 15:01:24,909 : INFO : downsampling leaves estimated 1242287 word corpus (72.9% of prior 1703716)
2020-04-17 15:01:24,924 : INFO : constructing a huffman tree from 20167 words
2020-04-17 15:01:25,318 : INFO : built huffman tree with maximum node depth 18
2020-04-17 15:01:25,344 : INFO : estimated required memory for 20167 words and 100 dimensions: 38317300 bytes
2020-04-17 15:01:25,345 : INFO : resetting layer w

2020-04-17 15:01:51,612 : INFO : collected 73167 word types from a corpus of 1788017 raw words and 179 sentences
2020-04-17 15:01:51,612 : INFO : Loading a fresh vocabulary
2020-04-17 15:01:51,662 : INFO : effective_min_count=5 retains 20167 unique words (27% of original 73167, drops 53000)
2020-04-17 15:01:51,663 : INFO : effective_min_count=5 leaves 1703716 word corpus (95% of original 1788017, drops 84301)
2020-04-17 15:01:51,710 : INFO : deleting the raw counts dictionary of 73167 items
2020-04-17 15:01:51,713 : INFO : sample=0.001 downsamples 38 most-common words
2020-04-17 15:01:51,714 : INFO : downsampling leaves estimated 1242287 word corpus (72.9% of prior 1703716)
2020-04-17 15:01:51,735 : INFO : constructing a huffman tree from 20167 words
2020-04-17 15:01:52,194 : INFO : built huffman tree with maximum node depth 18
2020-04-17 15:01:52,227 : INFO : estimated required memory for 20167 words and 100 dimensions: 38317300 bytes
2020-04-17 15:01:52,227 : INFO : resetting layer w

Word2vec model #19: {'train_data': '10MB', 'compute_loss': False, 'sg': 0, 'hs': 1, 'train_time_mean': 13.289937257766724, 'train_time_std': 0.5378709545717524}


2020-04-17 15:02:04,776 : INFO : collected 73167 word types from a corpus of 1788017 raw words and 179 sentences
2020-04-17 15:02:04,777 : INFO : Loading a fresh vocabulary
2020-04-17 15:02:04,826 : INFO : effective_min_count=5 retains 20167 unique words (27% of original 73167, drops 53000)
2020-04-17 15:02:04,827 : INFO : effective_min_count=5 leaves 1703716 word corpus (95% of original 1788017, drops 84301)
2020-04-17 15:02:04,874 : INFO : deleting the raw counts dictionary of 73167 items
2020-04-17 15:02:04,878 : INFO : sample=0.001 downsamples 38 most-common words
2020-04-17 15:02:04,879 : INFO : downsampling leaves estimated 1242287 word corpus (72.9% of prior 1703716)
2020-04-17 15:02:04,926 : INFO : estimated required memory for 20167 words and 100 dimensions: 26217100 bytes
2020-04-17 15:02:04,927 : INFO : resetting layer weights
2020-04-17 15:02:08,160 : INFO : training model with 3 workers on 20167 vocabulary and 100 features, using sg=1 hs=0 sample=0.001 negative=5 window=5


2020-04-17 15:02:40,591 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 15:02:40,622 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 15:02:40,630 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 15:02:40,630 : INFO : EPOCH - 4 : training on 1788017 raw words (1242407 effective words) took 2.7s, 463784 effective words/s
2020-04-17 15:02:41,740 : INFO : EPOCH 5 - PROGRESS: at 36.87% examples, 461560 words/s, in_qsize 5, out_qsize 0
2020-04-17 15:02:42,742 : INFO : EPOCH 5 - PROGRESS: at 74.30% examples, 460543 words/s, in_qsize 5, out_qsize 0
2020-04-17 15:02:43,392 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 15:02:43,442 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 15:02:43,448 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 15:02:43,448 : INFO : EPOCH - 5 : training on 1788017 raw words (1243078 effecti

Word2vec model #20: {'train_data': '10MB', 'compute_loss': True, 'sg': 1, 'hs': 0, 'train_time_mean': 19.60117761294047, 'train_time_std': 0.26749512954468235}


2020-04-17 15:03:03,541 : INFO : collected 73167 word types from a corpus of 1788017 raw words and 179 sentences
2020-04-17 15:03:03,542 : INFO : Loading a fresh vocabulary
2020-04-17 15:03:03,584 : INFO : effective_min_count=5 retains 20167 unique words (27% of original 73167, drops 53000)
2020-04-17 15:03:03,585 : INFO : effective_min_count=5 leaves 1703716 word corpus (95% of original 1788017, drops 84301)
2020-04-17 15:03:03,630 : INFO : deleting the raw counts dictionary of 73167 items
2020-04-17 15:03:03,634 : INFO : sample=0.001 downsamples 38 most-common words
2020-04-17 15:03:03,634 : INFO : downsampling leaves estimated 1242287 word corpus (72.9% of prior 1703716)
2020-04-17 15:03:03,682 : INFO : estimated required memory for 20167 words and 100 dimensions: 26217100 bytes
2020-04-17 15:03:03,683 : INFO : resetting layer weights
2020-04-17 15:03:06,930 : INFO : training model with 3 workers on 20167 vocabulary and 100 features, using sg=1 hs=0 sample=0.001 negative=5 window=5


2020-04-17 15:03:39,526 : INFO : EPOCH 5 - PROGRESS: at 77.09% examples, 477482 words/s, in_qsize 5, out_qsize 0
2020-04-17 15:03:40,156 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 15:03:40,192 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 15:03:40,201 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 15:03:40,202 : INFO : EPOCH - 5 : training on 1788017 raw words (1242507 effective words) took 2.7s, 462411 effective words/s
2020-04-17 15:03:40,202 : INFO : training on a 8940085 raw words (6212743 effective words) took 13.9s, 448534 effective words/s
2020-04-17 15:03:40,213 : INFO : collecting all words and their counts
2020-04-17 15:03:40,328 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-04-17 15:03:40,696 : INFO : collected 73167 word types from a corpus of 1788017 raw words and 179 sentences
2020-04-17 15:03:40,696 : INFO : Loading a fresh vocabulary
2020-0

Word2vec model #21: {'train_data': '10MB', 'compute_loss': False, 'sg': 1, 'hs': 0, 'train_time_mean': 18.330134630203247, 'train_time_std': 0.7170737128106401}


2020-04-17 15:03:58,527 : INFO : collected 73167 word types from a corpus of 1788017 raw words and 179 sentences
2020-04-17 15:03:58,528 : INFO : Loading a fresh vocabulary
2020-04-17 15:03:58,570 : INFO : effective_min_count=5 retains 20167 unique words (27% of original 73167, drops 53000)
2020-04-17 15:03:58,570 : INFO : effective_min_count=5 leaves 1703716 word corpus (95% of original 1788017, drops 84301)
2020-04-17 15:03:58,614 : INFO : deleting the raw counts dictionary of 73167 items
2020-04-17 15:03:58,617 : INFO : sample=0.001 downsamples 38 most-common words
2020-04-17 15:03:58,618 : INFO : downsampling leaves estimated 1242287 word corpus (72.9% of prior 1703716)
2020-04-17 15:03:58,633 : INFO : constructing a huffman tree from 20167 words
2020-04-17 15:03:59,043 : INFO : built huffman tree with maximum node depth 18
2020-04-17 15:03:59,071 : INFO : estimated required memory for 20167 words and 100 dimensions: 38317300 bytes
2020-04-17 15:03:59,072 : INFO : resetting layer w

2020-04-17 15:04:42,263 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 15:04:42,312 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 15:04:42,331 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 15:04:42,332 : INFO : EPOCH - 1 : training on 1788017 raw words (1242948 effective words) took 5.6s, 220313 effective words/s
2020-04-17 15:04:43,380 : INFO : EPOCH 2 - PROGRESS: at 15.64% examples, 192195 words/s, in_qsize 5, out_qsize 0
2020-04-17 15:04:44,408 : INFO : EPOCH 2 - PROGRESS: at 34.64% examples, 211720 words/s, in_qsize 5, out_qsize 0
2020-04-17 15:04:45,436 : INFO : EPOCH 2 - PROGRESS: at 53.63% examples, 217081 words/s, in_qsize 5, out_qsize 0
2020-04-17 15:04:46,447 : INFO : EPOCH 2 - PROGRESS: at 72.07% examples, 218454 words/s, in_qsize 5, out_qsize 0
2020-04-17 15:04:47,472 : INFO : EPOCH 2 - PROGRESS: at 90.50% examples, 219731 words/s, in_qsize 5, out_qsize 0
2020-04-17 15:04:47,916 :

2020-04-17 15:05:30,118 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 15:05:30,178 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 15:05:30,196 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 15:05:30,196 : INFO : EPOCH - 3 : training on 1788017 raw words (1241916 effective words) took 5.8s, 215587 effective words/s
2020-04-17 15:05:31,325 : INFO : EPOCH 4 - PROGRESS: at 17.32% examples, 214495 words/s, in_qsize 5, out_qsize 0
2020-04-17 15:05:32,330 : INFO : EPOCH 4 - PROGRESS: at 35.75% examples, 221842 words/s, in_qsize 5, out_qsize 0
2020-04-17 15:05:33,360 : INFO : EPOCH 4 - PROGRESS: at 54.75% examples, 223537 words/s, in_qsize 5, out_qsize 0
2020-04-17 15:05:34,393 : INFO : EPOCH 4 - PROGRESS: at 73.74% examples, 224047 words/s, in_qsize 5, out_qsize 0
2020-04-17 15:05:35,396 : INFO : EPOCH 4 - PROGRESS: at 92.18% examples, 225031 words/s, in_qsize 5, out_qsize 0
2020-04-17 15:05:35,758 :

Word2vec model #22: {'train_data': '10MB', 'compute_loss': True, 'sg': 1, 'hs': 1, 'train_time_mean': 34.45679942766825, 'train_time_std': 1.1719151413204352}


2020-04-17 15:05:41,874 : INFO : collected 73167 word types from a corpus of 1788017 raw words and 179 sentences
2020-04-17 15:05:41,874 : INFO : Loading a fresh vocabulary
2020-04-17 15:05:41,915 : INFO : effective_min_count=5 retains 20167 unique words (27% of original 73167, drops 53000)
2020-04-17 15:05:41,916 : INFO : effective_min_count=5 leaves 1703716 word corpus (95% of original 1788017, drops 84301)
2020-04-17 15:05:41,959 : INFO : deleting the raw counts dictionary of 73167 items
2020-04-17 15:05:41,963 : INFO : sample=0.001 downsamples 38 most-common words
2020-04-17 15:05:41,964 : INFO : downsampling leaves estimated 1242287 word corpus (72.9% of prior 1703716)
2020-04-17 15:05:41,980 : INFO : constructing a huffman tree from 20167 words
2020-04-17 15:05:44,951 : INFO : built huffman tree with maximum node depth 18
2020-04-17 15:05:44,983 : INFO : estimated required memory for 20167 words and 100 dimensions: 38317300 bytes
2020-04-17 15:05:44,984 : INFO : resetting layer w

2020-04-17 15:06:25,509 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 15:06:25,509 : INFO : EPOCH - 1 : training on 1788017 raw words (1241447 effective words) took 5.4s, 228204 effective words/s
2020-04-17 15:06:26,641 : INFO : EPOCH 2 - PROGRESS: at 17.32% examples, 214132 words/s, in_qsize 5, out_qsize 0
2020-04-17 15:06:27,643 : INFO : EPOCH 2 - PROGRESS: at 36.31% examples, 225250 words/s, in_qsize 5, out_qsize 0
2020-04-17 15:06:28,707 : INFO : EPOCH 2 - PROGRESS: at 55.87% examples, 225452 words/s, in_qsize 5, out_qsize 0
2020-04-17 15:06:29,755 : INFO : EPOCH 2 - PROGRESS: at 74.30% examples, 223118 words/s, in_qsize 5, out_qsize 0
2020-04-17 15:06:30,755 : INFO : EPOCH 2 - PROGRESS: at 93.30% examples, 225659 words/s, in_qsize 5, out_qsize 0
2020-04-17 15:06:31,055 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 15:06:31,121 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 15:06:31,131 :

2020-04-17 15:07:09,488 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-17 15:07:09,527 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-17 15:07:09,540 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-17 15:07:09,541 : INFO : EPOCH - 3 : training on 1788017 raw words (1241166 effective words) took 5.4s, 228895 effective words/s
2020-04-17 15:07:10,667 : INFO : EPOCH 4 - PROGRESS: at 17.88% examples, 221939 words/s, in_qsize 5, out_qsize 0
2020-04-17 15:07:11,674 : INFO : EPOCH 4 - PROGRESS: at 36.31% examples, 225039 words/s, in_qsize 5, out_qsize 0
2020-04-17 15:07:12,682 : INFO : EPOCH 4 - PROGRESS: at 54.75% examples, 225103 words/s, in_qsize 5, out_qsize 0
2020-04-17 15:07:13,692 : INFO : EPOCH 4 - PROGRESS: at 73.74% examples, 226485 words/s, in_qsize 5, out_qsize 0
2020-04-17 15:07:14,701 : INFO : EPOCH 4 - PROGRESS: at 92.74% examples, 228098 words/s, in_qsize 5, out_qsize 0
2020-04-17 15:07:15,026 :

Word2vec model #23: {'train_data': '10MB', 'compute_loss': False, 'sg': 1, 'hs': 1, 'train_time_mean': 33.0646603902181, 'train_time_std': 1.0580181685249426}
   train_data  compute_loss  sg  hs  train_time_mean  train_time_std
4        25kB          True   1   0         0.641856        0.003553
5        25kB         False   1   0         0.706347        0.041496
6        25kB          True   1   1         1.084956        0.117828
7        25kB         False   1   1         0.999427        0.022375
0        25kB          True   0   0         0.469475        0.016243
1        25kB         False   0   0         0.442190        0.003510
2        25kB          True   0   1         0.563282        0.004206
3        25kB         False   0   1         0.550286        0.001170
12        1MB          True   1   0         1.824892        0.013058
13        1MB         False   1   0         1.807248        0.003884
14        1MB          True   1   1         3.053651        0.005384
15        1MB

Adding Word2Vec "model to dict" method to production pipeline
-------------------------------------------------------------

Suppose, we still want more performance improvement in production.

One good way is to cache all the similar words in a dictionary.

So that next time when we get the similar query word, we'll search it first in the dict.

And if it's a hit then we will show the result directly from the dictionary.

otherwise we will query the word and then cache it so that it doesn't miss next time.




In [45]:
# re-enable logging
logging.root.level = logging.INFO

most_similars_precalc = {word : model.wv.most_similar(word) for word in model.wv.index2word}
for i, (key, value) in enumerate(most_similars_precalc.items()):
    if i == 3:
        break
    print(key, value)

2020-04-17 19:32:50,580 : INFO : precomputing L2-norms of word weight vectors


the [('first', 0.9999039173126221), ('an', 0.9999021291732788), ('against', 0.9998972415924072), ('its', 0.9998953342437744), ('his', 0.9998934268951416), ('two', 0.9998933672904968), ('one', 0.999891996383667), ('and', 0.9998893141746521), ('after', 0.999887228012085), ('of', 0.9998849034309387)]
to [('or', 0.9999405145645142), ('will', 0.9999387264251709), ('are', 0.9999384880065918), ('if', 0.999937891960144), ('out', 0.9999355673789978), ('has', 0.9999327659606934), ('about', 0.9999327659606934), ('from', 0.9999326467514038), ('and', 0.9999295473098755), ('with', 0.9999294281005859)]
of [('first', 0.9999454617500305), ('by', 0.9999425411224365), ('on', 0.9999414682388306), ('after', 0.999940812587738), ('in', 0.9999327659606934), ('three', 0.9999324083328247), ('with', 0.9999293088912964), ('which', 0.9999275207519531), ('over', 0.999925971031189), ('at', 0.9999222755432129)]


Comparison with and without caching
-----------------------------------

for time being lets take 4 words randomly




In [46]:
import time
words = ['voted', 'few', 'their', 'around']

Without caching




In [47]:
start = time.time()
for word in words:
    result = model.wv.most_similar(word)
    print(result)
end = time.time()
print(end - start)

[('team', 0.9988727569580078), ('than', 0.9988700747489929), ('died', 0.9988337159156799), ('seekers', 0.998832106590271), ('australia', 0.9988315105438232), ('power', 0.998828649520874), ('killed', 0.9988218545913696), ('saying', 0.9988198280334473), ('hospital', 0.9988188147544861), ('meeting', 0.9988039135932922)]
[('their', 0.9997981786727905), ('also', 0.9997972249984741), ('other', 0.9997957348823547), ('at', 0.9997896552085876), ('an', 0.9997894763946533), ('as', 0.9997888803482056), ('about', 0.9997838139533997), ('to', 0.9997806549072266), ('under', 0.9997791051864624), ('one', 0.9997783899307251)]
[('and', 0.9999499320983887), ('about', 0.999947190284729), ('have', 0.9999425411224365), ('also', 0.9999412894248962), ('with', 0.9999408721923828), ('us', 0.9999403357505798), ('its', 0.9999401569366455), ('are', 0.9999364018440247), ('out', 0.9999362230300903), ('at', 0.9999358654022217)]
[('as', 0.9999256134033203), ('today', 0.9999256134033203), ('over', 0.9999226927757263), ('

Now with caching




In [48]:
start = time.time()
for word in words:
    if 'voted' in most_similars_precalc:
        result = most_similars_precalc[word]
        print(result)
    else:
        result = model.wv.most_similar(word)
        most_similars_precalc[word] = result
        print(result)

end = time.time()
print(end - start)

[('team', 0.9988727569580078), ('than', 0.9988700747489929), ('died', 0.9988337159156799), ('seekers', 0.998832106590271), ('australia', 0.9988315105438232), ('power', 0.998828649520874), ('killed', 0.9988218545913696), ('saying', 0.9988198280334473), ('hospital', 0.9988188147544861), ('meeting', 0.9988039135932922)]
[('their', 0.9997981786727905), ('also', 0.9997972249984741), ('other', 0.9997957348823547), ('at', 0.9997896552085876), ('an', 0.9997894763946533), ('as', 0.9997888803482056), ('about', 0.9997838139533997), ('to', 0.9997806549072266), ('under', 0.9997791051864624), ('one', 0.9997783899307251)]
[('and', 0.9999499320983887), ('about', 0.999947190284729), ('have', 0.9999425411224365), ('also', 0.9999412894248962), ('with', 0.9999408721923828), ('us', 0.9999403357505798), ('its', 0.9999401569366455), ('are', 0.9999364018440247), ('out', 0.9999362230300903), ('at', 0.9999358654022217)]
[('as', 0.9999256134033203), ('today', 0.9999256134033203), ('over', 0.9999226927757263), ('

Clearly you can see the improvement but this difference will be even larger
when we take more words in the consideration.




Visualising the Word Embeddings
-------------------------------

The word embeddings made by the model can be visualised by reducing
dimensionality of the words to 2 dimensions using tSNE.

Visualisations can be used to notice semantic and syntactic trends in the data.

Example:

* Semantic: words like cat, dog, cow, etc. have a tendency to lie close by
* Syntactic: words like run, running or cut, cutting lie close together.

Vector relations like vKing - vMan = vQueen - vWoman can also be noticed.

.. Important::
  The model used for the visualisation is trained on a small corpus. Thus
  some of the relations might not be so clear.




In [50]:
from sklearn.decomposition import IncrementalPCA    # inital reduction
from sklearn.manifold import TSNE                   # final reduction
import numpy as np                                  # array handling


def reduce_dimensions(model):
    num_dimensions = 2  # final num dimensions (2D, 3D, etc)

    vectors = [] # positions in vector space
    labels = [] # keep track of words to label our data again later
    for word in model.wv.vocab:
        vectors.append(model.wv[word])
        labels.append(word)

    # convert both lists into numpy vectors for reduction
    vectors = np.asarray(vectors)
    labels = np.asarray(labels)

    # reduce using t-SNE
    vectors = np.asarray(vectors)
    tsne = TSNE(n_components=num_dimensions, random_state=0)
    vectors = tsne.fit_transform(vectors)

    x_vals = [v[0] for v in vectors]
    y_vals = [v[1] for v in vectors]
    return x_vals, y_vals, labels


x_vals, y_vals, labels = reduce_dimensions(model)

def plot_with_plotly(x_vals, y_vals, labels, plot_in_notebook=True):
    from plotly.offline import init_notebook_mode, iplot, plot
    import plotly.graph_objs as go

    trace = go.Scatter(x=x_vals, y=y_vals, mode='text', text=labels)
    data = [trace]

    if plot_in_notebook:
        init_notebook_mode(connected=True)
        iplot(data, filename='word-embedding-plot')
    else:
        plot(data, filename='word-embedding-plot.html')


def plot_with_matplotlib(x_vals, y_vals, labels):
    import matplotlib.pyplot as plt
    import random

    random.seed(0)

    plt.figure(figsize=(12, 12))
    plt.scatter(x_vals, y_vals)

    #
    # Label randomly subsampled 25 data points
    #
    indices = list(range(len(labels)))
    selected_indices = random.sample(indices, 25)
    for i in selected_indices:
        plt.annotate(labels[i], (x_vals[i], y_vals[i]))

try:
    get_ipython()
except Exception:
    plot_function = plot_with_matplotlib
else:
    plot_function = plot_with_plotly

plot_function(x_vals, y_vals, labels)

Conclusion
----------

In this tutorial we learned how to train word2vec models on your custom data
and also how to evaluate it. Hope that you too will find this popular tool
useful in your Machine Learning tasks!

Links
-----

- API docs: :py:mod:`gensim.models.word2vec`
- `Original C toolkit and word2vec papers by Google <https://code.google.com/archive/p/word2vec/>`_.


