In [1]:
%matplotlib inline


Word2Vec Model
==============

Introduces Gensim's Word2Vec model and demonstrates its use on the Lee Corpus.




In [24]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In case you missed the buzz, word2vec is a widely featured as a member of the
“new wave” of machine learning algorithms based on neural networks, commonly
referred to as "deep learning" (though word2vec itself is rather shallow).
Using large amounts of unannotated plain text, word2vec learns relationships
between words automatically. The output are vectors, one vector per word,
with remarkable linear relationships that allow us to do things like:

* vec("king") - vec("man") + vec("woman") =~ vec("queen")
* vec("Montreal Canadiens") – vec("Montreal") + vec("Toronto") =~ vec("Toronto Maple Leafs").

Word2vec is very useful in `automatic text tagging
<https://github.com/RaRe-Technologies/movie-plots-by-genre>`_\ , recommender
systems and machine translation.

This tutorial:

#. Introduces ``Word2Vec`` as an improvement over traditional bag-of-words
#. Shows off a demo of ``Word2Vec`` using a pre-trained model
#. Demonstrates training a new model from your own data
#. Demonstrates loading and saving models
#. Introduces several training parameters and demonstrates their effect
#. Discusses memory requirements
#. Visualizes Word2Vec embeddings by applying dimensionality reduction

Review: Bag-of-words
--------------------

.. Note:: Feel free to skip these review sections if you're already familiar with the models.

You may be familiar with the `bag-of-words model
<https://en.wikipedia.org/wiki/Bag-of-words_model>`_ from the
`core_concepts_vector` section.
This model transforms each document to a fixed-length vector of integers.
For example, given the sentences:

- ``John likes to watch movies. Mary likes movies too.``
- ``John also likes to watch football games. Mary hates football.``

The model outputs the vectors:

- ``[1, 2, 1, 1, 2, 1, 1, 0, 0, 0, 0]``
- ``[1, 1, 1, 1, 0, 1, 0, 1, 2, 1, 1]``

Each vector has 10 elements, where each element counts the number of times a
particular word occurred in the document.
The order of elements is arbitrary.
In the example above, the order of the elements corresponds to the words:
``["John", "likes", "to", "watch", "movies", "Mary", "too", "also", "football", "games", "hates"]``.

Bag-of-words models are surprisingly effective, but have several weaknesses.

First, they lose all information about word order: "John likes Mary" and
"Mary likes John" correspond to identical vectors. There is a solution: bag
of `n-grams <https://en.wikipedia.org/wiki/N-gram>`__
models consider word phrases of length n to represent documents as
fixed-length vectors to capture local word order but suffer from data
sparsity and high dimensionality.

Second, the model does not attempt to learn the meaning of the underlying
words, and as a consequence, the distance between vectors doesn't always
reflect the difference in meaning.  The ``Word2Vec`` model addresses this
second problem.

Introducing: the ``Word2Vec`` Model
-----------------------------------

``Word2Vec`` is a more recent model that embeds words in a lower-dimensional
vector space using a shallow neural network. The result is a set of
word-vectors where vectors close together in vector space have similar
meanings based on context, and word-vectors distant to each other have
differing meanings. For example, ``strong`` and ``powerful`` would be close
together and ``strong`` and ``Paris`` would be relatively far.

The are two versions of this model and :py:class:`~gensim.models.word2vec.Word2Vec`
class implements them both:

1. Skip-grams (SG)
2. Continuous-bag-of-words (CBOW)

.. Important::
  Don't let the implementation details below scare you.
  They're advanced material: if it's too much, then move on to the next section.

The `Word2Vec Skip-gram <http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model>`__
model, for example, takes in pairs (word1, word2) generated by moving a
window across text data, and trains a 1-hidden-layer neural network based on
the synthetic task of given an input word, giving us a predicted probability
distribution of nearby words to the input. A virtual `one-hot
<https://en.wikipedia.org/wiki/One-hot>`__ encoding of words
goes through a 'projection layer' to the hidden layer; these projection
weights are later interpreted as the word embeddings. So if the hidden layer
has 300 neurons, this network will give us 300-dimensional word embeddings.

Continuous-bag-of-words Word2vec is very similar to the skip-gram model. It
is also a 1-hidden-layer neural network. The synthetic training task now uses
the average of multiple input context words, rather than a single word as in
skip-gram, to predict the center word. Again, the projection weights that
turn one-hot words into averageable vectors, of the same width as the hidden
layer, are interpreted as the word embeddings.




Word2Vec Demo
-------------

To see what ``Word2Vec`` can do, let's download a pre-trained model and play
around with it. We will fetch the Word2Vec model trained on part of the
Google News dataset, covering approximately 3 million words and phrases. Such
a model can take hours to train, but since it's already available,
downloading and loading it with Gensim takes minutes.

.. Important::
  The model is approximately 2GB, so you'll need a decent network connection
  to proceed.  Otherwise, skip ahead to the "Training Your Own Model" section
  below.

You may also check out an `online word2vec demo
<http://radimrehurek.com/2014/02/word2vec-tutorial/#app>`_ where you can try
this vector algebra for yourself. That demo runs ``word2vec`` on the
**entire** Google News dataset, of **about 100 billion words**.




In [25]:
import gensim.models

wv=gensim.models.KeyedVectors.load_word2vec_format('../data/GoogleNews-vectors-negative300.bin.gz', binary=True)

2020-02-03 23:11:00,017 : INFO : loading projection weights from ../data/GoogleNews-vectors-negative300.bin.gz
2020-02-03 23:13:32,331 : INFO : loaded (3000000, 300) matrix from ../data/GoogleNews-vectors-negative300.bin.gz


A common operation is to retrieve the vocabulary of a model.  That is trivial:



In [26]:
for i, word in enumerate(wv.vocab):
    if i == 10:
        break
    print(word)

</s>
in
for
that
is
on
##
The
with
said


We can easily obtain vectors for terms the model is familiar with:




In [6]:
vec_king = wv['king']

可惜，这个model不能推断它不知道的词。这是Word2Vec其中一个限制：如果你很在意这个限制，请使用 FastText 模型。

In [7]:
try:
    vec_cameroon = wv['cameroon']
except KeyError:
    print("The word 'cameroon' does not appear in this model")

The word 'cameroon' does not appear in this model


Moving on, ``Word2Vec`` supports several word similarity tasks out of the
box.  You can see how the similarity intuitively decreases as the words get
less and less similar.




In [9]:
pairs = [
    ('car', 'minivan'),   # a minivan is a kind of car
    ('car', 'bicycle'),   # still a wheeled vehicle
    ('car', 'airplane'),  # ok, no wheels, but still a vehicle
    ('car', 'cereal'),    # ... and so on
    ('car', 'communism'),
]
for w1, w2 in pairs:
    print('%r\t%r\t%.2f' % (w1, w2, wv.similarity(w1, w2)))

'car'	'minivan'	0.69
'car'	'bicycle'	0.54
'car'	'airplane'	0.42
'car'	'cereal'	0.14
'car'	'communism'	0.06


Print the 5 most similar words to "car" or "minivan"



In [10]:
print(wv.most_similar(positive=['car', 'minivan'], topn=5))

2020-02-03 22:46:02,729 : INFO : precomputing L2-norms of word weight vectors


[('SUV', 0.8532191514968872), ('vehicle', 0.8175784349441528), ('pickup_truck', 0.7763689160346985), ('Jeep', 0.7567334175109863), ('Ford_Explorer', 0.7565719485282898)]


Which of the below does not belong in the sequence?



In [11]:
print(wv.doesnt_match(['fire', 'water', 'land', 'sea', 'air', 'car']))

car


  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


Training Your Own Model
-----------------------

To start, you'll need some data for training the model.  For the following
examples, we'll use the `Lee Corpus
<https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/test/test_data/lee_background.cor>`_
(which you already have if you've installed gensim).

该语料很小，能完全放在内存中，但我们仍然实现一个针对内存友好的迭代器，按行读取，以便处理大规模语料。

In [12]:
from gensim.test.utils import datapath
from gensim import utils

corpus_path='../data/lee_background.cor'

class MyCorpus(object):
    """An interator that yields sentences (lists of str)."""

    def __iter__(self):
        for line in open(corpus_path):
            # assume there's one document per line, tokens separated by whitespace
            yield utils.simple_preprocess(line)

2020-02-03 22:49:22,016 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2020-02-03 22:49:22,020 : INFO : built Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...) from 9 documents (total 29 corpus positions)


If we wanted to do any custom preprocessing, e.g. decode a non-standard
encoding, lowercase, remove numbers, extract named entities... All of this can
be done inside the ``MyCorpus`` iterator and ``word2vec`` doesn’t need to
know. All that is required is that the input yields one sentence (list of
utf8 words) after another.

Let's go ahead and train a model on our corpus.  Don't worry about the
training parameters much for now, we'll revisit them later.




In [13]:
import gensim.models

sentences = MyCorpus()
model = gensim.models.Word2Vec(sentences=sentences)

2020-02-03 22:49:28,784 : INFO : collecting all words and their counts
2020-02-03 22:49:28,790 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-02-03 22:49:28,885 : INFO : collected 6981 word types from a corpus of 58152 raw words and 300 sentences
2020-02-03 22:49:28,886 : INFO : Loading a fresh vocabulary
2020-02-03 22:49:28,918 : INFO : effective_min_count=5 retains 1750 unique words (25% of original 6981, drops 5231)
2020-02-03 22:49:28,919 : INFO : effective_min_count=5 leaves 49335 word corpus (84% of original 58152, drops 8817)
2020-02-03 22:49:28,925 : INFO : deleting the raw counts dictionary of 6981 items
2020-02-03 22:49:28,928 : INFO : sample=0.001 downsamples 51 most-common words
2020-02-03 22:49:28,930 : INFO : downsampling leaves estimated 35935 word corpus (72.8% of prior 49335)
2020-02-03 22:49:28,936 : INFO : estimated required memory for 1750 words and 100 dimensions: 2275000 bytes
2020-02-03 22:49:28,937 : INFO : resetting layer weight

Once we have our model, we can use it in the same way as in the demo above.

The main part of the model is ``model.wv``\ , where "wv" stands for "word vectors".




In [14]:
vec_king = model.wv['king'] # dim=100

Retrieving the vocabulary works the same way:



In [15]:
for i, word in enumerate(model.wv.vocab):
    if i == 10:
        break
    print(word)

hundreds
of
people
have
been
forced
to
their
homes
in


Storing and loading models
--------------------------

You'll notice that training non-trivial models can take time.  Once you've
trained your model and it works as expected, you can save it to disk.  That
way, you don't have to spend time training it all over again later.

You can store/load models using the standard gensim methods:




In [16]:
import tempfile

with tempfile.NamedTemporaryFile(prefix='gensim-model-', delete=False) as tmp:
    temporary_filepath = tmp.name
    model.save(temporary_filepath)
    #
    # The model is now safely stored in the filepath.
    # You can copy it to other machines, share it with others, etc.
    #
    # To load a saved model:
    #
    new_model = gensim.models.Word2Vec.load(temporary_filepath)

2020-02-03 22:50:01,465 : INFO : saving Word2Vec object under /var/folders/fb/jcf536t9265fc_nlmqdcwd9jsf5jrm/T/gensim-model-wiazqwrx, separately None
2020-02-03 22:50:01,466 : INFO : not storing attribute vectors_norm
2020-02-03 22:50:01,467 : INFO : not storing attribute cum_table
2020-02-03 22:50:01,491 : INFO : saved /var/folders/fb/jcf536t9265fc_nlmqdcwd9jsf5jrm/T/gensim-model-wiazqwrx
2020-02-03 22:50:01,492 : INFO : loading Word2Vec object from /var/folders/fb/jcf536t9265fc_nlmqdcwd9jsf5jrm/T/gensim-model-wiazqwrx
2020-02-03 22:50:01,509 : INFO : loading wv recursively from /var/folders/fb/jcf536t9265fc_nlmqdcwd9jsf5jrm/T/gensim-model-wiazqwrx.wv.* with mmap=None
2020-02-03 22:50:01,510 : INFO : setting ignored attribute vectors_norm to None
2020-02-03 22:50:01,511 : INFO : loading vocabulary recursively from /var/folders/fb/jcf536t9265fc_nlmqdcwd9jsf5jrm/T/gensim-model-wiazqwrx.vocabulary.* with mmap=None
2020-02-03 22:50:01,511 : INFO : loading trainables recursively from /var/

which uses pickle internally, optionally ``mmap``\ ‘ing the model’s internal
large NumPy matrices into virtual memory directly from disk files, for
inter-process memory sharing.

In addition, you can load models created by the original C tool, both using
its text and binary formats::

  model = gensim.models.KeyedVectors.load_word2vec_format('/tmp/vectors.txt', binary=False)
  # using gzipped/bz2 input works too, no need to unzip
  model = gensim.models.KeyedVectors.load_word2vec_format('/tmp/vectors.bin.gz', binary=True)




Training Parameters
-------------------

``Word2Vec`` accepts several parameters that affect both training speed and quality.

min_count
---------

``min_count`` is for pruning the internal dictionary. Words that appear only
once or twice in a billion-word corpus are probably uninteresting typos and
garbage. In addition, there’s not enough data to make any meaningful training
on those words, so it’s best to ignore them:

default value of min_count=5



In [None]:
model = gensim.models.Word2Vec(sentences, min_count=10)

size
----

``size`` is the number of dimensions (N) of the N-dimensional space that
gensim Word2Vec maps the words onto.

Bigger size values require more training data, but can lead to better (more
accurate) models. Reasonable values are in the tens to hundreds.




In [None]:
# default value of size=100
model = gensim.models.Word2Vec(sentences, size=200)

workers
-------

``workers`` , the last of the major parameters (full list `here
<http://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec>`_)
is for training parallelization, to speed up training:




In [None]:
# default value of workers=3 (tutorial says 1...)
model = gensim.models.Word2Vec(sentences, workers=4)

The ``workers`` parameter only has an effect if you have `Cython
<http://cython.org/>`_ installed. Without Cython, you’ll only be able to use
one core because of the `GIL
<https://wiki.python.org/moin/GlobalInterpreterLock>`_ (and ``word2vec``
training will be `miserably slow
<http://rare-technologies.com/word2vec-in-python-part-two-optimizing/>`_\ ).




Memory
------

At its core, ``word2vec`` model parameters are stored as matrices (NumPy
arrays). Each array is **#vocabulary** (controlled by min_count parameter)
times **#size** (size parameter) of floats (single precision aka 4 bytes).

Three such matrices are held in RAM (work is underway to reduce that number
to two, or even one). So if your input contains 100,000 unique words, and you
asked for layer ``size=200``\ , the model will require approx.
``100,000*200*4*3 bytes = ~229MB``.

There’s a little extra memory needed for storing the vocabulary tree (100,000 words would take a few megabytes), but unless your words are extremely loooong strings, memory footprint will be dominated by the three matrices above.




Evaluating
----------

``Word2Vec`` training is an unsupervised task, there’s no good way to
objectively evaluate the result. Evaluation depends on your end application.

Google has released their testing set of about 20,000 syntactic and semantic
test examples, following the “A is to B as C is to D” task. It is provided in
the 'datasets' folder.

For example a syntactic analogy of comparative type is bad:worse;good:?.
There are total of 9 types of syntactic comparisons in the dataset like
plural nouns and nouns of opposite meaning.

The semantic questions contain five types of semantic analogies, such as
capital cities (Paris:France;Tokyo:?) or family members
(brother:sister;dad:?).




Gensim supports the same evaluation set, in exactly the same format:




In [None]:
model.accuracy('../data/questions-words.txt')

This ``accuracy`` takes an `optional parameter
<http://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec.accuracy>`_
``restrict_vocab`` which limits which test examples are to be considered.




In the December 2016 release of Gensim we added a better way to evaluate semantic similarity.

By default it uses an academic dataset WS-353 but one can create a dataset
specific to your business based on it. It contains word pairs together with
human-assigned similarity judgments. It measures the relatedness or
co-occurrence of two words. For example, 'coast' and 'shore' are very similar
as they appear in the same context. At the same time 'clothes' and 'closet'
are less similar because they are related but not interchangeable.




In [None]:
model.evaluate_word_pairs('../data/wordsim353.tsv')

.. Important::
  Good performance on Google's or WS-353 test set doesn’t mean word2vec will
  work well in your application, or vice versa. It’s always best to evaluate
  directly on your intended task. For an example of how to use word2vec in a
  classifier pipeline, see this `tutorial
  <https://github.com/RaRe-Technologies/movie-plots-by-genre>`_.




Online training / Resuming training
-----------------------------------

Advanced users can load a model and continue training it with more sentences
and `new vocabulary words <online_w2v_tutorial.ipynb>`_:




In [None]:
model = gensim.models.Word2Vec.load(temporary_filepath)
more_sentences = [
    ['Advanced', 'users', 'can', 'load', 'a', 'model',
     'and', 'continue', 'training', 'it', 'with', 'more', 'sentences']
]
model.build_vocab(more_sentences, update=True)
model.train(more_sentences, total_examples=model.corpus_count, epochs=model.iter)

# cleaning up temporary file
import os
os.remove(temporary_filepath)

You may need to tweak the ``total_words`` parameter to ``train()``,
depending on what learning rate decay you want to simulate.

Note that it’s not possible to resume training with models generated by the C
tool, ``KeyedVectors.load_word2vec_format()``. You can still use them for
querying/similarity, but information vital for training (the vocab tree) is
missing there.




Training Loss Computation
-------------------------

The parameter ``compute_loss`` can be used to toggle computation of loss
while training the Word2Vec model. The computed loss is stored in the model
attribute ``running_training_loss`` and can be retrieved using the function
``get_latest_training_loss`` as follows :




In [None]:
# instantiating and training the Word2Vec model
model_with_loss = gensim.models.Word2Vec(
    sentences,
    min_count=1,
    compute_loss=True,
    hs=0,
    sg=1,
    seed=42
)

# getting the training loss value
training_loss = model_with_loss.get_latest_training_loss()
print(training_loss)

Benchmarks
----------

Let's run some benchmarks to see effect of the training loss computation code
on training time.

We'll use the following data for the benchmarks:

#. Lee Background corpus: included in gensim's test data
#. Text8 corpus.  To demonstrate the effect of corpus size, we'll look at the
   first 1MB, 10MB, 50MB of the corpus, as well as the entire thing.




In [17]:
import io
import os

import gensim.models.word2vec
import gensim.downloader as api
import smart_open
import logging

def head(path, size):
    with smart_open.open(path) as fin:
        return io.StringIO(fin.read(size))


def generate_input_data():
    lee_path = '../data/lee_background.cor'
    ls = gensim.models.word2vec.LineSentence(lee_path)
    ls.name = '25kB'
    yield ls

    text8_path = '../data/text8'
    labels = ('1MB', '10MB', '50MB', '100MB')
    sizes = (1024 ** 2, 10 * 1024 ** 2, 50 * 1024 ** 2, 100 * 1024 ** 2)
    for l, s in zip(labels, sizes):
        ls = gensim.models.word2vec.LineSentence(head(text8_path, s))
        ls.name = l
        yield ls


input_data = list(generate_input_data())

We now compare the training time taken for different combinations of input
data and model training parameters like ``hs`` and ``sg``.

For each combination, we repeat the test several times to obtain the mean and
standard deviation of the test duration.




In [18]:
# Temporarily reduce logging verbosity
logging.root.level = logging.ERROR

import time
import numpy as np
import pandas as pd

train_time_values = []
seed_val = 42
sg_values = [0, 1]
hs_values = [0, 1]

fast = True
if fast:
    input_data_subset = input_data[:3]
else:
    input_data_subset = input_data


for data in input_data_subset:
    for sg_val in sg_values:
        for hs_val in hs_values:
            for loss_flag in [True, False]:
                time_taken_list = []
                for i in range(3):
                    start_time = time.time()
                    w2v_model = gensim.models.Word2Vec(
                        data,
                        compute_loss=loss_flag,
                        sg=sg_val,
                        hs=hs_val,
                        seed=seed_val,
                    )
                    time_taken_list.append(time.time() - start_time)

                time_taken_list = np.array(time_taken_list)
                time_mean = np.mean(time_taken_list)
                time_std = np.std(time_taken_list)

                model_result = {
                    'train_data': data.name,
                    'compute_loss': loss_flag,
                    'sg': sg_val,
                    'hs': hs_val,
                    'train_time_mean': time_mean,
                    'train_time_std': time_std,
                }
                print("Word2vec model #%i: %s" % (len(train_time_values), model_result))
                train_time_values.append(model_result)

train_times_table = pd.DataFrame(train_time_values)
train_times_table = train_times_table.sort_values(
    by=['train_data', 'sg', 'hs', 'compute_loss'],
    ascending=[False, False, True, False],
)
print(train_times_table)

2020-02-03 22:54:28,575 : INFO : collecting all words and their counts
2020-02-03 22:54:28,577 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-02-03 22:54:28,598 : INFO : collected 10781 word types from a corpus of 59890 raw words and 300 sentences
2020-02-03 22:54:28,599 : INFO : Loading a fresh vocabulary
2020-02-03 22:54:28,606 : INFO : effective_min_count=5 retains 1762 unique words (16% of original 10781, drops 9019)
2020-02-03 22:54:28,606 : INFO : effective_min_count=5 leaves 46084 word corpus (76% of original 59890, drops 13806)
2020-02-03 22:54:28,613 : INFO : deleting the raw counts dictionary of 10781 items
2020-02-03 22:54:28,614 : INFO : sample=0.001 downsamples 45 most-common words
2020-02-03 22:54:28,614 : INFO : downsampling leaves estimated 32610 word corpus (70.8% of prior 46084)
2020-02-03 22:54:28,620 : INFO : estimated required memory for 1762 words and 100 dimensions: 2290600 bytes
2020-02-03 22:54:28,621 : INFO : resetting layer we

2020-02-03 22:54:30,317 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 22:54:30,321 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:54:30,321 : INFO : EPOCH - 2 : training on 59890 raw words (32552 effective words) took 0.0s, 945396 effective words/s
2020-02-03 22:54:30,351 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 22:54:30,355 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 22:54:30,358 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:54:30,359 : INFO : EPOCH - 3 : training on 59890 raw words (32517 effective words) took 0.0s, 909672 effective words/s
2020-02-03 22:54:30,391 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 22:54:30,394 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 22:54:30,398 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-0

Word2vec model #0: {'train_data': '25kB', 'compute_loss': True, 'sg': 0, 'hs': 0, 'train_time_mean': 0.6225600242614746, 'train_time_std': 0.008212428504875411}


2020-02-03 22:54:30,882 : INFO : training model with 3 workers on 1762 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
2020-02-03 22:54:30,913 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 22:54:30,914 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 22:54:30,920 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:54:30,921 : INFO : EPOCH - 1 : training on 59890 raw words (32543 effective words) took 0.0s, 868729 effective words/s
2020-02-03 22:54:30,957 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 22:54:30,960 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 22:54:30,964 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:54:30,964 : INFO : EPOCH - 2 : training on 59890 raw words (32552 effective words) took 0.0s, 798707 effective words/s
2020-02-03 22:54:30,996 : INFO : worker

2020-02-03 22:54:32,268 : INFO : EPOCH - 4 : training on 59890 raw words (32769 effective words) took 0.0s, 1001040 effective words/s
2020-02-03 22:54:32,299 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 22:54:32,302 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 22:54:32,307 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:54:32,308 : INFO : EPOCH - 5 : training on 59890 raw words (32723 effective words) took 0.0s, 878691 effective words/s
2020-02-03 22:54:32,309 : INFO : training on a 299450 raw words (163104 effective words) took 0.2s, 900933 effective words/s
2020-02-03 22:54:32,312 : INFO : collecting all words and their counts
2020-02-03 22:54:32,313 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-02-03 22:54:32,337 : INFO : collected 10781 word types from a corpus of 59890 raw words and 300 sentences
2020-02-03 22:54:32,339 : INFO : Loading a fresh vocab

Word2vec model #1: {'train_data': '25kB', 'compute_loss': False, 'sg': 0, 'hs': 0, 'train_time_mean': 0.621660312016805, 'train_time_std': 0.028677800827210085}


2020-02-03 22:54:32,764 : INFO : training model with 3 workers on 1762 vocabulary and 100 features, using sg=0 hs=1 sample=0.001 negative=5 window=5
2020-02-03 22:54:32,811 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 22:54:32,814 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 22:54:32,820 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:54:32,821 : INFO : EPOCH - 1 : training on 59890 raw words (32543 effective words) took 0.1s, 595016 effective words/s
2020-02-03 22:54:32,885 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 22:54:32,888 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 22:54:32,894 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:54:32,895 : INFO : EPOCH - 2 : training on 59890 raw words (32552 effective words) took 0.1s, 458342 effective words/s
2020-02-03 22:54:32,943 : INFO : worker

2020-02-03 22:54:34,577 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 22:54:34,580 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 22:54:34,588 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:54:34,589 : INFO : EPOCH - 4 : training on 59890 raw words (32769 effective words) took 0.1s, 534934 effective words/s
2020-02-03 22:54:34,652 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 22:54:34,657 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 22:54:34,667 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:54:34,668 : INFO : EPOCH - 5 : training on 59890 raw words (32723 effective words) took 0.1s, 430194 effective words/s
2020-02-03 22:54:34,668 : INFO : training on a 299450 raw words (163104 effective words) took 0.3s, 492454 effective words/s
2020-02-03 22:54:34,673 : INFO : collecting all words and their

Word2vec model #2: {'train_data': '25kB', 'compute_loss': True, 'sg': 0, 'hs': 1, 'train_time_mean': 0.7871049245198568, 'train_time_std': 0.02387725786025585}


2020-02-03 22:54:35,127 : INFO : training model with 3 workers on 1762 vocabulary and 100 features, using sg=0 hs=1 sample=0.001 negative=5 window=5
2020-02-03 22:54:35,181 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 22:54:35,185 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 22:54:35,193 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:54:35,194 : INFO : EPOCH - 1 : training on 59890 raw words (32543 effective words) took 0.1s, 505026 effective words/s
2020-02-03 22:54:35,244 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 22:54:35,246 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 22:54:35,253 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:54:35,253 : INFO : EPOCH - 2 : training on 59890 raw words (32552 effective words) took 0.1s, 565358 effective words/s
2020-02-03 22:54:35,309 : INFO : worker

2020-02-03 22:54:36,913 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 22:54:36,922 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 22:54:36,926 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:54:36,927 : INFO : EPOCH - 4 : training on 59890 raw words (32624 effective words) took 0.1s, 528869 effective words/s
2020-02-03 22:54:36,977 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 22:54:36,982 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 22:54:36,986 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:54:36,987 : INFO : EPOCH - 5 : training on 59890 raw words (32543 effective words) took 0.1s, 559008 effective words/s
2020-02-03 22:54:36,988 : INFO : training on a 299450 raw words (162982 effective words) took 0.3s, 515390 effective words/s
2020-02-03 22:54:36,992 : INFO : collecting all words and their

Word2vec model #3: {'train_data': '25kB', 'compute_loss': False, 'sg': 0, 'hs': 1, 'train_time_mean': 0.7731149196624756, 'train_time_std': 0.0057283864786662475}


2020-02-03 22:54:37,437 : INFO : training model with 3 workers on 1762 vocabulary and 100 features, using sg=1 hs=0 sample=0.001 negative=5 window=5
2020-02-03 22:54:37,510 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 22:54:37,517 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 22:54:37,520 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:54:37,521 : INFO : EPOCH - 1 : training on 59890 raw words (32668 effective words) took 0.1s, 403550 effective words/s
2020-02-03 22:54:37,606 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 22:54:37,609 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 22:54:37,615 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:54:37,616 : INFO : EPOCH - 2 : training on 59890 raw words (32652 effective words) took 0.1s, 349560 effective words/s
2020-02-03 22:54:37,690 : INFO : worker

2020-02-03 22:54:39,452 : INFO : EPOCH - 4 : training on 59890 raw words (32587 effective words) took 0.1s, 436150 effective words/s
2020-02-03 22:54:39,522 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 22:54:39,528 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 22:54:39,531 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:54:39,532 : INFO : EPOCH - 5 : training on 59890 raw words (32693 effective words) took 0.1s, 419158 effective words/s
2020-02-03 22:54:39,533 : INFO : training on a 299450 raw words (162978 effective words) took 0.4s, 427615 effective words/s
2020-02-03 22:54:39,534 : INFO : collecting all words and their counts
2020-02-03 22:54:39,536 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-02-03 22:54:39,553 : INFO : collected 10781 word types from a corpus of 59890 raw words and 300 sentences
2020-02-03 22:54:39,554 : INFO : Loading a fresh vocabu

Word2vec model #4: {'train_data': '25kB', 'compute_loss': True, 'sg': 1, 'hs': 0, 'train_time_mean': 0.8472832838694254, 'train_time_std': 0.0664275760725608}


2020-02-03 22:54:39,908 : INFO : training model with 3 workers on 1762 vocabulary and 100 features, using sg=1 hs=0 sample=0.001 negative=5 window=5
2020-02-03 22:54:39,977 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 22:54:39,982 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 22:54:39,984 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:54:39,985 : INFO : EPOCH - 1 : training on 59890 raw words (32668 effective words) took 0.1s, 439556 effective words/s
2020-02-03 22:54:40,053 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 22:54:40,054 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 22:54:40,060 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:54:40,061 : INFO : EPOCH - 2 : training on 59890 raw words (32652 effective words) took 0.1s, 440707 effective words/s
2020-02-03 22:54:40,130 : INFO : worker

2020-02-03 22:54:41,752 : INFO : EPOCH - 4 : training on 59890 raw words (32580 effective words) took 0.1s, 442623 effective words/s
2020-02-03 22:54:41,818 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 22:54:41,822 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 22:54:41,826 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:54:41,827 : INFO : EPOCH - 5 : training on 59890 raw words (32528 effective words) took 0.1s, 445564 effective words/s
2020-02-03 22:54:41,828 : INFO : training on a 299450 raw words (162937 effective words) took 0.4s, 417093 effective words/s
2020-02-03 22:54:41,830 : INFO : collecting all words and their counts
2020-02-03 22:54:41,832 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-02-03 22:54:41,849 : INFO : collected 10781 word types from a corpus of 59890 raw words and 300 sentences
2020-02-03 22:54:41,850 : INFO : Loading a fresh vocabu

Word2vec model #5: {'train_data': '25kB', 'compute_loss': False, 'sg': 1, 'hs': 0, 'train_time_mean': 0.7651705741882324, 'train_time_std': 0.0057429612127855275}


2020-02-03 22:54:42,270 : INFO : training model with 3 workers on 1762 vocabulary and 100 features, using sg=1 hs=1 sample=0.001 negative=5 window=5
2020-02-03 22:54:42,434 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 22:54:42,435 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 22:54:42,444 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:54:42,445 : INFO : EPOCH - 1 : training on 59890 raw words (32543 effective words) took 0.2s, 187902 effective words/s
2020-02-03 22:54:42,616 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 22:54:42,620 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 22:54:42,627 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:54:42,628 : INFO : EPOCH - 2 : training on 59890 raw words (32552 effective words) took 0.2s, 178734 effective words/s
2020-02-03 22:54:42,820 : INFO : worker

2020-02-03 22:54:45,989 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 22:54:45,992 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 22:54:46,005 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:54:46,006 : INFO : EPOCH - 4 : training on 59890 raw words (32585 effective words) took 0.3s, 121382 effective words/s
2020-02-03 22:54:46,234 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 22:54:46,244 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 22:54:46,256 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:54:46,257 : INFO : EPOCH - 5 : training on 59890 raw words (32720 effective words) took 0.2s, 131995 effective words/s
2020-02-03 22:54:46,258 : INFO : training on a 299450 raw words (163193 effective words) took 1.3s, 127548 effective words/s
2020-02-03 22:54:46,267 : INFO : collecting all words and their

Word2vec model #6: {'train_data': '25kB', 'compute_loss': True, 'sg': 1, 'hs': 1, 'train_time_mean': 1.4782519340515137, 'train_time_std': 0.20497679517315912}


2020-02-03 22:54:46,940 : INFO : training model with 3 workers on 1762 vocabulary and 100 features, using sg=1 hs=1 sample=0.001 negative=5 window=5
2020-02-03 22:54:47,238 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 22:54:47,247 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 22:54:47,261 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:54:47,263 : INFO : EPOCH - 1 : training on 59890 raw words (32543 effective words) took 0.3s, 101384 effective words/s
2020-02-03 22:54:47,669 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 22:54:47,689 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 22:54:47,750 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:54:47,753 : INFO : EPOCH - 2 : training on 59890 raw words (32552 effective words) took 0.5s, 69674 effective words/s
2020-02-03 22:54:48,314 : INFO : worker 

2020-02-03 22:54:52,373 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 22:54:52,380 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 22:54:52,386 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:54:52,386 : INFO : EPOCH - 4 : training on 59890 raw words (32605 effective words) took 0.2s, 214117 effective words/s
2020-02-03 22:54:52,525 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 22:54:52,532 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 22:54:52,536 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:54:52,537 : INFO : EPOCH - 5 : training on 59890 raw words (32567 effective words) took 0.1s, 219721 effective words/s
2020-02-03 22:54:52,538 : INFO : training on a 299450 raw words (163001 effective words) took 0.8s, 214614 effective words/s
2020-02-03 22:54:52,543 : INFO : collecting all words and their

Word2vec model #7: {'train_data': '25kB', 'compute_loss': False, 'sg': 1, 'hs': 1, 'train_time_mean': 2.0920166969299316, 'train_time_std': 0.8674240515297896}


2020-02-03 22:54:53,518 : INFO : training model with 3 workers on 4125 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
2020-02-03 22:54:53,605 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 22:54:53,609 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 22:54:53,610 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:54:53,611 : INFO : EPOCH - 1 : training on 175599 raw words (110344 effective words) took 0.1s, 1415013 effective words/s
2020-02-03 22:54:53,727 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 22:54:53,735 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 22:54:53,737 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:54:53,738 : INFO : EPOCH - 2 : training on 175599 raw words (110214 effective words) took 0.1s, 1011600 effective words/s
2020-02-03 22:54:53,858 : INFO : 

2020-02-03 22:54:57,714 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:54:57,715 : INFO : EPOCH - 4 : training on 175599 raw words (110337 effective words) took 0.1s, 1289892 effective words/s
2020-02-03 22:54:57,808 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 22:54:57,811 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 22:54:57,813 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:54:57,814 : INFO : EPOCH - 5 : training on 175599 raw words (110226 effective words) took 0.1s, 1314573 effective words/s
2020-02-03 22:54:57,815 : INFO : training on a 877995 raw words (551326 effective words) took 0.5s, 1009492 effective words/s
2020-02-03 22:54:57,818 : INFO : collecting all words and their counts
2020-02-03 22:54:57,831 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-02-03 22:54:57,865 : INFO : collected 17251 word types from a c

Word2vec model #8: {'train_data': '1MB', 'compute_loss': True, 'sg': 0, 'hs': 0, 'train_time_mean': 1.7585968971252441, 'train_time_std': 0.40717185435063274}


2020-02-03 22:54:58,738 : INFO : training model with 3 workers on 4125 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
2020-02-03 22:54:58,842 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 22:54:58,846 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 22:54:58,847 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:54:58,848 : INFO : EPOCH - 1 : training on 175599 raw words (110344 effective words) took 0.1s, 1211352 effective words/s
2020-02-03 22:54:58,949 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 22:54:58,952 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 22:54:58,953 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:54:58,954 : INFO : EPOCH - 2 : training on 175599 raw words (110115 effective words) took 0.1s, 1203785 effective words/s
2020-02-03 22:54:59,049 : INFO : 

2020-02-03 22:55:01,947 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:55:01,947 : INFO : EPOCH - 4 : training on 175599 raw words (110132 effective words) took 0.1s, 1401890 effective words/s
2020-02-03 22:55:02,028 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 22:55:02,031 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 22:55:02,034 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:55:02,035 : INFO : EPOCH - 5 : training on 175599 raw words (110161 effective words) took 0.1s, 1475328 effective words/s
2020-02-03 22:55:02,036 : INFO : training on a 877995 raw words (550803 effective words) took 0.5s, 1108736 effective words/s
2020-02-03 22:55:02,040 : INFO : collecting all words and their counts
2020-02-03 22:55:02,054 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-02-03 22:55:02,089 : INFO : collected 17251 word types from a c

Word2vec model #9: {'train_data': '1MB', 'compute_loss': False, 'sg': 0, 'hs': 0, 'train_time_mean': 1.407256046930949, 'train_time_std': 0.03917851311917621}


2020-02-03 22:55:03,023 : INFO : training model with 3 workers on 4125 vocabulary and 100 features, using sg=0 hs=1 sample=0.001 negative=5 window=5
2020-02-03 22:55:03,225 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 22:55:03,235 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 22:55:03,241 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:55:03,241 : INFO : EPOCH - 1 : training on 175599 raw words (110285 effective words) took 0.2s, 546413 effective words/s
2020-02-03 22:55:03,416 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 22:55:03,426 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 22:55:03,431 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:55:03,432 : INFO : EPOCH - 2 : training on 175599 raw words (110214 effective words) took 0.2s, 639918 effective words/s
2020-02-03 22:55:03,614 : INFO : wo

2020-02-03 22:55:07,524 : INFO : EPOCH - 3 : training on 175599 raw words (110217 effective words) took 0.2s, 651422 effective words/s
2020-02-03 22:55:07,773 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 22:55:07,787 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 22:55:07,797 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:55:07,798 : INFO : EPOCH - 4 : training on 175599 raw words (110193 effective words) took 0.3s, 427209 effective words/s
2020-02-03 22:55:07,985 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 22:55:07,993 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 22:55:07,998 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:55:07,999 : INFO : EPOCH - 5 : training on 175599 raw words (110258 effective words) took 0.2s, 595028 effective words/s
2020-02-03 22:55:08,000 : INFO : training on a 87

Word2vec model #10: {'train_data': '1MB', 'compute_loss': True, 'sg': 0, 'hs': 1, 'train_time_mean': 1.988530953725179, 'train_time_std': 0.07284101358439658}


2020-02-03 22:55:09,044 : INFO : training model with 3 workers on 4125 vocabulary and 100 features, using sg=0 hs=1 sample=0.001 negative=5 window=5
2020-02-03 22:55:09,270 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 22:55:09,271 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 22:55:09,277 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:55:09,278 : INFO : EPOCH - 1 : training on 175599 raw words (110284 effective words) took 0.2s, 505477 effective words/s
2020-02-03 22:55:09,481 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 22:55:09,491 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 22:55:09,496 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:55:09,496 : INFO : EPOCH - 2 : training on 175599 raw words (110137 effective words) took 0.2s, 543612 effective words/s
2020-02-03 22:55:09,663 : INFO : wo

2020-02-03 22:55:13,525 : INFO : EPOCH - 3 : training on 175599 raw words (110339 effective words) took 0.2s, 693839 effective words/s
2020-02-03 22:55:13,700 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 22:55:13,700 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 22:55:13,706 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:55:13,706 : INFO : EPOCH - 4 : training on 175599 raw words (110332 effective words) took 0.2s, 666317 effective words/s
2020-02-03 22:55:13,881 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 22:55:13,883 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 22:55:13,890 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:55:13,891 : INFO : EPOCH - 5 : training on 175599 raw words (110258 effective words) took 0.2s, 643391 effective words/s
2020-02-03 22:55:13,892 : INFO : training on a 87

Word2vec model #11: {'train_data': '1MB', 'compute_loss': False, 'sg': 0, 'hs': 1, 'train_time_mean': 1.9638365109761555, 'train_time_std': 0.09133045537800726}


2020-02-03 22:55:14,816 : INFO : training model with 3 workers on 4125 vocabulary and 100 features, using sg=1 hs=0 sample=0.001 negative=5 window=5
2020-02-03 22:55:15,067 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 22:55:15,070 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 22:55:15,081 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:55:15,082 : INFO : EPOCH - 1 : training on 175599 raw words (109994 effective words) took 0.3s, 439515 effective words/s
2020-02-03 22:55:15,338 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 22:55:15,347 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 22:55:15,351 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:55:15,352 : INFO : EPOCH - 2 : training on 175599 raw words (110177 effective words) took 0.3s, 430340 effective words/s
2020-02-03 22:55:15,610 : INFO : wo

2020-02-03 22:55:21,056 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:55:21,057 : INFO : EPOCH - 4 : training on 175599 raw words (110224 effective words) took 0.2s, 450915 effective words/s
2020-02-03 22:55:21,306 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 22:55:21,312 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 22:55:21,319 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:55:21,320 : INFO : EPOCH - 5 : training on 175599 raw words (110027 effective words) took 0.3s, 440053 effective words/s
2020-02-03 22:55:21,321 : INFO : training on a 877995 raw words (551014 effective words) took 1.4s, 396424 effective words/s
2020-02-03 22:55:21,324 : INFO : collecting all words and their counts
2020-02-03 22:55:21,338 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-02-03 22:55:21,374 : INFO : collected 17251 word types from a corp

Word2vec model #12: {'train_data': '1MB', 'compute_loss': True, 'sg': 1, 'hs': 0, 'train_time_mean': 2.4752933979034424, 'train_time_std': 0.26572930944761375}


2020-02-03 22:55:22,175 : INFO : training model with 3 workers on 4125 vocabulary and 100 features, using sg=1 hs=0 sample=0.001 negative=5 window=5
2020-02-03 22:55:22,415 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 22:55:22,417 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 22:55:22,429 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:55:22,429 : INFO : EPOCH - 1 : training on 175599 raw words (109994 effective words) took 0.2s, 460800 effective words/s
2020-02-03 22:55:22,700 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 22:55:22,709 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 22:55:22,712 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:55:22,713 : INFO : EPOCH - 2 : training on 175599 raw words (110037 effective words) took 0.3s, 410299 effective words/s
2020-02-03 22:55:22,960 : INFO : wo

2020-02-03 22:55:27,722 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:55:27,723 : INFO : EPOCH - 4 : training on 175599 raw words (110278 effective words) took 0.3s, 408633 effective words/s
2020-02-03 22:55:27,970 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 22:55:27,973 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 22:55:27,982 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:55:27,982 : INFO : EPOCH - 5 : training on 175599 raw words (110216 effective words) took 0.2s, 449902 effective words/s
2020-02-03 22:55:27,983 : INFO : training on a 877995 raw words (551234 effective words) took 1.3s, 408368 effective words/s
2020-02-03 22:55:27,986 : INFO : collecting all words and their counts
2020-02-03 22:55:27,999 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-02-03 22:55:28,033 : INFO : collected 17251 word types from a corp

Word2vec model #13: {'train_data': '1MB', 'compute_loss': False, 'sg': 1, 'hs': 0, 'train_time_mean': 2.220606009165446, 'train_time_std': 0.04694651771730856}


2020-02-03 22:55:28,957 : INFO : training model with 3 workers on 4125 vocabulary and 100 features, using sg=1 hs=1 sample=0.001 negative=5 window=5
2020-02-03 22:55:29,488 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 22:55:29,509 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 22:55:29,519 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:55:29,519 : INFO : EPOCH - 1 : training on 175599 raw words (110226 effective words) took 0.5s, 202015 effective words/s
2020-02-03 22:55:30,434 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 22:55:30,475 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 22:55:30,478 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:55:30,479 : INFO : EPOCH - 2 : training on 175599 raw words (110041 effective words) took 0.9s, 117027 effective words/s
2020-02-03 22:55:31,085 : INFO : wo

2020-02-03 22:55:38,748 : INFO : EPOCH - 3 : training on 175599 raw words (110080 effective words) took 0.6s, 190861 effective words/s
2020-02-03 22:55:39,333 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 22:55:39,354 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 22:55:39,364 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:55:39,365 : INFO : EPOCH - 4 : training on 175599 raw words (110224 effective words) took 0.6s, 183500 effective words/s
2020-02-03 22:55:39,940 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 22:55:39,956 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 22:55:39,966 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:55:39,967 : INFO : EPOCH - 5 : training on 175599 raw words (110239 effective words) took 0.6s, 188255 effective words/s
2020-02-03 22:55:39,968 : INFO : training on a 87

Word2vec model #14: {'train_data': '1MB', 'compute_loss': True, 'sg': 1, 'hs': 1, 'train_time_mean': 3.9957686265309653, 'train_time_std': 0.2193048492510622}


2020-02-03 22:55:40,996 : INFO : training model with 3 workers on 4125 vocabulary and 100 features, using sg=1 hs=1 sample=0.001 negative=5 window=5
2020-02-03 22:55:41,529 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 22:55:41,547 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 22:55:41,559 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:55:41,560 : INFO : EPOCH - 1 : training on 175599 raw words (110344 effective words) took 0.5s, 201082 effective words/s
2020-02-03 22:55:42,102 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 22:55:42,119 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 22:55:42,129 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:55:42,130 : INFO : EPOCH - 2 : training on 175599 raw words (110124 effective words) took 0.6s, 199558 effective words/s
2020-02-03 22:55:42,647 : INFO : wo

2020-02-03 22:55:50,773 : INFO : EPOCH - 3 : training on 175599 raw words (110030 effective words) took 0.5s, 208634 effective words/s
2020-02-03 22:55:51,316 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 22:55:51,337 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 22:55:51,342 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:55:51,343 : INFO : EPOCH - 4 : training on 175599 raw words (110323 effective words) took 0.6s, 198451 effective words/s
2020-02-03 22:55:51,882 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 22:55:51,900 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 22:55:51,911 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:55:51,912 : INFO : EPOCH - 5 : training on 175599 raw words (110162 effective words) took 0.6s, 198707 effective words/s
2020-02-03 22:55:51,913 : INFO : training on a 87

Word2vec model #15: {'train_data': '1MB', 'compute_loss': False, 'sg': 1, 'hs': 1, 'train_time_mean': 3.9813363552093506, 'train_time_std': 0.2596908837737627}


2020-02-03 22:55:52,124 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-02-03 22:55:52,482 : INFO : collected 73167 word types from a corpus of 1788017 raw words and 179 sentences
2020-02-03 22:55:52,482 : INFO : Loading a fresh vocabulary
2020-02-03 22:55:52,530 : INFO : effective_min_count=5 retains 20167 unique words (27% of original 73167, drops 53000)
2020-02-03 22:55:52,531 : INFO : effective_min_count=5 leaves 1703716 word corpus (95% of original 1788017, drops 84301)
2020-02-03 22:55:52,589 : INFO : deleting the raw counts dictionary of 73167 items
2020-02-03 22:55:52,594 : INFO : sample=0.001 downsamples 38 most-common words
2020-02-03 22:55:52,595 : INFO : downsampling leaves estimated 1242287 word corpus (72.9% of prior 1703716)
2020-02-03 22:55:52,651 : INFO : estimated required memory for 20167 words and 100 dimensions: 26217100 bytes
2020-02-03 22:55:52,652 : INFO : resetting layer weights
2020-02-03 22:55:56,460 : INFO : training model wit

2020-02-03 22:56:16,619 : INFO : effective_min_count=5 leaves 1703716 word corpus (95% of original 1788017, drops 84301)
2020-02-03 22:56:16,681 : INFO : deleting the raw counts dictionary of 73167 items
2020-02-03 22:56:16,686 : INFO : sample=0.001 downsamples 38 most-common words
2020-02-03 22:56:16,688 : INFO : downsampling leaves estimated 1242287 word corpus (72.9% of prior 1703716)
2020-02-03 22:56:16,763 : INFO : estimated required memory for 20167 words and 100 dimensions: 26217100 bytes
2020-02-03 22:56:16,763 : INFO : resetting layer weights
2020-02-03 22:56:20,955 : INFO : training model with 3 workers on 20167 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
2020-02-03 22:56:21,982 : INFO : EPOCH 1 - PROGRESS: at 70.39% examples, 868730 words/s, in_qsize 5, out_qsize 0
2020-02-03 22:56:22,337 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 22:56:22,342 : INFO : worker thread finished; awaiting finish of 1 more threa

Word2vec model #16: {'train_data': '10MB', 'compute_loss': True, 'sg': 0, 'hs': 0, 'train_time_mean': 12.098690430323282, 'train_time_std': 0.11005429577102101}


2020-02-03 22:56:28,817 : INFO : collected 73167 word types from a corpus of 1788017 raw words and 179 sentences
2020-02-03 22:56:28,818 : INFO : Loading a fresh vocabulary
2020-02-03 22:56:28,875 : INFO : effective_min_count=5 retains 20167 unique words (27% of original 73167, drops 53000)
2020-02-03 22:56:28,876 : INFO : effective_min_count=5 leaves 1703716 word corpus (95% of original 1788017, drops 84301)
2020-02-03 22:56:28,935 : INFO : deleting the raw counts dictionary of 73167 items
2020-02-03 22:56:28,940 : INFO : sample=0.001 downsamples 38 most-common words
2020-02-03 22:56:28,940 : INFO : downsampling leaves estimated 1242287 word corpus (72.9% of prior 1703716)
2020-02-03 22:56:29,004 : INFO : estimated required memory for 20167 words and 100 dimensions: 26217100 bytes
2020-02-03 22:56:29,004 : INFO : resetting layer weights
2020-02-03 22:56:33,086 : INFO : training model with 3 workers on 20167 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5


2020-02-03 22:56:51,318 : INFO : deleting the raw counts dictionary of 73167 items
2020-02-03 22:56:51,322 : INFO : sample=0.001 downsamples 38 most-common words
2020-02-03 22:56:51,322 : INFO : downsampling leaves estimated 1242287 word corpus (72.9% of prior 1703716)
2020-02-03 22:56:51,386 : INFO : estimated required memory for 20167 words and 100 dimensions: 26217100 bytes
2020-02-03 22:56:51,387 : INFO : resetting layer weights
2020-02-03 22:56:55,275 : INFO : training model with 3 workers on 20167 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
2020-02-03 22:56:56,302 : INFO : EPOCH 1 - PROGRESS: at 73.74% examples, 911231 words/s, in_qsize 5, out_qsize 0
2020-02-03 22:56:56,645 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 22:56:56,657 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 22:56:56,659 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:56:56,659 : I

Word2vec model #17: {'train_data': '10MB', 'compute_loss': False, 'sg': 0, 'hs': 0, 'train_time_mean': 11.358401457468668, 'train_time_std': 0.5699301892493713}


2020-02-03 22:57:02,848 : INFO : collected 73167 word types from a corpus of 1788017 raw words and 179 sentences
2020-02-03 22:57:02,849 : INFO : Loading a fresh vocabulary
2020-02-03 22:57:02,909 : INFO : effective_min_count=5 retains 20167 unique words (27% of original 73167, drops 53000)
2020-02-03 22:57:02,910 : INFO : effective_min_count=5 leaves 1703716 word corpus (95% of original 1788017, drops 84301)
2020-02-03 22:57:02,974 : INFO : deleting the raw counts dictionary of 73167 items
2020-02-03 22:57:02,978 : INFO : sample=0.001 downsamples 38 most-common words
2020-02-03 22:57:02,979 : INFO : downsampling leaves estimated 1242287 word corpus (72.9% of prior 1703716)
2020-02-03 22:57:03,001 : INFO : constructing a huffman tree from 20167 words
2020-02-03 22:57:03,536 : INFO : built huffman tree with maximum node depth 18
2020-02-03 22:57:03,577 : INFO : estimated required memory for 20167 words and 100 dimensions: 38317300 bytes
2020-02-03 22:57:03,578 : INFO : resetting layer w

2020-02-03 22:57:36,912 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 22:57:36,920 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:57:36,921 : INFO : EPOCH - 4 : training on 1788017 raw words (1242682 effective words) took 3.1s, 401671 effective words/s
2020-02-03 22:57:37,958 : INFO : EPOCH 5 - PROGRESS: at 35.20% examples, 436171 words/s, in_qsize 5, out_qsize 0
2020-02-03 22:57:38,976 : INFO : EPOCH 5 - PROGRESS: at 87.15% examples, 533041 words/s, in_qsize 5, out_qsize 0
2020-02-03 22:57:39,212 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 22:57:39,227 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 22:57:39,231 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:57:39,231 : INFO : EPOCH - 5 : training on 1788017 raw words (1242037 effective words) took 2.3s, 542492 effective words/s
2020-02-03 22:57:39,232 : INFO : training on 

Word2vec model #18: {'train_data': '10MB', 'compute_loss': True, 'sg': 0, 'hs': 1, 'train_time_mean': 17.665089686711628, 'train_time_std': 1.224967270064921}


2020-02-03 22:57:55,781 : INFO : collected 73167 word types from a corpus of 1788017 raw words and 179 sentences
2020-02-03 22:57:55,781 : INFO : Loading a fresh vocabulary
2020-02-03 22:57:55,833 : INFO : effective_min_count=5 retains 20167 unique words (27% of original 73167, drops 53000)
2020-02-03 22:57:55,834 : INFO : effective_min_count=5 leaves 1703716 word corpus (95% of original 1788017, drops 84301)
2020-02-03 22:57:55,887 : INFO : deleting the raw counts dictionary of 73167 items
2020-02-03 22:57:55,892 : INFO : sample=0.001 downsamples 38 most-common words
2020-02-03 22:57:55,892 : INFO : downsampling leaves estimated 1242287 word corpus (72.9% of prior 1703716)
2020-02-03 22:57:55,912 : INFO : constructing a huffman tree from 20167 words
2020-02-03 22:57:56,388 : INFO : built huffman tree with maximum node depth 18
2020-02-03 22:57:56,425 : INFO : estimated required memory for 20167 words and 100 dimensions: 38317300 bytes
2020-02-03 22:57:56,426 : INFO : resetting layer w

2020-02-03 22:58:30,352 : INFO : EPOCH - 4 : training on 1788017 raw words (1242861 effective words) took 2.7s, 463804 effective words/s
2020-02-03 22:58:31,378 : INFO : EPOCH 5 - PROGRESS: at 35.20% examples, 439994 words/s, in_qsize 5, out_qsize 0
2020-02-03 22:58:32,393 : INFO : EPOCH 5 - PROGRESS: at 83.80% examples, 515415 words/s, in_qsize 5, out_qsize 0
2020-02-03 22:58:32,753 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 22:58:32,766 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 22:58:32,774 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:58:32,775 : INFO : EPOCH - 5 : training on 1788017 raw words (1242278 effective words) took 2.4s, 516652 effective words/s
2020-02-03 22:58:32,776 : INFO : training on a 8940085 raw words (6211848 effective words) took 12.3s, 505131 effective words/s
2020-02-03 22:58:32,803 : INFO : collecting all words and their counts
2020-02-03 22:58:32,925 : IN

Word2vec model #19: {'train_data': '10MB', 'compute_loss': False, 'sg': 0, 'hs': 1, 'train_time_mean': 18.446553071339924, 'train_time_std': 2.277916091152619}


2020-02-03 22:58:51,174 : INFO : collected 73167 word types from a corpus of 1788017 raw words and 179 sentences
2020-02-03 22:58:51,175 : INFO : Loading a fresh vocabulary
2020-02-03 22:58:51,233 : INFO : effective_min_count=5 retains 20167 unique words (27% of original 73167, drops 53000)
2020-02-03 22:58:51,234 : INFO : effective_min_count=5 leaves 1703716 word corpus (95% of original 1788017, drops 84301)
2020-02-03 22:58:51,290 : INFO : deleting the raw counts dictionary of 73167 items
2020-02-03 22:58:51,294 : INFO : sample=0.001 downsamples 38 most-common words
2020-02-03 22:58:51,295 : INFO : downsampling leaves estimated 1242287 word corpus (72.9% of prior 1703716)
2020-02-03 22:58:51,352 : INFO : estimated required memory for 20167 words and 100 dimensions: 26217100 bytes
2020-02-03 22:58:51,353 : INFO : resetting layer weights
2020-02-03 22:58:55,053 : INFO : training model with 3 workers on 20167 vocabulary and 100 features, using sg=1 hs=0 sample=0.001 negative=5 window=5


2020-02-03 22:59:29,583 : INFO : EPOCH 4 - PROGRESS: at 22.35% examples, 283140 words/s, in_qsize 5, out_qsize 0
2020-02-03 22:59:30,593 : INFO : EPOCH 4 - PROGRESS: at 49.16% examples, 306857 words/s, in_qsize 5, out_qsize 0
2020-02-03 22:59:31,609 : INFO : EPOCH 4 - PROGRESS: at 78.21% examples, 321646 words/s, in_qsize 5, out_qsize 0
2020-02-03 22:59:32,271 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 22:59:32,307 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 22:59:32,314 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 22:59:32,315 : INFO : EPOCH - 4 : training on 1788017 raw words (1242708 effective words) took 3.7s, 332898 effective words/s
2020-02-03 22:59:33,338 : INFO : EPOCH 5 - PROGRESS: at 25.70% examples, 324053 words/s, in_qsize 5, out_qsize 0
2020-02-03 22:59:34,342 : INFO : EPOCH 5 - PROGRESS: at 52.51% examples, 327146 words/s, in_qsize 6, out_qsize 0
2020-02-03 22:59:35,349 :

Word2vec model #20: {'train_data': '10MB', 'compute_loss': True, 'sg': 1, 'hs': 0, 'train_time_mean': 22.757126569747925, 'train_time_std': 0.152659750088263}


2020-02-03 22:59:59,462 : INFO : collected 73167 word types from a corpus of 1788017 raw words and 179 sentences
2020-02-03 22:59:59,462 : INFO : Loading a fresh vocabulary
2020-02-03 22:59:59,516 : INFO : effective_min_count=5 retains 20167 unique words (27% of original 73167, drops 53000)
2020-02-03 22:59:59,517 : INFO : effective_min_count=5 leaves 1703716 word corpus (95% of original 1788017, drops 84301)
2020-02-03 22:59:59,574 : INFO : deleting the raw counts dictionary of 73167 items
2020-02-03 22:59:59,577 : INFO : sample=0.001 downsamples 38 most-common words
2020-02-03 22:59:59,578 : INFO : downsampling leaves estimated 1242287 word corpus (72.9% of prior 1703716)
2020-02-03 22:59:59,638 : INFO : estimated required memory for 20167 words and 100 dimensions: 26217100 bytes
2020-02-03 22:59:59,638 : INFO : resetting layer weights
2020-02-03 23:00:03,452 : INFO : training model with 3 workers on 20167 vocabulary and 100 features, using sg=1 hs=0 sample=0.001 negative=5 window=5


2020-02-03 23:00:39,763 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 23:00:39,771 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 23:00:39,772 : INFO : EPOCH - 3 : training on 1788017 raw words (1241996 effective words) took 3.9s, 318041 effective words/s
2020-02-03 23:00:40,832 : INFO : EPOCH 4 - PROGRESS: at 24.02% examples, 294015 words/s, in_qsize 5, out_qsize 0
2020-02-03 23:00:41,833 : INFO : EPOCH 4 - PROGRESS: at 48.60% examples, 299578 words/s, in_qsize 5, out_qsize 0
2020-02-03 23:00:42,837 : INFO : EPOCH 4 - PROGRESS: at 72.07% examples, 294732 words/s, in_qsize 5, out_qsize 0
2020-02-03 23:00:43,837 : INFO : EPOCH 4 - PROGRESS: at 94.41% examples, 290616 words/s, in_qsize 5, out_qsize 0
2020-02-03 23:00:44,021 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 23:00:44,069 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 23:00:44,075 : INFO : worker thread 

Word2vec model #21: {'train_data': '10MB', 'compute_loss': False, 'sg': 1, 'hs': 0, 'train_time_mean': 24.407532930374146, 'train_time_std': 1.284500796488715}


2020-02-03 23:01:12,659 : INFO : collected 73167 word types from a corpus of 1788017 raw words and 179 sentences
2020-02-03 23:01:12,660 : INFO : Loading a fresh vocabulary
2020-02-03 23:01:12,723 : INFO : effective_min_count=5 retains 20167 unique words (27% of original 73167, drops 53000)
2020-02-03 23:01:12,724 : INFO : effective_min_count=5 leaves 1703716 word corpus (95% of original 1788017, drops 84301)
2020-02-03 23:01:12,800 : INFO : deleting the raw counts dictionary of 73167 items
2020-02-03 23:01:12,804 : INFO : sample=0.001 downsamples 38 most-common words
2020-02-03 23:01:12,805 : INFO : downsampling leaves estimated 1242287 word corpus (72.9% of prior 1703716)
2020-02-03 23:01:12,828 : INFO : constructing a huffman tree from 20167 words
2020-02-03 23:01:13,323 : INFO : built huffman tree with maximum node depth 18
2020-02-03 23:01:13,363 : INFO : estimated required memory for 20167 words and 100 dimensions: 38317300 bytes
2020-02-03 23:01:13,364 : INFO : resetting layer w

2020-02-03 23:01:59,696 : INFO : training model with 3 workers on 20167 vocabulary and 100 features, using sg=1 hs=1 sample=0.001 negative=5 window=5
2020-02-03 23:02:00,715 : INFO : EPOCH 1 - PROGRESS: at 11.17% examples, 143105 words/s, in_qsize 5, out_qsize 0
2020-02-03 23:02:01,745 : INFO : EPOCH 1 - PROGRESS: at 24.58% examples, 153479 words/s, in_qsize 5, out_qsize 0
2020-02-03 23:02:02,751 : INFO : EPOCH 1 - PROGRESS: at 38.55% examples, 159956 words/s, in_qsize 5, out_qsize 0
2020-02-03 23:02:03,769 : INFO : EPOCH 1 - PROGRESS: at 53.63% examples, 165375 words/s, in_qsize 5, out_qsize 0
2020-02-03 23:02:04,791 : INFO : EPOCH 1 - PROGRESS: at 68.16% examples, 166723 words/s, in_qsize 5, out_qsize 0
2020-02-03 23:02:05,792 : INFO : EPOCH 1 - PROGRESS: at 82.68% examples, 169215 words/s, in_qsize 5, out_qsize 0
2020-02-03 23:02:06,792 : INFO : EPOCH 1 - PROGRESS: at 97.21% examples, 170844 words/s, in_qsize 5, out_qsize 0
2020-02-03 23:02:06,910 : INFO : worker thread finished; aw

2020-02-03 23:02:47,876 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 23:02:47,893 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 23:02:47,894 : INFO : EPOCH - 1 : training on 1788017 raw words (1242143 effective words) took 7.5s, 164867 effective words/s
2020-02-03 23:02:48,949 : INFO : EPOCH 2 - PROGRESS: at 8.94% examples, 110725 words/s, in_qsize 5, out_qsize 0
2020-02-03 23:02:49,954 : INFO : EPOCH 2 - PROGRESS: at 21.23% examples, 132040 words/s, in_qsize 5, out_qsize 0
2020-02-03 23:02:50,959 : INFO : EPOCH 2 - PROGRESS: at 36.31% examples, 150066 words/s, in_qsize 5, out_qsize 0
2020-02-03 23:02:51,996 : INFO : EPOCH 2 - PROGRESS: at 51.40% examples, 157749 words/s, in_qsize 6, out_qsize 0
2020-02-03 23:02:53,013 : INFO : EPOCH 2 - PROGRESS: at 64.80% examples, 157899 words/s, in_qsize 5, out_qsize 0
2020-02-03 23:02:54,038 : INFO : EPOCH 2 - PROGRESS: at 78.21% examples, 158902 words/s, in_qsize 5, out_qsize 0
2020

Word2vec model #22: {'train_data': '10MB', 'compute_loss': True, 'sg': 1, 'hs': 1, 'train_time_mean': 42.05006639162699, 'train_time_std': 0.7657778590513281}


2020-02-03 23:03:18,781 : INFO : collected 73167 word types from a corpus of 1788017 raw words and 179 sentences
2020-02-03 23:03:18,782 : INFO : Loading a fresh vocabulary
2020-02-03 23:03:18,839 : INFO : effective_min_count=5 retains 20167 unique words (27% of original 73167, drops 53000)
2020-02-03 23:03:18,840 : INFO : effective_min_count=5 leaves 1703716 word corpus (95% of original 1788017, drops 84301)
2020-02-03 23:03:18,907 : INFO : deleting the raw counts dictionary of 73167 items
2020-02-03 23:03:18,911 : INFO : sample=0.001 downsamples 38 most-common words
2020-02-03 23:03:18,911 : INFO : downsampling leaves estimated 1242287 word corpus (72.9% of prior 1703716)
2020-02-03 23:03:18,936 : INFO : constructing a huffman tree from 20167 words
2020-02-03 23:03:19,538 : INFO : built huffman tree with maximum node depth 18
2020-02-03 23:03:19,576 : INFO : estimated required memory for 20167 words and 100 dimensions: 38317300 bytes
2020-02-03 23:03:19,577 : INFO : resetting layer w

2020-02-03 23:04:05,834 : INFO : EPOCH 1 - PROGRESS: at 26.26% examples, 164977 words/s, in_qsize 5, out_qsize 0
2020-02-03 23:04:06,877 : INFO : EPOCH 1 - PROGRESS: at 42.46% examples, 174495 words/s, in_qsize 5, out_qsize 0
2020-02-03 23:04:07,942 : INFO : EPOCH 1 - PROGRESS: at 59.22% examples, 178742 words/s, in_qsize 5, out_qsize 0
2020-02-03 23:04:08,952 : INFO : EPOCH 1 - PROGRESS: at 75.42% examples, 182796 words/s, in_qsize 5, out_qsize 0
2020-02-03 23:04:09,985 : INFO : EPOCH 1 - PROGRESS: at 91.06% examples, 183781 words/s, in_qsize 5, out_qsize 0
2020-02-03 23:04:10,455 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 23:04:10,531 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 23:04:10,544 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-03 23:04:10,544 : INFO : EPOCH - 1 : training on 1788017 raw words (1242946 effective words) took 6.7s, 184710 effective words/s
2020-02-03 23:04:11,571 :

2020-02-03 23:04:53,381 : INFO : EPOCH 2 - PROGRESS: at 11.17% examples, 142380 words/s, in_qsize 5, out_qsize 0
2020-02-03 23:04:54,439 : INFO : EPOCH 2 - PROGRESS: at 25.70% examples, 157941 words/s, in_qsize 5, out_qsize 0
2020-02-03 23:04:55,450 : INFO : EPOCH 2 - PROGRESS: at 40.78% examples, 167091 words/s, in_qsize 5, out_qsize 0
2020-02-03 23:04:56,536 : INFO : EPOCH 2 - PROGRESS: at 55.87% examples, 167765 words/s, in_qsize 6, out_qsize 0
2020-02-03 23:04:57,567 : INFO : EPOCH 2 - PROGRESS: at 70.95% examples, 169903 words/s, in_qsize 6, out_qsize 0
2020-02-03 23:04:58,647 : INFO : EPOCH 2 - PROGRESS: at 85.47% examples, 169567 words/s, in_qsize 5, out_qsize 0
2020-02-03 23:04:59,586 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-02-03 23:04:59,658 : INFO : EPOCH 2 - PROGRESS: at 99.44% examples, 169682 words/s, in_qsize 1, out_qsize 1
2020-02-03 23:04:59,659 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-02-03 23:04:59,685 : I

Word2vec model #23: {'train_data': '10MB', 'compute_loss': False, 'sg': 1, 'hs': 1, 'train_time_mean': 40.79019276301066, 'train_time_std': 0.2595709934863263}
   train_data  compute_loss  sg  hs  train_time_mean  train_time_std
4        25kB          True   1   0         0.847283        0.066428
5        25kB         False   1   0         0.765171        0.005743
6        25kB          True   1   1         1.478252        0.204977
7        25kB         False   1   1         2.092017        0.867424
0        25kB          True   0   0         0.622560        0.008212
1        25kB         False   0   0         0.621660        0.028678
2        25kB          True   0   1         0.787105        0.023877
3        25kB         False   0   1         0.773115        0.005728
12        1MB          True   1   0         2.475293        0.265729
13        1MB         False   1   0         2.220606        0.046947
14        1MB          True   1   1         3.995769        0.219305
15        1M

Adding Word2Vec "model to dict" method to production pipeline
-------------------------------------------------------------

Suppose, we still want more performance improvement in production.

One good way is to cache all the similar words in a dictionary.

So that next time when we get the similar query word, we'll search it first in the dict.

And if it's a hit then we will show the result directly from the dictionary.

otherwise we will query the word and then cache it so that it doesn't miss next time.




In [19]:
# re-enable logging
logging.root.level = logging.INFO

most_similars_precalc = {word : model.wv.most_similar(word) for word in model.wv.index2word}
for i, (key, value) in enumerate(most_similars_precalc.items()):
    if i == 3:
        break
    print(key, value)

2020-02-03 23:05:20,711 : INFO : precomputing L2-norms of word weight vectors


the [('on', 0.9999356269836426), ('by', 0.9999175071716309), ('which', 0.9999114274978638), ('also', 0.9999108910560608), ('and', 0.9999102354049683), ('an', 0.9999089241027832), ('for', 0.9999087452888489), ('days', 0.9999085664749146), ('with', 0.9999066591262817), ('australian', 0.9999063014984131)]
to [('also', 0.9999505877494812), ('about', 0.9999481439590454), ('for', 0.9999478459358215), ('from', 0.9999448657035828), ('but', 0.9999440312385559), ('at', 0.9999434947967529), ('company', 0.9999430775642395), ('or', 0.9999391436576843), ('before', 0.9999383687973022), ('who', 0.9999364018440247)]
of [('and', 0.9999475479125977), ('in', 0.9999444484710693), ('on', 0.9999402761459351), ('with', 0.9999359846115112), ('over', 0.9999345541000366), ('which', 0.9999303221702576), ('after', 0.9999282956123352), ('two', 0.9999276995658875), ('an', 0.9999268651008606), ('into', 0.9999268054962158)]


Comparison with and without caching
-----------------------------------

for time being lets take 4 words randomly




In [20]:
import time
words = ['voted', 'few', 'their', 'around']

Without caching




In [21]:
start = time.time()
for word in words:
    result = model.wv.most_similar(word)
    print(result)
end = time.time()
print(end - start)

[('lower', 0.9980545043945312), ('army', 0.9980535507202148), ('unions', 0.9980233907699585), ('first', 0.9980073571205139), ('secretary', 0.9980006217956543), ('israel', 0.997992753982544), ('weather', 0.9979914426803589), ('air', 0.9979841709136963), ('child', 0.997980535030365), ('team', 0.9979622960090637)]
[('were', 0.9997997283935547), ('around', 0.9997900724411011), ('with', 0.9997897744178772), ('three', 0.999785840511322), ('which', 0.9997817873954773), ('for', 0.9997814893722534), ('two', 0.9997813701629639), ('into', 0.9997802376747131), ('now', 0.9997799396514893), ('as', 0.999779462814331)]
[('for', 0.9999517202377319), ('at', 0.9999475479125977), ('with', 0.9999464750289917), ('from', 0.9999454021453857), ('also', 0.9999444484710693), ('but', 0.9999433159828186), ('and', 0.999941349029541), ('his', 0.9999404549598694), ('which', 0.9999392032623291), ('were', 0.9999378323554993)]
[('with', 0.9999535083770752), ('and', 0.9999483823776245), ('for', 0.9999468326568604), ('aft

Now with caching




In [22]:
start = time.time()
for word in words:
    if 'voted' in most_similars_precalc:
        result = most_similars_precalc[word]
        print(result)
    else:
        result = model.wv.most_similar(word)
        most_similars_precalc[word] = result
        print(result)

end = time.time()
print(end - start)

[('lower', 0.9980545043945312), ('army', 0.9980535507202148), ('unions', 0.9980233907699585), ('first', 0.9980073571205139), ('secretary', 0.9980006217956543), ('israel', 0.997992753982544), ('weather', 0.9979914426803589), ('air', 0.9979841709136963), ('child', 0.997980535030365), ('team', 0.9979622960090637)]
[('were', 0.9997997283935547), ('around', 0.9997900724411011), ('with', 0.9997897744178772), ('three', 0.999785840511322), ('which', 0.9997817873954773), ('for', 0.9997814893722534), ('two', 0.9997813701629639), ('into', 0.9997802376747131), ('now', 0.9997799396514893), ('as', 0.999779462814331)]
[('for', 0.9999517202377319), ('at', 0.9999475479125977), ('with', 0.9999464750289917), ('from', 0.9999454021453857), ('also', 0.9999444484710693), ('but', 0.9999433159828186), ('and', 0.999941349029541), ('his', 0.9999404549598694), ('which', 0.9999392032623291), ('were', 0.9999378323554993)]
[('with', 0.9999535083770752), ('and', 0.9999483823776245), ('for', 0.9999468326568604), ('aft

Clearly you can see the improvement but this difference will be even larger
when we take more words in the consideration.




Visualising the Word Embeddings
-------------------------------

The word embeddings made by the model can be visualised by reducing
dimensionality of the words to 2 dimensions using tSNE.

Visualisations can be used to notice semantic and syntactic trends in the data.

Example:

* Semantic: words like cat, dog, cow, etc. have a tendency to lie close by
* Syntactic: words like run, running or cut, cutting lie close together.

Vector relations like vKing - vMan = vQueen - vWoman can also be noticed.

.. Important::
  The model used for the visualisation is trained on a small corpus. Thus
  some of the relations might not be so clear.




In [23]:
from sklearn.decomposition import IncrementalPCA    # inital reduction
from sklearn.manifold import TSNE                   # final reduction
import numpy as np                                  # array handling


def reduce_dimensions(model):
    num_dimensions = 2  # final num dimensions (2D, 3D, etc)

    vectors = [] # positions in vector space
    labels = [] # keep track of words to label our data again later
    for word in model.wv.vocab:
        vectors.append(model.wv[word])
        labels.append(word)

    # convert both lists into numpy vectors for reduction
    vectors = np.asarray(vectors)
    labels = np.asarray(labels)

    # reduce using t-SNE
    vectors = np.asarray(vectors)
    tsne = TSNE(n_components=num_dimensions, random_state=0)
    vectors = tsne.fit_transform(vectors)

    x_vals = [v[0] for v in vectors]
    y_vals = [v[1] for v in vectors]
    return x_vals, y_vals, labels


x_vals, y_vals, labels = reduce_dimensions(model)

def plot_with_plotly(x_vals, y_vals, labels, plot_in_notebook=True):
    from plotly.offline import init_notebook_mode, iplot, plot
    import plotly.graph_objs as go

    trace = go.Scatter(x=x_vals, y=y_vals, mode='text', text=labels)
    data = [trace]

    if plot_in_notebook:
        init_notebook_mode(connected=True)
        iplot(data, filename='word-embedding-plot')
    else:
        plot(data, filename='word-embedding-plot.html')


def plot_with_matplotlib(x_vals, y_vals, labels):
    import matplotlib.pyplot as plt
    import random

    random.seed(0)

    plt.figure(figsize=(12, 12))
    plt.scatter(x_vals, y_vals)

    #
    # Label randomly subsampled 25 data points
    #
    indices = list(range(len(labels)))
    selected_indices = random.sample(indices, 25)
    for i in selected_indices:
        plt.annotate(labels[i], (x_vals[i], y_vals[i]))

try:
    get_ipython()
except Exception:
    plot_function = plot_with_matplotlib
else:
    plot_function = plot_with_plotly

plot_function(x_vals, y_vals, labels)

Conclusion
----------

In this tutorial we learned how to train word2vec models on your custom data
and also how to evaluate it. Hope that you too will find this popular tool
useful in your Machine Learning tasks!

Links
-----

- API docs: :py:mod:`gensim.models.word2vec`
- `Original C toolkit and word2vec papers by Google <https://code.google.com/archive/p/word2vec/>`_.


