#### Train doc2vec using parsed wiki10 documents (wikipedia article text with multiple class labels)

In [148]:
# import modules & set up logging
import gensim
from gensim.models.doc2vec import TaggedDocument
import logging
from pathlib import Path
import re
import string 
from itertools import islice
import multiprocessing
from gensim.models import Doc2Vec

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [169]:
DATA_DIR = Path('../../data/wiki10')
TEXT_DIR = DATA_DIR / 'text' 
MODEL_PATH = DATA_DIR / 'doc2vec.model'

Gensim's implementation of Doc2Vec requires as input:
* Iterable of TaggedDocument objects
* TaggedDocument corresponds to a single sentence and contains a list of tokens in that sentence, and a list of one (or more) labels used to identify that document


Implementing iterator to parse our input file, and crudely split into sentences of tokens 

**N.B.**
* We cannot use a generator as training will iterate over input multiple times

In [9]:
class Corpus2TaggedDocument(object):
    
    def __init__(self, input_path):
        self.input_path = Path(input_path)
        
    def __iter__(self):
        ''' yield list of tokens for a single sentence in corpus '''
        for document_path in self.input_path.iterdir():
            document = document_path.read_text().lower()
            for sentence in self.sentencize(document):
                yield TaggedDocument(words=self.tokenize(sentence), tags=[document_path.name])
       
    def sentencize(self, text):
        return text.split('. ')
    
    def tokenize(self, text):
        return re.sub(f'([{string.punctuation}])', r' \1 ', text).split()

#### Parsing input

In [10]:
sentences = Corpus2TaggedDocument(DATA_DIR)

Example sentences

In [11]:
for sentence in islice(sentences, 5): print(sentence, end='\n\n')

TaggedDocument(['miso', 'soup', '(', '味噌汁', ',', 'miso', 'shiru', '?', ')', 'is', 'a', 'traditional', 'japanese', 'soup', 'consisting', 'of', 'a', 'stock', 'called', '"', 'dashi', '"', 'into', 'which', 'is', 'mixed', 'softened', 'miso', 'paste'], ['33d29682601cfaa391fe7516d90d6baf'])

TaggedDocument(['although', 'the', 'suspension', 'of', 'miso', 'paste', 'into', 'dashi', 'is', 'the', 'only', 'characteristic', 'that', 'actually', 'defines', 'miso', 'soup', ',', 'many', 'other', 'ingredients', 'are', 'added', 'depending', 'on', 'regional', 'and', 'seasonal', 'recipes', ',', 'and', 'personal', 'preference'], ['33d29682601cfaa391fe7516d90d6baf'])

TaggedDocument(['the', 'choice', 'of', 'miso', 'paste', 'for', 'the', 'soup', 'defines', 'a', 'great', 'deal', 'of', 'its', 'character', 'and', 'flavor'], ['33d29682601cfaa391fe7516d90d6baf'])

TaggedDocument(['most', 'miso', 'pastes', 'can', 'be', 'categorized', 'into', 'red', '(', 'akamiso', ')', ',', 'white', '(', 'shiromiso', ')', ',', 'or',

Number of sentences in corpus

In [12]:
sum(1 for _ in sentences)

1511872

In [13]:
cores = multiprocessing.cpu_count()

#### Training model

Training doc2vec model using a window of 5 surrounding tokens for word vectors, and minimum token frequency of 2. Will see what sort of result we get with this simple setup, then try tuning parameters:

* We are using default algorithm "DBOW", if we later use the alternative "DM" algorithm we may want test using full documents instead of sentences as our training data [link](https://groups.google.com/forum/#!topic/gensim/2QkGx2C1pNQ)
* Optimising learning rate  [link](https://rare-technologies.com/doc2vec-tutorial/)
* General parameter tuning [link](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb)
* Paper describing effects of document length [link](https://arxiv.org/pdf/1607.05368.pdf)

In [None]:
model = Doc2Vec(
    sentences,
    size=100,
    min_count=2, 
    window=5,
    workers=cores
    )

2018-02-06 06:18:05,412 : INFO : collecting all words and their counts
2018-02-06 06:18:05,434 : INFO : PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
2018-02-06 06:18:06,347 : INFO : PROGRESS: at example #10000, processed 326436 words (361678/s), 24960 word types, 120 tags
2018-02-06 06:18:07,250 : INFO : PROGRESS: at example #20000, processed 648630 words (357775/s), 36480 word types, 240 tags
2018-02-06 06:18:08,154 : INFO : PROGRESS: at example #30000, processed 971417 words (358060/s), 46077 word types, 365 tags
2018-02-06 06:18:09,067 : INFO : PROGRESS: at example #40000, processed 1287076 words (346842/s), 54771 word types, 495 tags
2018-02-06 06:18:09,945 : INFO : PROGRESS: at example #50000, processed 1588239 words (344193/s), 61737 word types, 615 tags
2018-02-06 06:18:10,846 : INFO : PROGRESS: at example #60000, processed 1897401 words (344463/s), 69040 word types, 757 tags
2018-02-06 06:18:11,752 : INFO : PROGRESS: at example #70000, processed 220908

2018-02-06 06:19:03,410 : INFO : PROGRESS: at example #640000, processed 20595333 words (351447/s), 269676 word types, 8393 tags
2018-02-06 06:19:04,291 : INFO : PROGRESS: at example #650000, processed 20908102 words (356032/s), 271777 word types, 8512 tags
2018-02-06 06:19:05,162 : INFO : PROGRESS: at example #660000, processed 21217257 words (356177/s), 273934 word types, 8639 tags
2018-02-06 06:19:06,047 : INFO : PROGRESS: at example #670000, processed 21527911 words (355221/s), 276179 word types, 8766 tags
2018-02-06 06:19:06,915 : INFO : PROGRESS: at example #680000, processed 21833459 words (353256/s), 278552 word types, 8883 tags
2018-02-06 06:19:07,836 : INFO : PROGRESS: at example #690000, processed 22156692 words (356903/s), 281204 word types, 9023 tags
2018-02-06 06:19:08,754 : INFO : PROGRESS: at example #700000, processed 22485475 words (359247/s), 283328 word types, 9117 tags
2018-02-06 06:19:09,641 : INFO : PROGRESS: at example #710000, processed 22798885 words (357815/s

2018-02-06 06:20:01,277 : INFO : PROGRESS: at example #1270000, processed 40980666 words (349795/s), 406251 word types, 16517 tags
2018-02-06 06:20:02,209 : INFO : PROGRESS: at example #1280000, processed 41291272 words (334095/s), 407936 word types, 16664 tags
2018-02-06 06:20:03,139 : INFO : PROGRESS: at example #1290000, processed 41597434 words (330190/s), 409688 word types, 16787 tags
2018-02-06 06:20:04,069 : INFO : PROGRESS: at example #1300000, processed 41919682 words (347700/s), 411802 word types, 16931 tags
2018-02-06 06:20:05,016 : INFO : PROGRESS: at example #1310000, processed 42246580 words (346262/s), 413679 word types, 17081 tags
2018-02-06 06:20:05,926 : INFO : PROGRESS: at example #1320000, processed 42569029 words (355621/s), 415353 word types, 17212 tags
2018-02-06 06:20:06,841 : INFO : PROGRESS: at example #1330000, processed 42886107 words (347394/s), 417134 word types, 17340 tags
2018-02-06 06:20:07,746 : INFO : PROGRESS: at example #1340000, processed 43206297 

2018-02-06 06:21:20,813 : INFO : PROGRESS: at 1.89% examples, 81487 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:21:21,831 : INFO : PROGRESS: at 1.94% examples, 81526 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:21:22,936 : INFO : PROGRESS: at 1.99% examples, 81556 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:21:23,940 : INFO : PROGRESS: at 2.03% examples, 81606 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:21:24,981 : INFO : PROGRESS: at 2.08% examples, 81595 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:21:26,001 : INFO : PROGRESS: at 2.12% examples, 81612 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:21:27,032 : INFO : PROGRESS: at 2.17% examples, 81643 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:21:28,052 : INFO : PROGRESS: at 2.21% examples, 81661 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:21:29,194 : INFO : PROGRESS: at 2.27% examples, 81649 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:21:30,247 : INFO : PROGRESS: at 2.31% examples, 81620 words/s, in_qsize 5, ou

2018-02-06 06:22:48,617 : INFO : PROGRESS: at 5.71% examples, 81586 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:22:49,621 : INFO : PROGRESS: at 5.75% examples, 81546 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:22:50,629 : INFO : PROGRESS: at 5.80% examples, 81565 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:22:51,634 : INFO : PROGRESS: at 5.84% examples, 81578 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:22:52,769 : INFO : PROGRESS: at 5.89% examples, 81520 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:22:53,781 : INFO : PROGRESS: at 5.92% examples, 81361 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:22:54,788 : INFO : PROGRESS: at 5.96% examples, 81268 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:22:55,810 : INFO : PROGRESS: at 6.00% examples, 81223 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:22:56,856 : INFO : PROGRESS: at 6.05% examples, 81219 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:22:57,939 : INFO : PROGRESS: at 6.10% examples, 81248 words/s, in_qsize 5, ou

2018-02-06 06:24:15,580 : INFO : PROGRESS: at 9.54% examples, 81714 words/s, in_qsize 4, out_qsize 0
2018-02-06 06:24:16,593 : INFO : PROGRESS: at 9.59% examples, 81689 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:24:17,685 : INFO : PROGRESS: at 9.64% examples, 81705 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:24:18,686 : INFO : PROGRESS: at 9.68% examples, 81647 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:24:19,688 : INFO : PROGRESS: at 9.72% examples, 81654 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:24:20,779 : INFO : PROGRESS: at 9.77% examples, 81667 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:24:21,832 : INFO : PROGRESS: at 9.82% examples, 81692 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:24:22,927 : INFO : PROGRESS: at 9.87% examples, 81703 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:24:24,009 : INFO : PROGRESS: at 9.92% examples, 81716 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:24:25,015 : INFO : PROGRESS: at 9.95% examples, 81658 words/s, in_qsize 4, ou

2018-02-06 06:25:41,682 : INFO : PROGRESS: at 13.32% examples, 81852 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:25:42,769 : INFO : PROGRESS: at 13.37% examples, 81863 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:25:43,860 : INFO : PROGRESS: at 13.42% examples, 81874 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:25:44,871 : INFO : PROGRESS: at 13.46% examples, 81878 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:25:45,908 : INFO : PROGRESS: at 13.51% examples, 81898 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:25:47,042 : INFO : PROGRESS: at 13.56% examples, 81895 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:25:48,095 : INFO : PROGRESS: at 13.60% examples, 81912 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:25:49,133 : INFO : PROGRESS: at 13.64% examples, 81907 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:25:50,151 : INFO : PROGRESS: at 13.69% examples, 81911 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:25:51,238 : INFO : PROGRESS: at 13.73% examples, 81894 words/s, in_q

2018-02-06 06:27:07,753 : INFO : PROGRESS: at 17.11% examples, 82102 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:27:08,759 : INFO : PROGRESS: at 17.15% examples, 82105 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:27:09,935 : INFO : PROGRESS: at 17.21% examples, 82113 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:27:10,995 : INFO : PROGRESS: at 17.26% examples, 82147 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:27:12,011 : INFO : PROGRESS: at 17.30% examples, 82149 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:27:13,011 : INFO : PROGRESS: at 17.35% examples, 82155 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:27:14,165 : INFO : PROGRESS: at 17.40% examples, 82149 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:27:15,208 : INFO : PROGRESS: at 17.45% examples, 82163 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:27:16,279 : INFO : PROGRESS: at 17.49% examples, 82155 words/s, in_qsize 4, out_qsize 1
2018-02-06 06:27:17,285 : INFO : PROGRESS: at 17.54% examples, 82178 words/s, in_q

2018-02-06 06:28:33,531 : INFO : PROGRESS: at 20.99% examples, 82456 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:28:34,537 : INFO : PROGRESS: at 21.04% examples, 82474 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:28:35,665 : INFO : PROGRESS: at 21.09% examples, 82473 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:28:36,757 : INFO : PROGRESS: at 21.13% examples, 82491 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:28:37,856 : INFO : PROGRESS: at 21.18% examples, 82494 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:28:38,889 : INFO : PROGRESS: at 21.23% examples, 82507 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:28:39,897 : INFO : PROGRESS: at 21.28% examples, 82524 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:28:40,922 : INFO : PROGRESS: at 21.32% examples, 82537 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:28:41,950 : INFO : PROGRESS: at 21.37% examples, 82553 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:28:43,038 : INFO : PROGRESS: at 21.42% examples, 82558 words/s, in_q

2018-02-06 06:29:58,776 : INFO : PROGRESS: at 24.93% examples, 83095 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:29:59,876 : INFO : PROGRESS: at 24.98% examples, 83098 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:30:00,961 : INFO : PROGRESS: at 25.02% examples, 83099 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:30:02,137 : INFO : PROGRESS: at 25.07% examples, 83089 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:30:03,156 : INFO : PROGRESS: at 25.12% examples, 83102 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:30:04,169 : INFO : PROGRESS: at 25.17% examples, 83103 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:30:05,179 : INFO : PROGRESS: at 25.22% examples, 83105 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:30:06,221 : INFO : PROGRESS: at 25.27% examples, 83114 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:30:07,355 : INFO : PROGRESS: at 25.32% examples, 83111 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:30:08,373 : INFO : PROGRESS: at 25.37% examples, 83122 words/s, in_q

2018-02-06 06:31:24,921 : INFO : PROGRESS: at 28.88% examples, 83406 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:31:26,003 : INFO : PROGRESS: at 28.93% examples, 83409 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:31:27,132 : INFO : PROGRESS: at 28.98% examples, 83394 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:31:28,135 : INFO : PROGRESS: at 29.03% examples, 83407 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:31:29,219 : INFO : PROGRESS: at 29.08% examples, 83409 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:31:30,232 : INFO : PROGRESS: at 29.13% examples, 83410 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:31:31,444 : INFO : PROGRESS: at 29.18% examples, 83420 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:31:32,466 : INFO : PROGRESS: at 29.23% examples, 83441 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:31:33,492 : INFO : PROGRESS: at 29.28% examples, 83450 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:31:34,502 : INFO : PROGRESS: at 29.32% examples, 83450 words/s, in_q

2018-02-06 06:32:50,891 : INFO : PROGRESS: at 32.94% examples, 84001 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:32:51,926 : INFO : PROGRESS: at 32.99% examples, 84019 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:32:52,939 : INFO : PROGRESS: at 33.04% examples, 84028 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:32:54,006 : INFO : PROGRESS: at 33.10% examples, 84042 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:32:55,040 : INFO : PROGRESS: at 33.14% examples, 84027 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:32:56,103 : INFO : PROGRESS: at 33.17% examples, 83990 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:32:57,109 : INFO : PROGRESS: at 33.22% examples, 84000 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:32:58,123 : INFO : PROGRESS: at 33.27% examples, 84019 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:32:59,144 : INFO : PROGRESS: at 33.32% examples, 84027 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:33:00,188 : INFO : PROGRESS: at 33.37% examples, 84033 words/s, in_q

2018-02-06 06:34:15,874 : INFO : PROGRESS: at 37.06% examples, 84755 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:34:16,980 : INFO : PROGRESS: at 37.12% examples, 84761 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:34:18,042 : INFO : PROGRESS: at 37.17% examples, 84773 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:34:19,061 : INFO : PROGRESS: at 37.21% examples, 84779 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:34:20,095 : INFO : PROGRESS: at 37.26% examples, 84785 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:34:21,133 : INFO : PROGRESS: at 37.32% examples, 84800 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:34:22,139 : INFO : PROGRESS: at 37.37% examples, 84808 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:34:23,177 : INFO : PROGRESS: at 37.42% examples, 84813 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:34:24,231 : INFO : PROGRESS: at 37.47% examples, 84825 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:34:25,281 : INFO : PROGRESS: at 37.52% examples, 84838 words/s, in_q

2018-02-06 06:35:41,511 : INFO : PROGRESS: at 41.27% examples, 85396 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:35:42,520 : INFO : PROGRESS: at 41.32% examples, 85410 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:35:43,561 : INFO : PROGRESS: at 41.37% examples, 85413 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:35:44,569 : INFO : PROGRESS: at 41.42% examples, 85420 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:35:45,613 : INFO : PROGRESS: at 41.47% examples, 85423 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:35:46,630 : INFO : PROGRESS: at 41.51% examples, 85419 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:35:47,733 : INFO : PROGRESS: at 41.56% examples, 85417 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:35:48,787 : INFO : PROGRESS: at 41.61% examples, 85427 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:35:49,824 : INFO : PROGRESS: at 41.66% examples, 85432 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:35:50,827 : INFO : PROGRESS: at 41.71% examples, 85440 words/s, in_q

2018-02-06 06:37:06,970 : INFO : PROGRESS: at 45.40% examples, 85830 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:37:07,971 : INFO : PROGRESS: at 45.45% examples, 85843 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:37:08,972 : INFO : PROGRESS: at 45.49% examples, 85833 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:37:09,986 : INFO : PROGRESS: at 45.54% examples, 85838 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:37:11,085 : INFO : PROGRESS: at 45.59% examples, 85843 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:37:12,245 : INFO : PROGRESS: at 45.65% examples, 85852 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:37:13,260 : INFO : PROGRESS: at 45.70% examples, 85863 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:37:14,386 : INFO : PROGRESS: at 45.75% examples, 85867 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:37:15,394 : INFO : PROGRESS: at 45.80% examples, 85872 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:37:16,463 : INFO : PROGRESS: at 45.85% examples, 85879 words/s, in_q

2018-02-06 06:38:32,547 : INFO : PROGRESS: at 49.56% examples, 86199 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:38:33,594 : INFO : PROGRESS: at 49.61% examples, 86201 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:38:34,597 : INFO : PROGRESS: at 49.66% examples, 86206 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:38:35,742 : INFO : PROGRESS: at 49.72% examples, 86214 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:38:36,814 : INFO : PROGRESS: at 49.77% examples, 86220 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:38:37,884 : INFO : PROGRESS: at 49.82% examples, 86219 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:38:39,033 : INFO : PROGRESS: at 49.88% examples, 86227 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:38:40,069 : INFO : PROGRESS: at 49.93% examples, 86236 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:38:41,182 : INFO : PROGRESS: at 49.98% examples, 86239 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:38:42,275 : INFO : PROGRESS: at 50.03% examples, 86250 words/s, in_q

2018-02-06 06:39:58,057 : INFO : PROGRESS: at 53.70% examples, 86557 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:39:59,094 : INFO : PROGRESS: at 53.75% examples, 86559 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:40:00,125 : INFO : PROGRESS: at 53.80% examples, 86561 words/s, in_qsize 4, out_qsize 1
2018-02-06 06:40:01,151 : INFO : PROGRESS: at 53.85% examples, 86564 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:40:02,202 : INFO : PROGRESS: at 53.90% examples, 86571 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:40:03,284 : INFO : PROGRESS: at 53.95% examples, 86569 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:40:04,290 : INFO : PROGRESS: at 54.00% examples, 86573 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:40:05,333 : INFO : PROGRESS: at 54.05% examples, 86574 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:40:06,411 : INFO : PROGRESS: at 54.09% examples, 86571 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:40:07,514 : INFO : PROGRESS: at 54.14% examples, 86575 words/s, in_q

2018-02-06 06:41:23,440 : INFO : PROGRESS: at 57.77% examples, 86767 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:41:24,504 : INFO : PROGRESS: at 57.83% examples, 86768 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:41:25,536 : INFO : PROGRESS: at 57.88% examples, 86770 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:41:26,541 : INFO : PROGRESS: at 57.93% examples, 86774 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:41:27,545 : INFO : PROGRESS: at 57.98% examples, 86778 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:41:28,578 : INFO : PROGRESS: at 58.03% examples, 86779 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:41:29,706 : INFO : PROGRESS: at 58.07% examples, 86769 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:41:30,757 : INFO : PROGRESS: at 58.13% examples, 86769 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:41:31,776 : INFO : PROGRESS: at 58.18% examples, 86773 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:41:32,808 : INFO : PROGRESS: at 58.23% examples, 86775 words/s, in_q

2018-02-06 06:42:48,884 : INFO : PROGRESS: at 61.86% examples, 86917 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:42:49,897 : INFO : PROGRESS: at 61.92% examples, 86921 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:42:50,913 : INFO : PROGRESS: at 61.97% examples, 86923 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:42:52,127 : INFO : PROGRESS: at 62.02% examples, 86924 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:42:53,187 : INFO : PROGRESS: at 62.08% examples, 86924 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:42:54,256 : INFO : PROGRESS: at 62.12% examples, 86924 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:42:55,279 : INFO : PROGRESS: at 62.18% examples, 86932 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:42:56,401 : INFO : PROGRESS: at 62.23% examples, 86928 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:42:57,425 : INFO : PROGRESS: at 62.26% examples, 86908 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:42:58,552 : INFO : PROGRESS: at 62.31% examples, 86898 words/s, in_q

2018-02-06 06:44:14,169 : INFO : PROGRESS: at 66.02% examples, 87172 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:44:15,181 : INFO : PROGRESS: at 66.07% examples, 87180 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:44:16,238 : INFO : PROGRESS: at 66.12% examples, 87180 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:44:17,258 : INFO : PROGRESS: at 66.17% examples, 87183 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:44:18,285 : INFO : PROGRESS: at 66.23% examples, 87190 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:44:19,294 : INFO : PROGRESS: at 66.28% examples, 87187 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:44:20,416 : INFO : PROGRESS: at 66.33% examples, 87194 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:44:21,460 : INFO : PROGRESS: at 66.39% examples, 87200 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:44:22,524 : INFO : PROGRESS: at 66.44% examples, 87205 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:44:23,608 : INFO : PROGRESS: at 66.49% examples, 87209 words/s, in_q

2018-02-06 06:45:39,240 : INFO : PROGRESS: at 70.22% examples, 87440 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:45:40,283 : INFO : PROGRESS: at 70.27% examples, 87446 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:45:41,331 : INFO : PROGRESS: at 70.31% examples, 87446 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:45:42,361 : INFO : PROGRESS: at 70.37% examples, 87452 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:45:43,449 : INFO : PROGRESS: at 70.42% examples, 87455 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:45:44,495 : INFO : PROGRESS: at 70.47% examples, 87461 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:45:45,508 : INFO : PROGRESS: at 70.51% examples, 87467 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:45:46,530 : INFO : PROGRESS: at 70.57% examples, 87473 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:45:47,552 : INFO : PROGRESS: at 70.61% examples, 87475 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:45:48,637 : INFO : PROGRESS: at 70.67% examples, 87478 words/s, in_q

2018-02-06 06:47:05,000 : INFO : PROGRESS: at 74.34% examples, 87611 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:47:06,042 : INFO : PROGRESS: at 74.39% examples, 87611 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:47:07,077 : INFO : PROGRESS: at 74.44% examples, 87617 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:47:08,230 : INFO : PROGRESS: at 74.48% examples, 87596 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:47:09,313 : INFO : PROGRESS: at 74.52% examples, 87584 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:47:10,380 : INFO : PROGRESS: at 74.57% examples, 87588 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:47:11,396 : INFO : PROGRESS: at 74.62% examples, 87590 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:47:12,453 : INFO : PROGRESS: at 74.67% examples, 87594 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:47:13,516 : INFO : PROGRESS: at 74.71% examples, 87588 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:47:14,525 : INFO : PROGRESS: at 74.76% examples, 87589 words/s, in_q

2018-02-06 06:48:31,061 : INFO : PROGRESS: at 78.53% examples, 87789 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:48:32,105 : INFO : PROGRESS: at 78.58% examples, 87794 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:48:33,136 : INFO : PROGRESS: at 78.63% examples, 87799 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:48:34,365 : INFO : PROGRESS: at 78.68% examples, 87794 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:48:35,436 : INFO : PROGRESS: at 78.74% examples, 87797 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:48:36,470 : INFO : PROGRESS: at 78.79% examples, 87797 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:48:37,529 : INFO : PROGRESS: at 78.84% examples, 87801 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:48:38,666 : INFO : PROGRESS: at 78.90% examples, 87806 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:48:39,737 : INFO : PROGRESS: at 78.95% examples, 87810 words/s, in_qsize 5, out_qsize 0
2018-02-06 06:48:40,802 : INFO : PROGRESS: at 79.01% examples, 87813 words/s, in_q

Saving model / document vectors

In [37]:
model.save(str(MODEL_PATH)) 

2018-02-06 18:24:12,863 : INFO : saving Doc2Vec object under ../../data/wiki10/doc2vec.model, separately None
2018-02-06 18:24:12,871 : INFO : storing np array 'syn0' to ../../data/wiki10/doc2vec.model.wv.syn0.npy
2018-02-06 18:24:13,006 : INFO : not storing attribute syn0norm
2018-02-06 18:24:13,010 : INFO : storing np array 'syn1neg' to ../../data/wiki10/doc2vec.model.syn1neg.npy
2018-02-06 18:24:13,197 : INFO : not storing attribute cum_table
2018-02-06 18:24:16,413 : INFO : saved ../../data/wiki10/doc2vec.model
