# Gensim

**Gensim** is billed as a Natural Language Processing package that does ‘Topic Modeling for Humans’. But it is practically much more than that. It is a leading and a state-of-the-art package for processing texts, working with *word vector models* and for building *topic models*.

But the width and scope of facilities to build and evaluate topic models are unparalleled in gensim, plus many more convenient facilities for text processing.

Also, another significant advantage with gensim is: it lets you handle large text files without having to load the entire file in memory.

In [1]:
from gensim import corpora
import gensim.downloader as api
from gensim.utils import simple_preprocess
from gensim.models.word2vec import Word2Vec
from gensim.models.doc2vec import TaggedDocument, Doc2Vec

## Dictionary and Corpus

**Dictionary** is an object that maps each word to a unique id.

The dictionary object is typically used to create a ‘bag of words’ Corpus. It is this Dictionary and the bag-of-words (Corpus) that are used as inputs to topic modeling and other models that Gensim specializes in.

*gensim.utils.simple_preprocess* Convert a document into a list of lowercase tokens, ignoring tokens that are too short or too long.

In [2]:
# from a list of sentences

documents = ["If you use a car frequently, the first step to cutting",
             "down your emissions may well be to simply", 
             "fully consider the", 
             "alternatives available to you."
             ]

# Tokenize(split) the sentences into words
texts = [[text for text in doc.split()] for doc in documents]

# Create dictionary
dictionary = corpora.Dictionary(texts)

print(dictionary)
print(dictionary.token2id)

Dictionary(23 unique tokens: ['If', 'a', 'car', 'cutting', 'first']...)
{'If': 0, 'a': 1, 'car': 2, 'cutting': 3, 'first': 4, 'frequently,': 5, 'step': 6, 'the': 7, 'to': 8, 'use': 9, 'you': 10, 'be': 11, 'down': 12, 'emissions': 13, 'may': 14, 'simply': 15, 'well': 16, 'your': 17, 'consider': 18, 'fully': 19, 'alternatives': 20, 'available': 21, 'you.': 22}


With simple_preprocess

In [4]:
dictionary = corpora.Dictionary(simple_preprocess(line, deacc=True) for line in documents)
print(dictionary)
print(dictionary.token2id)

Dictionary(21 unique tokens: ['car', 'cutting', 'first', 'frequently', 'if']...)
{'car': 0, 'cutting': 1, 'first': 2, 'frequently': 3, 'if': 4, 'step': 5, 'the': 6, 'to': 7, 'use': 8, 'you': 9, 'be': 10, 'down': 11, 'emissions': 12, 'may': 13, 'simply': 14, 'well': 15, 'your': 16, 'consider': 17, 'fully': 18, 'alternatives': 19, 'available': 20}


In [6]:
#from document
dictionary = corpora.Dictionary(simple_preprocess(line, deacc=True) for line in open('sample.txt', encoding='utf-8'))

**Corpus** object that contains the word id and its frequency in each document.

The dictionary object is typically used to create a ‘bag of words’ Corpus. It is this Dictionary and the bag-of-words (Corpus) that are used as inputs to topic modeling and other models that Gensim specializes in.

- Bag of words

In [7]:
my_docs = ["Who let the dogs out?",
           "Who? Who? Who? Who?"]

# Tokenize the docs
tokenized_list = [simple_preprocess(doc) for doc in my_docs]

# Create the Corpus
dictionary = corpora.Dictionary()

#allow_update=True - add new words to dictionary
bow_corpus = [dictionary.doc2bow(doc, allow_update=True) for doc in tokenized_list]
print(bow_corpus)

print("Dictionary: ", dictionary.token2id)

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)], [(4, 4)]]
Dictionary:  {'dogs': 0, 'let': 1, 'out': 2, 'the': 3, 'who': 4}


The (0, 1) in line 1 means, the word with id=0 appears once in the 1st document.
Likewise, the (4, 4) in the second list item means the word with id 4 appears 4 times in the second document. And so on.

- TfIdf

In [9]:
from gensim import models
import numpy as np

documents = ["This is the first line",
             "This is the second sentence",
             "This third document"]

# Create the Dictionary and Corpus
mydict = corpora.Dictionary([simple_preprocess(line) for line in documents])
corpus = [mydict.doc2bow(simple_preprocess(line)) for line in documents]

# Show the Word Weights in Corpus
for doc in corpus:
    print([[mydict[id], freq] for id, freq in doc])
print()

# Create the TF-IDF model
tfidf = models.TfidfModel(corpus, smartirs='ntc')

# Show the TF-IDF weights
for doc in tfidf[corpus]:
    print([[mydict[id], np.around(freq, decimals=2)] for id, freq in doc])

[['first', 1], ['is', 1], ['line', 1], ['the', 1], ['this', 1]]
[['is', 1], ['the', 1], ['this', 1], ['second', 1], ['sentence', 1]]
[['this', 1], ['document', 1], ['third', 1]]

[['first', 0.66], ['is', 0.24], ['line', 0.66], ['the', 0.24]]
[['is', 0.24], ['the', 0.24], ['second', 0.66], ['sentence', 0.66]]
[['document', 0.71], ['third', 0.71]]


Save and load

In [10]:
# Save the Dict and Corpus
dictionary.save('mydict.dict')  # save dict to disk
corpora.MmCorpus.serialize('bow_corpus.mm', bow_corpus)  # save corpus to disk

In [11]:
# Load them back
loaded_dict = corpora.Dictionary.load('mydict.dict')

corpus = corpora.MmCorpus('bow_corpus.mm')
for line in corpus:
    print(line)

[(0, 1.0), (1, 1.0), (2, 1.0), (3, 1.0), (4, 1.0)]
[(4, 4.0)]


## Datasets

Gensim provides an inbuilt API to download popular text datasets and word embedding models.

A comprehensive list of available datasets and models is maintained [here](https://raw.githubusercontent.com/RaRe-Technologies/gensim-data/master/list.json).

Using the API to download the dataset is as simple as calling the ```api.load()``` method with the right data or model name.

In [None]:
import gensim.downloader as api

Short description

In [12]:
api.info('text8')
api.info('glove-wiki-gigaword-50')

{'num_records': 400000,
 'file_size': 69182535,
 'base_dataset': 'Wikipedia 2014 + Gigaword 5 (6B tokens, uncased)',
 'reader_code': 'https://github.com/RaRe-Technologies/gensim-data/releases/download/glove-wiki-gigaword-50/__init__.py',
 'license': 'http://opendatacommons.org/licenses/pddl/',
 'parameters': {'dimension': 50},
 'description': 'Pre-trained vectors based on Wikipedia 2014 + Gigaword, 5.6B tokens, 400K vocab, uncased (https://nlp.stanford.edu/projects/glove/).',
 'preprocessing': 'Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i <fname> -o glove-wiki-gigaword-50.txt`.',
 'read_more': ['https://nlp.stanford.edu/projects/glove/',
  'https://nlp.stanford.edu/pubs/glove.pdf'],
 'checksum': 'c289bc5d7f2f02c6dc9f2f9b67641813',
 'file_name': 'glove-wiki-gigaword-50.gz',
 'parts': 1}

Dataset

In [14]:
dataset = api.load("text8")
data = [d for d in dataset]
print(data[0])

['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including', 'the', 'diggers', 'of', 'the', 'english', 'revolution', 'and', 'the', 'sans', 'culottes', 'of', 'the', 'french', 'revolution', 'whilst', 'the', 'term', 'is', 'still', 'used', 'in', 'a', 'pejorative', 'way', 'to', 'describe', 'any', 'act', 'that', 'used', 'violent', 'means', 'to', 'destroy', 'the', 'organization', 'of', 'society', 'it', 'has', 'also', 'been', 'taken', 'up', 'as', 'a', 'positive', 'label', 'by', 'self', 'defined', 'anarchists', 'the', 'word', 'anarchism', 'is', 'derived', 'from', 'the', 'greek', 'without', 'archons', 'ruler', 'chief', 'king', 'anarchism', 'as', 'a', 'political', 'philosophy', 'is', 'the', 'belief', 'that', 'rulers', 'are', 'unnecessary', 'and', 'should', 'be', 'abolished', 'although', 'there', 'are', 'differing', 'interpretations', 'of', 'what', 'this', 'means', 'anarchism', 'also', 'refers', 'to', 'related', 'so

Pretrained model

In [15]:
w2v_model = api.load("glove-wiki-gigaword-50")
w2v_model.most_similar('blue')



[('red', 0.8901657462120056),
 ('black', 0.8648406863212585),
 ('pink', 0.845291793346405),
 ('green', 0.8346816301345825),
 ('yellow', 0.8320707082748413),
 ('purple', 0.8293111324310303),
 ('white', 0.8225342035293579),
 ('orange', 0.8114302158355713),
 ('bright', 0.799933910369873),
 ('colored', 0.7876655459403992)]

## Word2Vec

A word embedding model is a model that can provide numerical vectors for a given word. Using the Gensim’s downloader API, you can download pre-built word embedding models like word2vec, fasttext, GloVe and ConceptNet. These are built on large corpuses of commonly occurring text data such as wikipedia, google news etc.

The training algorithms in the Gensim package were actually ported from the original Word2Vec implementation by Google and extended with additional functionality.

This module implements the word2vec family of algorithms, using *highly optimized* C routines, data streaming and Pythonic interfaces.

**Parameters:**

- ```sentences``` - (iterable of iterables, optional) – The sentences iterable can be simply a *list of lists of tokens*, but for larger corpora, consider an *iterable* that streams the sentences directly from disk/network.
- ```corpus_file``` (str, optional) – Path to a corpus file in LineSentence format. You may use this argument instead of sentences to get performance boost. *Only one of sentences or corpus_file arguments need to be passed* 
- ```size``` = 100 - Dimensionality of the word vectors.
- ```window``` = 5 - Maximum distance between the current and predicted word within a sentence.
- ```min_count``` = 5 (int, optional) – Ignores all words with total frequency lower than this.
- ```workers``` = 3 (int, optional) – Use these many worker threads to train the model (=faster training with multicore machines).
- ```sg``` = 0 ({0, 1}, optional) – Training algorithm: 1 for skip-gram; otherwise CBOW.
- ```hs``` = 0 ({0, 1}, optional) – If 1, hierarchical softmax will be used for model training. If 0, and negative is non-zero, negative sampling will be used.
- ```negative``` = 5 (int, optional) – If > 0, negative sampling will be used, the int for negative specifies how many “noise words” should be drawn (usually between 5-20). If set to 0, no negative sampling is used.
- ```max_vocab_size``` = None (int, optional) Limits the RAM during vocabulary building; if there are more unique words than this, then prune the infrequent ones.
- ```iter``` (int, optional) – Number of iterations (epochs) over the corpus.

In [23]:
import logging 
# Setting up the loggings to monitor gensim
logging.basicConfig(format="%(levelname)s - %(asctime)s: %(message)s", datefmt= '%H:%M:%S', level=logging.INFO)

In [24]:
dataset = api.load("text8")
data = [d for d in dataset]

# Train Word2Vec model
model = Word2Vec(data)

INFO - 21:36:00: collecting all words and their counts
INFO - 21:36:00: PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
INFO - 21:36:07: collected 253854 word types from a corpus of 17005207 raw words and 1701 sentences
INFO - 21:36:07: Loading a fresh vocabulary
INFO - 21:36:08: effective_min_count=5 retains 71290 unique words (28% of original 253854, drops 182564)
INFO - 21:36:08: effective_min_count=5 leaves 16718844 word corpus (98% of original 17005207, drops 286363)
INFO - 21:36:08: deleting the raw counts dictionary of 253854 items
INFO - 21:36:08: sample=0.001 downsamples 38 most-common words
INFO - 21:36:08: downsampling leaves estimated 12506280 word corpus (74.8% of prior 16718844)
INFO - 21:36:09: estimated required memory for 71290 words and 100 dimensions: 92677000 bytes
INFO - 21:36:09: resetting layer weights
INFO - 21:36:33: training model with 3 workers on 71290 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
INFO - 21:3

INFO - 21:37:40: EPOCH 4 - PROGRESS: at 40.51% examples, 720051 words/s, in_qsize 6, out_qsize 0
INFO - 21:37:41: EPOCH 4 - PROGRESS: at 46.15% examples, 718364 words/s, in_qsize 5, out_qsize 0
INFO - 21:37:42: EPOCH 4 - PROGRESS: at 52.03% examples, 718907 words/s, in_qsize 4, out_qsize 1
INFO - 21:37:43: EPOCH 4 - PROGRESS: at 57.97% examples, 721033 words/s, in_qsize 5, out_qsize 0
INFO - 21:37:44: EPOCH 4 - PROGRESS: at 63.73% examples, 721238 words/s, in_qsize 6, out_qsize 0
INFO - 21:37:45: EPOCH 4 - PROGRESS: at 69.43% examples, 720144 words/s, in_qsize 4, out_qsize 1
INFO - 21:37:46: EPOCH 4 - PROGRESS: at 75.49% examples, 721939 words/s, in_qsize 5, out_qsize 0
INFO - 21:37:47: EPOCH 4 - PROGRESS: at 81.36% examples, 722033 words/s, in_qsize 5, out_qsize 0
INFO - 21:37:48: EPOCH 4 - PROGRESS: at 87.18% examples, 721972 words/s, in_qsize 5, out_qsize 0
INFO - 21:37:49: EPOCH 4 - PROGRESS: at 93.12% examples, 722843 words/s, in_qsize 5, out_qsize 0
INFO - 21:37:50: EPOCH 4 - PRO

```Word2Vec``` without ```sentences``` or ```corpus``` is initialization only, should be trained

In [16]:
model = Word2Vec()
model.train(data)

**Save and load model**

In [26]:
model.save('w2v_newmodel')
model = Word2Vec.load('w2v_newmodel')

INFO - 21:38:12: saving Word2Vec object under w2v_newmodel, separately None
INFO - 21:38:12: not storing attribute vectors_norm
INFO - 21:38:12: not storing attribute cum_table
INFO - 21:38:14: saved w2v_newmodel
INFO - 21:38:14: loading Word2Vec object from w2v_newmodel
INFO - 21:38:17: loading wv recursively from w2v_newmodel.wv.* with mmap=None
INFO - 21:38:17: setting ignored attribute vectors_norm to None
INFO - 21:38:17: loading vocabulary recursively from w2v_newmodel.vocabulary.* with mmap=None
INFO - 21:38:17: loading trainables recursively from w2v_newmodel.trainables.* with mmap=None
INFO - 21:38:17: setting ignored attribute cum_table to None
INFO - 21:38:17: loaded w2v_newmodel


You can **continue trainig**

In [25]:
model.train([["hello", "world"]], total_examples=1, epochs=1)

INFO - 21:38:12: training model with 3 workers on 71290 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
INFO - 21:38:12: worker thread finished; awaiting finish of 2 more threads
INFO - 21:38:12: worker thread finished; awaiting finish of 1 more threads
INFO - 21:38:12: worker thread finished; awaiting finish of 0 more threads
INFO - 21:38:12: EPOCH - 1 : training on 2 raw words (2 effective words) took 0.0s, 294 effective words/s
INFO - 21:38:12: training on a 2 raw words (2 effective words) took 0.0s, 103 effective words/s


(2, 2)

**Word2vec input**


1) Parameter ```sentences```

Gensim’s word2vec expects a sequence of sentences as its input. Each sentence a list of words

   1.1 *List of list of tokens*

In [34]:
input1 = [['first', 'sentence'], ['second', 'sentence']]
model1 = Word2Vec(input1, min_count=1)

INFO - 21:44:41: collecting all words and their counts
INFO - 21:44:41: PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
INFO - 21:44:41: collected 3 word types from a corpus of 4 raw words and 2 sentences
INFO - 21:44:41: Loading a fresh vocabulary
INFO - 21:44:41: effective_min_count=1 retains 3 unique words (100% of original 3, drops 0)
INFO - 21:44:41: effective_min_count=1 leaves 4 word corpus (100% of original 4, drops 0)
INFO - 21:44:41: deleting the raw counts dictionary of 3 items
INFO - 21:44:41: sample=0.001 downsamples 3 most-common words
INFO - 21:44:41: downsampling leaves estimated 0 word corpus (5.7% of prior 4)
INFO - 21:44:41: estimated required memory for 3 words and 100 dimensions: 3900 bytes
INFO - 21:44:41: resetting layer weights
INFO - 21:44:41: training model with 3 workers on 3 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
INFO - 21:44:41: worker thread finished; awaiting finish of 2 more threads
INFO - 21:44:41

1.2 Gensim only requires that the input must provide sentences sequentially, when iterated over. *No need to keep everything in RAM*

In [None]:
class MySentences(object):
    def __init__(self, dirname):
        self.dirname = dirname
 
    def __iter__(self):
        for fname in os.listdir(self.dirname):
            for line in open(os.path.join(self.dirname, fname)):
                yield line.split()
                
input2 = MySentences('/some/directory') # a memory-friendly iterator
model = gensim.models.Word2Vec(input2)

See [BrownCorpus](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.BrownCorpus), [Text8Corpus](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Text8Corpus) or [LineSentence](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.LineSentence) in word2vec module for such examples.

BrownCorpus and Text8Corpus were implemented special for BrownCorpus and Text8 datasets. Text8 corpus, for example, consists of one line of cleaned and joined wikipedia articles.

LineSentence iterate over a file that contains sentences: one line = one sentence. Words must be already preprocessed and separated by whitespace.

In [None]:
from gensim.test.utils import datapath
from gensim.models.word2vec importLineSentence

input3 = LineSentence(datapath('lee_background.cor'))
model = gensim.models.Word2Vec(input3)

2. Parameter ```corpus_file``` - path to a corpus file in LineSentence format

If corpus is in right format, parameter corpus_file may be passed instead of last cell

## Exploring the model

- Extract the trained word vectors from model.wv:

In [19]:
model.wv['topic']

array([-0.9116209 ,  0.76353014,  0.6720918 ,  1.1859015 ,  0.68622196,
        1.1805097 ,  0.6357555 , -0.03566878, -0.61359787, -0.8718574 ,
        0.652692  , -0.04561124,  0.5598307 ,  0.9242337 , -1.4690322 ,
        0.22043759,  0.80602527, -0.868186  , -0.12753902, -0.06760293,
        0.4953544 , -0.38806862,  1.5687288 , -0.30487606,  2.2516096 ,
       -1.6864485 ,  0.44446105,  1.0207226 , -0.29092094,  0.43139705,
        0.26387596, -2.011847  , -1.0975704 ,  0.9253219 , -1.9253153 ,
        0.6592959 ,  2.0866177 ,  0.06789557, -0.2630213 , -1.0449048 ,
        1.7788374 , -1.7338845 , -0.9940762 ,  0.65088135, -0.3835365 ,
        0.4743413 , -0.9915545 ,  0.34806243, -0.10273027, -0.62624365,
        0.01318653, -0.04616779, -0.44287038,  0.9173674 ,  0.40987295,
        1.3166438 ,  0.310701  ,  0.71503466, -1.0450833 ,  0.07349207,
        0.719098  , -0.3179902 ,  0.59884137,  1.0848142 , -0.29605114,
        0.02436474, -0.57979673,  0.5711215 ,  2.0835128 , -0.87

- Similarity

In [21]:
model.wv.most_similar(positive = ['topic'])

[('discussion', 0.7259880304336548),
 ('discourse', 0.6891059279441833),
 ('debate', 0.6597331762313843),
 ('interpretation', 0.6579047441482544),
 ('comment', 0.654038667678833),
 ('subject', 0.6490315198898315),
 ('viewpoint', 0.6357095241546631),
 ('misunderstanding', 0.6349965333938599),
 ('discussions', 0.6340876221656799),
 ('commentary', 0.6307163238525391)]

In [80]:
print('Similarity between (walk, walking): ', model.wv.similarity('walk', 'walking'))
print('Similarity between (duck, ducks): ', model.wv.similarity('duck', 'ducks'))
print('Similarity between (banana, pear): ', model.wv.similarity('banana', 'pear'))
print()
print('Similarity between (banana, sky): ', model.wv.similarity('banana', 'sky'))
print('Similarity between (walk, lie): ', model.wv.similarity('walk', 'lie'))
print('Similarity between (dark, slow): ', model.wv.similarity('dark', 'slow'))

Similarity between (walk, walking):  0.7154655820526513
Similarity between (duck, ducks):  0.7508801808704236
Similarity between (banana, pear):  0.7704371854673051

Similarity between (banana, sky):  0.1269163860707912
Similarity between (walk, lie):  0.22455184386987476
Similarity between (dark, slow):  0.24851554413785346


- Analogy

In [28]:
model.wv.most_similar(positive = ['lower', 'tall'], negative = ['low'])

[('taller', 0.6156757473945618),
 ('brighter', 0.5981073379516602),
 ('shallower', 0.584308922290802),
 ('feet', 0.572521448135376),
 ('upper', 0.5704020857810974),
 ('foot', 0.5655150413513184),
 ('hind', 0.5584174394607544),
 ('thicker', 0.5536401271820068),
 ('lip', 0.5493001341819763),
 ('aisle', 0.5444109439849854)]

In [93]:
model.wv.most_similar(positive = ['mother', 'man'], negative = ['woman'])

[('father', 0.7658834457397461),
 ('lover', 0.6715303659439087),
 ('grandfather', 0.6457886695861816),
 ('neighbor', 0.6330509185791016),
 ('grandmother', 0.630250096321106),
 ('son', 0.628896176815033),
 ('mistress', 0.6158114671707153),
 ('uncle', 0.6074668169021606),
 ('faramir', 0.6050997972488403),
 ('corpse', 0.602986216545105)]

- Matching

In [86]:
print(model.wv.doesnt_match(['car', 'airplane', 'bed']), " doesn't match to [car, airplane]")
print(model.wv.doesnt_match(['red', 'blue', 'roof']), " doesn't match to [red, blue]")

bed  doesn't match to [car, airplane]
roof  doesn't match to [red, blue]


## Compare with other pretrained embeddings

In [41]:
fasttext_model300 = api.load('fasttext-wiki-news-subwords-300')
word2vec_model300 = api.load('word2vec-google-news-300')
glove_model300 = api.load('glove-wiki-gigaword-300')

To define which one performs better using the respective model's evaluate_word_analogies() 

Compute performance of the model on an analogy test set. The accuracy is reported (printed to log and returned as a score) for each section separately, plus there’s one aggregate summary at the end. 

Input:
- ```analogies``` (str) – Path to file, where lines are 4-tuples of words, split into sections by “: SECTION NAME” lines. See [this file](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/datasets/questions-words.txt) as example.

Output:
- ```score``` (float) – The overall evaluation score on the entire evaluation set

- ```sections``` (list of dict of {str : str or list of tuple of (str, str, str, str)}) – Results broken down by each section of the evaluation set. Each dict contains the name of the section under the key ‘section’, and lists of correctly and incorrectly predicted 4-tuples of words under the keys ‘correct’ and ‘incorrect’.

In [38]:
f = open('questions-words.txt', 'r')
for i in range(5):
    print(f.readline())
f.close()

: capital-common-countries

Athens Greece Baghdad Iraq

Athens Greece Bangkok Thailand

Athens Greece Beijing China

Athens Greece Berlin Germany



In [42]:
word2vec_model300.evaluate_word_analogies(analogies='questions-words.txt')[0]

# fasttext_accuracy
fasttext_model300.evaluate_word_analogies(analogies='questions-words.txt')[0]

# GloVe accuracy
glove_model300.evaluate_word_analogies(analogies='questions-words.txt')[0]

0.7401448525607863
0.8827876424099353
0.7195422354510931


## Doc2Vec

Unlike Word2Vec, a Doc2Vec model provides a vectorised representation of a group of words taken collectively as a single unit. It is not a simple average of the word vectors of the words in the sentence.

The training data for ```Doc2Vec``` should be a list of ```TaggedDocuments```. To create one, we pass a list of words and a unique integer as input to the ```models.doc2vec.TaggedDocument()```.

In [47]:
#prepare dataset
def create_tagged_document(list_of_list_of_words):
    for i, list_of_words in enumerate(list_of_list_of_words):
        yield TaggedDocument(list_of_words, [i])

dataset = api.load("text8")
data = [d for d in dataset]

train_data = list(create_tagged_document(data))
print(train_data[:1])

[TaggedDocument(words=['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including', 'the', 'diggers', 'of', 'the', 'english', 'revolution', 'and', 'the', 'sans', 'culottes', 'of', 'the', 'french', 'revolution', 'whilst', 'the', 'term', 'is', 'still', 'used', 'in', 'a', 'pejorative', 'way', 'to', 'describe', 'any', 'act', 'that', 'used', 'violent', 'means', 'to', 'destroy', 'the', 'organization', 'of', 'society', 'it', 'has', 'also', 'been', 'taken', 'up', 'as', 'a', 'positive', 'label', 'by', 'self', 'defined', 'anarchists', 'the', 'word', 'anarchism', 'is', 'derived', 'from', 'the', 'greek', 'without', 'archons', 'ruler', 'chief', 'king', 'anarchism', 'as', 'a', 'political', 'philosophy', 'is', 'the', 'belief', 'that', 'rulers', 'are', 'unnecessary', 'and', 'should', 'be', 'abolished', 'although', 'there', 'are', 'differing', 'interpretations', 'of', 'what', 'this', 'means', 'anarchism', 'also', 'refers'

In [50]:
#Train model
model = Doc2Vec(vector_size=50, min_count=1, epochs=40)
model.build_vocab(train_data)
model.train(train_data, total_examples=model.corpus_count, epochs=model.epochs)

INFO - 22:21:14: collecting all words and their counts
INFO - 22:21:14: PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
INFO - 22:21:21: collected 253854 word types and 1701 unique tags from a corpus of 1701 examples and 17005207 words
INFO - 22:21:21: Loading a fresh vocabulary
INFO - 22:21:35: effective_min_count=1 retains 253854 unique words (100% of original 253854, drops 0)
INFO - 22:21:35: effective_min_count=1 leaves 17005207 word corpus (100% of original 17005207, drops 0)
INFO - 22:21:37: deleting the raw counts dictionary of 253854 items
INFO - 22:21:37: sample=0.001 downsamples 36 most-common words
INFO - 22:21:37: downsampling leaves estimated 12819131 word corpus (75.4% of prior 17005207)
INFO - 22:21:38: estimated required memory for 253854 words and 50 dimensions: 228808800 bytes
INFO - 22:21:38: resetting layer weights
INFO - 22:22:53: training model with 3 workers on 253854 vocabulary and 50 features, using sg=0 hs=0 sample=0.001 negative=5 windo

INFO - 22:23:59: EPOCH 4 - PROGRESS: at 63.49% examples, 736983 words/s, in_qsize 5, out_qsize 0
INFO - 22:24:00: EPOCH 4 - PROGRESS: at 69.43% examples, 738596 words/s, in_qsize 5, out_qsize 0
INFO - 22:24:01: EPOCH 4 - PROGRESS: at 75.43% examples, 740499 words/s, in_qsize 5, out_qsize 0
INFO - 22:24:02: EPOCH 4 - PROGRESS: at 81.48% examples, 741881 words/s, in_qsize 5, out_qsize 0
INFO - 22:24:03: EPOCH 4 - PROGRESS: at 87.48% examples, 743220 words/s, in_qsize 5, out_qsize 0
INFO - 22:24:04: EPOCH 4 - PROGRESS: at 93.24% examples, 742520 words/s, in_qsize 5, out_qsize 0
INFO - 22:24:05: EPOCH 4 - PROGRESS: at 99.12% examples, 742810 words/s, in_qsize 5, out_qsize 0
INFO - 22:24:05: worker thread finished; awaiting finish of 2 more threads
INFO - 22:24:05: worker thread finished; awaiting finish of 1 more threads
INFO - 22:24:05: worker thread finished; awaiting finish of 0 more threads
INFO - 22:24:05: EPOCH - 4 : training on 17005207 raw words (12821574 effective words) took 17.3

INFO - 22:25:12: EPOCH 8 - PROGRESS: at 78.66% examples, 716839 words/s, in_qsize 5, out_qsize 0
INFO - 22:25:13: EPOCH 8 - PROGRESS: at 84.71% examples, 720528 words/s, in_qsize 5, out_qsize 0
INFO - 22:25:14: EPOCH 8 - PROGRESS: at 90.83% examples, 724312 words/s, in_qsize 5, out_qsize 0
INFO - 22:25:15: EPOCH 8 - PROGRESS: at 97.06% examples, 727747 words/s, in_qsize 5, out_qsize 0
INFO - 22:25:15: worker thread finished; awaiting finish of 2 more threads
INFO - 22:25:15: worker thread finished; awaiting finish of 1 more threads
INFO - 22:25:15: worker thread finished; awaiting finish of 0 more threads
INFO - 22:25:15: EPOCH - 8 : training on 17005207 raw words (12820754 effective words) took 17.6s, 729222 effective words/s
INFO - 22:25:16: EPOCH 9 - PROGRESS: at 6.06% examples, 759399 words/s, in_qsize 5, out_qsize 0
INFO - 22:25:17: EPOCH 9 - PROGRESS: at 12.29% examples, 768581 words/s, in_qsize 5, out_qsize 0
INFO - 22:25:19: EPOCH 9 - PROGRESS: at 18.52% examples, 777295 words/

INFO - 22:26:23: worker thread finished; awaiting finish of 1 more threads
INFO - 22:26:23: worker thread finished; awaiting finish of 0 more threads
INFO - 22:26:23: EPOCH - 12 : training on 17005207 raw words (12819724 effective words) took 16.7s, 768239 effective words/s
INFO - 22:26:24: EPOCH 13 - PROGRESS: at 6.06% examples, 768372 words/s, in_qsize 5, out_qsize 0
INFO - 22:26:25: EPOCH 13 - PROGRESS: at 12.23% examples, 772560 words/s, in_qsize 5, out_qsize 0
INFO - 22:26:26: EPOCH 13 - PROGRESS: at 18.28% examples, 770763 words/s, in_qsize 5, out_qsize 0
INFO - 22:26:27: EPOCH 13 - PROGRESS: at 24.22% examples, 767778 words/s, in_qsize 4, out_qsize 1
INFO - 22:26:28: EPOCH 13 - PROGRESS: at 30.34% examples, 770588 words/s, in_qsize 5, out_qsize 0
INFO - 22:26:29: EPOCH 13 - PROGRESS: at 36.27% examples, 770233 words/s, in_qsize 5, out_qsize 0
INFO - 22:26:30: EPOCH 13 - PROGRESS: at 42.39% examples, 771526 words/s, in_qsize 6, out_qsize 0
INFO - 22:26:31: EPOCH 13 - PROGRESS: at

INFO - 22:27:33: EPOCH 17 - PROGRESS: at 24.51% examples, 777816 words/s, in_qsize 5, out_qsize 0
INFO - 22:27:34: EPOCH 17 - PROGRESS: at 30.75% examples, 781975 words/s, in_qsize 5, out_qsize 0
INFO - 22:27:35: EPOCH 17 - PROGRESS: at 36.74% examples, 780743 words/s, in_qsize 5, out_qsize 2
INFO - 22:27:36: EPOCH 17 - PROGRESS: at 42.80% examples, 780279 words/s, in_qsize 5, out_qsize 0
INFO - 22:27:37: EPOCH 17 - PROGRESS: at 49.03% examples, 781229 words/s, in_qsize 4, out_qsize 1
INFO - 22:27:38: EPOCH 17 - PROGRESS: at 54.97% examples, 779139 words/s, in_qsize 6, out_qsize 0
INFO - 22:27:39: EPOCH 17 - PROGRESS: at 60.91% examples, 777276 words/s, in_qsize 5, out_qsize 0
INFO - 22:27:40: EPOCH 17 - PROGRESS: at 66.96% examples, 776183 words/s, in_qsize 5, out_qsize 0
INFO - 22:27:41: EPOCH 17 - PROGRESS: at 72.96% examples, 775516 words/s, in_qsize 5, out_qsize 0
INFO - 22:27:42: EPOCH 17 - PROGRESS: at 79.13% examples, 775255 words/s, in_qsize 5, out_qsize 0
INFO - 22:27:43: EPO

INFO - 22:28:46: EPOCH 21 - PROGRESS: at 48.03% examples, 765381 words/s, in_qsize 5, out_qsize 0
INFO - 22:28:47: EPOCH 21 - PROGRESS: at 54.03% examples, 766481 words/s, in_qsize 5, out_qsize 0
INFO - 22:28:48: EPOCH 21 - PROGRESS: at 59.91% examples, 765068 words/s, in_qsize 4, out_qsize 1
INFO - 22:28:49: EPOCH 21 - PROGRESS: at 66.02% examples, 766108 words/s, in_qsize 5, out_qsize 0
INFO - 22:28:50: EPOCH 21 - PROGRESS: at 72.02% examples, 766101 words/s, in_qsize 5, out_qsize 0
INFO - 22:28:51: EPOCH 21 - PROGRESS: at 78.13% examples, 765972 words/s, in_qsize 5, out_qsize 0
INFO - 22:28:52: EPOCH 21 - PROGRESS: at 84.07% examples, 765448 words/s, in_qsize 5, out_qsize 0
INFO - 22:28:53: EPOCH 21 - PROGRESS: at 90.30% examples, 767534 words/s, in_qsize 5, out_qsize 0
INFO - 22:28:54: EPOCH 21 - PROGRESS: at 96.53% examples, 769020 words/s, in_qsize 5, out_qsize 0
INFO - 22:28:54: worker thread finished; awaiting finish of 2 more threads
INFO - 22:28:54: worker thread finished; aw

INFO - 22:29:58: EPOCH 25 - PROGRESS: at 83.77% examples, 761760 words/s, in_qsize 5, out_qsize 0
INFO - 22:29:59: EPOCH 25 - PROGRESS: at 89.89% examples, 762338 words/s, in_qsize 5, out_qsize 0
INFO - 22:30:00: EPOCH 25 - PROGRESS: at 96.00% examples, 763021 words/s, in_qsize 5, out_qsize 0
INFO - 22:30:01: worker thread finished; awaiting finish of 2 more threads
INFO - 22:30:01: worker thread finished; awaiting finish of 1 more threads
INFO - 22:30:01: worker thread finished; awaiting finish of 0 more threads
INFO - 22:30:01: EPOCH - 25 : training on 17005207 raw words (12820973 effective words) took 16.8s, 763256 effective words/s
INFO - 22:30:02: EPOCH 26 - PROGRESS: at 6.06% examples, 771920 words/s, in_qsize 5, out_qsize 0
INFO - 22:30:03: EPOCH 26 - PROGRESS: at 12.29% examples, 781641 words/s, in_qsize 5, out_qsize 0
INFO - 22:30:04: EPOCH 26 - PROGRESS: at 18.28% examples, 775194 words/s, in_qsize 5, out_qsize 0
INFO - 22:30:05: EPOCH 26 - PROGRESS: at 24.63% examples, 78168

INFO - 22:31:09: worker thread finished; awaiting finish of 1 more threads
INFO - 22:31:09: worker thread finished; awaiting finish of 0 more threads
INFO - 22:31:09: EPOCH - 29 : training on 17005207 raw words (12821869 effective words) took 17.3s, 739975 effective words/s
INFO - 22:31:10: EPOCH 30 - PROGRESS: at 4.47% examples, 571252 words/s, in_qsize 5, out_qsize 0
INFO - 22:31:11: EPOCH 30 - PROGRESS: at 9.88% examples, 627045 words/s, in_qsize 5, out_qsize 0
INFO - 22:31:12: EPOCH 30 - PROGRESS: at 15.46% examples, 654599 words/s, in_qsize 5, out_qsize 0
INFO - 22:31:13: EPOCH 30 - PROGRESS: at 21.40% examples, 679108 words/s, in_qsize 6, out_qsize 0
INFO - 22:31:14: EPOCH 30 - PROGRESS: at 27.34% examples, 695887 words/s, in_qsize 5, out_qsize 0
INFO - 22:31:15: EPOCH 30 - PROGRESS: at 33.27% examples, 708029 words/s, in_qsize 5, out_qsize 0
INFO - 22:31:16: EPOCH 30 - PROGRESS: at 39.27% examples, 715804 words/s, in_qsize 4, out_qsize 1
INFO - 22:31:17: EPOCH 30 - PROGRESS: at 

INFO - 22:32:19: EPOCH 34 - PROGRESS: at 12.52% examples, 790750 words/s, in_qsize 5, out_qsize 0
INFO - 22:32:20: EPOCH 34 - PROGRESS: at 18.64% examples, 786072 words/s, in_qsize 5, out_qsize 0
INFO - 22:32:21: EPOCH 34 - PROGRESS: at 24.87% examples, 788002 words/s, in_qsize 5, out_qsize 0
INFO - 22:32:22: EPOCH 34 - PROGRESS: at 31.04% examples, 789661 words/s, in_qsize 5, out_qsize 0
INFO - 22:32:23: EPOCH 34 - PROGRESS: at 37.10% examples, 787596 words/s, in_qsize 5, out_qsize 0
INFO - 22:32:24: EPOCH 34 - PROGRESS: at 43.27% examples, 788060 words/s, in_qsize 5, out_qsize 0
INFO - 22:32:25: EPOCH 34 - PROGRESS: at 49.44% examples, 788256 words/s, in_qsize 5, out_qsize 0
INFO - 22:32:26: EPOCH 34 - PROGRESS: at 55.44% examples, 785311 words/s, in_qsize 5, out_qsize 0
INFO - 22:32:27: EPOCH 34 - PROGRESS: at 61.55% examples, 784677 words/s, in_qsize 5, out_qsize 0
INFO - 22:32:28: EPOCH 34 - PROGRESS: at 67.43% examples, 781417 words/s, in_qsize 5, out_qsize 0
INFO - 22:32:29: EPO

INFO - 22:33:32: EPOCH 38 - PROGRESS: at 42.68% examples, 775432 words/s, in_qsize 5, out_qsize 0
INFO - 22:33:33: EPOCH 38 - PROGRESS: at 48.50% examples, 771552 words/s, in_qsize 5, out_qsize 0
INFO - 22:33:34: EPOCH 38 - PROGRESS: at 53.20% examples, 752264 words/s, in_qsize 5, out_qsize 0
INFO - 22:33:35: EPOCH 38 - PROGRESS: at 59.02% examples, 751714 words/s, in_qsize 5, out_qsize 0
INFO - 22:33:36: EPOCH 38 - PROGRESS: at 63.79% examples, 739006 words/s, in_qsize 6, out_qsize 0
INFO - 22:33:37: EPOCH 38 - PROGRESS: at 69.78% examples, 741063 words/s, in_qsize 5, out_qsize 0
INFO - 22:33:38: EPOCH 38 - PROGRESS: at 75.90% examples, 743524 words/s, in_qsize 5, out_qsize 0
INFO - 22:33:39: EPOCH 38 - PROGRESS: at 82.01% examples, 745479 words/s, in_qsize 5, out_qsize 0
INFO - 22:33:40: EPOCH 38 - PROGRESS: at 88.30% examples, 749269 words/s, in_qsize 5, out_qsize 0
INFO - 22:33:41: EPOCH 38 - PROGRESS: at 94.47% examples, 751553 words/s, in_qsize 5, out_qsize 0
INFO - 22:33:41: wor

In [51]:
#Get document vector
print(model.infer_vector('gensim is really awesome'.split(' ')))

[ 0.18035454  0.41629905  0.25469506 -0.22324803 -0.09268823 -0.05151142
 -0.08480603 -0.00570442  0.08062097  0.00310741  0.17478175 -0.00154936
 -0.09330877 -0.02339078  0.0199794   0.04549846 -0.04683316 -0.03828195
 -0.3936776  -0.2056224  -0.04153012 -0.2085073   0.3692062   0.12547413
 -0.03771724  0.09459786  0.34567693 -0.34003636 -0.14349678  0.0229786
  0.1584827   0.02953391  0.13116093  0.02679738  0.02890064  0.06687283
  0.2509888   0.15626836 -0.03766628  0.24216698  0.04326655  0.18548243
  0.25629497 -0.10298692 -0.3509259  -0.06340942 -0.22913803 -0.14003526
  0.04637286 -0.14653628]


Used materials: 
    
- https://www.machinelearningplus.com/nlp/gensim-tutorial/
- https://radimrehurek.com/gensim/models/word2vec.html
- https://radimrehurek.com/gensim/models/keyedvectors.html
- https://rare-technologies.com/word2vec-tutorial/