In [1]:
%matplotlib inline

In [2]:
import pprint


How to Apply Doc2Vec to Reproduce the ['Paragraph Vector' paper](https://arxiv.org/pdf/1405.4053.pdf)
==============================================================


In [3]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

Load [IMDB archive](http://ai.stanford.edu/~amaas/data/sentiment/)
-----------
* The corpus contains 100K movie reviews. Each review is a single line of text with multiple sentences:

```
One of the best movie-dramas I have ever seen. We do a lot of acting in the
church and this is one that can be used as a resource that highlights all the
good things that actors can do in their work. I highly recommend this one,
especially for those who have an interest in acting, as a "must see."
```

* These reviews will be the **documents**, divided into 25K for training (50% pos, 50% neg), 25K for testing (50% pos, 50% neg), and 50K unlabeled.

* Document metadata:
    * words: The text of the document, as a list of words.
    * tags: Used to keep the index of the document in the entire dataset.
    * split: 'train', 'test' or 'extra'. Determines how the document will be used.
    * sentiment: 1 (positive), 0 (negative) or None (unlabeled).

In [4]:
import collections

SentimentDocument = collections.namedtuple('SentimentDocument', 'words tags split sentiment')

In [5]:
!ls ~/Downloads/*.gz

/home/bjpcjp/Downloads/aclImdb_v1.tar.gz


In [6]:
import io, re, tarfile, os.path, smart_open, gensim.utils



In [7]:
#def download_dataset(url='http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'):

url   = 'file://~/Downloads/aclImdb_v1.tar.gz'
fname = url.split('/')[-1]
print(fname)

if os.path.isfile(fname):
    print("ok - it's a file.")
    
with smart_open.open(url, 'rb', ignore_ext=True) as fin:
    with smart_open.open(fname, 'wb', ignore_ext=True) as fout:
        while True:
            buf = fin.read(io.DEFAULT_BUFFER_SIZE)
            if not buf:
                break
            fout.write(buf)
            
print("downloading to local storage first.")

aclImdb_v1.tar.gz
downloading to local storage first.


In [8]:
def create_sentiment_document(name, text, index):
    _, split, sentiment_str, _ = name.split('/')
    sentiment                  = {'pos': 1.0, 
                                  'neg': 0.0, 
                                  'unsup': None}[sentiment_str]

    if sentiment is None:
        split = 'extra'

    tokens = gensim.utils.to_unicode(text).split()
    
    return SentimentDocument(tokens, [index], split, sentiment)

In [9]:
def extract_documents():
    index = 0

    with tarfile.open(fname, mode='r:gz') as tar:
        for member in tar.getmembers():
            if re.match(
                r'aclImdb/(train|test)/(pos|neg|unsup)/\d+_\d+.txt$', 
                member.name):
                member_bytes = tar.extractfile(member).read()
                member_text  = member_bytes.decode('utf-8', errors='replace')

                assert member_text.count('\n') == 0
                yield create_sentiment_document(member.name, member_text, index)
                index += 1

alldocs = list(extract_documents())
#print(alldocs)

In [10]:
# single document:
pprint.pprint(alldocs[27])

SentimentDocument(words=['I', 'was', 'looking', 'forward', 'to', 'this', 'movie.', 'Trustworthy', 'actors,', 'interesting', 'plot.', 'Great', 'atmosphere', 'then', '?????', 'IF', 'you', 'are', 'going', 'to', 'attempt', 'something', 'that', 'is', 'meant', 'to', 'encapsulate', 'the', 'meaning', 'of', 'life.', 'First.', 'Know', 'it.', 'OK', 'I', 'did', 'not', 'expect', 'the', 'directors', 'or', 'writers', 'to', 'actually', 'know', 'the', 'meaning', 'but', 'I', 'thought', 'they', 'may', 'have', 'offered', 'crumbs', 'to', 'peck', 'at', 'and', 'treats', 'to', 'add', 'fuel', 'to', 'the', 'fire-Which!', 'they', 'almost', 'did.', 'Things', 'I', "didn't", 'get.', 'A', 'woman', 'wandering', 'around', 'in', 'dark', 'places', 'and', 'lonely', 'car', 'parks', 'alone-oblivious', 'to', 'the', 'consequences.', 'Great', 'riddles', 'that', 'fell', 'by', 'the', 'wayside.', 'The', 'promise', 'of', 'the', 'knowledge', 'therein', 'contained', 'by', 'the', 'original', 'so-called', 'criminal.', 'I', 'had', 'no

In [11]:
train_docs = [doc for doc in alldocs if doc.split == 'train']
test_docs  = [doc for doc in alldocs if doc.split == 'test']

print('%d docs: %d train-sentiment, %d test-sentiment' % (len(alldocs), len(train_docs), len(test_docs)))

100000 docs: 25000 train-sentiment, 25000 test-sentiment


Set-up Doc2Vec Training & Evaluation Models
-------------------------------------------
* Using Le & Mikolov: [Distributed Representations of Sentences and Documents](http://cs.stanford.edu/~quocle/paragraph_vector.pdf) and Mikolov [go.sh](https://groups.google.com/d/msg/word2vec-toolkit/Q49FIrNOQRo/J6KG8mUj45sJ):

    ./word2vec -train ../alldata-id.txt -output vectors.txt -cbow 0 -size 100 -window 10 -negative 5 -hs 0 -sample 1e-4 -threads 40 -binary 0 -iter 20 -min-count 1 -sentence-vectors 1

We vary the following parameters:

* 100-dimensional vectors (the 400-d vectors in the paper eat a lot of memory.)
* Frequent word subsampling seems to decrease sentiment-prediction accuracy, so it's left out
* *cbow=0* means skip-gram which is equivalent to the paper's 'PV-DBOW' mode, matched in gensim with *dm=0*
* Added to that DBOW model are two DM models, one which averages context vectors (\ ``dm_mean``\ ) and one which concatenates them (\ ``dm_concat``\ , resulting in a much larger, slower, more data-hungry model)
* *min_count=2* saves model memory, discarding only words that appear in a single doc (and are thus no more expressive than the unique-to-each doc vectors themselves)




In [12]:
import multiprocessing
from collections import OrderedDict
import gensim.models.doc2vec
assert gensim.models.doc2vec.FAST_VERSION > -1, "This will be painfully slow otherwise"
from gensim.models.doc2vec import Doc2Vec

In [13]:
common_kwargs = dict(
    vector_size = 100, 
    epochs      = 20, 
    min_count   = 2,
    sample      = 0, 
    workers     = multiprocessing.cpu_count(), 
    negative    = 5, 
    hs          = 0,
)

# PV-DBOW plain
# PV-DM w/ default averaging; a higher starting alpha may improve CBOW/PV-DM modes
# PV-DM w/ concatenation - big, slow, experimental mode
# window=5 (both sides) approximates paper's apparent 10-word total window size

simple_models = [
    Doc2Vec(dm=0,                                                           **common_kwargs),
    Doc2Vec(dm=1,              window=10, alpha=0.05, comment='alpha=0.05', **common_kwargs),
    Doc2Vec(dm=1, dm_concat=1, window=5,                                    **common_kwargs),
]

for model in simple_models:
    model.build_vocab(alldocs)
    print("%s vocabulary scanned & state initialized" % model)

models_by_name = OrderedDict((str(model), model) for model in simple_models)

2020-04-27 10:10:52,161 : INFO : using concatenative 1100-dimensional layer1
2020-04-27 10:10:52,162 : INFO : collecting all words and their counts
2020-04-27 10:10:52,163 : INFO : PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
2020-04-27 10:10:52,618 : INFO : PROGRESS: at example #10000, processed 2292381 words (5039949/s), 150816 word types, 10000 tags
2020-04-27 10:10:53,065 : INFO : PROGRESS: at example #20000, processed 4573645 words (5114726/s), 238497 word types, 20000 tags
2020-04-27 10:10:53,510 : INFO : PROGRESS: at example #30000, processed 6865575 words (5154913/s), 312348 word types, 30000 tags
2020-04-27 10:10:53,973 : INFO : PROGRESS: at example #40000, processed 9190019 words (5030977/s), 377231 word types, 40000 tags
2020-04-27 10:10:54,434 : INFO : PROGRESS: at example #50000, processed 11557847 words (5143268/s), 438729 word types, 50000 tags
2020-04-27 10:10:54,898 : INFO : PROGRESS: at example #60000, processed 13899883 words (5058999/s), 49

Doc2Vec(dbow,d100,n5,mc2,t8) vocabulary scanned & state initialized


2020-04-27 10:12:04,498 : INFO : PROGRESS: at example #10000, processed 2292381 words (5192553/s), 150816 word types, 10000 tags
2020-04-27 10:12:04,977 : INFO : PROGRESS: at example #20000, processed 4573645 words (4770293/s), 238497 word types, 20000 tags
2020-04-27 10:12:05,446 : INFO : PROGRESS: at example #30000, processed 6865575 words (4887229/s), 312348 word types, 30000 tags
2020-04-27 10:12:05,953 : INFO : PROGRESS: at example #40000, processed 9190019 words (4594287/s), 377231 word types, 40000 tags
2020-04-27 10:12:06,462 : INFO : PROGRESS: at example #50000, processed 11557847 words (4658212/s), 438729 word types, 50000 tags
2020-04-27 10:12:06,938 : INFO : PROGRESS: at example #60000, processed 13899883 words (4933766/s), 493913 word types, 60000 tags
2020-04-27 10:12:07,459 : INFO : PROGRESS: at example #70000, processed 16270094 words (4557316/s), 548474 word types, 70000 tags
2020-04-27 10:12:07,948 : INFO : PROGRESS: at example #80000, processed 18598876 words (476500

Doc2Vec("alpha=0.05",dm/m,d100,n5,w10,mc2,t8) vocabulary scanned & state initialized


2020-04-27 10:13:15,849 : INFO : PROGRESS: at example #10000, processed 2292381 words (5912743/s), 150816 word types, 10000 tags
2020-04-27 10:13:16,250 : INFO : PROGRESS: at example #20000, processed 4573645 words (5699672/s), 238497 word types, 20000 tags
2020-04-27 10:13:16,705 : INFO : PROGRESS: at example #30000, processed 6865575 words (5041044/s), 312348 word types, 30000 tags
2020-04-27 10:13:17,134 : INFO : PROGRESS: at example #40000, processed 9190019 words (5432419/s), 377231 word types, 40000 tags
2020-04-27 10:13:17,540 : INFO : PROGRESS: at example #50000, processed 11557847 words (5836598/s), 438729 word types, 50000 tags
2020-04-27 10:13:17,943 : INFO : PROGRESS: at example #60000, processed 13899883 words (5825183/s), 493913 word types, 60000 tags
2020-04-27 10:13:18,354 : INFO : PROGRESS: at example #70000, processed 16270094 words (5778162/s), 548474 word types, 70000 tags
2020-04-27 10:13:18,770 : INFO : PROGRESS: at example #80000, processed 18598876 words (560481

Doc2Vec(dm/c,d100,n5,w5,mc2,t8) vocabulary scanned & state initialized


* **Le and Mikolov**: combining a paragraph vector from Distributed Bag of Words (DBOW) and Distributed Memory (DM) improves performance. Let's try pairing the models together for evaluation. 
* Concatenate the paragraph vectors from each model with the help of a thin wrapper class included in a gensim test module.

In [14]:
from gensim.test.test_doc2vec import ConcatenatedDoc2Vec
models_by_name['dbow+dmm'] = ConcatenatedDoc2Vec([simple_models[0], simple_models[1]])
models_by_name['dbow+dmc'] = ConcatenatedDoc2Vec([simple_models[0], simple_models[2]])

2020-04-27 10:14:27,269 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2020-04-27 10:14:27,270 : INFO : built Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...) from 9 documents (total 29 corpus positions)


Predictive Evaluation Methods
-----------------------------
* *Doc2Vec* models return a vector representation of a document. In case of sentiment analysis, we want the output vector to reflect the sentiment in the document. So, in vector space, positive documents should be distant from negative documents.
* Train a logistic regression using Doc2Vec model vectors as inputs, and returning sentiment labels.
* Test the logistic regression & measure the error rate.
* The error rate reflects *how well* the Doc2Vec model represents documents as vectors.

In [18]:
import numpy as np
import statsmodels.api as sm
from random import sample

def logistic_predictor_from_data(train_targets, train_regressors):

    logit     = sm.Logit(train_targets, train_regressors)
    predictor = logit.fit(disp=0)

    # print(predictor.summary())
    return predictor

def error_rate_for_model(test_model, train_set, test_set):

    train_targets    = [doc.sentiment for doc in train_set]
    train_regressors = [test_model.docvecs[doc.tags[0]] for doc in train_set]
    train_regressors = sm.add_constant(train_regressors)
    predictor        = logistic_predictor_from_data(train_targets, train_regressors)
    test_regressors  = [test_model.docvecs[doc.tags[0]] for doc in test_set]
    test_regressors  = sm.add_constant(test_regressors)

    # Predict & evaluate
    test_predictions = predictor.predict(test_regressors)
    corrects         = sum(np.rint(test_predictions) == [doc.sentiment for doc in test_set])
    errors           = len(test_predictions) - corrects
    error_rate       = float(errors) / len(test_predictions)
    return (error_rate, errors, len(test_predictions), predictor)

Bulk Training & Per-Model Evaluation
------------------------------------

* Note training is occurring on *all* documents of the dataset, which includes all TRAIN/TEST/DEV docs.  Because the native document-order has similar-sentiment documents in large clumps – which is suboptimal for training – we use a once-shuffled copy of the training set.
* We evaluate each model's sentiment predictive power based on error rate, and the evaluation is done for each model.
* (On a 4-core 2.6Ghz Intel Core i7, these 20 passes training and evaluating 3 main models takes about an hour.)

In [19]:
from collections import defaultdict
error_rates = defaultdict(lambda: 1.0)  # To selectively print only best errors achieved

In [20]:
from random import shuffle
shuffled_alldocs = alldocs[:]
shuffle(shuffled_alldocs)

for model in simple_models:
    print("Training %s" % model)
    model.train(shuffled_alldocs, total_examples=len(shuffled_alldocs), epochs=model.epochs)

    print("\nEvaluating %s" % model)
    err_rate, err_count, test_count, predictor = error_rate_for_model(model, train_docs, test_docs)
    error_rates[str(model)] = err_rate
    print("\n%f %s\n" % (err_rate, model))

for model in [models_by_name['dbow+dmm'], models_by_name['dbow+dmc']]:
    print("\nEvaluating %s" % model)
    err_rate, err_count, test_count, predictor = error_rate_for_model(model, train_docs, test_docs)
    error_rates[str(model)] = err_rate
    print("\n%f %s\n" % (err_rate, model))

2020-04-27 10:27:17,263 : INFO : training model with 8 workers on 265408 vocabulary and 100 features, using sg=1 hs=0 sample=0 negative=5 window=5


Training Doc2Vec(dbow,d100,n5,mc2,t8)


2020-04-27 10:27:18,281 : INFO : EPOCH 1 - PROGRESS: at 5.79% examples, 1314441 words/s, in_qsize 16, out_qsize 0
2020-04-27 10:27:19,289 : INFO : EPOCH 1 - PROGRESS: at 12.05% examples, 1373180 words/s, in_qsize 15, out_qsize 0
2020-04-27 10:27:20,294 : INFO : EPOCH 1 - PROGRESS: at 18.32% examples, 1388456 words/s, in_qsize 15, out_qsize 0
2020-04-27 10:27:21,304 : INFO : EPOCH 1 - PROGRESS: at 24.49% examples, 1397504 words/s, in_qsize 15, out_qsize 0
2020-04-27 10:27:22,305 : INFO : EPOCH 1 - PROGRESS: at 30.72% examples, 1405892 words/s, in_qsize 15, out_qsize 0
2020-04-27 10:27:23,306 : INFO : EPOCH 1 - PROGRESS: at 36.92% examples, 1410402 words/s, in_qsize 15, out_qsize 0
2020-04-27 10:27:24,306 : INFO : EPOCH 1 - PROGRESS: at 43.18% examples, 1408384 words/s, in_qsize 16, out_qsize 0
2020-04-27 10:27:25,313 : INFO : EPOCH 1 - PROGRESS: at 49.36% examples, 1407913 words/s, in_qsize 15, out_qsize 0
2020-04-27 10:27:26,319 : INFO : EPOCH 1 - PROGRESS: at 55.72% examples, 1412261 


Evaluating Doc2Vec(dbow,d100,n5,mc2,t8)


2020-04-27 10:32:42,821 : INFO : training model with 8 workers on 265408 vocabulary and 100 features, using sg=0 hs=0 sample=0 negative=5 window=10



0.103680 Doc2Vec(dbow,d100,n5,mc2,t8)

Training Doc2Vec("alpha=0.05",dm/m,d100,n5,w10,mc2,t8)


2020-04-27 10:32:43,870 : INFO : EPOCH 1 - PROGRESS: at 3.16% examples, 687800 words/s, in_qsize 16, out_qsize 1
2020-04-27 10:32:44,877 : INFO : EPOCH 1 - PROGRESS: at 6.69% examples, 752242 words/s, in_qsize 15, out_qsize 0
2020-04-27 10:32:45,888 : INFO : EPOCH 1 - PROGRESS: at 10.03% examples, 753474 words/s, in_qsize 15, out_qsize 0
2020-04-27 10:32:46,893 : INFO : EPOCH 1 - PROGRESS: at 13.18% examples, 745819 words/s, in_qsize 14, out_qsize 1
2020-04-27 10:32:47,895 : INFO : EPOCH 1 - PROGRESS: at 16.54% examples, 749552 words/s, in_qsize 15, out_qsize 0
2020-04-27 10:32:48,896 : INFO : EPOCH 1 - PROGRESS: at 20.03% examples, 758329 words/s, in_qsize 15, out_qsize 0
2020-04-27 10:32:49,915 : INFO : EPOCH 1 - PROGRESS: at 23.66% examples, 767855 words/s, in_qsize 16, out_qsize 0
2020-04-27 10:32:50,919 : INFO : EPOCH 1 - PROGRESS: at 27.27% examples, 775289 words/s, in_qsize 15, out_qsize 0
2020-04-27 10:32:51,932 : INFO : EPOCH 1 - PROGRESS: at 30.77% examples, 778334 words/s, i


Evaluating Doc2Vec("alpha=0.05",dm/m,d100,n5,w10,mc2,t8)


2020-04-27 10:42:17,095 : INFO : training model with 8 workers on 265409 vocabulary and 1100 features, using sg=0 hs=0 sample=0 negative=5 window=5



0.169360 Doc2Vec("alpha=0.05",dm/m,d100,n5,w10,mc2,t8)

Training Doc2Vec(dm/c,d100,n5,w5,mc2,t8)


2020-04-27 10:42:18,218 : INFO : EPOCH 1 - PROGRESS: at 1.07% examples, 217021 words/s, in_qsize 15, out_qsize 0
2020-04-27 10:42:19,337 : INFO : EPOCH 1 - PROGRESS: at 3.12% examples, 316691 words/s, in_qsize 15, out_qsize 0
2020-04-27 10:42:20,410 : INFO : EPOCH 1 - PROGRESS: at 5.14% examples, 354440 words/s, in_qsize 15, out_qsize 0
2020-04-27 10:42:21,476 : INFO : EPOCH 1 - PROGRESS: at 7.12% examples, 374713 words/s, in_qsize 15, out_qsize 0
2020-04-27 10:42:22,563 : INFO : EPOCH 1 - PROGRESS: at 9.13% examples, 385048 words/s, in_qsize 15, out_qsize 0
2020-04-27 10:42:23,651 : INFO : EPOCH 1 - PROGRESS: at 11.16% examples, 392026 words/s, in_qsize 15, out_qsize 0
2020-04-27 10:42:24,720 : INFO : EPOCH 1 - PROGRESS: at 13.19% examples, 398037 words/s, in_qsize 15, out_qsize 0
2020-04-27 10:42:25,720 : INFO : EPOCH 1 - PROGRESS: at 15.20% examples, 404809 words/s, in_qsize 15, out_qsize 0
2020-04-27 10:42:26,732 : INFO : EPOCH 1 - PROGRESS: at 17.11% examples, 407608 words/s, in_q


Evaluating Doc2Vec(dm/c,d100,n5,w5,mc2,t8)

0.300480 Doc2Vec(dm/c,d100,n5,w5,mc2,t8)


Evaluating Doc2Vec(dbow,d100,n5,mc2,t8)+Doc2Vec("alpha=0.05",dm/m,d100,n5,w10,mc2,t8)

0.104400 Doc2Vec(dbow,d100,n5,mc2,t8)+Doc2Vec("alpha=0.05",dm/m,d100,n5,w10,mc2,t8)


Evaluating Doc2Vec(dbow,d100,n5,mc2,t8)+Doc2Vec(dm/c,d100,n5,w5,mc2,t8)

0.103840 Doc2Vec(dbow,d100,n5,mc2,t8)+Doc2Vec(dm/c,d100,n5,w5,mc2,t8)



Sentiment-Prediction Accuracy
-----------------------------

In [21]:
print("Err_rate Model")
for rate, name in sorted((rate, name) for name, rate in error_rates.items()):
    print("%f %s" % (rate, name))

Err_rate Model
0.103680 Doc2Vec(dbow,d100,n5,mc2,t8)
0.103840 Doc2Vec(dbow,d100,n5,mc2,t8)+Doc2Vec(dm/c,d100,n5,w5,mc2,t8)
0.104400 Doc2Vec(dbow,d100,n5,mc2,t8)+Doc2Vec("alpha=0.05",dm/m,d100,n5,w10,mc2,t8)
0.169360 Doc2Vec("alpha=0.05",dm/m,d100,n5,w10,mc2,t8)
0.300480 Doc2Vec(dm/c,d100,n5,w5,mc2,t8)


* The best results are a ~10% error rate, still a long way from the paper's reported 7.42% error rate.

Are inferred vectors close to the precalculated ones?
-----------------------------------------------------



In [23]:
doc_id = np.random.randint(simple_models[0].docvecs.count)  # Pick random doc; re-run cell for more examples
print('for doc %d...' % doc_id)
for model in simple_models:
    inferred_docvec = model.infer_vector(alldocs[doc_id].words)
    pprint.pprint('%s:%s' % (model, model.docvecs.most_similar([inferred_docvec], topn=3)))

for doc 66061...
('Doc2Vec(dbow,d100,n5,mc2,t8):[(66061, 0.9706528782844543), (11244, '
 '0.6271395087242126), (67861, 0.6208620071411133)]')
('Doc2Vec("alpha=0.05",dm/m,d100,n5,w10,mc2,t8):[(66061, 0.9435800909996033), '
 '(66046, 0.6392136216163635), (66043, 0.631019651889801)]')
('Doc2Vec(dm/c,d100,n5,w5,mc2,t8):[(66061, 0.8404983282089233), (60576, '
 '0.4583997130393982), (15363, 0.4405970573425293)]')


Do close documents seem more related than distant ones?
-------------------------------------------------------



In [24]:
import random

doc_id = np.random.randint(
    simple_models[0].docvecs.count)  # pick random doc, re-run cell for more examples

model  = random.choice(
    simple_models)  # and a random model

sims   = model.docvecs.most_similar(
    doc_id, 
    topn=model.docvecs.count)  # get *all* similar documents

pprint.pprint(u'TARGET (%d): «%s»\n' % (doc_id, ' '.join(alldocs[doc_id].words)))
pprint.pprint(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)

for label, index in [('MOST', 0), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
    s = sims[index]
    i = sims[index][0]
    words = ' '.join(alldocs[i].words)
    pprint.pprint(u'%s %s: «%s»' % (label, s, words))

('TARGET (26743): «If Todd Sheets were to come out and admit that this movie '
 'was intended to spoof the zombie genre, I would change my rating to an '
 'eight. Try to imagine a movie where every scene, line, and even every acting '
 'nuance was designed to be a parody. I could probably crap out alphabet soup, '
 'rearrange what was left of the letters, and still have a better script. Two '
 'scenes in particular come to mind when I think of this movie. SPOILER ALERT! '
 "One is when Mike's dad and the other dad walk, I repeat walk down a "
 'staircase jam packed with zombies. This is a small staircase and even though '
 'they brush up against the flailing undead, nothing happens to them. When '
 'they reach the end, the ex-marine turns around, says "God you\'re a horny '
 'bastard", and shoots only one. The other is in the military complex. The '
 'girl stabs a zombie with a machete and is immediately surrounded. The camera '
 'moves around her for roughly forty seconds, while she i

Somewhat, in terms of reviewer tone, movie genre, etc... the MOST
cosine-similar docs usually seem more like the TARGET than the MEDIAN or
LEAST... especially if the MOST has a cosine-similarity > 0.5. Re-run the
cell to try another random target document.




Do the word vectors show useful similarities?
---------------------------------------------




In [26]:
import random

word_models = simple_models[:]

def pick_random_word(model, threshold=10):
    # pick a random word with a suitable number of occurences
    while True:
        word = random.choice(model.wv.index2word)
        if model.wv.vocab[word].count > threshold:
            return word

target_word = pick_random_word(word_models[0])
# or uncomment below line, to just pick a word from the relevant domain:
# target_word = 'comedy/drama'

for model in word_models:
    print('target_word: %r model: %s similar words:' % (target_word, model))
    for i, (word, sim) in enumerate(model.wv.most_similar(target_word, topn=10), 1):
        print('    %d. %.2f %r' % (i, sim, word))
    print()

2020-04-27 11:06:54,967 : INFO : precomputing L2-norms of word weight vectors
2020-04-27 11:06:55,086 : INFO : precomputing L2-norms of word weight vectors


target_word: 'complex,' model: Doc2Vec(dbow,d100,n5,mc2,t8) similar words:
    1. 0.44 'McEwan),'
    2. 0.42 'ogle.'
    3. 0.41 'Hoechlin,'
    4. 0.41 'satisfaction'
    5. 0.40 'TREK,'
    6. 0.40 'Shawshank,'
    7. 0.40 'Bright.'
    8. 0.40 'WIth'
    9. 0.39 'horse?'
    10. 0.39 "Dumbledore's"

target_word: 'complex,' model: Doc2Vec("alpha=0.05",dm/m,d100,n5,w10,mc2,t8) similar words:


2020-04-27 11:06:55,317 : INFO : precomputing L2-norms of word weight vectors


    1. 0.70 'complex'
    2. 0.56 'complicated,'
    3. 0.55 'powerful'
    4. 0.54 'powerful,'
    5. 0.54 'enthralling,'
    6. 0.53 'claustrophobic,'
    7. 0.52 'absorbing,'
    8. 0.52 'uncommunicative,'
    9. 0.51 'ethos,'
    10. 0.51 'elegantly'

target_word: 'complex,' model: Doc2Vec(dm/c,d100,n5,w5,mc2,t8) similar words:
    1. 0.71 'solid,'
    2. 0.66 'thought-provoking,'
    3. 0.65 'compelling,'
    4. 0.64 'good-looking,'
    5. 0.63 'simple,'
    6. 0.62 'mysterious,'
    7. 0.62 'brutal,'
    8. 0.62 'subtle,'
    9. 0.62 'disturbing,'
    10. 0.62 'surreal,'



Are the word vectors from this dataset any good at analogies?
-------------------------------------------------------------



In [27]:
# grab the file if not already local

questions_filename = 'questions-words.txt'

if not os.path.isfile(questions_filename):
    # Download IMDB archive
    print("Downloading analogy questions file...")
    url = u'https://raw.githubusercontent.com/tmikolov/word2vec/master/questions-words.txt'

    with smart_open.open(url, 'rb') as fin:
        with smart_open.open(questions_filename, 'wb') as fout:
            fout.write(fin.read())

assert os.path.isfile(questions_filename), "questions-words.txt unavailable"
print("Success, questions-words.txt is available for next steps.")

# Note: this analysis takes many minutes
for model in word_models:
    score, sections    = model.wv.evaluate_word_analogies('questions-words.txt')
    correct, incorrect = len(sections[-1]['correct']), len(sections[-1]['incorrect'])
    
    pprint.pprint('%s: %0.2f%% correct (%d of %d)' % (model, 
                                                      float(correct*100)/(correct+incorrect), 
                                                      correct, 
                                                      correct+incorrect))

Success, questions-words.txt is available for next steps.


2020-04-27 11:07:07,121 : INFO : Evaluating word analogies for top 300000 words in the model on questions-words.txt
2020-04-27 11:07:11,380 : INFO : capital-common-countries: 0.0% (0/420)
2020-04-27 11:07:22,519 : INFO : capital-world: 0.0% (0/902)
2020-04-27 11:07:23,650 : INFO : currency: 0.0% (0/86)
2020-04-27 11:07:42,154 : INFO : city-in-state: 0.0% (0/1510)
2020-04-27 11:07:47,916 : INFO : family: 0.0% (0/506)
2020-04-27 11:07:58,382 : INFO : gram1-adjective-to-adverb: 0.0% (0/992)
2020-04-27 11:08:08,749 : INFO : gram2-opposite: 0.0% (0/756)
2020-04-27 11:08:24,445 : INFO : gram3-comparative: 0.0% (0/1332)
2020-04-27 11:08:36,124 : INFO : gram4-superlative: 0.0% (0/1056)
2020-04-27 11:08:47,331 : INFO : gram5-present-participle: 0.0% (0/992)
2020-04-27 11:09:03,842 : INFO : gram6-nationality-adjective: 0.0% (0/1445)
2020-04-27 11:09:21,230 : INFO : gram7-past-tense: 0.0% (0/1560)
2020-04-27 11:09:34,474 : INFO : gram8-plural: 0.0% (0/1190)
2020-04-27 11:09:44,001 : INFO : gram9-

'Doc2Vec(dbow,d100,n5,mc2,t8): 0.00% correct (0 of 13617)'


2020-04-27 11:09:44,394 : INFO : Evaluating word analogies for top 300000 words in the model on questions-words.txt
2020-04-27 11:09:48,857 : INFO : capital-common-countries: 5.7% (24/420)
2020-04-27 11:09:58,675 : INFO : capital-world: 1.7% (15/902)
2020-04-27 11:09:59,736 : INFO : currency: 0.0% (0/86)
2020-04-27 11:10:17,422 : INFO : city-in-state: 0.1% (2/1510)
2020-04-27 11:10:24,141 : INFO : family: 35.0% (177/506)
2020-04-27 11:10:36,510 : INFO : gram1-adjective-to-adverb: 2.8% (28/992)
2020-04-27 11:10:45,443 : INFO : gram2-opposite: 6.1% (46/756)
2020-04-27 11:11:00,045 : INFO : gram3-comparative: 49.5% (659/1332)
2020-04-27 11:11:12,241 : INFO : gram4-superlative: 27.8% (294/1056)
2020-04-27 11:11:24,060 : INFO : gram5-present-participle: 22.4% (222/992)
2020-04-27 11:11:40,974 : INFO : gram6-nationality-adjective: 2.8% (41/1445)
2020-04-27 11:11:59,088 : INFO : gram7-past-tense: 28.9% (451/1560)
2020-04-27 11:12:12,760 : INFO : gram8-plural: 18.8% (224/1190)
2020-04-27 11:12

'Doc2Vec("alpha=0.05",dm/m,d100,n5,w10,mc2,t8): 18.72% correct (2549 of 13617)'


2020-04-27 11:12:24,428 : INFO : Evaluating word analogies for top 300000 words in the model on questions-words.txt
2020-04-27 11:12:30,274 : INFO : capital-common-countries: 1.2% (5/420)
2020-04-27 11:12:43,069 : INFO : capital-world: 0.3% (3/902)
2020-04-27 11:12:44,166 : INFO : currency: 0.0% (0/86)
2020-04-27 11:13:03,648 : INFO : city-in-state: 0.3% (5/1510)
2020-04-27 11:13:10,742 : INFO : family: 32.0% (162/506)
2020-04-27 11:13:23,317 : INFO : gram1-adjective-to-adverb: 6.4% (63/992)
2020-04-27 11:13:32,467 : INFO : gram2-opposite: 4.4% (33/756)
2020-04-27 11:13:49,410 : INFO : gram3-comparative: 37.6% (501/1332)
2020-04-27 11:14:03,587 : INFO : gram4-superlative: 23.9% (252/1056)
2020-04-27 11:14:16,749 : INFO : gram5-present-participle: 36.1% (358/992)
2020-04-27 11:14:33,858 : INFO : gram6-nationality-adjective: 2.6% (38/1445)
2020-04-27 11:14:53,751 : INFO : gram7-past-tense: 26.8% (418/1560)
2020-04-27 11:15:07,444 : INFO : gram8-plural: 11.0% (131/1190)
2020-04-27 11:15:1

'Doc2Vec(dm/c,d100,n5,w5,mc2,t8): 17.87% correct (2433 of 13617)'


Even though this is a tiny, domain-specific dataset, it shows some meager
capability on the general word analogies – at least for the DM/mean and
DM/concat models which actually train word vectors. (The untrained
random-initialized words of the DBOW model of course fail miserably.)


