In [1]:
%matplotlib inline


How to reproduce the doc2vec 'Paragraph Vector' paper
=====================================================

Shows how to reproduce results of the "Distributed Representation of Sentences and Documents" paper by Le and Mikolov using Gensim.




In [2]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

Introduction
------------

This guide shows you how to reproduce the results of the paper by `Le and
Mikolov 2014 <https://arxiv.org/pdf/1405.4053.pdf>`_ using Gensim. While the
entire paper is worth reading (it's only 9 pages), we will be focusing on
Section 3.2: "Beyond One Sentence - Sentiment Analysis with the IMDB
dataset".

This guide follows the following steps:

#. Load the IMDB dataset
#. Train a variety of Doc2Vec models on the dataset
#. Evaluate the performance of each model using a logistic regression
#. Examine some of the results directly:

When examining results, we will look for answers for the following questions:

#. Are inferred vectors close to the precalculated ones?
#. Do close documents seem more related than distant ones?
#. Do the word vectors show useful similarities?
#. Are the word vectors from this dataset any good at analogies?

Load corpus
-----------

Our data for the tutorial will be the `IMDB archive
<http://ai.stanford.edu/~amaas/data/sentiment/>`_.
If you're not familiar with this dataset, then here's a brief intro: it
contains several thousand movie reviews.

Each review is a single line of text containing multiple sentences, for example:

```
One of the best movie-dramas I have ever seen. We do a lot of acting in the
church and this is one that can be used as a resource that highlights all the
good things that actors can do in their work. I highly recommend this one,
especially for those who have an interest in acting, as a "must see."
```

These reviews will be the **documents** that we will work with in this tutorial.
There are 100 thousand reviews in total.

#. 25k reviews for training (12.5k positive, 12.5k negative)
#. 25k reviews for testing (12.5k positive, 12.5k negative)
#. 50k unlabeled reviews

Out of 100k reviews, 50k have a label: either positive (the reviewer liked
the movie) or negative.
The remaining 50k are unlabeled.

Our first task will be to prepare the dataset.

More specifically, we will:

#. Download the tar.gz file (it's only 84MB, so this shouldn't take too long)
#. Unpack it and extract each movie review
#. Split the reviews into training and test datasets

First, let's define a convenient datatype for holding data for a single document:

* words: The text of the document, as a ``list`` of words.
* tags: Used to keep the index of the document in the entire dataset.
* split: one of ``train``\ , ``test`` or ``extra``. Determines how the document will be used (for training, testing, etc).
* sentiment: either 1 (positive), 0 (negative) or None (unlabeled document).

This data type is helpful for later evaluation and reporting.
In particular, the ``index`` member will help us quickly and easily retrieve the vectors for a document from a model.




In [3]:
import collections

SentimentDocument = collections.namedtuple('SentimentDocument', 'words tags split sentiment')

We can now proceed with loading the corpus.



In [4]:
import io
import re
import tarfile
import os.path

import smart_open
import gensim.utils

def download_dataset(url='http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'):
    fname = url.split('/')[-1]

    if os.path.isfile(fname):
       return fname

    # Download the file to local storage first.
    with smart_open.open(url, "rb", ignore_ext=True) as fin:
        with smart_open.open(fname, 'wb', ignore_ext=True) as fout:
            while True:
                buf = fin.read(io.DEFAULT_BUFFER_SIZE)
                if not buf:
                    break
                fout.write(buf)

    return fname

def create_sentiment_document(name, text, index):
    _, split, sentiment_str, _ = name.split('/')
    sentiment = {'pos': 1.0, 'neg': 0.0, 'unsup': None}[sentiment_str]

    if sentiment is None:
        split = 'extra'

    tokens = gensim.utils.to_unicode(text).split()
    return SentimentDocument(tokens, [index], split, sentiment)

def extract_documents(file_path):

    index = 0

    with tarfile.open(file_path, mode='r:gz') as tar:
        for member in tar.getmembers():
            if re.match(r'aclImdb/(train|test)/(pos|neg|unsup)/\d+_\d+.txt$', member.name):
                member_bytes = tar.extractfile(member).read()
                member_text = member_bytes.decode('utf-8', errors='replace')
                assert member_text.count('\n') == 0
                yield create_sentiment_document(member.name, member_text, index)
                index += 1

In [5]:
# fname = download_dataset()
# alldocs = list(extract_documents(filename))
data_path = "./data/aclImdb_v1.tar.gz"
alldocs = list(extract_documents(data_path))

Here's what a single document looks like.



In [6]:
print(alldocs[27])

SentimentDocument(words=['I', 'was', 'looking', 'forward', 'to', 'this', 'movie.', 'Trustworthy', 'actors,', 'interesting', 'plot.', 'Great', 'atmosphere', 'then', '?????', 'IF', 'you', 'are', 'going', 'to', 'attempt', 'something', 'that', 'is', 'meant', 'to', 'encapsulate', 'the', 'meaning', 'of', 'life.', 'First.', 'Know', 'it.', 'OK', 'I', 'did', 'not', 'expect', 'the', 'directors', 'or', 'writers', 'to', 'actually', 'know', 'the', 'meaning', 'but', 'I', 'thought', 'they', 'may', 'have', 'offered', 'crumbs', 'to', 'peck', 'at', 'and', 'treats', 'to', 'add', 'fuel', 'to', 'the', 'fire-Which!', 'they', 'almost', 'did.', 'Things', 'I', "didn't", 'get.', 'A', 'woman', 'wandering', 'around', 'in', 'dark', 'places', 'and', 'lonely', 'car', 'parks', 'alone-oblivious', 'to', 'the', 'consequences.', 'Great', 'riddles', 'that', 'fell', 'by', 'the', 'wayside.', 'The', 'promise', 'of', 'the', 'knowledge', 'therein', 'contained', 'by', 'the', 'original', 'so-called', 'criminal.', 'I', 'had', 'no

Extract our documents and split into training/test sets.



In [7]:
train_docs = [doc for doc in alldocs if doc.split == 'train']
test_docs = [doc for doc in alldocs if doc.split == 'test']
print(f'{len(alldocs)} docs: {len(train_docs)} train-sentiment, {len(test_docs)} test-sentiment')

100000 docs: 25000 train-sentiment, 25000 test-sentiment


Set-up Doc2Vec Training & Evaluation Models
-------------------------------------------

We approximate the experiment of Le & Mikolov `"Distributed Representations
of Sentences and Documents"
<http://cs.stanford.edu/~quocle/paragraph_vector.pdf>`_ with guidance from
Mikolov's `example go.sh
<https://groups.google.com/d/msg/word2vec-toolkit/Q49FIrNOQRo/J6KG8mUj45sJ>`_::

    ./word2vec -train ../alldata-id.txt -output vectors.txt -cbow 0 -size 100 -window 10 -negative 5 -hs 0 -sample 1e-4 -threads 40 -binary 0 -iter 20 -min-count 1 -sentence-vectors 1

We vary the following parameter choices:

* 100-dimensional vectors, as the 400-d vectors of the paper take a lot of
  memory and, in our tests of this task, don't seem to offer much benefit
* Similarly, frequent word subsampling seems to decrease sentiment-prediction
  accuracy, so it's left out
* ``cbow=0`` means skip-gram which is equivalent to the paper's 'PV-DBOW'
  mode, matched in gensim with ``dm=0``
* Added to that DBOW model are two DM models, one which averages context
  vectors (\ ``dm_mean``\ ) and one which concatenates them (\ ``dm_concat``\ ,
  resulting in a much larger, slower, more data-hungry model)
* A ``min_count=2`` saves quite a bit of model memory, discarding only words
  that appear in a single doc (and are thus no more expressive than the
  unique-to-each doc vectors themselves)




In [8]:
import multiprocessing
from collections import OrderedDict

import gensim.models.doc2vec
assert gensim.models.doc2vec.FAST_VERSION > -1, "This will be painfully slow otherwise"

from gensim.models.doc2vec import Doc2Vec

common_kwargs = dict(
    vector_size=100, epochs=20, min_count=2,
    sample=0, workers=multiprocessing.cpu_count(), negative=5, hs=0,
)

simple_models = [
    # PV-DBOW plain
    Doc2Vec(dm=0, **common_kwargs),
    # PV-DM w/ default averaging; a higher starting alpha may improve CBOW/PV-DM modes
    Doc2Vec(dm=1, window=10, alpha=0.05, comment='alpha=0.05', **common_kwargs),
    # PV-DM w/ concatenation - big, slow, experimental mode
    # window=5 (both sides) approximates paper's apparent 10-word total window size
    Doc2Vec(dm=1, dm_concat=1, window=5, **common_kwargs),
]

for model in simple_models:
    model.build_vocab(alldocs)
    print(f"{model} vocabulary scanned & state initialized")

models_by_name = OrderedDict((str(model), model) for model in simple_models)

2024-05-07 20:39:56,438 : INFO : Doc2Vec lifecycle event {'params': 'Doc2Vec<dbow,d100,n5,mc2,t48>', 'datetime': '2024-05-07T20:39:56.438724', 'gensim': '4.3.2', 'python': '3.9.18 | packaged by conda-forge | (main, Dec 23 2023, 16:33:10) \n[GCC 12.3.0]', 'platform': 'Linux-4.15.0-213-generic-x86_64-with-glibc2.27', 'event': 'created'}
2024-05-07 20:39:56,441 : INFO : Doc2Vec lifecycle event {'params': 'Doc2Vec<dm/m,d100,n5,w10,mc2,t48>', 'datetime': '2024-05-07T20:39:56.441952', 'gensim': '4.3.2', 'python': '3.9.18 | packaged by conda-forge | (main, Dec 23 2023, 16:33:10) \n[GCC 12.3.0]', 'platform': 'Linux-4.15.0-213-generic-x86_64-with-glibc2.27', 'event': 'created'}
2024-05-07 20:39:56,442 : INFO : using concatenative 1100-dimensional layer1
2024-05-07 20:39:56,445 : INFO : Doc2Vec lifecycle event {'params': 'Doc2Vec<dm/c,d100,n5,w5,mc2,t48>', 'datetime': '2024-05-07T20:39:56.444971', 'gensim': '4.3.2', 'python': '3.9.18 | packaged by conda-forge | (main, Dec 23 2023, 16:33:10) \n[G

Doc2Vec<dbow,d100,n5,mc2,t48> vocabulary scanned & state initialized


2024-05-07 20:40:08,310 : INFO : PROGRESS: at example #10000, processed 2292381 words (4919791 words/s), 150816 word types, 0 tags
2024-05-07 20:40:08,789 : INFO : PROGRESS: at example #20000, processed 4573645 words (4774673 words/s), 238497 word types, 0 tags
2024-05-07 20:40:09,282 : INFO : PROGRESS: at example #30000, processed 6865575 words (4661119 words/s), 312348 word types, 0 tags
2024-05-07 20:40:09,879 : INFO : PROGRESS: at example #40000, processed 9190019 words (3906690 words/s), 377231 word types, 0 tags
2024-05-07 20:40:10,415 : INFO : PROGRESS: at example #50000, processed 11557847 words (4420429 words/s), 438729 word types, 0 tags
2024-05-07 20:40:11,009 : INFO : PROGRESS: at example #60000, processed 13899883 words (3954589 words/s), 493913 word types, 0 tags
2024-05-07 20:40:11,620 : INFO : PROGRESS: at example #70000, processed 16270094 words (3884410 words/s), 548474 word types, 0 tags
2024-05-07 20:40:12,194 : INFO : PROGRESS: at example #80000, processed 18598876

Doc2Vec<dm/m,d100,n5,w10,mc2,t48> vocabulary scanned & state initialized


2024-05-07 20:40:18,816 : INFO : PROGRESS: at example #10000, processed 2292381 words (4489958 words/s), 150816 word types, 0 tags
2024-05-07 20:40:19,275 : INFO : PROGRESS: at example #20000, processed 4573645 words (4978433 words/s), 238497 word types, 0 tags
2024-05-07 20:40:19,753 : INFO : PROGRESS: at example #30000, processed 6865575 words (4805054 words/s), 312348 word types, 0 tags
2024-05-07 20:40:20,237 : INFO : PROGRESS: at example #40000, processed 9190019 words (4804720 words/s), 377231 word types, 0 tags
2024-05-07 20:40:20,735 : INFO : PROGRESS: at example #50000, processed 11557847 words (4768502 words/s), 438729 word types, 0 tags
2024-05-07 20:40:21,217 : INFO : PROGRESS: at example #60000, processed 13899883 words (4861162 words/s), 493913 word types, 0 tags
2024-05-07 20:40:21,731 : INFO : PROGRESS: at example #70000, processed 16270094 words (4620900 words/s), 548474 word types, 0 tags
2024-05-07 20:40:22,248 : INFO : PROGRESS: at example #80000, processed 18598876

Doc2Vec<dm/c,d100,n5,w5,mc2,t48> vocabulary scanned & state initialized


Le and Mikolov note that combining a paragraph vector from Distributed Bag of
Words (DBOW) and Distributed Memory (DM) improves performance. We will
follow, pairing the models together for evaluation. Here, we concatenate the
paragraph vectors obtained from each model with the help of a thin wrapper
class included in a gensim test module. (Note that this a separate, later
concatenation of output-vectors than the kind of input-window-concatenation
enabled by the ``dm_concat=1`` mode above.)




In [10]:
from gensim.test.test_doc2vec import ConcatenatedDoc2Vec
models_by_name['dbow+dmm'] = ConcatenatedDoc2Vec([simple_models[0], simple_models[1]])
models_by_name['dbow+dmc'] = ConcatenatedDoc2Vec([simple_models[0], simple_models[2]])

2024-05-07 20:48:16,075 : INFO : adding document #0 to Dictionary<0 unique tokens: []>
2024-05-07 20:48:16,076 : INFO : built Dictionary<12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...> from 9 documents (total 29 corpus positions)
2024-05-07 20:48:16,078 : INFO : Dictionary lifecycle event {'msg': "built Dictionary<12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...> from 9 documents (total 29 corpus positions)", 'datetime': '2024-05-07T20:48:16.077933', 'gensim': '4.3.2', 'python': '3.9.18 | packaged by conda-forge | (main, Dec 23 2023, 16:33:10) \n[GCC 12.3.0]', 'platform': 'Linux-4.15.0-213-generic-x86_64-with-glibc2.27', 'event': 'created'}


Predictive Evaluation Methods
-----------------------------

Given a document, our ``Doc2Vec`` models output a vector representation of the document.
How useful is a particular model?
In case of sentiment analysis, we want the ouput vector to reflect the sentiment in the input document.
So, in vector space, positive documents should be distant from negative documents.

We train a logistic regression from the training set:

  - regressors (inputs): document vectors from the Doc2Vec model
  - target (outpus): sentiment labels

So, this logistic regression will be able to predict sentiment given a document vector.

Next, we test our logistic regression on the test set, and measure the rate of errors (incorrect predictions).
If the document vectors from the Doc2Vec model reflect the actual sentiment well, the error rate will be low.

Therefore, the error rate of the logistic regression is indication of *how well* the given Doc2Vec model represents documents as vectors.
We can then compare different ``Doc2Vec`` models by looking at their error rates.




In [12]:
import numpy as np
import statsmodels.api as sm
from random import sample

def logistic_predictor_from_data(train_targets, train_regressors):
    """Fit a statsmodel logistic predictor on supplied data"""
    logit = sm.Logit(train_targets, train_regressors)
    predictor = logit.fit(disp=0)
    # print(predictor.summary())
    return predictor

def error_rate_for_model(test_model, train_set, test_set):
    """Report error rate on test_doc sentiments, using supplied model and train_docs"""

    train_targets = [doc.sentiment for doc in train_set]
    train_regressors = [test_model.dv[doc.tags[0]] for doc in train_set]
    train_regressors = sm.add_constant(train_regressors)
    predictor = logistic_predictor_from_data(train_targets, train_regressors)

    test_regressors = [test_model.dv[doc.tags[0]] for doc in test_set]
    test_regressors = sm.add_constant(test_regressors)

    # Predict & evaluate
    test_predictions = predictor.predict(test_regressors)
    corrects = sum(np.rint(test_predictions) == [doc.sentiment for doc in test_set])
    errors = len(test_predictions) - corrects
    error_rate = float(errors) / len(test_predictions)
    return (error_rate, errors, len(test_predictions), predictor)

Bulk Training & Per-Model Evaluation
------------------------------------

Note that doc-vector training is occurring on *all* documents of the dataset,
which includes all TRAIN/TEST/DEV docs.  Because the native document-order
has similar-sentiment documents in large clumps – which is suboptimal for
training – we work with once-shuffled copy of the training set.

We evaluate each model's sentiment predictive power based on error rate, and
the evaluation is done for each model.

(On a 4-core 2.6Ghz Intel Core i7, these 20 passes training and evaluating 3
main models takes about an hour.)




In [13]:
from collections import defaultdict
error_rates = defaultdict(lambda: 1.0)  # To selectively print only best errors achieved

In [14]:
from random import shuffle
shuffled_alldocs = alldocs[:]
shuffle(shuffled_alldocs)

for model in simple_models:
    print(f"Training {model}")
    model.train(shuffled_alldocs, total_examples=len(shuffled_alldocs), epochs=model.epochs)

    print(f"\nEvaluating {model}")
    err_rate, err_count, test_count, predictor = error_rate_for_model(model, train_docs, test_docs)
    error_rates[str(model)] = err_rate
    print("\n%f %s\n" % (err_rate, model))

for model in [models_by_name['dbow+dmm'], models_by_name['dbow+dmc']]:
    print(f"\nEvaluating {model}")
    err_rate, err_count, test_count, predictor = error_rate_for_model(model, train_docs, test_docs)
    error_rates[str(model)] = err_rate
    print(f"\n{err_rate} {model}\n")

2024-05-07 20:52:33,757 : INFO : Doc2Vec lifecycle event {'msg': 'training model with 48 workers on 265408 vocabulary and 100 features, using sg=1 hs=0 sample=0 negative=5 window=5 shrink_windows=True', 'datetime': '2024-05-07T20:52:33.757010', 'gensim': '4.3.2', 'python': '3.9.18 | packaged by conda-forge | (main, Dec 23 2023, 16:33:10) \n[GCC 12.3.0]', 'platform': 'Linux-4.15.0-213-generic-x86_64-with-glibc2.27', 'event': 'train'}


Training Doc2Vec<dbow,d100,n5,mc2,t48>


2024-05-07 20:52:34,852 : INFO : EPOCH 0 - PROGRESS: at 2.03% examples, 438913 words/s, in_qsize 93, out_qsize 2
2024-05-07 20:52:35,871 : INFO : EPOCH 0 - PROGRESS: at 5.15% examples, 573782 words/s, in_qsize 85, out_qsize 10
2024-05-07 20:52:36,942 : INFO : EPOCH 0 - PROGRESS: at 8.50% examples, 624579 words/s, in_qsize 95, out_qsize 3
2024-05-07 20:52:37,977 : INFO : EPOCH 0 - PROGRESS: at 11.82% examples, 650652 words/s, in_qsize 95, out_qsize 0
2024-05-07 20:52:39,026 : INFO : EPOCH 0 - PROGRESS: at 15.14% examples, 666371 words/s, in_qsize 95, out_qsize 2
2024-05-07 20:52:40,035 : INFO : EPOCH 0 - PROGRESS: at 18.63% examples, 687482 words/s, in_qsize 96, out_qsize 0
2024-05-07 20:52:41,049 : INFO : EPOCH 0 - PROGRESS: at 21.50% examples, 682103 words/s, in_qsize 94, out_qsize 2
2024-05-07 20:52:42,105 : INFO : EPOCH 0 - PROGRESS: at 24.54% examples, 679386 words/s, in_qsize 96, out_qsize 3
2024-05-07 20:52:43,110 : INFO : EPOCH 0 - PROGRESS: at 27.41% examples, 675653 words/s, i


Evaluating Doc2Vec<dbow,d100,n5,mc2,t48>


2024-05-07 21:03:22,544 : INFO : Doc2Vec lifecycle event {'msg': 'training model with 48 workers on 265408 vocabulary and 100 features, using sg=0 hs=0 sample=0 negative=5 window=10 shrink_windows=True', 'datetime': '2024-05-07T21:03:22.544307', 'gensim': '4.3.2', 'python': '3.9.18 | packaged by conda-forge | (main, Dec 23 2023, 16:33:10) \n[GCC 12.3.0]', 'platform': 'Linux-4.15.0-213-generic-x86_64-with-glibc2.27', 'event': 'train'}



0.104520 Doc2Vec<dbow,d100,n5,mc2,t48>

Training Doc2Vec<dm/m,d100,n5,w10,mc2,t48>


2024-05-07 21:03:23,568 : INFO : EPOCH 0 - PROGRESS: at 0.17% examples, 48493 words/s, in_qsize 95, out_qsize 0
2024-05-07 21:03:24,583 : INFO : EPOCH 0 - PROGRESS: at 2.18% examples, 250468 words/s, in_qsize 95, out_qsize 0
2024-05-07 21:03:25,609 : INFO : EPOCH 0 - PROGRESS: at 4.19% examples, 318914 words/s, in_qsize 94, out_qsize 1
2024-05-07 21:03:26,614 : INFO : EPOCH 0 - PROGRESS: at 6.08% examples, 347580 words/s, in_qsize 96, out_qsize 0
2024-05-07 21:03:27,639 : INFO : EPOCH 0 - PROGRESS: at 7.97% examples, 363486 words/s, in_qsize 96, out_qsize 0
2024-05-07 21:03:28,735 : INFO : EPOCH 0 - PROGRESS: at 9.91% examples, 371143 words/s, in_qsize 96, out_qsize 0
2024-05-07 21:03:29,739 : INFO : EPOCH 0 - PROGRESS: at 11.84% examples, 379843 words/s, in_qsize 95, out_qsize 0
2024-05-07 21:03:30,742 : INFO : EPOCH 0 - PROGRESS: at 13.41% examples, 378307 words/s, in_qsize 95, out_qsize 0
2024-05-07 21:03:31,756 : INFO : EPOCH 0 - PROGRESS: at 15.09% examples, 378672 words/s, in_qsi


Evaluating Doc2Vec<dm/m,d100,n5,w10,mc2,t48>


2024-05-07 21:21:10,802 : INFO : Doc2Vec lifecycle event {'msg': 'training model with 48 workers on 265409 vocabulary and 1100 features, using sg=0 hs=0 sample=0 negative=5 window=5 shrink_windows=True', 'datetime': '2024-05-07T21:21:10.802014', 'gensim': '4.3.2', 'python': '3.9.18 | packaged by conda-forge | (main, Dec 23 2023, 16:33:10) \n[GCC 12.3.0]', 'platform': 'Linux-4.15.0-213-generic-x86_64-with-glibc2.27', 'event': 'train'}



0.170320 Doc2Vec<dm/m,d100,n5,w10,mc2,t48>

Training Doc2Vec<dm/c,d100,n5,w5,mc2,t48>


2024-05-07 21:21:11,842 : INFO : EPOCH 0 - PROGRESS: at 1.25% examples, 295255 words/s, in_qsize 95, out_qsize 0
2024-05-07 21:21:12,847 : INFO : EPOCH 0 - PROGRESS: at 3.98% examples, 455909 words/s, in_qsize 96, out_qsize 0
2024-05-07 21:21:13,878 : INFO : EPOCH 0 - PROGRESS: at 6.85% examples, 521068 words/s, in_qsize 96, out_qsize 2
2024-05-07 21:21:14,883 : INFO : EPOCH 0 - PROGRESS: at 10.04% examples, 571071 words/s, in_qsize 90, out_qsize 1
2024-05-07 21:21:15,916 : INFO : EPOCH 0 - PROGRESS: at 13.04% examples, 590311 words/s, in_qsize 96, out_qsize 2
2024-05-07 21:21:16,979 : INFO : EPOCH 0 - PROGRESS: at 16.54% examples, 619029 words/s, in_qsize 95, out_qsize 0
2024-05-07 21:21:17,996 : INFO : EPOCH 0 - PROGRESS: at 19.36% examples, 621569 words/s, in_qsize 95, out_qsize 4
2024-05-07 21:21:19,025 : INFO : EPOCH 0 - PROGRESS: at 22.85% examples, 641496 words/s, in_qsize 95, out_qsize 1
2024-05-07 21:21:20,026 : INFO : EPOCH 0 - PROGRESS: at 26.08% examples, 651704 words/s, in


Evaluating Doc2Vec<dm/c,d100,n5,w5,mc2,t48>

0.299000 Doc2Vec<dm/c,d100,n5,w5,mc2,t48>


Evaluating Doc2Vec<dbow,d100,n5,mc2,t48>+Doc2Vec<dm/m,d100,n5,w10,mc2,t48>

0.10436 Doc2Vec<dbow,d100,n5,mc2,t48>+Doc2Vec<dm/m,d100,n5,w10,mc2,t48>


Evaluating Doc2Vec<dbow,d100,n5,mc2,t48>+Doc2Vec<dm/c,d100,n5,w5,mc2,t48>

0.10476 Doc2Vec<dbow,d100,n5,mc2,t48>+Doc2Vec<dm/c,d100,n5,w5,mc2,t48>



Achieved Sentiment-Prediction Accuracy
--------------------------------------
Compare error rates achieved, best-to-worst



In [16]:
print("Err_rate Model")
for rate, name in sorted((rate, name) for name, rate in error_rates.items()):
    print(f"{rate} {name}")

Err_rate Model
0.10436 Doc2Vec<dbow,d100,n5,mc2,t48>+Doc2Vec<dm/m,d100,n5,w10,mc2,t48>
0.10452 Doc2Vec<dbow,d100,n5,mc2,t48>
0.10476 Doc2Vec<dbow,d100,n5,mc2,t48>+Doc2Vec<dm/c,d100,n5,w5,mc2,t48>
0.17032 Doc2Vec<dm/m,d100,n5,w10,mc2,t48>
0.299 Doc2Vec<dm/c,d100,n5,w5,mc2,t48>


In our testing, contrary to the results of the paper, on this problem,
PV-DBOW alone performs as good as anything else. Concatenating vectors from
different models only sometimes offers a tiny predictive improvement – and
stays generally close to the best-performing solo model included.

The best results achieved here are just around 10% error rate, still a long
way from the paper's reported 7.42% error rate.

(Other trials not shown, with larger vectors and other changes, also don't
come close to the paper's reported value. Others around the net have reported
a similar inability to reproduce the paper's best numbers. The PV-DM/C mode
improves a bit with many more training epochs – but doesn't reach parity with
PV-DBOW.)




Examining Results
-----------------

Let's look for answers to the following questions:

#. Are inferred vectors close to the precalculated ones?
#. Do close documents seem more related than distant ones?
#. Do the word vectors show useful similarities?
#. Are the word vectors from this dataset any good at analogies?




Are inferred vectors close to the precalculated ones?
-----------------------------------------------------



In [17]:
doc_id = np.random.randint(len(simple_models[0].dv))  # Pick random doc; re-run cell for more examples
print(f'for doc {doc_id}...')
for model in simple_models:
    inferred_docvec = model.infer_vector(alldocs[doc_id].words)
    print(f'{model}:\n {model.dv.most_similar([inferred_docvec], topn=3)}')

for doc 67990...
Doc2Vec<dbow,d100,n5,mc2,t48>:
 [(67990, 0.9694914817810059), (64627, 0.6643124222755432), (74191, 0.6321359872817993)]
Doc2Vec<dm/m,d100,n5,w10,mc2,t48>:
 [(67990, 0.9226250052452087), (26538, 0.5970041155815125), (97970, 0.5921608209609985)]
Doc2Vec<dm/c,d100,n5,w5,mc2,t48>:
 [(67990, 0.7077532410621643), (53586, 0.4684329330921173), (90969, 0.46206676959991455)]


(Yes, here the stored vector from 20 epochs of training is usually one of the
closest to a freshly-inferred vector for the same words. Defaults for
inference may benefit from tuning for each dataset or model parameters.)




Do close documents seem more related than distant ones?
-------------------------------------------------------



In [20]:
import random

doc_id = np.random.randint(len(simple_models[0].dv))  # pick random doc, re-run cell for more examples
model = random.choice(simple_models)  # and a random model
sims = model.dv.most_similar(doc_id, topn=len(model.dv))  # get *all* similar documents
print(f'TARGET ({doc_id}): «{" ".join(alldocs[doc_id].words)}»\n')
print(f'SIMILAR/DISSIMILAR DOCS PER MODEL {model}%s:\n')
for label, index in [('MOST', 0), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
    s = sims[index]
    i = sims[index][0]
    words = ' '.join(alldocs[i].words)
    print(f'{label} {s}: «{words}»\n')

TARGET (85049): «A Few years ago, I found this at a Flea Market, after catching the trailer on the VHS of The Wizard Of Gore years earlier. All I remembered was that it seemed like something worth the three dollars the guy was asking, but after years of being burned in my search for decent-bad cinema, I wasn't all that shocked to discover that Class Reunion Massacre is, in fact, not worth the three dollars I paid. I like old B-horror because it's different, because it's outlandish and rebellious, because it can be refreshing compared to all the big-budget, super-hero garbage of today. But none of that means anything if it puts you to sleep.<br /><br />Typical Horror plot. Some kid walks out of a lake, gets on a bus, and goes to Church, where we find a rather loud Priest doing his thing...<br /><br />and I'm sure all of this has something to do with the six "sinners" the story now revolves around. Four People I barely noticed, Mr. metro-fem, and a lesbian are headed to their 10 year hig

Somewhat, in terms of reviewer tone, movie genre, etc... the MOST
cosine-similar docs usually seem more like the TARGET than the MEDIAN or
LEAST... especially if the MOST has a cosine-similarity > 0.5. Re-run the
cell to try another random target document.




Do the word vectors show useful similarities?
---------------------------------------------




In [21]:
import random

word_models = simple_models[:]

def pick_random_word(model, threshold=10):
    # pick a random word with a suitable number of occurences
    while True:
        word = random.choice(model.wv.index_to_key)
        if model.wv.get_vecattr(word, "count") > threshold:
            return word

target_word = pick_random_word(word_models[0])
# or uncomment below line, to just pick a word from the relevant domain:
# target_word = 'comedy/drama'

for model in word_models:
    print(f'target_word: {repr(target_word)} model: {model} similar words:')
    for i, (word, sim) in enumerate(model.wv.most_similar(target_word, topn=10), 1):
        print(f'    {i}. {sim:.2f} {repr(word)}')
    print()

target_word: 'teachers,' model: Doc2Vec<dbow,d100,n5,mc2,t48> similar words:
    1. 0.42 'acquiesces'
    2. 0.41 'on....just'
    3. 0.41 'marries'
    4. 0.41 'J.J'
    5. 0.40 'forty-nine'
    6. 0.40 'Romancing'
    7. 0.39 'brake'
    8. 0.39 'galled'
    9. 0.39 'RUBBER'
    10. 0.38 'in;'

target_word: 'teachers,' model: Doc2Vec<dm/m,d100,n5,w10,mc2,t48> similar words:
    1. 0.52 'cheerleaders,'
    2. 0.52 'graduates'
    3. 0.51 'crowd),'
    4. 0.51 'fraternities'
    5. 0.51 'pupils'
    6. 0.50 'co-workers,'
    7. 0.49 'neighbors,'
    8. 0.49 'disapproval'
    9. 0.49 'bratty,'
    10. 0.49 'bigots'

target_word: 'teachers,' model: Doc2Vec<dm/c,d100,n5,w5,mc2,t48> similar words:
    1. 0.60 'sexuality,'
    2. 0.59 'troops,'
    3. 0.59 'prostitutes,'
    4. 0.57 'employers,'
    5. 0.57 'siblings,'
    6. 0.57 'Presidents,'
    7. 0.57 'glue,'
    8. 0.56 'slaves,'
    9. 0.56 'daughters,'
    10. 0.55 'clothing,'



Do the DBOW words look meaningless? That's because the gensim DBOW model
doesn't train word vectors – they remain at their random initialized values –
unless you ask with the ``dbow_words=1`` initialization parameter. Concurrent
word-training slows DBOW mode significantly, and offers little improvement
(and sometimes a little worsening) of the error rate on this IMDB
sentiment-prediction task, but may be appropriate on other tasks, or if you
also need word-vectors.

Words from DM models tend to show meaningfully similar words when there are
many examples in the training data (as with 'plot' or 'actor'). (All DM modes
inherently involve word-vector training concurrent with doc-vector training.)




Are the word vectors from this dataset any good at analogies?
-------------------------------------------------------------



In [22]:
from gensim.test.utils import datapath
questions_filename = datapath('questions-words.txt')

# Note: this analysis takes many minutes
for model in word_models:
    score, sections = model.wv.evaluate_word_analogies(questions_filename)
    correct, incorrect = len(sections[-1]['correct']), len(sections[-1]['incorrect'])
    print(f'{model}: {float(correct*100)/(correct+incorrect):0.2f}%% correct ({correct} of {correct+incorrect}')

2024-05-07 21:36:55,852 : INFO : Evaluating word analogies for top 300000 words in the model on /home/xgwei/miniconda3/envs/workenv/lib/python3.9/site-packages/gensim/test/test_data/questions-words.txt
2024-05-07 21:37:06,504 : INFO : capital-common-countries: 0.0% (0/420)
2024-05-07 21:37:29,797 : INFO : capital-world: 0.0% (0/902)
2024-05-07 21:37:31,960 : INFO : currency: 0.0% (0/86)
2024-05-07 21:38:11,312 : INFO : city-in-state: 0.0% (0/1510)
2024-05-07 21:38:24,420 : INFO : family: 0.0% (0/506)
2024-05-07 21:38:49,753 : INFO : gram1-adjective-to-adverb: 0.0% (0/992)
2024-05-07 21:39:09,077 : INFO : gram2-opposite: 0.0% (0/756)
2024-05-07 21:39:42,881 : INFO : gram3-comparative: 0.0% (0/1332)
2024-05-07 21:40:09,760 : INFO : gram4-superlative: 0.0% (0/1056)
2024-05-07 21:40:35,126 : INFO : gram5-present-participle: 0.0% (0/992)
2024-05-07 21:41:11,710 : INFO : gram6-nationality-adjective: 0.0% (0/1445)
2024-05-07 21:41:51,811 : INFO : gram7-past-tense: 0.0% (0/1560)
2024-05-07 21:

Doc2Vec<dbow,d100,n5,mc2,t48>: 0.00%% correct (0 of 13617


2024-05-07 21:42:55,574 : INFO : capital-common-countries: 2.6% (11/420)
2024-05-07 21:43:18,275 : INFO : capital-world: 0.6% (5/902)
2024-05-07 21:43:20,394 : INFO : currency: 0.0% (0/86)
2024-05-07 21:43:58,232 : INFO : city-in-state: 0.4% (6/1510)
2024-05-07 21:44:11,818 : INFO : family: 36.8% (186/506)
2024-05-07 21:44:37,075 : INFO : gram1-adjective-to-adverb: 3.4% (34/992)
2024-05-07 21:44:55,866 : INFO : gram2-opposite: 6.7% (51/756)
2024-05-07 21:45:29,930 : INFO : gram3-comparative: 48.1% (641/1332)
2024-05-07 21:45:56,751 : INFO : gram4-superlative: 25.5% (269/1056)
2024-05-07 21:46:22,273 : INFO : gram5-present-participle: 23.5% (233/992)
2024-05-07 21:46:58,167 : INFO : gram6-nationality-adjective: 2.5% (36/1445)
2024-05-07 21:47:37,550 : INFO : gram7-past-tense: 29.6% (462/1560)
2024-05-07 21:48:07,749 : INFO : gram8-plural: 19.8% (236/1190)
2024-05-07 21:48:29,442 : INFO : gram9-plural-verbs: 44.8% (390/870)
2024-05-07 21:48:29,446 : INFO : Quadruplets with out-of-vocabul

Doc2Vec<dm/m,d100,n5,w10,mc2,t48>: 18.80%% correct (2560 of 13617


2024-05-07 21:48:40,454 : INFO : capital-common-countries: 1.9% (8/420)
2024-05-07 21:49:02,551 : INFO : capital-world: 0.4% (4/902)
2024-05-07 21:49:04,721 : INFO : currency: 0.0% (0/86)
2024-05-07 21:49:41,891 : INFO : city-in-state: 0.1% (2/1510)
2024-05-07 21:49:54,906 : INFO : family: 36.6% (185/506)
2024-05-07 21:50:20,515 : INFO : gram1-adjective-to-adverb: 9.4% (93/992)
2024-05-07 21:50:39,657 : INFO : gram2-opposite: 3.3% (25/756)
2024-05-07 21:51:13,077 : INFO : gram3-comparative: 35.1% (467/1332)
2024-05-07 21:51:39,280 : INFO : gram4-superlative: 23.7% (250/1056)
2024-05-07 21:52:04,308 : INFO : gram5-present-participle: 37.2% (369/992)
2024-05-07 21:52:40,095 : INFO : gram6-nationality-adjective: 2.3% (33/1445)
2024-05-07 21:53:18,995 : INFO : gram7-past-tense: 27.2% (425/1560)
2024-05-07 21:53:49,800 : INFO : gram8-plural: 10.5% (125/1190)
2024-05-07 21:54:11,695 : INFO : gram9-plural-verbs: 48.9% (425/870)
2024-05-07 21:54:11,698 : INFO : Quadruplets with out-of-vocabula

Doc2Vec<dm/c,d100,n5,w5,mc2,t48>: 17.71%% correct (2411 of 13617


Even though this is a tiny, domain-specific dataset, it shows some meager
capability on the general word analogies – at least for the DM/mean and
DM/concat models which actually train word vectors. (The untrained
random-initialized words of the DBOW model of course fail miserably.)




In [23]:
import torch
save_path = "./data/doc2vec.model"
torch.save(simple_models, save_path)

In [24]:
save_path = "./data/doc2vec.model"
simple_models = torch.load(save_path)

In [25]:
for model in simple_models:
    print(f"\nEvaluating {model}")
    err_rate, err_count, test_count, predictor = error_rate_for_model(model, train_docs, test_docs)
    error_rates[str(model)] = err_rate
    print("\n%f %s\n" % (err_rate, model))



Evaluating Doc2Vec<dbow,d100,n5,mc2,t48>

0.104520 Doc2Vec<dbow,d100,n5,mc2,t48>


Evaluating Doc2Vec<dm/m,d100,n5,w10,mc2,t48>

0.170320 Doc2Vec<dm/m,d100,n5,w10,mc2,t48>


Evaluating Doc2Vec<dm/c,d100,n5,w5,mc2,t48>

0.299000 Doc2Vec<dm/c,d100,n5,w5,mc2,t48>

