In [1]:
%matplotlib inline


Doc2Vec Model
=============

Introduces Gensim's Doc2Vec model and demonstrates its use on the Lee Corpus.




In [2]:
import pprint
import logging
logging.basicConfig(format='%(levelname)s : %(message)s', level=logging.ERROR)

Doc2Vec models (represents) each document as a _core_concepts_vector_.

Bag-of-words Weaknesses
--------------------

* Bag-of-words models are surprisingly effective, but have several weaknesses.
* First, they forget about word order: "John likes Mary" and "Mary likes John" correspond to identical vectors. 
    - There is a solution: [bag of n-grams](https://en.wikipedia.org/wiki/N-gram) models consider word phrases of length n to represent documents as fixed-length vectors to capture local word order but suffer from data sparsity and high dimensionality.
* Second, the model does not try to learn the meaning of the underlying words - therefore the distance between vectors doesn't always reflect the difference in meaning.
    - _Word2Vec_ addresses this problem.

Word2Vec
--------
* *Word2Vec* embeds words in a lower-dimensional vector space using a shallow neural network. It returns a set of word-vectors where vectors close together in vector space have similar **contextual** meanings, ie 'strong' & 'powerful' would be close together; 'strong' & 'Paris' would not.

* *Word2Vec* allows us to build vectors for each **word** in a document. But what if we want to build a vector for the **entire document**? We could average the word vectors for each word, but there is a better way.

Introducing: Paragraph Vector
-----------------------------
* IMPORTANT: In Gensim, we refer to the Paragraph Vector model as 'Doc2Vec'.

* Le and Mikolov in 2014 introduced the [Doc2Vec algorithm](https://cs.stanford.edu/~quocle/paragraph_vector.pdf), which usually outperforms simple averaging of *Word2Vec* vectors.

* The idea is to act as if a document has another floating word-like vector, which contributes to all training predictions, and is updated like other word-vectors, but we will call it a doc-vector.

* There are two implementations:
    1. Paragraph Vector - Distributed Memory (PV-DM)
    2. Paragraph Vector - Distributed Bag of Words (PV-DBOW)

* PV-DM is analogous to Word2Vec CBOW. The doc-vectors are obtained by training a neural network on the synthetic task of predicting a center word based an average of both context word-vectors and the full document's doc-vector.

* PV-DBOW is analogous to Word2Vec SG. The doc-vectors are obtained by training a neural network on the synthetic task of predicting a target word just from the full document's doc-vector. (It is also common to combine this with skip-gram testing, using both the doc-vector and nearby word-vectors to predict a single target word, but only one at a time.)

Prepare the [Lee Background Corpus](https://hekyll.services.adelaide.edu.au/dspace/bitstream/2440/28910/1/hdl_28910.pdf)
----------------------------------

* The corpus is included in gensim. It contains 314 documents selected from the Australian Broadcasting Corporation’s news mail service, which provides text e-mails of headline stories and covers a number of broad topics.

* We'll test our model using the much shorter [Lee Corpus](https://hekyll.services.adelaide.edu.au/dspace/bitstream/2440/28910/1/hdl_28910.pdf), which contains 50 documents.

In [3]:
import os, gensim

test_data_dir  = os.path.join(gensim.__path__[0], 'test', 'test_data')
lee_train_file = os.path.join(test_data_dir, 'lee_background.cor')
lee_test_file  = os.path.join(test_data_dir, 'lee.cor')

In [5]:
!ls $test_data_dir/*lee*

/home/bjpcjp/.local/lib/python3.6/site-packages/gensim/test/test_data/d2v-lee-v0.13.0
/home/bjpcjp/.local/lib/python3.6/site-packages/gensim/test/test_data/lee_background.cor
/home/bjpcjp/.local/lib/python3.6/site-packages/gensim/test/test_data/lee.cor
/home/bjpcjp/.local/lib/python3.6/site-packages/gensim/test/test_data/lee_fasttext
/home/bjpcjp/.local/lib/python3.6/site-packages/gensim/test/test_data/lee_fasttext.bin
/home/bjpcjp/.local/lib/python3.6/site-packages/gensim/test/test_data/lee_fasttext_new.bin
/home/bjpcjp/.local/lib/python3.6/site-packages/gensim/test/test_data/lee_fasttext.vec
/home/bjpcjp/.local/lib/python3.6/site-packages/gensim/test/test_data/pang_lee_polarity.cor
/home/bjpcjp/.local/lib/python3.6/site-packages/gensim/test/test_data/pang_lee_polarity_fasttext.bin
/home/bjpcjp/.local/lib/python3.6/site-packages/gensim/test/test_data/pang_lee_polarity_fasttext.vec
/home/bjpcjp/.local/lib/python3.6/site-packages/gensim/test/test_data/varembed_lee_subcorpus.cor
/home/bj

Define a Function to Read and Preprocess Text
---------------------------------------------

* Below, we define a function to:
    - open the train/test file (with latin encoding)
    - read the file line-by-line
    - pre-process each line (tokenize text into words, remove punctuation, set to lowercase, etc)

The file we're reading is a **corpus**. Each line of the file is a **document**.

* To train the model, we'll need to associate a tag/number with each document of the training corpus. In our case, the tag is simply the zero-based line number.

In [6]:
import smart_open

def read_corpus(fname, tokens_only=False):
    with smart_open.open(fname, encoding="iso-8859-1") as f:
        for i, line in enumerate(f):
            tokens = gensim.utils.simple_preprocess(line)
            if tokens_only:
                yield tokens
            else:
                # For training data, add tags
                yield gensim.models.doc2vec.TaggedDocument(tokens, [i])

train_corpus = list(read_corpus(lee_train_file))
test_corpus = list(read_corpus(lee_test_file, tokens_only=True))

Let's take a look at the training corpus




In [7]:
pprint.pprint(train_corpus[:2])

[TaggedDocument(words=['hundreds', 'of', 'people', 'have', 'been', 'forced', 'to', 'vacate', 'their', 'homes', 'in', 'the', 'southern', 'highlands', 'of', 'new', 'south', 'wales', 'as', 'strong', 'winds', 'today', 'pushed', 'huge', 'bushfire', 'towards', 'the', 'town', 'of', 'hill', 'top', 'new', 'blaze', 'near', 'goulburn', 'south', 'west', 'of', 'sydney', 'has', 'forced', 'the', 'closure', 'of', 'the', 'hume', 'highway', 'at', 'about', 'pm', 'aedt', 'marked', 'deterioration', 'in', 'the', 'weather', 'as', 'storm', 'cell', 'moved', 'east', 'across', 'the', 'blue', 'mountains', 'forced', 'authorities', 'to', 'make', 'decision', 'to', 'evacuate', 'people', 'from', 'homes', 'in', 'outlying', 'streets', 'at', 'hill', 'top', 'in', 'the', 'new', 'south', 'wales', 'southern', 'highlands', 'an', 'estimated', 'residents', 'have', 'left', 'their', 'homes', 'for', 'nearby', 'mittagong', 'the', 'new', 'south', 'wales', 'rural', 'fire', 'service', 'says', 'the', 'weather', 'conditions', 'which', '

* The training corpus (a list of lists; no tags):

In [9]:
print(test_corpus[:2])

[['the', 'national', 'executive', 'of', 'the', 'strife', 'torn', 'democrats', 'last', 'night', 'appointed', 'little', 'known', 'west', 'australian', 'senator', 'brian', 'greig', 'as', 'interim', 'leader', 'shock', 'move', 'likely', 'to', 'provoke', 'further', 'conflict', 'between', 'the', 'party', 'senators', 'and', 'its', 'organisation', 'in', 'move', 'to', 'reassert', 'control', 'over', 'the', 'party', 'seven', 'senators', 'the', 'national', 'executive', 'last', 'night', 'rejected', 'aden', 'ridgeway', 'bid', 'to', 'become', 'interim', 'leader', 'in', 'favour', 'of', 'senator', 'greig', 'supporter', 'of', 'deposed', 'leader', 'natasha', 'stott', 'despoja', 'and', 'an', 'outspoken', 'gay', 'rights', 'activist'], ['cash', 'strapped', 'financial', 'services', 'group', 'amp', 'has', 'shelved', 'million', 'plan', 'to', 'buy', 'shares', 'back', 'from', 'investors', 'and', 'will', 'raise', 'million', 'in', 'fresh', 'capital', 'after', 'profits', 'crashed', 'in', 'the', 'six', 'months', 'to'

Training the Model
------------------
* Create a Doc2Vec model with a 50-dimension vector
* Iterate over the training corpus 40 times
* Set minimum word count to 2 - to discard words with very few occurrences.
* Typical iteration counts in [Paragraph Vector paper](https://cs.stanford.edu/~quocle/paragraph_vector.pdf) results, using 10s-of-thousands to millions of docs, are 10-20. More iterations take more time & result in diminishing returns.
* This is a small dataset (300 documents) with shortish documents (a few hundred words). Adding training passes can help with such small datasets.

In [10]:
model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=40)

In [11]:
model.build_vocab(train_corpus)

* The vocabulary is a dictionary (_model.wv.vocab_) of all unique words extracted from the training corpus along with the count (_model.wv.vocab['penalty'].count_ for counts for the word 'penalty').

* If the BLAS library is being used, this should take no more than 3 seconds.

In [12]:
model.train(train_corpus, total_examples=model.corpus_count, epochs=model.epochs)

* Use the trained model to infer a vector for any piece of text by passing a list of words to _model.infer_vector_. This vector can then be compared with other vectors via cosine similarity.

In [13]:
vector = model.infer_vector(['only', 'you', 'can', 'prevent', 'forest', 'fires'])
print(vector)

[-0.20484999  0.29121417 -0.09885011 -0.10435056 -0.07113887  0.25424787
 -0.23287696 -0.13940692  0.3316893   0.31415215  0.03712632 -0.0121164
  0.13449863  0.15245944 -0.13940471 -0.05893684 -0.09646172 -0.06737366
 -0.10485979 -0.2108823   0.1520444   0.08312634 -0.05644798 -0.0951256
  0.12901355  0.067581    0.04278953 -0.10728648  0.26051214  0.07028764
 -0.02052462 -0.13310336 -0.00821643 -0.00613631  0.28000695  0.06918957
  0.00604747 -0.06568436  0.19088987  0.02761287 -0.00912968 -0.06838579
 -0.16202644 -0.13949049  0.02260514  0.06482696  0.02777114  0.08233481
  0.2015555   0.03069362]


* _infer_vector()_ does *not* take a string, but rather a list of string tokens, which should have already been tokenized the same way as the ``words`` property of original training document objects.
* Repeated inferences of the same text can return slightly different vectors. This is because the underlying training/inference algorithms are an iterative approximation problem that makes use of internal randomization.

Assessing the Model
-------------------
* To assess our new model, we'll first new vectors for each document of the training corpus, compare the inferred vectors with the training corpus, and then returning the rank of the document based on self-similarity.
* Basically, we're pretending as if the training corpus is some new unseen data and seeing how they compare with the trained model. The expectation is that we've probably overfit our model (i.e., all of the ranks will be less than 2) and so we should be able to find similar documents very easily.
* Additionally, we'll keep track of the second ranks for a comparison of less similar documents.

In [14]:
ranks, second_ranks = [],[]

for doc_id in range(len(train_corpus)):
    inferred_vector = model.infer_vector(train_corpus[doc_id].words)
    sims            = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))
    rank            = [docid for docid, sim in sims].index(doc_id)
    
    ranks.append(rank)
    second_ranks.append(sims[1])

In [15]:
import collections

counter = collections.Counter(ranks)
print(counter)

Counter({0: 293, 1: 7})


* >95% of the inferred documents are found to be most similar to itself; ~5% of the time it is (mistakenly) most similar to another document. Checking the inferred-vector against a training-vector is a sort of 'sanity check' as to whether the model is behaving in a usefully consistent manner, though not a real 'accuracy' value. 
* We can take a look at an example:

In [16]:
print('Document ({}): «{}»\n'.format(doc_id, ' '.join(train_corpus[doc_id].words)))

print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)

for label, index in [('MOST', 0), ('SECOND-MOST', 1), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
    print(u'%s %s: «%s»\n' % (label, 
                              sims[index], 
                              ' '.join(train_corpus[sims[index][0]].words)))

Document (299): «australia will take on france in the doubles rubber of the davis cup tennis final today with the tie levelled at wayne arthurs and todd woodbridge are scheduled to lead australia in the doubles against cedric pioline and fabrice santoro however changes can be made to the line up up to an hour before the match and australian team captain john fitzgerald suggested he might do just that we ll make team appraisal of the whole situation go over the pros and cons and make decision french team captain guy forget says he will not make changes but does not know what to expect from australia todd is the best doubles player in the world right now so expect him to play he said would probably use wayne arthurs but don know what to expect really pat rafter salvaged australia davis cup campaign yesterday with win in the second singles match rafter overcame an arm injury to defeat french number one sebastien grosjean in three sets the australian says he is happy with his form it not v

* Note: the most similar document (usually the same text) has a similarity score approaching 1.0. The similarity score for the second-ranked documents should be significantly lower (assuming the documents are in fact different) and the reasoning becomes obvious when we examine the text itself.

* We can run the next cell repeatedly to see a sampling other target-document comparisons.

In [17]:
# Pick a random document from the corpus and infer a vector from the model
import random

doc_id = random.randint(0, len(train_corpus) - 1)
sim_id = second_ranks[doc_id]

# Compare and print the second-most-similar document
print('Train Document ({}): «{}»\n'.format(doc_id, ' '.join(train_corpus[doc_id].words)))
print('Similar Document {}: «{}»\n'.format(sim_id, ' '.join(train_corpus[sim_id[0]].words)))

Train Document (26): «pakistan president pervez musharraf says he is ready to meet indian prime minister atal behari vajpayee as fears grow of war between the two countries tensions have escalated since suicide attack on the indian parliament two weeks ago india alleges the attack was backed by the pakistani intelligence service general musharraf says pakistan will never initiate conflict between the two countries he says he is prepared to hold talks with the indian prime minister at regional summit in nepal next week don mind meeting him but as ve said once before you can clap with one hand general musharraf said if there is willingness from the other side there will be willingness from my side»

Similar Document (12, 0.7191126942634583): «president general pervez musharraf says pakistan wants to defuse the brewing crisis with india but was prepared to respond vigorously to any attack pakistan stands for peace pakistan wants peace pakistan wants to reduce tension he said let the two c

Testing the Model
-----------------

Using the same approach above, we'll infer the vector for a randomly chosen
test document, and compare the document to our model by eye.




In [18]:
# Pick a random document from the test corpus and infer a vector from the model
doc_id          = random.randint(0, len(test_corpus) - 1)
inferred_vector = model.infer_vector(test_corpus[doc_id])
sims            = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))

print('Test Document ({}): «{}»\n'.format(doc_id, ' '.join(test_corpus[doc_id])))

print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)

for label, index in [('MOST', 0), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
    print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(train_corpus[sims[index][0]].words)))

Test Document (6): «senior members of the saudi royal family paid at least million to osama bin laden terror group and the taliban for an agreement his forces would not attack targets in saudi arabia according to court documents the papers filed in us billion billion lawsuit in the us allege the deal was made after two secret meetings between saudi royals and leaders of al qa ida including bin laden the money enabled al qa ida to fund training camps in afghanistan later attended by the september hijackers the disclosures will increase tensions between the us and saudi arabia»

SIMILAR/DISSIMILAR DOCS PER MODEL Doc2Vec(dm/m,d50,n5,w5,mc2,s0.001,t3):

MOST (261, 0.5970593690872192): «afghan opposition leaders meeting in germany have reached an agreement after seven days of talks on the structure of an interim post taliban government for afghanistan the agreement calls for the immediate assembly of temporary group of multi national peacekeepers in kabul and possibly other areas the four a

Resources
----------

* [Word2Vec Paper](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)
* [Doc2Vec Paper](https://cs.stanford.edu/~quocle/paragraph_vector.pdf)
* [Dr. Michael D. Lee's Website](http://faculty.sites.uci.edu/mdlee)
* [Lee Corpus](http://faculty.sites.uci.edu/mdlee/similarity-data)__
* [IMDB Doc2Vec Tutorial](doc2vec-IMDB.ipynb)


