## doc2vec training excercise

In this excercise, you will train a Paragraph Vectors / doc2vec model using gensim. You can find information on the gensim doc2vec api here: https://radimrehurek.com/gensim/models/doc2vec.html

N.B. You should be using Python 3 for this.

The data folder contains a train and test set with small sets of documents from the "20 newsgroups" dataset.

In [1]:
import os
from gensim.models import doc2vec
from gensim.utils import simple_preprocess

In [2]:
# generic settings
HOMEDIR = './'
CORPUS_FILE = os.path.join(HOMEDIR, "data/train_docs.txt")

# file names for the models we'll be creating
MODEL_FILE_DM = os.path.join(HOMEDIR, "models/doc2vec_DM_v20171229.bin")
MODEL_FILE_DBOW = os.path.join(HOMEDIR, "models/doc2vec_DBOW_v20171229.bin")

**Read the corpus. Each line is a document / paragraph. Optionally preprocess it first.**

In [3]:
flg_preprocess = False

if flg_preprocess:
    # quick & simple approach
    docs = doc2vec.TaggedLineDocument(CORPUS_FILE)
else:
    # with pre-processing
    with open(CORPUS_FILE, 'r', encoding='utf-8') as f:
        lines = f.readlines()
        docs = [simple_preprocess(line, deacc=False, min_len=1) for line in lines]
        docs = [doc2vec.TaggedDocument(doc, tags=[i]) for i, doc in enumerate(docs)]

In [5]:
# have a look at the data
docs[0]

TaggedDocument(words=['anarchism', 'is', 'a', 'political', 'philosophy', 'that', 'advocates', 'self', 'governed', 'societies', 'with', 'voluntary', 'institutions', 'these', 'are', 'often', 'described', 'as', 'stateless', 'societies', 'but', 'several', 'authors', 'have', 'defined', 'them', 'more', 'specifically', 'as', 'institutions', 'based', 'on', 'non', 'hierarchical', 'free', 'associations', 'anarchism', 'holds', 'the', 'state', 'to', 'be', 'undesirable', 'unnecessary', 'or', 'harmful', 'while', 'anti', 'statism', 'is', 'central', 'anarchism', 'entails', 'opposing', 'authority', 'or', 'hierarchical', 'organisation', 'in', 'the', 'conduct', 'of', 'human', 'relations', 'including', 'but', 'not', 'limited', 'to', 'the', 'state', 'system'], tags=[0])

In [6]:
# train DM model
model_dm = doc2vec.Doc2Vec(docs, 
                           size=200, # vector size, should be the same size as pre-trained embedding size when not using dm_concat
                           window=10, # window size for word context, on each side
                           min_count=1, # minimum nr. of occurrences of a word
                           sample=1e-5, # threshold for undersampling high-frequency words
                           workers=4, # for multicore processing
                           hs=0, # if 1, use hierarchical softmax; if 0, use negative sampling
                           dm=1, # if 1 use PV-DM, if 0 use PM-DBOW
                           negative=5, # how many words to use for negative sampling
                           dbow_words=1, # train word vectors
                           dm_concat=1, # concatenate vectors or sum/average them?
                           iter=100 # nr of epochs to train
                          )

In [7]:
# save it for later use
model_dm.save(MODEL_FILE_DM)

In [8]:
# train DBOW model
model_dbow = doc2vec.Doc2Vec(docs, 
                            size=200, # vector size, should be the same size as pre-trained embedding size when not using dm_concat
                            window=10, # window size for word context, on each side
                            min_count=1, # minimum nr. of occurrences of a word
                            sample=1e-5, 
                            workers=4, # for multicore processing
                            hs=0, # if 1, use hierarchical softmax; if 0, use negative sampling
                            dm=0, # if 1 use PV-DM, if 0 use PM-DBOW
                            negative=5, 
                            dbow_words=1, 
                            iter=100 # nr of epochs to train
                            )

In [9]:
model_dbow.save(MODEL_FILE_DBOW)

In [17]:
def show_most_similar(model, docs, ref_doc_id):
    """
    For a given document, display the most similar ones in the corpus
    """
    def print_doc(doc_id):
        doc_txt = ' '.join(docs[doc_id].words)
        print("[Doc {}]: {}".format(doc_id, doc_txt))
        
    print("[Original document]")
    print_doc(ref_doc_id)
    print("\n[Most similar documents]")
    for doc_id, similarity in model.docvecs.most_similar(ref_doc_id, topn=3):
        print("-----------------")
        print("similarity: {}".format(similarity))
        print_doc(doc_id)


In [18]:
show_most_similar(model_dbow, list(docs), 200)

[Original document]
[Doc 200]: single scattering albedo is used to define scattering of electromagnetic waves on small particles it depends on properties of the material lrb refractive index rrb the size of the particle or particles and the wavelength of the incoming radiation

[Most similar documents]
-----------------
similarity: 0.8602718114852905
[Doc 154]: terrestrial albedo
-----------------
similarity: 0.8536304235458374
[Doc 187]: water reflects light very differently from typical terrestrial materials the reflectivity of a water surface is calculated using the fresnel equations lrb see graph rrb
-----------------
similarity: 0.8468575477600098
[Doc 148]: albedo lrb rrb or reflection coefficient derived from latin albedo whiteness lrb or reflected sunlight rrb in turn from albus white is the diffuse reflectivity or reflecting power of a surface


## Prediction phase

In [18]:
test_data_file = os.path.join(HOMEDIR, "data/test_docs.txt")

In [19]:
# inference hyper-parameters
start_alpha=0.01
infer_epoch=1000

In [20]:
with open(test_data_file, "r") as f:
    test_docs = [ x.strip().split() for x in f.readlines() ]

In [21]:
test_docvecs = [model.infer_vector(d, alpha=start_alpha, steps=infer_epoch) for d in test_docs]

In [22]:
test_docvecs

[array([  1.56096563e-01,  -3.52686830e-02,   7.04447329e-02,
         -4.29527387e-02,   1.11410789e-01,  -4.64408919e-02,
         -2.96674911e-02,   1.07038237e-01,  -6.71069622e-02,
         -9.20917839e-02,   6.43559173e-02,   8.43140408e-02,
          1.79991331e-02,   2.16307156e-02,  -1.07286192e-01,
         -4.07555588e-02,   1.44115537e-02,   8.09284970e-02,
          2.90366393e-02,  -4.43438366e-02,   4.48280685e-02,
          9.74932671e-01,  -2.77833380e-02,  -1.26351580e-01,
          1.64782535e-02,  -5.45104481e-02,  -5.02871685e-02,
         -1.58917114e-01,  -3.49505804e-02,   4.03022841e-02,
          1.42115220e-01,   1.94461614e-01,   7.65543357e-02,
          9.08669826e-05,  -6.34886418e-03,  -2.58497596e-02,
         -4.56970036e-02,  -2.96264980e-02,   4.01529670e-02,
         -6.39744475e-02,   7.17852591e-03,  -1.98921971e-02,
         -7.80529389e-03,   3.09865945e-03,   1.21870581e-02,
          3.76222581e-02,  -1.70244910e-02,   8.32777619e-02,
        