Finding document vectors from pre-trained word2vec word vectors
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
document2vec
models
.gitignore
.travis.yml
LICENSE
README.md
setup.py

README.md

Document2Vec

Finding document vectors from pre-trained word2vec word vectors

Build Status MIT License

How to install

Simply install from the git repo like so:

pip install -e git+git://github.com/cemoody/Document2Vec.git#egg=Package
# on a shared machine without system-python access add --user

How to use

The word2vec file must be a trained gensim Word2Vec file and cannot be Mikolov's pre-trained vectors. This is because training a new document vector requires the syn1 layer which the C version of word2vec throws away.

Initialize Document2Vec with pre-trained word vectors from a pre-existing word2vec training run like so:

from document2vec.document2vec import Document2Vec
from document2vec.corpora import SeriesCorpus
import pandas as pd
# This must be a gensim Word2Vec or Doc2Vec pickle
d2v = Document2Vec("/home/moody/projects/Parachute/data/data-all-02.py2")
sentences = pd.Series(['i love jackets', 'blue is my favorite color'])
corpus = SeriesCorpus(sentences)
doc_vectors = d2v.transform(corpus)

And then semantic similarities can be evaluated directly:

from scipy.spatial.distance import cosine
# vector for 'i love jackets'
v0 = doc_vectors[0, :] 
# vector the word 'jackets'
v1 = d2v['jackets']
similarity = 1 - cosine(v0, v1)
print(similarity) # 0.320
# Of course, the similarity with a word that is literally
# in the sentence is going to be quite high
# What if we try something similar, like coats?
v2 = d2v['coats']
similarity = 1 - cosine(v0, v2)
print(similarity) # 0.265
# And then if we try a very something very dissimilar from the sentece
# like the city of New York we get low similarity:
v3 = d2v['new_york']
similarity = 1 - cosine(v0, v3)
print(similarity) # 0.02

Monitoring training

It can be useful to monitor the training over many iteration to make sure doc2vec is at (least locally) doing what it should be doing:

from scipy.spatial.distance import cosine
import numpy as np
def monitor(model):
    print model.alpha,
    for word in ['jackets', 'jacket', 'coats', 'dog']:
        print word,': ', 1.0 - cosine(model['SENT_0'], model[word]),
    print " "
d2v.monitor = monitor
doc_vectors = d2v.transform(corpus)

Will print something similar to the following:

0.25000 jackets :  0.347975713494 jacket :  0.150385576332 coats : 0.305263268479 dog :  0.121432161320
0.20002 jackets :  0.301431248517 jacket :  0.113824911821 coats : 0.272647329817 dog :  0.125565730551
0.15004 jackets :  0.296385793196 jacket :  0.108801409463 coats : 0.267922727947 dog :  0.126922837909
0.10006 jackets :  0.293973052240 jacket :  0.106190931536 coats : 0.265730524733 dog :  0.126504370045
0.05008 jackets :  0.293425048701 jacket :  0.105495592420 coats : 0.264931351959 dog :  0.125495564005