Skip to content
Branch: master
Find file History
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
..
Failed to load latest commit information.
competitors
eval-logs
papers
testsets
PSDVec.pdf
README.md
addheader.py
analogy.py
bench.sh
benchspeed.py
catbench.py
cleancorpus.py
corpus2liblinear.py
evaluate-toefl.py
evaluate.py
extractwiki.py
fact-rcv1.bat
fact-rcv1.sh
fact-wiki.bat
fact-wiki.sh
factorize.py
genSentDict.bat
gencatdata.py
gram-rcv1.bat
gram.bat
gramcount.pl
patch to gensim.py
perlxs.h
removeDoubleNewline.pl
sent-bench.bat
sent-gen.conf
sentbench.py
tab2list.py
topwordsInList.py
utils.py
vecnorms.py
xml2corpus.pl

README.md

PSDVec

PSDVec is the source code for "A Generative Word Embedding Model and its Low Rank Positive Semidefinite Solution" (EMNLP 2015).

See "PSDVec.pdf" for a manual (PSDVec: a Toolbox for Incremental and Scalable Word Embedding, accepted by Neurocomputing, 2016).

Update v0.42: Tikhonov Regularization (=Spherical Gaussian Prior) to embeddings in block-wise factorization:

  1. Obtain 25000 core embeddings using Weighted PSD Approximation, into 25000-500-EM.vec:
    • python factorize.py -w 25000 top2grams-wiki.txt
  2. Obtain 45000 noncore embeddings using Weighted Least Squares, totaling 80000 (25000 cores + 55000 noncores), into 25000-80000-500-BLK-2.0.vec:
    • python factorize.py -v 25000-500-EM.vec -o 55000 -t2 top2grams-wiki.txt
  3. Incrementally learn other 50000 noncore embeddings (based on 25000 cores), into 25000-130000-500-BLK-4.0.vec:
    • python factorize.py -v 25000-80000-500-BLK-2.0.vec -b 25000 -o 50000 -t4 top2grams-wiki.txt
  4. Repeat 3 again, with Tikhonov coeff = 8 to get more embeddings of rarer words, into 25000-180000-500-BLK-8.0.vec:
    • python factorize.py -v 25000-130000-500-BLK-4.0.vec -b 25000 -o 50000 -t8 top2grams-wiki.txt

Pretrained 180,000 embeddings and evaluation results are uploaded. Now the performance is systematically better than other methods.

Update v0.41: Gradient Descent (GD) solution:

  • python factorize.py -G 500 -w 120000 top2grams-wiki.txt
  • GD is fast and scalable, but the performance is much worse (~10% lower on the testsets). It's not recommended, unless initialized using unweighted Eigendecomposition (which is still not scalable).

Update v0.4: Online Block-wise Factorization

Testsets are by courtesy of Omer Levy (https://bitbucket.org/omerlevy/hyperwords/src).

The Gradient Descent algorithm was based on the suggestion of Peilin Zhao (not included as a part of the papers).

You can’t perform that action at this time.