Skip to content

Skip-gram word embeddings in hyperbolic space

Notifications You must be signed in to change notification settings

codeaudit/minkowski

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Minkowski

Overview

Minkowski implements skip-gram training for learning word embeddings from continous text in hyperbolic space. Based on code from fastText, each word in the vocabulary is represented by a point on the hyperboloid model in Minkowski space. The embeddings are then optimized by negative sampling to minimize the hyperbolic distance of co-occurring words.

The differences to fastText are as follows:

  • Word vectors are situated on the hyperboloid model of hyperbolic space.
  • The similarity of two vectors is anti-proportional to their hyperbolic distance.
  • In multithreaded training, individual word vectors are locked while being updated, so that no other thread can overwrite them and thus violate the constraint of the hyperboloid.
  • The option to specify start and end learning rates and a number of burnin epochs with lower learning rate.
  • It is possible to store intermediate word vectors using the checkpoint command line argument.
  • It is possible to specify the power to which the unigram distribution is raised for negative sampling.

Installation

In order to build the executable, a recent C++ compiler and CMake need to be installed (tested with g++ 5.4.0 and CMake 3.11.0-rc2).

The following commands produce the executable minkowski in the build directory :

git clone ... ./minkowski
cd .. & mkdir minkowski-build & cd minkowski-build
cmake ../minkowski
make

Usage

The following command line parameters are available:

$ ./minkowski 
Empty input or output path.
  -input                  training file path
  -output                 output file path
  -min-count              minimal number of word occurences [5]
  -t                      sub-sampling threshold (0=no subsampling) [0.0001]
  -start-lr               start learning rate [0.05]
  -end-lr                 end learning rate [0.05]
  -burnin-lr              fixed learning rate for the burnin epochs [0.05]
  -max-step-size          max. dist to travel in one update [2]
  -dimension              dimension of the Minkowski ambient [100]
  -window-size            size of the context window [5]
  -init-std-dev           stddev of the hyperbolic distance from the base point for initialization [0.1]
  -burnin-epochs          number of extra prelim epochs with burn-in learning rate [0]
  -epochs                 number of epochs with learning rate linearly decreasing from -start-lr to -end-lr [5]
  -number-negatives       number of negatives sampled [5]
  -distribution-power     power used to modified distribution for negative sampling [0.5]
  -checkpoint-interval    save vectors every this many epochs [-1]
  -threads                number of threads [12]
  -seed                   seed for the random number generator [1]
                          n.b. only deterministic if single threaded!

An example call looks like this:

$ ./minkowski -input textfile.txt -output embeddings -dimension 50 -start-lr 0.1
-end-lr 0 -epochs 3 -min-count 15 -t 1e-5 -window-size 10 -number-negatives 10
-threads 64

References

[1] Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. [pdf] (https://arxiv.org/pdf/1301.3781.pdf?)

[2] Maximilian Nickel, Douwe Kiela: Poincaré Embeddings for Learning Hierarchical Representations. NIPS 2017. [pdf](https://papers.nips .cc/paper/7213-poincare-embeddings-for-learning-hierarchical-representations.pdf)

About

Skip-gram word embeddings in hyperbolic space

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C++ 85.6%
  • Python 10.9%
  • CMake 1.0%
  • M4 0.8%
  • Shell 0.6%
  • Makefile 0.6%
  • C 0.5%