This projects provides an enhanced version of the original word2vec code. In addition to the normal functionality (i.e., training word vectors based on their surrounding context), this implementation also provides a possibility to train word embeddings tweaked to a particular user-defined task (in addition to or instead of the normal objective).
In order to build this project, you need to proceed to the
directory of the checked-out repository and execute the following
cmake ../ make
This will look for the necessary libraries, adjust the compilation options, and compile the executable files. Currently, this project depends on the following third party utils:
In order to test the built program, you should run the following command:
Afterwards, you can start using the compiled
word2vec. You can find
examples of input data in the
test/ directory of this projects.
In order to run the normal
word2vec training, you can execute the
following command (from the
./bin/word2vec -min-count 0 -train ../tests/test_1.0.in
this will train the vanilla
word2vec embeddings, which, however,
might be slightly different from the original results when trained
with multiple threads.
If you, however, want to train embeddings with respect to a particular task (e.g., predicting the subjective polarity of a sentence), you can launch:
./bin/word2vec -ts -min-count 0 -train ../tests/test_2.0.in
Then, the resulting word vector will be trained to best fit your
custom task. The labels for each task should be specified as
contiguous non-negative integers starting from zero (i.e., if a task
has three classes, the labels to use should be
separated by a tab character from the main text, e.g.:
Ich fahre morgen nach Hause.\t0 Ich bin sehr froh dich zu sehen.\t1 Schade, dass wir uns nicht getroffen haben.\t2
If the label for the task is not known, you should put an underscore
_ instead of the tag. In the same way, you can also specify
multiple tags for different objectives, e.g.:
Ich fahre morgen nach Hause.\t0\t1 Ich bin sehr froh dich zu sehen.\t1\t_ Schade, dass wir uns nicht getroffen haben.\t2\t0
-ts mode which trains purely task-specific embeddings,
we also provide a couple of in-between solutions:
-ts-w2voption, you can simultaneously train both
word2vecand task-specific objectives, in which case word embeddings will be shared and updated to match both tasks.
Alternatively, you can also use the
-ts-least-sqoption, in which case
word2vecand task-specific embeddings will be trained independently. In the final step, however, task-specific embeddings of words which did not appear in the task-labeled lines will be computed from their
word2vecrepresentation using the linear least-squares method.
To build the documentation for the compiled executable, you need to
install Doxygen prior to executing
after the Makefiles have been generated.