commented word2vec
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
.gitignore
LICENSE
README.md
compute-accuracy.c
demo-analogy.sh
demo-classes.sh
demo-phrase-accuracy.sh
demo-phrases.sh
demo-train-big-model-v1.sh
demo-word-accuracy.sh
demo-word.sh
distance.c
makefile
questions-phrases.txt
questions-words.txt
word-analogy.c
word2phrase.c
word2vec.c

README.md

Google word2vec C code with comments. Run

    git diff original

to see the complete set of changes to the original code.

Cross-reference this code with the original papers:

and the essential follow-up paper:

The following gensim blog posts by Radim Řehůřek are very interesting and informative:

And---shameless plug---I wrote a C++ implementation of word2vec (the skip-gram with negative sampling (SGNS) algorithm) that also supports streaming (vocabulary and embedding model are learned in one pass, see the write-up on arXiv for details):

The original content of README.txt follows the break.


Tools for computing distributed representtion of words

We provide an implementation of the Continuous Bag-of-Words (CBOW) and the Skip-gram model (SG), as well as several demo scripts.

Given a text corpus, the word2vec tool learns a vector for every word in the vocabulary using the Continuous Bag-of-Words or the Skip-Gram neural network architectures. The user should to specify the following:

  • desired vector dimensionality
  • the size of the context window for either the Skip-Gram or the Continuous Bag-of-Words model
  • training algorithm: hierarchical softmax and / or negative sampling
  • threshold for downsampling the frequent words
  • number of threads to use
  • the format of the output word vector file (text or binary)

Usually, the other hyper-parameters such as the learning rate do not need to be tuned for different training sets.

The script demo-word.sh downloads a small (100MB) text corpus from the web, and trains a small word vector model. After the training is finished, the user can interactively explore the similarity of the words.

More information about the scripts is provided at https://code.google.com/p/word2vec/