GitHub

This repository contains the word2vec C code, but with comments. Run

    git diff original

to see the complete set of changes to the original code.

Cross-reference this code with the original papers:

and the essential follow-up paper:

Improving Distributional Similarity with Lessons Learned from Word Embeddings

The following gensim blog posts by Radim Řehůřek are very interesting and informative:

And---shameless plug---I wrote a C++ implementation of word2vec (the skip-gram with negative sampling (SGNS) algorithm) that also supports streaming (vocabulary and embedding model are learned in one pass, see the write-up on arXiv for details):

athena (naive-lm-train-raw is the entry point for basic SGNS training)

The original content of README.txt follows the break.

Tools for computing distributed representtion of words

We provide an implementation of the Continuous Bag-of-Words (CBOW) and the Skip-gram model (SG), as well as several demo scripts.

Given a text corpus, the word2vec tool learns a vector for every word in the vocabulary using the Continuous Bag-of-Words or the Skip-Gram neural network architectures. The user should to specify the following:

desired vector dimensionality
the size of the context window for either the Skip-Gram or the Continuous Bag-of-Words model
training algorithm: hierarchical softmax and / or negative sampling
threshold for downsampling the frequent words
number of threads to use
the format of the output word vector file (text or binary)

Usually, the other hyper-parameters such as the learning rate do not need to be tuned for different training sets.

The script demo-word.sh downloads a small (100MB) text corpus from the web, and trains a small word vector model. After the training is finished, the user can interactively explore the similarity of the words.

More information about the scripts is provided at https://code.google.com/p/word2vec/

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
compute-accuracy.c		compute-accuracy.c
demo-analogy.sh		demo-analogy.sh
demo-classes.sh		demo-classes.sh
demo-phrase-accuracy.sh		demo-phrase-accuracy.sh
demo-phrases.sh		demo-phrases.sh
demo-train-big-model-v1.sh		demo-train-big-model-v1.sh
demo-word-accuracy.sh		demo-word-accuracy.sh
demo-word.sh		demo-word.sh
distance.c		distance.c
makefile		makefile
questions-phrases.txt		questions-phrases.txt
questions-words.txt		questions-words.txt
word-analogy.c		word-analogy.c
word2phrase.c		word2phrase.c
word2vec.c		word2vec.c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tools for computing distributed representtion of words

About

Releases

Packages

Languages

License

ccmaymay/word2vec

Folders and files

Latest commit

History

Repository files navigation

Tools for computing distributed representtion of words

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages