Skip to content
Commented (but unaltered) version of original word2vec C implementation.
C Shell Makefile
Branch: master
Clone or download

Latest commit

chrisjmccormick Fixed hierarchical softmax comments
My initial interpretation of HS was incorrect. I've gone back and fixed my comments on this topic.
Latest commit 154f36e Sep 10, 2019

Files

Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.gitattributes 🍭 Added .gitattributes & .gitignore files Jan 12, 2017
.gitignore 🍭 Added .gitattributes & .gitignore files Jan 12, 2017
LICENSE Adding Google code + initial comments Jan 12, 2017
README.md Updating README to mention CBOW comments Mar 1, 2019
README.txt Adding Google code + initial comments Jan 12, 2017
compute-accuracy.c Adding Google code + initial comments Jan 12, 2017
demo-analogy.sh Adding Google code + initial comments Jan 12, 2017
demo-classes.sh Adding Google code + initial comments Jan 12, 2017
demo-phrase-accuracy.sh Adding Google code + initial comments Jan 12, 2017
demo-phrases.sh Adding Google code + initial comments Jan 12, 2017
demo-train-big-model-v1.sh Adding Google code + initial comments Jan 12, 2017
demo-word-accuracy.sh Adding Google code + initial comments Jan 12, 2017
demo-word.sh
distance.c Adding Google code + initial comments Jan 12, 2017
makefile Adding Google code + initial comments Jan 12, 2017
questions-phrases.txt Adding Google code + initial comments Jan 12, 2017
questions-words.txt Adding Google code + initial comments Jan 12, 2017
source-archive.zip Adding Google code + initial comments Jan 12, 2017
word-analogy.c Added a few comments to the analogies code. Aug 30, 2019
word2phrase.c Added comments to the phrase detection tool. Jul 13, 2017
word2vec.c Fixed hierarchical softmax comments Sep 9, 2019

README.md

word2vec_commented

This project is a functionally unaltered version of Google's published word2vec implementation in C, but which includes source comments.

If you're new to word2vec, I recommending reading my tutorial first.

My focus is on the code in word2vec.c for training:

  • I have commented both the skip-gram and CBOW architectures with negative sampling
  • I haven't done Hierarchical Softmax yet for either architecture.

I have also commented the word2phrase.c tool, which implements the phrase detection.

I haven't looked much at the testing code.

Because the code supports both models and both training approaches, I highly recommended viewing the code in an editor which allows you to collapse code blocks. The training code is much more readable when you hide the implementations that you aren't interested in.

word2vec Model Training

word2vec training occurs in word2vec.c

Almost all of the functions in word2vec.c happen to be related to building and maintaining the vocabulary. If you remove the vocabulary functions, here's what's left:

  • main() - Entry point to the script.
    • Parses the command line arguments.
  • TrainModel() - Main entry point to the training process.
    • Learns the vocabulary, initializes the network, and kicks off the training threads.
  • TrainModelThread() - Performs the actual training.
    • Each thread operates on a different portion of the input text file.

Text Parsing

The word2vec C project does not include code for parsing and tokenizing your text. It simply accepts a training file with words separated by whitespace (spaces, tabs, or newlines). This means that you'll need to handle the removal of things like punctuation separately.

The code expects the text to be divided into sentences (with a default maximum length of 1,000 words). The end of a sentence is marked by a single newline character "\n"--that is, there should be one sentence per line in the file.

I've also commented the word2phrase.c tool which handles phrase detection. This tool is used to actually produce a new version of your training file where phrases like "New York" are replaced with a single token, "New_York". Each run of the word2phrase tool looks at combinations of two words (or tokens), so the first pass would turn "New York" into "New_York" and the second pass would turn "New_York City" into "New_York_City".

Building the Vocabulary

word2vec.c includes code for constructing a vocabulary from the input text file.

The code supports fast lookup of vocab words through a hash table, which maps word strings to their respective vocab_word object.

The completed vocabulary consists of the following:

  • vocab_word - A structure containing a word and its metadata, such as its frequency (word count) in the training text.
  • vocab - The array of vocab_word objects for every word.
  • vocab_hash - A hash table which maps word hash codes to the index of the word in the vocab array. The word hash is calculated using the GetWordHash function.

Learning the vocabulary starts with the LearnVocabFromTrainFile function. Tokens from the training text are added to the vocabulary, and the frequency (word count) for each word is tracked.

If the vocabulary grows beyond 70% of the hash table size, the code will prune the vocabulary by eliminating the least frequent words. This is to minimize the occurrence (and performance impact) of hash collisions.

You can’t perform that action at this time.