A pure-NumPy implementation of Word2Vec, built from scratch. No PyTorch, TensorFlow, or other ML frameworks are used.
| Feature | Detail |
|---|---|
| Algorithms | Skip-gram + Negative Sampling (SGNS) and CBOW |
| Batched training | Vectorised mini-batch updates via NumPy einsum |
| Subsampling | Frequent-word subsampling (Mikolov 2013 §2.3) |
| Noise distribution | Unigram^0.75 table for O(1) negative sampling |
| LR schedule | Linear decay with configurable floor |
| Evaluation | Cosine NN lookup, word analogies, t-SNE, loss curves |
| Checkpointing | Save/load model weights as .npz |
| CLI | train.py with full argument parser |
word2vec-numpy/
├── src/
│ ├── __init__.py
│ ├── vocabulary.py # vocab building, subsampling, noise table
│ ├── preprocessing.py # text cleaning, tokenisation, pair generation
│ ├── word2vec.py # model: forward pass, loss, gradients, SGD update
│ └── trainer.py # training loop, LR schedule, checkpointing
├── evaluate.py # analogy eval, nearest neighbours, t-SNE, plots
├── train.py # CLI entry point
├── requirements.txt
└── README.md
# Install dependencies (only numpy is required; matplotlib + sklearn for plots)
pip install -r requirements.txt
# Train on the built-in sample corpus (runs in ~30 seconds):
python train.py --corpus sample --epochs 40 --dim 100 --evaluate
# Train on text8 (~100 MB Wikipedia corpus, downloads automatically):
python train.py --corpus text8 --epochs 5 --dim 100 --evaluate --plot
# Train on your own text file:
python train.py --corpus path/to/corpus.txt --dim 200 --epochs 3from src import Vocabulary, Word2Vec, Trainer, load_sample_corpus
tokens = load_sample_corpus()
vocab = Vocabulary(min_count=2).build(tokens)
corpus = vocab.encode(tokens)
model = Word2Vec(vocab=vocab, embed_dim=100, n_negatives=5)
trainer = Trainer(model=model, vocab=vocab, corpus=corpus, n_epochs=10)
trainer.train()
# Query the embeddings
model.most_similar("king", top_n=5)
model.analogy("king", "man", "woman", top_n=5) # → queen
# Save / load
model.save("results/model")
vocab.save("results/vocab.json")Given a centre word
Without negative sampling, the softmax over the full vocabulary is intractable (
For each positive pair
where:
-
$\sigma(x) = \frac{1}{1 + e^{-x}}$ is the sigmoid function -
$\mathbf{u}_c \in \mathbb{R}^d$ — the input embedding (row ofW_in) for centre word$c$ -
$\mathbf{v}_o \in \mathbb{R}^d$ — the output embedding (row ofW_out) for context word$o$ -
$\mathbf{v}_{n_i}$ — the output embedding for the$i$ -th negative sample
The loss pushes $\mathbf{v}_o \cdot \mathbf{u}c$ to be large (positive word should be predictable) and $\mathbf{v}{n_i} \cdot \mathbf{u}_c$ to be small (negative words should be unpredictable).
Let
Gradient w.r.t.
Gradient w.r.t.
Gradient w.r.t.
Derivation sketch (using the chain rule):
Since
where lr_start to lr_min over training).
Negative samples are drawn from the unigram distribution raised to the 3/4 power:
This was empirically found to outperform both the raw unigram and the uniform distributions. It is implemented as a large pre-computed lookup table (1 million entries) for O(1) sampling.
Each token is discarded during training with probability:
where
CBOW predicts a target word from the mean of its context embeddings:
The loss and negative sampling objective are identical to SGNS, but applied with
python train.py [options]
Data:
--corpus 'sample' | 'text8' | /path/to/file.txt
--max_tokens Truncate corpus to N tokens
--data_dir Directory for downloaded datasets (default: data/)
Model:
--model skipgram | cbow (default: skipgram)
--dim Embedding dimensionality (default: 100)
--neg Negative samples per pair (default: 5)
--window Context window radius (default: 5)
--min_count Minimum word frequency (default: 5)
Training:
--epochs Number of training epochs (default: 5)
--lr Initial learning rate (default: 0.025)
--lr_min Minimum learning rate (default: 0.0001)
--batch_size Pairs per gradient step (default: 512)
--seed Random seed (default: 42)
Output:
--out Model output path (default: results/model)
--evaluate Run evaluation after training
--plot Save t-SNE and loss curve plots
--log_every Log loss every N steps (default: 10000)
After training:
from evaluate import full_report
full_report(model, plot=True, loss_history=trainer.loss_history)Outputs:
- Embedding norm statistics
- Nearest neighbours for sample words
- Word analogy accuracy
- t-SNE scatter plot →
results/tsne.png - Loss curve →
results/loss_curve.png
Why two embedding matrices? W_in (centre) and W_out (context) are kept separate, as in the original paper. Using a single shared matrix (tying weights) is also possible and sometimes used, but the two-matrix variant generally performs better.
Why not hierarchical softmax? Negative sampling is faster in practice and achieves comparable or better results for large vocabularies. Hierarchical softmax has
Batching vs. per-pair updates: The batched sgns_batch_step uses np.einsum for vectorised score computation and np.add.at for safe scattered gradient accumulation. Each example in the batch contributes its own gradient independently (not averaged), matching the original Mikolov implementation.
Numerical stability: Sigmoid inputs are clipped to