FlashLM

A family of ternary (1.58-bit) language models that train and run entirely on CPU. Weights are constrained to {-1, 0, +1}, so inference uses only additions and subtractions — no floating-point multiplies.

Models

Version	Params	Dataset	Val Loss	BPC	Status
v4 "Bolt"	4.3M	TinyStories	2.10	0.88	✅ Current
v3	13.6M	FineWeb-Edu	6.80	—	Archived

v4 "Bolt" Architecture

Embedding (10K × 192, float, weight-tied)
  → 6 × BoltBlock:
      RMSNorm → Gated Causal DepthwiseConv (k=8) → residual
      RMSNorm → Ternary GLU (SiLU, 192→512→192) → residual
  → RMSNorm → Output Head (tied to embedding)

All linear layers inside BoltBlocks use ternary BitLinear quantisation (straight-through estimator, α = mean|W|). The only floating-point operations are the embedding lookup, RMSNorm, and the tied output projection.

Quick Start

Training

# Install dependencies
pip install torch tiktoken datasets

# Auto-detect hardware and train
python train.py

# Small model (4.3M params), 2-hour run
python train.py --small --hours 2

# Large model (15.7M params), train until convergence
python train.py --large

# Resume from checkpoint
python train.py --resume checkpoints/flashlm_v4_step_5000.pt

# Custom configuration
python train.py --large --batch 64 --lr 2e-3 --epochs 5

The script auto-detects CPU cores and RAM, selects an appropriate model size, downloads TinyStories, builds a frequency-based 10K vocabulary (99.9% coverage), caches tokenized data to disk, and begins training. Subsequent runs skip download and tokenization.

CLI Options

Flag	Description	Default
`--small`	Force v4-small (d=192, 6 blocks, ~4.3M params)	Auto
`--large`	Force v4-large (d=384, 8 blocks, ~15.7M params)	Auto
`--resume PATH`	Resume training from a checkpoint	—
`--epochs N`	Maximum epochs	10
`--hours H`	Wall-clock time limit	None
`--batch N`	Batch size	Auto
`--lr FLOAT`	Peak learning rate	Auto
`--seed N`	Random seed	42
`--save-dir PATH`	Checkpoint directory	`checkpoints/`

Files

File	Description
`train.py`	Standalone training script for FlashLM v4
`eval_bpc.py`	BPC evaluation script (FlashLM v4 vs TinyStories-1M)
`FlashLMv3.ipynb`	Original v3 notebook (archived)

Results

FlashLM v4 vs TinyStories-1M (500 validation stories):

Metric	FlashLM v4	TinyStories-1M
Params	4.3M (ternary)	3.7M (float32)
BPC	0.88	0.62
PPL	15.05	6.72
Hardware	2-thread CPU	V100 GPU
Tokens seen	10.6M	~470M
Training time	2 hours	Hours (GPU)

The BPC gap is primarily due to undertraining — v4 has seen only 2.3% of the data the baseline used, and loss was still decreasing when the 2-hour time limit was reached.

Links

Model & Weights: HuggingFace
Demo: HuggingFace Space
v3 Model: HuggingFace
v3 Demo: HuggingFace Space

Roadmap

Extended training on Ryzen 7950X3D (16 cores, 128GB RAM)
Scale to ~15M params (v4-large)
Curriculum learning (TinyStories → SimpleStories → filtered FineWeb-Edu)
ONNX / C inference runtime
BPC evaluation script (eval_bpc.py)

Inspired By

License

MIT — see LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FlashLM

Models

v4 "Bolt" Architecture

Quick Start

Training

CLI Options

Files

Results

Links

Roadmap

Inspired By

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
FlashLMv3.ipynb		FlashLMv3.ipynb
LICENSE		LICENSE
README.md		README.md
eval_bpc.py		eval_bpc.py
train.py		train.py

License

changcheng967/FlashLM

Folders and files

Latest commit

History

Repository files navigation

FlashLM

Models

v4 "Bolt" Architecture

Quick Start

Training

CLI Options

Files

Results

Links

Roadmap

Inspired By

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages