A few-files, all-in-one C engine for Transformer AI models: inference, training, tokenizer creation, chat, and vision.
Built from scratch over 18 months (from March 2024 to August 2025) during my lunch breaks and weekend nights, TRiP exists just because I wanted to truly understand the transformer internals - from the matrix multiplications up.
TRiP's purpose is purely educational, for me and for anyone willing to learn about transformers. It supports Gemma 1, Llama 2, PaliGemma, and GPT-2, with full inference and training. It does not aim to track the latest model releases, and is not trying to compete with llama.cpp.
NOTE: since people are asking: here's what's AI-generated in the code:
- the json parser (with some fix)
- the safetensors checkpoint save function
- the whole jpeg-X11 management functions (I had no interest in developing them)
- the final file split (I initially wrote everything as main.c :D )
- some revisions of the comments before I made the commit
- this readme, for the most part :D
That's all, I think; the rest, it's all hand-coded by me. And it would had no sense otherwise, since the whole point in doing this was: to achieve the closest to a full-stack understanding of the transformer internals.
- Architectures: Llama2, Gemma 1.0/1.1, PaliGemma 1 (vision+language), GPT-2
- Checkpoint formats: SafeTensors (HuggingFace), Karpathy's llama2.c and gpt2 formats
- Weight types: bf16, float16, float32
- Training: full backpropagation with AdamW, cosine annealing LR, gradient clipping
- Tokenizer: BPE (SentencePiece-compatible), with vocabulary creation from scratch
- Inference: greedy, top-k, and nucleus (top-p) sampling
- Chat: interactive chat with Llama, Gemma, and TinyLlama chat templates
- Vision: multimodal inference with PaliGemma (JPEG input, X11 display)
- Memory: RAM-optimized mode via mmap for large models on limited hardware
gcc (recommended: version 13 or higher, to get support for bfloat16; with OpenMP support)
libjpeg-dev (or libjpeg62-turbo-dev)
libx11-dev
WARNING: do NOT expect higher performance with bfloat16 or float16 on CPUs; today's CPUs are not optimized for floating point operations in such formats, and float32 always performs best. That surprised me a lot, too.
On Debian:
sudo apt install build-essential libomp-dev libjpeg62-turbo-dev libx11-devOn Ubuntu:
sudo apt install build-essential libomp-dev libjpeg-dev libx11-devTRiP runs natively under WSL (Windows Subsystem for Linux). To enable the X11 display features (vision mode, image display), install an X server on the Windows side:
Then in your WSL terminal, before running TRiP:
export DISPLAY=:0If using WSL2 (most setups), use instead:
export DISPLAY=$(cat /etc/resolv.conf | grep nameserver | awk '{print $2}'):0X11 is only needed for vision mode. Chat, inference, and training work without it.
makeThat's it. No cmake, no external frameworks, no Python. Just make.
Download a Gemma-2B-IT model from HuggingFace (safetensors format), then:
./trip --chat \
--checkpoint gemma-2b-it/model.safetensors \
--tokenizer gemma-2b-it/tokenizer.json \
--chat_scheme GEMMA./trip --decode \
--input_text "The capital of Italy is" \
--checkpoint gemma-2b-it/model.safetensors \
--tokenizer gemma-2b-it/tokenizer.jsonOr from a text file:
./trip --decode prompt.txt \
--checkpoint gemma-2b-it/model.safetensors \
--tokenizer gemma-2b-it/tokenizer.json./trip --train \
--checkpoint my_model/model.safetensors \
--tokenizer my_model/tokenizer.json \
--train_data my_dataset.txt \
--train_config training_args.json./trip --vision photo.jpg \
--checkpoint paligemma/model.safetensors \
--tokenizer paligemma/tokenizer.json \
--input_text "Describe this image"./trip --build_vocab corpus.txt --vocab_size 32000 --tokenizer my_tokenizer.jsonUSAGE:
./trip <ACTION> [OPTIONS...]
| Flag | Description |
|---|---|
--decode [file] |
Run inference on a prompt (from file, --input_text, or stdin) |
--chat |
Interactive chat session |
--vision [image.jpg] |
Multimodal inference with an image |
--train |
Train the model |
--create |
Create a new model from a configuration file |
--build_vocab <data.txt> |
Build a new tokenizer vocabulary from a text corpus |
--utest |
Run unit tests |
--help |
Show help |
| Flag | Default | Description |
|---|---|---|
--checkpoint <path> |
default.model |
Path to model checkpoint file(s) |
--checkpoint_type <type> |
SAFETENSORS |
Format: SAFETENSORS, LLAMA2_AK, GPT2_AK |
--configuration <path> |
(auto) | Path to config.json (for SafeTensors) |
--tokenizer <path> |
default.tokenizer |
Path to tokenizer file |
--tokenizer_format <type> |
JSON_HUGGINGFACE |
Format: JSON_HUGGINGFACE, LLAMA2_AK, GPT2_AK |
--tokenizer_type <type> |
SENTENCEPIECE |
Algorithm: SENTENCEPIECE, TRIP |
| Flag | Default | Description |
|---|---|---|
--input_text "<prompt>" |
— | Provide prompt text directly on the command line |
--system_prompt "<text>" |
— | System prompt for chat mode |
--chat_scheme <scheme> |
(none) | Chat template: LLAMA, TINY_LLAMA, GEMMA |
--chat_save_context <file> |
— | Pre-process and save chat context for faster startup |
--chat_load_context <file> |
— | Load a previously saved chat context |
--temperature <value> |
1.0 |
Sampling temperature. 0.0 = greedy (always pick the most probable token) |
--top_p <value> |
0.9 |
Nucleus sampling: sample from the smallest set of tokens whose cumulative probability exceeds this value |
--top_k <value> |
(disabled) | Top-k sampling: sample from the k most probable tokens |
--ram |
(off) | Memory-map weights instead of loading them (slower, uses less RAM) |
| Flag | Default | Description |
|---|---|---|
--train_config <path> |
training_args.json |
Path to training configuration JSON |
--train_data <path> |
training_data.txt |
Path to training data (plain text) |
TRiP is organized into 7 files. Open trip.h for the complete map.
| File | Lines | What it contains |
|---|---|---|
trip.h |
~900 | The map. Every type, struct, global, and declaration. |
math.c |
~3000 | Tensor ops, each forward+backward paired side by side: matmul, softmax, layernorm, RMSnorm, RoPE, attention, FFN activations, vector arithmetic |
forward.c |
~1500 | Forward pass orchestration + token sampling |
backward.c |
~1500 | Backward pass + AdamW optimizer + gradient management |
model.c |
~5500 | Checkpoint I/O, model init, memory management, tokenizer, vision preprocessing |
utils.c |
~1000 | Logging, JSON parser, terminal I/O, JPEG/X11 image handling |
main.c |
~1900 | CLI argument parsing, chat loop, training loop, inference loop |
TRiP implements a transformer from first principles in C. No PyTorch, no TensorFlow, no ONNX — just linear algebra on arrays of floats.
The residual stream is the central concept: a vector that flows through the model like data on a bus. Each layer reads from it, processes it through attention and a feed-forward network, and writes back to it. The forward pass walks the layers top to bottom; the backward pass walks them bottom to top, computing gradients via the chain rule.
Every math operation (math.c) is implemented as a forward+backward pair: you can read rmsnorm() and immediately below it rmsnorm_backward(), and see exactly how the gradient flows through the same computation in reverse.
I put a lot of comments in the code, both as reminders to me, and to render TRiP basically an annotated school book about transformers.
For a deeper understanding of backpropagation, see Andrej Karpathy's lecture; TRiP would never have existed without his work.
CC BY-NC 4.0 — free to use, study, modify, and share for non-commercial purposes, with attribution. For commercial licensing, contact the author.