A few-file, all-in-one C engine for Transformer AI models: inference, training, tokenizer creation, chat, and vision.
Built from scratch over 18 months (March 2024 to August 2025) during my lunch breaks and weekend nights, TRiP exists just because I wanted to truly understand the transformer internals, from the matrix multiplications up. About 15000 lines, structured to be read. It supports Gemma 1, Llama 2, PaliGemma, and GPT-2, with full inference and training. TRiP's purpose is purely educational - for me, and for anyone willing to learn about transformers. It does not aim to track the latest model releases, and is not trying to compete with llama.cpp.
TRiP also helped produce the analysis behind my paper:
"Confidently Wrong: Failure of Mixture of Inputs in Low-Entropy Regimes" (DOI: 10.5281/zenodo.20822173) https://zenodo.org/records/20822173
In the spirit of full transparency, here's the small portion that is AI-generated (everything else is hand-coded):
- the json parser (with some fixes)
- the safetensors checkpoint save function
- the whole jpeg-X11 management functions (I had no interest in developing them)
- the final file split (I initially wrote everything as main.c :D)
- some revisions of the comments before committing
- this readme, for the most part :D
That's all, I think. The rest is hand-coded by me - and it would have made no sense otherwise, since the whole point of doing this was to get as close as possible to a full-stack understanding of the transformer internals.
- Architectures: Llama2, Gemma 1.0/1.1, PaliGemma 1 (vision+language), GPT-2
- Checkpoint formats: SafeTensors (HuggingFace), Karpathy's llama2.c and gpt2 formats
- Weight types: bf16, float16, float32
- Training: full backpropagation with AdamW, cosine annealing LR, gradient clipping
- Tokenizer: BPE (SentencePiece-compatible), with vocabulary creation from scratch
- Inference: greedy, top-k, and nucleus (top-p) sampling
- Chat: interactive chat with Llama, Gemma, and TinyLlama chat templates
- Vision: multimodal inference with PaliGemma (JPEG input, X11 display)
- Memory: RAM-optimized mode via mmap for large models on limited hardware
gcc (recommended: version 13 or higher, to get support for bfloat16; with OpenMP support)
libjpeg-dev (or libjpeg62-turbo-dev)
libx11-dev
WARNING: do NOT expect higher performance with bfloat16 or float16 on CPUs; today's CPUs are not optimized for floating point operations in such formats, and float32 always performs best. That surprised me a lot, too.
On Debian:
sudo apt install build-essential libomp-dev libjpeg62-turbo-dev libx11-devOn Ubuntu:
sudo apt install build-essential libomp-dev libjpeg-dev libx11-devTRiP runs natively under WSL (Windows Subsystem for Linux). To enable the X11 display features (vision mode, image display), install an X server on the Windows side:
Then in your WSL terminal, before running TRiP:
export DISPLAY=:0If using WSL2 (most setups), use instead:
export DISPLAY=$(cat /etc/resolv.conf | grep nameserver | awk '{print $2}'):0X11 is only needed for vision mode. Chat, inference, and training work without it.
makeThat's it. No cmake, no external frameworks, no Python. Just make.
Download a Gemma-2B-IT model from HuggingFace (safetensors format), then:
./trip --chat \
--checkpoint gemma-2b-it/model.safetensors \
--tokenizer gemma-2b-it/tokenizer.json \
--chat_scheme GEMMA./trip --decode \
--input_text="The capital of Italy is" \
--checkpoint gemma-2b-it/model.safetensors \
--tokenizer gemma-2b-it/tokenizer.jsonOr from a text file:
./trip --decode prompt.txt \
--checkpoint gemma-2b-it/model.safetensors \
--tokenizer gemma-2b-it/tokenizer.json./trip --train \
--checkpoint my_model/model.safetensors \
--tokenizer my_model/tokenizer.json \
--train_data my_dataset.txt \
--train_config training_args.json./trip --vision photo.jpg \
--checkpoint paligemma/model.safetensors \
--tokenizer paligemma/tokenizer.json \
--input_text="Describe this image"./trip --build_vocab corpus.txt --vocab_size 32000 --tokenizer my_tokenizer.jsonUSAGE:
./trip <ACTION> [OPTIONS...]
| Flag | Description |
|---|---|
--decode [file] |
Run inference on a prompt (from file, --input_text, or stdin) |
--chat |
Interactive chat session |
--vision [image.jpg] |
Multimodal inference with an image |
--train |
Train the model |
--create |
Create a new model from a configuration file |
--build_vocab <data.txt> |
Build a new tokenizer vocabulary from a text corpus |
--utest |
Run unit tests |
--help |
Show help |
| Flag | Default | Description |
|---|---|---|
--checkpoint <path> |
default.model |
Path to model checkpoint file(s) |
--checkpoint_type <type> |
SAFETENSORS |
Format: SAFETENSORS, LLAMA2_AK, GPT2_AK |
--configuration <path> |
(auto) | Path to config.json (for SafeTensors) |
--tokenizer <path> |
default.tokenizer |
Path to tokenizer file |
--tokenizer_format <type> |
JSON_HUGGINGFACE |
Format: JSON_HUGGINGFACE, LLAMA2_AK, GPT2_AK |
--tokenizer_type <type> |
SENTENCEPIECE |
Algorithm: SENTENCEPIECE, TRIP |
| Flag | Default | Description |
|---|---|---|
--input_text="<prompt>" |
— | Provide prompt text directly on the command line |
--system_prompt "<text>" |
— | System prompt for chat mode |
--chat_scheme <scheme> |
(none) | Chat template: LLAMA, TINY_LLAMA, GEMMA |
--chat_save_context <file> |
— | Pre-process and save chat context for faster startup |
--chat_load_context <file> |
— | Load a previously saved chat context |
--temperature <value> |
1.0 |
Sampling temperature. 0.0 = greedy (always pick the most probable token) |
--top_p <value> |
0.9 |
Nucleus sampling: sample from the smallest set of tokens whose cumulative probability exceeds this value |
--top_k <value> |
(disabled) | Top-k sampling: sample from the k most probable tokens |
--ram |
(off) | Memory-map weights instead of loading them (slower, uses less RAM) |
| Flag | Default | Description |
|---|---|---|
--train_config <path> |
training_args.json |
Path to training configuration JSON |
--train_data <path> |
training_data.txt |
Path to training data (plain text) |
In "examples/", you can find an example for training_args.json, and a sample file to fine-tune GEMMA and TinyLlama using their chat template. Take care: models are VERY sensitive to correct use of their chat template when you fine-tune them.
TRiP is organized into 7 files. Open trip.h for the complete map.
| File | Lines | What it contains |
|---|---|---|
trip.h |
~900 | The map. Every type, struct, global, and declaration. |
math.c |
~3000 | Tensor ops, each forward+backward paired side by side: matmul, softmax, layernorm, RMSnorm, RoPE, attention, FFN activations, vector arithmetic |
forward.c |
~1500 | Forward pass orchestration + token sampling |
backward.c |
~1500 | Backward pass + AdamW optimizer + gradient management |
model.c |
~5500 | Checkpoint I/O, model init, memory management, tokenizer, vision preprocessing |
utils.c |
~1000 | Logging, JSON parser, terminal I/O, JPEG/X11 image handling |
main.c |
~1900 | CLI argument parsing, chat loop, training loop, inference loop |
TRiP implements a transformer from first principles in C. No PyTorch, no TensorFlow, no ONNX — just linear algebra on arrays of floats.
The residual stream is the central concept: a vector that flows through the model like data on a bus. Each layer reads from it, processes it through attention and a feed-forward network, and writes back to it. The forward pass walks the layers top to bottom; the backward pass walks them bottom to top, computing gradients via the chain rule.
Every math operation (math.c) is implemented as a forward+backward pair: you can read rmsnorm() and immediately below it rmsnorm_backward(), and see exactly how the gradient flows through the same computation in reverse.
I put a lot of comments in the code, both as reminders to me, and to render TRiP basically an annotated school book about transformers.
For a deeper understanding of backpropagation, see Andrej Karpathy's lecture; TRiP would never have existed without his work.
CC BY-NC 4.0 — free to use, study, modify, and share for non-commercial purposes, with attribution. For commercial licensing, contact the author.