| Section | What’s there |
|---|---|
| Features | Tokenizer, data, Docker, tooling |
| Contributions | Inter-block residual blocks, TARNet-style training |
| Architecture | Data → train → checkpoint; optional TARNet head diagram |
| Quick start | Docker, local install, W&B, generation |
| How training works | Pipeline overview |
| Training IMDB | make train-imdb, chat, resume |
| Tests | pytest, coverage, CI |
| Development | pip install -e ".[dev]", pre-commit |
| Project structure | Layout of the repo |
| Config | Env vars and JSON config |
| License | MIT |
| Experiment archive | Collapsed logs and notes |
| Tokenizer | Hugging Face byte-level BPE (hf_bpe_byte); sinusoidal or RoPE positional encoding |
| Model | Causal multi-head self-attention; optional inter-block residuals; optional TARNet two-head mode |
| Data | IMDB sentiment via Hugging Face datasets (tags or natural conditioning) |
| Runtime | Docker (NGC PyTorch), Make targets, optional Weights & Biases |
This repo is a small from-scratch decoder LM; beyond the baseline stack, it highlights two implementation threads you can turn on via config / CLI:
| Inter-block residual blocks | With --block-attn-residuals, layers use InterBlockAttnDecoderBlock instead of the vanilla DecoderBlock. Each step mixes the current stream with prior macro-block outputs through a depth attention residual (RMSNorm keys, one learned pseudo-query, softmax over depth) before the usual causal self-attention and FFN—parallel residual adds into the same hidden stream, with snapshots at macro-block boundaries. |
| TARNet-like training | --tarnet-two-heads: one trunk, logits_shared + Δ_k readouts, factual treatment T, weighted CE on the active head, optional JS separation (tarnet_head_separation_weight). Diagram, equation, shared-head rationale, and inference (--counterfactual): TARNet under Architecture. |
Papers.
Attention residuals (decoder). Macro-block depth mixing is in the same family as Block AttnRes in Attention Residuals (Kimi Team, 2026). Conceptual background and how it maps to this repo: vanilla vs inter-block decoder.
TARNet (classical ITE). Dual-head training here is inspired by shared encoders and treatment-specific heads in Estimating individual treatment effect: generalization bounds and algorithms (Shalit, Johansson & Sontag)—adapted to next-token LM loss, not the paper’s observational ITE setup.
Counterfactuals × LLMs (further reading). Different formalisms from this repo’s supervised dual-head decoder, but adjacent if you care about “alternative text under a treatment” with language models:
- Counterfactual Token Generation in Large Language Models (Chatzi et al.) — Gumbel–Max structural causal model of autoregressive sampling; counterfactual rollouts without fine-tuning pretrained LMs.
- Gumbel Counterfactual Generation From Language Models (Ravfogel et al.) — interventions vs true string counterfactuals; hindsight Gumbel sampling over latent noise.
- Counterfactual Causal Inference in Natural Language with Large Language Models (Gendron et al.) — extracting causal variables from documents, merging graphs, and counterfactual inference with LLMs.
- LIBERTy: A Causal Framework for Benchmarking Concept-Based Explanations of LLMs with Structural Counterfactuals (Toker et al.) — SCM-grounded structural counterfactual pairs for evaluating concept-based explanations.
High-level data and training flow:
flowchart LR
subgraph inputs[Inputs]
DS[datasets IMDB]
CFG[Config JSON / CLI]
end
subgraph nano[nano_llm]
TOK[Tokenizer hf_bpe_byte]
DL[DataLoaders]
M[NanoLLM decoder]
TR[train loop]
end
CKPT[(checkpoint best.pt)]
DS --> TOK
CFG --> TR
TOK --> DL
DL --> TR
TR --> M
TR --> CKPT
NanoLLM uses one causal trunk. On top of the final hidden state it applies a shared vocab MLP (logits_shared) and two sentiment residual heads:
logits_k = logits_shared + Δ_k(hidden)
The last layer of each Δ is zero-initialized, so at step 0 both heads match the shared predictor. The trainer computes a treatment-weighted next-token CE loss on logits_T (where T ∈ {0,1} is factual sentiment: negative → 0, positive → 1), plus an optional Jensen–Shannon term between the two heads (tarnet_head_separation_weight). Original TARNet framing (ITE, observational causal setup): see Papers under Contributions.
Input text (what the model sees). Training examples and TARNet counterfactual chat share a treatment-invariant template: a command prompt, then [REVIEW] … [/REVIEW] around the body (the tokenized sequence does not include [SENTIMENT] … [/SENTIMENT]). Each example is constructed like:
<bos>{command_prompt} [REVIEW] <review characters…> [/REVIEW]<eos>
The default {command_prompt} is GENERATE an IMDB-like review: (config key imdb_tarnet_command_prompt, CLI --imdb-tarnet-command-prompt). [REVIEW] / [/REVIEW] come from data.py.
Where T comes from (and how it affects loss). IMDB rows are still loaded using the usual formatting (imdb_conditioning_style: tags with [SENTIMENT] positive|negative [/SENTIMENT], or natural instructions before [REVIEW]) so the loader can read the factual label. IMDBTARNetDataset then drops the sentiment markup from the tokenized string, converts it to T, and trains the model with:
- Loss: next-token CE chooses
logits0vslogits1according toTfor all non-padding target tokens in the chunk window. - The dataset exposes a
review_mask, but the currenttrainloop does not use it to restrict the TARNet CE; soTis not provided as an explicit token hint—only via head selection in the loss.
At inference, use the same prefix your checkpoint was trained with (e.g. scripts/chat.py --counterfactual builds <bos>{command_prompt} [REVIEW] ); you can override the prefix text with --command-prompt.
flowchart TB
subgraph trunk["Shared trunk (causal decoder)"]
IDS["Token IDs"]
IDS --> EMB["Embedding + positional encoding"]
EMB --> DEC["Decoder blocks (vanilla or inter-block)"]
DEC --> LNF["Final LayerNorm"]
end
LNF --> H["Hidden state H (per position)"]
H --> SH["tarnet_shared_head"]
SH --> LS["logits_shared"]
H --> D0["tarnet_sentiment_delta0"]
H --> D1["tarnet_sentiment_delta1"]
LS --> P0["+"]
D0 --> P0
P0 --> L0["logits0 — head Y0"]
LS --> P1["+"]
D1 --> P1
P1 --> L1["logits1 — head Y1"]
Why logits_shared (vs two full heads)? One projection carries treatment-agnostic structure (syntax, entities, review phrasing); Δ₀/Δ₁ only nudge logits per sentiment—less capacity than duplicating two full readouts and a clearer Y0 vs Y1 delta on the same baseline (with the cold start behavior described above).
Inference. scripts/chat.py --counterfactual and scripts/generate.py --both-heads sample Y0 or Y1 from that same prefix. generate_both_heads uses one trunk forward pass while both heads agree, then decodes independently if sampling diverges.
# Build and run training
make train
# or: docker compose up --build
# Continue training from checkpoint
make resume EPOCHS=15
# or: docker compose run train python scripts/train.py --resume checkpoints/best.pt --epochs 15
# Generate text
make generate PROMPT="<bos>[SENTIMENT] positive [/SENTIMENT] [REVIEW] " MAX_TOKENS=200
# Interactive shell
make shellSee make help for all targets.
pip install wandb
wandb login # paste API key from https://wandb.ai/authorizeEnable logging when training:
docker compose run --rm -e WANDB_API_KEY=... train python scripts/train.py \
--use-wandb --wandb-project nano-llm-imdb \
--wandb-tags imdb,hf_bpe_byte --epochs 10Or with Make (set WANDB_API_KEY in your environment, or add it to .env for Compose):
make train-imdb ARGS='--use-wandb --wandb-project nano-llm --wandb-run-name run1'Each epoch logs train/loss, val/loss, perplexity, learning rate. Use --wandb-log-model to upload best.pt at the end (larger upload).
pip install -r requirements.txt
pip install -e .
# Train with defaults
python scripts/train.py
# Override hyperparameters
python scripts/train.py --d-model 128 --epochs 5 --batch-size 32
# Continue training from checkpoint (more epochs)
python scripts/train.py --resume checkpoints/best.pt --epochs 15
# Early stopping (stop if val_loss unchanged for 10 epochs)
python scripts/train.py --epochs 3000 --early-stopping-patience 10After training, generate text from a checkpoint:
# Default: greedy, 100 tokens (set a checkpoint-appropriate prompt)
python scripts/generate.py
# Custom prompt and sampling (IMDB tags-style example)
python scripts/generate.py --prompt "<bos>[SENTIMENT] positive [/SENTIMENT] [REVIEW] " --max-tokens 200
python scripts/generate.py --method top_k --top-k 40 --temperature 0.8
python scripts/generate.py --method top_p --top-p 0.9 --seed 42
# Specific checkpoint
python scripts/generate.py --checkpoint checkpoints/best.ptWith Docker (after training in container, checkpoints in ./checkpoints):
# Generate using GPU
docker compose run generate
# With options (args pass through to generate.py)
docker compose run generate --prompt "<bos>[SENTIMENT] positive [/SENTIMENT] [REVIEW] " --max-tokens 200 --method top_pLocalhost API server (TARNet-only):
# Start HTTP server on localhost:18080 (default port avoids common 8000 conflicts)
python scripts/inference_api.py --checkpoint checkpoints/counterfactual_repeat_20m/best.pt --host 127.0.0.1
# Same API in the GPU Docker image (NVIDIA Compose stack). Host listens on 127.0.0.1:18080; use --host 0.0.0.0 in-container.
make inference-api
# or: docker compose run --rm --service-ports inference-api
# Health check
curl http://127.0.0.1:18080/health
# Generate both reviews from one request
curl -X POST http://127.0.0.1:18080/generate ^
-H "Content-Type: application/json" ^
-d "{\"job_id\":\"api-1\",\"prompt\":\"<bos>GENERATE an IMDB-like review: [REVIEW] \",\"both_reviews\":true,\"max_tokens\":120,\"method\":\"top_p\",\"top_p\":0.9}"
# OpenAI-compatible chat completions route
curl -X POST http://127.0.0.1:18080/v1/chat/completions ^
-H "Content-Type: application/json" ^
-d "{\"model\":\"nano-llm-local\",\"messages\":[{\"role\":\"user\",\"content\":\"<bos>GENERATE an IMDB-like review: [REVIEW] \"}],\"both_reviews\":true,\"max_tokens\":120,\"top_p\":0.9}"POST /generate accepts the same fields as worker JSON requests (job_id, prompt, both_reviews, max_tokens, sampling args, etc.). Response includes either output or output_y0 + output_y1.
POST /v1/chat/completions accepts OpenAI-style messages; set both_reviews=true to return two assistant choices (Y0/Y1).
The server prints no HTTP body until generation finishes. With curl -s, the terminal stays blank until then; on CPU, max_tokens=120 with both_reviews can take several minutes—watch the API terminal for lines like [inference_api] POST /v1/chat/completions … or try a quick smoke test with max_tokens 8–16.
On Windows PowerShell, curl.exe -d $body often strips the opening " after {, producing invalid JSON ({model":...). Pipe the same here-string through scripts/post_chat_completion.py instead:
@'
{"model":"nano-llm-local","messages":[{"role":"user","content":"<bos>GENERATE an IMDB-like review: [REVIEW] "}],"both_reviews":true,"max_tokens":16,"top_p":0.9}
'@ | python scripts/post_chat_completion.pyAlternative: write UTF-8 without BOM to a temp file, then
curl.exe ... --data-binary "@$env:TEMP\tarnet.json". In cmd.exe, curl.exe --% stops PowerShell from rewriting the rest of the line.
Safety note. This project does not include safety guardrails at model level. Generated outputs may still be inappropriate; as a minimal safeguard, decoded text is post-processed to redact a small set of explicit terms by default (see content_filter.py). Use --no-censor on scripts/generate.py or scripts/chat.py to print raw decoded output.
- CLI and config —
scripts/train.pyloadsDEFAULT_CONFIG(and optional--configJSON), applies CLI overrides, then callsnano_llm.train.train(cfg). - Data — Training loads IMDB from Hugging Face and formats each row into a conditioned string. The tokenizer is trained on train+val text unless you resume from a checkpoint with
tokenizer_state/vocab, in which case it is restored to match the checkpoint. If present, JSONdataset_idmust be"imdb_sentiment"(other values are rejected). - Batches — Chunking keeps each prefix aligned with its review body; padded targets use ignore index
-100. Single-head: sentiment in the string ([SENTIMENT]…[/SENTIMENT]or natural instructions before[REVIEW]). TARNet: the tokenized prefix is treatment-invariant (<bos>+ command +[REVIEW]; full template under Architecture → TARNet); factual sentiment is carried only as batchT. - Model — Causal decoder-only
NanoLLM. With--tarnet-two-heads,weight_tieis off (see Architecture → TARNet). - Loss and optimization
- Single head: next-token cross-entropy (optional weight-tied embeddings).
- TARNet: weighted CE on the head that matches factual
T, optional JS viatarnet_head_separation_weight(full wiring in Architecture → TARNet). - Optimizer: AdamW; LR schedule: cosine, linear, or none. AMP:
fp16/bf16on CUDA when configured.
- Checkpointing — When validation improves,
best.ptstoresmodelweights, fullconfig,vocab, andtokenizer_statefor reproducible load and chat. - IMDB conditioning
tags(default):[SENTIMENT] positive|negative [/SENTIMENT] [REVIEW] … [/REVIEW].natural: instruction text before[REVIEW](--imdb-conditioning-style natural, optional--imdb-positive-instruction/--imdb-negative-instruction).scripts/chat.pyreadsimdb_conditioning_stylefrom the checkpoint for single-head models.
Train on IMDB, then interactive chat:
make train-imdb EPOCHS=30
make chat-imdbchat-imdb follows the checkpoint’s imdb_conditioning_style (tags vs natural instructions) for single-head models; TARNet counterfactual mode uses the command prompt + [REVIEW]. Override checkpoint: IMDB_CHECKPOINT=path/to/best.pt make chat-imdb.
Generate (one shot):
docker compose run --rm generate \
--checkpoint checkpoints/imdb_sentiment/hf_bpe_byte/best.pt \
--prompt "<bos>[SENTIMENT] positive [/SENTIMENT] [REVIEW] " \
--method top_p --temperature 0.7 --repetition-penalty 1.2 \
--max-tokens 300 --stop-sequence "[/REVIEW]"Resume IMDB training from a checkpoint:
docker compose run --rm train python scripts/train.py \
--resume checkpoints/imdb_sentiment/hf_bpe_byte/best.pt \
--tokenizer-type hf_bpe_byte --bpe-vocab-size 256 --position-encoding rope \
--imdb-max-review-chars 500 --epochs 30 \
--checkpoint-dir checkpoints/imdb_sentiment/hf_bpe_byte --early-stopping-patience 5# Unit tests only (default; integration tests are deselected via pyproject.toml)
make test
# or: pytest
# With coverage (core package; fails under 40% when coverage is enabled)
make test-cov
# or: pytest --cov=src/nano_llm --cov-report=term-missing
# All tests including integration (slow; may download IMDB)
make test-all
# or: pytest --override-ini "addopts=-v -x"GitHub Actions runs Ruff (lint + format check) and the default pytest selection on pushes and pull requests to main / master.
pip install -e ".[dev]"
pre-commit install # optional: ruff + whitespace/yaml hooks from .pre-commit-config.yaml
pre-commit run --all-filesRuff in pre-commit is limited to src/ and tests/ (same scope as make lint / CI). Other hooks still run on staged files repo-wide.
| Path | Role |
|---|---|
.github/workflows/ci.yml |
Ruff + pytest on push/PR to main / master |
.pre-commit-config.yaml |
Optional Ruff + file hygiene hooks |
.env.example |
Template for secrets / env (copy to .env; never commit .env) |
LICENSE |
MIT |
docs/README.md |
Index of Jupyter tutorials (tokenizer, IMDB, sampling, decoder stacks) |
src/nano_llm/ |
Model, layers, tokenizer, data, training, inference |
scripts/train.py |
Training CLI |
scripts/generate.py |
Generation CLI |
| Source | Purpose |
|---|---|
scripts/train.py --config |
JSON file merged with defaults; highest priority after CLI flags |
Env NANO_LLM_CONFIG |
Optional path to default JSON (used by load_config helpers) |
Env WANDB_API_KEY |
Optional; Weights & Biases when --use-wandb |
Copy .env.example to .env for local or Compose secrets (never commit .env). CLI flags override values from config files.
PyTorch in the NGC PyTorch Docker image. On CUDA, mixed precision (fp16) is the default when configured.
Released under the MIT License.
Archived notes: Docker one-liners, training logs, and chat transcripts. Transcripts are mostly verbatim model output; a few explicit tokens are replaced with [redacted] for readability.
Summary
- Large TARNet (≈20M params):
d_model=512,num_heads=8,num_layers=6,d_ff=1888,seq_len=256, RoPE, inter-block residuals,hf_bpe_bytevocab 256,--tarnet-head-separation-weight 0.02, 40 epochs,counterfactual_repeat_20m. ≈20,040,768 parameters; ≈4.8 h wall time; val loss ≈1.96 → best ≈1.65; perplexity ≈7 → ≈5.2 (val best ≈5.21). - Qualitative: Counterfactual Y0 vs Y1 show different sentiment styles; fluency varies across prompts at this scale.
- 512-wide comparison table: vanilla vs inter-block vs TARNet — see expanded section below.
Expand — raw logs, commands, and full table
The logs below compare three 512-wide IMDB runs (same num_layers=6, d_ff=1888, seq_len=256, RoPE, hf_bpe_byte vocab 256, batch 16). Rows are ordered lowest → best by best validation CE (lower is better).
| Order | Checkpoint | Decoder stack | LM head | Epochs | Params | Best val CE | Final val CE | Best val PPL | Wall time |
|---|---|---|---|---|---|---|---|---|---|
| 1 (lowest) | imdb_baseline_vanilla_20m |
Vanilla (block_attn_residuals off) |
Single, imdb_conditioning_style=natural |
20 | 18,062,400 | 1.727 | 1.727 | 5.62 | ≈1.2 h |
| 2 | imdb_baseline_natural_20m |
Inter-block | Single, natural | 20 | 18,074,688 | 1.717 | 1.717 | 5.57 | ≈2.2 h |
| 3 (best) | counterfactual_repeat_20m |
Inter-block | TARNet two heads, tarnet_head_separation_weight=0.02 |
40 | 20,040,768 | 1.650 | 1.664 | 5.21 | ≈4.8 h |
What was added each step, and what improved
- Baseline: vanilla decoder + natural instructions, single head. Best val CE ≈1.73.
- Inter-block residuals (
--block-attn-residuals). Adds ≈12k parameters (about 2× wall time for 20 epochs) and improves best val CE to ≈1.717. - TARNet + longer training + head separation. Adds two sentiment heads, JS separation loss (
0.02), and longer training (40 epochs), reaching best val CE ≈1.65.
Note: objectives differ across configurations (single-head vs TARNet dual-head), so comparisons reflect both architecture and loss changes.
docker compose run --rm train python scripts/train.py --epochs 40 --batch-size 16 --d-model 512 --num-heads 8 --num-layers 6 --d-ff 1888 --seq-len 256 --dropout 0.1 --tokenizer-type hf_bpe_byte --bpe-vocab-size 256 --tarnet-two-heads --tarnet-head-n-fc 2 --position-encoding rope --block-attn-residuals --macro-block-size 2 --max-block-representations 9 --checkpoint-dir checkpoints/counterfactual_repeat_20m --tarnet-head-separation-weight 0.02docker compose run --rm -it chat --checkpoint checkpoints/counterfactual_repeat_20m/best.pt --max-tokens 340 --temperature 0.7 --method top_p --top-p 0.5 --counterfactual --repetition-penalty 1.5
Generate [+/-/b/q] (default b):
[Y0]
ember the same way, and it was all downhill from there. The pace is terrible, but nothing exciting happens. So what's with that? And why is this movie such a confused mess of old film? Too many story lines involve between those guards where you see it only to be fascistic (I'm not asking for more) to disgust over the center of this movie and think that you had good times past her and I was really let down but it seldom became more informative.
[Y1]
What a fun movie! The casting of the two leads is great. She portrays the dancer who has been married to her old but she would not have [redacted] with her. And, in this case, sometimes it's about as good as everyone elses. This film was once again just before you'll stick with it and remind my opinion of that for you.
Generate [+/-/b/q] (default b):
[Y0]
For a film that has no character development, and it is pretty bad. The story is told in sometimes you'll find out what the heck was good about this movie. And just like when you throw monster moving on with the title of this movie and see something exceptional before that everyone did not give it another message. THIS IS SOMELY AWFUL!
[Y1]
raphics, so there is not a lot of contention in this movie. The story tells a good man who has been put together by describing the way he is in it and finally gets kidnapped but only to find out that his woman is now being tumbled. That's what I've seen for years! And he's excellent as the damaged counterfeiter; and Similarly she poses as an elderly man with a real job, you've seen it. The film moves with something more than just playing on.
Generate [+/-/b/q] (default b):
[Y0]
ts, and then all of a sudden he decides to go back to live with his wife. There's nothing more than that, in fact it is just plain stupid and trite. And the ending was cheesy and overly similar for this movie. That's why I don't see how anybody mentioned that this film is goofball-made in its [redacted] way. Sometimes you can't believe the reasons for examining what literally made this movie sometimes becomes out of nowhere; once you see SOYLENT NIGHT!
[Y1]
Perhaps the most interesting thing about this movie is that it has no sense of disbelief. The story is good, but not as funny as you can see, and it's excellent! That's what I liked.
Generate [+/-/b/q] (default b):
[Y0]
OK, there are some funny moments in this movie. The characters were poorly drawn and extras were not too bad. And the story was terrible, I thought it was good for a low budget film. This is only because of it's just that it has non-plot tension and plot lines that severely cut together without an overall damning flatulence of any sort which is really abounding! Scarface was stilted, as in the old hospital cosmic mansion on his head with similar powers; he did NOT give kid all of what he examined that?? Affore comes to be mental into light-hearking but this film's finest storm is relevance, and yet you'll be modish and I've seen only for good old pick-basic election, when I th
[Y1]
One of the most incredible films ever made. This is a movie that worked beautifully, and it was considering the programmes on this site, but I found it to be a lot more open-minded than you have seen for example. The story is dark and nicely detailed with good suspension of disbelief and quality.
Generate [+/-/b/q] (default b):
[Y0]
What a waste of time. To be honest, this is only the worst movie ever made! The premises in this movie seem to sound like it was dubbed or not. Sometimes you'll get by the casting of Cagney and Adams, which may be more than that.
[Y1]
Have you ever seen this movie? This is a superb film. The casting of the two leads, especially Sammo Hung as the woman whose father was born in Afghanistan and he did not get himself into being a parrot. I think that it's madly because it's strength and sentimentality is more out of place, as with most old movies they'll live without exceptional conversation.
Generate [+/-/b/q] (default b):
[Y0]
When I saw this movie, it was full of flaws. The acting is bad and the performances are wooden and the story line does not match up to me. And while that's becoming more than half of the film is obviously terrible, you've gotten a little bit into it as well as once in only times you see how shallow he is!
[Y1]
One of the best movies ever made. The story is about tragically changing his ways and doing something like this, as he prefers to become a genuinely fascinated (and industry) dreamer. And it's not so bad, it's good for those who hated that movie with young men whose family live on out and when the concept was before you seen the film in 1973! Timothy Spall was excellent as Spanjers, he played Amitabh Bachchan is good to watch; now I've seldom seen him more than this once? This movie is actual movie for you!