Skip to content

dyh1265/Transformers

Repository files navigation

Nano-LLM

Small decoder-only transformers, trained from scratch on PyTorch.

CI Python 3.10+ License: MIT


Contents

Section What’s there
Features Tokenizer, data, Docker, tooling
Contributions Inter-block residual blocks, TARNet-style training
Architecture Data → train → checkpoint; optional TARNet head diagram
Quick start Docker, local install, W&B, generation
How training works Pipeline overview
Training IMDB make train-imdb, chat, resume
Tests pytest, coverage, CI
Development pip install -e ".[dev]", pre-commit
Project structure Layout of the repo
Config Env vars and JSON config
License MIT
Experiment archive Collapsed logs and notes

Features

Tokenizer Hugging Face byte-level BPE (hf_bpe_byte); sinusoidal or RoPE positional encoding
Model Causal multi-head self-attention; optional inter-block residuals; optional TARNet two-head mode
Data IMDB sentiment via Hugging Face datasets (tags or natural conditioning)
Runtime Docker (NGC PyTorch), Make targets, optional Weights & Biases

Contributions

This repo is a small from-scratch decoder LM; beyond the baseline stack, it highlights two implementation threads you can turn on via config / CLI:

Inter-block residual blocks With --block-attn-residuals, layers use InterBlockAttnDecoderBlock instead of the vanilla DecoderBlock. Each step mixes the current stream with prior macro-block outputs through a depth attention residual (RMSNorm keys, one learned pseudo-query, softmax over depth) before the usual causal self-attention and FFN—parallel residual adds into the same hidden stream, with snapshots at macro-block boundaries.
TARNet-like training --tarnet-two-heads: one trunk, logits_shared + Δ_k readouts, factual treatment T, weighted CE on the active head, optional JS separation (tarnet_head_separation_weight). Diagram, equation, shared-head rationale, and inference (--counterfactual): TARNet under Architecture.

Papers.

Attention residuals (decoder). Macro-block depth mixing is in the same family as Block AttnRes in Attention Residuals (Kimi Team, 2026). Conceptual background and how it maps to this repo: vanilla vs inter-block decoder.

TARNet (classical ITE). Dual-head training here is inspired by shared encoders and treatment-specific heads in Estimating individual treatment effect: generalization bounds and algorithms (Shalit, Johansson & Sontag)—adapted to next-token LM loss, not the paper’s observational ITE setup.

Counterfactuals × LLMs (further reading). Different formalisms from this repo’s supervised dual-head decoder, but adjacent if you care about “alternative text under a treatment” with language models:


Architecture

High-level data and training flow:

flowchart LR
  subgraph inputs[Inputs]
    DS[datasets IMDB]
    CFG[Config JSON / CLI]
  end
  subgraph nano[nano_llm]
    TOK[Tokenizer hf_bpe_byte]
    DL[DataLoaders]
    M[NanoLLM decoder]
    TR[train loop]
  end
  CKPT[(checkpoint best.pt)]
  DS --> TOK
  CFG --> TR
  TOK --> DL
  DL --> TR
  TR --> M
  TR --> CKPT
Loading

TARNet two-head mode (--tarnet-two-heads)

NanoLLM uses one causal trunk. On top of the final hidden state it applies a shared vocab MLP (logits_shared) and two sentiment residual heads:

logits_k = logits_shared + Δ_k(hidden)

The last layer of each Δ is zero-initialized, so at step 0 both heads match the shared predictor. The trainer computes a treatment-weighted next-token CE loss on logits_T (where T ∈ {0,1} is factual sentiment: negative → 0, positive → 1), plus an optional Jensen–Shannon term between the two heads (tarnet_head_separation_weight). Original TARNet framing (ITE, observational causal setup): see Papers under Contributions.

Input text (what the model sees). Training examples and TARNet counterfactual chat share a treatment-invariant template: a command prompt, then [REVIEW] … [/REVIEW] around the body (the tokenized sequence does not include [SENTIMENT] … [/SENTIMENT]). Each example is constructed like:

<bos>{command_prompt} [REVIEW] <review characters…> [/REVIEW]<eos>

The default {command_prompt} is GENERATE an IMDB-like review: (config key imdb_tarnet_command_prompt, CLI --imdb-tarnet-command-prompt). [REVIEW] / [/REVIEW] come from data.py.

Where T comes from (and how it affects loss). IMDB rows are still loaded using the usual formatting (imdb_conditioning_style: tags with [SENTIMENT] positive|negative [/SENTIMENT], or natural instructions before [REVIEW]) so the loader can read the factual label. IMDBTARNetDataset then drops the sentiment markup from the tokenized string, converts it to T, and trains the model with:

  • Loss: next-token CE chooses logits0 vs logits1 according to T for all non-padding target tokens in the chunk window.
  • The dataset exposes a review_mask, but the current train loop does not use it to restrict the TARNet CE; so T is not provided as an explicit token hint—only via head selection in the loss.

At inference, use the same prefix your checkpoint was trained with (e.g. scripts/chat.py --counterfactual builds <bos>{command_prompt} [REVIEW] ); you can override the prefix text with --command-prompt.

flowchart TB
  subgraph trunk["Shared trunk (causal decoder)"]
    IDS["Token IDs"]
    IDS --> EMB["Embedding + positional encoding"]
    EMB --> DEC["Decoder blocks (vanilla or inter-block)"]
    DEC --> LNF["Final LayerNorm"]
  end
  LNF --> H["Hidden state H (per position)"]
  H --> SH["tarnet_shared_head"]
  SH --> LS["logits_shared"]
  H --> D0["tarnet_sentiment_delta0"]
  H --> D1["tarnet_sentiment_delta1"]
  LS --> P0["+"]
  D0 --> P0
  P0 --> L0["logits0 — head Y0"]
  LS --> P1["+"]
  D1 --> P1
  P1 --> L1["logits1 — head Y1"]
Loading

Why logits_shared (vs two full heads)? One projection carries treatment-agnostic structure (syntax, entities, review phrasing); Δ₀/Δ₁ only nudge logits per sentiment—less capacity than duplicating two full readouts and a clearer Y0 vs Y1 delta on the same baseline (with the cold start behavior described above).

Inference. scripts/chat.py --counterfactual and scripts/generate.py --both-heads sample Y0 or Y1 from that same prefix. generate_both_heads uses one trunk forward pass while both heads agree, then decodes independently if sampling diverges.


Quick start

With Docker (recommended)

# Build and run training
make train
# or: docker compose up --build

# Continue training from checkpoint
make resume EPOCHS=15
# or: docker compose run train python scripts/train.py --resume checkpoints/best.pt --epochs 15

# Generate text
make generate PROMPT="<bos>[SENTIMENT] positive [/SENTIMENT] [REVIEW] " MAX_TOKENS=200

# Interactive shell
make shell

See make help for all targets.

Weights & Biases (experiment tracking)

pip install wandb
wandb login   # paste API key from https://wandb.ai/authorize

Enable logging when training:

docker compose run --rm -e WANDB_API_KEY=... train python scripts/train.py \
  --use-wandb --wandb-project nano-llm-imdb \
  --wandb-tags imdb,hf_bpe_byte --epochs 10

Or with Make (set WANDB_API_KEY in your environment, or add it to .env for Compose):

make train-imdb ARGS='--use-wandb --wandb-project nano-llm --wandb-run-name run1'

Each epoch logs train/loss, val/loss, perplexity, learning rate. Use --wandb-log-model to upload best.pt at the end (larger upload).

Local

pip install -r requirements.txt
pip install -e .

# Train with defaults
python scripts/train.py

# Override hyperparameters
python scripts/train.py --d-model 128 --epochs 5 --batch-size 32

# Continue training from checkpoint (more epochs)
python scripts/train.py --resume checkpoints/best.pt --epochs 15

# Early stopping (stop if val_loss unchanged for 10 epochs)
python scripts/train.py --epochs 3000 --early-stopping-patience 10

Generation (inference)

After training, generate text from a checkpoint:

# Default: greedy, 100 tokens (set a checkpoint-appropriate prompt)
python scripts/generate.py

# Custom prompt and sampling (IMDB tags-style example)
python scripts/generate.py --prompt "<bos>[SENTIMENT] positive [/SENTIMENT] [REVIEW] " --max-tokens 200
python scripts/generate.py --method top_k --top-k 40 --temperature 0.8
python scripts/generate.py --method top_p --top-p 0.9 --seed 42

# Specific checkpoint
python scripts/generate.py --checkpoint checkpoints/best.pt

With Docker (after training in container, checkpoints in ./checkpoints):

# Generate using GPU
docker compose run generate

# With options (args pass through to generate.py)
docker compose run generate --prompt "<bos>[SENTIMENT] positive [/SENTIMENT] [REVIEW] " --max-tokens 200 --method top_p

Localhost API server (TARNet-only):

# Start HTTP server on localhost:18080 (default port avoids common 8000 conflicts)
python scripts/inference_api.py --checkpoint checkpoints/counterfactual_repeat_20m/best.pt --host 127.0.0.1

# Same API in the GPU Docker image (NVIDIA Compose stack). Host listens on 127.0.0.1:18080; use --host 0.0.0.0 in-container.
make inference-api
# or: docker compose run --rm --service-ports inference-api

# Health check
curl http://127.0.0.1:18080/health

# Generate both reviews from one request
curl -X POST http://127.0.0.1:18080/generate ^
  -H "Content-Type: application/json" ^
  -d "{\"job_id\":\"api-1\",\"prompt\":\"<bos>GENERATE an IMDB-like review: [REVIEW] \",\"both_reviews\":true,\"max_tokens\":120,\"method\":\"top_p\",\"top_p\":0.9}"

# OpenAI-compatible chat completions route
curl -X POST http://127.0.0.1:18080/v1/chat/completions ^
  -H "Content-Type: application/json" ^
  -d "{\"model\":\"nano-llm-local\",\"messages\":[{\"role\":\"user\",\"content\":\"<bos>GENERATE an IMDB-like review: [REVIEW] \"}],\"both_reviews\":true,\"max_tokens\":120,\"top_p\":0.9}"

POST /generate accepts the same fields as worker JSON requests (job_id, prompt, both_reviews, max_tokens, sampling args, etc.). Response includes either output or output_y0 + output_y1. POST /v1/chat/completions accepts OpenAI-style messages; set both_reviews=true to return two assistant choices (Y0/Y1).

The server prints no HTTP body until generation finishes. With curl -s, the terminal stays blank until then; on CPU, max_tokens=120 with both_reviews can take several minutes—watch the API terminal for lines like [inference_api] POST /v1/chat/completions … or try a quick smoke test with max_tokens 8–16.

On Windows PowerShell, curl.exe -d $body often strips the opening " after {, producing invalid JSON ({model":...). Pipe the same here-string through scripts/post_chat_completion.py instead:

@'
{"model":"nano-llm-local","messages":[{"role":"user","content":"<bos>GENERATE an IMDB-like review: [REVIEW] "}],"both_reviews":true,"max_tokens":16,"top_p":0.9}
'@ | python scripts/post_chat_completion.py

Alternative: write UTF-8 without BOM to a temp file, then
curl.exe ... --data-binary "@$env:TEMP\tarnet.json". In cmd.exe, curl.exe --% stops PowerShell from rewriting the rest of the line.

Safety note. This project does not include safety guardrails at model level. Generated outputs may still be inappropriate; as a minimal safeguard, decoded text is post-processed to redact a small set of explicit terms by default (see content_filter.py). Use --no-censor on scripts/generate.py or scripts/chat.py to print raw decoded output.

How training works

  1. CLI and configscripts/train.py loads DEFAULT_CONFIG (and optional --config JSON), applies CLI overrides, then calls nano_llm.train.train(cfg).
  2. Data — Training loads IMDB from Hugging Face and formats each row into a conditioned string. The tokenizer is trained on train+val text unless you resume from a checkpoint with tokenizer_state / vocab, in which case it is restored to match the checkpoint. If present, JSON dataset_id must be "imdb_sentiment" (other values are rejected).
  3. Batches — Chunking keeps each prefix aligned with its review body; padded targets use ignore index -100. Single-head: sentiment in the string ([SENTIMENT]…[/SENTIMENT] or natural instructions before [REVIEW]). TARNet: the tokenized prefix is treatment-invariant (<bos> + command + [REVIEW]; full template under Architecture → TARNet); factual sentiment is carried only as batch T.
  4. Model — Causal decoder-only NanoLLM. With --tarnet-two-heads, weight_tie is off (see Architecture → TARNet).
  5. Loss and optimization
    • Single head: next-token cross-entropy (optional weight-tied embeddings).
    • TARNet: weighted CE on the head that matches factual T, optional JS via tarnet_head_separation_weight (full wiring in Architecture → TARNet).
    • Optimizer: AdamW; LR schedule: cosine, linear, or none. AMP: fp16 / bf16 on CUDA when configured.
  6. Checkpointing — When validation improves, best.pt stores model weights, full config, vocab, and tokenizer_state for reproducible load and chat.
  7. IMDB conditioning
    • tags (default): [SENTIMENT] positive|negative [/SENTIMENT] [REVIEW] … [/REVIEW].
    • natural: instruction text before [REVIEW] (--imdb-conditioning-style natural, optional --imdb-positive-instruction / --imdb-negative-instruction).
    • scripts/chat.py reads imdb_conditioning_style from the checkpoint for single-head models.

Training IMDB

Train on IMDB, then interactive chat:

make train-imdb EPOCHS=30
make chat-imdb

chat-imdb follows the checkpoint’s imdb_conditioning_style (tags vs natural instructions) for single-head models; TARNet counterfactual mode uses the command prompt + [REVIEW]. Override checkpoint: IMDB_CHECKPOINT=path/to/best.pt make chat-imdb.

Generate (one shot):

docker compose run --rm generate \
  --checkpoint checkpoints/imdb_sentiment/hf_bpe_byte/best.pt \
  --prompt "<bos>[SENTIMENT] positive [/SENTIMENT] [REVIEW] " \
  --method top_p --temperature 0.7 --repetition-penalty 1.2 \
  --max-tokens 300 --stop-sequence "[/REVIEW]"

Resume IMDB training from a checkpoint:

docker compose run --rm train python scripts/train.py \
  --resume checkpoints/imdb_sentiment/hf_bpe_byte/best.pt \
  --tokenizer-type hf_bpe_byte --bpe-vocab-size 256 --position-encoding rope \
  --imdb-max-review-chars 500 --epochs 30 \
  --checkpoint-dir checkpoints/imdb_sentiment/hf_bpe_byte --early-stopping-patience 5

Tests

# Unit tests only (default; integration tests are deselected via pyproject.toml)
make test
# or: pytest

# With coverage (core package; fails under 40% when coverage is enabled)
make test-cov
# or: pytest --cov=src/nano_llm --cov-report=term-missing

# All tests including integration (slow; may download IMDB)
make test-all
# or: pytest --override-ini "addopts=-v -x"

GitHub Actions runs Ruff (lint + format check) and the default pytest selection on pushes and pull requests to main / master.


Development

pip install -e ".[dev]"
pre-commit install   # optional: ruff + whitespace/yaml hooks from .pre-commit-config.yaml
pre-commit run --all-files

Ruff in pre-commit is limited to src/ and tests/ (same scope as make lint / CI). Other hooks still run on staged files repo-wide.


Project structure

Path Role
.github/workflows/ci.yml Ruff + pytest on push/PR to main / master
.pre-commit-config.yaml Optional Ruff + file hygiene hooks
.env.example Template for secrets / env (copy to .env; never commit .env)
LICENSE MIT
docs/README.md Index of Jupyter tutorials (tokenizer, IMDB, sampling, decoder stacks)
src/nano_llm/ Model, layers, tokenizer, data, training, inference
scripts/train.py Training CLI
scripts/generate.py Generation CLI

Config

Source Purpose
scripts/train.py --config JSON file merged with defaults; highest priority after CLI flags
Env NANO_LLM_CONFIG Optional path to default JSON (used by load_config helpers)
Env WANDB_API_KEY Optional; Weights & Biases when --use-wandb

Copy .env.example to .env for local or Compose secrets (never commit .env). CLI flags override values from config files.


Framework

PyTorch in the NGC PyTorch Docker image. On CUDA, mixed precision (fp16) is the default when configured.


License

Released under the MIT License.


Experiment archive

Archived notes: Docker one-liners, training logs, and chat transcripts. Transcripts are mostly verbatim model output; a few explicit tokens are replaced with [redacted] for readability.

Summary

  • Large TARNet (≈20M params): d_model=512, num_heads=8, num_layers=6, d_ff=1888, seq_len=256, RoPE, inter-block residuals, hf_bpe_byte vocab 256, --tarnet-head-separation-weight 0.02, 40 epochs, counterfactual_repeat_20m. ≈20,040,768 parameters; ≈4.8 h wall time; val loss ≈1.96 → best ≈1.65; perplexity ≈7 → ≈5.2 (val best ≈5.21).
  • Qualitative: Counterfactual Y0 vs Y1 show different sentiment styles; fluency varies across prompts at this scale.
  • 512-wide comparison table: vanilla vs inter-block vs TARNet — see expanded section below.
Expand — raw logs, commands, and full table

IMDB ≈18–20M runs: training, validation, and sample quality (summary)

The logs below compare three 512-wide IMDB runs (same num_layers=6, d_ff=1888, seq_len=256, RoPE, hf_bpe_byte vocab 256, batch 16). Rows are ordered lowest → best by best validation CE (lower is better).

Order Checkpoint Decoder stack LM head Epochs Params Best val CE Final val CE Best val PPL Wall time
1 (lowest) imdb_baseline_vanilla_20m Vanilla (block_attn_residuals off) Single, imdb_conditioning_style=natural 20 18,062,400 1.727 1.727 5.62 ≈1.2 h
2 imdb_baseline_natural_20m Inter-block Single, natural 20 18,074,688 1.717 1.717 5.57 ≈2.2 h
3 (best) counterfactual_repeat_20m Inter-block TARNet two heads, tarnet_head_separation_weight=0.02 40 20,040,768 1.650 1.664 5.21 ≈4.8 h

What was added each step, and what improved

  1. Baseline: vanilla decoder + natural instructions, single head. Best val CE ≈1.73.
  2. Inter-block residuals (--block-attn-residuals). Adds ≈12k parameters (about wall time for 20 epochs) and improves best val CE to ≈1.717.
  3. TARNet + longer training + head separation. Adds two sentiment heads, JS separation loss (0.02), and longer training (40 epochs), reaching best val CE ≈1.65.

Note: objectives differ across configurations (single-head vs TARNet dual-head), so comparisons reflect both architecture and loss changes.


Test Results TARNet-style ≈20M-parameter run

docker compose run --rm train python scripts/train.py --epochs 40 --batch-size 16 --d-model 512 --num-heads 8 --num-layers 6 --d-ff 1888 --seq-len 256 --dropout 0.1 --tokenizer-type hf_bpe_byte --bpe-vocab-size 256 --tarnet-two-heads --tarnet-head-n-fc 2 --position-encoding rope --block-attn-residuals --macro-block-size 2 --max-block-representations 9 --checkpoint-dir checkpoints/counterfactual_repeat_20m --tarnet-head-separation-weight 0.02
docker compose run --rm -it chat --checkpoint checkpoints/counterfactual_repeat_20m/best.pt --max-tokens 340 --temperature 0.7 --method top_p --top-p 0.5 --counterfactual --repetition-penalty 1.5


Generate [+/-/b/q] (default b):
[Y0]
ember the same way, and it was all downhill from there. The pace is terrible, but nothing exciting happens. So what's with that? And why is this movie such a confused mess of old film? Too many story lines involve between those guards where you see it only to be fascistic (I'm not asking for more) to disgust over the center of this movie and think that you had good times past her and I was really let down but it seldom became more informative.

[Y1]
What a fun movie! The casting of the two leads is great. She portrays the dancer who has been married to her old but she would not have [redacted] with her. And, in this case, sometimes it's about as good as everyone elses. This film was once again just before you'll stick with it and remind my opinion of that for you.

Generate [+/-/b/q] (default b):

[Y0]
For a film that has no character development, and it is pretty bad. The story is told in sometimes you'll find out what the heck was good about this movie. And just like when you throw monster moving on with the title of this movie and see something exceptional before that everyone did not give it another message. THIS IS SOMELY AWFUL!

[Y1]
raphics, so there is not a lot of contention in this movie. The story tells a good man who has been put together by describing the way he is in it and finally gets kidnapped but only to find out that his woman is now being tumbled. That's what I've seen for years! And he's excellent as the damaged counterfeiter; and Similarly she poses as an elderly man with a real job, you've seen it. The film moves with something more than just playing on.

Generate [+/-/b/q] (default b):

[Y0]

ts, and then all of a sudden he decides to go back to live with his wife. There's nothing more than that, in fact it is just plain stupid and trite. And the ending was cheesy and overly similar for this movie. That's why I don't see how anybody mentioned that this film is goofball-made in its [redacted] way. Sometimes you can't believe the reasons for examining what literally made this movie sometimes becomes out of nowhere; once you see SOYLENT NIGHT!

[Y1]

Perhaps the most interesting thing about this movie is that it has no sense of disbelief. The story is good, but not as funny as you can see, and it's excellent! That's what I liked.

Generate [+/-/b/q] (default b):

[Y0]

OK, there are some funny moments in this movie. The characters were poorly drawn and extras were not too bad. And the story was terrible, I thought it was good for a low budget film. This is only because of it's just that it has non-plot tension and plot lines that severely cut together without an overall damning flatulence of any sort which is really abounding! Scarface was stilted, as in the old hospital cosmic mansion on his head with similar powers; he did NOT give kid all of what he examined that?? Affore comes to be mental into light-hearking but this film's finest storm is relevance, and yet you'll be modish and I've seen only for good old pick-basic election, when I th

[Y1]

One of the most incredible films ever made. This is a movie that worked beautifully, and it was considering the programmes on this site, but I found it to be a lot more open-minded than you have seen for example. The story is dark and nicely detailed with good suspension of disbelief and quality.

Generate [+/-/b/q] (default b):

[Y0]
What a waste of time. To be honest, this is only the worst movie ever made! The premises in this movie seem to sound like it was dubbed or not. Sometimes you'll get by the casting of Cagney and Adams, which may be more than that.

[Y1]
Have you ever seen this movie? This is a superb film. The casting of the two leads, especially Sammo Hung as the woman whose father was born in Afghanistan and he did not get himself into being a parrot. I think that it's madly because it's strength and sentimentality is more out of place, as with most old movies they'll live without exceptional conversation.

Generate [+/-/b/q] (default b):

[Y0]
When I saw this movie, it was full of flaws. The acting is bad and the performances are wooden and the story line does not match up to me. And while that's becoming more than half of the film is obviously terrible, you've gotten a little bit into it as well as once in only times you see how shallow he is!

[Y1]
One of the best movies ever made. The story is about tragically changing his ways and doing something like this, as he prefers to become a genuinely fascinated (and industry) dreamer. And it's not so bad, it's good for those who hated that movie with young men whose family live on out and when the concept was before you seen the film in 1973! Timothy Spall was excellent as Spanjers, he played Amitabh Bachchan is good to watch; now I've seldom seen him more than this once? This movie is actual movie for you!

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors