A tiny autoregressive transformer (about 2.9M parameters) that generates 32x32 palette-indexed pixel art sprites, built end-to-end by AI tooling as an experiment in agent-driven research code.
The aim was not to chase state-of-the-art quality. The aim was to see whether a model small enough to run on a regular gamer's machine could generate pixel art sprites on demand, conditioned on simple categorical prompts, well enough that a game could pull from it at runtime.
The model itself is the artefact, but the interesting question sat one level up: how much of the research loop can a coding agent run on its own? Data sourcing, palette quantisation, training scaffolding, sampling, breeding, post-process shading: every layer of this repo was generated, iterated, and debugged through agent sessions, with me steering rather than writing.
The constraint that shaped every decision was inference budget. A diffusion model would have been the easy choice for visual quality, but I wanted something that could plausibly run inside a game on commodity hardware, so the design landed on a small autoregressive transformer with KV-cache inference over a fixed 64-colour palette.
The results are sub-par. I am sharing them because the failure modes are themselves interesting and because the agentic build process learned something at every step.
The strongest output comes from the post-process shading layer. The model produces flat palette-indexed sprites and a separate procedural shader applies directional light, ambient occlusion, and edge darkening using only colours already in the palette:
Across the categories the model converged on (ReefFish, Grazer, Coral, Jellyfish), the zone × category grid shows the kind of coherent variety I was hoping for:
The temperature sweep is where the limits show. At low temperatures the model collapses to empty sprites, at moderate temperatures it produces recognisable shapes, and beyond about 1.0 it tips into static:
Two of the six categories (Cephalopod and one Abyssal column) never converged at all in the final checkpoint, producing pure noise regardless of sampling settings. I iterated on the training data several times: synthetic only, Wikimedia Commons sourced and palette-quantised, sprite sheet rips, mixed corpora. None of them got those categories to settle.
- About 2.9M parameter transformer decoder, 6 layers, 192 dim, 6 heads
- 80 token vocabulary: 64 palette indices plus PAD, BOS, EOS, MASK, 6 zone tokens, 6 category tokens
- 2D sinusoidal positional encoding so the model can reason about row and column structure
- KV-cache inference for O(n) per-sprite generation
- Sprite breeding through partial completion: take parent A, mask 40% of pixels, seed with parent B's colours, let the model fill the gaps
- Post-process shading that stays inside the 64-colour palette by mapping to per-colour luminance ramps
Two motivations.
The first was practical. I was sketching out small games that could benefit from runtime sprite generation, and the existing options (cloud diffusion, large local models, hand-drawn art) all had problems for an indie scope. I wanted to see whether the bottom-end of the model size scale could give something usable.
The second was methodological. Most of my recent side projects involve agents writing game code, where the success signal is "does it run, does it look right." Training a model is a different kind of agentic task: success signals are noisier, the loop is much slower, and visual verification only kicks in at the end. I wanted to learn how that changes the way I work with agents.
pip install -r requirements.txt
# Smoke test on synthetic data, CPU only
python -m pixelllm.train --synthetic --epochs 5 --device cpu
# Real training run, GPU
python -m pixelllm.train --data-dir data/processed --epochs 50 --fp16 --device cuda
# Sample from a checkpoint
python -m pixelllm.generateBest run reached val_acc 0.9837, val_loss 0.0476 with batch_size 32, grad_accum 2, fp16 on a single RTX 4080 SUPER over about 30 hours. See docs/STATUS.md for the run log and docs/ARCHITECTURE.md for design details.
- PyTorch 2.0+ for model and training
- NumPy and Pillow for data pipeline
- Wikimedia Commons API for one of the source corpora
- A handcrafted procedural sprite generator for synthetic data
A few honest observations:
- Agents are much better at scaffolding training infrastructure than at choosing training data. Several iterations of "the loss is going down but the samples look bad" came down to the data corpus, not the model.
- Visual verification needs to be wired into the loop early. Without periodic sample dumps you cannot tell whether the model is genuinely learning structure or just memorising a few prototypes.
- A small model leaves nowhere to hide. With 2.9M parameters the model fails honestly: weak categories produce noise rather than confident-looking nonsense.
- Knowing when to stop is part of the work. After several training-data iterations I judged that the agentic loop had hit a ceiling for this architecture, and that pushing further would mean changing the model or the problem rather than the data.
Frozen. The repo captures where the experiment landed: a working pipeline, a trained checkpoint, and outputs that show both what the approach can do and where it falls short. I am sharing it as-is rather than polishing it, because the value here is in the trail, not the destination.
If I came back to it I would try a hybrid: keep the autoregressive structure for the palette discipline, but add a small diffusion head for the shape coarse-pass, and condition on a richer prompt vocabulary. The current zone-and-category bottleneck is too narrow for the data to lift.
This repository was extracted from a private monorepo where I work on many side projects together. The single initial commit reflects the migration, not the development cadence; the work itself unfolded across many sessions.
MIT