PixelLLM

A tiny autoregressive transformer (about 2.9M parameters) that generates 32x32 palette-indexed pixel art sprites, built end-to-end by AI tooling as an experiment in agent-driven research code.

The aim was not to chase state-of-the-art quality. The aim was to see whether a model small enough to run on a regular gamer's machine could generate pixel art sprites on demand, conditioned on simple categorical prompts, well enough that a game could pull from it at runtime.

What this experiment was about

The model itself is the artefact, but the interesting question sat one level up: how much of the research loop can a coding agent run on its own? Data sourcing, palette quantisation, training scaffolding, sampling, breeding, post-process shading: every layer of this repo was generated, iterated, and debugged through agent sessions, with me steering rather than writing.

The constraint that shaped every decision was inference budget. A diffusion model would have been the easy choice for visual quality, but I wanted something that could plausibly run inside a game on commodity hardware, so the design landed on a small autoregressive transformer with KV-cache inference over a fixed 64-colour palette.

Honest about results

The results are sub-par. I am sharing them because the failure modes are themselves interesting and because the agentic build process learned something at every step.

The strongest output comes from the post-process shading layer. The model produces flat palette-indexed sprites and a separate procedural shader applies directional light, ambient occlusion, and edge darkening using only colours already in the palette:

Across the categories the model converged on (ReefFish, Grazer, Coral, Jellyfish), the zone × category grid shows the kind of coherent variety I was hoping for:

The temperature sweep is where the limits show. At low temperatures the model collapses to empty sprites, at moderate temperatures it produces recognisable shapes, and beyond about 1.0 it tips into static:

Two of the six categories (Cephalopod and one Abyssal column) never converged at all in the final checkpoint, producing pure noise regardless of sampling settings. I iterated on the training data several times: synthetic only, Wikimedia Commons sourced and palette-quantised, sprite sheet rips, mixed corpora. None of them got those categories to settle.

What it is

About 2.9M parameter transformer decoder, 6 layers, 192 dim, 6 heads
80 token vocabulary: 64 palette indices plus PAD, BOS, EOS, MASK, 6 zone tokens, 6 category tokens
2D sinusoidal positional encoding so the model can reason about row and column structure
KV-cache inference for O(n) per-sprite generation
Sprite breeding through partial completion: take parent A, mask 40% of pixels, seed with parent B's colours, let the model fill the gaps
Post-process shading that stays inside the 64-colour palette by mapping to per-colour luminance ramps

Why I built it

Two motivations.

The first was practical. I was sketching out small games that could benefit from runtime sprite generation, and the existing options (cloud diffusion, large local models, hand-drawn art) all had problems for an indie scope. I wanted to see whether the bottom-end of the model size scale could give something usable.

The second was methodological. Most of my recent side projects involve agents writing game code, where the success signal is "does it run, does it look right." Training a model is a different kind of agentic task: success signals are noisier, the loop is much slower, and visual verification only kicks in at the end. I wanted to learn how that changes the way I work with agents.

How to run it

pip install -r requirements.txt

# Smoke test on synthetic data, CPU only
python -m pixelllm.train --synthetic --epochs 5 --device cpu

# Real training run, GPU
python -m pixelllm.train --data-dir data/processed --epochs 50 --fp16 --device cuda

# Sample from a checkpoint
python -m pixelllm.generate

Best run reached val_acc 0.9837, val_loss 0.0476 with batch_size 32, grad_accum 2, fp16 on a single RTX 4080 SUPER over about 30 hours. See docs/STATUS.md for the run log and docs/ARCHITECTURE.md for design details.

Tech stack

PyTorch 2.0+ for model and training
NumPy and Pillow for data pipeline
Wikimedia Commons API for one of the source corpora
A handcrafted procedural sprite generator for synthetic data

What this experiment showed about agentic generation

A few honest observations:

Agents are much better at scaffolding training infrastructure than at choosing training data. Several iterations of "the loss is going down but the samples look bad" came down to the data corpus, not the model.
Visual verification needs to be wired into the loop early. Without periodic sample dumps you cannot tell whether the model is genuinely learning structure or just memorising a few prototypes.
A small model leaves nowhere to hide. With 2.9M parameters the model fails honestly: weak categories produce noise rather than confident-looking nonsense.
Knowing when to stop is part of the work. After several training-data iterations I judged that the agentic loop had hit a ceiling for this architecture, and that pushing further would mean changing the model or the problem rather than the data.

Status and what's next

Frozen. The repo captures where the experiment landed: a working pipeline, a trained checkpoint, and outputs that show both what the approach can do and where it falls short. I am sharing it as-is rather than polishing it, because the value here is in the trail, not the destination.

If I came back to it I would try a hybrid: keep the autoregressive structure for the palette discipline, but add a small diffusion head for the shape coarse-pass, and condition on a richer prompt vocabulary. The current zone-and-category bottleneck is too narrow for the data to lift.

A note on history

This repository was extracted from a private monorepo where I work on many side projects together. The single initial commit reflects the migration, not the development cadence; the work itself unfolded across many sessions.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
docs		docs
eval		eval
pixelllm		pixelllm
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PixelLLM

What this experiment was about

Honest about results

What it is

Why I built it

How to run it

Tech stack

What this experiment showed about agentic generation

Status and what's next

A note on history

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PixelLLM

What this experiment was about

Honest about results

What it is

Why I built it

How to run it

Tech stack

What this experiment showed about agentic generation

Status and what's next

A note on history

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages