TinyGPU is a tiny educational GPU simulator - inspired by Tiny8, designed to demonstrate how GPUs execute code in parallel. It models a small SIMT (Single Instruction, Multiple Threads) system with per-thread registers, global memory, synchronization barriers, branching, and a minimal GPU-like instruction set.
๐ Built for learning and visualization - see how threads, registers, and memory interact across cycles!
| Odd-Even Sort | Reduction |
|---|---|
![]() |
![]() |
- Enhanced Instruction Set:
- Added
SHLDandSHSTfor robust shared memory operations. - Improved
SYNCsemantics for better thread coordination.
- Added
- Visualizer Improvements:
- Export execution as GIFs with enhanced clarity.
- Added support for saving visuals directly from the simulator.
- Refactored Core:
- Simplified step semantics for better extensibility.
- Optimized performance for larger thread counts.
- CI/CD Updates:
- Integrated linting (
ruff,black) and testing workflows. - Automated builds and tests on GitHub Actions.
- Integrated linting (
- Documentation:
- Expanded examples and added detailed usage instructions.
TinyGPU was built as a learning-first GPU simulator - simple enough for beginners, but powerful enough to illustrate real GPU execution concepts like threads, synchronization, and divergence.
- โก Lightweight:
Zero heavy dependencies - runs entirely in Python with clean, readable code. - ๐งฉ Educational:
Demonstrates GPU fundamentals: thread execution, shared memory, branching, and synchronization barriers. - ๐ Fast feedback loop:
Write a.tgpufile, run it, and instantly visualize how your threads interact. - ๐ง Extensible:
Add your own instructions, modify architecture parameters, or visualize new metrics. - ๐จ Visual clarity:
Each program can be rendered as a heatmap or exported as a GIF - perfect for teaching and demos.
- ๐งโ๐ Students - learn GPU basics through hands-on examples.
- ๐ฉโ๐ซ Educators - build step-by-step visual lessons about parallel computing.
- ๐งโ๐ฌ Researchers & hobbyists - prototype kernel logic or explore synchronization models.
- ๐ก Developers - experiment with SIMD-style concepts without real GPU dependencies.
- Install & run locally - see the Quickstart section below.
- Explore the examples - try
vector_add,odd_even_sort, andreduce_sumkernels. - Visualize execution - watch registers, memory, and threads evolve in real-time.
- Experiment! - tweak instruction behavior or add your own to understand how GPUs schedule and synchronize work.
๐งญ TinyGPU aims to make GPU learning intuitive, visual, and interactive - from classroom demos to self-guided exploration.
- ๐งฉ GPU-like instruction set:
SET,ADD,MUL,LD,ST,JMP,BNE,BEQ,SYNC,CSWAP,SHLD,SHST. - ๐ง Per-thread registers & PCs - each thread executes the same kernel independently.
- ๐งฑ Shared global memory for inter-thread operations.
- ๐ Synchronization barriers (
SYNC) for parallel coordination. - ๐จ Visualizer & GIF exporter - view execution as heatmaps or export to animated GIFs.
- ๐งฎ Example kernels included:
- Vector addition
- Odd-even sort (GPU-style bubble sort)
- Parallel reduction (sum of array)
- Loop test & synchronization demo
Located in
src/outputs/โ run the example scripts to generate these GIFs (they're saved undersrc/outputs/<script_name>/).
git clone https://github.com/deaneeth/tinygpu.git
cd tinygpu
pip install -e .
pip install -r requirements-dev.txtpython -m examples.run_odd_even_sortProduces:
src/outputs/run_odd_even_sort/run_odd_even_sort_*.gifโ a visual GPU-style sorting process.
python -m examples.run_vector_add
python -m examples.run_reduce_sum
python -m examples.run_test_loop
python -m examples.run_sync_test.
โโ .github/
โ โโ workflows/
โ โโ ci.yml
โโ docs/
โ โโ index.md
โโ examples/
โ โโ odd_even_sort_tmp.tgpu
โ โโ odd_even_sort.tgpu
โ โโ reduce_sum.tgpu
โ โโ run_odd_even_sort.py
โ โโ run_reduce_sum.py
โ โโ run_sync_test.py
โ โโ run_test_loop.py
โ โโ run_vector_add.py
โ โโ sync_test.tgpu
โ โโ test_loop.tgpu
โ โโ vector_add.tgpu
โโ src/outputs/
โ โโ run_block_shared_sum/
โ โโ run_odd_even_sort/
โ โโ run_reduce_sum/
โ โโ run_sync_test/
โ โโ run_test_cmp/
โ โโ run_test_kernel_args/
โ โโ run_test_loop/
โ โโ run_vector_add/
โโ src/
โ โโ tinygpu/
โ โโ __init__.py
โ โโ assembler.py
โ โโ gpu.py
โ โโ instructions.py
โ โโ visualizer.py
โโ tests/
โ โโ test_assembler.py
โ โโ test_gpu_core.py
โ โโ test_gpu.py
โ โโ test_programs.py
โโ LICENSE
โโ pyproject.toml
โโ README.md
โโ requirements-dev.txt
TinyGPU uses a minimal instruction set designed for clarity and education - simple enough for beginners, but expressive enough to build realistic GPU-style kernels.
| Instruction | Operands | Description |
|---|---|---|
SET Rd, imm |
Rd = destination register, imm = immediate value |
Set register Rd to an immediate constant. |
ADD Rd, Ra, Rb |
Rd = destination, Ra + Rb |
Add two registers and store result in Rd. |
ADD Rd, Ra, imm |
Rd = destination, Ra + immediate |
Add register and immediate value. |
MUL Rd, Ra, Rb |
Multiply two registers. | Rd = Ra * Rb |
MUL Rd, Ra, imm |
Multiply register by immediate. | Rd = Ra * imm |
LD Rd, addr |
Load from memory address into register. | Rd = mem[addr] |
LD Rd, Rk |
Load from address in register Rk. |
Rd = mem[Rk] |
ST addr, Rs |
Store register into memory address. | mem[addr] = Rs |
ST Rk, Rs |
Store value from Rs into memory at address in register Rk. |
mem[Rk] = Rs |
JMP target |
Label or immediate. | Unconditional jump โ sets PC to target. |
BEQ Ra, Rb, target |
Branch if equal. | Jump to target if Ra == Rb. |
BNE Ra, Rb, target |
Branch if not equal. | Jump to target if Ra != Rb. |
SYNC |
(no operands) | Synchronization barrier โ all threads must reach this point before continuing. |
CSWAP addrA, addrB |
Compare-and-swap memory values. | If mem[addrA] > mem[addrB], swap them. Used for sorting. |
SHLD addr, Rs |
Load shared memory into register. | Rs = shared_mem[addr] |
SHST addr, Rs |
Store register into shared memory. | shared_mem[addr] = Rs |
CMP Rd, Ra, Rb (optional) |
Compare and set flag or register. | Used internally for extended examples (e.g., prefix-scan). |
NOP (optional) |
(no operands) | No operation; placeholder instruction. |
- Registers:
- Each thread has its own register file (
R0โR7or more depending on configuration). R7is automatically initialized to the thread ID (tid).
- Each thread has its own register file (
- Memory:
- Global memory is shared between all threads.
- Addresses can be immediate (
16) or register-based (R2).
- Labels & Comments:
- Use labels like
loop:for jumps. - Comments start with
;.
- Use labels like
- Synchronization:
- Use
SYNCwhen threads must finish a phase before the next (e.g., sorting or reduction).
- Use
- Execution:
- Each instruction executes per-thread.
- Threads can branch independently (SIMT divergence is handled via per-thread PCs).
SET R0, 0 ; phase_counter
SET R1, 8 ; num_phases == N (set to array length here, adjust in runner)
SET R3, 0 ; parity (0 = even phase, 1 = odd phase)
loop_phase:
; compute base index = tid * 2
MUL R4, R7, 2 ; R4 = tid * 2
ADD R5, R4, R3 ; R5 = index = tid*2 + parity
ADD R6, R5, 1 ; R6 = index + 1
CSWAP R5, R6 ; compare & swap memory[index], memory[index+1]
SYNC ; synchronize threads across the phase
ADD R3, R3, 1 ; parity = parity + 1
BNE R3, 2, noreset
SET R3, 0
noreset:
ADD R0, R0, 1
BNE R0, R1, loop_phase
done:
JMP done-
Write your
.tgpuassembly file inexamples/. -
Assemble and run it in Python:
from src.tinygpu.assembler import assemble_file from src.tinygpu.gpu import TinyGPU prog, labels = assemble_file("examples/vector_add.tgpu") gpu = TinyGPU(num_threads=8, num_registers=8, mem_size=64) gpu.load_program(prog, labels) gpu.run(max_cycles=100)
-
Visualize the run:
from src.tinygpu.visualizer import visualize visualize(gpu, show_pc=True)
-
Export as GIF:
from src.tinygpu.visualizer import save_animation save_animation(gpu, out_path="examples/my_run.gif", fps=10, max_frames=200)
Run tests:
pytestRun linters:
ruff .
black --check src/ tests/CI builds and runs tests automatically on push/pull.
MIT - see LICENSE
โค๏ธ Built by Deaneeth
Inspired by the educational design of Tiny8 CPU Simulator.
TinyGPU extends these ideas into the world of parallel GPU computing, emphasizing clarity, simplicity, and visualization for all learners.







