Skip to content

A lightweight Python-based GPU architecture simulator that demonstrates how parallel threads, registers, memory, and instructions work on a GPU.

License

deaneeth/tinygpu

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

36 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

TinyGPU ๐Ÿ‰โšก

PyPI version Python 3.13 License: MIT CI Code Style: Black Tests

TinyGPU is a tiny educational GPU simulator - inspired by Tiny8, designed to demonstrate how GPUs execute code in parallel. It models a small SIMT (Single Instruction, Multiple Threads) system with per-thread registers, global memory, synchronization barriers, branching, and a minimal GPU-like instruction set.

๐ŸŽ“ Built for learning and visualization - see how threads, registers, and memory interact across cycles!

Odd-Even Sort Reduction
Odd-Even Sort Reduction

๐Ÿš€ What's New in v2.0.0

  • Enhanced Instruction Set:
    • Added SHLD and SHST for robust shared memory operations.
    • Improved SYNC semantics for better thread coordination.
  • Visualizer Improvements:
    • Export execution as GIFs with enhanced clarity.
    • Added support for saving visuals directly from the simulator.
  • Refactored Core:
    • Simplified step semantics for better extensibility.
    • Optimized performance for larger thread counts.
  • CI/CD Updates:
    • Integrated linting (ruff, black) and testing workflows.
    • Automated builds and tests on GitHub Actions.
  • Documentation:
    • Expanded examples and added detailed usage instructions.

๐Ÿ’ก Why TinyGPU?

TinyGPU was built as a learning-first GPU simulator - simple enough for beginners, but powerful enough to illustrate real GPU execution concepts like threads, synchronization, and divergence.

  • โšก Lightweight:
    Zero heavy dependencies - runs entirely in Python with clean, readable code.
  • ๐Ÿงฉ Educational:
    Demonstrates GPU fundamentals: thread execution, shared memory, branching, and synchronization barriers.
  • ๐Ÿ” Fast feedback loop:
    Write a .tgpu file, run it, and instantly visualize how your threads interact.
  • ๐Ÿง  Extensible:
    Add your own instructions, modify architecture parameters, or visualize new metrics.
  • ๐ŸŽจ Visual clarity:
    Each program can be rendered as a heatmap or exported as a GIF - perfect for teaching and demos.

๐Ÿ‘ฅ Who Should Use It?

  • ๐Ÿง‘โ€๐ŸŽ“ Students - learn GPU basics through hands-on examples.
  • ๐Ÿ‘ฉโ€๐Ÿซ Educators - build step-by-step visual lessons about parallel computing.
  • ๐Ÿง‘โ€๐Ÿ”ฌ Researchers & hobbyists - prototype kernel logic or explore synchronization models.
  • ๐Ÿ’ก Developers - experiment with SIMD-style concepts without real GPU dependencies.

๐Ÿšฆ Get Started

  1. Install & run locally - see the Quickstart section below.
  2. Explore the examples - try vector_add, odd_even_sort, and reduce_sum kernels.
  3. Visualize execution - watch registers, memory, and threads evolve in real-time.
  4. Experiment! - tweak instruction behavior or add your own to understand how GPUs schedule and synchronize work.

๐Ÿงญ TinyGPU aims to make GPU learning intuitive, visual, and interactive - from classroom demos to self-guided exploration.


โœจ Highlights

  • ๐Ÿงฉ GPU-like instruction set:
    SET, ADD, MUL, LD, ST, JMP, BNE, BEQ, SYNC, CSWAP, SHLD, SHST.
  • ๐Ÿง  Per-thread registers & PCs - each thread executes the same kernel independently.
  • ๐Ÿงฑ Shared global memory for inter-thread operations.
  • ๐Ÿ”„ Synchronization barriers (SYNC) for parallel coordination.
  • ๐ŸŽจ Visualizer & GIF exporter - view execution as heatmaps or export to animated GIFs.
  • ๐Ÿงฎ Example kernels included:
    • Vector addition
    • Odd-even sort (GPU-style bubble sort)
    • Parallel reduction (sum of array)
    • Loop test & synchronization demo

๐Ÿ–ผ๏ธ Example Visuals

Located in src/outputs/ โ€” run the example scripts to generate these GIFs (they're saved under src/outputs/<script_name>/).

Example Description GIF Preview
Vector Add Parallel vector addition (A+B -> C) Vector Add
Block Shared Sum Per-block shared memory sum example Block Shared Sum
Odd-Even Sort GPU-style odd-even transposition sort Odd-Even Sort
Parallel Reduction Sum reduction across an array Reduction
Sync Test Synchronization / barrier demonstration Sync Test
Loop Test Branching and loop behavior demo Test Loop
Compare Test Comparison and branching example Test CMP
Kernel Args Test Demonstrates passing kernel arguments Kernel Args

๐Ÿš€ Quickstart

Clone and install

git clone https://github.com/deaneeth/tinygpu.git
cd tinygpu
pip install -e .
pip install -r requirements-dev.txt

Run an example

python -m examples.run_odd_even_sort

Produces: src/outputs/run_odd_even_sort/run_odd_even_sort_*.gif โ€” a visual GPU-style sorting process.

Other examples

python -m examples.run_vector_add
python -m examples.run_reduce_sum
python -m examples.run_test_loop
python -m examples.run_sync_test

๐Ÿงฉ Project Layout

.
โ”œโ”€ .github/
โ”‚  โ””โ”€ workflows/
โ”‚     โ””โ”€ ci.yml
โ”œโ”€ docs/
โ”‚  โ””โ”€ index.md
โ”œโ”€ examples/
โ”‚  โ”œโ”€ odd_even_sort_tmp.tgpu
โ”‚  โ”œโ”€ odd_even_sort.tgpu
โ”‚  โ”œโ”€ reduce_sum.tgpu
โ”‚  โ”œโ”€ run_odd_even_sort.py
โ”‚  โ”œโ”€ run_reduce_sum.py
โ”‚  โ”œโ”€ run_sync_test.py
โ”‚  โ”œโ”€ run_test_loop.py
โ”‚  โ”œโ”€ run_vector_add.py
โ”‚  โ”œโ”€ sync_test.tgpu
โ”‚  โ”œโ”€ test_loop.tgpu
โ”‚  โ””โ”€ vector_add.tgpu
โ”œโ”€ src/outputs/
โ”‚  โ”œโ”€ run_block_shared_sum/
โ”‚  โ”œโ”€ run_odd_even_sort/
โ”‚  โ”œโ”€ run_reduce_sum/
โ”‚  โ”œโ”€ run_sync_test/
โ”‚  โ”œโ”€ run_test_cmp/
โ”‚  โ”œโ”€ run_test_kernel_args/
โ”‚  โ”œโ”€ run_test_loop/
โ”‚  โ””โ”€ run_vector_add/
โ”œโ”€ src/
โ”‚  โ””โ”€ tinygpu/
โ”‚     โ”œโ”€ __init__.py
โ”‚     โ”œโ”€ assembler.py
โ”‚     โ”œโ”€ gpu.py
โ”‚     โ”œโ”€ instructions.py
โ”‚     โ””โ”€ visualizer.py
โ”œโ”€ tests/
โ”‚  โ”œโ”€ test_assembler.py
โ”‚  โ”œโ”€ test_gpu_core.py
โ”‚  โ”œโ”€ test_gpu.py
โ”‚  โ””โ”€ test_programs.py
โ”œโ”€ LICENSE
โ”œโ”€ pyproject.toml
โ”œโ”€ README.md
โ””โ”€ requirements-dev.txt

๐Ÿง  Assembly Reference (Mini ISA)

TinyGPU uses a minimal instruction set designed for clarity and education - simple enough for beginners, but expressive enough to build realistic GPU-style kernels.

Instruction Operands Description
SET Rd, imm Rd = destination register, imm = immediate value Set register Rd to an immediate constant.
ADD Rd, Ra, Rb Rd = destination, Ra + Rb Add two registers and store result in Rd.
ADD Rd, Ra, imm Rd = destination, Ra + immediate Add register and immediate value.
MUL Rd, Ra, Rb Multiply two registers. Rd = Ra * Rb
MUL Rd, Ra, imm Multiply register by immediate. Rd = Ra * imm
LD Rd, addr Load from memory address into register. Rd = mem[addr]
LD Rd, Rk Load from address in register Rk. Rd = mem[Rk]
ST addr, Rs Store register into memory address. mem[addr] = Rs
ST Rk, Rs Store value from Rs into memory at address in register Rk. mem[Rk] = Rs
JMP target Label or immediate. Unconditional jump โ€” sets PC to target.
BEQ Ra, Rb, target Branch if equal. Jump to target if Ra == Rb.
BNE Ra, Rb, target Branch if not equal. Jump to target if Ra != Rb.
SYNC (no operands) Synchronization barrier โ€” all threads must reach this point before continuing.
CSWAP addrA, addrB Compare-and-swap memory values. If mem[addrA] > mem[addrB], swap them. Used for sorting.
SHLD addr, Rs Load shared memory into register. Rs = shared_mem[addr]
SHST addr, Rs Store register into shared memory. shared_mem[addr] = Rs
CMP Rd, Ra, Rb (optional) Compare and set flag or register. Used internally for extended examples (e.g., prefix-scan).
NOP (optional) (no operands) No operation; placeholder instruction.

๐Ÿ’ก Notes & Conventions

  • Registers:
    • Each thread has its own register file (R0โ€“R7 or more depending on configuration).
    • R7 is automatically initialized to the thread ID (tid).
  • Memory:
    • Global memory is shared between all threads.
    • Addresses can be immediate (16) or register-based (R2).
  • Labels & Comments:
    • Use labels like loop: for jumps.
    • Comments start with ;.
  • Synchronization:
    • Use SYNC when threads must finish a phase before the next (e.g., sorting or reduction).
  • Execution:
    • Each instruction executes per-thread.
    • Threads can branch independently (SIMT divergence is handled via per-thread PCs).

๐Ÿงฎ Example: Odd-Even Sort Kernel

SET R0, 0        ; phase_counter
SET R1, 8        ; num_phases == N (set to array length here, adjust in runner)
SET R3, 0        ; parity (0 = even phase, 1 = odd phase)

loop_phase:
    ; compute base index = tid * 2
    MUL R4, R7, 2      ; R4 = tid * 2
    ADD R5, R4, R3     ; R5 = index = tid*2 + parity
    ADD R6, R5, 1      ; R6 = index + 1
    CSWAP R5, R6       ; compare & swap memory[index], memory[index+1]
    SYNC               ; synchronize threads across the phase

    ADD R3, R3, 1      ; parity = parity + 1
    BNE R3, 2, noreset
    SET R3, 0
noreset:
    ADD R0, R0, 1
    BNE R0, R1, loop_phase

done:
    JMP done

๐Ÿงฉ Writing & Running Programs

  1. Write your .tgpu assembly file in examples/.

  2. Assemble and run it in Python:

    from src.tinygpu.assembler import assemble_file
    from src.tinygpu.gpu import TinyGPU
    
    prog, labels = assemble_file("examples/vector_add.tgpu")
    gpu = TinyGPU(num_threads=8, num_registers=8, mem_size=64)
    gpu.load_program(prog, labels)
    gpu.run(max_cycles=100)
  3. Visualize the run:

    from src.tinygpu.visualizer import visualize
    visualize(gpu, show_pc=True)
  4. Export as GIF:

    from src.tinygpu.visualizer import save_animation
    save_animation(gpu, out_path="examples/my_run.gif", fps=10, max_frames=200)

๐Ÿงฐ Development & Testing

Run tests:

pytest

Run linters:

ruff .
black --check src/ tests/

CI builds and runs tests automatically on push/pull.


๐Ÿ“˜ License

MIT - see LICENSE


๐ŸŒŸ Credits & Inspiration

โค๏ธ Built by Deaneeth

Inspired by the educational design of Tiny8 CPU Simulator.

TinyGPU extends these ideas into the world of parallel GPU computing, emphasizing clarity, simplicity, and visualization for all learners.

About

A lightweight Python-based GPU architecture simulator that demonstrates how parallel threads, registers, memory, and instructions work on a GPU.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages