Skip to content

deaneeth/tiny-tpu

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

73 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

TinyTPU

TinyTPU

A real, synthesizable 4Γ—4 systolic array in SystemVerilog,
compiled to WebAssembly and running live in your browser.

Every PE, every activation, every partial sum you see is a live hardware signal.
Nothing on screen is fabricated or reimplemented in JavaScript.

MIT License RTL: SystemVerilog Runtime: WebAssembly Verified: cocotb + numpy Frontend: Astro + React TypeScript strict

RTL CI Web CI WASM Build

Open Live Visualizer Β  Read the Docs


🎬 Preview

TinyTPU live visualizer preview

Enter two int8 matrices. The browser executes the actual Verilog RTL (compiled to WebAssembly) cycle-by-cycle and animates every PE, every activation, and every partial sum straight from the hardware signals.


βš™οΈ How It Works

RTL-to-browser pipeline diagram

The core insight: Verilator and Emscripten, chained together, turn synthesizable SystemVerilog into a WebAssembly module any browser can execute. The React visualizer is purely downstream of this: it reads state out of the compiled hardware binary.


What is real Why it matters
πŸ”΅ Real RTL rtl/*.sv is synthesizable SystemVerilog: always_ff/always_comb only, no delays, no initial blocks, no inferred latches Drop it into any FPGA synthesis tool without modification
🟣 Real WASM Verilator compiles RTL to cycle-accurate C++. Emscripten compiles that C++ to WebAssembly. The browser runs compiled hardware Not a JavaScript reimplementation of the math
🟒 Real signals PE weights, activations, partial sums, and FSM phase come from an explicit debug output bus on tiny_tpu_top The visualizer fabricates nothing; every number is traceable to a hardware port

πŸ’Ž Hardware Specs

A 4Γ—4 weight-stationary systolic array computes C = A Β· B for signed int8 matrices in 14 clock cycles:

Phase Cycles Timeline What happens
LOAD_WEIGHTS 4 β–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ Matrix B loaded column-by-column into the PE grid as stationary weights
STREAM 7 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘ Matrix A streams from the west edge with diagonal row-skew; MACs fire each cycle
DRAIN 3 β–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ Final partial sums propagate out the south edge to the result buffer
Total 14 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ C = A Β· B complete. Signed int8 inputs, 32-bit accumulation.

The MAC equation executed by every PE, every cycle:

psum_out  <=  psum_in  +  (weight_reg  Γ—  act_in)

Row i of matrix A is delayed by i cycles (the diagonal skew) so each activation meets the correct stationary weight at precisely the right clock edge. For matrices larger than 4Γ—4, the L3 view tiles into multiple 4Γ—4 passes, each running on real RTL.


Spec Value
Array dimensions 4 Γ— 4 (16 processing elements)
MACs per cycle Up to 16 (one per PE)
Input precision Signed int8
Accumulator width Signed int32
Synthesizable target Any FPGA synthesis tool (no simulation-only constructs)

βœ… Why It Is Honest

Most hardware visualizers show a cartoon: a JavaScript reimplementation of the math dressed up with animations. TinyTPU does the opposite.

Principle In practice
πŸ”’ RTL is the only source of truth The frontend never reimplements the matmul. It reads state out of the compiled WASM binary. If the RTL is wrong, the visualizer shows the wrong thing.
βœ”οΈ Bit-exact golden verification The cocotb test suite asserts bit-exact equality between RTL output and a numpy reference model across 20+ random matrix pairs before anything ships. A wrong matmul is a beautiful lie. TinyTPU refuses to tell it.
🚫 No signal fabrication PE weights, activations, and partial sums come from an explicit debug output bus on tiny_tpu_top, not from reconstructed state, not from a shadow model, not from public_flat.
πŸ—οΈ Synthesizable by constraint The Verilog is not a testbench hack; it is the actual design, constrained to always_ff/always_comb, lint-clean under -Wall, and free of all simulation-only constructs.

πŸ› οΈ Build from Source

All RTL tooling runs inside WSL2 Ubuntu. The frontend runs anywhere.

Prerequisites (click to expand)
# WSL2 Ubuntu system dependencies
sudo apt-get install -y build-essential cmake python3 python3-pip python3-venv \
    autoconf flex bison libfl2 libfl-dev

# Verilator 5.x (build from source)
git clone https://github.com/verilator/verilator && cd verilator
git checkout stable && autoconf && ./configure && make -j$(nproc) && sudo make install

# Emscripten SDK
git clone https://github.com/emscripten-core/emsdk && cd emsdk
./emsdk install latest && ./emsdk activate latest
source emsdk_env.sh

# Python virtualenv
python3 -m venv ~/.venvs/tinytpu && source ~/.venvs/tinytpu/bin/activate
pip install cocotb pytest numpy

# Node.js + pnpm
nvm install --lts && npm install -g pnpm

Step 1: RTL lint

verilator --lint-only -Wall rtl/*.sv

Step 2: Simulation and golden verification

source ~/.venvs/tinytpu/bin/activate
pytest sim/golden.py -q

Step 3: WASM build

bash wasm/build.sh
# outputs  web/public/tiny_tpu.mjs  +  web/public/tiny_tpu.wasm

Step 4: Frontend dev server

cd web && pnpm install && pnpm dev    # http://localhost:4321

πŸ“¦ Tech Stack

Layer Technologies
RTL SystemVerilog Verilator
WASM Emscripten WebAssembly
Frontend Astro React TypeScript Tailwind CSS
Verification cocotb Python numpy
Deploy Vercel

πŸ“ Repository Structure

tiny-tpu/
β”‚
β”œβ”€β”€ rtl/                        SystemVerilog source of truth
β”‚   β”œβ”€β”€ pe.sv                   Single MAC cell (weight-stationary)
β”‚   β”œβ”€β”€ systolic_array.sv       4Γ—4 PE grid (generate loop)
β”‚   β”œβ”€β”€ controller.sv           FSM: IDLE, LOAD_WEIGHTS, STREAM, DRAIN, DONE
β”‚   └── tiny_tpu_top.sv         Top wrapper + debug output bus
β”‚
β”œβ”€β”€ sim/                        cocotb verification suite
β”‚   β”œβ”€β”€ golden.py               numpy reference model (ground truth)
β”‚   β”œβ”€β”€ test_pe.py              PE-level unit tests
β”‚   β”œβ”€β”€ test_systolic_array.py  Array-level unit tests
β”‚   └── test_top.py             Full matmul + cycle count tests
β”‚
β”œβ”€β”€ wasm/                       C++ harness to WASM bridge
β”‚   β”œβ”€β”€ harness.cpp             TinyTpuSim class, reads the debug bus
β”‚   β”œβ”€β”€ bindings.cpp            embind JS-callable surface
β”‚   └── build.sh                verilator --cc + em++ build script
β”‚
β”œβ”€β”€ web/                        Astro + React + shadcn/ui frontend
β”‚   β”œβ”€β”€ src/pages/              index.astro, app.astro, docs/
β”‚   β”œβ”€β”€ src/components/         Visualizer, PEGrid, Controls, MatrixInput
β”‚   β”œβ”€β”€ src/lib/                wasm-loader.ts, state-schema.ts
β”‚   └── public/                 tiny_tpu.wasm (compiled artifact)
β”‚
└── docs/
    └── STATE_SCHEMA.md         Per-cycle state contract (keep in sync with state-schema.ts)

πŸ“š Documentation

Doc What it covers
How it works The full RTL to Verilator to WASM to browser pipeline. Why the browser runs real compiled hardware, not a JavaScript reimplementation.
The systolic array Weight-stationary dataflow, the diagonal skew, the 14-cycle budget, and why TPUs use this structure.
Architecture Monorepo layout, the CycleState data contract, build flow, and key design decisions.

πŸ—ΊοΈ Roadmap

v1 (Shipped)

Status Feature
βœ… 4Γ—4 synthesizable systolic array, bit-exact golden-verified against numpy
βœ… Real-time WASM execution in the browser (zero JS math reimplementation)
βœ… L1 / L2 / L3 progressive disclosure: single MAC, full 4Γ—4 grid, tiling
βœ… Full SEO pass, production deploy on Vercel

Coming Next (build in public)

Status Feature
πŸ”² Configurable array size (N = 2 to 16)
πŸ”² Challenge mode: score your MAC utilization vs theoretical optimal
πŸ”² Dataflow modes: weight-stationary vs output-stationary toggle
πŸ”² int8 quantization visualizer
πŸ”² GPU-vs-TPU comparison view (cross-links TinyGPU)
πŸ”² Run a real nn.Linear layer: the ML to hardware bridge

🧬 The Tiny Series

Invisible systems, made watchable, with the real implementation underneath.

# Project Description Status
1 TinyGPU A minimal GPU in synthesizable RTL βœ… Shipped
2 TinyTPU This project, a minimal weight-stationary systolic array βœ… Shipped

πŸ“„ License

MIT. See LICENSE.


Built by Deaneeth❀️  · 
SystemVerilog Β· Verilator Β· Emscripten Β· Astro Β· React


About

A TPU you can watch run - real SystemVerilog systolic array, compiled to WASM, visualized live in your browser.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors