A real, synthesizable 4Γ4 systolic array in SystemVerilog,
compiled to WebAssembly and running live in your browser.
Every PE, every activation, every partial sum you see is a live hardware signal.
Nothing on screen is fabricated or reimplemented in JavaScript.
Enter two int8 matrices. The browser executes the actual Verilog RTL (compiled to WebAssembly) cycle-by-cycle and animates every PE, every activation, and every partial sum straight from the hardware signals.
The core insight: Verilator and Emscripten, chained together, turn synthesizable SystemVerilog into a WebAssembly module any browser can execute. The React visualizer is purely downstream of this: it reads state out of the compiled hardware binary.
| What is real | Why it matters | |
|---|---|---|
| π΅ Real RTL | rtl/*.sv is synthesizable SystemVerilog: always_ff/always_comb only, no delays, no initial blocks, no inferred latches |
Drop it into any FPGA synthesis tool without modification |
| π£ Real WASM | Verilator compiles RTL to cycle-accurate C++. Emscripten compiles that C++ to WebAssembly. The browser runs compiled hardware | Not a JavaScript reimplementation of the math |
| π’ Real signals | PE weights, activations, partial sums, and FSM phase come from an explicit debug output bus on tiny_tpu_top |
The visualizer fabricates nothing; every number is traceable to a hardware port |
A 4Γ4 weight-stationary systolic array computes C = A Β· B for signed int8 matrices in 14 clock cycles:
| Phase | Cycles | Timeline | What happens |
|---|---|---|---|
LOAD_WEIGHTS |
4 | ββββββββββββββ |
Matrix B loaded column-by-column into the PE grid as stationary weights |
STREAM |
7 | ββββββββββββββ |
Matrix A streams from the west edge with diagonal row-skew; MACs fire each cycle |
DRAIN |
3 | ββββββββββββββ |
Final partial sums propagate out the south edge to the result buffer |
| Total | 14 | ββββββββββββββ |
C = A Β· B complete. Signed int8 inputs, 32-bit accumulation. |
The MAC equation executed by every PE, every cycle:
psum_out <= psum_in + (weight_reg Γ act_in)
Row i of matrix A is delayed by i cycles (the diagonal skew) so each activation meets the correct stationary weight at precisely the right clock edge. For matrices larger than 4Γ4, the L3 view tiles into multiple 4Γ4 passes, each running on real RTL.
| Spec | Value |
|---|---|
| Array dimensions | 4 Γ 4 (16 processing elements) |
| MACs per cycle | Up to 16 (one per PE) |
| Input precision | Signed int8 |
| Accumulator width | Signed int32 |
| Synthesizable target | Any FPGA synthesis tool (no simulation-only constructs) |
Most hardware visualizers show a cartoon: a JavaScript reimplementation of the math dressed up with animations. TinyTPU does the opposite.
| Principle | In practice |
|---|---|
| π RTL is the only source of truth | The frontend never reimplements the matmul. It reads state out of the compiled WASM binary. If the RTL is wrong, the visualizer shows the wrong thing. |
| βοΈ Bit-exact golden verification | The cocotb test suite asserts bit-exact equality between RTL output and a numpy reference model across 20+ random matrix pairs before anything ships. A wrong matmul is a beautiful lie. TinyTPU refuses to tell it. |
| π« No signal fabrication | PE weights, activations, and partial sums come from an explicit debug output bus on tiny_tpu_top, not from reconstructed state, not from a shadow model, not from public_flat. |
| ποΈ Synthesizable by constraint | The Verilog is not a testbench hack; it is the actual design, constrained to always_ff/always_comb, lint-clean under -Wall, and free of all simulation-only constructs. |
All RTL tooling runs inside WSL2 Ubuntu. The frontend runs anywhere.
Prerequisites (click to expand)
# WSL2 Ubuntu system dependencies
sudo apt-get install -y build-essential cmake python3 python3-pip python3-venv \
autoconf flex bison libfl2 libfl-dev
# Verilator 5.x (build from source)
git clone https://github.com/verilator/verilator && cd verilator
git checkout stable && autoconf && ./configure && make -j$(nproc) && sudo make install
# Emscripten SDK
git clone https://github.com/emscripten-core/emsdk && cd emsdk
./emsdk install latest && ./emsdk activate latest
source emsdk_env.sh
# Python virtualenv
python3 -m venv ~/.venvs/tinytpu && source ~/.venvs/tinytpu/bin/activate
pip install cocotb pytest numpy
# Node.js + pnpm
nvm install --lts && npm install -g pnpmStep 1: RTL lint
verilator --lint-only -Wall rtl/*.svStep 2: Simulation and golden verification
source ~/.venvs/tinytpu/bin/activate
pytest sim/golden.py -qStep 3: WASM build
bash wasm/build.sh
# outputs web/public/tiny_tpu.mjs + web/public/tiny_tpu.wasmStep 4: Frontend dev server
cd web && pnpm install && pnpm dev # http://localhost:4321| Layer | Technologies |
|---|---|
| RTL | |
| WASM | |
| Frontend | |
| Verification | |
| Deploy |
tiny-tpu/
β
βββ rtl/ SystemVerilog source of truth
β βββ pe.sv Single MAC cell (weight-stationary)
β βββ systolic_array.sv 4Γ4 PE grid (generate loop)
β βββ controller.sv FSM: IDLE, LOAD_WEIGHTS, STREAM, DRAIN, DONE
β βββ tiny_tpu_top.sv Top wrapper + debug output bus
β
βββ sim/ cocotb verification suite
β βββ golden.py numpy reference model (ground truth)
β βββ test_pe.py PE-level unit tests
β βββ test_systolic_array.py Array-level unit tests
β βββ test_top.py Full matmul + cycle count tests
β
βββ wasm/ C++ harness to WASM bridge
β βββ harness.cpp TinyTpuSim class, reads the debug bus
β βββ bindings.cpp embind JS-callable surface
β βββ build.sh verilator --cc + em++ build script
β
βββ web/ Astro + React + shadcn/ui frontend
β βββ src/pages/ index.astro, app.astro, docs/
β βββ src/components/ Visualizer, PEGrid, Controls, MatrixInput
β βββ src/lib/ wasm-loader.ts, state-schema.ts
β βββ public/ tiny_tpu.wasm (compiled artifact)
β
βββ docs/
βββ STATE_SCHEMA.md Per-cycle state contract (keep in sync with state-schema.ts)
| Doc | What it covers |
|---|---|
| How it works | The full RTL to Verilator to WASM to browser pipeline. Why the browser runs real compiled hardware, not a JavaScript reimplementation. |
| The systolic array | Weight-stationary dataflow, the diagonal skew, the 14-cycle budget, and why TPUs use this structure. |
| Architecture | Monorepo layout, the CycleState data contract, build flow, and key design decisions. |
| Status | Feature |
|---|---|
| β | 4Γ4 synthesizable systolic array, bit-exact golden-verified against numpy |
| β | Real-time WASM execution in the browser (zero JS math reimplementation) |
| β | L1 / L2 / L3 progressive disclosure: single MAC, full 4Γ4 grid, tiling |
| β | Full SEO pass, production deploy on Vercel |
| Status | Feature |
|---|---|
| π² | Configurable array size (N = 2 to 16) |
| π² | Challenge mode: score your MAC utilization vs theoretical optimal |
| π² | Dataflow modes: weight-stationary vs output-stationary toggle |
| π² | int8 quantization visualizer |
| π² | GPU-vs-TPU comparison view (cross-links TinyGPU) |
| π² | Run a real nn.Linear layer: the ML to hardware bridge |
Invisible systems, made watchable, with the real implementation underneath.
| # | Project | Description | Status |
|---|---|---|---|
| 1 | TinyGPU | A minimal GPU in synthesizable RTL | β Shipped |
| 2 | TinyTPU | This project, a minimal weight-stationary systolic array | β Shipped |
MIT. See LICENSE.

