A simple RISC CPU implemented in Verilog, as well as compilation toolchain for it.
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.



This project contains an implementation of a CPU in Verilog targetting a Cyclone II FPGA (though it works with few modifications on a Cyclone V); an assembler for the architecture; and a non-optimizing compiler for a small subset of C. It supports VGA output through a dedicated memory region.

a simple animation running on the T258

The name comes from being, in part, developed as a final project for the CSC258 course offered by the University of Toronto.


I should preface all this by saying I'm not a hardware guy. This was my first time working with Verilog in any nontrivial capacity. A number of concessions were made in the process just to simplify the circuitry. Notably, the smallest unit of data is 16 bits, as opposed to 8 bits as on most (all?) modern architectures.

Overall, the CPU models a Harvard architecture. That is, the ROM bus is separate from the RAM bus, and instructions cannot be executed from RAM. Address and data buses are 16 bits wide.

12-bit color 160x120 VGA output is supported through writes to VRAM, positioned from $0000 to $4B00.

The stack is positioned at $4C00, and grows upwards.


The assembly syntax is loosely based off of that of the Zilog Z80.

Instructions are all 16-bits. The Opcode column below lists the lower 8 bits for each instruction. If an instruction involves a single register, the register number will be written to bits 15-11. If it involves two, bits 13-11 will represent the first register, and 10-8 the second. There are 7 registers available.

An assembler is provided (assembler.py), which generates a rom.v file that must be built alongside the rest of the project.

Opcode Table
Opcode Mnemonic Operation
81 INC r0 r0++
82 DEC r0 r0--
83 ADD r0, r1 r0 += r1
85 SUB r0, r1 r0 -= r1
87 MUL r0, r1 r0 *= r1
88 OR r0, r1 r0 |= r1
89 AND r0, r1 r0 &= r1
8A XOR r0, r1 r0 ^= r1
8B CMP r0, r1 cmp(r0, r1)
8C PUSH r0 sp++; ram[0x9000 + sp] = r0
8D PUSH @const sp++; ram[0x9000 + sp] = @const
8E POP r0 r0 = ram[0x9000 + sp]; sp--;
8F POP @const r0 = ram[0x9000 + sp]; sp--;
90 JEQ @const pc = Z ? @const : pc
91 JLE @const pc = N ? pc : @const
92 JMP @const pc = @const
93 LD r0, r1 r0 = r1
94 LD r0, (r1) r0 = ram[r1]
95 LD (r0), r1 ram[r0] = r1
96 LD r0, @const r0 = @const


A transpiler from a subset of C ("TudorC") is implemented in transpiler.py. Some demo programs that can be compiled using it can be found in the demo directory, demonstrating the syntax supported.

To run it, you'll need to first pip install lark-parser. The grammar used can be found in tudorc.g

Limitations & Known Issues

  • The compiler is a bottom-up, non-optimizing process. It was written in 2 hours, and the assembly it generates is far from optimal.
  • For assembly, an MIF could have (and should have) been used instead of generating hardcoded opcode arrays.
  • The finite state automaton is too slow, and cannot be clocked at 50MHz. A rate divider is introduced to scale down to 5MHz, but this causes clock delivery issues during synthesis. As a result, a number of opcode pairs cannot be executed in sequence without throwing the CPU into an undefined state. This also means that often, the code generated by the transpiler won't actually run correctly.