## Multi-core RISC Processor Design and Implementation (Rev. 2.02)

ELEC5881M - Interim Report

**Ben David Lancaster** Student ID: 201280376

Submitted in accordance with the requirements for the degree of Master of Science (MSc) in Embedded Systems Engineering

Supervisor: Dr. David Cowell Assessor: Mr David Moore

**University of Leeds**School of Electrical and Electronic Engineering

May 1, 2019

Word count: 4689

#### **Abstract**

This interim report details the 4-month progress on a project to design, implement, and verify, a multi-core FPGA RISC processor. The project has been split into two stages: firstly to build a functional single-core RISC processor, and then secondly to add multiprocessor principles and functionality to it.

Current multiprocessor and network-on-chip communication methods have been discussed and how they could be included in this multi-core RISC design. To-date, a 16-bit instruction set architecture has been designed featuring common load/store instructions, comparison, and bitwise operations. A single-core processor has been implemented in Verilog and verified using simulations/test benches running various simple software programs.

Future tasks have been planned and will focus on the second stage of the project. Work will start on designing a loosely coupled multiprocessor communication interface and bringing them to the single-core processor.

## **Revision History**

| Date       | Version | Changes                        |
|------------|---------|--------------------------------|
| 10/04/2019 | 2.02    | Update future stages.          |
| 05/04/2019 | 2.01    | Fix processor RTL diagram.     |
| 04/04/2019 | 2.00    | Initial processor RTL diagram. |
| 01/04/2019 | 1.00    | Initial section outline.       |

Document revisions.

# **Declaration of Academic Integrity**

The candidate confirms that the work submitted is his/her own, except where work which has formed part of jointly-authored publications has been included. The contribution of the candidate and the other authors to this work has been explicitly indicated in the report. The candidate confirms that appropriate credit has been given within the report where reference has been made to the work of others.

This copy has been supplied on the understanding that no quotation from the report may be published without proper acknowledgement. The candidate, however, confirms his/her consent to the University of Leeds copying and distributing all or part of this work in any forms and using third parties, who might be outside the University, to monitor breaches of regulations, to verify whether this work contains plagiarised material, and for quality assurance purposes.

The candidate confirms that the details of any mitigating circumstances have been submitted to the Student Support Office at the School of Electronic and Electrical Engineering, at the University of Leeds.

Name: Ben David Lancaster

Date: May 1, 2019

# **Table of Contents**

| 1  | Intr  | oductio | on                                 | 4    |
|----|-------|---------|------------------------------------|------|
|    | 1.1   | Why I   | Multi-core?                        | . 4  |
|    | 1.2   | Why I   | RISC?                              | . 5  |
|    | 1.3   | Why I   | FPGA?                              | . 5  |
| 2  | Bacl  | kgroun  | nd .                               | 6    |
|    | 2.1   | Amda    | ahl's Law and Parallelism          | . 6  |
|    | 2.2   | Loose   | ely and Tightly Coupled Processors | . 6  |
|    | 2.3   | Netwo   | ork-on-chip Architectures          | . 7  |
| 3  | Proj  | ect Ov  | rerview                            | 9    |
|    | 3.1   | Projec  | ct Deliverables                    | . 9  |
|    |       | 3.1.1   | Core Deliverables (CD)             | . 9  |
|    |       | 3.1.2   | Extended Deliverables (ED)         | . 10 |
|    | 3.2   | Projec  | et Timeline                        | . 11 |
|    |       | 3.2.1   | Project Stages                     | . 11 |
|    |       | 3.2.2   | Project Stage Detail               |      |
|    |       | 3.2.3   | Timeline                           | . 13 |
|    | 3.3   | Resou   | irces                              | . 13 |
|    |       | 3.3.1   | Hardware Resources                 | . 13 |
|    |       | 3.3.2   | Software Resources                 | . 14 |
|    | 3.4   | Legal   | and Ethical Considerations         | . 15 |
| 4  | Cur   | rent Pr | rogress                            | 16   |
|    | 4.1   | RISC (  | Core                               | . 16 |
|    |       | 4.1.1   | Instruction Set Architecture       | . 16 |
|    |       | 4.1.2   | Design and Implementation          | . 19 |
|    |       | 4.1.3   | Verification                       | . 23 |
| 5  | Futu  | ıre Wo  | rk                                 | 25   |
|    | 5.1   | Projec  | et Status                          | . 25 |
|    |       | 5.1.1   | Updated Project Time Line          | . 25 |
|    |       | 5.1.2   | Future Work                        | . 25 |
| 6  | Con   | clusior | n                                  | 27   |
| Re | ferer | ices    |                                    | 28   |
| Aı | peno  | dix A - | Code Listing                       | 29   |

## Chapter 1

## Introduction

| 1.1 | Why Multi-core? | 4 |
|-----|-----------------|---|
| 1.2 | Why RISC?       | 5 |
| 1.3 | Why FPGA?       | 5 |

This project will detail the design, implementation, and verification, of a new multi-core RISC processor aimed at FPGA devices. This project was chosen due to my interest in processor design, in which I have only previously designed single-core RISC processors and wish to extend this knowledge to gain a basic understanding of multi-core communication, design considerations, and the limitations of parallelism first hand.

I will use this opportunity to further develop my knowledge of FPGA and processor design by implementing, designing, and verifying, a multi-core RISC processor from scratch, including the design of a communication interface between multiple cores.

### 1.1 Why Multi-core?

Moore's Law states that the number of transistors in a chip will double every 2 years []. CPU designers would utilize the additional transistors to add more pipeline stages in the processor to reduce the propagation delay [] which would allow for higher clock frequencies.

The size of transistors have been decreasing [] and today can be manufactured in sub-10 nanometer range. However, the extremely small transistor size increases electrical leakage and other negative effects resulting in unreliability and potential damage to the transistor []. The high transistor count produces large amounts of heat and requires increasing power to supply the chip. These trade-offs are currently managed by reducing the input voltage, utilising complex cooling techniques, and reducing clock frequency. These factors limit the performance of the chip significantly. These are contributing factors to Moore's Law *slowing* down. The capacity limit of the current-generation planar transistors is approaching and so in order for performance increases to continue, other approaches such as alternate transistor technologies like Multigate transistors [1], software and hardware optimisations, and multiprocessor architectures are employed.

This report will focus on the latter: to produce a small multi-core processor that can utilise software-based parallelism to gain performance benefits, compared to a larger single-core design.

### 1.2 Why RISC?

RISC architectures feature simpler and fewer instructions compared to CISC, which emphasises instructions that perform larger tasks. A single CISC instruction might be performed with multiple RISC instructions. Because of the fewer and simpler instructions, RISC machines rely heavily on software optimisations for performance. RISC instruction sets are based on load/store architectures, where most instructions are either register-to-register or memory reading and writing [2]. This constraint greatly reduces complexity.

RISC architectures are easier to design implement, especially for beginners, due to their simpler instructions that share the same pipeline, compared to CISC where there may be different pipeline for each instruction, which would greatly consume FPGA resources.

### 1.3 Why FPGA?

Field programmable gate arrays (FPGA) are a great choice for prototyping digital logic designs due to their programmable nature and quick development times.

My previous experience with FPGAs in previous projects will reduce risk and learning times and allow for more time to be spent on adding and extending features (discusses further in section 3.1).

FPGAs, however, may not be suitable for prototyping all register-transistor logic (RTL) projects. Larger RTL projects, such as large commercial processors, may greatly exceed the logic cell resources available in today's high-end FPGA devices and may only be prototyped through silicon fabrication, which can be expensive. This resource limitation will not be problem as the project aims to produce a small and minimal design specifically for learning about multi-core architectures.

## Chapter 2

# Background

| 2.1 | Amdahl's Law and Parallelism           | 6 |
|-----|----------------------------------------|---|
| 2.2 | Loosely and Tightly Coupled Processors | 6 |
| 2.3 | Network-on-chip Architectures          | 7 |

#### 2.1 Amdahl's Law and Parallelism

In many applications, not restricted to software, there may exists many opportunities for processes or algorithms to be performed in parallel. These algorithms can be split into two parts: a serial part that cannot be parallised, and a part that can be parallelised. Amdahl's Law defines a formula for calculating the maximum *speedup* of a process with potential parallelism opportunities when ran in parallel with n many processors. Speedup is a term used to describe the potential performance improvements of an algorithm using an enhanced resource (in this case, adding parallel processors) compared to the original algorithm. Amdalh's Law is defined below, where the potential speedup  $S_p$  is dependant on the portion of program that can be parallelised p and the number of processing cores n:

$$S_p = \frac{1}{(1-p) + \frac{p}{n}} \tag{2.1}$$

This formula will be used throughout the project to gauge the the performance of the multi-core design running various software algorithms.

### 2.2 Loosely and Tightly Coupled Processors

Multiprocessor systems can be generalised into two architectures: loosely and tightly coupled, and each architecture has advantages and disadvantages. In loosely coupled systems, each processing node is self-contained – each node has it's own dedicated memory and IO modules. Communication between nodes is performed over a *Message Transfer System (MTS)* [3] in a master-slave control architecture.

Scalability in loosely coupled systems is generally easier to implement as each node can simply be appended to the shared MTS interface without large modifications to the rest of the system. Scalability is an important concern in this project as I wish to test the developed solution with a range of processing nodes.

As loosely coupled system's nodes feature there own memory and IO modules, they generally perform better in cases where interaction between nodes is not prominent – each

node can store a separate part of the software program in it's memory module allowing simultaneous executing of the program.

In scenarios where inter-node communication is prominent however, access to the MTS interface must be scheduled to avoid access conflicts which introduces delays and idle times in the software programs execution, resulting in lower throughput. Figure 2.1 shows a general layout of a loosely coupled multiprocessor system.

Tightly coupled systems feature processing nodes that do not have their own dedicated memory or IO modules – each node is directly connected to a shared memory module using a dedicated port. In scenarios where inter-node communication is prominent, tightly coupled systems are generally better suited as nodes are directly connected to a shared memory and do not need to wait to use a shared bus.



**Figure 2.1:** A loosely coupled multiprocessor system. Each node features it's own memory and IO modules and uses a Message Transfer System to perform inter-node communication. Image source: [3].

**Figure 2.2:** A tightly coupled multiprocessor system. Nodes are directly connected to memory and IO modules. Image source: [3].

This project will utilise a loosely coupled architecture due to it's easier scalability implementation and my previous experience with the design of single-core processors. Although it will require a scheduler to access the MTS, the experience and knowledge gained from this task will be greatly beneficial for future projects.

## 2.3 Network-on-chip Architectures

Network-on-chip (NoC) architectures implement on-chip communication mechanisms that are based on network communication principles, such as routing, switching, and massive scalability [4]. NoC's can generally support hundreds to millions of processing cores. Figure 2.3 shows an example 16-core network-on-chip architecture. NoC's can scale to very large sizes while not sacrificing performance because each processor core is able to drive the network rather than needing to wait for a shared bus to become free before doing so.

The greater the number of cores in a network-on-chip design, the greater quality of service (QoS) problems arise. As such, network-on-chip architectures suffer the same problems as networks, such as fairness and throughput [5].



**Figure 2.3:** A multiprocessor network-on-chip architecture with 16 processing nodes. Nodes are connected in a grid formation with routers and links. Image source: [6].

## Chapter 3

# **Project Overview**

| 3.1 | Projec | t Deliverables             |
|-----|--------|----------------------------|
|     | 3.1.1  | Core Deliverables (CD)     |
|     | 3.1.2  | Extended Deliverables (ED) |
| 3.2 | Projec | t Timeline                 |
|     | 3.2.1  | Project Stages             |
|     | 3.2.2  | Project Stage Detail       |
|     | 3.2.3  | Timeline                   |
| 3.3 | Resou  | rces                       |
|     | 3.3.1  | Hardware Resources         |
|     | 3.3.2  | Software Resources         |
| 3.4 | Legal  | and Ethical Considerations |

This chapter discusses the the project's requirements, goals, and structure.

### 3.1 Project Deliverables

The project's deliverables are split into two sections: core deliverables (CD) – each deliverable must be satisfied for the project to be a minimum viable product (MVP), and extended deliverables (ED) – deliverables that are not required for a MVP – features that only improve upon an existing feature.

#### 3.1.1 Core Deliverables (CD)

The project's core deliverables are described below.

#### CD1 Design a compact 16-bit RISC instruction set architecture.

The instruction set will be the primary interface to control the processor from software. An instruction set will be required to implement the custom multi-core communication interface.

It was decided to design a new instruction set rather than to extend an existing architecture as this will increase my knowledge of the constraints to consider when designing instruction sets and processors.

#### CD2 Design and implement a Verilog RISC core that implements the ISA in CD1.

The Verilog RISC core will be able to run software program written for the instruction set architecture.

# CD3 Design and implement an on-chip interconnect for multi-core processing (2 to 32 cores) using the RISC core from CD2.

The interconnect will be a chief requirement to enable multi-core communication. The interconnect should support up to 32 cores, however FPGA implementation constraints may limit this due to limited resources.

The interconnect will control communication between the cores to enable software parallelism.

# CD4 Analyse performance of serial and parallel software algorithms, such as parallel DFT, on the processor.

To evaluate the effectiveness of the developed solution, a serial and parallel implementation of a simple computing algorithm (parallel reduction, sorting) will be ran on the processor and it's performance analysed. Effectiveness will be rated on total algorithm run-time and the speed-up gained by adding more cores.

# CD5 Allow the RISC core to be easily compiled to multiple FPGA vendors (Xilinx, Altera).

The developed solution should be generic and portable to allow it to be used across a wide-range of FPGA vendors and devices.

Verilog is a generic implementation-independent hardware-description language and so designing implementation specific modules is recommended.

A key consideration for this requirement is to consider the varying hard IP provided by the FPGA vendors (such as BRAM, ethernet, and PCIe [7, 8]). To overcome this problem, the developed Verilog code will conditionally compile where vendor specific requirements are present.

#### 3.1.2 Extended Deliverables (ED)

The project's extended deliverables are described below.

- **ED1** Design a RISC core with an instructions-per-clock (IPC) rating of at least 1.0 (a single-cycle CPU).
- **ED2** Design a RISC core with a pipe-lined data path to increase the design's clock speed.
- **ED3** Design a scalable multi-core interconnect supporting arbitrary (more than 32) RISC core instances (manycore) using Network-on-Chip (NoC) architecture.
- **ED4** Design a compiler-backend for the PRCO304 [9] compiler to support the ISA from1 CD1. This will make it easier to build complex multi-core software for the processor.
- **ED5** The RISC core can communicate to peripherals via a memory-mapped addresses using the Wishbone bus.
- **ED6** Implement various memory-mapped peripherals such as UART, GPIO, LCD, to aid visual representation of the processor during the demonstration viva.

- ED7 Store instruction memory in SPI flash.
- ED8 Reprogram instruction memory at runtime from host computer.
- **ED9** Processor external debugger using host-processor link.

### 3.2 Project Timeline

#### 3.2.1 Project Stages

The project is split up into many stages to aid planning and management of the project. There are 8 unique stage areas: 1. Inital project conception; 2 Basic RISC core development; 3. Extended RISC core development; 4. Multi-core development; 5. Processor quality-of-life (QoL) improvements; 6. Compiler development; 7. Demo preparation, and 8. Final report. The project stages are shown in Table 3.1.

| Stage | Title                                        | Start Date | Days | Core | Applicable<br>Deliverables |
|-------|----------------------------------------------|------------|------|------|----------------------------|
| 1.0   | Research                                     | Feb 04     | 7    | x    |                            |
| 1.1   | Requirement gathering/review                 | Feb 11     | 14   | x    |                            |
| 1.1   | Processor specification, architecture, ISA   | Feb 18     | 100  | x    | CD1                        |
| 1.2   | Stage/Time Allocation Planning               | Feb 25     | 7    | х    |                            |
| 2.1   | Decoder, Register Set, impl & integration    | Feb 25     | 14   | x    | CD2                        |
| 2.2   | Register set impl & integration              | Mar 04     | 14   | x    | CD2                        |
| 2.3   | Local memory impl & integration              | Mar 11     | 14   | x    | CD2                        |
| 3.1   | Memory mapped register layout & impl         | Apr 01     | 21   |      | ED5                        |
| 3.2   | Wishbone peripheral bus connected to MMU     | Apr 08     | 21   |      | ED5                        |
| 3.3   | Pipelined implementation and verification    | Apr 15     | 21   |      | ED2                        |
| 3.4   | Cache memory design & impl                   | Apr 22     | 28   |      | ED2                        |
| 4.1   | Multi-core communication interface           | TBD        | TBD  | x    | CD3                        |
| 4.2   | Shared-memory controller                     | TBD        | TBD  | x    | CD3                        |
| 4.3   | Scalable multi-core interface (10s of cores) | TBD        | TBD  | x    | CD3                        |
| 4.4   | Multi-core example program (reduction)       | TBD        | TBD  | х    | CD4                        |
| 5.1   | SPI-FPGA interface for OTG programming       | TBD        | TBD  |      | ED7                        |
| 5.2   | FPGA-PC interfacing                          | TBD        | TBD  |      | ED9                        |
| 5.3   | FPGA-PC debugging (instruction breakpoints)  | TBD        | TBD  |      | ED9                        |
| 6.1   | Compiler backend for vmicro16                | TBD        | TBD  |      | ED4                        |
| 6.2   | Compiler support for multi-core codegen      | TBD        | TBD  |      | ED4                        |
| 7.1   | Wishbone peripherals for demo                | TBD        | TBD  | x    | CD4                        |
| 8.1   | Final Report                                 | TBD        | TBD  | x    |                            |

Table 3.1: Project stages throughout the life cycle of the project.

#### 3.2.2 Project Stage Detail

#### Stages 1.0 through 1.2 - Research and Project Conception

These stages cover initial research of existing problems and solutions in the multiprocessor area. The instruction set architecture is also proposed that later stages will implement.

#### Stages 2.1 through 2.3 - Processor module Design, Implementation, and Integration

These stages cover the design, implementation, and integration of key processor core modules such as the instruction decoder, register sets and local memory. Integration of all the modules is a challenging task because some modules have both asynchronous and synchronous signals that need to be timed correctly in order for other modules to receive valid data. An example of this is the register set which has asynchronous read ports that are later clocked in the instruction decode stage.

#### Stages 3.1 through 3.4 – Advanced Processor Implementation

These stages add advanced features to the processor to provide a more functional product. Although these stages are classified as extended, their technical requirement to design and implement is not great and so are have time allocations in the project schedule. The extended features that these stages introduce are: pipelined processor stages – to drastically increase processor performance; provide a memory-mapped peripheral interface through the MMU; provide a Wishbone master interface to the MMU – allowing external peripherals such as GPIO and LCD displays to be utilised in a modular fashion; and to implement a cache memory for each processor core.

#### Stages 4.1 through 4.4 – Multiprocessor Functionality

These stages are dedicated to adding multiprocessor functionality using a loosely coupled architecture to the processor.

#### Stages 5.1 through 5.3 – Debugging Features

These stages cover debugging features and are classified as extended due to the large development time required to implement them as well as not being related to multiprocessor systems.

#### Stages 6.1 through 6.2 – Compiler Backends

These stages cover the implementation of a compiler backend to ease software writing and programming of the processor.

#### Stage 7.1 – Wishbone Peripherals

Additional Wishbone peripherals, such as SPI and timers will be added to produce a more useful multiprocessor system.

#### Stage 8.1 – Final Report

This stage is dedicated to the final report write-up. It is expected to be an iterative task that is active throughout the lifespan of the project.

#### 3.2.3 Timeline

The project stages from Table 3.1 are displayed below in a Gantt chart.



**Figure 3.1:** Project stages in a Gantt chart.

#### 3.3 Resources

This section describes the hardware and software resources required to fulfil the project.

#### 3.3.1 Hardware Resources

Core deliverable CD5 requires the designed RISC core to be implemented and demonstrated on multiple FPGA devices. Although my design should synthesise for physical IC implementation, due to high costs and lengthy production times, it is not a primary development target. Due to having past experience with Xilinx FPGAs from my placement work and experience with Altera from university modules it was decided to target the Xilinx Spartan 6 XC6SLX9 and the Altera Cyclone V.

#### Terasic DE1-SoC Development Board

The Terasic DE1-SoC development board features a large Cyclone V FPGA and many peripherals, such as seven-segment displays, 64 MB SDRAM, ADCs, and buttons and switches, which will aid demonstration of the project. The development board is available through the

university so the cost is negligible. Figure 3.2 shows the peripherals (green) available to the FPGA.



Figure 3.2: Terasic DE1-SoC development board featuring the Altera Cyclone V FPGA and many peripherals. Image source: [10].

#### Minispartan 6+ FPGA Development Board

The Minispartan 6+ is a hobbyist FGPA development board with fewer peripherals than the DE1-SoC. The board features a Xilinx Spartan 6 XC6LX9 which has far fewer resources than the DE1-SoC's Cyclone V however it's simplicity and my familiarity with Xilinx's software suite will speed up development. The development board is shown in Figure 3.3.



**Figure 3.3:** Minispartan-6+ development board featuring the Xilinx Spartan 6 XC6SLX9. Note that the XC6SLX9 and XC6SLX25 FPGAs share the same board. Image source: [11].

#### 3.3.2 Software Resources

#### **Intel Quartus**

Intel Quartus Prime is a paid-for SoC, CPLD, and FPGA software suite targeting Intel's Stratix, Arria, and Cyclone based FPGAs. The university provides student licences which

will be used via VPN.

#### Xilinx ISE Webpack

Xilinx ISE Webkpack is Xilinx's free software suite for FPGA development for Spartan 6 based FPGAs. Due to ISE's intuitive and fast work flow, most of the initial simulation and verification processes will be performed using ISE. This will greatly improve development times.

#### Verilator

Verilator is an open-source Verilog to C++ transpiler which provides a C++ interface to simulate Verilog modules and read/write values similar to a test bench. Verilator will be used for specific modules within the RISC core such as the ALU and decoder as Verilator is useful when performing exhaustive verification.

### 3.4 Legal and Ethical Considerations

The RISC core is designed to be used as an academic research and educational tool to aid learning and understanding of RISC and multi-core machines. It should not be use for roles where mission critical or safety is a factor.

The processor does not provide any memory protection features and any software running on the processor has full access to all memory.

The processor does not store/track/predict software instructions. The processor uses pipelining techniques to improve performance which results in future instructions entering the pipeline even if the software's logical sequence does not include these instructions. This could result in security vulnerabilities similar to Intel's Spectre vulnerability [12].

## Chapter 4

# **Current Progress**

| 4.1 | RISC  | Core                         | 16         |
|-----|-------|------------------------------|------------|
|     | 4.1.1 | Instruction Set Architecture | 16         |
|     | 4.1.2 | Design and Implementation    | 19         |
|     | 4.1.3 | Verification                 | <b>2</b> 3 |

This chapter discusses the current progress made towards the project, including designs, implementation, and current results.

#### 4.1 RISC Core

Following the project time line described in section 3.2, the first couple months have been dedicated to the design and implementation of the instruction set architecture and RISC core with stages 1-3. Good progress has been made in both deliverables, the ISA and the RISC core, and the progress is on-time with the initial project time line. The core has been nicknamed *Vmicro16* – short for Verilog microprocessor 16-bit.

#### 4.1.1 Instruction Set Architecture

A 16-bit instruction set architecture (ISA) has been designed using an iterative approach. There currently exists 32 unique instructions covering most generic RISC operations (add, load/store, branch, compare, etc.) and atleast 16 opcodes available to be provide multi-core communication and functionality. This number should be adequate to support these features when the work begins on the multi-core project stages (stages 4-7).

#### **Design Goals**

Having past experience designing and implementing ISAs for previous projects, I wanted to use that knowledge to design an even more efficient and compact instruction set that could provide much greater functionality. The technical design goals of the ISA are described below:

#### ISA1 Use a fixed width of 16-bits for all instructions.

This will significantly reduce RTL resources and encourage efficiency by not wasting spare bits. In addition, many SPI flash and RAMs support 16-bit wide data reads which will allow each instruction fetch to only require one clock cycle, thus increasing processor performance.

#### ISA2 Be able to select at least two registers for common instructions.

This will reduce the number of required instructions to manipulate register data. A disadvantage of using two instead of three reigster selects is that instructions are always destructive – they always *destroy* existing data in the destination register (e.g. R0 = ADD R0 R1) unlike constructive instructions that provide a unique register select for the destination (e.g. R2 = ADD R0 R1).

#### ISA3 Reduce bit-space for frequently used instructions (MOV, MOVI, ADD).

Due to the 16-bit limit, two register selects, and immediate values, the opcode bits are reduced resulting in fewer unique instructions. To overcome this constraint, spare bits in other instructions will be appended to the opcode bits to extend the opcode range. This however, will require a more complex decoder that must first switch the opcode, then switch any spare bits to determine the final opcode. This method will significantly increase the number of unique instructions provided by the instruction set.

#### ISA4 Provide frequently used actions as options for existing instructions.

In software, frequently used actions include incrementing/decrementing by 1 and performing logical comparisons which usually take more than one instruction on some RISC architectures. As they are common actions, the instruction overhead and time may be significant and can affect performance. To provide a solution to this problem, in addition to using spare bits to extend the opcode range, spare bits will be used to signify a frequently used action action to be performed by the ALU.

As shown in Figure 4.1, frequently used commands such as incrementing/decrementing and logical comparions are provided by setting spare bits to special values. For example, the instructions ARITH\_UADDI and ARITH\_SSUBI extend the ARITH\_U and ARITH\_S opcodes by filling the spare bit, 4. If this bit is not set (0), the instruction allows for a 4-bit immediate value to be added in addition to the two register selects. The 4-bit immediate allows adding a small number to the ALU which is useful in the case of software for loops where an increment/decrement of more than 1 is required.

Another example is the SETC instruction. Inspired by Intel's x86 SETCC, the instructions sets the destination register to zero or one depending on the result of the CMP instruction's flags. Without this instruction, multiple branches would be required to convert the comparion's flags to logical zeros and ones.

#### ISA5 Provide instructions for performing bitwise manipulations.

RISC processors are commonly used for microprocessing and microcontroller actions which typically includes bit manipulation. The ISA provides bitwise OR, XOR, AND, NOT, and shifting instructions under a single opcode to fill this need.

#### ISA6 Provide instructions for explicitly performing signed and unsigned arithmetic.

Performing signed and unsigned arithmetic is a key requirement for RISC applications and so it was decided to provide such instructions. Software programmers can easily switch between signed and unsigned arithmetic by setting bit 11 in the ARITH instruction family. Being able to change between signed and unsigned arithmetic instructions by changing a single bit will make the RISC processor's decoder module smaller and less complex.

Without explicit unsigned and signed instructions, extra instructions would be required to perform addition and subtraction. In addition, due to two's complement representation of signed numbers, the highest immediate operand value would be halved, resulting in more instructions to reach the desired value.

|             | 15-11 | 10-8  | 7-5  | 4-0            | rd ra simm5               |
|-------------|-------|-------|------|----------------|---------------------------|
|             | 15-11 | 10-8  | 7-0  |                | rd imm8                   |
|             | 15-11 | 10-0  |      |                | nop                       |
|             | 15    | 14:12 | 11:0 |                | extended immediate        |
| NOP         | 00000 |       | Χ    | o <sub>2</sub> |                           |
| LW          | 00001 | Rd    | Ra   | s5             | Rd <= RAM[Ra+s5]          |
| SW          | 00010 | Rd    | Ra   | s5             | RAM[Ra+s5] <= Rd          |
| BIT         | 00011 | Rd    | Ra   | s5             | bitwise operations        |
| BIT_OR      | 00011 | Rd    | Ra   | 00000          | Rd <= Rd   Ra             |
| BIT_XOR     | 00011 | Rd    | Ra   | 00001          | Rd <= Rd ^ Ra             |
| BIT_AND     | 00011 | Rd    | Ra   | 00010          | Rd <= Rd & Ra             |
| BIT_NOT     | 00011 | Rd    | Ra   | 00011          | Rd <= ~Ra                 |
| BIT_LSHFT   | 00011 | Rd    | Ra   | 00100          | Rd <= Rd << Ra            |
| BIT_RSHFT   | 00011 | Rd    | Ra   | 00101          | Rd <= Rd >> Ra            |
| MOV         | 00100 | Rd    | Ra   | X              | Rd <= Ra                  |
| MOVI        | 00101 | Rd    | i    | 8              | Rd <= i8                  |
| ARITH_U     | 00110 | Rd    | Ra   | s5             | unsigned arithmetic       |
| ARITH_UADD  | 00110 | Rd    | Ra   | 11111          | Rd <= uRd + uRa           |
| ARITH_USUB  | 00110 | Rd    | Ra   | 10000          | Rd <= uRd - uRa           |
| ARITH_UADDI | 00110 | Rd    | Ra   | OAAAA          | Rd <= uRd + Ra + AAAA     |
| ARITH_S     | 00111 | Rd    | Ra   | s5             | signed arithmetic         |
| ARITH_SADD  | 00111 | Rd    | Ra   | 11111          | Rd <= sRd + sRa           |
| ARITH_SSUB  | 00111 | Rd    | Ra   | 10000          | Rd <= sRd - sRa           |
| ARITH_SSUBI | 00111 | Rd    | Ra   | OAAAA          | Rd <= sRd - sRa + AAAA    |
| BR          | 01000 | Rd    | i    | 8              | conditional branch        |
| BR_U        | 01000 | Rd    | 0000 | 0000           | Any                       |
| BR_E        | 01000 | Rd    | 0000 | 0001           | Z=1                       |
| BR_NE       | 01000 | Rd    | 0000 | 0010           | Z=0                       |
| BR_G        | 01000 | Rd    | 0000 | 0011           | Z=0 and S=0               |
| BR_GE       | 01000 | Rd    | 0000 | 0100           | S=0                       |
| BR_L        | 01000 | Rd    | 0000 | 0101           | S != O                    |
| BR_LE       | 01000 | Rd    | 0000 | 0110           | Z=1 or (S != O)           |
| BR_S        | 01000 | Rd    | 0000 | 0111           | S=1                       |
| BR_NS       | 01000 | Rd    | 0000 | 1000           | S=0                       |
| CMP         | 01001 | Rd    | Ra   | Χ              | SZO <= CMP(Rd, Ra)        |
| SETC        | 01010 | Rd    | Ra   | Χ              | Rd <= Imm8 == SZO ? 1 : 0 |
| MOVI_LARGE  | 1     | Rd    | i12  |                | Rd <= i12                 |

Figure 4.1: Initial Vmicro16 16-bit instruction set architecture. Coloured regions represent instruction families (bitwise, branching, arithmetic, etc.).

The ISA table is shown in Figure 4.1. The top 5 bits (15-11) are dedicated to the opcode resulting in 32 unique values. Currently only the bits 14-11 are used (NOP to SETC) leaving the top bit spare. Initially, this bit was reserved to indicate an extended immediate instruction, MOVI12, supporting a large 12-bit immediate value, however later in the design it was decided that the top bit would indicate special instructions dedicated for multi-core operation. This leaves 16 spare unique opcodes for this purpose.

#### 4.1.2 Design and Implementation

The RISC core design is a traditional 5-stage processor (fetch, decode, execute, memory, write-back).

To satisfy CD5, the Verilog code will be self-contained in a single file. This reduces the hierarchical complexity and eases cross-vendor project set-up as only a single file is required to be included. A disadvantage with this single file approach is that some external Verilog verification tools that I plan to use, such as Verilator, do not currently support multiple Verilog modules (due to an unfixed bug) within a single file.



Figure 4.2: Vmicro16 RISC 5-stage RTL diagram showing: instruction pipelining (data passed forward through clocked register banks at each stage); branch address calculation; ALU operand calculation (rd2 or imm); and program counter incrementing.

#### **Instruction and Data Memory**

The design uses separate instruction and data memories similar to a Harvard architecture computer. This architecture was chosen due because I find it easier to implement.

#### **Register File**

To support design goal **ISA2**, the register set features a dual-port read and single-port write. This allows instructions to read 2 registers simultaneously for any instruction. The single-port write allows the instruction output to be written to the register file.

#### **Pipelining**

The extended deliverable ED1, to provide atleast 1 instructions per clock. Previous processor designs of mine have all required multiple clocks per instruction as it is a lot easier to implement. Modern processors today can output 1 or more instructions per clock through the use of instruction pipelining. This technique increases throughput of the processor by performing each stage in parallel. In this pipeline, instructions still travel through each stage in the same order, the difference is that the fetch stage does not wait for the final stage to complete and so fetches a new instruction every clock cycle, resulting in each stage operating on new data every clock cycle. To extend my knowledge in CPU pipelining, extended deliverable ED1 is proposed.

Instruction pipelining is harder to implement as data and control hazards can occur. Data hazards occur when instructions are dependent on the output of a previous instruction that has not left the pipeline, for example a register dependency. Methods to detect this hazard include checking if the register selects in the decode stage are present in future stages of the pipeline. If this check is true, then the current instruction depends on an instruction in the pipeline, and the processor can either wait until the dependant instruction has left the pipeline (i.e. has been written back to registers) or insert a NOP that will produce a *bubble* in the pipeline allowing the final stage to execute before the dependant instruction continues.

Control hazards occur when conditional or interrupt branching instructions are in the pipeline and their result has not been calculated yet. This results in preceding instructions entering the pipeline when they should not be executed due to the conditional branch. To detect this hazard, for instructions that perform branching or conditional execution, a global flag is set. When the outcome of the conditional check is performed, stages after decode are allowed to commit their results. Fortunately this technique is fairly simple implement.

This project's RISC processor implements these two hazard detectors and solutions to resolve them. The data hazard resolver implements a valid signal that is passed forward from stage to stage. This signal is low when a hazard has occurred and indicates that receiving stage should not operate on the previous stage's data. Each stage's valid signal is dependant on the previous stages valid signal. This allows future stages to stall when a hazard is detected in previous stages. A diagram of the implementation of these hazards in the processor is shown in Figure 4.3.

#### Memory Management Unit

It was decided to use a memory management unit (MMU) to make it easier and extensible to communicate with external peripherals or additional registers. This method would trans-



**Figure 4.3:** Pipeline data hazard detection. The register selects are passed forward through each stage and compared to the IDEX (latest instruction) register selects. If they match, the latest instruction depends on the output of an instruction in the pipeline, the IFID and IDEX stages are stalled to allow the instruction in the pipeline to commit.

parently use the existing LW/SW instructions which removes the requirement for a unique instruction for each peripheral.

#### **Proposed Memory Mapped Addresses**

The peripheral addresses are currently based on classes. For example, a memory-mapped address may use the upper byte to address a peripheral and the lower byte to address a register/function in that peripheral.

Later in the project, I plan to rewrite the addressing scheme to use a simpler address format which is closer to commonly used peripheral addressing schemes used today. The proposed memory mapped addresses for each system and peripheral are listed below.

| Address (16-bit aligned) | Peripheral Name                                                                |
|--------------------------|--------------------------------------------------------------------------------|
| 0x0000                   | NOP (reads returns 0, writes do nothing)                                       |
| 0x00ZZ                   | Per-core scratch RAM (ZZ = 8-bit RAM address)                                  |
| 0x0100                   | Extended Core Registers 1                                                      |
| 0x0200                   | Extended Core Registers 2                                                      |
| 0x03ZZ                   | Wishbone Master controller select (ZZ contains 8-bit wishbone slave address)   |
| 0x1XYZ                   | Master core controller ( $X = $ slave select, $Y = $ instruction, $Z = $ data) |

Table 4.1: Provisional memory-mapped addresses table.

#### **ALU Design**

The Vmicro16's ALU is an asynchronous module that has 3 inputs: data a; data b; and opcode op, and outputs data value c. The ALU is able to operate on both register data (rd1 and rd2) and immediate values. A switch is used to set the b input to either the rd2 or imm value from the previous stage.

Currently, the ALU does not store flags to indicate overflow, equality, or zero values in the module itself. Instead the ALU outputs the result of the CMP, which calculates such flags, to be written back to the register set in the write-back stage. This means that in order to perform a conditional operation, such as a branch, the register containing the CMP flags must be included in the instruction.



Figure 4.4: Vmicro16 ALU diagram showing clocked inputs from the previous IDEX stage being

The Verilog implementation of the ALU is shown in Figure 4.5. The ALU's asynchronous output is clocked with other registers, such as destination register rs1 and other control signals, in the EXME register bank.

```
always @(*) case (op)
322
                      // branch/nop, output nothing
323
                      `VMICRO16_ALU_BR,
324
325
                      `VMICRO16_ALU_NOP:
                                                     c = 0;
                      // load/store addresses (use value in rd2)
326
                      `VMICRO16_ALU_LW,
327
                      `VMICRO16_ALU_SW:
                                                     c = b;
328
                      // bitwise operations
329
                      `VMICRO16_ALU_BIT_OR:
                                                     c = a \mid b:
330
                      `VMICRO16_ALU_BIT_XOR:
                                                     c = a ^ b;
331
                       `VMICRO16_ALU_BIT_AND:
                                                     c = a \& b;
332
                       `VMICRO16_ALU_BIT_NOT:
                                                     c = ^{(b)};
333
334
                      `VMICRO16_ALU_BIT_LSHFT:
                                                     c = a \ll b;
                      `VMICRO16_ALU_BIT_RSHFT:
                                                     c = a \gg b;
335
```

Figure 4.5: Vmicro16's ALU implementation named vmicro16\_alu. vmicro16.v

#### **Decoder Design**

Instruction decoding occurs in the between the IFID and IDEX stages. The decoder extracts register selects and operands from the input instruction. The decoder outputs are asynchronous which allows the register selects to be passed to the register set and register data

to be read asynchronously. The register selects and register read data is then clocked into the IDEX register bank.

```
always @(*) case (opcode)
224
                     `VMICRO16_OP_HALT, // TODO: stop ifid
225
                     `VMICRO16_OP_NOP:
                                                   alu_op = `VMICRO16_ALU_NOP;
226
227
                     `VMICRO16_OP_LW:
                                                   alu_op = `VMICRO16_ALU_LW;
228
                     `VMICRO16_OP_SW:
                                                   alu_op = `VMICRO16_ALU_SW;
229
230
                     `VMICRO16_OP_MOV:
                                                   alu_op = `VMICRO16_ALU_MOV;
231
                     `VMICRO16_OP_MOVI:
                                                   alu_op = `VMICRO16_ALU_MOVI;
232
                     `VMICRO16_OP_MOVI_L:
                                                   alu_op = `VMICRO16_ALU_MOVI_L;
233
234
                     `VMICRO16_OP_BR:
                                                    alu_op = `VMICRO16_ALU_BR;
235
236
                     `VMICRO16_OP_BIT:
                                               casez (simm5)
237
                                                       alu_op = `VMICRO16_ALU_BIT_OR;
                             `VMICRO16_OP_BIT_OR:
238
239
                             `VMICRO16_OP_BIT_XOR:
                                                       alu_op = `VMICRO16_ALU_BIT_XOR;
                             `VMICRO16_OP_BIT_AND: alu_op = `VMICRO16_ALU_BIT_AND;
240
                             `VMICRO16_OP_BIT_NOT: alu_op = `VMICRO16_ALU_BIT_NOT;
241
                             `VMICRO16_OP_BIT_LSHFT: alu_op = `VMICRO16_ALU_BIT_LSHFT;
242
                             `VMICRO16_OP_BIT_RSHFT: alu_op = `VMICRO16_ALU_BIT_RSHFT;
243
                                                       alu_op = `VMICRO16_ALU_BAD; endcase
244
                             default:
245
```

Figure 4.6: Vmicro16's decoder module code showing nested bit switches to determine the intended opcode. vmicro16.v

In Figure 4.6, it can be seen that the first 8 opcode cases are represented using the same 15-11 bits, however the VMICRO16\_OP\_BIT instructions require another bit range to be compared to determine the output opcode.

#### 4.1.3 Verification

Currently, the only verification method used is manual inspection of the output waveforms of a test bench. For now, it is easier and faster to spot erroneous states by hand due to the large complexity of the pipeline. Later in the project, automatic test benches will be utilised.

#### **Known Bugs**

Known bugs exist within the RISC core however none are critical as they can be easily avoided in software.

#### BUG1 Stall detection does not consider load/store instructions.

Due to instruction pipelining techniques used by the processor and lack of address checking in the EXME and MEWB stages, LW instructions immediately after SW instructions:

```
SW RO (R2+16)
LW R1 (R2+16)
```

will not return the previously stored value. In addition, because of the target address is calculated by the ALU (e.g. R2+16), detecting matching addresses at IFID and IDEX stage is not trivial, and because of this, a hardware fix is not planned for

the final version. It is possible to overcome this problem in software by placing at least 5 NOP instructions after each SW.

## Chapter 5

## **Future Work**

| 5.1 | Projec | t Status                  | 25 |
|-----|--------|---------------------------|----|
|     | 5.1.1  | Updated Project Time Line | 25 |
|     | 5.1.2  | Future Work               | 25 |

### 5.1 Project Status

Four months have passed since the start of the project and significant progress has been made to the final deliverable.

The current active stage is 3.3 Pipeline Implementation and Verification where the processor pipeline is being verified against of range of simple software sequences. It is important that this verification is thorough and the output is bug free as future additions to the processor will utilise this foundation.

#### 5.1.1 Updated Project Time Line

The project table described in section 3.2 did not allocate times for stages 4.1 and later. This was due to expected high demand from other modules and exams in this time period and so it was decided to not allocate times that would later not be followed.

Now that this time period is closer, time allocations have been assigned for stages 4, 7, and 8. The state of stage 5's extended deliverables, to implement debugging interfaces, have changed from *Unknown* to *Cancelled* due to expected high workload from other modules in the next month. The cancellation of these stages will not severely affect the final functionality of the deliverable however it will make debugging the processor slightly more difficult. It was decided to remove these extended features to allow for more time to be spent on core functionality.

The updated project status is shown in Table 5.1 and in Figure 5.1.

#### 5.1.2 Future Work

May and early June are reserved for work on other modules and preparation for exams. From mid-June, work will resume on verifying the end of stage 3 and then work will start on stage 4 (focussed on designing and implementing multiprocessor features). After stage 4, software algorithms will be compiled for the ISA and evaluated against Amdahl's Law.

| Stage | Title                                        | Start Date | Core | Status    |
|-------|----------------------------------------------|------------|------|-----------|
| 1.0   | Research                                     | Feb 04     | x    | Completed |
| 1.1   | Requirement gathering/review                 | Feb 11     | х    | Completed |
| 1.1   | Processor specification, architecture, ISA   | Feb 18     | х    | Completed |
| 1.2   | Stage/Time Allocation Planning               | Feb 25     | x    | Completed |
| 2.1   | Decoder, Register Set, impl & integration    | Feb 25     | x    | Completed |
| 2.2   | Register set impl & integration              | Mar 04     | x    | Completed |
| 2.3   | Local memory impl & integration              | Mar 11     | x    | Completed |
| 3.1   | Memory mapped register layout & impl         | Apr 01     |      | On-going  |
| 3.2   | Wishbone peripheral bus connected to MMU     | Apr 08     |      | On-going  |
| 3.3   | Pipeline implementation and verification     | Apr 15     |      | On-going  |
| 3.4   | Cache memory design & impl                   | Apr 22     |      | Cancelled |
| 4.1   | Multi-core communication interface           | Jun 05     | x    | Planned   |
| 4.2   | Shared-memory controller                     | Jun 05     | x    | Planned   |
| 4.3   | Scalable multi-core interface (10s of cores) | Jul 01     | x    | Planned   |
| 4.4   | Multi-core example program (reduction)       | Jul 10     | x    | Planned   |
| 5.1   | SPI-FPGA interface for OTG programming       | TBD        |      | Cancelled |
| 5.2   | FPGA-PC interfacing                          | TBD        |      | Cancelled |
| 5.3   | FPGA-PC debugging (instruction breakpoints)  | TBD        |      | Cancelled |
| 6.1   | Compiler backend for vmicro16                | TBD        |      | Unknown   |
| 6.2   | Compiler support for multi-core codegen      | TBD        |      | Unknown   |
| 7.1   | Wishbone peripherals for demo                | Aug 01     | x    | Planned   |
| 8.1   | Final Report                                 | Jun 05     | x    | Planned   |

Table 5.1: Updated project stages.



Figure 5.1: Updated project time gantt chart showing time allocations for stage 4.

## Chapter 6

## Conclusion

With the end of Moore's Law coming, processor designers must use other strategies to continue improving performance of processors – multiprocessor and parallelism being a primary strategy. This projects sets out to improve my knowledge on multiprocessor communication by designing, implementing, and verifying a multiprocessor – and I believe starting from scratch is the best way to accomplish this learning task.

To date, a compact 16-bit RISC instruction set has been designed and implemented in a Verilog single-core processor. Whilst single-core verification is still on-going, good progress has been made and extended deliverables from stage 3, such as instruction pipelining and memory-mapped peripherals via a Wishbone bus, has been implemented successfully.

Stage 5's extended deliverables and the cache memory have been cancelled but they do not effect the core functionality of the processor. The planned project time-line for future stages is realistic and accomplishing the projects goals appears achievable.

REFERENCES 28

#### References

[1] V. Subramanian, "Multiple gate field-effect transistors for future CMOS technologies," *IETE Technical review*, vol. 27, no. 6, pp. 446–454, 2010.

- [2] M. J. Flynn, *Computer architecture: Pipelined and parallel processor design*. Jones & Bartlett Learning, 1995.
- [3] Tech Differences, "Difference between loosely coupled and tightly coupled multiprocessor (with comaprison chart)," system Jul 2017. [Online]. Available: https://techdifferences.com/ difference-between-loosely-coupled-and-tightly-coupled-multiprocessor-system.html (Accessed 2019-04-20).
- [4] L. Benini and G. De Micheli, "Networks on Chips: A new SoC paradigm," *Computer*, vol. 35, pp. 70–78, 02 2002.
- [5] D. Zhu, L. Chen, S. Yue, T. M. Pinkston, and M. Pedram, "Balancing On-Chip Network Latency in Multi-application Mapping for Chip-Multiprocessors," in 2014 IEEE 28th International Parallel and Distributed Processing Symposium, May 2014, pp. 872–881.
- [6] N. Chatterjee, S. Paul, and S. Chattopadhyay, "Fault-tolerant dynamic task mapping and scheduling for network-on-chip-based multicore platform," *ACM Transactions on Embedded Computing Systems*, vol. 16, pp. 1–24, 05 2017.
- [7] Xilinx, Spartan-6 FPGA Block RAM Resources, Xilinx.
- [8] Altera, Recommended HDL Coding Styles QII51007-9.0.0, Altera.
- [9] B. Lancaster, "FPGA-based RISC Microprocessor and Compiler," vol. 3.14, pp. 37–50. [Online]. Available: https://github.com/bendl/prco304 (Accessed March 2018).
- [10] Terasic Technologies, "SoC Platform Cyclone DE1-SoC Board." [Online]. Available: https://www.terasic.com.tw/cgi-bin/page/archive.pl?Language=English&No=836 (Accessed 2019-04-20).
- [11] *MiniSpartan6+*, Scarab Hardware, 2014. [Online]. Available: https://www.scarabhardware.com/minispartan6/ (Accessed 2019-04-20).
- [12] P. Kocher, D. Genkin, D. Gruss, W. Haas, M. Hamburg, M. Lipp, S. Mangard, T. Prescher, M. Schwarz, and Y. Yarom, "Spectre attacks: Exploiting speculative execution," arXiv preprint arXiv:1801.01203, 2018.

# Appendix A - Code Listing

#### vmicro16.v

The single core RISC processor is defined in this file. It contains many submodules such as the decoder and local memory.

```
// This file contains multiple modules.
// Verilator likes 1 file for each module
/* verilator lint_off DECLFILENAME */
/* verilator lint_off UNUSED */
       /* verilator lint_off BLKSEQ */
/* verilator lint_off WIDTH */
        // Include Vmicro16 ISA containing definitions for the bits 'include "vmicro16_isa.v" \,
        // This module aims to be a SYNCHRONOUS, WRITE_FIRST BLOCK RAM
// https://www.xilinx.com/support/documentation/user_guides/ug473_7Series_Memory_Resources.pdf
11
        // https://www.xilinx.com/support/documentation/user_guides/ug383.pdf
// https://www.xilinx.com/support/documentation/sw_manuals/xilinx2016_4/ug901-vivado-synthesis.pdf
13
14
15
        module vmicro16_bram # (
                   parameter MEM_WIDTH
16
17
                   parameter MEM_DEPTH
18
19
                   input clk, input reset.
20
21
                   input input
                                   [MEM WIDTH-1:0] mem addr.
22
23
                                   [MEM_WIDTH-1:0] mem_in,
                   24
25
       ):
26
27
                   // memory vector
reg [MEM_WIDTH-1:0] mem [0:MEM_DEPTH-1];
28
29
                   // not synthesizable
30
31
32
33
34
35
                   integer i;
initial for (i = 0; i < MEM_DEPTH; i = i + 1) mem[i] <= 0;</pre>
                   always @(posedge clk) begin
                               // synchronous WRITE_FIRST (page 13)
if (mem_we) begin
                                          mem[mem_addr] <= mem_in;
$display($time, "\tmem: \www.mem[%h] <= \%h", mem_addr, mem_in);
38
39
40
41
42
                               end else begin
                                         mem_out <= mem[mem_addr];</pre>
43
44
45
                   // TODO: Reset impl = every clock while reset is asserted, clear each cell
                                one at a time, mem[i++] \le 0
46
47
48
        // Wishbone wrapper around the register file to use as a peripheral
        module vmicro16_regs_wb # (
parameter CELL_WIDTH
parameter CELL_DEPTH
49
50
                                                     = 16,
= 8,
                   parameter CELL_SEL_BITS = 3,
                   parameter CELL_SEL_DIID C,
parameter CELL_DEFAULTS = 0,
DEBUG NAME = "",
53
54
                   parameter DEBUG_NAME
                   parameter PIPELINE READ = 0
55
56
57
58
       ) (
                   input clk, input reset,
                   // wishbone slave interface
59
60
                    input
                   input
input
                                                      wb_cvc_i,
                   input [CELL_WIDTH-1:0] wb_addr_i,
                   input [CELL_WIDTH-1:0] wb_data_i,
output [CELL_WIDTH-1:0] wb_data_o,
63
64
65
66
67
68
                   output
                                                      wb_ack_o,
                   output
                                                      wb_stall_o,
                    output
                                                      wb_err_o
       );
                   // embedded register data
wire [CELL_SEL_BITS-1:0] reg_rs1;
                   wire [CELL_WIDTH-1:0]
```

```
72
73
74
75
76
77
                       wire [CELL_WIDTH-1:0]
                                                             reg_wd;
                       reg selected;
                       always @(*) begin
if (wb_stb_i)
 78
79
                                   selected = 1'b1;
else if (selected && wb_cyc_i)
 80
                                   selected = 1'b1;
else if (selected && !wb_cyc_i)
 81
                                   selected = 1'b0;
else selected = 1'b0;
 82
 83
                       end
 84
85
                       assign reg_we = wb_we_i && wb_stb_i;
assign reg_rs1 = wb_addr_i[CELL_SEL_BITS-1:0];
assign reg_wd = wb_data_i;
 86
87
 88
 89
90
                       // Only stall on write requests
                       // Unity Stall on write requests
//assign wb_stall_o = wb_cyc_i && wb_we_i;
// TODO: Stall for 1 clock if pipelined
assign wb_stall_o = PIPELINE_READ ? wb_stb_i : 1'b0;
 91
92
 93
                       dasign wb_data_o = (selected && !wb_we_i) ? reg_rd1 : {(CELL_WIDTH-1){1'bZ}};
 94
95
96
97
                       // TODO: ack on same stb clock allowed?
                       assign wb_ack_o = (wb_cyc_i && !wb_stall_o) ? 1'b1 : 1'bZ;
 98
99
100
                       vmicro16_regs # (
                                   .CELL WIDTH
                                                             (CELL WIDTH).
                                    .CELL_DEPTH
                                                             (CELL_DEPTH),
                                    .CELL_SEL_BITS (CELL_SEL_BITS),
102
103
                                     .CELL_DEFAULTS (CELL_DEFAULTS),
104
                                    . DEBUG NAME
                                                             (DEBUG NAME).
105
                                    .PIPELINE_READ (PIPELINE_READ)
                     ) vmicro16_regs_wb_i (
.clk
106
107
                                                             (clk)
                                    ,.reset
108
                                                             (reset)
109
110
                                    // Read port 1 (IDEX)
                                   ,.rs1
                                                             (reg_rs1)
                                   ,.rd1 (reg_rd1)
// Read port 2 (IDEX) unused
111
112
                                                  (dec_rs2)
113
                                    //,.rs2
                                                                (reg_rd2)
114
                                    //,.rd2
115
                                    // Write port (MEWB)
                                                             (reg_we)
116
                                   ,.we
117
118
                                   ,.ws1
                                                             (reg_rs1)
                                                             (reg_wd)
                                    ,.wd
119
          endmodule
         module vmicro16_regs # (
    parameter CELL_WIDTH = 16,
    parameter CELL_DEPTH = 8,
    parameter CELL_SEL_BITS = 3,
    parameter CELL_DEFAULTS = 0,
    parameter DEBUG_NAME = "",
121
122
123
124
125
                      parameter DEBUG_NAME = ""
parameter PIPELINE_READ = 0
127
129
         ) (
130
                       input clk, input reset,
                       // ID/EX stage reg reads
// Dual port register reads
input [CELL_SEL_BITS-1:0] rs1, // port 1
output reg [CELL_WIDTH-1:0] rd1,
input [CELL_SEL_BITS-1:0] rs2, // port 2
131
133
134
135
                       output reg [CELL_WIDTH-1:0]
137
                       // EX/WB final stage write back
138
                       input
                       input [CELL_SEL_BITS-1:0]
139
                                                                        ws1.
140
                       input [CELL_WIDTH-1:0]
          );
141
142
143
                       reg [CELL_WIDTH-1:0] regs [0:CELL_DEPTH-1] /*verilator public_flat*/;
144
                       // Initialise registers with default values
                              Really only used for special registers used by the soc
                       // Really only used for special registers used by the so
// TODO: How to do this on reset?
initial if (CELL_DEFAULTS) $readmemh(CELL_DEFAULTS, regs);
145
146
147
148
149
                       integer i;
150
151
                       always @(posedge clk)
                                  if (reset)
                                   if (CELL_DEFAULTS) $readmemh(CELL_DEFAULTS, regs); // TODO:
    else for(i = 0; i < CELL_DEPTH; i = i + 1) regs[i] <= {(CELL_WIDTH-1){1'b0}};
else if (we) begin
    $display($time, "\tREGS #%s: Writing %h to reg[%d]", DEBUG_NAME, wd, ws1);
    $display($time, "\t\t\t| %h "",
    regs[0], regs[1], regs[2], regs[3], regs[4], regs[6], regs[6], regs[7]);
152
154
155
156
                                                regs[U], reg.
// Perform the right
regs[ws1] <= wd;
158
159
160
                       generate if (PIPELINE_READ)
162
                                 always @(posedge clk)
                                               if (reset) begin
    rd1 <= {(CELL_WIDTH-1){1'b0}};
    rd2 <= {(CELL_WIDTH-1){1'b0}};</pre>
164
166
                                                end else begin
    rd1 <= regs[rs1];</pre>
168
                                                             rd2 <= regs[rs2];
170
                                                end
```

```
171
172
                     else
                                 always @(*) begin
                                            rd1 = regs[rs1];
rd2 = regs[rs2];
174
175
176
                      endgenerate
177
          {\tt endmodule}
178
         // Decoder is hard to parameterise as it's very closely linked to the ISA. {\tt module}\ {\tt vmicro16\_dec}\ {\tt\#} (
179
180
                     wmicro16_dec # (
parameter INSTR_WIDTH = 16,
parameter INSTR_OP_WIDTH = 5,
parameter INSTR_RS_WIDTH = 3,
parameter ALU_OP_WIDTH = 5
                                                       = 16,
181
182
183
184
185
         ) (
                      //input clk, // not used yet (all combinational)
                      //input reset, // not used yet (all combinational)
187
188
                      input [INSTR WIDTH-1:0]
189
                     output [INSTR_OP_WIDTH-1:0] opcode,
output [INSTR_RS_WIDTH-1:0] rd,
191
192
                     output [INSTR_RS_WIDTH-1:0] ra,
output [7:0] imm
193
                      output [11:0]
output [4:0]
195
                                                               imm12.
196
197
                      // This can be freely increased without affecting the isa
199
                      output reg [ALU_OP_WIDTH-1:0] alu_op,
200
201
                      output reg has_imm8,
                     output reg has_imm12,
output reg has_we,
output reg has_br,
202
203
204
                     output reg has_mem,
output reg has_mem_we,
205
206
207
208
209
                      output halt
210
211
                      // TODO: Use to identify bad instruction and
                                  raise exceptions
212
                      //,output is_bad
213
         );
                     assign opcode = instr[15:11];
assign rd = instr[10:8];
assign ra = instr[7:5];
214
215
216
217
                     assign ra = instr[7:5];
assign imm8 = instr[7:0];
                      assign imm12 = instr[11:0];
assign simm5 = instr[4:0];
218
220
                     // Special opcodes
assign halt = (opcode == `VMICRO16_OP_HALT);
221
222
                     224
225
                                                                            alu_op = `VMICRO16_ALU_NOP;
226
227
                                                                            alu_op = `VMICRO16_ALU_LW;
alu_op = `VMICRO16_ALU_SW;
228
                                  `VMICRO16 OP LW:
229
230
                                 `VMICRO16_OP_SW:
231
232
                                                                             alu_op = `VMICRO16_ALU_MOV;
alu_op = `VMICRO16_ALU_MOVI;
alu_op = `VMICRO16_ALU_MOVI_L;
                                  `VMICRO16_OP_MOV:
                                 `VMICRO16_OP_MOVI:
`VMICRO16_OP_MOVI_L:
233
234
235
236
                                                                             alu_op = `VMICRO16_ALU_BR;
                                 `VMICRO16_OP_BR:
237
238
                                 `VMICRO16_OP_BIT:
                                                                        casez (simm5)
                                              VMICRO16_OP_BIT_OR:
                                                                                   alu_op = `VMICRO16_ALU_BIT_OR;
                                             VMICRO16_OP_BIT_XOR:

VMICRO16_OP_BIT_AND:
                                                                                   alu_op = `VMICRO16_ALU_BIT_XOR;
alu_op = `VMICRO16_ALU_BIT_AND;
239
240
241
242
                                             VMICRO16_OP_BIT_NOT:
VMICRO16_OP_BIT_LSHFT:
                                                                                  alu_op = `VMICRO16_ALU_BIT_NOT;
alu_op = `VMICRO16_ALU_BIT_LSHFT;
                                                                                   alu_op = `VMICRO16_ALU_BIT_RSHFT;
alu_op = `VMICRO16_ALU_BAD; endcase
243
244
                                              `VMICRO16_OP_BIT_RSHFT:
                                             default:
245
246
                                 `VMICRO16_OP_ARITH_U:
                                                                        casez (simm5)
                                             ID_UP_ARITH_U: Casez (SIMMB)

'VMICRO16_OP_ARITH_UADD: alu_op = 'VMICRO16_ALU_ARITH_UADD;

'VMICRO16_OP_ARITH_USUB: alu_op = 'VMICRO16_ALU_ARITH_USUB;

'VMICRO16_OP_ARITH_UADDI: alu_op = 'VMICRO16_ALU_ARITH_UADDI;
default: alu_op = 'VMICRO16_ALU_BAD; endcase
247
249
250
251
                                                                        casez (simm5)
                                  `VMICRO16_OP_ARITH_S:
                                             253
254
255
                                             default:
257
258
                                 default: begin
                                             : Degin
alu_op = `VMICRO16_ALU_BAD;
$display($time, "\tDEC: unknown opcode: %h", opcode);
259
261
                      endcase
263
                      // Register writes
always @(*) case (opcode)
265
                                  VMICRO16_OP_LW,
                                  VMICRO16_OP_MOV
267
                                  `VMICRO16_OP_MOVI
269
                                  `VMICRO16_OP_MOVI_L,
```

```
270
271
                           `VMICRO16_OP_ARITH_U,
`VMICRO16_OP_ARITH_S,
                            `VMICRO16_OP_CMP,
`VMICRO16_OP_SETC:
272
273
                                                             has_we = 1'b1;
has_we = 1'b0;
274
275
                           default:
                 endcase
276
277
                 // Contains 8-bit immediate
278
279
                 has_imm8 = 1'b1;
has_imm8 = 1'b0;
280
                           `VMICRO16_OP_CMP:
281
                          default:
282
283
                 endcase
                 284
285
286
287
                                                             has_imm12 = 1'b1;
has_imm12 = 1'b0;
                           default:
288
                 endcase
290
291
                 // Will branch the pc
always @(*) case (opcode)
292
293
                                                             has_br = 1'b1;
has_br = 1'b0;
                          `VMICRO16_OP_BR:
                           default:
294
                 endcase
295
296
                 298
                          `VMICRO16_OP_SW:
299
300
                                                             has_mem = 1'b1;
has_mem = 1'b0;
                           default:
301
                  endcase
302
                 303
304
305
                                                             has_mem_we = 1'b1;
                                                             has_mem_we = 1'b0;
306
                          default:
307
308
                  endcase
        endmodule
309
310
311
312
       module vmicro16_alu # (
    parameter OP_WIDTH
            parameter OP_WIDTH = 5,
parameter DATA_WIDTH = 16
313
314
315
316
                 // input clk, // TODO: make clocked
                 input [OP_WIDTH-1:0] op,
input [DATA_WIDTH-1:0] a, // rs1/dst
input [DATA_WIDTH-1:0] b, // rs2
output reg [DATA_WIDTH-1:0] c
317
318
319
320
321
       );
322
                 always @(*) case (op)
323
324
                           // branch/nop, output nothing
`VMICRO16_ALU_BR,
                           325
326
327
328
329
330
                           // bitwise operations
`VMICRO16_ALU_BIT_OR:
                                                            c = a | b;
c = a ^ b;
331
                            VMICRO16 ALU BIT XOR:
332
333
                            `VMICRO16_ALU_BIT_AND:
                                                            c = a & b;
c = ~(b);
                            VMICRO16_ALU_BIT_NOT:
334
                            `VMICRO16_ALU_BIT_LSHFT:
                                                            c = a >> b;
335
                            `VMICRO16_ALU_BIT_RSHFT:
336
337
                            VMICRO16 ALU MOV:
338
339
                            `VMICRO16_ALU_MOVI:
`VMICRO16_ALU_MOVI_L:
                                                             c = b;
340
341
                           342
343
344
345
                            `VMICRO16_ALU_ARITH_UADDI: c = a + b;
346
347
                            348
349
                           // TODO: ALU should have simm5 as input

VMICRO16_ALU_ARITH_SSUBI: c = $signed(a) + $signed(b);
350
351
                           // TODO: Parameterise
default: begin
352
353
354
                                     $display($time, "\tALU: unknown op: %h", op);
355
                                     c = {(DATA_WIDTH-1){1'bZZZZ}};
356
357
                           end
358
        endmodule
360
       module vmicro16 ifid (
361
                 input clk,
362
                 input reset,
input stall,
364
365
366
                  input mewb_valid,
                 input jmping,
                 input [15:0]
368
                                       wb_jmp_target,
```

```
369
370
                                output reg ifid_valid
output reg [15:0] ifid_pc,
output reg [15:0] ifid_instr
                                                                        ifid valid.
371
372
373
374
              );
                                 reg [7:0] mem_cache [0:31];
375
376
                                 integer i;
initial begin
377
378
379
380
                                381
382
                                 // Single cycle register writes
                                // Single cycle register writes

mem_cache[0] = { 'VMICR016_0P_MOVI, 3'h0}; mem_cache[1] = { 8'h00 };

mem_cache[2] = { 'VMICR016_0P_MOVI, 3'h1}; mem_cache[3] = { 8'h01 };

mem_cache[4] = { 'VMICR016_0P_MOVI, 3'h2}; mem_cache[5] = { 8'h02 };

mem_cache[6] = { 'VMICR016_0P_MOVI, 3'h3}; mem_cache[7] = { 8'h03 };

mem_cache[8] = { 'VMICR016_0P_MOVI, 3'h3}; mem_cache[7] = { 8'h04 };

mem_cache[10] = { 'VMICR016_0P_MOVI, 3'h5}; mem_cache[11] = { 8'h05 };

mem_cache[12] = { 'VMICR016_0P_MOVI, 3'h6}; mem_cache[13] = { 8'h06 };

mem_cache[14] = { 'VMICR016_0P_MOVI, 3'h7}; mem_cache[15] = { 8'h07 };

mem_cache[16] = { 'VMICR016_0P_HALT, 3'h0}; mem_cache[17] = { 8'h00 };

/**/
383
384
 385
386
387
389
390
391
393
 394
                                /*
mem_cache[0] = {`VMICR016_OP_MOVI, 3'h0}; mem_cache[1] = { 8'h00 };
mem_cache[2] = {`VMICR016_OP_MOVI, 3'h1}; mem_cache[3] = { -8'd02 };
mem_cache[4] = {`VMICR016_OP_ARITH_U, 3'h0}; mem_cache[5] = {3'h7, 1'b0, 4'h1};
mem_cache[6] = {`VMICR016_OP_ARITH_U, 3'h0}; mem_cache[7] = {3'h7, 1'b0, 4'h3};
mem_cache[8] = {`VMICR016_OP_ARITH_U, 3'h0}; mem_cache[9] = {3'h7, 1'b0, 4'h5};
395
397
398
399
                                                                                                                     , 3'h0); mem_cache[9] = {3'h7, 1'b0, 4'h5};
3'h1); mem_cache[11] = {'VMICR016_OP_BR_U};
3'h0); mem_cache[13] = {8'h0);
3'h0); mem_cache[15] = {8'h0);
3'h0); mem_cache[17] = {8'h0);
3'h0); mem_cache[19] = {8'h0);
3'h0); mem_cache[21] = {8'h00};
                                mem_cache[10] = {`VMICRO16_OP_BR,
mem_cache[12] = {`VMICRO16_OP_NOP,
 400
401
                                mem_cache[12] = {\text{`VMICR016_UP_NUP,}
mem_cache[14] = {\text{`VMICR016_UP_NUP,}
mem_cache[16] = {\text{`VMICR016_UP_NUP,}
mem_cache[18] = {\text{`VMICR016_UP_NUP,}
mem_cache[20] = {\text{`VMICR016_UP_HALT,}
}
 402
403
404
405
406
407
                                 //*/
408
                                mem_cache[0] = {`VMICR016_OP_MOVI, 3'h0}; mem_cache[1] = { 8'h7F };
mem_cache[2] = {`VMICR016_OP_MOVI, 3'h1}; mem_cache[3] = { 8'h0A };
mem_cache[4] = {`VMICR016_OP_SW, 3'h0}; mem_cache[5] = { 3'h01, 5'h03 };
mem_cache[6] = {`VMICR016_OP_HALT, 3'h0}; mem_cache[7] = {8'h00};
409
410
411
 412
413
414
415
                                reg [15:0] pc;
initial pc = 0;
416
418
419
                                 always @(posedge clk) begin
                                                   if (reset) begin
    ifid_valid <= 0;</pre>
420
 421
                                                                    ifid_instr <= 0;
ifid_pc <= 0;
 422
 423
424
                                                  pc
end else begin
                                                                                             <= 0;
                                                                   426
 427
428
                                                                                      pc <= wb_jmp_target;</pre>
430
431
432
                                                                                      // TODO: vmicro16_mmu is single port
 433
                                                                                      //
                                                                                                         so we require a cache to do this
                                                                                      ifid_instr <= {mem_cache[pc], mem_cache[pc+1]};
ifid_pc <= pc; // Only for simulation</pre>
434
 435
436
                                                                                                              <= pc + 16'h2;
                                                                                      DС
437
438
                                                  end
439
440
               endmodule
 441
              module vmicro16_idex (
442
                                 input clk,
443
444
                                input reset,
445
446
                                                          [15:0] ifid_pc, output reg [15:0] idex_pc,
                                input
447
448
                                                         [15:0] ifid_instr, output reg [15:0] idex_instr,
449
                                 output reg [4:0] idex_op,
                                // register data pipe
output reg [15:0] idex_rd1,
output reg [15:0] idex_rd2,
451
 452
453
 454
 455
                                 // not clocked
 456
                                 output [4:0] dec_op,
                                output [2:0] reg_rs1, output [2:0] reg_rs2, input [15:0] reg_rd1, input [15:0] reg_rd2,
457
459
                                 output dec has imm8.
 460
461
                                 // computed rd3 data
                                 output reg [15:0] idex_rd3,
463
                                // register select pipe
output reg [2:0] idex_rs1,
output reg [2:0] idex_rs2,
 465
467
```

```
468
469
                        output reg idex_has_br,
                        output reg idex_has_we,
output reg idex_has_mem,
470
471
                        output reg idex_has_mem_we,
472
473
                        output dec_halt,
474
475
                        input stall, input jmping,
input ifid_valid, output reg idex_valid
476
477
          );
478
                        wire dec_has_br;
479
                        wire dec_has_simm5;
480
481
                        wire dec_has_we;
wire dec_has_mem;
482
                        wire dec_has_mem_we;
wire [4:0] alu_op;
484
                        wire [7:0] dec_imm8;
485
                        wire [4:0] dec_simm5;
486
                        vmicro16_dec decoder (
487
                                     .instr
                                                                (ifid_instr),
                                     .opcode
                                                                (dec_op),
(reg_rs1),
488
489
490
                                     .ra
                                                                 (reg_rs2)
                                      .imm8
                                                                 (dec_imm8),
492
                                      .has imm8
                                                                 (dec has imm8
493
                                                                 (dec_simm5
494
                                      has br
                                                                 (dec has br
495
                                       .has_we
                                                                (dec_has_we
496
                                     //.has bad
                                                                    (dec has bad
497
498
                                                                 (dec_has_mem
                                      .has_mem
                                                                 (dec_has_mem_we),
                                     .has_mem_we
499
                                                                 (alu_op),
                                      .alu_op
                                                                (dec halt)
500
                                     .halt
501
                        );
502
503
                        // Clock values through the pipeline
504
                        always @(posedge clk)
505
506
                        if (!reset) begin
                                     if(!stall) begin
                                                   if) begin
// Move previous stage regs into this stage
idex_pc <= ifid_pc; // Only for simulation
idex_rd1 <= reg_rd1; // clock the decoder outputs into regs
idex_rd2 <= reg_rd2; // clock the decoder outputs into regs
idex_rs1 <= reg_rs1; // destination register
idex_rs2 <= reg_rs2; // operand register</pre>
507
508
509
510
511
512
513
514
                                                   // store decoded instr
idex_op <= alu_op;</pre>
                                                   idex_has_br
idex_has_we
                                                                             <= dec_has_br;
<= dec_has_we;
515
517
                                                   idex_has_mem
                                                                           <= dec_has_mem;
518
                                                   idex_has_mem_we <= dec_has_mem_we;</pre>
519
                                                   if ((dec_op == `VMICRO16_OP_SW) || (dec_op == `VMICRO16_OP_LW))
        idex_rd3 <= reg_rd2 + { {11{dec_imm8[4]}}, dec_simm5 };
else if(dec_has_imm8)</pre>
520
521
                                                   else if(dec_has_imm8)
    idex_rd3 <= { {8{dec_imm8[7]}}, dec_imm8 };
else if ((dec_op == `VMICR016_0P_ARITH_U && ifid_instr[4] == 0))
    idex_rd3 <= reg_rd2 + { {12{1'b0}}, ifid_instr[3:0] };
else if ((dec_op == `VMICR016_0P_ARITH_S && ifid_instr[4] == 0))
    idex_rd3 <= reg_rd2 + { {12{ifid_instr[3]}}, ifid_instr[3:0] };</pre>
523
524
525
526
527
                                                                idex_rd3 <= reg_rd2;
529
530
531
                                     idex_valid <= stall ? 1'b0 : (ifid_valid && !jmping);</pre>
                        end else begin
532
                                     idex valid
533
                                                                <= 1'b0:
534
535
                                                                <= 1'b0;
<= 1'b0;
                                      idex_has_we
                                     idex has mem
536
537
                                                                <= 1'b0;
                                     idex_has_mem_we
                                                                <= 1'b0;
                                     idex_rd1
538
539
                                                                <= 1'b0;
<= 1'b0;
                                     idex_rd3
540
541
542
           endmodule
543
544
545
          module vmicro16_exme (
546
547
                        input clk,
input reset,
548
                        input [15:0] idex_pc, output reg [15:0] exme_pc,
550
551
                       input [4:0] idex_op, output reg [4:0] exme_op, output reg [15:0] exme_d, input [15:0] idex_rd1, output reg [15:0] exme_d2,
552
553
554
555
                        // input [15:0] idex_rd2, input [15:0] idex_rd3,
556
557
558
                        input [2:0] idex_rs1, output reg [2:0] exme_rs1,
input [2:0] idex_rs2, output reg [2:0] exme_rs2,
559
560
                        input idex_has_br,
                                                                output reg exme_has_br,
                        input idex_has_e, output reg exme_has_e, input idex_has_mem, output reg exme_has_mem, input idex_has_mem_we, output reg exme_has_mem_we,
562
563
564
                        input idex_valid,
                                                              output reg exme_valid,
                        input jmping,
566
```

```
567
568
                    output reg [15:0] exme_jmp_target
        ):
                    // ALU wire [15:0] alu_c;
570
571
572
                    vmicro16_alu alu (
.op(idex_op),
573
574
                               .d1(idex_rd1),
                               .d2(idex rd3).
575
576
577
578
                               .q(alu_c)
                    );
                    always @(posedge clk)
579
580
                    if (!reset) begin

// Move previous stage regs into this stage
581
582
                               exme_pc <= idex_pc; // Only for simulation
// exme_d contains the result data value or</pre>
                               // exme_d contains the result data value or
// address for LW/SW
exme_d <= alu_c;
exme_d2 <= idex_rd1;
// exme_rs contains the destination register for
// the data value or memory after it's fetched
exme_rs1 <= idex_rs1;
583
584
585
586
587
588
589
590
591
                               exme_rs2
                                                     <= idex_rs2;
                               exme has br
                                                      <= idex has br:
592
593
                               exme_has_we
                                                      <= idex_has_we;
                                                     <= idex has mem:
                               exme_has_mem
594
595
                               exme_has_mem_we <= idex_has_mem_we;
596
597
                               exme_valid
                                                     <= idex_valid && !jmping;
                               // Relative PC jmp target, PC = PC + rd1
exme_jmp_target <= (idex_has_br) ?</pre>
598
599
600
                                                                 (idex_pc + idex_rd1) :
601
                                                                1'b0:
602
                    end else begin
603
                               exme_valid
                                                    <= 1'b0;
604
605
                               exme_d
exme_d2
                                                   <= 16'h0;
<= 16'h0;
                               exme_has_mem <= 1'b0;
exme_has_mem_we <= 1'b0;
606
607
608
                    end
609
         endmodule
610
611
         module vmicro16_mewb (
612
613
                    input clk,
input reset,
614
615
                    input [15:0] exme_pc, output reg [15:0] mewb_pc,
616
617
                    input [15:0] exme_d,
618
                    input [15:0] mem_out, output reg [15:0] mewb_d,
619
                    input [15:0] exme_d2, output reg [15:0] mewb_d2,
input [2:0] exme_rs1, output reg [2:0] mewb_rs1,
620
621
622
623
                                    [15:0] exme_jmp_target,
                    output reg [15:0] mewb_jmp_target,
624
625
626
                     input exme_has_br,
                                                   output reg mewb_has_br,
                    input exme_has_we, output reg mewb_has_we, input exme_has_mem, output reg mewb_has_mem, input exme_has_mem_we, output reg mewb_has_mem_we,
627
628
629
630
631
                     input mem_valid,
632
                    input exme_valid,
                                                output reg mewb_valid,
633
634
                    input jmping
635
         );
                    // MEWB stage
636
637
638
                    always @(posedge clk)
if (!reset) begin
639
640
                              if (exme_valid) begin
                                          // Move previous stage regs into this stage
641
                                          mewb_pc
                                                                <= exme_pc; // Only for simulation
642
643
644
                                          645
646
                                          mewb_d2
                                                               <= exme_d2;
647
648
                                           mewb_rs1
                                                                <= exme_rs1;
                                          mewb_jmp_target <= exme_jmp_target;</pre>
649
650
                                          mewb_has_br
                                                                  <= exme_has_br;
                                          mewb_has_we <= exme_has_we;
mewb_has_mem <= exme_has_mem;
mewb_has_mem_we <= exme_has_mem_we;
651
652
\begin{array}{c} 653 \\ 654 \end{array}
655
656
                                          if (exme_has_mem)
                                                     // LW
if (!exme_has_mem_we)
657
658
                                                                $display($time, "\tMEWB: LW: r[%h] <= mem[%h]",
659
                                                                            exme_rs1, exme_d);
                               mewb_valid <= exme_valid && !jmping && mem_valid;</pre>
661
                    end else begin
mewb_valid
                                                     <= 1'b0;
663
                                                     <= 1'b0;
<= 1'b0;
665
                               mewb_has_we
```

```
666
667
                              mewb_has_mem <= 1'b0;
mewb_has_mem_we <= 1'b0;</pre>
669
        endmodule
670
671
        module vmicro16_wb (
672
673
                  input clk,
input reset,
674
675
                   input jmping,
676
677
                   input [15:0] mewb_d,
                                                           output reg [15:0] wb_d,
678
679
                   input [2:0] mewb_rs1,
input mewb_has_we,
                                                           output reg [2:0] wb_rs1,
output reg wb_we,
680
                   input mewb_has_br, output reg wb_has_br,
input [15:0] mewb_jmp_target, output reg [15:0] wb_jmp_target,
682
683
                   input mewb_valid,
                                                           output reg wb_valid
684
        );
                   // WB stage
                   always @(posedge clk)
if (!reset) begin
686
687
688
689
                             if (mewb_valid) begin
    wb_d <= mewb_d;</pre>
                                                        <= mewb_has_we;
690
                                         wb_we
691
                                                           <= mewb_rs1;
                                                        <= mewb_has_br;
692
                                         wb_has_br
                                         wb_jmp_target <= mewb_jmp_target;</pre>
694
                              end
                   wb_valid <= mewb_valid && !jmping;
end else begin
695
696
                                              <= 1'b0;
697
                              wb_valid
698
                                             <= 1'b0;
<= 1'b0;
                              wb we
699
                              wb_wc
wb_has_br
                              wb_jmp_target <= 1'b0;
700
701
702
        endmodule
703
704
        module vmicro16_mmu # (
    parameter MEM_WIDTH
    parameter MEM_DEPTH
705
706
                                                 = 16.
707
708
709
710
                                                = 1024,
                   parameter MEM_RVAL
                                                = 16'h00CC
        ) (
711
712
                   input clk,
input reset,
713
714
                   output valid,
715
716
717
                   input req,
                   input [15:0] mem_addr,
input [15:0] mem_in,
input mem_we,
input [1:0] mem_whl, // TODO: apply to mem_out
output reg [15:0] mem_out,
718
719
720
721
722
723
724
725
726
727
                    // wishbone peripheral master
                   output reg wb_mosi
//output wb_stb_o_xx,
                                    wb_mosi_stb_o_regs,
                   output reg wb_mosi_cyc_o,
output [15:0] wb_mosi_addr_o,
728
729
                   output wb_mosi_we_o,
output [15:0] wb_mosi_data_o, // seperate data_o and data_i buses
input [15:0] wb_miso_data_i, // seperate data_o and data_i buses
730
731
732
733
                   input
                                     wb miso ack i
734
735
        );
                   wire [15:0] bram_out;
736
737
                   738
739
                   // TODO: CLEANUP
740
741
742
743
                    reg active = 0;
                    always @(*)
                    if (req) begin
                              active = 1'b1;
744
745
                   746
747
                   end else
                              active = active;
748
749
750
751
752
753
754
755
756
757
758
759
760
                   always @(*)
if (reset)
                             mem_out = 16'h0;
                   else if(active && wb_mosi_stb_o_regs) begin
                              wb_mosi_cyc_o = 1'b1;
if (wb_miso_ack_i) begin
                                        mem_out = wb_miso_data_i;
                   end else if (active) begin
761
762
                             mem_out = bram_out;
                   end else begin
                              wb_mosi_cyc_o = 1'b0;
// TODO: mem_out isn't valid in this state, output high z or 0?
```

```
mem_out = 16'hZZ;
765
766
767
776
777
772
7774
775
776
777
778
779
780
781
782
784
785
786
787
789
790
791
792
794
795
796
797
797
798
                      end
                     // bram memory is always single clk, wb is unknown
assign valid = active ? wb_miso_ack_i : 1'b1;
                      // TODO: CLEANUP
                      // Virtual memory translator
                      always @(*) begin
// zero all peripherals
                                  wb_mosi_stb_o_regs = 0;
                                 // enable porture

casez(mem_addr)

15'h01??: wb_mosi_stb_o_regs = 1'b1;

default: wb_mosi_stb_o_regs = 1'b0;
                                 end
                                                    = wb_mosi_cyc_o; // deprecated
= wb_mosi_stb_o_regs;// || req;
                      wire wb_active
                      wire
                                wb stb
                     wire wb_stb = wb_mosi_cregs;// || req;

//assign wb_mosi_cyc_o = wb_stb;

assign wb_mosi_we_o = wb_active ? mem_we : 1'b0;

assign wb_mosi_addr_o = wb_active ? mem_addr : 16'h00;

assign wb_mosi_data_o = wb_active ? mem_in : 16'h00;
                                                         = wb_active ? wb_miso_data_i : bram_out;
                     //assign mem_out
                      // TODO: Should this be inside the mmu or outside?
                      wire bram_we = mem_we && !wb_active;
                     wire Dram_we - mem_we && .wv_accore,
vmicro16_bram # (
.MEM_WIDTH(MEM_WIDTH), // TODO: mem 16b or 8b wide?
800
801
802
803
                                  .MEM_DEPTH(MEM_DEPTH)
                     ) bram (
804
                                 .clk
                                                          (clk),
805
                                                         (reset).
                                  .reset
806
807
                                  // port 1
                                                          (mem_addr),
                                  .mem_addr
                                                         (mem_in),
(bram_we),
808
                                  .mem_in
809
                                  .mem_we
810
811
                                  .mem_out
                                                         (bram_out)
812
                      // reset must be held long for atleast MEM_SIZE clocks
                     // to fully erase the bram.
// E.g. Xilinx bram 1024 cells = 1024 clocks = ~21us
814
815
                      // TODO: implement with a dfa
816
817
         endmodule
818
819
         module vmicro16_cpu (
                     input clk,
input reset,
820
821
822
823
                      // wishbone peripheral master interface
                     // driven by mmu
output wb_mosi_stb_o_regs, // ...
824
825
                     output
output
826
                                          wb_mosi_cyc_o,
                                           wb_mosi_we_o,
                     output wb_mosi_we_o,
output [15:0] wb_mosi_addr_o,
output [16:0] wb_mosi_data_o, // seperate data_o and data_i buses
input [15:0] wb_miso_data_i, // seperate data_o and data_i buses
input wb_miso_ack_i
828
829
830
831
         );
832
833
834
                     wire [4:0] dec_op;
                      wire [7:0] dec_imm8;
835
836
                     wire dec_has_im
wire [4:0] dec_simm5;
                                        dec_has_imm8;
                     wire
837
838
                                        dec_has_br;
                      wire
                                        dec_has_we;
839
840
                     wire
                                       dec_has_mem;
                      wire
                                       dec_has_mem_we;
841
842
                      wire
                                       dec_has_bad;
843
844
                     wire [15:0] ifid_pc;
wire [15:0] ifid_instr;
845
                      wire [15:0] reg_rd1;
                     wire [15:0] reg_rd2;
wire [2:0] reg_rs1;
wire [2:0] reg_rs2;
847
848
849
850
851
                     wire ifid_valid;
852
                      wire idex_valid;
                     wire exme_valid;
wire mewb_valid;
853
855
                      wire wb valid:
856
                      wire mem_valid;
857
                     wire dec_halt;
wire wb_has_br;
858
859
                     wire [2:0] idex_rs1;
wire [2:0] exme_rs1;
wire [2:0] mewb_rs1;
wire [2:0] wb_rs1;
861
863
```

```
864
865
                      // nop = not any bits set in dec_op
wire nop = ~(|dec_op);
wire stall_ifid = (~nop) && ifid_valid;
866
867
                      wire stall_idex = ("nop && idex_valid) && ((reg_rs1 == idex_rs1) || (("dec_has_imm8) && (reg_rs2 == idex_rs1)));
wire stall_exme = ("nop && exme_valid) && ((reg_rs1 == exme_rs1) || (("dec_has_imm8) && (reg_rs2 == exme_rs1)));
wire stall_mewb = ("nop && mewb_valid) && ((reg_rs1 == mewb_rs1) || (("dec_has_imm8) && (reg_rs2 == mewb_rs1)));
wire stall_wb = ("nop && wb_valid) && ((reg_rs1 == wb_rs1) || (("dec_has_imm8) && (reg_rs2 == wb_rs1)));
wire stall_wb = ("nop && wb_valid) && ((reg_rs1 == wb_rs1) || (("dec_has_imm8) && (reg_rs2 == wb_rs1)));
868
869
870
871
872
873
874
875
                      wire stall_mem = 1'b0;
                                               = |{ stall_idex,
                      wire stall
                                                      stall_mewb,
876
877
                                                       stall_wb,
                                                       dec_halt,
878
879
                                              !mem_valid };
= (wb_valid && wb_has_br);
                      wire jmping
880
881
882
                      wire [15:0] wb d:
883
                      wire [15:0] wb_jmp_target;
                                        wb_we;
wb_we_w = reset ? 1'b0 : (wb_we && wb_valid);
884
                      wire
885
                      886
888
                                   .CELL DEPTH(8)
889
                      ) regs (
                                  .clk
890
                                               (clk).
                                  .reset (reset),
892
893
894
                                                (reg_rs1),
                                  .rd1
                                               (reg_rd1),
895
                                   .rs2
                                               (reg rs2).
896
897
                                  .rd2
                                                (reg_rd2),
898
899
                                   .we
                                                (wb_we_w),
900
                                  .ws1
                                               (wb_rs1),
901
902
                                   .wd
                                                (wb_d)
                      );
903
904
                      // stage_ifid
905
                       vmicro16_ifid stage_ifid (
906
907
                                                            (clk),
                                  .clk
                                                            (reset),
(stall),
908
                                   .stall
                                                           (jmping),
(wb_jmp_target),
(mewb_valid),
(ifid_valid),
909
910
                                   .jmping
                                   .wb_jmp_target
                                   .mewb_valid
.ifid_valid
911
913
                                  .ifid_pc
.ifid_instr
                                                           (ifid_pc),
(ifid_instr)
914
915
                      ):
916
917
                      wire [15:0] idex_pc;
wire [15:0] idex_instr;
918
                      wire [2:0] idex_rs2;
wire [15:0] idex_rd1;
919
921
                      wire [15:0] idex rd2:
922
923
                       wire [15:0] idex_rd3;
                      wire [4:0] idex_op;
wire idex_has_br;
924
925
                      wire
                                         idex has mem:
926
927
                                         idex_has_mem_we;
                      wire
                      wire
                                         idex_has_we;
928
929
                       vmicro16_idex stage_idex
                                                             (clk),
                                 .clk
.reset
929
930
931
932
                                  .ifid_pc
                                                             (ifid_pc),
933
                                  .idex_pc
                                                             (idex_pc),
934
935
                                   .ifid_instr
                                                             (ifid_instr),
936
937
                                   .idex_instr
                                                             (idex_instr),
                                   // not clocked
938
939
                                   .dec_op
                                                             (dec_op),
                                                             (reg_rs1),
(reg_rs2),
940
                                    .reg_rs1
941
                                   .reg_rs2
942
943
                                   .reg_rd1
                                                             (reg_rd1),
                                   .reg_rd2
                                                             (reg_rd2),
944
                                    .dec_has_imm8
                                                             (dec_has_imm8),
946
947
948
                                                             (idex_rd1),
                                   .idex_rd1
                                   .idex_rd2
                                                             (idex_rd2),
                                   .idex_rd3
                                                             (idex_rd3),
949
950
951
952
953
954
                                                             (idex_rs1),
(idex_rs2),
                                   .idex_rs1
                                   .idex_rs2
                                   .idex_has_br
                                                             (idex_has_br),
                                    idex has we
                                                             (idex has we)
955
956
957
958
                                   .idex_has_mem
                                                             (idex_has_mem)
                                   .idex_has_mem_we (idex_has_mem_we),
                                   .dec halt
                                                             (dec halt).
959
960
                                                             (stall),
                                  .stall
                                                             (jmping),
                                   .jmping
962
```

```
963
964
                               .ifid_valid
                                                    (ifid_valid),
                              .idex valid
                                                    (idex valid).
 965
966
                                                    (idex_op)
                              .idex_op
 967
968
                    );
 969
970
                    // NEW
                    wire [15:0] exme_pc;
                   wire [15:0] exme_pc;
wire [4:0] exme_op;
wire [15:0] exme_d;
wire [15:0] exme_d2;
// PASS
 971
972
973
 974
 975
976
                    wire [2:0] exme_rs2;
wire exme_has_br;
 977
978
979
                    wire
                                   exme_has_we;
exme_has_mem;
                    wire
                    980
981
                   vmicro16_exme stage_exme
 982
                              .reset (re
 983
984
985
986
987
                                                    (reset),
                              .jmping (jmping)
// Pass through registers
                                                   (jmping),
                              .idex_pc
                                                    (idex_pc),
(idex_rs1),
                                                                                                    (exme_pc),
(exme_rs1),
                                                                              .exme_pc
 988
989
990
991
                               .idex_rs1
                                                                              .exme_rs1
                               .idex_rs2
                                                    (idex_rs2),
                                                                              .exme_rs2
                                                                                                    (exme_rs2),
                               .idex_has_br
                                                     (idex_has_br),
                                                                              .exme_has_br
                                                                                                     (exme_has_br),
                               .idex has we
                                                    (idex has we).
                                                                              .exme has we
                                                                                                    (exme_has_we),
 992
993
                               .idex_has_mem
                                                     (idex_has_mem),
                                                                              .exme_has_mem
                                                                                                    (exme_has_mem),
                               .idex_has_mem_we
                                                    (idex_has_mem_we),
                                                                              .exme_has_mem_we
                                                                                                    (exme_has_mem_we),
 994
995
                               idex_valid
                                                    (idex_valid),
                                                                               .exme_valid
                                                                                                    (exme_valid),
                              // ALU ops
 996
997
                               .idex_op
                                                    (idex_op),
                                                                                                    (exme_op), //PASS
                               .exme d
                                                    (exme d).
 998
                               .idex_rd1
                                                    (idex_rd1),
                                                                              .exme_d2
                                                                                                    (exme_d2), //PASS
 999
                              .idex_rd3
                                                    (idex_rd3),
1000
1001
                               .exme_jmp_target (exme_jmp_target)
                    );
1002
1003
1004
                    wire [15:0] mem_out;
1005
1006
                                                 If SW, use calculated address
1007
                    wire [15:0] mem_addr = exme_has_mem ? exme_d : 16'h00;
                                              If SW, use register value = exme_ua. To Hoo;
If SW, use register value = exme_has_mem ? exme_d2 : exme_d;
= reset ? 1'b0 : (exme_has_mem_we & exme_valid);
= 2'b00; // TODO: implement in ISA
1008
1009
                   //
wire [15:0] mem_in
                    wire mem_we
wire [1:0] mem_whl
1010
1011
1012
                    vmicro16_mmu mmu (
1013
                             .clk
                                                   (clk),
                                                   (reset)
1014
                              reset
1015
                              .req
.valid
1016
                                                   (exme_has_mem && exme_valid),
1017
                                                   (mem_valid),
1018
1019
                              .mem_addr
                                                   (mem_addr),
1020
                               .mem in
                                                   (mem_in), (mem_we),
1021
                               .mem_we
1022
                               .mem whl
                                                   (mem_whl),
1023
                                                   (mem_out),
                               .mem_out
1024
1025
                              // wishbone master interface
// TODO: Add to top level cpu
1026
1027
                               .wb_mosi_stb_o_regs (wb_mosi_stb_o_regs),
1028
                              .wb_mosi_cyc_o
.wb_mosi_we_o
                                                        (wb_mosi_cyc_o),
(wb_mosi_we_o),
1029
1030
                               .wb mosi addr o
                                                        (wb mosi addr o)
1031
1032
                               .wb_mosi_data_o
                                                        (wb_mosi_data_o),
                              .wb_miso_data_i
                                                        (wb_miso_data_i),
1033
1034
                               .wb_miso_ack_i
                                                        (wb_miso_ack_i)
                    );
1035
1036
                    wire [15:0] mewb_pc;
1037
                    wire [15:0] mewb_d;
1038
                    wire
                          [15:0] mewb_d2;
                          [15:0] mewb_jmp_target;
mewb_has_mem;
1039
1040
                    wire
1041
1042
                    wire
                                   mewb_has_mem_we;
                    wire
                                   mewb_has_br;
1043
                    wire
                                   mewb_has_we;
1044
                    vmicro16_mewb stage_mewb (
1045
                               .clk
                                                    (clk).
1046
                                                    (reset),
                              .reset
1047
1048
                              .jmping
                                                    (jmping),
1049
1050
                               .mem_out
                                                    (mem_out),
1051
                               .mem_valid
                                                    (mem_valid),
1052
1053
                               .exme pc
                                                    (exme pc).
                              .mewb_pc
1054
                                                    (mewb_pc),
1055
1056
                                                     (exme_d),
                              .mewb_d
.exme_d2
                                                    (mewb_d),
(exme_d2),
1057
1058
1059
                               .mewb_d2
                                                    (mewb_d2),
1060
                               .exme_rs1
                                                     (exme_rs1),
1061
                              .mewb_rs1
                                                    (mewb_rs1),
```

```
1062
1063
                           .exme_jmp_target (exme_jmp_target),
1064
1065
                           .mewb_jmp_target (mewb_jmp_target),
1066
1067
                           .exme has br
                                               (exme_has_br),
                           .mewb_has_br
                                               (mewb_has_br),
1068
                           .exme_has_we
                                               (exme_has_we),
1069
                                               (mewb has we).
                           .mewb has we
1070
                           .exme_has_mem
                                               (exme_has_mem),
1071
                           .mewb_has_mem
                                               (mewb_has_mem),
                           .exme_has_mem_we (exme_has_mem_we),
.mewb_has_mem_we (mewb_has_mem_we),
1072
1073
1074 \\ 1075
                           .exme_valid
                                               (exme_valid),
1076
                           .mewb_valid
                                               (mewb_valid)
1077
1078
1079
                 // WB stage
vmicro16_wb stage_wb (
1080
1081
                                               (clk).
1082
                           .clk
1083
                           .reset
                                               (reset),
1084
                                               (mewb_d),
1086
                           .wb_d
                                               (wb_d),
1087
                           .mewb rs1
                                               (mewb rs1).
1088
1089
                           .wb_rs1
1090
1091
                           .mewb_has_we
                                               (mewb_has_we),
1092
                           .wb_we
                                               (wb_we),
1093
1094
                           .mewb has br
                                               (mewb has br).
1095
                           .wb_has_br
                                               (wb_has_br),
1096
1097
1098
                           .mewb_jmp_target (mewb_jmp_target),
                           .wb_jmp_target (wb_jmp_target),
1099
1100
                           .jmping
                                               (jmping),
1101
                                               (mewb_valid),
                           .mewb valid
1102
1103
                           .wb_valid
                                               (wb_valid)
1104
                 );
1105
1106
1107
1108
1109
        endmodule
```

#### vmicro16\_soc.v

```
module vmicro16_soc (
 2
3
4
             input clk,
input reset
             // Internal wishbone master interface
wire wb_mosi_stb_o_regs;
 5
6
7
8
9
             wire wb_mosi_cyc_o;
wire [15:0] wb_mosi_addr_o;
             wire [15:0] wb_mosi_addr_o;
wire [15:0] wb_mosi_data_o; // seperate data_o and data_i buses
wire [15:0] wb_miso_data_i; // seperate data_o and data_i buses
wire wb_miso_ack_i;
10
11
12
13
             vmicro16_regs_wb # (
15
16
17
                         .CELL_WIDTH
                                               (16),
                                               (8),
                         .CELL_SEL_BITS (3),
                        .CELL_DEFAULTS (C),
.DEBUG_NAME ("soc_regs")
18
19
20
21
22
23
24
25
26
27
28
29
30
31
             ) soc_regs (
                                              (clk),
                                              (reset),
                        .reset
                         .wb_stb_i
                                              (wb_mosi_stb_o_regs),
                         .wb_cyc_i
                                              (wb_mosi_cyc_o),
                         .wb we i
                                              (wb mosi we o).
                         .wb_addr_i
                                              (wb_mosi_addr_o)
                         .wb_data_i
                                              (wb_mosi_data_o),
                         .wb_data_o
                                              (wb_miso_data_i),
                         .wb_ack_o
                                              (wb_miso_ack_i)
32
33
34
35
36
37
                         //.wb_stall_o
                                                 (wb_stall_o),
                        //.wb err o
                                                 (wb err o).
             vmicro16_cpu core (
                   .clk (clk),
.reset (reset),
                  .clk
38
39
                   // Wishbone master interface
\frac{40}{41}
                   .wb_mosi_stb_o_regs (wb_mosi_stb_o_regs),
                   .wb_mosi_cyc_o
.wb_mosi_addr_o
                                              (wb_mosi_cyc_o),
(wb_mosi_addr_o),
42
                   .wb_mosi_we_o
                                               (wb_mosi_we_o),
```

```
45 .wb_mosi_data_o (wb_mosi_data_o),
46 .wb_miso_data_i (wb_miso_data_i),
47 .wb_miso_ack_i (wb_miso_ack_i)
48 );
49 .wb_miso_ack_i (wb_miso_ack_i)
50 endmodule
```

#### vmicro16\_isa.v

```
// Vmicro16 multi-core instruction set
         // TODO: Remove NOP by making a register write/read always 0
          define VMICRO16_OP_NOP
                                                                 5'b00000
5'b00001
          define VMICR016_OP_SW define VMICR016_OP_BIT
                                                                 5'b00010
                                                                 5'b00011
          define VMICRO16_OP_BIT_OR define VMICRO16_OP_BIT_XOR
                                                                 5'b00000
                                                                 5'b00001
          define VMICRO16_UP_BIT_XUR
define VMICRO16_UP_BIT_AND
define VMICRO16_UP_BIT_NOT
define VMICRO16_UP_BIT_LSHFT
define VMICRO16_UP_BIT_RSHFT
10
                                                                 5'b00010
                                                                 5'b00011
12
                                                                 5'600100
13
                                                                 5'b00101
          define VMICRO16_OP_MOV
define VMICRO16_OP_MOVI
14
15
                                                                 5'b00100
                                                                 5'b00101
16
17
          `define VMICRO16_OP_MOVI_L
`define VMICRO16_OP_ARITH_U
                                                                 5'b10000
          define VMICRO16_OP_ARITH_UADD
'define VMICRO16_OP_ARITH_UADD
'define VMICRO16_OP_ARITH_UADDI
'define VMICRO16_OP_ARITH_S
'define VMICRO16_OP_ARITH_S
'define VMICRO16_OP_ARITH_SADD
'define VMICRO16_OP_ARITH_SSUB
18
                                                                 5'611111
20
                                                                 51b0????
21
22
                                                                 5'b11111
23
24
25
          `define VMICRO16_OP_ARITH_SSUBI `define VMICRO16_OP_BR
                                                                 5'b0????
         // TODO: wasted upper nibble bits in BR
'define VMICRO16_OP_BR_U 8'h00
'define VMICRO16_OP_BR_E 8'h01
26
27
28
29
30
          define VMICRO16_OP_BR_NE
define VMICRO16_OP_BR_G
                                                                 8'h03
          define VMICRO16_OP_BR_GE
define VMICRO16_OP_BR_L
31
32
                                                                 8'h04
                                                                 8'h05
          define VMICR016_OP_BR_LE
define VMICR016_OP_BR_S
33
34
                                                                 8'h06
                                                                 8'h07
          define VMICRO16_OP_BR_NS
define VMICRO16_OP_CMP
35
36
                                                                 8'h08
                                                                 5'601001
37
38
          define VMICRO16_OP_SETC
define VMICRO16_OP_HALT
                                                                 5'b01010
                                                                 5'b01011
39
40
         // microcode operations
          define VMICRO16_ALU_BIT_OR
define VMICRO16_ALU_BIT_XOR
41
                                                                 5'h00
                                                                 5'h01
42
          define VMICRO16_ALU_BIT_XUR
define VMICRO16_ALU_BIT_AND
define VMICRO16_ALU_BIT_NOT
define VMICRO16_ALU_BIT_LSHFT
define VMICRO16_ALU_BIT_RSHFT
43
                                                                 5'h02
                                                                 5'h03
44
45
                                                                 5'h04
46
                                                                 5'h05
          define VMICRO16_ALU_LWdefine VMICRO16_ALU_SW
47
48
                                                                 5'h06
                                                                 5'h07
          define VMICRO16_ALU_NOP
49
                                                                 51h08
          `define VMICRO16_ALU_MOVI
`define VMICRO16_ALU_MOVI_L
51
52
                                                                 5'h0a
53
          `define VMICRO16_ALU_ARITH_UADD
`define VMICRO16_ALU_ARITH_USUB
                                                                 51h0c
54
55
          define VMICRO16_ALU_ARITH_SADD
                                                                 5'h0e
           define VMICRO16_ALU_ARITH_SSUB
57
58
          `define VMICRO16_ALU_BR_U `define VMICRO16_ALU_BR_E
                                                                 5'h10
                                                                 5'h11
59
          define VMICRO16 ALU BR NE
                                                                 5'h12
           define VMICRO16_ALU_BR_G
61
          define VMTCRO16 ALU BR GE
                                                                 5'h14
          define VMICRO16_ALU_BR_L
define VMICRO16_ALU_BR_LE
63
                                                                 5'h16
          define VMICRO16_ALU_BR_S
define VMICRO16_ALU_BR_NS
65
                                                                 5'h18
          define VMICRO16_ALU_CMP
67
           define VMTCRO16 ALU SETC
                                                                 5'h1a
           define VMICRO16_ALU_ARITH_UADDI
69
           define VMICRO16 ALU ARITH SSUBI 5'h1c
          define VMICRO16_ALU_BR
define VMICRO16_ALU_SPARE
70
                                                                 5'h1d
71
                                                                 5'h1e
          `define VMICRO16_ALU_BAD
```