# Multi-core RISC Processor Design and Implementation (Rev. 2.02)

ELEC5881M - Final Report

#### Ben David Lancaster

Student ID: 201280376

Submitted in accordance with the requirements for the degree of Master of Science (MSc) in Embedded Systems Engineering

Supervisor: Dr. David Cowell Assessor: Mr David Moore

#### University of Leeds

School of Electrical and Electronic Engineering

July 5, 2019

Word count: 4689

#### Abstract

This interim report details the 4-month progress on a project to design, implement, and verify, a multi-core FPGA RISC processor. The project has been split into two stages: firstly to build a functional single-core RISC processor, and then secondly to add multiprocessor principles and functionality to it.

Current multiprocessor and network-on-chip communication methods have been discussed and how they could be included in this multi-core RISC design. To-date, a 16-bit instruction set architecture has been designed featuring common load/store instructions, comparison, and bitwise operations. A single-core processor has been implemented in Verilog and verified using simulations/test benches running various simple software programs.

Future tasks have been planned and will focus on the second stage of the project. Work will start on designing a loosely coupled multiprocessor communication interface and bringing them to the single-core processor.

## **Revision History**

| Date       | Version | Changes                        |  |
|------------|---------|--------------------------------|--|
| 10/04/2019 | 2.02    | Update future stages.          |  |
| 05/04/2019 | 2.01    | Fix processor RTL diagram.     |  |
| 04/04/2019 | 2.00    | Initial processor RTL diagram. |  |
| 01/04/2019 | 1.00    | Initial section outline.       |  |

Document revisions.

## Declaration of Academic Integrity

The candidate confirms that the work submitted is his/her own, except where work which has formed part of jointly-authored publications has been included. The contribution of the candidate and the other authors to this work has been explicitly indicated in the report. The candidate confirms that appropriate credit has been given within the report where reference has been made to the work of others.

This copy has been supplied on the understanding that no quotation from the report may be published without proper acknowledgement. The candidate, however, confirms his/her consent to the University of Leeds copying and distributing all or part of this work in any forms and using third parties, who might be outside the University, to monitor breaches of regulations, to verify whether this work contains plagiarised material, and for quality assurance purposes.

The candidate confirms that the details of any mitigating circumstances have been submitted to the Student Support Office at the School of Electronic and Electrical Engineering, at the University of Leeds.

Name: Ben David Lancaster

Date: July 5, 2019

# Table of Contents

| 1 | Me   | mory Mapping                           | 5         |
|---|------|----------------------------------------|-----------|
|   | 1.1  | Memory Map                             | 6         |
|   | 1.2  | Special Registers                      | 7         |
| 2 | Inte | errupts                                | 8         |
|   | 2.1  | Why Interrupts?                        | 8         |
|   | 2.2  | Hardware Implementation                | 8         |
|   |      | 2.2.1 Context Switching                | 8         |
|   | 2.3  | Software Interface                     | 9         |
|   |      | 2.3.1 Software Example                 | 10        |
| 3 | Per  | ipherals                               | 11        |
|   | 3.1  | GPIO Interface                         | 11        |
|   | 3.2  | Timer Interrupt                        | 11        |
|   | 3.3  | UART Interface                         | 11        |
| 4 | Sys  | tem-on-Chip Layout                     | <b>12</b> |
| 5 | Inte | erconnect                              | 14        |
| 6 | Inti | roduction                              | <b>15</b> |
|   | 6.1  | Why Multi-core?                        | 15        |
|   | 6.2  | Why RISC?                              | 16        |
|   | 6.3  | Why FPGA?                              | 16        |
| 7 | Bac  | ekground                               | <b>17</b> |
|   | 7.1  | Amdahl's Law and Parallelism           | 17        |
|   | 7.2  | Loosely and Tightly Coupled Processors | 17        |
|   | 7.3  | Network-on-chip Architectures          | 18        |
| 8 | Pro  | ject Overview                          | <b>20</b> |
|   | 8.1  | Project Deliverables                   | 20        |
|   |      | 8.1.1 Core Deliverables (CD)           | 20        |
|   |      | 8.1.2 Extended Deliverables (ED)       | 21        |
|   | 8.2  | Project Timeline                       | 22        |
|   |      | 8.2.1 Project Stages                   | 22        |

|                           |       | 8.2.2  | Project Stage Detail         | . 22 |
|---------------------------|-------|--------|------------------------------|------|
|                           |       | 8.2.3  | Timeline                     | . 24 |
|                           | 8.3   | Resou  | rces                         | . 24 |
|                           |       | 8.3.1  | Hardware Resources           | . 24 |
|                           |       | 8.3.2  | Software Resources           | . 26 |
|                           | 8.4   | Legal  | and Ethical Considerations   | . 26 |
| 9                         | Cur   | rent F | Progress                     | 27   |
|                           | 9.1   | RISC   | Core                         | . 27 |
|                           |       | 9.1.1  | Instruction Set Architecture | . 27 |
|                           |       | 9.1.2  | Design and Implementation    | . 31 |
|                           |       | 9.1.3  | Verification                 | . 36 |
| 10                        | Fut   | ure W  | ork                          | 38   |
|                           | 10.1  | Projec | et Status                    | . 38 |
|                           |       | 10.1.1 | Updated Project Time Line    | . 39 |
|                           |       | 10.1.2 | Future Work                  | . 39 |
| 11                        | Con   | clusio | n                            | 41   |
| R                         | efere | nces   |                              | 42   |
| $\mathbf{A}_{\mathbf{j}}$ | ppen  | dix A  | - Code Listing               | 43   |

# Memory Mapping

|    | 1.1   | Memory Map                                                                     | 6 |
|----|-------|--------------------------------------------------------------------------------|---|
|    | 1.2   | Special Registers                                                              | 7 |
|    | The   | Vmicro16 processor uses a memory-mapping scheme to communicate with peripheral | S |
| an | d oth | er cores.                                                                      |   |

## 1.1 Memory Map



Figure 1.1: Memory map showing addresses of various memory sections.

### 1.2 Special Registers

From the software perspective, it is important for both the developer and software algorithms to know the target system's architecture to better utilise the resources available to them. Software written for one architecture with N cores must also run on an architecture with M cores. To enable such portability, the software must query the system for information such as: number of processor cores and the current core identifier. Without this information, the developer would be required to produce software for each individual architecture (e.g. an Intel i5 with 4 cores or an Intel i7 with 8 cores, or an NVIDIA GTX 970 with



Figure 1.2: Vmicro16 Special Registers layout (0x0080 - 0x008F).

## Interrupts

| 2.1 | Why Interrupts?         | . 8  |
|-----|-------------------------|------|
| 2.2 | Hardware Implementation | . 8  |
|     | 2.2.1 Context Switching | . 8  |
| 2.3 | Software Interface      | . 9  |
|     | 2.3.1 Software Example  | . 10 |

This section describes the design, considerations, and implementation, of interrupt functionality within the Vmicro16 processor.

### 2.1 Why Interrupts?

Interrupts are used to enable asynchronous behaviour within a processor.

Interrupts are commonly used to signal actions from asynchronous sources, for example an input button or from a UART receiver signalling that data has been received.

### 2.2 Hardware Implementation

#### 2.2.1 Context Switching

When acting upon an incoming interrupt the current state the processor must be saved so that changes from the interrupt handler, such as register writes and branches, do not affect the current state. After the interrupt handler function signals it has finished (by using the *Interrupt Return* intr instruction) the saved state is restored. In the case of the Vmicro16 processor, the program counter r\_pc[15:0] and register set regs instance are the only states that are saved. Going forth, the terms *normal mode* and *interrupt mode* are used to describe what registers the processor should use when executing instructions.

To save the normal mode register set regs state, a dedicated register set regs\_isr is multiplexed with regs.

## 2.3 Software Interface



Figure 2.1: The interrupt vector consists of eight 16-bit values that point to memory addresses of the instruction memory to jump to.

#### 2.3.1 Software Example

```
entry:
    // Set interrupt vector at 0x100
    // Move address of isr0 function to vector[0]
    movi    r0, isr0
    // create 0x100 value by left shifting 1 8 bits
 1
 2
 3
 4
                     r1, #0x1
r2, #0x8
r1, r2
            movi
            movi
 8
            lshft
            // write isr0 address to vector[0]
 9
                        r0, r1
10
11
            // enable all interrupts by writing 0x0f to 0x108
                        r0, #0x0f
13
            movi
                        r0, r1 + #0x8
14
                                             // enter low power idle state
            halt
15
16
                                             // arbitrary name
// do something
// return from interrupt
     isr0:
17
                        r0, #0xff
            movi
18
            intr
19
```

# Peripherals

| 3.1 | GPIO Interface  | 11 |
|-----|-----------------|----|
| 3.2 | Timer Interrupt | 11 |
| 3.3 | UART Interface  | 11 |
|     |                 |    |

### 3.1 GPIO Interface



### 3.2 Timer Interrupt



### 3.3 UART Interface



# System-on-Chip Layout

The Vmicro16 processor uses



Figure 4.1: •

# Interconnect

The Vmicro16 processor uses

## Introduction

| 6.1 | Why Multi-core? | 15 |
|-----|-----------------|----|
| 6.2 | Why RISC?       | 16 |
| 6.3 | Why FPGA?       | 16 |

This project will detail the design, implementation, and verification, of a new multi-core RISC processor aimed at FPGA devices. This project was chosen due to my interest in processor design, in which I have only previously designed single-core RISC processors and wish to extend this knowledge to gain a basic understanding of multi-core communication, design considerations, and the limitations of parallelism first hand.

I will use this opportunity to further develop my knowledge of FPGA and processor design by implementing, designing, and verifying, a multi-core RISC processor from scratch, including the design of a communication interface between multiple cores.



Figure 6.1: Foo

### 6.1 Why Multi-core?

Moore's Law states that the number of transistors in a chip will double every 2 years []. CPU designers would utilize the additional transistors to add more pipeline stages in the processor to reduce the propagation delay [] which would allow for higher clock frequencies.

The size of transistors have been decreasing [] and today can be manufactured in sub10 nanometer range. However, the extremely small transistor size increases electrical leakage
and other negative effects resulting in unreliability and potential damage to the transistor
[]. The high transistor count produces large amounts of heat and requires increasing power
to supply the chip. These trade-offs are currently managed by reducing the input voltage,
utilising complex cooling techniques, and reducing clock frequency. These factors limit the
performance of the chip significantly. These are contributing factors to Moore's Law slowing
down. The capacity limit of the current-generation planar transistors is approaching and so
in order for performance increases to continue, other approaches such as alternate transistor
technologies like Multigate transistors [1], software and hardware optimisations, and multiprocessor architectures are employed.

This report will focus on the latter: to produce a small multi-core processor that can utilise software-based parallelism to gain performance benefits, compared to a larger single-core design.

#### 6.2 Why RISC?

RISC architectures feature simpler and fewer instructions compared to CISC, which emphasises instructions that perform larger tasks. A single CISC instruction might be performed with multiple RISC instructions. Because of the fewer and simpler instructions, RISC machines rely heavily on software optimisations for performance. RISC instruction sets are based on load/store architectures, where most instructions are either register-to-register or memory reading and writing [2]. This constraint greatly reduces complexity.

RISC architectures are easier to design implement, especially for beginners, due to their simpler instructions that share the same pipeline, compared to CISC where there may be different pipeline for each instruction, which would greatly consume FPGA resources.

### 6.3 Why FPGA?

Field programmable gate arrays (FPGA) are a great choice for prototyping digital logic designs due to their programmable nature and quick development times.

My previous experience with FPGAs in previous projects will reduce risk and learning times and allow for more time to be spent on adding and extending features (discusses further in section 8.1).

FPGAs, however, may not be suitable for prototyping all register-transistor logic (RTL) projects. Larger RTL projects, such as large commercial processors, may greatly exceed the logic cell resources available in today's high-end FPGA devices and may only be prototyped through silicon fabrication, which can be expensive. This resource limitation will not be problem as the project aims to produce a small and minimal design specifically for learning about multicore architectures.

## Background

| 7.1 | Amdahl's Law and Parallelism           | 1  |
|-----|----------------------------------------|----|
| 7.2 | Loosely and Tightly Coupled Processors | 1  |
| 7.3 | Network-on-chip Architectures          | 18 |

#### 7.1 Amdahl's Law and Parallelism

In many applications, not restricted to software, there may exists many opportunities for processes or algorithms to be performed in parallel. These algorithms can be split into two parts: a serial part that cannot be parallelsed, and a part that can be parallelised. Amdahl's Law defines a formula for calculating the maximum speedup of a process with potential parallelism opportunities when ran in parallel with n many processors. Speedup is a term used to describe the potential performance improvements of an algorithm using an enhanced resource (in this case, adding parallel processors) compared to the original algorithm. Amdalh's Law is defined below, where the potential speedup  $S_p$  is dependant on the portion of program that can be parallelised p and the number of processing cores p:

$$S_p = \frac{1}{(1-p) + \frac{p}{n}} \tag{7.1}$$

This formula will be used throughout the project to gauge the the performance of the multicore design running various software algorithms.

### 7.2 Loosely and Tightly Coupled Processors

Multiprocessor systems can be generalised into two architectures: loosely and tightly coupled, and each architecture has advantages and disadvantages. In loosely coupled systems, each processing node is self-contained – each node has it's own dedicated memory and IO modules. Communication between nodes is performed over a *Message Transfer System (MTS)* [3] in a master-slave control architecture.

Scalability in loosely coupled systems is generally easier to implement as each node can simply be appended to the shared MTS interface without large modifications to the rest of the system. Scalability is an important concern in this project as I wish to test the developed solution with a range of processing nodes.

As loosely coupled system's nodes feature there own memory and IO modules, they generally perform better in cases where interaction between nodes is not prominent – each node can store a separate part of the software program in it's memory module allowing simultaneous executing of the program.

In scenarios where inter-node communication is prominent however, access to the MTS interface must be scheduled to avoid access conflicts which introduces delays and idle times in the software programs execution, resulting in lower throughput. Figure 7.1 shows a general layout of a loosely coupled multiprocessor system.

Tightly coupled systems feature processing nodes that do not have their own dedicated memory or IO modules – each node is directly connected to a shared memory module using a dedicated port. In scenarios where inter-node communication is prominent, tightly coupled systems are generally better suited as nodes are directly connected to a shared memory and do not need to wait to use a shared bus.



Figure 7.1: A loosely coupled multiprocessor system. Each node features it's own memory and IO modules and uses a Message Transfer System to perform internode communication. Image source: [3].

Figure 7.2: A tightly coupled multiprocessor system. Nodes are directly connected to memory and IO modules. Image source: [3].

This project will utilise a loosely coupled architecture due to it's easier scalability implementation and my previous experience with the design of single-core processors. Although it will require a scheduler to access the MTS, the experience and knowledge gained from this task will be greatly beneficial for future projects.

### 7.3 Network-on-chip Architectures

Network-on-chip (NoC) architectures implement on-chip communication mechanisms that are based on network communication principles, such as routing, switching, and massive scalability [4]. NoC's can generally support hundreds to millions of processing cores. Figure 7.3 shows an example 16-core network-on-chip architecture. NoC's can scale to very large sizes while not sacrificing performance because each processor core is able to drive the network rather than needing to wait for a shared bus to become free before doing so.

The greater the number of cores in a network-on-chip design, the greater quality of service (QoS) problems arise. As such, network-on-chip architectures suffer the same problems as networks, such as fairness and throughput [5].



Figure 7.3: A multiprocessor network-on-chip architecture with 16 processing nodes. Nodes are connected in a grid formation with routers and links. Image source: [6].

# **Project Overview**

| 8.1 | Projec | ct Deliverables            | 20 |
|-----|--------|----------------------------|----|
|     | 8.1.1  | Core Deliverables (CD)     | 20 |
|     | 8.1.2  | Extended Deliverables (ED) | 21 |
| 8.2 | Projec | et Timeline                | 22 |
|     | 8.2.1  | Project Stages             | 22 |
|     | 8.2.2  | Project Stage Detail       | 22 |
|     | 8.2.3  | Timeline                   | 24 |
| 8.3 | Resou  | rces                       | 24 |
|     | 8.3.1  | Hardware Resources         | 24 |
|     | 8.3.2  | Software Resources         | 26 |
| 8.4 | Legal  | and Ethical Considerations | 26 |

This chapter discusses the the project's requirements, goals, and structure.

### 8.1 Project Deliverables

The project's deliverables are split into two sections: core deliverables (CD) – each deliverable must be satisfied for the project to be a minimum viable product (MVP), and extended deliverables (ED) – deliverables that are not required for a MVP – features that only improve upon an existing feature.

#### 8.1.1 Core Deliverables (CD)

The project's core deliverables are described below.

#### CD1 Design a compact 16-bit RISC instruction set architecture.

The instruction set will be the primary interface to control the processor from software. An instruction set will be required to implement the custom multi-core communication interface.

It was decided to design a new instruction set rather than to extend an existing architecture as this will increase my knowledge of the constraints to consider when designing instruction sets and processors.

#### CD2 Design and implement a Verilog RISC core that implements the ISA in CD1.

The Verilog RISC core will be able to run software program written for the instruction set architecture.

# CD3 Design and implement an on-chip interconnect for multi-core processing (2 to 32 cores) using the RISC core from CD2.

The interconnect will be a chief requirement to enable multi-core communication. The interconnect should support up to 32 cores, however FPGA implementation constraints may limit this due to limited resources.

The interconnect will control communication between the cores to enable software parallelism.

# CD4 Analyse performance of serial and parallel software algorithms, such as parallel DFT, on the processor.

To evaluate the effectiveness of the developed solution, a serial and parallel implementation of a simple computing algorithm (parallel reduction, sorting) will be ran on the processor and it's performance analysed. Effectiveness will be rated on total algorithm run-time and the speed-up gained by adding more cores.

# CD5 Allow the RISC core to be easily compiled to multiple FPGA vendors (Xilinx, Altera).

The developed solution should be generic and portable to allow it to be used across a wide-range of FPGA vendors and devices.

Verilog is a generic implementation-independent hardware-description language and so designing implementation specific modules is recommended.

A key consideration for this requirement is to consider the varying hard IP provided by the FPGA vendors (such as BRAM, ethernet, and PCIe [7, 8]). To overcome this problem, the developed Verilog code will conditionally compile where vendor specific requirements are present.

#### 8.1.2 Extended Deliverables (ED)

The project's extended deliverables are described below.

- **ED1** Design a RISC core with an instructions-per-clock (IPC) rating of at least 1.0 (a single-cycle CPU).
- **ED2** Design a RISC core with a pipe-lined data path to increase the design's clock speed.
- **ED3** Design a scalable multi-core interconnect supporting arbitrary (more than 32) RISC core instances (manycore) using Network-on-Chip (NoC) architecture.
- **ED4** Design a compiler-backend for the PRCO304 [9] compiler to support the ISA from 1 CD1. This will make it easier to build complex multi-core software for the processor.
- **ED5** The RISC core can communicate to peripherals via a memory-mapped addresses using the Wishbone bus.

- **ED6** Implement various memory-mapped peripherals such as UART, GPIO, LCD, to aid visual representation of the processor during the demonstration viva.
- **ED7** Store instruction memory in SPI flash.
- **ED8** Reprogram instruction memory at runtime from host computer.
- **ED9** Processor external debugger using host-processor link.

### 8.2 Project Timeline

#### 8.2.1 Project Stages

The project is split up into many stages to aid planning and management of the project. There are 8 unique stage areas: 1. Inital project conception; 2 Basic RISC core development; 3. Extended RISC core development; 4. Multi-core development; 5. Processor quality-of-life (QoL) improvements; 6. Compiler development; 7. Demo preparation, and 8. Final report.

The project stages are shown in Table 8.1.

#### 8.2.2 Project Stage Detail

#### Stages 1.0 through 1.2 - Research and Project Conception

These stages cover initial research of existing problems and solutions in the multiprocessor area. The instruction set architecture is also proposed that later stages will implement.

#### Stages 2.1 through 2.3 – Processor module Design, Implementation, and Integration

These stages cover the design, implementation, and integration of key processor core modules such as the instruction decoder, register sets and local memory. Integration of all the modules is a challenging task because some modules have both asynchronous and synchronous signals that need to be timed correctly in order for other modules to receive valid data. An example of this is the register set which has asynchronous read ports that are later clocked in the instruction decode stage.

#### Stages 3.1 through 3.4 - Advanced Processor Implementation

These stages add advanced features to the processor to provide a more functional product. Although these stages are classified as extended, their technical requirement to design and implement is not great and so are have time allocations in the project schedule. The extended features that these stages introduce are: pipelined processor stages – to drastically increase processor performance; provide a memory-mapped peripheral interface through the MMU; provide a Wishbone master interface to the MMU – allowing external peripherals such as GPIO and LCD displays to be utilised in a modular fashion; and to implement a cache memory for each processor core.

| Stage | Title                                        | Start Date | Days | Core | Applicable Deliverables |
|-------|----------------------------------------------|------------|------|------|-------------------------|
| 1.0   | Research                                     | Feb 04     | 7    | x    |                         |
| 1.1   | Requirement gathering/review                 | Feb 11     | 14   | х    |                         |
| 1.1   | Processor specification, architecture, ISA   | Feb 18     | 100  | х    | CD1                     |
| 1.2   | Stage/Time Allocation Planning               | Feb 25     | 7    | х    |                         |
| 2.1   | Decoder, Register Set, impl & integration    | Feb 25     | 14   | x    | CD2                     |
| 2.2   | Register set impl & integration              | Mar 04     | 14   | х    | CD2                     |
| 2.3   | Local memory impl & integration              | Mar 11     | 14   | X    | CD2                     |
| 3.1   | Memory mapped register layout & impl         | Apr 01     | 21   |      | ED5                     |
| 3.2   | Wishbone peripheral bus connected to MMU     | Apr 08     | 21   |      | ED5                     |
| 3.3   | Pipelined implementation and verification    | Apr 15     | 21   |      | ED2                     |
| 3.4   | Cache memory design & impl                   | Apr 22     | 28   |      | ED2                     |
| 4.1   | Multi-core communication interface           | TBD        | TBD  | X    | CD3                     |
| 4.2   | Shared-memory controller                     | TBD        | TBD  | X    | CD3                     |
| 4.3   | Scalable multi-core interface (10s of cores) | TBD        | TBD  | х    | CD3                     |
| 4.4   | Multi-core example program (reduction)       | TBD        | TBD  | X    | CD4                     |
| 5.1   | SPI-FPGA interface for OTG programming       | TBD        | TBD  |      | ED7                     |
| 5.2   | FPGA-PC interfacing                          | TBD        | TBD  |      | ED9                     |
| 5.3   | FPGA-PC debugging (instruction breakpoints)  | TBD        | TBD  |      | ED9                     |
| 6.1   | Compiler backend for vmicro16                | TBD        | TBD  |      | ED4                     |
| 6.2   | Compiler support for multi-core codegen      | TBD        | TBD  |      | ED4                     |
| 7.1   | Wishbone peripherals for demo                | TBD        | TBD  | X    | CD4                     |
| 8.1   | Final Report                                 | TBD        | TBD  | X    |                         |

Table 8.1: Project stages throughout the life cycle of the project.

#### Stages 4.1 through 4.4 – Multiprocessor Functionality

These stages are dedicated to adding multiprocessor functionality using a loosely coupled architecture to the processor.

#### Stages 5.1 through 5.3 – Debugging Features

These stages cover debugging features and are classified as extended due to the large development time required to implement them as well as not being related to multiprocessor systems.

#### Stages 6.1 through 6.2 – Compiler Backends

These stages cover the implementation of a compiler backend to ease software writing and programming of the processor.

#### Stage 7.1 – Wishbone Peripherals

Additional Wishbone peripherals, such as SPI and timers will be added to produce a more useful multiprocessor system.

#### Stage 8.1 - Final Report

This stage is dedicated to the final report write-up. It is expected to be an iterative task that is active throughout the lifespan of the project.

#### 8.2.3 Timeline

The project stages from Table 8.1 are displayed below in a Gantt chart.



Figure 8.1: Project stages in a Gantt chart.

#### 8.3 Resources

This section describes the hardware and software resources required to fulfil the project.

#### 8.3.1 Hardware Resources

Core deliverable CD5 requires the designed RISC core to be implemented and demonstrated on multiple FPGA devices. Although my design should synthesise for physical IC implementation, due to high costs and lengthy production times, it is not a primary development target. Due to having past experience with Xilinx FPGAs from my placement work and experience with Altera from university modules it was decided to target the Xilinx Spartan 6 XC6SLX9 and the Altera Cyclone V.

#### Terasic DE1-SoC Development Board

The Terasic DE1-SoC development board features a large Cyclone V FPGA and many peripherals, such as seven-segment displays, 64 MB SDRAM, ADCs, and buttons and switches, which will aid demonstration of the project. The development board is available through the university so the cost is negligible. Figure 8.2 shows the peripherals (green) available to the FPGA.



Figure 8.2: Terasic DE1-SoC development board featuring the Altera Cyclone V FPGA and many peripherals. Image source: [10].

#### Minispartan 6+ FPGA Development Board

The Minispartan 6+ is a hobbyist FGPA development board with fewer peripherals than the DE1-SoC. The board features a Xilinx Spartan 6 XC6LX9 which has far fewer resources than the DE1-SoC's Cyclone V however it's simplicity and my familiarity with Xilinx's software suite will speed up development. The development board is shown in Figure 8.3.



Figure 8.3: Minispartan-6+ development board featuring the Xilinx Spartan 6 XC6SLX9. Note that the XC6SLX9 and XC6SLX25 FPGAs share the same board. Image source: [11].

#### 8.3.2 Software Resources

#### Intel Quartus

Intel Quartus Prime is a paid-for SoC, CPLD, and FPGA software suite targeting Intel's Stratix, Arria, and Cyclone based FPGAs. The university provides student licences which will be used via VPN.

#### Xilinx ISE Webpack

Xilinx ISE Webkpack is Xilinx's free software suite for FPGA development for Spartan 6 based FPGAs. Due to ISE's intuitive and fast work flow, most of the initial simulation and verification processes will be performed using ISE. This will greatly improve development times.

#### Verilator

Verilator is an open-source Verilog to C++ transpiler which provides a C++ interface to simulate Verilog modules and read/write values similar to a test bench. Verilator will be used for specific modules within the RISC core such as the ALU and decoder as Verilator is useful when performing exhaustive verification.

#### 8.4 Legal and Ethical Considerations

The RISC core is designed to be used as an academic research and educational tool to aid learning and understanding of RISC and multi-core machines. It should not be use for roles where mission critical or safety is a factor.

The processor does not provide any memory protection features and any software running on the processor has full access to all memory.

The processor does not store/track/predict software instructions. The processor uses pipelining techniques to improve performance which results in future instructions entering the pipeline even if the software's logical sequence does not include these instructions. This could result in security vulnerabilities similar to Intel's Spectre vulnerability [12].

## **Current Progress**

| 9.1 | RISC  | Core                         | 27 |
|-----|-------|------------------------------|----|
|     | 9.1.1 | Instruction Set Architecture | 27 |
|     | 9.1.2 | Design and Implementation    | 31 |
|     | 9.1.3 | Verification                 | 36 |

This chapter discusses the current progress made towards the project, including designs, implementation, and current results.

#### 9.1 RISC Core

Following the project time line described in section 8.2, the first couple months have been dedicated to the design and implementation of the instruction set architecture and RISC core with stages 1-3. Good progress has been made in both deliverables, the ISA and the RISC core, and the progress is on-time with the initial project time line. The core has been nicknamed *Vmicro16* – short for Verilog microprocessor 16-bit.

#### 9.1.1 Instruction Set Architecture

A 16-bit instruction set architecture (ISA) has been designed using an iterative approach. There currently exists 32 unique instructions covering most generic RISC operations (add, load/store, branch, compare, etc.) and atleast 16 opcodes available to be provide multi-core communication and functionality. This number should be adequate to support these features when the work begins on the multi-core project stages (stages 4-7).

#### **Design Goals**

Having past experience designing and implementing ISAs for previous projects, I wanted to use that knowledge to design an even more efficient and compact instruction set that could provide much greater functionality. The technical design goals of the ISA are described below:

#### ISA1 Use a fixed width of 16-bits for all instructions.

This will significantly reduce RTL resources and encourage efficiency by not wasting spare bits. In addition, many SPI flash and RAMs support 16-bit wide data reads

which will allow each instruction fetch to only require one clock cycle, thus increasing processor performance.

#### ISA2 Be able to select at least two registers for common instructions.

This will reduce the number of required instructions to manipulate register data. A disadvantage of using two instead of three reigster selects is that instructions are always destructive – they always destroy existing data in the destination register (e.g. R0 = ADD R0 R1) unlike constructive instructions that provide a unique register select for the destination (e.g. R2 = ADD R0 R1).

#### ISA3 Reduce bit-space for frequently used instructions (MOV, MOVI, ADD).

Due to the 16-bit limit, two register selects, and immediate values, the opcode bits are reduced resulting in fewer unique instructions. To overcome this constraint, spare bits in other instructions will be appended to the opcode bits to extend the opcode range. This however, will require a more complex decoder that must first switch the opcode, then switch any spare bits to determine the final opcode. This method will significantly increase the number of unique instructions provided by the instruction set.

#### ISA4 Provide frequently used actions as options for existing instructions.

In software, frequently used actions include incrementing/decrementing by 1 and performing logical comparisons which usually take more than one instruction on some RISC architectures. As they are common actions, the instruction overhead and time may be significant and can affect performance. To provide a solution to this problem, in addition to using spare bits to extend the opcode range, spare bits will be used to signify a frequently used action action to be performed by the ALU.

As shown in Figure 9.1, frequently used commands such as incrementing/decrementing and logical comparions are provided by setting spare bits to special values. For example, the instructions ARITH\_UADDI and ARITH\_SSUBI extend the ARITH\_U and ARITH\_S opcodes by filling the spare bit, 4. If this bit is not set (0), the instruction allows for a 4-bit immediate value to be added in addition to the two register selects. The 4-bit immediate allows adding a small number to the ALU which is useful in the case of software for loops where an increment/decrement of more than 1 is required.

Another example is the SETC instruction. Inspired by Intel's x86 SETCC, the instructions sets the destination register to zero or one depending on the result of the CMP instruction's flags. Without this instruction, multiple branches would be required to convert the comparion's flags to logical zeros and ones.

#### ISA5 Provide instructions for performing bitwise manipulations.

RISC processors are commonly used for microprocessing and microcontroller actions which typically includes bit manipulation. The ISA provides bitwise OR, XOR, AND, NOT, and shifting instructions under a single opcode to fill this need.

# ISA6 Provide instructions for explicitly performing signed and unsigned arithmetic.

Performing signed and unsigned arithmetic is a key requirement for RISC applications and so it was decided to provide such instructions. Software programmers can easily switch between signed and unsigned arithmetic by setting bit 11 in the ARITH instruction family. Being able to change between signed and unsigned arithmetic instructions by changing a single bit will make the RISC processor's decoder module smaller and less complex.

Without explicit unsigned and signed instructions, extra instructions would be required to perform addition and subtraction. In addition, due to two's complement representation of signed numbers, the highest immediate operand value would be halved, resulting in more instructions to reach the desired value.

|             | 15-11 | 10-8  | 7-5  | 4-0   | rd ra simm5               |  |
|-------------|-------|-------|------|-------|---------------------------|--|
|             | 15-11 | 10-8  | 7-0  |       | rd imm8                   |  |
|             | 15-11 | 10-0  |      |       | nop                       |  |
|             | 15    | 14:12 | 11:0 |       | extended immediate        |  |
| NOP         | 00000 |       | X    |       |                           |  |
| LW          | 00001 | Rd    | Ra   | s5    | Rd <= RAM[Ra+s5]          |  |
| SW          | 00010 | Rd    | Ra   | s5    | RAM[Ra+s5] <= Rd          |  |
| BIT         | 00011 | Rd    | Ra   | s5    | bitwise operations        |  |
| BIT_OR      | 00011 | Rd    | Ra   | 00000 | Rd <= Rd   Ra             |  |
| BIT_XOR     | 00011 | Rd    | Ra   | 00001 | Rd <= Rd ^ Ra             |  |
| BIT_AND     | 00011 | Rd    | Ra   | 00010 | Rd <= Rd & Ra             |  |
| BIT_NOT     | 00011 | Rd    | Ra   | 00011 | Rd <= ~Ra                 |  |
| BIT_LSHFT   | 00011 | Rd    | Ra   | 00100 | Rd <= Rd << Ra            |  |
| BIT_RSHFT   | 00011 | Rd    | Ra   | 00101 | Rd <= Rd >> Ra            |  |
| MOV         | 00100 | Rd    | Ra   | X     | Rd <= Ra                  |  |
| MOVI        | 00101 | Rd    | i    | 8     | Rd <= i8                  |  |
| ARITH_U     | 00110 | Rd    | Ra   | s5    | unsigned arithmetic       |  |
| ARITH_UADD  | 00110 | Rd    | Ra   | 11111 | Rd <= uRd + uRa           |  |
| ARITH_USUB  | 00110 | Rd    | Ra   | 10000 | Rd <= uRd - uRa           |  |
| ARITH_UADDI | 00110 | Rd    | Ra   | 0AAAA | Rd <= uRd + Ra + AAAA     |  |
| ARITH_S     | 00111 | Rd    | Ra   | s5    | signed arithmetic         |  |
| ARITH_SADD  | 00111 | Rd    | Ra   | 11111 | Rd <= sRd + sRa           |  |
| ARITH_SSUB  | 00111 | Rd    | Ra   | 10000 | Rd <= sRd - sRa           |  |
| ARITH_SSUBI | 00111 | Rd    | Ra   | 0AAAA | Rd <= sRd - sRa + AAAA    |  |
| BR          | 01000 | Rd    | i    | 8     | conditional branch        |  |
| BR_U        | 01000 | Rd    | 0000 | 0000  | Any                       |  |
| BR_E        | 01000 | Rd    | 0000 | 0001  | Z=1                       |  |
| BR_NE       | 01000 | Rd    | 0000 | 0010  | Z=0                       |  |
| BR_G        | 01000 | Rd    | 0000 | 0011  | Z=0 and S=O               |  |
| BR_GE       | 01000 | Rd    | 0000 | 0100  | S=O                       |  |
| BR_L        | 01000 | Rd    | 0000 | 0101  | S != O                    |  |
| BR_LE       | 01000 | Rd    | 0000 | 0110  | Z=1 or (S != O)           |  |
| BR_S        | 01000 | Rd    | 0000 | 0111  | S=1                       |  |
| BR_NS       | 01000 | Rd    | 0000 | 1000  | S=0                       |  |
| CMP         | 01001 | Rd    | Ra   | X     | SZO <= CMP(Rd, Ra)        |  |
| SETC        | 01010 | Rd    | Ra   | X     | Rd <= Imm8 == SZO ? 1 : 0 |  |
| MOVI_LARGE  | 1     | Rd    | i12  | XIV   | Rd <= i12                 |  |

Figure 9.1: Initial Vmicro16 16-bit instruction set architecture. Coloured regions represent instruction families (bitwise, branching, arithmetic, etc.).

The ISA table is shown in Figure 9.1. The top 5 bits (15-11) are dedicated to the opcode resulting in 32 unique values. Currently only the bits 14-11 are used (NOP to SETC) leaving the top bit spare. Initially, this bit was reserved to indicate an extended immediate instruction,

MOVI12, supporting a large 12-bit immediate value, however later in the design it was decided that the top bit would indicate special instructions dedicated for multi-core operation. This leaves 16 spare unique opcodes for this purpose.

#### 9.1.2 Design and Implementation

The RISC core design is a traditional 5-stage processor (fetch, decode, execute, memory, write-back).

To satisfy CD5, the Verilog code will be self-contained in a single file. This reduces the hierarchical complexity and eases cross-vendor project set-up as only a single file is required to be included. A disadvantage with this single file approach is that some external Verilog verification tools that I plan to use, such as Verilator, do not currently support multiple Verilog modules (due to an unfixed bug) within a single file.



Figure 9.2: Vmicro16 RISC 5-stage RTL diagram showing: instruction pipelining (data passed forward through clocked register banks at each stage); branch address calculation; ALU operand calculation (rd2 or imm); and program counter incrementing.

#### Instruction and Data Memory

The design uses separate instruction and data memories similar to a Harvard architecture computer. This architecture was chosen due because I find it easier to implement.

#### Register File

To support design goal **ISA2**, the register set features a dual-port read and single-port write. This allows instructions to read 2 registers simultaneously for any instruction. The single-port write allows the instruction output to be written to the register file.

#### **Pipelining**

The extended deliverable **ED1**, to provide atleast 1 instructions per clock. Previous processor designs of mine have all required multiple clocks per instruction as it is a lot easier to implement. Modern processors today can output 1 or more instructions per clock through the use of instruction pipelining. This technique increases throughput of the processor by performing each stage in parallel. In this pipeline, instructions still travel through each stage in the same order, the difference is that the fetch stage does not wait for the final stage to complete and so fetches a new instruction every clock cycle, resulting in each stage operating on new data every clock cycle. To extend my knowledge in CPU pipelining, extended deliverable **ED1** is proposed.

Instruction pipelining is harder to implement as data and control hazards can occur. Data hazards occur when instructions are dependent on the output of a previous instruction that has not left the pipeline, for example a register dependency. Methods to detect this hazard include checking if the register selects in the decode stage are present in future stages of the pipeline. If this check is true, then the current instruction depends on an instruction in the pipeline, and the processor can either wait until the dependant instruction has left the pipeline (i.e. has been written back to registers) or insert a NOP that will produce a *bubble* in the pipeline allowing the final stage to execute before the dependant instruction continues.

Control hazards occur when conditional or interrupt branching instructions are in the pipeline and their result has not been calculated yet. This results in preceding instructions entering the pipeline when they should not be executed due to the conditional branch. To detect this hazard, for instructions that perform branching or conditional execution, a global flag is set. When the outcome of the conditional check is performed, stages after decode are allowed to commit their results. Fortunately this technique is fairly simple implement.

This project's RISC processor implements these two hazard detectors and solutions to resolve them. The data hazard resolver implements a valid signal that is passed forward from stage to stage. This signal is low when a hazard has occurred and indicates that receiving stage should not operate on the previous stage's data. Each stage's valid signal is dependant on the previous stages valid signal. This allows future stages to stall when a hazard is detected in previous stages. A diagram of the implementation of these hazards in the processor is shown in Figure 9.3.

#### Memory Management Unit

It was decided to use a memory management unit (MMU) to make it easier and extensible to communicate with external peripherals or additional registers. This method would transparently use the existing LW/SW instructions which removes the requirement for a unique instruction for each peripheral.



Figure 9.3: Pipeline data hazard detection. The register selects are passed forward through each stage and compared to the IDEX (latest instruction) register selects. If they match, the latest instruction depends on the output of an instruction in the pipeline, the IFID and IDEX stages are stalled to allow the instruction in the pipeline to commit.

#### Proposed Memory Mapped Addresses

The peripheral addresses are currently based on classes. For example, a memory-mapped address may use the upper byte to address a peripheral and the lower byte to address a register/function in that peripheral.

Later in the project, I plan to rewrite the addressing scheme to use a simpler address format which is closer to commonly used peripheral addressing schemes used today. The proposed memory mapped addresses for each system and peripheral are listed below.

| Address (16-bit aligned) | Peripheral Name                                                                               |
|--------------------------|-----------------------------------------------------------------------------------------------|
| 0x0000                   | NOP (reads returns 0, writes do nothing)                                                      |
| 0x00ZZ                   | Per-core scratch RAM (ZZ = 8-bit RAM address)                                                 |
| 0x0100                   | Extended Core Registers 1                                                                     |
| 0x0200                   | Extended Core Registers 2                                                                     |
| 0x03ZZ                   | Wishbone Master controller select (ZZ contains 8-bit wishbone slave address)                  |
| 0x1XYZ                   | Master core controller ( $X = \text{slave select}, Y = \text{instruction}, Z = \text{data}$ ) |

Table 9.1: Provisional memory-mapped addresses table.

### **ALU Design**

The Vmicro16's ALU is an asynchronous module that has 3 inputs: data a; data b; and opcode op, and outputs data value c. The ALU is able to operate on both register data (rd1 and rd2) and immediate values. A switch is used to set the b input to either the rd2 or imm value from the previous stage.

Currently, the ALU does not store flags to indicate overflow, equality, or zero values in the module itself. Instead the ALU outputs the result of the CMP, which calculates such flags, to be written back to the register set in the write-back stage. This means that in order to perform a conditional operation, such as a branch, the register containing the CMP flags must be included in the instruction.



Figure 9.4: Vmicro16 ALU diagram showing clocked inputs from the previous IDEX stage being

The Verilog implementation of the ALU is shown in Figure 9.5. The ALU's asynchronous output is clocked with other registers, such as destination register rs1 and other control signals, in the EXME register bank.

```
input
                                       mmu lwex.
322
         input
                                       mmu_swex,
323
         output reg [MEM_WIDTH-1:0]
                                       mmu_out,
324
325
         // interrupts
326
         output reg [`DATA_WIDTH*`DEF_NUM_INT-1:0] ints_vector,
         output reg [`DEF_NUM_INT-1:0]
                                                      ints_mask,
328
329
         // TO APB interconnect
         output reg [`APB_WIDTH-1:0] M_PADDR,
331
                                         M_PWRITE,
         output reg
332
         output reg
                                         M_PSELx,
         output reg
                                         M_PENABLE,
334
         output reg [MEM_WIDTH-1:0]
                                        M_PWDATA,
335
```

Figure 9.5: Vmicro16's ALU implementation named vmicro16\_alu. vmicro16.v

#### Decoder Design

Instruction decoding occurs in the between the IFID and IDEX stages. The decoder extracts register selects and operands from the input instruction. The decoder outputs are asynchronous which allows the register selects to be passed to the register set and register data to be read asynchronously. The register selects and register read data is then clocked into the IDEX register bank.

```
224
                 //~define TEST_BR
                  `ifdef TEST_BR
225
                 mem[0] = {`VMICRO16_OP_MOVI,
                                                  3'h0, 8'h0};
226
                 mem[1] = {`VMICRO16_OP_MOVI,
                                                  3'h3, 8'h3};
                 mem[2] = {`VMICRO16_OP_MOVI,
                                                   3'h1, 8'h2};
228
                 mem[3] = {`VMICRO16_OP_ARITH_U, 3'h0, 3'h1, 5'b11111};
229
                                                  3'h3, `VMICRO16_OP_BR_U};
                 mem[4] = {`VMICRO16_OP_BR},
                 mem[5] = {`VMICRO16_OP_MOVI,
                                                   3'h0, 8'hFF};
231
                  `endif
232
233
                 //`define ALL_TEST
234
                 `ifdef ALL_TEST
235
                 // Standard all test
236
                 // REGSO
237
                 mem[0] = {`VMICRO16_OP_MOVI,
                                                  3'h0. 8'h81}:
238
                 mem[1] = {`VMICRO16_OP_SW},
                                                  3'h1, 3'h0, 5'h0; // MMU[0x81] = 6
239
                 mem[2] = {`VMICRO16_OP_SW},
                                                   3'h2, 3'h0, 5'h1}; // MMU[0x82] = 6
240
                 // GPI00
241
                 mem[3] = {`VMICRO16_OP_MOVI,
                                                   3'h0, 8'h90};
242
                 mem[4] = {`VMICRO16_OP_MOVI,
                                                   3'h1, 8'hD};
243
                 mem[5] = {`VMICRO16_OP_SW},
                                                   3'h1, 3'h0, 5'h0};
244
                                                   3'h2, 3'h0, 5'h0};
                 mem[6] = {`VMICRO16_OP_LW},
245
```

Figure 9.6: Vmicro16's decoder module code showing nested bit switches to determine the intended opcode. vmicro16.v

In Figure 9.6, it can be seen that the first 8 opcode cases are represented using the same 15-11 bits, however the VMICRO16\_OP\_BIT instructions require another bit range to be compared to determine the output opcode.

#### 9.1.3 Verification

Currently, the only verification method used is manual inspection of the output waveforms of a test bench. For now, it is easier and faster to spot erroneous states by hand due to the large complexity of the pipeline. Later in the project, automatic test benches will be utilised.

#### **Known Bugs**

Known bugs exist within the RISC core however none are critical as they can be easily avoided in software.

# BUG1 Stall detection does not consider load/store instructions.

Due to instruction pipelining techniques used by the processor and lack of address

checking in the  ${\tt EXME}$  and  ${\tt MEWB}$  stages,  ${\tt LW}$  instructions immediately after  ${\tt SW}$  instructions:

SW RO (R2+16) LW R1 (R2+16)

will not return the previously stored value. In addition, because of the target address is calculated by the ALU (e.g. R2+16), detecting matching addresses at IFID and IDEX stage is not trivial, and because of this, a hardware fix is not planned for the final version. It is possible to overcome this problem in software by placing at least 5 NOP instructions after each SW.

# Chapter 10

# Future Work

| 10.1 | Project Status                   | 38 |
|------|----------------------------------|----|
|      | 10.1.1 Updated Project Time Line | 39 |
|      | 10.1.2 Future Work               | 30 |

# 10.1 Project Status

Four months have passed since the start of the project and significant progress has been made to the final deliverable.



Figure 10.1: Caption for BRAMex

The current active stage is 3.3 Pipeline Implementation and Verification where the processor pipeline is being verified against of range of simple software sequences. It is important that this verification is thorough and the output is bug free as future additions to the processor will

utilise this foundation.

### 10.1.1 Updated Project Time Line

The project table described in section 8.2 did not allocate times for stages 4.1 and later. This was due to expected high demand from other modules and exams in this time period and so it was decided to not allocate times that would later not be followed.

Now that this time period is closer, time allocations have been assigned for stages 4, 7, and 8. The state of stage 5's extended deliverables, to implement debugging interfaces, have changed from *Unknown* to *Cancelled* due to expected high workload from other modules in the next month. The cancellation of these stages will not severely affect the final functionality of the deliverable however it will make debugging the processor slightly more difficult. It was decided to remove these extended features to allow for more time to be spent on core functionality.

The updated project status is shown in Table 10.1 and in Figure 10.2.

#### 10.1.2 Future Work

May and early June are reserved for work on other modules and preparation for exams. From mid-June, work will resume on verifying the end of stage 3 and then work will start on stage 4 (focussed on designing and implementing multiprocessor features). After stage 4, software algorithms will be compiled for the ISA and evaluated against Amdahl's Law.



Figure 10.2: Updated project time gantt chart showing time allocations for stage 4.

| Stage | Title                                        | Start Date | Core | Status    |
|-------|----------------------------------------------|------------|------|-----------|
| 1.0   | Research                                     | Feb 04     | X    | Completed |
| 1.1   | Requirement gathering/review                 | Feb 11     | X    | Completed |
| 1.1   | Processor specification, architecture, ISA   | Feb 18     | X    | Completed |
| 1.2   | Stage/Time Allocation Planning               | Feb 25     | х    | Completed |
| 2.1   | Decoder, Register Set, impl & integration    | Feb 25     | X    | Completed |
| 2.2   | Register set impl & integration              | Mar 04     | х    | Completed |
| 2.3   | Local memory impl & integration              | Mar 11     | X    | Completed |
| 3.1   | Memory mapped register layout & impl         | Apr 01     |      | On-going  |
| 3.2   | Wishbone peripheral bus connected to MMU     | Apr 08     |      | On-going  |
| 3.3   | Pipeline implementation and verification     | Apr 15     |      | On-going  |
| 3.4   | Cache memory design & impl                   | Apr 22     |      | Cancelled |
| 4.1   | Multi-core communication interface           | Jun 05     | X    | Planned   |
| 4.2   | Shared-memory controller                     | Jun 05     | X    | Planned   |
| 4.3   | Scalable multi-core interface (10s of cores) | Jul 01     | X    | Planned   |
| 4.4   | Multi-core example program (reduction)       | Jul 10     | X    | Planned   |
| 5.1   | SPI-FPGA interface for OTG programming       | TBD        |      | Cancelled |
| 5.2   | FPGA-PC interfacing                          | TBD        |      | Cancelled |
| 5.3   | FPGA-PC debugging (instruction breakpoints)  | TBD        |      | Cancelled |
| 6.1   | Compiler backend for vmicro16                | TBD        |      | Unknown   |
| 6.2   | Compiler support for multi-core codegen      | TBD        |      | Unknown   |
| 7.1   | Wishbone peripherals for demo                | Aug 01     | X    | Planned   |
| 8.1   | Final Report                                 | Jun 05     | x    | Planned   |

 ${\bf Table \ 10.1:} \ {\bf Updated \ project \ stages.}$ 

# Chapter 11

# Conclusion

With the end of Moore's Law looming, processor designers must use other strategies to continue improving performance of processors – multiprocessor and parallelism being a primary strategy. This projects sets out to improve my knowledge on multiprocessor communication by designing, implementing, and verifying a multiprocessor – and I believe starting from scratch is the best way to accomplish this learning task.

To date, a compact 16-bit RISC instruction set has been designed and implemented in a Verilog single-core processor. Whilst single-core verification is still on-going, good progress has been made and extended deliverables from stage 3, such as instruction pipelining and memory-mapped peripherals via a Wishbone bus, has been implemented successfully.

Stage 5's extended deliverables and the cache memory have been cancelled but they do not effect the core functionality of the processor. The planned project time-line for future stages is realistic and accomplishing the project's goals appears achievable.

REFERENCES 42

# References

[1] V. Subramanian, "Multiple gate field-effect transistors for future CMOS technologies," *IETE Technical review*, vol. 27, no. 6, pp. 446–454, 2010.

- [2] M. J. Flynn, Computer architecture: Pipelined and parallel processor design. Jones & Bartlett Learning, 1995.
- [3] Tech Differences, "Difference between loosely coupled and multiprocessor (with tightly coupled system comaprison chart)," https://techdifferences.com/ Jul 2017. [Online]. Available: difference-between-loosely-coupled-and-tightly-coupled-multiprocessor-system.html (Accessed 2019-04-20).
- [4] L. Benini and G. De Micheli, "Networks on Chips: A new SoC paradigm," *Computer*, vol. 35, pp. 70–78, 02 2002.
- [5] D. Zhu, L. Chen, S. Yue, T. M. Pinkston, and M. Pedram, "Balancing On-Chip Network Latency in Multi-application Mapping for Chip-Multiprocessors," in 2014 IEEE 28th International Parallel and Distributed Processing Symposium, May 2014, pp. 872–881.
- [6] N. Chatterjee, S. Paul, and S. Chattopadhyay, "Fault-tolerant dynamic task mapping and scheduling for network-on-chip-based multicore platform," ACM Transactions on Embedded Computing Systems, vol. 16, pp. 1–24, 05 2017.
- [7] Xilinx, Spartan-6 FPGA Block RAM Resources, Xilinx.
- [8] Altera, Recommended HDL Coding Styles QII51007-9.0.0, Altera.
- [9] B. Lancaster, "FPGA-based RISC Microprocessor and Compiler," vol. 3.14, pp. 37–50. [Online]. Available: https://github.com/bendl/prco304 (Accessed March 2018).
- [10] Terasic Technologies, "SoC Platform Cyclone DE1-SoC Board." [Online]. Available: https://www.terasic.com.tw/cgi-bin/page/archive.pl?Language=English&No=836 (Accessed 2019-04-20).
- [11] MiniSpartan6+, Scarab Hardware, 2014. [Online]. Available: https://www.scarabhardware.com/minispartan6/ (Accessed 2019-04-20).
- [12] P. Kocher, D. Genkin, D. Gruss, W. Haas, M. Hamburg, M. Lipp, S. Mangard, T. Prescher, M. Schwarz, and Y. Yarom, "Spectre attacks: Exploiting speculative execution," arXiv preprint arXiv:1801.01203, 2018.

# Appendix A - Code Listing

# top\_ms.v

The top level implementation file is described here.

```
module seven display # (
                     parameter INVERT = 1
            ) (
  \frac{3}{4}
                      input [3:0] n,
  5
6
7
                      output [6:0] segments
                     reg [6:0] bits;
  8
                     assign segments = (INVERT ? "bits : bits);
                    always @(n)
case (n)

4'h0: bits = 7'b0111111; // 0

4'h1: bits = 7'b0000110; // 1

4'h2: bits = 7'b1011011; // 2

4'h3: bits = 7'b1001111; // 3

4'h4: bits = 7'b1001101; // 4

4'h5: bits = 7'b1101101; // 6

4'h7: bits = 7'b1111101; // 6

4'h7: bits = 7'b111111; // 8

4'h9: bits = 7'b110111; // 8

4'h9: bits = 7'b110111; // 8

4'h8: bits = 7'b110111; // 8

4'h8: bits = 7'b110111; // 8

4'h8: bits = 7'b1110111; // 8

4'h8: bits = 7'b1110011; // 8

4'hC: bits = 7'b0111001; // 6
11
13
15
17
18
19
\frac{20}{21}
\frac{22}{23}
                             4'hC: bits = 7'b011100; // C
4'hC: bits = 7'b0111001; // C
4'hD: bits = 7'b1011110; // D
4'hE: bits = 7'b1111001; // E
26
27
                              4'hF: bits = 7'b1110001; // F
           endcase
endmodule
\frac{30}{31}
            // minispartan6+ XC6SLX9
            module top_ms # (
    parameter GPIO_PINS = 8
32
           ) (
34
\frac{36}{37}
                     input [3:0]
// UART
                                                        SW,
                     //input
output
// Peripherals
38
39
                                                            RXD,
40
41
42
                     output [7:0]
43
44
                     output [6:0] ssd0,
output [6:0] ssd1,
                     output [6:0] ssd2,
output [6:0] ssd3,
                     output [6:0] ssd4,
output [6:0] ssd5
50
           );
                     localparam POR_CLKS = 8;
reg [3:0] por_timer = 0;
reg por_done = 0;
reg por_reset = 1;
52
                     reg por_reset = 1;
always @(posedge CLK50)
if (!por_done) begin
por_reset <= 1;
if (por_timer < POR_CLKS)
por_timer <= por_timer + 1;
else
55
56
60
                                               por_done <= 1;</pre>
                              else
                                      por_reset <= 0;
                     //wire [15:0]
                                                                     M_PADDR;
                      //wire
                                                                     M PWRITE:
                      //wire [5-1:0]
                                                                     M_PSELx;
                                                                                         // not shared
                      //wire
                                                                     M_PENABLE;
                      //wire [15:0]
                                                                     M_PWDATA;
                                                                     M_PRDATA; // input to intercon
M_PREADY; // input to intercon
                      //wire [15:0]
```

```
73
74
75
76
                             wire [7:0] gpio0;
wire [15:0] gpio1;
wire [7:0] gpio2;
                             vmicro16_soc soc (
.clk (CLK50),
  79
80
                                     .clk (CLK50),
.reset (por_reset | (~SW[0])),
   81
   82
                                         //.M_PADDR (M_PADDR),
//.M_PWRITE (M_PWRITE),
   83
   84
  85
86
                                         //.M_PSELx (M_PSELx),
//.M_PENABLE (M_PENABLE),
  87
88
                                         //.M_PWDATA
//.M_PRDATA
                                                                               (M_PWDATA), (M_PRDATA),
   89
                                         //.M_PREADY
                                                                             (M_PREADY),
   90
                                         .uart_tx (TXD),
.gpio0 (LEDS[3:0]),
  91
                                         .gpio1
                                                                  (gpio1), (gpio2),
  93
  95
                                         //.dbug0 (LEDS[3:0]),
.dbug1 (LEDS[7:4])
  97
  99
                            // SSD displays (split across 2 gpio ports 1 and 2)
wire [3:0] ssd_chars [0:5];
assign ssd_chars[0] = gpio1[3:0];
assign ssd_chars[1] = gpio1[7:4];
assign ssd_chars[2] = gpio1[11:8];
assign ssd_chars[3] = gpio1[15:12];
assign ssd_chars[4] = gpio2[3:0];
assign ssd_chars[6] = gpio2[7:4];
seven_display ssd_0 (.n(ssd_chars[0]), .segments (ssd0));
seven_display ssd_1 (.n(ssd_chars[1]), .segments (ssd1));
seven_display ssd_2 (.n(ssd_chars[2]), .segments (ssd2));
seven_display ssd_3 (.n(ssd_chars[3]), .segments (ssd3));
seven_display ssd_4 (.n(ssd_chars[4]), .segments (ssd3));
seven_display ssd_5 (.n(ssd_chars[4]), .segments (ssd4));
seven_display ssd_5 (.n(ssd_chars[5]), .segments (ssd5));
                              // SSD displays (split across 2 gpio ports 1 and 2)
101
103
104
105
106
107
108
109
110
111
112
113
114
                  endmodule
115
```

# apb\_intercon.v

## The

```
module seven_display # (
    parameter INVERT = 1
) (
  3
                         input [3:0] n,
output [6:0] segments
   5\\6\\7\\8
                         reg [6:0] bits;
assign segments = (INVERT ? ~bits : bits);
   9
                        always @(n)

case (n)

4'h0: bits = 7'b0111111; // 0

4'h1: bits = 7'b0000110; // 1

4'h2: bits = 7'b1011011; // 2

4'h3: bits = 7'b1001111; // 3

4'h4: bits = 7'b1001101; // 5

4'h5: bits = 7'b1101101; // 5

4'h6: bits = 7'b11011101; // 6

4'h7: bits = 7'b1111101; // 6

4'h7: bits = 7'b1111111; // 8

4'h9: bits = 7'b110111; // 9

4'hA: bits = 7'b1101111; // A

4'hB: bits = 7'b1111100; // B

4'hC: bits = 7'b1111001; // C

4'hD: bits = 7'b1111001; // C

4'hF: bits = 7'b1111001; // E

4'hF: bits = 7'b111001; // F

endcase
 10
                         always @(n)
 \frac{11}{12}
13
 14
15
16
17
19
\frac{21}{22}
23
24
25
26
27
29
               endmodule
                // minispartan6+ XC6SLX9
31
               module top_ms # (
    parameter GPIO_PINS = 8
32
33
34
                         input
                                                                   CLK50,
35
36
37
                         input [3:0]
// UART
                                                              SW,
38
39
                          //input
                                                                        RXD,
                                                                 TXD,
                         output
// Peripherals
40
41
                         output [7:0]
                                                                 LEDS,
42
43
                         // SSDs
\frac{44}{45}
                         output [6:0] ssd0,
output [6:0] ssd1,
46
                         output [6:0] ssd2,
```

```
output [6:0] ssd3,
 48
                    output [6:0] ssd4.
                    output [6:0] ssd5
            );
  50
 51
52
                    localparam POR_CLKS = 8;
                    reg [3:0] por_timer = 0;
                    reg por_done = 0;
reg por_reset = 1;
always @(posedge CLK50)
  53
  54
  55
                           if (!por_done) begin
  por_reset <= 1;
  if (por_timer < POR_CLKS)
      por_timer <= por_timer + 1;
  else</pre>
 56
57
58
 59
60
 \frac{61}{62}
                                          por_done <= 1;</pre>
  63
                            else
  64
                                   por_reset <= 0;
 65
  66
                    //wire [15:0]
                                                              M_PADDR;
  67
                    //wire
                                                              M_PWRITE;
                    //wire [5-1:0]
                                                              M_PSELx;
                    //wire
//wire [15:0]
 69
70
71
72
73
74
75
76
77
                                                              M_PENABLE;
                                                              M_PWDATA;
                                                             M_PRDATA; // input to intercon
M_PREADY; // input to intercon
                     //wire [15:0]
                   wire [7:0] gpio0;
wire [15:0] gpio1;
wire [7:0] gpio2;
  78
79
                   vmicro16_soc soc (
    .clk (CLK50),
                                            (por_reset | (~SW[0])),
 81
                            .reset
                            //.M_PADDR
                                                      (M_PADDR),
  83
 84
85
                           //.M_PWRITE
//.M_PSELx
                                                      (M_PSELx),
  86
                            //.M_PENABLE
                                                      (M_PENABLE),
  87
                            //.M PWDATA
                                                      (M PWDATA).
 88
89
                            //.M_PRDATA
                                                      (M_PRDATA),
                            //.M_PREADY
                                                      (M_PREADY),
 90
 91
                            .uart_tx (TXD),
 92
93
                            .gpio0
                                             (LEDS[3:0]),
                                             (gpio1),
                             .gpio1
 94
                            .gpio2
                                             (gpio2),
                            //.dbug0 (LEDS[3:0]),
 96
 97
                            .dbug1 (LEDS[7:4])
 98
                    // SSD displays (split across 2 gpio ports 1 and 2) wire [3:0] ssd_chars [0:5];
100
101
                   wire [3:0] ssd_chars [0:5];
assign ssd_chars[0] = gpio1[3:0];
assign ssd_chars[1] = gpio1[7:4];
assign ssd_chars[2] = gpio1[11:8];
assign ssd_chars[3] = gpio1[15:12];
assign ssd_chars[4] = gpio2[3:0];
assign ssd_chars[5] = gpio2[7:4];
seven_display ssd_0 (.n(ssd_chars[0]), .segments (ssd0));
seven_display ssd_1 (.n(ssd_chars[1]), .segments (ssd1));
seven_display ssd_2 (.n(ssd_chars[2]), .segments (ssd2));
seven_display ssd_3 (.n(ssd_chars[3]), .segments (ssd3));
seven_display ssd_4 (.n(ssd_chars[4]), .segments (ssd3));
102
104
106
108
110
112
                    seven_display ssd_4 (.n(ssd_chars[4]), .segments (ssd4)); seven_display ssd_5 (.n(ssd_chars[5]), .segments (ssd5));
114
115
             endmodule
```

## vmicro16.v

The single core RISC processor is defined in this file. It contains many submodules such as the decoder and local memory.

```
1 // This file contains multiple modules.
2 // Verilator likes 1 file for each module
3 /* verilator lint_off DECLFILENAME */
4 /* verilator lint_off BUNSED */
5 /* verilator lint_off BLKSEQ */
6 /* verilator lint_off BUNDTH */
7
8 // Include Vmicro16 ISA containing definitions for the bits
9 'include "vmicro16_isa.v"
10
11 'include "clog2.v"
12 'include "formal.v"
13
14
```

```
module vmicro16 bram apb # (
 19
              parameter BUS_WIDTH
parameter MEM_WIDTH
                                          = 16,
= 16,
 20
 21
 22
23
              parameter MEM_DEPTH
              parameter APB_PADDR
 24
25
         ) (
              input clk.
 26
27
              input reset,
// APB Slave to master interface
 28
              input ['clog2(MEM_DEPTH)-1:0] S_PADDR,
 29
                                                         S_PWRITE,
              input
 30
31
                                                         S_PSELx,
S_PENABLE,
              input
 32
              input [BUS_WIDTH-1:0]
                                                         S_PWDATA,
 34
              output [BUS_WIDTH-1:0]
                                                         S_PRDATA,
 35
                                                         S_PREADY
              output
 36
        );
              wire [MEM_WIDTH-1:0] mem_out;
 38
              assign S_PRDATA = (S_PSELx & S_PENABLE) ? mem_out : 16'h0000;
assign S_PREADY = (S_PSELx & S_PENABLE) ? 1'b1 : 1'b0;
assign we = (S_PSELx & S_PENABLE & S_PWRITE);
 40
 42
              always @(*)
 43
                    if (S_PSELx && S_PENABLE)
 44
                         $display($time, "\t\tMEM => %h", mem_out);
 46
              always @(posedge clk)
if (we)
 48
 49
                         $display($time, "\t\tBRAM[%h] <= %h", S_PADDR, S_PWDATA);</pre>
 50
 51
              vmicro16_bram # (
                    .MEM WIDTH (MEM WIDTH).
 52
 53
                    .MEM_DEPTH (MEM_DEPTH),
 54
                    .NAME
                                    ("BRAM")
 55
56
              ) bram_apb (
                                    (clk),
                   .clk
 57
58
                                    (reset),
 59
                    .mem_addr
                                    (S_PADDR)
 60
                                    (S_PWDATA),
                    .mem_in
 61
                     .mem_we
                                    (we),
 62
                                    (mem_out)
                    .mem_out
 63
64
         endmodule
 65
 66
        // This module aims to be a SYNCHRONOUS, WRITE_FIRST BLOCK RAM
// https://www.xilinx.com/support/documentation/user_guides/ug473_7Series_Memory_Resources.pdf
// https://www.xilinx.com/support/documentation/user_guides/ug383.pdf
// https://www.xilinx.com/support/documentation/sw_manuals/xilinx2016_4/ug901-vivado-synthesis.pdf
 67
 68
 69
 70
71
72
         module vmicro16_bram # (
 73
74
              parameter MEM_WIDTH parameter MEM_DEPTH
                                              = 16,
= 64,
 75
76
77
78
79
               parameter CORE ID
                                              = 0,
              parameter USE_INITS = 0,
parameter PARAM_DEFAULTS_R0 = 0,
              parameter PARAM_DEFAULTS_R1 = 0,
              parameter NAME
                                               = "BRAM"
 81
              input clk,
              input reset,
 83
 84
                             [`clog2(MEM_DEPTH)-1:0] mem_addr,
              input
input
 85
                             [MEM_WIDTH-1:0]
                                                             mem in.
 86
              output reg [MEM_WIDTH-1:0]
 87
                                                             mem_out
 88
89
              // memory vector
 90
91
              reg [MEM_WIDTH-1:0] mem [0:MEM_DEPTH-1];
 92
              // not synthesizable
              integer i;
initial begin
    for (i = 0; i < MEM_DEPTH; i = i + 1) mem[i] = 0;
    mem[0] = PARAM_DEFAULTS_RO;
    mem[1] = PARAM_DEFAULTS_R1;</pre>
 93
 94
 95
 96
97
 98
 99
                    if (USE_INITS) begin
                         //`define TEST_SW
`ifdef TEST_SW
100
101
102
                         103
                           endif
104
105
106
                          `ifdef TEST_ASM
                         $readmemh("E:\\Projects\\uni\\vmicro16\\sw\\asm.s.hex", mem);
108
                          endif
109
                         //~define TEST_COMPILER
110
                          `ifdef TEST_COMPILER
              mem[0] = 16'h2f3f;
mem[1] = 16'h2903;
mem[2] = 16'h4100;
112
114
              mem[3] = 16'h3fa1;
mem[4] = 16'h16e0;
116
```

```
mem[5] = 16'h26e0;

mem[6] = 16'h3fa1;

mem[7] = 16'h2890;

mem[8] = 16'h10d8;

mem[0] = 16'h3fa1;

mem[10] = 16'h3fa1;

mem[11] = 16'h10d9;

mem[12] = 16'h3fa1;

mem[13] = 16'h2892;

mem[14] = 16'h10da;

mem[15] = 16'h3fa1;

mem[15] = 16'h3fa1;

mem[16] = 16'h28a0;
117
118
119
120
121
122
123
124
125
126
127
128
                      mem[16] = 16'h28a0;
                      mem[17] = 16'h10db;
mem[18] = 16'h3fa1;
129
130
                      mem[19] = 16'h2880;
mem[20] = 16'h10dc;
mem[21] = 16'h3fa1;
131
133
                      mem[22] = 16'h28b0;
mem[23] = 16'h10dd:
134
135
                      mem[23] = 16'h10dd;
mem[24] = 16'h3fa1;
mem[25] = 16'h28b1;
mem[26] = 16'h10de;
136
137
                      mem[27] = 16'h3fa1;
mem[28] = 16'h08dc;
mem[29] = 16'h0800;
139
141
                      mem[30] = 16'h3fa1;
mem[31] = 16'h10e0;
142
143
                      mem[32] = 16'h2801;
mem[33] = 16'h0be0;
145
                      mem[34] = 16'h37a1;
mem[35] = 16'h4b00;
147
                      mem[36] = 16'h4000;
mem[36] = 16'h5001;
mem[37] = 16'h2b00;
mem[38] = 16'h4860;
mem[39] = 16'h4292c;
mem[40] = 16'h4101;
148
149
150
151
152
                      mem[41] = 16'h2864;
153
                      mem[41] = 16 h292e;
mem[42] = 16'h292e;
mem[43] = 16'h4100;
154
155
                      mem[44] = 16'h0000;
mem[45] = 16'h28c8;
mem[46] = 16'h0000;
156
157
158
                      mem[47] = 16'h08dc;
mem[48] = 16'h08d0;
159
160
                      mem[49] = 16'h3fa1;
161
                      mem[50] = 16'h10e0;
mem[51] = 16'h2805;
162
163
                      mem[52] = 16'h0be0;
mem[53] = 16'h37a1;
mem[54] = 16'h5860;
164
165
166
                      mem[55] = 16'h10df;
mem[56] = 16'h08df:
167
168
                      mem[56] = 16'h06d1;
mem[57] = 16'h3fa1;
mem[58] = 16'h10e0;
mem[59] = 16'h2830;
169
170
171
                      mem[60] = 16'h0be0;
mem[61] = 16'h37a1;
mem[62] = 16'h307f;
172
174
                      mem[63] = 16'h3fa1;
mem[64] = 16'h10e0;
176
                      mem[64] = 16'h10e0;

mem[65] = 16'h08db;

mem[66] = 16'h0be0;

mem[67] = 16'h37a1;

mem[68] = 16'h1300;
178
180
                      mem[69] = 16'h2832;
mem[70] = 16'h27c0;
181
182
                      mem[70] = 16 h27c0;
mem[71] = 16'h0ee0;
mem[72] = 16'h37a1;
mem[73] = 16'h6000;
183
184
185
186
                                         endif
187
                                        //~define TEST_COND
188
                                       `ifdef TEST_COND
mem[0] = {`VMICR016_OP_MOVI,
mem[0] = {`VMICR016_OP_MOVI,
189
190
                                                                                                           3'h7, 8'hCO}; // lock
191
                                                                                                           3'h7, 8'hCO}; // lock
192
                                         endif
103
194
                                        //`define TEST_CMP
                                       "ifdef TEST_CMP
mem[0] = {`VMICR016_0P_MOVI,
mem[1] = {`VMICR016_0P_MOVI,
mem[2] = {`VMICR016_0P_CMP,
195
                                                                                                           3'h0, 8'h0A};
196
                                                                                                           3'h1, 8'h0B};
3'h1, 3'h0, 5'h1};
197
198
199
                                         endif
200
201
                                        //`define TEST LWEX
                                         ifdef TEST_LWEX
202
                                       mem[0] = {`VMICRO16_OP_MOVI,
mem[1] = {`VMICRO16_OP_SW,
203
                                                                                                           3'h0, 8'hC5};
                                                                                                           3'h0, 3'h0, 5'h1};
3'h2, 3'h0, 5'h1};
3'h2, 3'h0, 5'h1};
204
                                       mem[1] = {\text{VMICRO16_OP_LW}, mem[3] = {\text{VMICRO16_OP_LWEX}, mem[4] = {\text{VMICRO16_OP_SWEX},}
205
206
                                                                                                           3'h3, 3'h0, 5'h1}:
207
208
209
                                        //`define TEST_MULTICORE
                                       `ifdef TEST_MULTICORE
mem[0] = {`VMICRO16_OP_MOVI,
mem[1] = {`VMICRO16_OP_MOVI,
211
213
                                                                                                           3'h1, 8'h33};
                                       mem[2] = {`VMICRO16_OP_SW,
mem[3] = {`VMICRO16_OP_MOVI,
                                                                                                            3'h1, 3'h0, 5'h0};
                                                                                                           3'h0. 8'h80}:
215
```

```
\frac{216}{217}
                               mem[4] = {`VMICRO16_OP_LW,
mem[5] = {`VMICRO16_OP_MOVI,
                                                                                   3'h2, 3'h0, 5'h0};
                                                                                   3'h1, 8'h33}:
                               mem[6] = {`VMICRO16_OP_MOVI,
mem[7] = {`VMICRO16_OP_MOVI,
218
                                                                                   3'h1, 8'h33};
219
                                                                                   3'h1, 8'h33};
                               mem[8] = {`VMICRO16_OP_MOVI,
mem[9] = {`VMICRO16_OP_SW,
                                                                                   3'h0, 8'h91};
3'h2, 3'h0, 5'h0};
220
221
222
                                endif
223
224
                               //`define TEST_BR
                               ifdef TEST_BR
mem[0] = {`VMICR016_OP_MOVI,
mem[1] = {`VMICR016_OP_MOVI,
225
                                                                                   3'h0, 8'h0};
3'h3, 8'h3};
226
227
                               mem[2] = {`VMICRO16_OP_MOVI, 3'h1, 8'h2};
mem[3] = {`VMICRO16_OP_ARITH_U, 3'h0, 3'h1, 5'b11111};
228
229
230
                               mem[4] = {`VMICRO16_OP_BR,
mem[5] = {`VMICRO16_OP_MOVI,
                                                                                   3'h3, 'VMICRO16_OP_BR_U};
3'h0, 8'hFF};
232
                                endif
233
234
                               //`define ALL TEST
                                 ifdef ALL_TEST
236
                               // Standard all test
237
                               // REGSO
                               mem[0] = {`VMICRO16_OP_MOVI,
mem[1] = {`VMICRO16_OP_SW,
mem[2] = {`VMICRO16_OP_SW,
                                                                                   3'h0, 8'h81};
3'h1, 3'h0, 5'h0}; // MMU[0x81] = 6
3'h2, 3'h0, 5'h1}; // MMU[0x82] = 6
238
240
241
                               // GPI00
                               mem[3] = {`VMICRO16_OP_MOVI,
mem[4] = {`VMICRO16_OP_MOVI,
mem[5] = {`VMICRO16_OP_SW,
242
                                                                                   3'h0. 8'h90}:
                                                                                   3'h1, 8'hD};
3'h1, 3'h0, 5'h0};
244
                               mem[6] = {\text{VMICRO16_OP_LW},}
245
246
                               // TIMO
                               mem[7] = {`VMICRO16_OP_MOVI,
mem[8] = {`VMICRO16_OP_LW,
247
                                                                                   3'h0, 8'h07};
                                                                                   3'h3, 3'h0, 5'h03};
248
249
                               // UARTO
                                                                                    3'h0, 8'hA0}; // UAI
3'h1, 8'h41}; // aso
3'h1, 3'h0, 5'h0};
3'h1, 8'h42}; // ascii B
3'h1, 3'h0, 5'h0};
                               mem[9] = {`VMICRO16_OP_MOVI,
mem[10] = {`VMICRO16_OP_MOVI,
mem[11] = {`VMICRO16_OP_SW,
                                                                                                                  // UARTO
250
251
                                                                                                                  // ascii A
252
                               mem[12] = {`VMICRO16_OP_MOVI,
mem[13] = {`VMICRO16_OP_SW,
253
\frac{253}{254}
                                                                                    3'h1, 8'h43}; // ascii C
3'h1, 8'h43}; // ascii C
3'h1, 3'h0, 5'h0};
3'h1, 8'h44}; // ascii D
3'h1, 3'h0, 5'h0};
                               mem[14] = {`VMICRO16_OP_MOVI,
mem[15] = {`VMICRO16_OP_SW,
mem[16] = {`VMICRO16_OP_MOVI,
255
256
257
                               mem[17] = {`VMICRO16_OP_SW,
mem[18] = {`VMICRO16_OP_MOVI,
mem[19] = {`VMICRO16_OP_SW,
258
                                                                                    3'h1, 8'h45}; // ascii D
3'h1, 3'h0, 5'h0};
259
260
                               mem[20] = {`VMICRO16_OP_MOVI,
mem[21] = {`VMICRO16_OP_SW,
                                                                                    3'h1, 8'h46}; // ascii E
3'h1, 3'h0, 5'h0};
261
262
263
                               // BRAMO
                              // BRANO
mem[22] = {`VMICRO16_0P_MOVI,
mem[23] = {`VMICRO16_0P_MOVI,
mem[24] = {`VMICRO16_0P_SW,
mem[25] = {`VMICRO16_0P_LW,
                                                                                     3'h0, 8'hC0};
265
                                                                                    3'h1, 8'hA};
3'h1, 3'h0, 5'h5};
266
267
                                                                                    3'h2, 3'h0, 5'h5};
                               // GPI01 (SSD 24-bit port)
268
                               mem[26] = { VMICRO16_OP_MOVI, mem[27] = { VMICRO16_OP_MOVI,
269
                                                                                    3'h0. 8'h911:
270
                                                                                     3'h1, 8'h12};
                                                                                    3'h1, 3'h0, 5'h0};
3'h2, 3'h0, 5'h0};
271
                               mem[28] = { `VMICRO16_OP_SW,
                               mem[29] = {`VMICRO16_OP_LW,
273
                               // GPI02
274
                               mem[30] = {\text{`VMICRO16_OP_MOVI,}}
mem[31] = {\text{`VMICRO16_OP_MOVI,}}
                                                                                    3'h0, 8'h92};
                                                                                    3'h1, 8'h56};
3'h1, 3'h0, 5'h0};
275
                               mem[32] = {`VMICRO16_OP_SW,
277
                                 endif
                               //`define TEST_BRAM
279
                                ifdef TEST_BRAM
                              // 2 core BRAMO test
mem[0] = {`VMICRO16_0P_MOVI,
mem[1] = {`VMICRO16_0P_MOVI,
mem[2] = {`VMICRO16_0P_SW,
mem[3] = {`VMICRO16_0P_LW,
281
282
                                                                                   3'h0, 8'hC0};
                                                                                   3'h1, 8'hA};
3'h1, 3'h0, 5'h5};
283
284
285
                                                                                   3'h2, 3'h0, 5'h5};
286
                                `endif
287
288
289
                 always @(posedge clk) begin
// synchronous WRITE_FIRST (page 13)
290
291
292
                         if (mem_we) begin
293
                               mem[mem_addr] <= mem_in;
\frac{294}{295}
                               296
                        end else
297
                               mem_out <= mem[mem_addr];</pre>
298
299
                  // TODO: Reset impl = every clock while reset is asserted, clear each cell // one at a time, mem[i++] <= 0  
300
301
302
           endmodule
303
304
305
           module vmicro16_core_mmu # (
306
                  parameter MEM WIDTH
                                                        = 16
307
                  parameter MEM_DEPTH
308
                  parameter CORE_ID = 3'h0,
parameter CORE_ID_BITS = `clog2(`CORES)
310
                  input clk,
312
                  input reset,
```

314

```
315
316
               output busy,
317
                // From core
318
319
                               [MEM_WIDTH-1:0] mmu_addr,
                input
                               [MEM_WIDTH-1:0] mmu_in,
320
                input
321
                                                       mmu_we,
                {\tt input}
322
                input
                                                       mmu lwex.
323
               input
                                                       mmu_swex
               output reg [MEM_WIDTH-1:0] mmu_out,
324
325
326
               // interrupts
               output reg [`DATA_WIDTH*`DEF_NUM_INT-1:0] ints_vector, output reg [`DEF_NUM_INT-1:0] ints_mask,
\frac{327}{328}
329
                // TO APB interconnect
               output reg [`APB_WIDTH-1:0] M_PADDR,
output reg M_PWRITE
output reg M_PSELx,
331
                                                         M_PWRITE,
332
333
334
                                                         M_PENABLE,
                output reg
                output reg [MEM_WIDTH-1:0] M_PWDATA,
// from interconnect
335
336
337
                input
                             [MEM_WIDTH-1:0] M_PRDATA,
               input
339
         );
340
               localparam MMU_STATE_T1 = 0;
               localparam MMU_STATE_T2 = 1;
localparam MMU_STATE_T3 = 2;
341
343
               reg [1:0] mmu_state
                                                   = MMU STATE T1:
               reg [MEM_WIDTH-1:0] per_out = 0;
wire [MEM_WIDTH-1:0] tim0_out;
345
346
347
348
                assign busy = req || (mmu_state == MMU_STATE_T2);
349
               // tightly integrated memory usage
wire tim0_en = (mmu_addr >= `DEF_MMU_TIM0_S)
&& (mmu_addr <= `DEF_MMU_TIM0_E);
wire sreg_en = (mmu_addr >= `DEF_MMU_SREG_S)
350
351
\frac{352}{353}
               354
355
356
357
358
359
               360
361
362
364
365
                // Special register selects
366
                localparam SPECIAL REGS = 8:
               wire [MEM_WIDTH-1:0] sr_val;
367
368
369
                // Interrupt vector and mask
               // Interrupt vector and mask
initial ints_vector = 0;
initial ints_mask = 0;
wire [2:0] intv_addr = mmu_addr[`clog2(`DEF_NUM_INT)-1:0];
always @(posedge clk)
370
372
373
                     if (intv_we)
374
                           ints_vector[intv_addr*`DATA_WIDTH +: `DATA_WIDTH] <= mmu_in;
376
               always @(posedge clk)
   if (intm_we)
378
                           ints_mask <= mmu_in;</pre>
380
381
               always @(ints_vector)
    $display($time, "\tC%d\t\tints_vector W: | %h ", CORE_ID,
    ints_vector[0* DATA_WIDTH +: DATA_WIDTH],
    ints_vector[1* DATA_WIDTH +: DATA_WIDTH],
    ints_vector[2* DATA_WIDTH +: `DATA_WIDTH],
382
383
384
385
386
387
                           ints_vector[3*`DATA_WIDTH +: `DATA_WIDTH],
ints_vector[4*`DATA_WIDTH +: `DATA_WIDTH],
388
                           ints_vector[6* DATA_WIDTH +: 'DATA_WIDTH],
ints_vector[6* DATA_WIDTH +: 'DATA_WIDTH],
ints_vector[7* DATA_WIDTH +: 'DATA_WIDTH]
389
390
391
392
393
394
               always @(intm_we)
                     $display($time, "\tC%d\t\tintm_we W: %b", CORE_ID, ints_mask);
395
396
               397
398
399
                     else if (inty_en) mmu_out = timu_out;
else if (sreg_en) mmu_out = sr_val;
else if (inty_en) mmu_out = ints_vector[mmu_addr[2:0]*`DATA_WIDTH +: `DATA_WIDTH];
else if (intm_en) mmu_out = ints_mask;
400
401
402
403
                     else
                                              mmu_out = per_out;
404
405
                // APB master to slave interface
                // APD master to stave interface
always @(posedge clk)
   if (reset) begin
        mmu_state <= MMU_STATE_T1;</pre>
406
407
                           M PENABLE <= 0:
409
                           M_PADDR <= 0;
M_PWDATA <= 0;
411
                           M_PSELx <= 0;
M_PWRITE <= 0;
413
```

```
\frac{414}{415}
                     end
else
                                MMU_STATE_T1: begin
417
                                      _SIAIL_II: Degin
if (req && apb_en) begin
    M_PADDR <= {mmu_lwex, mmu_swex, CORE_ID[CORE_ID_BITS-1:0], mmu_addr[MEM_WIDTH-1:0]};
    M_PWDATA <= mmu_in;
    M_PSELx <= 1;
    M_PWRITE <= mmu_we;
419
420
421
422
423
424
                                            mmu_state <= MMU_STATE_T2;</pre>
425
                                      end
\frac{426}{427}
428
                                 `ifdef FIX T3
                                      MMU_STATE_T2: begin
430
                                           M_PENABLE <= 1;
431
                                           if (M_PREADY == 1'b1) begin
    mmu_state <= MMU_STATE_T3;
end</pre>
432
433
434
435
436
                                       MMU_STATE_T3: begin
                                             // Slave has output a ready signal (finished)
438
                                            M_PENABLE <= 0;
M_PADDR <= 0;
M_PWDATA <= 0;
439
440
                                            m_rwdAlA <= U;
M_PSELx <= 0;
M_PWRITE <= 0;
// Clock the peripheral output into a reg,
// to output on the next clock cycle
per_out <= M_PRDATA;</pre>
442
444
446
447
                                            mmu state <= MMU STATE T1:
448
449
                                      end
                                 `else
450
                                      // No FIX_T3
MMU_STATE_T2: begin
\frac{451}{452}
                                            if (M_PREADY == 1'b1) begin
M_PENABLE <= 0;</pre>
453
454
                                                 M_PENABLE <= 0;

M_PADDR <= 0;

M_PWDATA <= 0;

M_PSELx <= 0;

M_PWRITE <= 0;
455
456
457
458
\frac{459}{460}
                                                  // Clock the peripheral output into a reg,
// to output on the next clock cycle
461
                                                  per_out <= M_PRDATA;</pre>
462
463
                                                  mmu_state <= MMU_STATE_T1;</pre>
                                            end else begin
M_PENABLE <= 1;
464
465
466
                                            end
                                       end
467
468
469
                           endcase
471
                vmicro16 bram # (
472
                      .MEM_WIDTH (MEM_WIDTH),
473
                      .MEM_DEPTH (SPECIAL_REGS),
                      .USE_INITS (0),
.PARAM_DEFAULTS_RO (CORE_ID),
475
                      .PARAM_DEFAULTS_R1 (`DEF_INT_MASK),
.NAME ("ram_sr")
477
                ) ram_sr (
479
                     .clk
                                       (clk).
480
                      .reset
                                       (reset),
                                       (mmu_addr[`clog2(SPECIAL_REGS)-1:0]),
481
                      .mem addr
482
                      .mem_in
483
                      .mem_we
                                       (),
484
                     .mem_out
                                       (sr_val)
485
486
                // Each M core has a TIMO scratch memory
487
                vmicro16_bram # (
.MEM_WIDTH (MEM_WIDTH),
488
489
                                      (MEM_DEPTH),
490
                       MEM DEPTH
491
                      .USE_INITS
\frac{492}{493}
                      .NAME
                                       ("TIMO")
                ) TIMO (
494
                     .clk
                                       (clk).
                                       (reset),
495
                      .reset
496
                      .{\tt mem\_addr}
                                       (mmu_addr[7:0]),
497
                      .mem_in
                                       (mmu_in),
498
                      .mem we
                                       (timO we).
499
                      .mem_out
                                       (tim0_out)
500
               );
501
502
503
504
505
          module vmicro16_regs # (
               parameter CELL_WIDTH parameter CELL_DEPTH
                                                       = 16,
506
                                                       = 10,
= 8,
= `clog2(CELL_DEPTH),
                parameter CELL_SEL_BITS
508
               parameter CELL_DEFAULTS
parameter DEBUG_NAME
                                                       = 0,
509
510
               parameter CORE_ID = 0,
parameter PARAM_DEFAULTS_RO = 16'h0000,
512
```

```
513
             parameter PARAM_DEFAULTS_R1 = 16'h0000
        ) (
514
516
              input reset,
              input reset,
// Dual port register reads
input [CELL_SEL_BITS-1:0] rs1, // port 1
output [CELL_WIDTH-1 :0] rd1,
//input [CELL_SEL_BITS-1:0] rs2, // port 2
//output [CELL_WIDTH-1 :0] rd2,
\frac{517}{518}
519
520
              //output [CELL_WIDTH-1 :0]
// EX/WB final stage write back
521
522
              input
input [CELL_SEL_BITS-1:0]
523
                                                     ws1,
524
525
              input [CELL_WIDTH-1:0]
                                                       wd
526
527
             reg [CELL_WIDTH-1:0] regs [0:CELL_DEPTH-1] /*verilator public_flat*/;
529
              \ensuremath{//} Initialise registers with default values
              // Really only used for special registers used by the soc // TODO: How to do this on reset?
530
531
532
              integer i;
533
             initial
534
                   if (CELL_DEFAULTS)
                        $readmemh(CELL_DEFAULTS, regs);
535
                   else begin
for(i = 0; i < CELL_DEPTH; i = i + 1)</pre>
537
                        regs[i] = 0;
regs[0] = PARAM_DEFAULTS_RO;
regs[1] = PARAM_DEFAULTS_R1;
538
539
540
541
                        end
              always @(regs)
543
544
                    $display($time, "\tC%02h\t\t| %h %h %h %h | %h %h %h %h | ",
                        CORE_ID, regs[0], regs[1], regs[2], regs[3],
545
546
                        regs[4], regs[5], regs[6], regs[7]);
547
548
              always @(posedge clk)
549
550
551
                   if (reset) begin
   for(i = 0; i < CELL_DEPTH; i = i + 1)</pre>
                        regs[i] <= 0;
regs[0] <= PARAM_DEFAULTS_RO;
regs[1] <= PARAM_DEFAULTS_R1;
552
553
554
555
                   else if (we) begin
556
                        $display($time, "\tC%02h: REGS #%s: Writing %h to reg[%d]",
557
558
                              CORE_ID, DEBUG_NAME, wd, ws1);
559
                        // Perform the write
regs[ws1] <= wd;</pre>
560
561
562
                   end
563
564
              // sync writes, async reads
             assign rd1 = regs[rs1];
//assign rd2 = regs[rs2];
565
566
567
568
569
570
        module vmicro16_regs_apb # (
          parameter BUS_WIDTH
parameter DATA_WIDTH
                                                  = 16.
572
                                                = 16,
              parameter CELL_DEPTH = 8,
parameter PARAM_DEFAULTS_RO = 0,
574
              parameter PARAM_DEFAULTS_R1 = 0
576
        ) (
578
              input clk,
579
              input reset,
// APB Slave to master interface
580
581
              input [`clog2(CELL_DEPTH)-1:0] S_PADDR,
                                                       S_PWRITE,
582
              input
583
584
                                                       S_PSELx,
S_PENABLE,
              input
585
              input [DATA_WIDTH-1:0]
                                                         S_PWDATA,
586
587
              output [DATA_WIDTH-1:0]
                                                        S_PRDATA,
588
              output
                                                       S_PREADY
589
        );
590
              wire [DATA_WIDTH-1:0] rd1;
\frac{591}{592}
             assign S_PRDATA = (S_PSELx & S_PENABLE) ? rd1 : 16'h0000;
assign S_PREADY = (S_PSELx & S_PENABLE) ? 1'b1 : 1'b0;
assign reg_we = (S_PSELx & S_PENABLE & S_PWRITE);
593
594
595
596
              always @(*)
597
                    if (reg we)
                        $display($time, "\t\tREGS_APB[%h] <= %h", S_PADDR, S_PWDATA);
598
599
600
              always @(*)
                    `rassert(reg_we == (S_PSELx & S_PENABLE & S_PWRITE))
601
602
603
              vmicro16 regs # (
604
                  .CELL_DEPTH
                                             (CELL_DEPTH),
605
                    .CELL WIDTH
                                             (DATA WIDTH).
                    .PARAM_DEFAULTS_RO (PARAM_DEFAULTS_RO),
                   .PARAM_DEFAULTS_R1 (PARAM_DEFAULTS_R1)
607
             ) regs_apb (
    .clk (clk),
    .reset (reset),
608
609
611
```

```
612
                              (S_PADDR),
613
                   .rd1
                              (rd1).
614
                   //.rs2
615
616
                   //.rd2
617
618
                              (reg_we),
(S_PADDR),
                   .ws1
619
620
                   .wd
                               (S_PWDATA) // either alu_c or mem_out
621
         endmodule
622
623
624 \\ 625
626
627
         module vmicro16_gpio_apb # (
              parameter BUS_WIDTH = 16,
parameter DATA_WIDTH = 16,
628
629
                                      = 8,
= "GPIO"
630
               parameter PORTS
631
              parameter NAME
        ) (
632
633
              input reset,
// APB Slave to master interface
634
              input [0:0] input
                                                        S_PADDR, // not used (optimised out)
636
637
                                                         S_PWRITE,
638
              {\tt input}
                                                        S PSELx.
639
                                                         S_PENABLE,
              input [DATA_WIDTH-1:0]
640
                                                       S PWDATA.
              output [DATA_WIDTH-1:0]
                                                         S_PRDATA,
642
643
                                                        S_PREADY,
              output reg [PORTS-1:0]
644
                                                        gpio
645
              assign S_PRDATA = (S_PSELx & S_PENABLE) ? gpio : 16'h0000;
assign S_PREADY = (S_PSELx & S_PENABLE) ? 1'b1 : 1'b0;
assign ports_we = (S_PSELx & S_PENABLE & S_PWRITE);
646
647
648
649 \\ 650
              always @(posedge clk)
651
                  if (reset)
                   gpio <= 0;
else if (rort-</pre>
652
                     653
654
655
                    end
656
         endmodule
\frac{657}{658}
659
         // Decoder is hard to parameterise as it's very closely linked to the ISA.
660
661
         module vmicro16_dec # (
             parameter INSTR_WIDTH = 16,
parameter INSTR_OP_WIDTH = 5,
parameter INSTR_RS_WIDTH = 3,
662
663
664
              parameter ALU_OP_WIDTH = 5
665
666
              //input clk, // not used yet (all combinational)
//input reset, // not used yet (all combinational)
667
669
670
              input [INSTR_WIDTH-1:0]
671
              output [INSTR_OP_WIDTH-1:0] opcode,
              output [INSTR_RS_WIDTH-1:0] rd,
output [INSTR_RS_WIDTH-1:0] ra,
673
675
              output [3:0]
              output [7:0]
677
              output [11:0]
output [4:0]
                                                   imm12.
678
679
              // This can be freely increased without affecting the isa output reg [ALU_OP_WIDTH-1:0] alu_op,
680
681
682
683
              output reg has_imm4,
684 \\ 685
              output reg has_imm8,
              output reg has_imm12,
output reg has_we,
686
687
              output reg has_br,
output reg has_mem,
688
689
              output reg has_mem_we,
690
              output reg has_cmp,
691
              output halt, output intr,
692
693
694
              output reg has_lwex,
output reg has_swex
695
696
697
              // TODO: Use to identify bad instruction and // raise exceptions \,
698
699
700
              //,output
                              is_bad
              assign opcode = instr[15:11];
assign rd = instr[10:8];
assign ra = instr[7:5];
assign imm4 = instr[3:0];
702
703
704
              assign imm8 = instr[7:0];
assign imm12 = instr[11:0];
assign simm5 = instr[4:0];
706
707
708
              // exme_op
710
```

```
711 \\ 712
                 always @(*) case (opcode)
   `VMICR016_OP_SPCL: casez(instr[11:0])
                               VMICRO16_OP_SPCL_NOP,
VMICRO16_OP_SPCL_HALT,
713
714
                                                                        alu_op = `VMICRO16_ALU_NOP;
alu_op = `VMICRO16_ALU_NOP; endcase
                               `VMICRO16_OP_SPCL_INTR:
716
                              default:
                                                                        alu_op = `VMICRO16_ALU_LW;
alu_op = `VMICRO16_ALU_SW;
alu_op = `VMICRO16_ALU_LW;
alu_op = `VMICRO16_ALU_SW;
                         `VMICRO16_OP_LW:
718
719
                        `VMICRO16_OP_SW:
                         `VMICRO16_OP_LWEX:
720
721
                        `VMICRO16_OP_SWEX:
722
723 \\ 724
                        `VMICRO16_OP_MOV:
`VMICRO16_OP_MOVI:
                                                                        alu_op = `VMICRO16_ALU_MOV;
alu_op = `VMICRO16_ALU_MOVI;
725
                                                                        alu_op = `VMICRO16_ALU_BR;
alu_op = `VMICRO16_ALU_MULT;
                        `VMICRO16_OP_BR:
727
                        `VMICRO16_OP_MULT:
728
                                                                        alu_op = `VMICRO16_ALU_CMP;
alu_op = `VMICRO16_ALU_SETC;
                        `VMICRO16_OP_CMP:
729
                        `VMICRO16_OP_SETC:
731
                        `VMICRO16_OP_BIT:
                                                                        alu_op = `VMICRO16_ALU_BIT_OR;
alu_op = `VMICRO16_ALU_BIT_XOR;
alu_op = `VMICRO16_ALU_BIT_AND;
                              VMICRO16_OP_BIT_OR:
VMICRO16_OP_BIT_XOR:
733
735
                                VMICRO16 OP BIT AND:
                                                                       alu_op = VMICRO16_ALU_BIT_AND;
alu_op = VMICRO16_ALU_BIT_NOT;
alu_op = VMICRO16_ALU_BIT_LSHFT;
                               VMICRO16_OP_BIT_NOT:
737
                               `VMTCRO16 OP BIT LSHFT:
                                                                        alu_op = `VMICRO16_ALU_BII_LSHFT;
alu_op = `VMICRO16_ALU_BIT_RSHFT;
alu_op = `VMICRO16_ALU_BAD; endcase
                               `VMICRO16_OP_BIT_RSHFT:
739
                              default:
                        `VMICRO16_OP_ARITH_U:
                                                                  casez (simm5)
741
                              CRUID-DF_ARTIH_U: case2 (SIMMO)

'VMICRO16_OP_ARITH_UADD: alu_op = 'VMICRO16_ALU_ARITH_UADD;

'VMICRO16_OP_ARITH_USUB: alu_op = 'VMICRO16_ALU_ARITH_USUB;

'VMICRO16_OP_ARITH_UADDI: alu_op = 'VMICRO16_ALU_ARITH_UADDI;
default: alu_op = 'VMICRO16_ALU_BAD; endcase
743
744
745
746
                        `VMICRO16_OP_ARITH_S:
747
                                                                  casez (simm5)
                              'VMICRO16_0P_ARITH_SADD: alu_op = 'VMICRO16_ALU_ARITH_SADD;
'VMICRO16_0P_ARITH_SSUB: alu_op = 'VMICRO16_ALU_ARITH_SSUB;
'VMICRO16_0P_ARITH_SSUBI: alu_op = 'VMICRO16_ALU_ARITH_SSUBI;
default: alu_op = 'VMICRO16_ALU_BAD; endcase
749
750
751
752
                        default: begin
753
                             alu_op = `VMICRO16_ALU_NOP; $display($time, "\tDEC: unknown opcode: %h ... NOPPING", opcode);
754
755
756
757
                        end
                  endcase
758
 759
                  // Special opcodes
                 // Jassign nop == ((opcode == `VMICR016_0P_SPCL) & (~instr[0]));
assign halt = ((opcode == `VMICR016_0P_SPCL) & instr[0]);
assign intr = ((opcode == `VMICR016_0P_SPCL) & instr[1]);
760
761
762
763
                 764
765
766
768
                         'VMICRO16 OP LW.
                         `VMICRO16_OP_MOV,
770
                         `VMTCRO16 OP MOVI.
                        //`VMICRO16_OP_MOVI_L,
`VMICRO16_OP_ARITH_U,
772
                         'VMICRO16_OP_ARITH_S,
774
                         `VMICRO16_OP_SETC,
                         `VMICRO16_OP_BIT,
776
                         `VMICRO16_OP_MULT:
                                                              has_we = 1'b1;
has_we = 1'b0;
                       default:
                  endcase
778
779
                  // Contains 4-bit immediate
780
                 always 0(*)

if ( (opcode == `VMICRO16_OP_ARITH_U) && (simm5[4] == 0)) ||

((opcode == `VMICRO16_OP_ARITH_S) && (simm5[4] == 0)) )

has_imm4 = 1'b1;
781
782
783
784
785
                        else
786
                              has_imm4 = 1'b0;
787
                  // Contains 8-bit immediate
788
789
                 790
791
                         `VMICRO16_OP_BR:
                                                              has_imm8 = 1'b1;
                                                              has_imm8 = 1'b0;
                        default:
793
                 endcase
794
                 //// Contains 12-bit immediate
//always @(*) case (opcode)
// `VMICRO16_OP_MOVI_L: h
795
796
                 // VMICRO16
// default:
797
                                                                 has_imm12 = 1'b1;
                                                                  has_imm12 = 1'b0;
798
799
                  //endcase
801
                  // Will branch the pc
                  always @(*) case (opcode)
802
                         `VMICRO16_OP_BR: has_br = 1'b1;
803
                                                        has_br = 1'b0;
                        default:
805
                  endcase
806
                 // Requires external memory
always @(*) case (opcode)
    `VMICRO16_OP_LW,
807
808
809
```

```
810
                `VMICRO16_OP_SW,
811
                 VMICRO16 OP LWEX.
                'VMICRO16_OP_SWEX: has_mem = 1'b1;
default: has_mem = 1'b0;
813
814
815
           816
817
818
                VMICRO16_OP_SWEX: has_mem_we = 1'b1;
default: has_mem_we = 1'b0;
819
820
821
822
823
           // Affects status registers (cmp instructions)
824
826
827
828
829
            // Performs exclusive checks
           830
831
                                      has_lwex = 1'b0;
832
                default:
834
           835
836
838
            endcase
839
840
841
      module vmicro16_alu # (
parameter OP_WIDTH = 5,
parameter DATA_WIDTH = 16,
parameter CORE_ID = 0
842
843
844
845
       ) (
846
847
848
           // input clk, // TODO: make clocked
           input [OP_WIDTH-1:0] op,
input [DATA_WIDTH-1:0] a, // rs1/dst
input [DATA_WIDTH-1:0] b, // rs2
input [3:0] flags,
output reg [DATA_WIDTH-1:0] c
849
850
851
852
853
854
      );
855
856
           localparam TOP_BIT = (DATA_WIDTH-1);
// 17-bit register
           reg [DATA_WIDTH:0] cmp_tmp = 0; // = {carry, [15:0]}
wire r_setc;
857
859
860
           always @(*) begin
861
                                cmp\_tmp = 0;
862
                                 case (op)
                // branch/nop, output nothing
`VMICRO16_ALU_BR,
863
864
                c = {DATA_WIDTH{1'b0}};
865
867
                 VMICRO16 ALU LW.
868
                // bitwise operations
`VMICRO16_ALU_BIT_OR:
                                           c = a | b;
c = a ^ b;
c = a & h.
869
871
                `VMICRO16 ALU BIT XOR:
                `VMICRO16_ALU_BIT_AND:
                                              c = a & b;
c = ~(b);
873
                `VMICRO16_ALU_BIT_NOT:
                875
876
                `VMICRO16_ALU_MOV:
877
                `VMICRO16_ALU_MOVI:
`VMICRO16_ALU_MOVI_L:
                                             c = b;
c = b;
878
879
880
                881
882
883
884
                `VMICRO16_ALU_ARITH_UADDI: c = a + b;
885
                `ifdef DEF_ALU_HW_MULT
   `VMICRO16_ALU_MULT: c = a * b;
886
887
888
889
                890
891
                \(\frac{\text{VMICRO16_ALU_ARITH_SSUBI:}}{\text{C}}\) c = \(\frac{\text{$$signed(a)$}}{\text{$$c$}}\) - \(\frac{\text{$$signed(a)$}}{\text{$$c$}}\).
892
893
894
895
                `VMICRO16_ALU_CMP: begin
                    // TODO: Do a-b in 17-bit register
// Set zero, overflow, carry, signed bits in result
896
897
                    cmp_tmp = a - b;
c = 0;
898
900
901
                     // N Negative condition code flag
                    // Z Zero condition code flag
// C Carry condition code flag
// V Overflow condition code fl
902
904
                           Overflow condition code flag
                    905
906
908
```

```
909
                             // Overflow flag
 910
                             // https://stackoverflow.com/questions/30957188/
                            // https://github.com/bendl/prco304/blob/master/prco_core/rtl/prco_alu.v#L50
case(cmp_tmp[TOP_BIT+1:TOP_BIT])
 912
                                  2'b01: c[`VMICRO16_SFLAG_V] = 1;
2'b10: c[`VMICRO16_SFLAG_V] = 1;
 913
 914
 915
                                  default: c[`VMICRO16_SFLAG_V] = 0;
 916
 917
                            $display($time, "\tC%02h: ALU CMP: %h %h = %h = %b", CORE_ID, a, b, cmp_tmp, c[3:0]);
 918
 919
 920
 921
922
                       `VMICRO16_ALU_SETC: c = { {15{1'b0}}, r_setc };
 923
                       // TODO: Parameterise
 924
                       default: begin
                            $display($time, "\tALU: unknown op: %h", op);
 925
                            c = 0;
cmp_tmp = 0;
 926
 927
 928
                       end
                                  endcase
 929
 930
 931
                 branch setc_check (
                      .flags
                                       (flags),
(b[7:0])
 933
 934
 935
                       .en
                                        (r_setc)
 937
           endmodule
 938
           // flags = 4 bit r_cmp_flags register
 939
 940
                       = 8 bit VMICRO16_OP_BR_? value. See vmicro16_isa.v
           module branch (
input [3:0] flags,
input [7:0] cond,
 941
 942
 943
 944
                 output reg en
 945
           );
 946 \\ 947
                 always @(*)
                       case (cond)
                            e (cond)

'VMICRO16_OP_BR_U: en = 1; 'VMICRO16_DP_BR_U: en = 1;

'VMICRO16_OP_BR_E: en = (flags['VMICRO16_SFLAG_Z] == 1);

'VMICRO16_OP_BR_NE: en = (flags['VMICRO16_SFLAG_Z] == 0);

'VMICRO16_OP_BR_G: en = (flags['VMICRO16_SFLAG_Z] == 0) &&

(flags['VMICRO16_SFLAG_Z] == 0) &&

'VMICRO16_OP_BR_L: en = (flags['VMICRO16_SFLAG_Z] == flags['VMICRO16_SFLAG_V]);

'VMICRO16_OP_BR_GE: en = (flags['VMICRO16_SFLAG_Z] == flags['VMICRO16_SFLAG_N]);

'VMICRO16_OP_BR_LE: en = (flags['VMICRO16_SFLAG_Z] == flags['VMICRO16_SFLAG_N]);
 948
 949
 950
 951
 952
 953
 954
955
 956
                                                                (flags[`VMICRO16_SFLAG_N] != flags[`VMICRO16_SFLAG_V]);
 957
 958
                       endcase
           endmodule
 959
 960
 961
 962
           module vmicro16_core # (
 963
                parameter DATA_WIDTH parameter MEM_INSTR_DEPTH
 964
 966
                  parameter MEM_SCRATCH_DEPTH = 64,
 967
 968
                 parameter CORE_ID
                                                         = 3'h0
 969
          ) (
 970
 971
 972
                 input
                                   reset,
 973
 974
                 output [7:0] dbug,
 975
 976
                 // interrupt sources
                 input ['DEF_NUM_INT-1:0] ints,
input ['DEF_NUM_INT*'DATA_WIDTH-1:0] ints_data,
 978
 979
                 output ['DEF_NUM_INT-1:0]
 980
 981
982
                 // APB master to slave interface (apb_intercon)
output [`APB_WIDTH-1:0] w_PADDR,
 983
                 output
                                                         w_PWRITE,
 984
                 output
                                                          w_PSELx,
 985
                                                          w_PENABLE,
 986
                            [DATA_WIDTH-1:0]
                                                         w_PWDATA,
                 output
 987
988
                            [DATA_WIDTH-1:0]
                                                          w_PRDATA,
                                                          w_PREADY
                 input
 989
 990
                 localparam STATE_IF = 0;
                 localparam STATE_R1 = 1;
localparam STATE_R2 = 2;
 991
 992
                 localparam STATE_ME = 3;
 993
                 localparam STATE_WB = 4;
 994
                localparam STATE_FE = 5;
localparam STATE_IDLE = 6;
localparam STATE_HALT = 7;
reg [2:0] r_state = STATE_IF;
 995
 996
 997
 998
 999
1000
                        [DATA_WIDTH-1:0] r_pc
                       [DATA_WIDTH-1:0] r_pc_saved [DATA_WIDTH-1:0] r_instr
                                                                   = 16'h0000:
1001
                 wire [DATA_WIDTH-1:0] w_mem_instr_out;
1003
1004
1005
1006
                 assign dbug = {7'h00, w_halt};
1007
```

```
wire [4:0]
wire [4:0]
wire [2:0]
1008
                                      r_instr_opcode;
1009
                                      r_instr_alu_op;
1010
                                      r_instr_rsd;
1011
             wire [2:0]
                                      r_instr_rsa;
             reg [DATA_WIDTH-1:0] r_instr_rdd = 0;
reg [DATA_WIDTH-1:0] r_instr_rda = 0;
1012
1013
1014
             wire [3:0]
                                      r_instr_imm4;
1015
             wire [7:0]
                                      r instr imm8:
1016
             wire [4:0]
                                      r_instr_simm5;
1017
             wire
                                      r_instr_has_imm4;
                                      r_instr_has_imm8;
r_instr_has_we;
1018
             wire
1019
             wire
                                      r_instr_has_br;
r_instr_has_cmp;
1020
             wire
1021
             wire
1022
             wire
                                      r_instr_has_mem;
r_instr_has_mem_we;
1023
             wire
1024
             wire
                                      r_instr_halt;
1025
                                      r_instr_has_lwex;
             wire
1026
             wire
                                      r instr has swex:
1027
             wire [DATA_WIDTH-1:0] r_alu_out;
1028
1029
             1030
1031
1032
1033
             reg
wire
1034
1035
                                      r_mem_scratch_busy;
1036
1037
             1038
1039
             wire [DATA_WIDTH-1:0] r_reg_rd1_i;
             wire [DATA_WIDTH-1:0] r_reg_rd1 = regs_use_int ? r_reg_rd1_i : r_reg_rd1_s;
1040
1041
             //wire [15:0] r_reg_rd2;
             1042
1043
             // branching w_intr;
1044
1045 \\ 1046
1047
             wire
                          w_branch_en;
             wire w_branching reg [3:0] r_cmp_flags
                                        = r_instr_has_br && w_branch_en;
= 4'h00; // N, Z, C, V
1048
1049
1050
             always @(r_cmp_flags)
$display($time, "\tC%02h:\tALU CMP: %b", CORE_ID, r_cmp_flags);
1051
1052
1053
1054
             // 2 cycle register fetch
             always @(*) begin
r_reg_rs1 = 0;
1055
1056
                  if (r_state == STATE_R1)
    r_reg_rs1 = r_instr_rsd;
else if (r_state == STATE_R2)
1057
1058
1059
                     r_reg_rs1 = r_instr_rsa;
1060
                  else
1061
1062
                      r_reg_rs1 = 3'h0;
1063
1064
1065
             wire ['DEF NUM INT*'DATA WIDTH-1:0] ints vector:
1066
             wire ['DEF_NUM_INT-1:0]
1067
             wire
                                                     has_int = ints & ints_mask;
1068
             reg int_pending = 0;
             reg int_pending_ack = 0;
reg regs_use_int = 0;
always @(posedge clk)
1069
1070
1071
1072
                  if (int_pending_ack)
                      // We've now branched to the isr
1073
1074
                       int_pending <= 0;
1075
                  else if (has_int)
                      // Notify fem to switch to the ints_vector at the last stage
int_pending <= 1;</pre>
1076
1077
1078
1079
                  else if (w_intr)

// Return to Interrupt instruction called,
\frac{1080}{1081}
                      // so we've finished with the interrupt
int_pending <= 0;</pre>
1082
1083
1084
             // cpu state machine
             always @(posedge clk)
if (reset) begin
1085
1086
1087
                                           <= 0;
                      r_pc
1088
                      r_state
                                          <= STATE_IF;
                                           <= 0;
1089
                      r_instr
                      r_mem_scratch_req <= 0;
r_instr_rdd <= 0;
1090
1091
1092
                      r_instr_rda
                                          <= 0:
1093
1094
                  else begin
1095
                      if (r_state == STATE_IF) begin
    if (w_halt) begin
    $display("");
1096
1097
1098
1099
                               $display("");
                               $display($time, "\tC%O2h: PC: %h HALT", CORE_ID, r_pc);
r_state <= STATE_HALT;</pre>
1100
1102
                           end else begin
                               r_instr <= w_mem_instr_out;
1104
                               $display("");
                               $display($time, "\tC%02h: PC: %h", CORE_ID, r_pc);
1106
```

```
1107
                            $display($time, "\tC%02h: INSTR: %h", CORE_ID, w_mem_instr_out);
1108
1109
                           r_state <= STATE_R1;
1110
1111
                    end
1112
1113
                    else if (r_state == STATE_R1) begin
                        // primary operand
1114
                        r_instr_rdd <= r_reg_rd1;
r_state <= STATE_R2;</pre>
1115
1116
                        r_state
1117
                    else if (r_state == STATE_R2) begin
1118
                        // Choose secondary operand (register or immediate)
if (r_instr_has_imm8) r_instr_rda <= r_instr_imm8;
else if (r_instr_has_imm4) r_instr_rda <= r_reg_rd1 + r_instr_imm4;
else r_instr_rda <= r_reg_rd1;
1119
1120
1121
1123
                        if (r_instr_has_mem) begin
    r state <= STATE_ME;</pre>
1124
                           r_state
// Pulse req
1125
1127
                            r_mem_scratch_req <= 1;
                           r_state <= STATE_WB;
1129
                    else if (r_state == STATE_ME) begin
1131
                        // Pulse req
1133
                        r_mem_scratch_req <= 0;
                          Wait for MMU to finish
1135
                        if (!r mem scratch busy)
                           r_state <= STATE_WB;
1137
                    else if (r_state == STATE_WB) begin
1138
                        1139
1140
                            r_{mp_flags} \le r_{alu_out[3:0]};
1141
1142
1143
                       1144 \\ 1145
1146
                            // TODO: check bounds
                            1147
1148
                            regs_use_int <= 1;
int_pending_ack <= 1;
1149
1150
                            // Jump to ISR
1151
1152
                            r_pc <= ints
else if (w_intr) begin
                                           <= ints_vector[0 +: `DATA_WIDTH];</pre>
1153
                            1154
                           1156
1157
1158
1160
1161
1162
                        else if (r_pc < (MEM_INSTR_DEPTH-1)) begin
1164
                           r_pc      <= r_pc + 1;
int_pending_ack <= 0;</pre>
1165
1166
1168
                       r_state <= STATE_FE;
                    else if (r_state == STATE_FE) begin
1170
                       r_state <= STATE_IF;
                    end
1172
1173
                    else if (r_state == STATE_HALT) begin
1174
                    end
1175
1176
\frac{1177}{1178}
            // Instruction ROM
            vmicro16_bram # (
1179
                .MEM_WIDTH
.MEM_DEPTH
                                (DATA_WIDTH)
                                (MEM_INSTR_DEPTH),
1180
1181
                 .CORE ID
                                (CORE_ID),
                .USE_INITS
1182
                                (1),
1183
                NAME
                                ("INSTR_MEM")
            ) mem_instr (
1184
1185
               .clk
                                (clk),
1186
                .reset
                                (reset),
                // port 1 .mem_addr
1187
1188
                                (r_pc),
                                (0),
(1'b0), // ROM
1189
                .mem_in
1190
                .mem_we
1191
                 .mem_out
                                (w_mem_instr_out)
1192
1193
1194
1195
            vmicro16_core_mmu #
                .MEM_WIDTH
                                (DATA_WIDTH),
1197
                MEM DEPTH
                                (MEM SCRATCH DEPTH).
1198
                .CORE_ID
                                (CORE_ID)
1199
            ) mmu (
1201
                .reset
                                (reset).
                                (r_mem_scratch_req)
                .req
1203
                .busy
                                (r_mem_scratch_busy),
1204
                // interrupts
1205
                .ints_vector
                                (ints_vector),
```

```
1206
                   .ints_mask
                                      (ints_mask),
1207
                   // port 1
1208
                   .mmu_addr
                                      (r_mem_scratch_addr),
1209
                   .mmu_in
                                      (r_mem_scratch_in),
1210 \\ 1211
                   .mmu_we
                                      (r_mem_scratch_we),
(r_instr_has_lwex),
1212
                   .mmu_swex
                                      (r_instr_has_swex),
(r_mem_scratch_out),
1213
                   .mmu out
                   // APB maste
.M_PADDR
                                      r to slave (w_PADDR),
1214
1215
                                      (w_PWRITE),
(w_PSELx),
1216
                    .M_PWRITE
1217
                   .M_PSELx
                   .M_PENABLE
.M_PWDATA
                                      (w_PENABLE), (w_PWDATA),
1218
1219
1220
                   .M_PRDATA
.M_PREADY
                                      (w_PRDATA), (w_PREADY)
1221
1222
              );
1223
              // Instruction decoder
1224
              vmicro16_dec dec (
                   // input
1226
1227
                                      (r_instr),
1228
                   // output async
                   .opcode
1230
                   .rd
                                      (r_instr_rsd),
1231
                   .ra
                                      (r_instr_rsa),
1232
                   .imm4
                                      (r instr imm4).
1233
                   .imm8
                                      (r_instr_imm8),
1234
                   .imm12
                                      ().
1235
                   .simm5
1236
                   .alu_op
                                      (r_instr_alu_op),
1237
                   .has_imm4
                                      (r_instr_has_imm4),
1238
                   .has_imm8
                                      (r_instr_has_imm8),
1239
                    .has_we
                                      (r_instr_has_we),
1240
                   .has br
                                      (r instr has br).
1241
                   .has_cmp
                                      (r_instr_has_cmp),
1242
                                      (r_instr_has_mem),
                   .has_mem
1243 \\ 1244
                                      (r_instr_has_mem_we),
(w_halt),
                    .has_mem_we
                   .halt
                   .intr
.has_lwex
1245
                                      (w_intr),
                                      (r instr has lwex).
1246
1247
                   .has_swex
                                      (r_instr_has_swex)
1248
1249
1250
              // Software registers
1251 \\ 1252
              vmicro16_regs # (
.CORE_ID (CORE_ID),
1253
                   .CELL_WIDTH (`DATA_WIDTH)
              ) regs (
1255
                   .clk
                                 (clk).
1256
                   .reset
                                 (reset),
1257
                   // async port 0
                                 (r_reg_rs1),
1258
                   .rs1
                   .rd1 (r_reg_rd1_s),
// async port 1
1259
1260
                                 Ο,
                   //.rs2
//.rd2
1261
1262
                                   Ö,
1263
                   // write port
1264
                   .we
                                 (r_reg_we && ~regs_use_int),
1265
                   .ws1
                                 (r_instr_rsd),
1266
                                 (r_reg_wd)
                   .wd
              ):
1267
1268
              // Interrupt replacement registers
              // Interrupt representation // Interrupt representation (CORE_ID),
1269
1270
1271
                    .CELL_WIDTH (`DATA_WIDTH),
1272
                   .DEBUG_NAME ("REGSINT")
1273
              ) regs_intr (
1274
1275
                                 (clk),
                   .clk
1276 \\ 1277
                   .reset (reset),
// async port 0
1278
                   .rs1
                                 (r_reg_rs1),
                                 (r_reg_rd1_i),
1279
                   .rd1
                   // async port 1
//.rs2 (),
1280
1281
                   //.rd2
// write port
1282
                                   Ο,
1283
1284
                   .we
.ws1
                                 (r_reg_we && regs_use_int),
(r_instr_rsd),
1285
1286
                   .wd
                                 (r_reg_wd)
1287
1288
              // ALU
1289
1290
              vmicro16 alu # (
                   .CORE_ID(CORE_ID)
1291
              ) alu (
1292
1293
                                 (r_instr_alu_op),
                  .op
                   .a
.b
                                 (r_instr_rdd),
(r_instr_rda),
1294
1295
                   .flags (r_cmp_flags),
// async output
1296
1297
1298
                   . с
                                 (r_alu_out)
1299
1300
1301
              branch branch_check (
                                 (r_cmp_flags),
1302
                   .flags
1303
                   .cond
                                 (r_instr_imm8),
1304
                   .en
                                 (w_branch_en)
```

```
1305 );
1306
1307 endmodule
```

# vmicro16\_soc.v

```
\frac{1}{2}
\frac{3}{4}
         `include "vmicro16_soc_config.v"
`include "clog2.v"
         module timer_apb # (
    parameter CLK_HZ = 50_000_000
) (
 9
10
               input clk,
11
               input reset,
               input clk_en,
13
               // 0 16-bit value R/W
// 1 16-bit control R b0 = start, b1 = reset
// 2 16-bit prescaler
input [1:0] S_PADDR,
15
17
18
19
20
21
                                                                   S_PWRITE,
               input
input
                                                                   S_PSELx,
22
23
                                                                   S_PENABLE,
                                 [`DATA_WIDTH-1:0]
                                                                  S PWDATA.
24
25
                output reg ['DATA_WIDTH-1:0]
                                                                  S PRDATA.
                                                                   S_PREADY,
27
28
29
               output out,
output [`DATA_WIDTH-1:0] int_data
30
               //assign S_PRDATA = (S_PSELx & S_PENABLE) ? swex_success ? 16'hF0F0 : 16'h0000;
assign S_PREADY = (S_PSELx & S_PENABLE) ? 1'b1 : 1'b0;
wire en = (S_PSELx & S_PENABLE);
wire we = (en & S_PWRITE);
31
32
33
34
               reg [`DATA_WIDTH-1:0] r_counter = 0;
reg [`DATA_WIDTH-1:0] r_load = 0;
36
37
               reg ['DATA_WIDTH-1:0] r_pres = 0;
reg ['DATA_WIDTH-1:0] r_ctrl = 0;
38
\frac{40}{41}
               localparam CTRL_START = 0;
localparam CTRL_RESET = 1;
42
               localparam ADDR_LOAD = 2'b00;
localparam ADDR_CTRL = 2'b01;
localparam ADDR_PRES = 2'b10;
44
46
               always @(*) begin
S_PRDATA = 0;
48
50
                      if (en)
                             case(S_PADDR)
                                   ADDR_LOAD: S_PRDATA = r_counter;
52
                                   ADDR_LUAD: S_PRDATA = r_ctrl;
ADDR_CTRL: S_PRDATA = r_pres;
//ADDR_CTRL: S_PRDATA = r_pres;
54
                                   default: S_PRDATA = 0;
                            endcase
56
57
58
59
                // prescaler counts from r_pres to 0, emitting a stb signal
               // prescaler counts from T_pres to 0, emi
// to enable the r_counter step
reg ['DATA_WIDTH-1:0] r_pres_counter = 0;
wire counter_en = (r_pres_counter == 0);
always @(posedge clk)
    if (r_pres_counter == 0)
60
61
62
63
64
65
                            r_pres_counter <= r_pres;
66
67
68
                            r_pres_counter <= r_pres_counter - 1;
69
               always @(posedge clk)
   if (we)
70
                            case(S_PADDR)
                                   // Write to the load register:
// Set load register
// Set counter register
73
74
75
76
77
78
79
                                   ADDR_LOAD: begin
                                         r_load
                                                                  <= S_PWDATA;
                                         "T_counter <= S_PWDATA;
$display($time, "\t\ttimrO: WRITE LOAD: %h", S_PWDATA);</pre>
                                  81
                                         $display($time, "\t\ttimr0: WRITE CTRL: \h", S_PWDATA);
83
                                   ADDR_PRES: begin
                                                       <= S PWDATA:
85
                                         r_pres
                                         $display($time, "\t\ttimr0: WRITE PRES: %h", S_PWDATA);
                     end
endcase
else
87
89
```

```
if (r_ctrl[CTRL_START]) begin
 91
                            if (r_counter == 0)
    r_counter <= r_load;</pre>
 93
                            else if(counter_en)
 94
95
                       r_counter <= r_counter -1;
end else if (r_ctrl[CTRL_RESET])
 96
                            r_counter <= r_load;
 97
             // generate the output pulse when r_counter == 0
// out = (counter reached zero && counter started)
assign out = (r_counter == 0) && r_ctrl[CTRL_START];
assign int_data = {`DATA_WIDTH{1'b1}};
 98
 99
100
101
102
        endmodule
103
104
        // Shared memory with hardware monitor (LWEX/SWEX) \,
106
        module vmicro16_bram_ex_apb # (
107
             parameter BUS_WIDTH = 16,
parameter MEM_WIDTH = 16,
108
109
             parameter MEM_DEPTH = 64
parameter CORE_ID_BITS = 3,
110
111
             parameter SWEX_SUCCESS = 16'h0000,
parameter SWEX_FAIL = 16'h0001
112
        ) (
114
             input clk,
116
             input reset,
             118
                                                      15
                                                            S_PADDR |
                                                    S_PADDR,
120
121
                                                     S PWRITE.
122
             input
123
             input
124
             input
                                                     S PENABLE.
125
             input [MEM_WIDTH-1:0]
                                                    S_PWDATA,
126
\frac{127}{128}
             output reg [MEM_WIDTH-1:0]
                                                    S_PRDATA,
                                                    S_PREADY
             output
129
             // exclusive flag checks
wire [MEM_WIDTH-1:0] mem_out;
wire [MEM_WIDTH-1:0] mem_out_ex;
130
131
132
133
                                       swex_success = 0;
134
135
             localparam ADDR_BITS = `clog2(MEM_DEPTH);
136
137
             // hack to create a 1 clock delay to S_PREADY
             // for bram to be ready
             reg cdelay = 1;
always @(posedge clk)
if (S_PSELx)
139
140
141
                       cdelay <= 0;
143
                  else
                      cdelay <= 1;
145
             //assign S_PRDATA = (S_PSELx & S_PENABLE) ? swex_success ? 16'hF0F0 : 16'h0000;
             assign we = (S_PSELx & S_PENABLE & S_PWRITE);
wire en = (S_PSELx & S_PENABLE);
147
149
             wire en
             // Similar to:
151
             // http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0204f/Cihbghef.html
153
154
             // mem_wd is the CORE_ID sent in bits [18:16]
             localparam TOP_BIT_INDEX
localparam PADDR_CORE_ID_MSB
                                                 = `APB_WIDTH -1;
= TOP_BIT_INDEX - 2;
= PADDR_CORE_ID_MSB - (CORE_ID_BITS-1);
155
156
157
             localparam PADDR_CORE_ID_LSB
158
             // [LWEX, CORE_ID, mem_addr] from S_PADDR
159
160
                                        lwex
                                                      = S_PADDR[TOP_BIT_INDEX];
= S_PADDR[TOP_BIT_INDEX-1];
161
             wire
                                          swex
             wire [CORE_ID_BITS-1:0] core_id = S_PADDR[PADDR_CORE_ID_M:
// CORE_ID to write to ex_flags register
wire [ADDR_BITS-1:0] mem_addr = S_PADDR[ADDR_BITS-1:0];
162
                                                        = S_PADDR[PADDR_CORE_ID_MSB:PADDR_CORE_ID_LSB];
163
164
165
             166
167
168
169
170
              // Check exclusive access flags
             always @(*) begin
172
                  swex_success = 0;
173
                  if (en)
174
                       if (swex)
                            if (is_locked && !is_locked_self)
176
                                 // someone else has locked it
swex_success = 0;
177
                            else if (is_locked && is_locked_self)
swex_success = 1;
178
180
             end
182
             always @(*)
184
                       if (swex success)
                            S_PRDATA = SWEX_SUCCESS;
                       else
186
                            S_PRDATA = SWEX_FAIL;
188
                  else
```

```
189
                      S_PRDATA = mem_out;
190
            191
192
193
            reg [CORE_ID_BITS:0] reg_wd;
194
195
            always @(*) begin
  reg_wd = {{CORE_ID_BITS}{1'b0}};
196
197
198
                 if (en)
                      // if wanting to lock the addr
if (lwex)
199
200
                           // and not already locked if (!is_locked) begin
201
                          - vis_iocked) begin
  reg_wd = (core_id + 1);
end
202
203
                      else if (swex)
if (is_locked && is_locked_self)
205
206
207
                               reg_wd = {{CORE_ID_BITS}{1'b0}};
208
209
210
             // Exclusive flag for each memory cell
            vmicro16_bram # (
    .MEM_WIDTH (CORE_ID_BITS + 1),
211
213
                  .MEM DEPTH
                                (MEM_DEPTH),
214
                  .USE_INITS
215
                  .NAME
                                ("rexram")
            ) ram_exflags (
217
                .clk
                                (clk).
                 .reset
                                (reset),
219
220
                 .{\tt mem\_addr}
                                (mem_addr),
221
                  .mem in
                                (reg_wd),
222
                  .mem_we
                                (reg_we)
223
                  .mem out
                               (ex_flags_read)
224
225
\frac{226}{227}
            always @(*)
if (S_PSELx && S_PENABLE)
228
                      $display(\$time, "\t\tBRAMex[\h] READ \h\tCORE: \h'', mem\_addr, mem\_out, S\_PADDR[16 +: CORE\_ID\_BITS]);
229
230
            always @(posedge clk)
231
                 if (we)
                      $display($time, "\t\tBRAMex[%h] WRITE %h\tCORE: %h", mem_addr, S_PWDATA, S_PADDR[16 +: CORE_ID_BITS]);
232
233
            vmicro16_bram # (
    .MEM_WIDTH (MEM_WIDTH),
^{234}
235
                  .MEM_DEPTH
                               (MEM_DEPTH),
236
238
                  . NAME
                                ("BRAMexinst")
239
            ) bram_apb (
240
                 clk
                                (clk)
241
                                (reset),
                 .reset
242
243
244
                 .mem_in
                                (S_PWDATA),
                 .mem_we
                                (we && swex_success),
246
                  .mem out
                                (mem out)
247
248
        endmodule
250
252
253
        module vmicro16_soc (
254
            input clk,
input reset,
255
256
257
            //input uart_rx,
258
            output
                                                   uart_tx,
259
260
            output [`APB_GPIO0_PINS-1:0]
output [`APB_GPIO1_PINS-1:0]
                                                   gpio0,
                                                   gpio1,
261
             output [`APB_GPI02_PINS-1:0]
262
263
                          [7:0]
                                                   dbug0,
                         [`CORES*8:0]
264
            output
                                                  dbug1
265
       );
266
             genvar di;
\frac{267}{268}
            generate for(di = 0; di < `CORES; di = di + 1) begin : gen_dbug0
assign dbug0[di] = dbug1[di*8];</pre>
269
270
             endgenerate
271
272
            // Peripherals (master to slave)
273
              wire [ APB_WIDTH-1:0]
                                                  M PADDR .
              wire
                                                   M_PWRITE;
              wire [`SLAVES-1:0]
                                                  M_PSELx; // not shared M_PENABLE;
275
276
              wire ['DATA_WIDTH-1:0] M_PWDATA;
wire ['SLAVES*`DATA_WIDTH-1:0] M_PRDATA; // input to intercon
wire ['SLAVES-1:0] M_PREADY; // input
277
279
280
            // Master apb interfaces
wire [`CORES*`APB_WIDTH-1:0]
281
              wire [ CORES-1:0]
283
                                                  w PWRITE:
              wire ['CORES-1:0]
                                                   w_PENABLE;
285
              wire ['CORES-1:0]
              wire ['CORES*'DATA_WIDTH-1:0] w_PWDATA;
wire ['CORES*'DATA_WIDTH-1:0] w_PRDATA;
287
```

```
288
289
              wire ['CORES-1:0]
                                                    w_PREADY;
290
             // Interrupts
wire [`DEF_NUM_INT-1:0]
291
                                                           ints;
292
293
             wire [`DEF_NUM_INT*DATA_WIDTH-1:0] ints_data;
assign ints[7:1] = 0;
294
             assign ints_data[`DEF_NUM_INT*`DATA_WIDTH-1:`DATA_WIDTH] = {`DEF_NUM_INT*(`DATA_WIDTH-1){1'b0}};
295
296
297
             apb_intercon_s # (
    .MASTER_PORTS(`CORES),
298
299
                  .SLAVE_PORTS (`SLAVES),
.BUS_WIDTH (`APB_WIDTH),
300
301
302
                   .DATA_WIDTH ('DATA_WIDTH)
303
             ) apb (
304
                  .clk
                                 (clk),
305
                                 (reset),
                  .reset
306
                  // APB master to slave
307
                  .S_PADDR
                                 (w_PADDR),
308
                  .S_PWRITE
                                 (w_PWRITE),
309
                  .S_PSELx
                                 (w_PSELx),
                  .S_PENABLE (w_PENABLE), .S_PWDATA (w_PWDATA),
310
312
                  .S PRDATA
                                 (w PRDATA).
313
                   .S_PREADY
                                 (w_PREADY),
314
                  // shared bus
                  .M_PADDR
                                 (M_PADDR),
316
                  .M PWRITE
                                 (M PWRITE).
                  .M_PSELx
.M_PENABLE
                                (M_PSELx),
(M_PENABLE),
318
319
                  .M_PWDATA
                                 (M_PWDATA),
320
                  .M PRDATA
                                 (M PRDATA).
321
                  .M_PREADY
                                 (M_PREADY)
322
323
324
\frac{325}{326}
             vmicro16_gpio_apb # (
                  .BUS_WIDTH ( APB_WIDTH),
.DATA_WIDTH ( DATA_WIDTH),
327
328
329
                  .PORTS
                                  (`APB_GPIOO_PINS),
                  .NAME
330
                                 ("GPI00")
             ) gpio0_apb (
331
332
                                 (clk),
333
334
                  .reset (reset),
// apb slave to master interface
                  .S_PADDR
.S_PWRITE
                                 (M_PADDR),
(M_PWRITE)
335
336
337
                  .S_PSELx
.S_PENABLE
                                 (M_PSELx[`APB_PSELX_GPIO0]),
338
                                 (M_PENABLE),
                                 (M_PWDATA),
(M_PRDATA[`APB_PSELX_GPIOO*`DATA_WIDTH +: `DATA_WIDTH]),
339
                  S PWDATA
340
                   .S_PRDATA
341
                   .S PREADY
                                 (M_PREADY[`APB_PSELX_GPIOO]),
                                 (gpio0)
                  .gpio
343
345
             // GPIO1 for Seven segment displays (16 pin) \,
346
347
             vmicro16_gpio_apb # (
    .BUS_WIDTH (`APB_WIDTH),
349
350
                   .DATA_WIDTH ('DATA_WIDTH)
                                 ('APB_GPIO1_PINS),
351
                  .PORTS
352
                   .NAME
                                 ("GPI01")
353
             ) gpio1_apb (
354
                  .clk
355
                  .reset
                                 (reset).
                  // apb slave to master interface .S_PADDR (M_PADDR),
356
357
358
359
                  .S_PWRITE
                                 (M_PWRITE),
(M_PSELx[`APB_PSELX_GPI01]),
360
                  .S_PENABLE
.S_PWDATA
                                 (M_PENABLE),
361
                                 (M_PWDATA),
                                 (M_FROATA[`APB_PSELX_GPIO1*`DATA_WIDTH +: `DATA_WIDTH]),
(M_PREADY[`APB_PSELX_GPIO1]),
362
                   .S_PRDATA
                  .S_PREADY
363
364
                                  (gpio1)
                  .gpio
365
366
367
             // GPI02 for Seven segment displays (8 pin)
368
369
             vmicro16_gpio_apb # (
    .BUS_WIDTH (`APB_WIDTH),
    .DATA_WIDTH (`DATA_WIDTH)
370
371
372
373
                  .PORTS
                                 (`APB_GPIO2_PINS),
374
                  . NAME
                                 ("GPI02")
375
             ) gpio2_apb (
376
                  .clk
                                 (clk)
                                 (reset),
                  .reset
                  // apb slave to master interface .S_PADDR \, (M_PADDR),
378
379
380
                  .S_PWRITE
                                 (M_PWRITE).
                  .S_PSELx
                                  (M_PSELx[`APB_PSELX_GPI02]),
                  .S PENABLE
382
                                 (M_PENABLE),
                                 (M_PWDATA),
(M_PRDATA['APB_PSELX_GPIO2*`DATA_WIDTH +: `DATA_WIDTH]),
383
                  .S_PWDATA
384
                  .S_PRDATA
385
                  .S_PREADY
                                 (M_PREADY[`APB_PSELX_GPI02]),
386
                   .gpio
                                 (gpio2)
```

```
387
             );
388
389
390
391
             apb_uart_tx uart0_apb (
392
                 .clk
                                (clk),
393
                  .reset
                                (reset),
394
                  // apb slave to master interface
395
                  .S PADDR
                                (M_PADDR),
                  .S_PWRITE
                                (M_PWRITE)
396
                 .S_PSELx (M_PSELx[`APB_PSELX_UARTO]),
.S_PENABLE (M_PENABLE),
397
398
399
400
                  .S_PWDATA
                                (M_PWDATA),
(M_PRDATA[`APB_PSELX_UARTO*`DATA_WIDTH +: `DATA_WIDTH]),
401
                  .S_PREADY
                                (M_PREADY[`APB_PSELX_UARTO]),
402
                  // uart wires
403
                  .tx_wire
                                (uart_tx),
404
                 .rx_wire
                                (uart_rx)
405
406
407
408
            timer_apb timr0 (
.clk (
409
411
                  .reset
                                (reset),
412
                  // apb slave to master interface
413
                  .S PADDR
                                (M PADDR).
                  .S_PWRITE
                                (M_PWRITE)
415
                  .S PSELx
                                (M_PSELx[`APB_PSELX_TIMRO]),
                  .S_PENABLE
                                (M_PENABLE),
                                (M_PWDATA),
(M_PWDATA),
(M_PRDATA[^APB_PSELX_TIMRO*^DATA_WIDTH +: `DATA_WIDTH]),
(M_PREADY[^APB_PSELX_TIMRO]),
417
                  .S_PWDATA
418
                  .S_PRDATA
419
                  .S_PREADY
420
                                (ints [`DEF_INT_TIMRO]),
(ints_data[`DEF_INT_TIMRO*`DATA_WIDTH +: `DATA_WIDTH])
421
                  .out
422
                  .int_data
423
\frac{424}{425}
             // Shared register set for system-on-chip info
426
             // RO = number of cores
427
428
            vmicro16_regs_apb # (
.BUS_WIDTH
.DATA_WIDTH
429
430
                                          ( APB WIDTH)
431
                                          (`DATA_WIDTH),
\frac{432}{433}
                  .CELL_DEPTH (8),
.PARAM_DEFAULTS_RO (CORES),
434
                  .PARAM_DEFAULTS_R1 (`SLAVES)
435
             ) regs0_apb (
436
                 .clk
                                (clk),
437
                 .reset (reset),
// apb slave to master interface
438
                  .S_PADDR
                                (M_PADDR),
439
                  .S_PWRITE
440
                                (M_PWRITE),
                  .S_PSELx
                                (M_PSELx[`APB_PSELX_REGS0]),
442
                  .S_PENABLE
                                (M_PENABLE),
                                (M_FUNDATA),
(M_PWDATA[`APB_PSELX_REGSO*`DATA_WIDTH +: `DATA_WIDTH]),
                  .S_PWDATA
444
                  .S PRDATA
445
                  .S_PREADY
                                (M_PREADY[`APB_PSELX_REGSO])
446
448
             vmicro16_bram_ex_apb # (
   .BUS_WIDTH ( APB_WIDTH),
   .MEM_WIDTH ( DATA_WIDTH),
450
451
452
453
                  .MEM_DEPTH
                                  ( APB_BRAMO_CELLS),
                  .CORE_ID_BITS (`clog2(`CORES))
454
455
             ) bram_apb (
456
                                (clk),
                 .clk
\frac{457}{458}
                 .reset (reset),
// apb slave to master interface
459
                  .S_PADDR
.S_PWRITE
                                (M_PADDR), (M_PWRITE),
460
461
                  .S_PSELx
                                (M_PSELx[`APB_PSELX_BRAMO]),
                  .S_PENABLE
                                (M_PENABLE),
462
                                (M_PWDATA),
(M_PRDATA['APB_PSELX_BRAMO*`DATA_WIDTH +: `DATA_WIDTH]),
463
                   .S_PWDATA
                  .S_PRDATA
464
\frac{465}{466}
                  .S_PREADY
                                (M_PREADY[`APB_PSELX_BRAMO])
467
468
             genvar i;
             generate for(i = 0; i < `CORES; i = i + 1) begin : cores</pre>
469
470
471
                  vmicro16_core # (
                       .CORE_ID
472
                       .DATA_WIDTH
473
                                              ( DATA_WIDTH),
474
                       .MEM_INSTR_DEPTH ( DEF_MEM_INSTR_DEPTH),
.MEM_SCRATCH_DEPTH ( DEF_MMU_TIMO_CELLS)
475
\frac{477}{478}
                 ) c1 (
                     .clk
479
                       .reset
                                     (reset)
                                     (dbug1[i*8 +: 8]),
                       .dbug
481
                      .ints_data (ints_data),
483
                                    (w_PADDR [`APB_WIDTH*i +: `APB_WIDTH] ),
                      .w_PADDR
485
```

```
486
                    .w_PWRITE
                                 (w_PWRITE [i]
487
                    .w PSELx
                                 (w PSELx
                                             Γil
488
                     .w_PENABLE
                                 (w_PENABLE [i]
                                             [`DATA_WIDTH*i +: `DATA_WIDTH] ),
489
                    .w_PWDATA
                                 (w_PWDATA
                                            [`DATA_WIDTH*i +: `DATA_WIDTH] ),
[i] )
490
                     .w_PRDATA
                                 (w_PRDATA
491
                    .w_PREADY
                                 (w_PREADY
492
               );
           end
493
494
           endgenerate
495
496
       endmodule
497
```

### vmicro16 isa.v

```
// Vmicro16 multi-core instruction set
         'include "vmicro16_soc_config.v"
         // TODO: Remove NOP by making a register write/read always \mathbf{0}
 4
         define VMICRO16_OP_SPCL
define VMICRO16_OP_LW
                                                         5'b00000
5'b00001
 \frac{6}{7}
          define VMICRO16_OP_SW
          define VMICRO16_OP_BIT
                                                          5'b00011
          define VMICRO16_OP_BIT_OR
          define VMICRO16_OP_BIT_XOR
define VMICRO16_OP_BIT_AND
define VMICRO16_OP_BIT_NOT
10
                                                          5'600001
                                                          5'600010
                                                          5'b00011
12
          define VMICRO16_OP_BIT_LSHFT
                                                          5'b00100
          define VMICRO16_OP_BIT_RSHFT
14
                                                          5'b00101
         define VMICRO16_OP_MOVI
'define VMICRO16_OP_MOVI
'define VMICRO16_OP_ARITH_U
'define VMICRO16_OP_ARITH_UADD
16
                                                          5'600101
                                                          5'600110
18
                                                          5'b11111
19
20
          define VMICRO16_OP_ARITH_USUBdefine VMICRO16_OP_ARITH_UADDI
                                                          5'b0????
          define VMICRO16_OP_ARITH_S
define VMICRO16_OP_ARITH_SADD
\frac{21}{22}
                                                          5'b00111
                                                          5'b11111
         define VMICRO16_OP_ARITH_SADD
'define VMICRO16_OP_ARITH_SSUBI
'define VMICRO16_OP_BR
'define VMICRO16_OP_CMP
23
                                                          5'b10000
24
                                                          5'b0????
25
                                                          5'601000
26
                                                          5'b01001
         define VMICRO16_OP_SETC define VMICRO16_OP_MULT
27
                                                          5'b01010
28
                                                          5'b01011
         define VMICRO16_OP_LWEX define VMICRO16_OP_SWEX
29
                                                          5'b01101
                                                          5'b01110
31
32
         // Special opcodes
         define VMICRO16_OP_SPCL_NOP
define VMICRO16_OP_SPCL_HALT
define VMICRO16_OP_SPCL_INTR
33
                                                          11 'h000
34
35
         // TODO: wasted upper nibble bits in BR
37
          define VMICRO16_OP_BR_U
39
          define VMICRO16 OP BR E
                                                          8'h01
40
         define VMICRO16_OP_BR_NE
          define VMICRO16_OP_BR_G
41
                                                          8'h03
          define VMICRO16_OP_BR_GE
                                                          8'h04
          define VMICRO16 OP BR L
43
                                                          8'h05
          define VMICRO16_OP_BR_LE
define VMICRO16_OP_BR_S
45
                                                          8'h07
\frac{46}{47}
          define VMICRO16_OP_BR_NS
                                                          8'h08
48
         // flag bit positions
         define VMICRO16_SFLAG_N define VMICRO16_SFLAG_Z
                                                          4'h03
49
50
                                                          4'h02
          define VMICRO16_SFLAG_C
                                                          4'h01
51
52
53
         `define VMICRO16_SFLAG_V
        // microcode operations
`define VMICRO16_ALU_BIT_OR
`define VMICRO16_ALU_BIT_XOR
`define VMICRO16_ALU_BIT_AND
54
                                                          5'h00
55
56
57
                                                          5'h01
                                                          5'h02
          define VMICRO16_ALU_BIT_NOT define VMICRO16_ALU_BIT_LSHFT
58
                                                          51503
59
                                                          5'h04
60
         define VMICRO16_ALU_BIT_RSHFT define VMICRO16_ALU_LW
                                                          51h05
61
                                                          5'h06
         define VMICRO16_ALU_SW define VMICRO16_ALU_NOP
62
                                                          5'h07
         `define VMICRO16_ALU_MOV`
define VMICRO16_ALU_MOVI
64
                                                          5'h09
         `define VMICR016_ALU_MOVI_L
`define VMICR016_ALU_ARITH_UADD
66
                                                          5'h0h
68
         `define VMICRO16_ALU_ARITH_USUB `define VMICRO16_ALU_ARITH_SADD
                                                          5'h0d
69
          define VMICRO16 ALU ARITH SSUB
70
                                                          5'h0f
          define VMICRO16_ALU_BR_U
72
          define VMICRO16 ALU BR E
                                                          51h11
          define VMICRO16_ALU_BR_NE
          define VMICRO16_ALU_BR_G
\frac{74}{75}
                                                          5'h13
          define VMICRO16_ALU_BR_GE
76
          define VMICRO16 ALU BR L
                                                          5'h15
          define VMICRO16_ALU_BR_LE
define VMICRO16_ALU_BR_S
                                                          5'h17
         `define VMICRO16_ALU_BR_NS
`define VMICRO16_ALU_CMP
```

5'h19