# Multi-core RISC Processor Design and Implementation (Rev. 2.02)

ELEC5881M - Final Report

Ben David Lancaster

Student ID: 201280376

Submitted in accordance with the requirements for the degree of Master of Science (MSc) in Embedded Systems Engineering

Supervisor: Dr. David Cowell Assessor: Mr David Moore

**University of Leeds** 

School of Electrical and Electronic Engineering

August 26, 2019

Word count: 4689

#### **Abstract**

This interim report details the 4-month progress on a project to design, implement, and verify, a multi-core FPGA RISC processor. The project has been split into two stages: firstly to build a functional single-core RISC processor, and then secondly to add multiprocessor principles and functionality to it.

Current multiprocessor and network-on-chip communication methods have been discussed and how they could be included in this multi-core RISC design. To-date, a 16-bit instruction set architecture has been designed featuring common load/store instructions, comparison, and bitwise operations. A single-core processor has been implemented in Verilog and verified using simulations/test benches running various simple software programs.

Future tasks have been planned and will focus on the second stage of the project. Work will start on designing a loosely coupled multiprocessor communication interface and bringing them to the single-core processor.

# **Declaration of Academic Integrity**

The candidate confirms that the work submitted is his/her own, except where work which has formed part of jointly-authored publications has been included. The contribution of the candidate and the other authors to this work has been explicitly indicated in the report. The candidate confirms that appropriate credit has been given within the report where reference has been made to the work of others.

This copy has been supplied on the understanding that no quotation from the report may be published without proper acknowledgement. The candidate, however, confirms his/her consent to the University of Leeds copying and distributing all or part of this work in any forms and using third parties, who might be outside the University, to monitor breaches of regulations, to verify whether this work contains plagiarised material, and for quality assurance purposes.

The candidate confirms that the details of any mitigating circumstances have been submitted to the Student Support Office at the School of Electronic and Electrical Engineering, at the University of Leeds.

Name: Ben David Lancaster

Date: August 26, 2019

# Acknowledgements

I would like to thank my supervisor, David Cowell, and assessor, David Moore, for giving me the opportunity to explore a project of my choosing.

# **Revision History**

| Date       | Version | Changes                        |
|------------|---------|--------------------------------|
| 10/04/2019 | 2.02    | Update future stages.          |
| 05/04/2019 | 2.01    | Fix processor RTL diagram.     |
| 04/04/2019 | 2.00    | Initial processor RTL diagram. |
| 01/04/2019 | 1.00    | Initial section outline.       |

Document revisions.

# **Table of Contents**

| 1 | Intr | oductio  | n                                | 11         |
|---|------|----------|----------------------------------|------------|
|   | 1.1  | Why N    | Multi-core?                      | 11         |
|   | 1.2  | Why R    | RISC?                            | 12         |
|   | 1.3  | Why F    | PGA?                             | 12         |
| 2 | Bac  | kground  | <u>.</u>                         | 13         |
|   | 2.1  | Amdal    | hl's Law and Parallelism         | 13         |
|   | 2.2  | Loosel   | y and Tightly Coupled Processors | 13         |
|   | 2.3  | Netwo    | ork-on-chip Architectures        | 14         |
| 3 | Proj | ect Ove  | erview                           | 16         |
|   | 3.1  | Project  | Deliverables                     | 16         |
|   |      | 3.1.1    | Core Deliverables (CD)           | 16         |
|   |      | 3.1.2    | Extended Deliverables (ED)       | 17         |
|   | 3.2  | Project  | t Timeline                       | 18         |
|   |      | 3.2.1    | Project Stages                   | 18         |
|   |      | 3.2.2    | Project Stage Detail             | 18         |
|   |      | 3.2.3    | Timeline                         | 20         |
|   | 3.3  | Resour   | rces                             | 20         |
|   |      | 3.3.1    | Hardware Resources               | 21         |
|   |      | 3.3.2    | Software Resources               | 21         |
|   | 3.4  | Legal a  | and Ethical Considerations       | 22         |
| 4 | Sing | gle-core | Design                           | <b>2</b> 3 |
|   | 4.1  | Introd   | uction                           | <b>2</b> 3 |
|   | 4.2  | Design   | and Implementation               | <b>2</b> 3 |
|   |      | 4.2.1    | Instruction Set Architecture     | 24         |
|   |      | 4.2.2    | Memory Management Unit           | 25         |
|   |      | 4.2.3    | Instruction and Data Memory      | 25         |
|   |      | 4.2.4    | ALU Design                       | 25         |
|   |      | 4.2.5    | Decoder Design                   | 27         |
|   |      | 4.2.6    | Pipelining                       | 27         |
|   |      | 4.2.7    | Design Optimisations             | 28         |
|   | 13   | Intorry  | unto                             | 20         |

TABLE OF CONTENTS 5

|    |            | 4.3.1    | Overview                    |     |       | <br>    | <br> |   |       |       |       | <br>29    |
|----|------------|----------|-----------------------------|-----|-------|---------|------|---|-------|-------|-------|-----------|
|    |            | 4.3.2    | Hardware Implementation     |     |       | <br>    | <br> |   |       |       |       | <br>29    |
|    |            | 4.3.3    | Software Interface          |     |       | <br>    | <br> |   |       |       |       | <br>30    |
|    |            | 4.3.4    | Design Improvements         |     |       | <br>    | <br> |   |       |       | <br>  | <br>31    |
|    | 4.4        | Verifica | ntion                       |     |       | <br>    | <br> |   |       |       |       | <br>32    |
| 5  | Inte       | rconnec  | t                           |     |       |         |      |   |       |       |       | 33        |
|    | 5.1        | Introdu  | action                      |     |       | <br>    | <br> |   |       |       |       | <br>33    |
|    |            | 5.1.1    | Comparison of On-chip Buses |     |       | <br>    | <br> |   |       |       |       | <br>33    |
|    | 5.2        | Overvi   | ew                          |     |       | <br>    | <br> |   |       |       | <br>  | <br>34    |
|    |            | 5.2.1    | Design Considerations       |     |       | <br>    | <br> |   |       |       |       | <br>35    |
|    | 5.3        | Interfa  | ces                         |     |       |         |      |   |       |       |       |           |
|    |            | 5.3.1    | Master to Slave Interface   |     |       | <br>    | <br> |   |       |       |       | <br>36    |
|    |            |          | Multi-master Support        |     |       |         |      |   |       |       |       |           |
|    | 5.4        |          | r Work                      |     |       |         |      |   |       |       |       |           |
| _  | Mari       | M        | <b>:</b>                    |     |       |         |      |   |       |       |       | 40        |
| 6  |            | nory M   |                             |     |       |         |      |   |       |       |       | 40        |
|    | 6.1        |          | action                      |     |       |         |      |   |       |       |       |           |
|    | 6.2        |          | ss Decoding                 |     |       |         |      |   |       |       |       |           |
|    | <i>c</i> 0 |          | Decoder Optimisations       |     |       |         |      |   |       |       |       |           |
|    | 6.3        | Memo     | ry Map                      | • • | <br>• | <br>• • | <br> | • | <br>• | <br>• | <br>• | <br>43    |
| 7  | Mul        | ti-core  | Communication               |     |       |         |      |   |       |       |       | 45        |
|    | 7.1        | Introdu  | action                      |     |       | <br>    | <br> |   |       |       |       | <br>45    |
|    |            | 7.1.1    | Design Goals                |     |       | <br>    | <br> |   |       |       |       | <br>45    |
|    |            | 7.1.2    | Context Identification      |     |       | <br>    | <br> |   |       |       |       | <br>45    |
|    |            | 7.1.3    | Thread Synchronisation      |     |       | <br>    | <br> |   |       | <br>• |       | <br>46    |
| 8  | Ana        | lysis &  | Results                     |     |       |         |      |   |       |       |       | 49        |
|    | 8.1        | Introd   | action                      |     |       | <br>    | <br> |   |       |       |       | <br>49    |
|    | 8.2        | Implen   | nentation Analysis          |     |       | <br>    | <br> |   |       |       |       | <br>49    |
|    |            | 8.2.1    | Design Size                 |     |       | <br>    | <br> |   |       |       |       | <br>49    |
|    |            | 8.2.2    | Maximum Frequency           |     |       | <br>    | <br> |   |       |       |       | <br>50    |
|    | 8.3        | Scenar   | io Performance              |     |       | <br>    | <br> |   |       |       | <br>  | <br>50    |
|    |            | 8.3.1    | Scenario Overview           |     |       | <br>    | <br> |   |       |       |       | <br>51    |
|    |            | 8.3.2    | Performance Measurements    |     |       | <br>    | <br> |   |       |       |       | <br>51    |
|    |            | 8.3.3    | Performance Results         |     |       | <br>    | <br> |   |       |       |       | <br>52    |
| Re | ferer      | nces     |                             |     |       |         |      |   |       |       |       | 55        |
|    |            |          |                             |     |       |         |      |   |       |       |       |           |
| Aı | peno       | dices    |                             |     |       |         |      |   |       |       |       | 56        |
| A  | Peri       | pheral 1 | nformation                  |     |       |         |      |   |       |       |       | <b>56</b> |
|    | A.1        | Special  | Registers                   |     |       | <br>    | <br> |   |       |       |       | <br>56    |
|    | A.2        | Watch    | log Timer                   |     |       | <br>    | <br> |   |       |       |       | <br>57    |

TABLE OF CONTENTS 6

|   | A.3         | GPIO Interface                       | 57         |
|---|-------------|--------------------------------------|------------|
|   | A.4         | Timer with Interrupt                 | 58         |
|   | A.5         | UART Interface                       | 58         |
| В | Add         | litional Figures                     | <b>59</b>  |
|   | B.1         | Register Set Multiplex               | 59         |
|   | B.2         | Instruction Set Architecture         | <b>60</b>  |
| C | Con         | figuration Options                   | 61         |
|   | <b>C</b> .1 | System-on-chip Configuration Options | 61         |
|   | C.2         | Core Options                         | 62         |
|   | <b>C</b> .3 | Peripheral Options                   | 63         |
| D | Viva        | a Demonstration Examples             | 64         |
|   | D.1         | 2-core Timer Interrupt and ISR       | 64         |
|   | D.2         | 1-160 Core Parallel Summation        | 66         |
| E | Cod         | e Listing                            | 68         |
|   | E.1         | SoC Code Listing                     | 68         |
|   |             | E.1.1 vmicro16_soc_config.v          | 68         |
|   |             | E.1.2 top_ms.v                       | <b>7</b> 0 |
|   |             | E.1.3 vmicro16_soc.v                 |            |
|   |             | E.1.4 vmicro16.v                     |            |
|   | E.2         |                                      | 91         |

# **List of Figures**

| <b>4.</b> 1 | and IO modules and uses a Message Transfer System to perform inter-node                                                                                                                                                                                    |    |
|-------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----|
|             | communication. Image source: [1]                                                                                                                                                                                                                           | 14 |
| 2.2         | A tightly coupled multiprocessor system. Nodes are directly connected to memory and IO modules. Image source: [1]                                                                                                                                          | 14 |
| 2.3         | A multiprocessor network-on-chip architecture with 16 processing nodes. Nodes are connected in a grid formation with routers and links. Image source: [2]                                                                                                  | 15 |
| 3.1         | Project stages in a Gantt chart.                                                                                                                                                                                                                           | 20 |
| 3.2         | Terasic DE1-SoC development board featuring the Altera Cyclone V FPGA and many peripherals. Image source: [3]                                                                                                                                              | 21 |
| 3.3         | Minispartan-6+ development board featuring the Xilinx Spartan 6 XC6SLX9.                                                                                                                                                                                   |    |
|             | Note that the XC6SLX9 and XC6SLX25 FPGAs share the same board. Image source: [4]                                                                                                                                                                           | 22 |
| 4.1         | Vmicro16 RISC 5-stage RTL diagram showing: instruction pipelining (data passed forward through clocked register banks at each stage); branch address calculation; ALU operand calculation (rd2 or imm); and program counter incrementing.                  | 24 |
| 4.2         | Vmicro16 ALU diagram showing clocked inputs from the previous IDEX stage                                                                                                                                                                                   |    |
| 4.3         | being                                                                                                                                                                                                                                                      | 20 |
|             | normal state                                                                                                                                                                                                                                               | 30 |
| 4.4         | The interrupt vector $(0x0100 - 0x0107)$ consists of eight 16-bit values that point to memory addresses of the instruction memory to jump to                                                                                                               | 30 |
| 4.5         | Interrupt Mask register (0x0108). Each bit corresponds to an interrupt source. 1 signifies the interrupt is enabled for/visible to the core. Bits [7:2] are left to the designer to assign. Bit 0 is assigned to TIMR0's interval timer. Bit 1 is assigned |    |
|             | to the UART0's receiver (unassigned if DEF_USE_REPROG is enabled)                                                                                                                                                                                          | 31 |
| 5.1         | Waveform showing an APB read transaction                                                                                                                                                                                                                   | 34 |

LIST OF FIGURES 8

| 5.2         | Block diagram of the Vmicro16 system-on-chip.                                                                                                                                                                                     | 35 |
|-------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----|
| 5.3         | Foo                                                                                                                                                                                                                               | 38 |
| 6.1         | Schematic showing the address decoder (addr_dec) accepting the active PADDR signal and outputting PSEL chip enable signals to each peripheral                                                                                     | 41 |
| 6.2         | Example 4-bit binary comparator which compares the bits (a, b, c, d) to the constant value 1010. The 0s of the constant are inverted and then all are passed                                                                      |    |
| 6.3         | to a wide-AND                                                                                                                                                                                                                     | 41 |
| 6.4         | Partial address decoding used by the Vmicro16 SoC design. Each peripheral shown only needs to decode a signal bit to determine if it is enabled                                                                                   | 42 |
| 6.5         | Memory map showing addresses of various memory sections                                                                                                                                                                           |    |
| 7.1         | Block digram showing the main multi-processing components: the CPU array and a peripheral interconnect used for core synchronisation                                                                                              | 46 |
| 7.2         | Vmicro16 Special Registers layout (0x0080 - 0x008F)                                                                                                                                                                               | 46 |
| 7.3         | Assembly code for locking a mutex. r1 is the address to lock. r3 is zero. r4 is the branch address                                                                                                                                | 47 |
|             | the branch address                                                                                                                                                                                                                | 17 |
| 8.1         | A theoretical FPGA device with 3 BRAM blocks running a 4-core design. Each core can map onto a BRAM block, however as there are more cores than BRAM blocks available, some core memories will be implemented as distributed RAM, |    |
|             | or in the worse case using ALMs                                                                                                                                                                                                   | 50 |
| 8.2<br>8.3  | Maximum design frequency for various core count configurations                                                                                                                                                                    | 50 |
|             | changes with core count                                                                                                                                                                                                           | 52 |
| 8.4         | Similar to Figure 8.3 but using shared instruction memory to reduce block                                                                                                                                                         | EO |
|             | memory requirements per core                                                                                                                                                                                                      | 33 |
| <b>A.</b> 1 | Vmicro16 Special Registers layout (0x0080 - 0x008F)                                                                                                                                                                               | 57 |
| B.1         | Normal mode (bottom) and interrupt mode (top) register sets are multiplexed to switch between contexts                                                                                                                            | 59 |
| B.2         | Vmicro16 instruction set architecture.                                                                                                                                                                                            | 60 |

# **List of Tables**

| 3.1         | Project stages throughout the life cycle of the project | 19 |
|-------------|---------------------------------------------------------|----|
| <b>C</b> .1 | SoC Configuration Options                               | 61 |
| C.2         | Core Options                                            | 62 |
| <b>C</b> .3 | Peripheral Options                                      | 63 |

# **List of Listings**

| 1 | ALU branch detection using flags: zero (Z), overflow (V), and negative (N)  | 26 |
|---|-----------------------------------------------------------------------------|----|
| 2 | Vmicro16's ALU implementation named vmicro16_alu. vmicro16.v                | 27 |
| 3 | Vmicro16's ALU implementation named vmicro16_alu. vmicro16.v                | 27 |
| 4 | Vmicro16's decoder module code showing nested bit switches to determine the |    |
|   | intended opcode. vmicro16.v                                                 | 27 |
| 5 | RAM and lock memories instantiated by the shared memory peripheral          | 47 |
| 6 | Assembly code for a memory barrier. Threads will wait in the barrier_wait   |    |
|   | function until all other threads have reached that code point               | 48 |
| 7 | Variable size inputs and outputs to the interconnect.                       | 59 |

# Chapter 1

# Introduction

| 1.1 | Why Multi-core? | 11 |
|-----|-----------------|----|
| 1.2 | Why RISC?       | 12 |
| 1.3 | Why FPGA?       | 12 |

This project will detail the design, implementation, and verification, of a new multi-core RISC processor aimed at FPGA devices. This project was chosen due to my interest in processor design, in which I have only previously designed single-core RISC processors, and wish to extend this knowledge to gain a basic understanding of multi-core communication, design considerations, and the challenges of software and hardware parallelism first hand.

I will use this opportunity to further develop my knowledge of FPGA and processor design by implementing, designing, and verifying, a multi-core RISC processor from scratch, including the design of a communication interface between multiple cores.

# 1.1 Why Multi-core?

Moore's Law states that the number of transistors in a chip will double every 2 years []. CPU designers would utilize the additional transistors to add more pipeline stages in the processor to reduce the propagation delay [] which would allow for higher clock frequencies.

The size of transistors have been decreasing [] and today can be manufactured in sub-10 nanometer range. However, the extremely small transistor size increases electrical leakage and other negative effects resulting in unreliability and potential damage to the transistor []. The high transistor count produces large amounts of heat and requires increasing power to supply the chip. These trade-offs are currently managed by reducing the input voltage, utilising complex cooling techniques, and reducing clock frequency. These factors limit the performance of the chip significantly. These are contributing factors to Moore's Law *slowing* down. The capacity limit of the current-generation planar transistors is approaching and so in order for performance increases to continue, other approaches such as alternate transistor technologies like Multigate transistors [5], software and hardware optimisations, and multi-processor architectures are employed.

This report will focus on the latter: to produce a small multi-core processor that can utilise software-based parallelism to gain performance benefits, compared to a larger single-core

design.

# 1.2 Why RISC?

RISC architectures feature simpler and fewer instructions compared to CISC, which emphasises instructions that perform larger tasks. A single CISC instruction might be performed with multiple RISC instructions. Because of the fewer and simpler instructions, RISC machines rely heavily on software optimisations for performance. RISC instruction sets are based on load/store architectures, where most instructions are either register-to-register or memory reading and writing [6]. This constraint greatly reduces complexity.

RISC architectures are easier to design implement, especially for beginners, due to their simpler instructions that share the same pipeline, compared to CISC where there may be different pipeline for each instruction, which would greatly consume FPGA resources.

# 1.3 Why FPGA?

Field programmable gate arrays (FPGA) are a great choice for prototyping digital logic designs due to their programmable nature and quick development times.

My previous experience with FPGAs in previous projects will reduce risk and learning times and allow for more time to be spent on adding and extending features (discusses further in section 3.1).

FPGAs, however, may not be suitable for prototyping all register-transistor logic (RTL) projects. Larger RTL projects, such as large commercial processors, may greatly exceed the logic cell resources available in today's high-end FPGA devices and may only be prototyped through silicon fabrication, which can be expensive. This resource limitation will not be problem as the project aims to produce a small and minimal design specifically for learning about multi-core architectures.

# Chapter 2

# **Background**

| 2.1 | Amdahl's Law and Parallelism           | 13 |
|-----|----------------------------------------|----|
| 2.2 | Loosely and Tightly Coupled Processors | 13 |
| 2.3 | Network-on-chip Architectures          | 14 |

## 2.1 Amdahl's Law and Parallelism

In many applications, not restricted to software, there may exists many opportunities for processes or algorithms to be performed in parallel. These algorithms can be split into two parts: a serial part that cannot be parallised, and a part that can be parallelised. Amdahl's Law defines a formula for calculating the maximum *speedup* of a process with potential parallelism opportunities when ran in parallel with n many processors. Speedup is a term used to describe the potential performance improvements of an algorithm using an enhanced resource (in this case, adding parallel processors) compared to the original algorithm. Amdalh's Law is defined below, where the potential speedup  $S_p$  is dependant on the portion of program that can be parallelised p and the number of processing cores n:

$$S_p = \frac{1}{(1-p) + \frac{p}{n}} \tag{2.1}$$

This formula will be used throughout the project to gauge the the performance of the multi-core design running various software algorithms.

# 2.2 Loosely and Tightly Coupled Processors

Multiprocessor systems can be generalised into two architectures: loosely and tightly coupled, and each architecture has advantages and disadvantages. In loosely coupled systems, each processing node is self-contained – each node has it's own dedicated memory and IO modules. Communication between nodes is performed over a *Message Transfer System (MTS)* [1] in a master-slave control architecture.

Scalability in loosely coupled systems is generally easier to implement as each node can simply be appended to the shared MTS interface without large modifications to the rest of the system. Scalability is an important concern in this project as I wish to test the developed solution with a range of processing nodes.

As loosely coupled system's nodes feature there own memory and IO modules, they generally perform better in cases where interaction between nodes is not prominent – each node can store a separate part of the software program in it's memory module allowing simultaneous executing of the program.

In scenarios where inter-node communication is prominent however, access to the MTS interface must be scheduled to avoid access conflicts which introduces delays and idle times in the software programs execution, resulting in lower throughput. Figure 2.1 shows a general layout of a loosely coupled multiprocessor system.

Tightly coupled systems feature processing nodes that do not have their own dedicated memory or IO modules – each node is directly connected to a shared memory module using a dedicated port. In scenarios where inter-node communication is prominent, tightly coupled systems are generally better suited as nodes are directly connected to a shared memory and do not need to wait to use a shared bus.



**Figure 2.1:** A loosely coupled multiprocessor system. Each node features it's own memory and IO modules and uses a Message Transfer System to perform inter-node communication. Image source: [1].

**Figure 2.2:** A tightly coupled multiprocessor system. Nodes are directly connected to memory and IO modules. Image source: [1].

This project will utilise a loosely coupled architecture due to it's easier scalability implementation and my previous experience with the design of single-core processors. Although it will require a scheduler to access the MTS, the experience and knowledge gained from this task will be greatly beneficial for future projects.

# 2.3 Network-on-chip Architectures

Network-on-chip (NoC) architectures implement on-chip communication mechanisms that are based on network communication principles, such as routing, switching, and massive scalability [7]. NoC's can generally support hundreds to millions of processing cores. Figure 2.3 shows an example 16-core network-on-chip architecture. NoC's can scale to very large sizes while not sacrificing performance because each processor core is able to drive the network rather than needing to wait for a shared bus to become free before doing so.

The greater the number of cores in a network-on-chip design, the greater quality of service

(QoS) problems arise. As such, network-on-chip architectures suffer the same problems as networks, such as fairness and throughput [8].



**Figure 2.3:** A multiprocessor network-on-chip architecture with 16 processing nodes. Nodes are connected in a grid formation with routers and links. Image source: [2].

# Chapter 3

# **Project Overview**

| 3.1 | Projec | t Deliverables             | 16         |
|-----|--------|----------------------------|------------|
|     | 3.1.1  | Core Deliverables (CD)     | 16         |
|     | 3.1.2  | Extended Deliverables (ED) | 17         |
| 3.2 | Projec | et Timeline                | 18         |
|     | 3.2.1  | Project Stages             | 18         |
|     | 3.2.2  | Project Stage Detail       | 18         |
|     | 3.2.3  | Timeline                   | <b>2</b> 0 |
| 3.3 | Resou  | irces                      | <b>2</b> 0 |
|     | 3.3.1  | Hardware Resources         | <b>2</b> 1 |
|     | 3.3.2  | Software Resources         | 21         |
| 3.4 | Legal  | and Ethical Considerations | 22         |

This chapter discusses the the project's requirements, goals, and structure.

# 3.1 Project Deliverables

The project's deliverables are split into two sections: core deliverables (CD) – each deliverable must be satisfied for the project to be a minimum viable product (MVP), and extended deliverables (ED) – deliverables that are not required for a MVP – features that only improve upon an existing feature.

## 3.1.1 Core Deliverables (CD)

The project's core deliverables are described below.

# CD1 Design a compact 16-bit RISC instruction set architecture.

The instruction set will be the primary interface to control the processor from software. An instruction set will be required to implement the custom multi-core communication interface.

It was decided to design a new instruction set rather than to extend an existing architecture as this will increase my knowledge of the constraints to consider when designing instruction sets and processors.

## CD2 Design and implement a Verilog RISC core that implements the ISA in CD1.

The Verilog RISC core will be able to run software program written for the instruction set architecture.

# CD3 Design and implement an on-chip interconnect for multi-core processing (2 to 32 cores) using the RISC core from CD2.

The interconnect will be a chief requirement to enable multi-core communication. The interconnect should support up to 32 cores, however FPGA implementation constraints may limit this due to limited resources.

The interconnect will control communication between the cores to enable software parallelism.

# CD4 Analyse performance of serial and parallel software algorithms, such as parallel DFT, on the processor.

To evaluate the effectiveness of the developed solution, a serial and parallel implementation of a simple computing algorithm (parallel reduction, sorting) will be ran on the processor and it's performance analysed. Effectiveness will be rated on total algorithm run-time and the speed-up gained by adding more cores.

# CD5 Allow the RISC core to be easily compiled to multiple FPGA vendors (Xilinx, Altera).

The developed solution should be generic and portable to allow it to be used across a wide-range of FPGA vendors and devices.

Verilog is a generic implementation-independent hardware-description language and so designing implementation specific modules is recommended.

A key consideration for this requirement is to consider the varying hard IP provided by the FPGA vendors (such as BRAM, ethernet, and PCIe [9, 10]). To overcome this problem, the developed Verilog code will conditionally compile where vendor specific requirements are present.

#### 3.1.2 Extended Deliverables (ED)

The project's extended deliverables are described below.

- **ED1** Design a RISC core with an instructions-per-clock (IPC) rating of at least 1.0 (a single-cycle CPU).
- **ED2** Design a RISC core with a pipe-lined data path to increase the design's clock speed.
- **ED3** Design a scalable multi-core interconnect supporting arbitrary (more than 32) RISC core instances (manycore) using Network-on-Chip (NoC) architecture.
- **ED4** Design a compiler-backend for the PRCO304 [11] compiler to support the ISA from **CD1**. This will make it easier to build complex multi-core software for the processor.
- **ED5** The RISC core can communicate to peripherals via a memory-mapped addresses using the Wishbone bus.

- **ED6** Implement various memory-mapped peripherals such as UART, GPIO, LCD, to aid visual representation of the processor during the demonstration viva.
- **ED7** Store instruction memory in SPI flash.
- ED8 Reprogram instruction memory at runtime from host computer.
- ED9 Processor external debugger using host-processor link.

# 3.2 Project Timeline

# 3.2.1 Project Stages

The project is split up into many stages to aid planning and management of the project. There are 8 unique stage areas: 1. Inital project conception; 2 Basic RISC core development; 3. Extended RISC core development; 4. Multi-core development; 5. Processor quality-of-life (QoL) improvements; 6. Compiler development; 7. Demo preparation, and 8. Final report.

The project stages are shown in Table 3.1.

## 3.2.2 Project Stage Detail

### Stages 1.0 through 1.2 – Research and Project Conception

These stages cover initial research of existing problems and solutions in the multiprocessor area. The instruction set architecture is also proposed that later stages will implement.

## Stages 2.1 through 2.3 - Processor module Design, Implementation, and Integration

These stages cover the design, implementation, and integration of key processor core modules such as the instruction decoder, register sets and local memory. Integration of all the modules is a challenging task because some modules have both asynchronous and synchronous signals that need to be timed correctly in order for other modules to receive valid data. An example of this is the register set which has asynchronous read ports that are later clocked in the instruction decode stage.

## Stages 3.1 through 3.4 – Advanced Processor Implementation

These stages add advanced features to the processor to provide a more functional product. Although these stages are classified as extended, their technical requirement to design and implement is not great and so are have time allocations in the project schedule. The extended features that these stages introduce are: pipelined processor stages – to drastically increase processor performance; provide a memory-mapped peripheral interface through the MMU; provide a Wishbone master interface to the MMU – allowing external peripherals such as GPIO and LCD displays to be utilised in a modular fashion; and to implement a cache memory for each processor core.

| Stage | Title                                        | Start Date | Days | Core | Applicable<br>Deliverables |
|-------|----------------------------------------------|------------|------|------|----------------------------|
| 1.0   | Research                                     | Feb 04     | 7    | x    |                            |
| 1.1   | Requirement gathering/review                 | Feb 11     | 14   | х    |                            |
| 1.1   | Processor specification, architecture, ISA   | Feb 18     | 100  | х    | CD1                        |
| 1.2   | Stage/Time Allocation Planning               | Feb 25     | 7    | x    |                            |
| 2.1   | Decoder, Register Set, impl & integration    | Feb 25     | 14   | x    | CD2                        |
| 2.2   | Register set impl & integration              | Mar 04     | 14   | x    | CD2                        |
| 2.3   | Local memory impl & integration              | Mar 11     | 14   | х    | CD2                        |
| 3.1   | Memory mapped register layout & impl         | Apr 01     | 21   |      | ED5                        |
| 3.2   | Wishbone peripheral bus connected to MMU     | Apr 08     | 21   |      | ED5                        |
| 3.3   | Pipelined implementation and verification    | Apr 15     | 21   |      | ED2                        |
| 3.4   | Cache memory design & impl                   | Apr 22     | 28   |      | ED2                        |
| 4.1   | Multi-core communication interface           | TBD        | TBD  | x    | CD3                        |
| 4.2   | Shared-memory controller                     | TBD        | TBD  | x    | CD3                        |
| 4.3   | Scalable multi-core interface (10s of cores) | TBD        | TBD  | x    | CD3                        |
| 4.4   | Multi-core example program (reduction)       | TBD        | TBD  | x    | CD4                        |
| 5.1   | SPI-FPGA interface for OTG programming       | TBD        | TBD  |      | ED7                        |
| 5.2   | FPGA-PC interfacing                          | TBD        | TBD  |      | ED9                        |
| 5.3   | FPGA-PC debugging (instruction breakpoints)  | TBD        | TBD  |      | ED9                        |
| 6.1   | Compiler backend for vmicro16                | TBD        | TBD  |      | ED4                        |
| 6.2   | Compiler support for multi-core codegen      | TBD        | TBD  |      | ED4                        |
| 7.1   | Wishbone peripherals for demo                | TBD        | TBD  | x    | CD4                        |
| 8.1   | Final Report                                 | TBD        | TBD  | x    |                            |

 Table 3.1: Project stages throughout the life cycle of the project.

# Stages 4.1 through 4.4 – Multiprocessor Functionality

These stages are dedicated to adding multiprocessor functionality using a loosely coupled architecture to the processor.

# **Stages 5.1 through 5.3 – Debugging Features**

These stages cover debugging features and are classified as extended due to the large development time required to implement them as well as not being related to multiprocessor systems.

# Stages 6.1 through 6.2 - Compiler Backends

These stages cover the implementation of a compiler backend to ease software writing and programming of the processor.

## **Stage 7.1 – Wishbone Peripherals**

Additional Wishbone peripherals, such as SPI and timers will be added to produce a more useful multiprocessor system.

## Stage 8.1 – Final Report

This stage is dedicated to the final report write-up. It is expected to be an iterative task that is active throughout the lifespan of the project.

## 3.2.3 Timeline

The project stages from Table 3.1 are displayed below in a Gantt chart.



Figure 3.1: Project stages in a Gantt chart.

## 3.3 Resources

This section describes the hardware and software resources required to fulfil the project.

#### 3.3.1 Hardware Resources

Core deliverable CD5 requires the designed RISC core to be implemented and demonstrated on multiple FPGA devices. Although my design should synthesise for physical IC implementation, due to high costs and lengthy production times, it is not a primary development target. Due to having past experience with Xilinx FPGAs from my placement work and experience with Altera from university modules it was decided to target the Xilinx Spartan 6 XC6SLX9 and the Altera Cyclone V.

### Terasic DE1-SoC Development Board

The Terasic DE1-SoC development board features a large Cyclone V FPGA and many peripherals, such as seven-segment displays, 64 MB SDRAM, ADCs, and buttons and switches, which will aid demonstration of the project. The development board is available through the university so the cost is negligible. Figure 3.2 shows the peripherals (green) available to the FPGA.



Figure 3.2: Terasic DE1-SoC development board featuring the Altera Cyclone V FPGA and many peripherals. Image source: [3].

## Minispartan 6+ FPGA Development Board

The Minispartan 6+ is a hobbyist FGPA development board with fewer peripherals than the DE1-SoC. The board features a Xilinx Spartan 6 XC6LX9 which has far fewer resources than the DE1-SoC's Cyclone V however it's simplicity and my familiarity with Xilinx's software suite will speed up development. The development board is shown in Figure 3.3.

#### 3.3.2 Software Resources

#### **Intel Quartus**

Intel Quartus Prime is a paid-for SoC, CPLD, and FPGA software suite targeting Intel's Stratix, Arria, and Cyclone based FPGAs. The university provides student licences which will be used



**Figure 3.3:** Minispartan-6+ development board featuring the Xilinx Spartan 6 XC6SLX9. Note that the XC6SLX9 and XC6SLX25 FPGAs share the same board. Image source: [4].

via VPN.

## Xilinx ISE Webpack

Xilinx ISE Webkpack is Xilinx's free software suite for FPGA development for Spartan 6 based FPGAs. Due to ISE's intuitive and fast work flow, most of the initial simulation and verification processes will be performed using ISE. This will greatly improve development times.

#### Verilator

Verilator is an open-source Verilog to C++ transpiler which provides a C++ interface to simulate Verilog modules and read/write values similar to a test bench. Verilator will be used for specific modules within the RISC core such as the ALU and decoder as Verilator is useful when performing exhaustive verification.

# 3.4 Legal and Ethical Considerations

The RISC core is designed to be used as an academic research and educational tool to aid learning and understanding of RISC and multi-core machines. It should not be use for roles where mission critical or safety is a factor.

The processor does not provide any memory protection features and any software running on the processor has full access to all memory.

The processor does not store/track/predict software instructions. The processor uses pipelining techniques to improve performance which results in future instructions entering the pipeline even if the software's logical sequence does not include these instructions. This could result in security vulnerabilities similar to Intel's Spectre vulnerability [12].

# Chapter 4

# Single-core Design

| 4.1 | Introd  | oduction                     |   |  |
|-----|---------|------------------------------|---|--|
| 4.2 | Desig   | n and Implementation         | 3 |  |
|     | 4.2.1   | Instruction Set Architecture | 4 |  |
|     | 4.2.2   | Memory Management Unit       | 5 |  |
|     | 4.2.3   | Instruction and Data Memory  | 5 |  |
|     | 4.2.4   | ALU Design                   | 5 |  |
|     | 4.2.5   | Decoder Design               | 7 |  |
|     | 4.2.6   | Pipelining                   | 7 |  |
|     | 4.2.7   | Design Optimisations         | 8 |  |
| 4.3 | Interr  | upts 2                       | 8 |  |
|     | 4.3.1   | Overview                     | 9 |  |
|     | 4.3.2   | Hardware Implementation      | 9 |  |
|     | 4.3.3   | Software Interface           | 0 |  |
|     | 4.3.4   | Design Improvements          | 1 |  |
| 44  | Verific | ration 3                     | 2 |  |

# 4.1 Introduction

While the majority of this report will focus on the multi-processing functionality of this project, it is important understand the design decisions of the single core to understand the features and limitations of the multi-core system-on-chip as a whole.

# 4.2 Design and Implementation

The single-core design is a traditional 5-stage RISC processor (fetch, decode, execute, memory, write-back). The core uses separate instruction and data memories in the style of a Harvard architecture [?].

To satisfy CD5, the Verilog code will be self-contained in a single file. This reduces the hierarchical complexity and eases cross-vendor project set-up as only a single file is required to be included.



**Figure 4.1:** Vmicro16 RISC 5-stage RTL diagram showing: instruction pipelining (data passed forward through clocked register banks at each stage); branch address calculation; ALU operand calculation (rd2 or imm); and program counter incrementing.

A small reduction in size within the single-core will result in substantial size reductions in

#### 4.2.1 Instruction Set Architecture

Core deliverable CD1 details the background for the requirement of a custom instruction set architecture. The 16-bit instruction set listing is shown in Figure B.2.

In this proposed architecture, most instructions are *destructive*, meaning that source operands also act as the destination, hence effectively *destroying* the original source operand.

This design decision reduces the complexity of the ISA as traditional three operand instructions, for example add r0, r1, can be encoded using only two operands add r0, r1. However, this does increase the complexity of compilers as they may need to make temporary copies of registers as the instructions will *destroy* the original source data.

The instruction set is split into 7 categories (highlighted by colours in Figure B.2):

- Special instructions, such as halting and interrupt returns;
- Bitwise operations, such as XOR and AND;
- Signed arithmetic;
- Unsigned arithmetic;
- Conditional branches and compare instructions;
- and Load/store instructions, with their atomic equivalents.

## 4.2.2 Memory Management Unit

It was decided to use a memory management unit (MMU) to make it easier and extensible to communicate with external peripherals or additional registers. This method transparently uses the existing LW[EX]/SW[EX] to easily provide an arbitrary number of peripherals/special purpose addresses to the software running on the processor.

# 4.2.3 Instruction and Data Memory

The design uses separate instruction and data memories similar to a Harvard architecture computer. This architecture was chosen due because it is generally easier to implement, however later resulted in design challenges in large multi-core designs. This is discussed later in the report.

Each single-core has it's own *scratch* memory – a small RAM-like memory which can be used for stack-space and arrays too large to fit into the 8 registers. These memories are provided as is – meaning it's up to the software to implement and provide any stack-frame, function, and calling, functionality. Each core also features it's own read-only instruction memory that is programmed at compile time of the design, or via the UARTO reciever interface (discussed later). Both of these memories map onto synchronous, read-first, single-port, FPGA block RAMs to minimise LUT requirements.

Users can customise the size of these memories by tweaking the following parameters in the vmicro16\_soc\_config.v file: DEF\_MEM\_INSTR\_DEPTH for the instruction memory, and DEF\_MEM\_SCRATCH\_DEPTH for the scratch memory.

## 4.2.4 ALU Design

The Vmicro16's ALU is an asynchronous module that has 3 inputs: data a; data b; and opcode op; and outputs data c. The ALU is able to operate on both register data (rd1 and rd2) and

immediate values. A switch is used to set the b input to either the rd2 or imm value from the previous stage.



Figure 4.2: Vmicro16 ALU diagram showing clocked inputs from the previous IDEX stage being

The ALU also performs comparison (CMP) operations in which it returns flags similar to X86's overflow, signed, and zero, flags. The combination of these flags can be used to easily compute relationships between the two input operands. For example, if the zero flag is not equal to the signed flag, then the relationship between inputs a and b is that a < b.

```
1
2
              module branch (
                        input [3:0]
input [7:0]
output reg
                                                 flags,
 3
4
5
6
7
8
9
                        always @(*)
                                 1;
(flags[`VMICRO16_SFLAG_Z]
(flags[`VMICRO16_SFLAG_Z]
(flags[`VMICRO16_SFLAG_Z]
(flags[`VMICRO16_SFLAG_N]
                                                                                       en =
                                                                                                                                                                   0);
0) &&
10
11
                                                                                       en
                                                                                                                                                             == flags[`VMICRO16_SFLAG_V]);
!= flags[`VMICRO16_SFLAG_N]);
== flags[`VMICRO16_SFLAG_N]);
                                                                                                 tlags['VMICRO16_SFLAG_N]
(flags['VMICRO16_SFLAG_Z]
(flags['VMICRO16_SFLAG_Z]
(flags['VMICRO16_SFLAG_Z]
(flags['VMICRO16_SFLAG_N]
                                           TVMICRO16_OP_BR_L:
TVMICRO16_OP_BR_GE:
TVMICRO16_OP_BR_LE:
13
14
                                                                                      en =
                                                                                      en
                                                                                      en
                                                                                                                                                            != flags[`VMICRO16_SFLAG_V]);
16
17
18
19
                                          default:
                                                                                       en
                                 endcase
               endmodule
```

Listing 1: ALU branch detection using flags: zero (Z), overflow (V), and negative (N).

The Verilog implementation of the ALU is shown in Listing 2. The ALU's asynchronous output is clocked with other registers, such as destination register rs1 and other control signals, in the EXME register bank.

```
always @(*) case (op)

// branch/nop, output nothing

VMICRO16_ALU_BR,

VMICRO16_ALU_NOP: c = {DATA_WIDTH{1'b0}};

// load/store addresses (use value in rd2)

VMICRO16_ALU_LW,

VMICRO16_ALU_SW: c = b;

// bitwise operations

VMICRO16_ALU_BIT_OR: c = a | b;

VMICRO16_ALU_BIT_AND: c = a & b;

VMICRO16_ALU_BIT_NOT: c = a & b;

VMICRO16_ALU_BIT_RSHFT: c = a << b;

VMICRO16_ALU_BIT_RSHFT: c = a >> b;
```

Listing 2: Vmicro16's ALU implementation named vmicro16\_alu. vmicro16.v

# 4.2.5 Decoder Design

Instruction decoding occurs in the between the IFID and IDEX stages. The decoder extracts register selects and operands from the input instruction. The decoder outputs are asynchronous which allows the register selects to be passed to the register set and register data to be read asynchronously. The register selects and register read data is then clocked into the IDEX register bank.

Listing 4: Vmicro16's decoder module code showing nested bit switches to determine the intended opcode. vmicro16.v

In Listing 4, it can be seen that the first 4 opcode cases (BR, MULT, CMP, SETC) are represented using the same 15-11 (opcode) bits, however the BIT instructions share the same opcode and so require another bit range to be compared to determine the output function.

# 4.2.6 Pipelining

In the interim progress update, the processor design featured *instruction pipelining* to meet requirement **ED1**. Instruction pipelining allows instructions executions to be overlapped in the pipeline, resulting in higher throughput (up to one instruction per clock) at the expense of 5-6 clocks of latency and *significant* code complexity. As the development of the project shifted from single-core to multi-core, it became obvious that the complexity of the pipelined processor would inhibit the integration of multi-core functionality. It was decided to remove the instruction pipelining functionality and use a simpler state-machine based pipeline that is much simpler to extend and would cause fewer challenges later in the project.

## 4.2.7 Design Optimisations

In a design that has many instantiations of the same component, a small resource saving improvement within the component can have a significant overall savings improvement if it is instantiated many times. Project requirement CD5 requires the design to be compiled for a range of FPGA sizes, and so space saving optimisations are considered.

#### **Register Set Size Improvements**

A register set in a CPU is a fast, temporary, and small memory that software instructions directly manipulate to perform computation. In the Vmicro16 instruction set, eight registers named r0 to r7 are available to software. The instruction set allows up to two registers to be references in most instructions, for example the instruction add r0, r1 tells the processor to perform the following actions:

- Clock 1. Fetch r0 and r1 from the register set
- Clock 2. Add the two values together in the ALU
- **Clock 3.** Store the result back the register set in r0

For Clock 1, it was originally decided to use a dual port register set (meaning that two data reads can be performed in a single clock, in this case r0 and r1), however due to the asynchronous design of the register set (for speed) the RTL produced consumed a significant amount of FPGA resources, approximately 256 flip-flops (16 (data width) \* 8 (registers) \* 2 (ports)). To reduce this, it was decided to split task 1 into two steps over two clock cycles using a single-port register set. This required the processor pipe-line to use another clock cycle resulting in slightly lower performance, however the size improvements will allow for more cores to be instantiated in the design. This optimisation is also applied to the interrupt register set, resulting in a saving of approximately 256 flip-flops per core (128 in the normal mode register set, and 128 in the interrupt register set). As shown, adding a single clock delay saves a significant amount of LUTs. This saving will be amplified in designs with many cores.

# 4.3 Interrupts

Interrupts are a technique used by processors to run software functions when an event occurs within the processor, such as exceptions, or signalled from an external source, such as a UART receiver signalling it has received new data. Today, it is common for micro-controllers, soft-processors, and desktop processors, to all feature interrupts. Modern implementations support an *interrupt vector* which is a memory array that contains addresses to different *interrupt handlers* (a software function called when a particular interrupt is received).

Although interrupts are not a requirement for a multi-core system, it was decided to implement this functionality to boost my understanding of such systems. In addition, example demos provided with this project are better visualised with a interrupt functionality.

## 4.3.1 Overview

The interrupt functionality in this project supports the following:

- Per-core 8 cell interrupt vector accessible to software.
   Software programs running on the Vmicro16 processor can edit the interrupt vector to add their own interrupt handlers at runtime.
- Fast context switching.

A dedicated interrupt register set is multiplexed with the normal mode register set to provide faster context switching. It should be noted that only the registers are saved during a context switch. The means that the stack is not saved. A schematic of the register multiplex is shown in Figure B.1.

• Parametrised interrupt sources and widths.

Users can configure the width of the interrupt in signals and the data width per interrupt source via the vmicro16\_soc\_config.v. By default, 8 interrupt sources are available and each can provide 8-bits of data.

## 4.3.2 Hardware Implementation

## **Context Switching**

When acting upon an incoming interrupt the current state the processor must be saved so that changes from the interrupt handler, such as register writes and branches, do not affect the current state. After the interrupt handler function signals it has finished (by using the *Interrupt Return* INTR instruction) the saved state is restored. In the case of the Vmicro16 processor, the program counter r\_pc[15:0] and register set regs instance are the only states that are saved. Going forth, the terms *normal mode* and *interrupt mode* are used to describe what registers the processor should use when executing instructions.

When saving the state, to avoid clocking 128 bits (8 registers of 16 bits) into another register (which would increase timing delays and logic elements), a dedicated register set for the interrupt mode (regs\_isr) is multiplexed with the normal mode register set (regs). Then depending on the mode (identified by the register regs\_use\_int) the processor can easily switch between the two large states without significantly affecting timing.

The timing diagram in Figure 4.3 shows the behavioural logic for the TIMR0 interrupt source.



**Figure 4.3:** Time diagram showing the TIMR0 peripheral emitting a 1us periodic interrupt signal (out) to the processor. The processor acknowledges the interrupt (int\_pending\_ack) and enters the interrupt mode (regs\_use\_int) for a period of time. When the interrupt handler reaches the Interrupt Return instruction (indicated by w\_intr) the processor returns to normal mode and restores the normal state.

#### 4.3.3 Software Interface

A memory-mapped software interface is provided through the MMU to allow easy software control of the interrupt behaviour. The interface is provided at the address range 0x0100 to 0x0108. This interface is per-core allowing each core to individually control what interrupts it receives and what functions to call upon an interrupt. This enables complex functionality, such as allowing each core to execute different functions upon the same interrupt.



**Figure 4.4:** The interrupt vector (0x0100 - 0x0107) consists of eight 16-bit values that point to memory addresses of the instruction memory to jump to.

### **Interrupt Vector (0x0100-0x0107)**

The interrupt vector is a per-core register that is used to store the addresses of interrupt handlers. An interrupt handler is simply a software function residing in instruction memory that is branched to when a particular interrupt is received.

#### Interrupt Mask (0x0108)

The interrupt mask is a per-core register that is used to mask/listen specific interrupt sources. This enables processing cores to individually select which interrupts they respond to. This allows for multi-processor designs where each core can be used for a particular interrupt



**Figure 4.5:** Interrupt Mask register (0x0108). Each bit corresponds to an interrupt source. 1 signifies the interrupt is enabled for/visible to the core. Bits [7:2] are left to the designer to assign. Bit 0 is assigned to TIMR0's interval timer. Bit 1 is assigned to the UART0's receiver (unassigned if DEF\_USE\_REPROG is enabled).

source, improving the time response to the interrupt for time critical programs. The Interrupt Mask register is an 8-bit read/write register where each bit corresponds to a particular interrupt source and each bit corresponds with the interrupt handler in the interrupt vector. The interrupt mask register is shown in Figure 4.5.

## Software Example

To better understand the usage of the described interrupt registers, a simple software program is described below. The following software program produces a simple and power efficient routine to initialise the interrupt vector and interrupt mask.

```
1
      setup_interrupts:
          // Set interrupt vector at 0x100
2
          // Move address of isr0 function to vector[0]
3
                  r0, isr0
4
          // create 0x100 value by left shifting 1 8 bits
5
6
          movi
                  r1, #0x1
          movi
                  r2, #0x8
          lshft
                  r1, r2
          // write isr0 address to vector[0]
10
                  r0, r1
11
      enable_interrupts:
12
          // enable all interrupts by writing 0x0f to 0x108
13
          movi
                  r0, #0x0f
14
                  r0, r1 + #0x8 // (0x100 + 0x8 = 0x108)
15
          SW
                                 // enter low power idle state
16
17
                                   ' arbitrary name
      isr0:
18
                  r0, #0xff
                                    do something
          movi
19
                                  // return from interrupt
20
          intr
```

A more complex example software program utilising interrupts and the TIMR0 interrupt is described in section D.1.

## 4.3.4 Design Improvements

The hardware and software interrupt design have changed throughout the projects cycle. In initial versions of the interrupt implementation, the software program, while waiting for an interrupt, would be in a tight infinite loop (branching to the same instruction). This resulted in the processor using all pipeline stages during this time. The pipeline stages produce many logic transitions and memory fetches which raise power consumption and temperatures. This is quite noticeable especially when running on the Spartan-6 LX9 FPGA.

To improve this, it was decided to implement a new state within the processor's state machine that, when entered, did not produce high frequency logic transitions or memory fetches. The HALT instruction was modified to enter this state and the only way to leave is from an interrupt or top-level reset. This removes the need for a software infinite loop that produces high frequency logic transitions (decoding, ALU, register reads, etc.) and memory fetches.

# 4.4 Verification

Various verification techniques are employed to ensure correct operation of the processor.

The first technique involves using static assertions to identify incorrect configuration parameters at compile time, such as having zero instruction memory and scratch memory depth. These assertions use the static\_assert for top level checks and static\_assert\_ng for checks inside generate blocks.

The second verification technique is to use assertions in always blocks to identify incorrect behavioural states. This is done using the rassert (run-time assert) macro.

The third verification technique is to use automatic verifying test benches. These test benches drive components of the processor, such as the ALU and decoder, and check the output against the correct value. This uses the rassert macro.

The final method of verification is to verify the complete design via a behavioural test bench. The design is passed a compiled software program with a known expected output, and is ran until the r\_halt signal is raised. The test bench then checks the value on the debug0, debug1, and debug2 signals against the expected value. If this matches, then it is assumed that sub-components of the design also operate correctly. This technique does not monitor the states of sub-components and statistics (such as time taken to execute an instruction), there leaves the possibility that some components could have entered an illegal state.

# Chapter 5

# Interconnect

| 5.1 | Introd  | luction                     |
|-----|---------|-----------------------------|
|     | 5.1.1   | Comparison of On-chip Buses |
| 5.2 | Overv   | riew                        |
|     | 5.2.1   | Design Considerations       |
| 5.3 | Interfa | aces                        |
|     | 5.3.1   | Master to Slave Interface   |
|     | 5.3.2   | Multi-master Support        |
| 5.4 | Furth   | er Work                     |

## 5.1 Introduction

The Vmicro16 processor needs to communicate with multiple peripheral modules (such as UART, timers, GPIO, and more) to provide useful functionality for the end user.

Previous peripheral interface designs of mine have been directly connected to a main driver with unique inputs and outputs that the peripheral required. For example, a timer peripheral would have dedicated wires for it's load and prescaler values, wires for enabling and resetting, and wires for reading. A memory peripheral would have wires for it's address, read and write data, and a write enable signal. This resulted in each peripheral having a unique interface and unique logic for driving the peripheral, which consumed significant amounts of limited FPGA resources.

It can be seen that many of the peripherals need similar inputs and outputs (for example read and write data signals, write enables, and addresses), and because of this, a standard interface can be used to interface with each peripheral. Using a standard interface can reduce logic requirements as each peripheral can be driven by a single driver.

# 5.1.1 Comparison of On-chip Buses

The choice of on-chip interconnect has changed multiple times over the life-cycle of this project, primary due to ease of implementation and resource requirements.

Originally, it was planned to use the Wishbone bus [? ] due to it's popularity within open-source FPGA modules and good quality documentation.

Late in the project, it was decided to use the AMBA APB protocol [? ] as it is more commonly used in large commercial designs and understanding how the interface worked would better benefit myself. APB describes an intuitive and easy to implement 2-state interface aimed at communicating with low-throughput devices, such as UARTs, timers, and watchdogs.



Figure 5.1: Waveform showing an APB read transaction.

# 5.2 Overview

The system-on-chip design is split into 3 main parts: peripheral interconnect (red), CPU array (gray), and the instruction memory interconnect (green).

A block diagram of this project is shown in Figure 5.2



Figure 5.2: Block diagram of the Vmicro16 system-on-chip.

#### 5.2.1 Design Considerations

There are several design issues to consider for this project. These are listed below:

#### • Design size limitations

The target devices for this project are small to medium sized FPGAs (featuring approximately 10,000 to 30,000 logic cells). Because of this, it is important to use a bus interconnect that has a small logic footprint yet is able to scale reasonably well.

#### • Ease of implementation

The interconnect and any peripherals should be easy to implement within the time allocations specified in Figure 3.1.

#### • Scalable

The interconnect should allow for easy scalability of master and slave interfaces with minimal code changes.

## 5.3 Interfaces



#### 5.3.1 Master to Slave Interface

| 20 | 19 | 18 | 17  | 16 | 15 0       | _            |
|----|----|----|-----|----|------------|--------------|
| LE | SE | CC | RE_ | ID | Address    | PADDR[20:0]  |
|    |    |    |     |    | Write data | PWDATA[15:0] |
|    |    |    |     |    | Read Data  | PRDATA[15:0] |
|    |    |    |     |    | WE         | PWRITE[0:0]  |
|    |    |    |     |    | Z.         | PENABLE[0:0] |

#### 5.3.2 Multi-master Support

In this design, each processor can act as an APB master to communicate with peripherals, for example to write a value to UART or to the shared memory peripheral. Because each core runs independently from other cores, it is likely, especially is many-core systems, that two or more processors will want to use the peripheral bus at the same time.

As the peripheral and instruction interconnects use a shared one-to-many (one master to many slaves) bus architecture, only one master can use the bus at any-time. To enable multiple masters to use the bus, a device called an *arbiter* must be used to control which master gets access to drive the shared interconnect.

Arbiters can vary in complexity, mostly relative to throughput requirements.

An ideal arbiter for this interconnect, which ideally features many, possibly tens of, high-throughput masters, would likely feature a priority-based and pipelined arbiter with various devices to improve performance such as cache-coherencies.

#### Overview

Due to this project's limited time, and my personal knowledge in this area, a simple rotating arbiter is used. This arbitration scheme is likely the simplest that can be thought of. A schematic of arbiter interconnect is shown in Figure 5.3.

In this scheme, access to the bus is given incrementally to each master port, even if the master port has not requested to use the bus. The active master port can use the bus for as long as it requires, and signals it has finished by lowering the PSEL signal. When the PSEL signal is lowered, the arbiter grants access to the next master port. If this next master port has not raised it's PSEL signal (i.e. it has not requested access to the bus) then the arbiter grants access to the next master port, and so on. In Verilog, this is simply an incremental counter which is used to index the master ports array. To support a variable number of master ports, the width of each APB signal is multiplied by the number of cores, as shown in Listing 7.



Figure 5.3: Foo

| 33       | 62 | 41     | 20 0   |  |
|----------|----|--------|--------|--|
| Core N-1 |    | Core 1 | Core 0 |  |

#### 5.4 Further Work

The submitted design is acceptable for a multi-core system as it fulfils the following requirements:

- Support an arbitrary number of peripherals.
- Supports memory-mapped address decoding.
- Supports multiple master interfaces.

#### **Arbiter Performance Improvements**

However, it fails in the performance aspect. A one clock penalty occurs if the next master port has not requested the bus. This may seem a small price to pay for such a simple arbiter design, however it can add up significantly in many-core designs. For example, if core #0 performs some action on the bus, but core #10 is the next master that wants to use the bus, then the arbiter will waste time incremental granting access to cores #1 to #9 which do not need the bus. This is also made worse when one of the cores is blocking access to a peripheral resource, such as through a mutex or semaphore.

To overcome this penalty, a scheme could use an algorithm to find the next master port requesting access, and grant access directly to it when the current master has finished. Another scheme could be to use a priority encoder. Here, a hard-coded lookup table (LUT) could be used, where the inputs are each master port's PSEL signal (acting as a bus request line) and

the output being which master to grant access to. As this is targetting FPGA devices, this implemented would require few LUT resources for the arbiter, due to the hard-coded LUT approach. An example of this is given in M. Weber's *Arbiter: Design Ideas and Coding Styles* [13, p. 2].

#### **APB Bus Errors and Recovery**

This project's implementation of a multi-master APB interconnect does not provide a method of detecting errors and stalls. This is mainly due to time constraints.

An easy error that could be detected is PADDR addresses that do not fall into a memory-mapped address range. This can easily and cheaply be detected in the address decoding module. This will be discussed in detail in the next chapter.

As previously stated, the active bus master can take control of the bus for as long as it wants to. This is useful for high-throughput transactions, such as memory operations to global memory, but detecting a stalled or glitched operation is not immediately identifiable. If an active master stalls or glitches, it may not be able to lower the PSEL line which appears to the arbiter that the transaction is still happening normally. To overcome this, a timer could be used to detect stalled operations and reset the affected peripheral (essential a watchdog but for an interconnect).

## Chapter 6

# **Memory Mapping**

| 6.1 | Introduction                | <b>40</b>  |  |  |  |
|-----|-----------------------------|------------|--|--|--|
| 6.2 | Address Decoding            | <b>4</b> 0 |  |  |  |
|     | 6.2.1 Decoder Optimisations | 41         |  |  |  |
| 6.3 | Memory Map                  |            |  |  |  |

The Vmicro16 processor uses a memory-mapping scheme to communicate with peripherals and other cores. This chapter describes the design decisions and implementation of the memory-map used in this project.

#### 6.1 Introduction

Memory mapping is a common technique used by CPUs, micro-controllers, and other systemon-chip devices, that enables peripherals and other devices to be accessed via a memory address on a common bus. In a processor use-case, this allows for the reuse of existing instructions (commonly memory load/store instructions) to communicate with external peripherals with little additional logic.

## 6.2 Address Decoding

An address decoder is used to determine the peripheral that the address is requesting. The address decoder module, addr\_dec in apb\_intercon.v, takes the 16-bit PADDR from the active APB interface and checks for set bits to determine which peripheral to select. The decoder outputs a chip enable signal PSEL for the selected peripheral. For example, if bit 12 is set in PADDR then the shared memory peripheral's PSEL is set high and others to low. A schematic for the decoder is shown in Figure 6.1.



Figure 6.1: Schematic showing the address decoder (addr\_dec) accepting the active PADDR signal and outputting PSEL chip enable signals to each peripheral.

#### **6.2.1** Decoder Optimisations

Performing a 16-bit equality comparison of the PADDR signal against each peripheral memory address consumes a significant amount of logic. Depending on the synthesis tools and FPGA features, a 16-bit comparator might require a fixed 16-bit value input to compare against (where the 0s are inverted) and a wide-AND to reduce and compare [14, 15]. An example 4-bit comparator is shown below in Figure 6.2.



**Figure 6.2:** Example 4-bit binary comparator which compares the bits (a, b, c, d) to the constant value 1010. The 0s of the constant are inverted and then all are passed to a wide-AND.

As we are targeting FPGAs, which use LUTs to implement combinatorial logic, we can conveniently utilise Verilog's == operator on fairly large operands without worrying about consuming too many resources. The targeted FPGA devices in this project, the Cyclone V and Spartan 6, feature 6-input LUTs which allow 64 different configurations [16, 17]. Knowing this, we can design the address decoder to utilise the FPGA's LUTs more effectively and reduce it's footprint significantly.

We can use part of the PADDR signal as a chip select and the other bits as sub-addresses to interface with the peripheral. The addressing bits are passed into the FPGA's 6-input LUTs which are programmed (via the bitstream) to output 1 or 0 depending on the address. Figure 6.3 below shows a LUT based approach to address decoding which will utilise approximately one ALM/CLB module per peripheral chip select (PSEL) and one for error detection. This method

of comparison (LUT based) is utilised in the addr\_dec module in apb\_intercon.v.



**Figure 6.3:** Bits [7:3] of an 8-bit PADDR signal are used as inputs to 5-bit LUTs to generate a PSEL signal. In addition, a default error case is shown allowing the address decoder to detect incorrect PADDR values (e.g. if no PSEL signals are generated).

The address decoding methods discussed above are examples of *full-address* decoding, where each bit (whether required or not) is compared. It is possible to further reduce the required logic by utilising *partial-address* decoding [18]. Partial-address decoding can reduce logic requirements by not using all bits. For example, if bits in address 0x0100 do not conflict with bits in other addresses (i.e. bit 8 is high in more than 1 address), then the address decoder needs only concern bit 8, not the other bits. This is visualised in Figure 6.4 below. This method is utilised in the MMU's address decoder (module vmicro16\_mmu in vmicro16.v:181). As this is an optimisation per core, significant resources can be saved when a large number of cores are used.



Figure 6.4: Partial address decoding used by the Vmicro16 SoC design. Each peripheral shown only needs to decode a signal bit to determine if it is enabled.

## 6.3 Memory Map

The system-on-chip's memory map is shown below in Figure 6.5. The addresses for each peripheral have been carefully chosen for both:

- Easy software access creating addresses via software requires few instructions (normally one to four MOVI and LSHIFT instructions to address 0x0000 to 0xffff), which increases software performance.
- and Reducing address decoding logic most addresses can be decoded using partial decoding techniques.



Figure 6.5: Memory map showing addresses of various memory sections.

## Chapter 7

## **Multi-core Communication**

| 7.1 | Introd | luction                |
|-----|--------|------------------------|
|     | 7.1.1  | Design Goals           |
|     | 7.1.2  | Context Identification |
|     | 7.1.3  | Thread Synchronisation |

So far we have discussed the features and design of the Vmicro16 system-on-chip. This section will discuss the multi-processing functionality and how to use it.

#### 7.1 Introduction

Multi-processing functionality is the primary deliverable of this project.

#### 7.1.1 Design Goals

#### • Support common synchronisation primitives.

Software should be able to implement common synchronisation primitives, such as mutexes, semaphores, and memory barriers, to perform atomic operations and avoid race conditions, which are critical in parallel and concurrent software applications.

#### • Context identification.

The SoC should expose configuration information such as: the number of processing cores, amount of shared and scratch memory, and the CORE\_ID, to each thread.

#### 7.1.2 Context Identification

A goal of the multi-processing functionality of this project is allow software written for it to be run on any number of cores. This means that a software program will scale to use all cores in the SoC without needing to rewrite the software. To enable this functionality, the software must be able to read contextual information about the SoC, such as the number of cores, how much global and scratch memory is available, and what the CORE\_ID of the current core is.



Figure 7.1: Block digram showing the main multi-processing components: the CPU array and a peripheral interconnect used for core synchronisation.

This information is provided through the Special Registers peripheral (0x0080 - 0x008F), shown in Figure 7.1. This register set provides relevant information for writing software that can dynamically scale for various SoC configurations.



Figure 7.2: Vmicro16 Special Registers layout (0x0080 - 0x008F).

#### 7.1.3 Thread Synchronisation

In multi-threaded software it is important

The mutex functionality is implemented using a similar scheme to that of ARM's *Global Monitor* [?].

#### Mutexes

In software, a mutex is an object used to control access to a shared resource. The term *object* is used as it's implementation is normally platform dependant, meaning that the processor may provide a hardware mechanism or is left for the operating system to provide.

In this project, mutexes are provided by the processor through the Shared Memory Peripheral (0x1000 to 0x1FFF) which provides a large RAM-style memory accessible by all cores through the peripheral interconnect bus. This large memory is explicitly defined to use the FPGA's BRAM blocks using Xilinx's Verilog ram\_style="block" attribute to avoid wasting LUTs when using high core counts. The peripheral allows each memory cell to be *locked*, meaning that only the cell owner can modify it's contents. This is implemented by using another large memory, locks, to store the CORE\_ID + 1 of the owner, as shown in Listing 5. In this system, a lock containing the value 0 indicates an unlocked cell. As CORE\_IDs are indexed from zero, 1 is arithmetically added to each cell. For example, if core #2 wants to lock a memory cell, the value 3 is written to the lock.

```
reg [15:0] ram [0:8191]; // 16KB large RAM memory reg [clog2(CORES):0] locks [0:8181]; // memory cell owner
```

Listing 5: RAM and lock memories instantiated by the shared memory peripheral.

To lock and unlock cells, the instructions LWEX and SWEX instructions are used. These instructions are similar to the LW/SW instructions but provide locking functionality. The *EX* in the instruction names indicate *exclusive access*. LWEX is used to read memory contents (like LW) and also lock the cell if not already locked. If a core attempts to lock an already locked cell, the lock does not change. Unlocking is done by the SWEX instruction, which conditionally writes to the memory cell if it is locked by the same core. Unlike SW, SWEX returns a zero for success and one for failure if it is locked by another core.

```
lock_mutex:
1
                // attempt lock
2
               lwex r0, r1
// check success
3
4
5
                swex r0, r1
6
                cmp r0, r3
                /ar{/} if not equal (NE), retry
8
                movi r4, lock_mutex
                    r4, BR_NE
      critical:
10
           // core has the mutex
```

**Figure 7.3:** Assembly code for locking a mutex. r1 is the address to lock. r3 is zero. r4 is the branch address.

Figure 7.3 shows a simple assembly function to lock a memory cell.

#### **Barriers**

Barriers are a useful software sequence used to block execution until all other threads (or a subset) have reached the same point. Barriers are often used for broadcast and gather actions (sending values to each core or receiving them). They are also used to synchronise program execution if some threads have more work to do than others.

The Vmicro16 processor provides barrier synchronisation through the Shared Memory

Peripheral. Like the mutex code, the barrier code uses the LWEX and SWEX instructions to lock a memory cell. Instead of immediately checking the lock as an abstract object, the barrier code treats the cell as a normal memory cell containing a numeric value. Listing 6 shows a software example of this. When the barrier\_reached code is reached, the code will increment the shared memory value by 1, indicating that the number of threads that have reached this

```
barrier_reached:
1
2
             // load latest count
             lwex r0, r5
3
             // try increment count
// increment by 1
4
5
             addi r0, r3 + #0x01
// attempt store
6
7
             swex
                     r0, r5
9
             // check success (== 0)
10
             cmp r0, r3
// branch if failed
11
12
             movi r4, barrier_reached br r4, BR_NE
13
14
15
       barrier_wait:
16
             // load the count
             Tw r0, r5
// compare with number of threads
18
19
            cmp r0, r7
// jump back to barrier if not equal
movi r4, barrier_wait
-4 RR NE
20
21
22
23
```

**Listing 6:** Assembly code for a memory barrier. Threads will wait in the barrier\_wait function until all other threads have reached that code point.

point has increased by one (r5). The barrier\_wait function is then entered which waits until this numeric value (r5) is equal to the number of threads (r7) in the system. If this is true, then all threads have reached the barrier\_wait function and can continue with normal program execution.

## **Chapter 8**

# **Analysis & Results**

| 8.1 | Introd | luction                  |
|-----|--------|--------------------------|
| 8.2 | Imple  | mentation Analysis       |
|     | 8.2.1  | Design Size              |
|     | 8.2.2  | Maximum Frequency        |
| 8.3 | Scena  | rio Performance          |
|     | 8.3.1  | Scenario Overview        |
|     | 8.3.2  | Performance Measurements |
|     | 8.3.3  | Performance Results      |

So far the system's design, implementation, and example usage, has been presented and discussed.

#### 8.1 Introduction

This chapter presents analytic information

## 8.2 Implementation Analysis

This section analysis the synthesised and implemented system-on-chip design to see the effect of increasing core counts.

#### 8.2.1 Design Size

#### **Implementation**

On a minimal system-on-chip configuration, with one core and minimal peripherals and features (no reprogramming, no interrupts, no UART), the design requires as few as 500 LUTs with the processor core requiring approximately 300-400 LUTs.

#### **Constraints**

As discussed in Chapter 4 Single-core Design, each processor core features two memories: instruction and scratch memory, which can both map onto synchronous, single-port, FPGA BRAM blocks. While this will reduce LUT requirements in designs with few cores, it becomes a non-trivial problem as the core counts increase. FPGAs have a fixed number of hard-BRAM blocks available for inference by the HDL compiler, for example the low-end Xilinx Spartan-6 XC6SLX9 FGPA features 32 18 Kb BRAM blocks [19, p. 2], and the Cyclone V 5CSEMA5F31C6N (used in the DE1-SoC) has 397 10 Kb blocks [20, p. 22].



**Figure 8.1:** A theoretical FPGA device with 3 BRAM blocks running a 4-core design. Each core can map onto a BRAM block, however as there are more cores than BRAM blocks available, some core memories will be implemented as distributed RAM, or in the worse case using ALMs.

As shown in Figure 8.1, as the number of processor cores increasing they eventually outnumber the available BRAM blocks, resulting in their memories being implemented in either distributed RAMs or ALMs, both of which can consume significant logic resources of the FPGA which reduces the maximum possible core count.

#### 8.2.2 Maximum Frequency

#### 8.3 Scenario Performance

To evaluate the performance of the system-on-chip, scenarios encompassing computational problems that are reflective of real-world applications are compiled and ran on the design.

#### 8.3.1 Scenario Overview

The scenario is a software program that runs a parallel implementation of the summation function, i.e. sum [1..10] which returns 55. While this may seem too simple at first to measure performance of a multi-core system-on-chip, the function is actually quite appropriate as it encompasses various parallel problems, such as: a fixed time/size serial part; broadcasting of the data set (in this case the range of the summation); thread synchronisation (to know when the data is ready and to schedule gathering of intermediary results); and is highly scalable.



Figure 8.2: Maximum design frequency for various core count configurations.

The summation task flow is as follows:

- 1. Root (core #0) broadcasts the range of the summation (i.e. sum 1 to 10) to all cores via the global shared memory.
- 2. Non-root cores wait for this broadcast to finish (memory barrier), then calculate their own subset of the range to sum. For example, if Root broadcasts that there are 240 samples and 10 cores in the system, each core calculates the subset size:

$$240/10 = 24 \tag{8.1}$$

calculations starting from:

$$ID_{CORE} * 24$$
 (8.2)

For example, Core #5 will start its 24 sample subset summation from

$$5 * 24 = 120 \tag{8.3}$$

effectively performing sum [120..123].

- 3. All cores perform an intermediary summation over their subset of the range (serial part).
- 4. All cores attempt to add their intermediary result to a global sum value in global shared memory (mutex).
- 5. All cores halt, signalling that their work has been committed to the global shared memory and have finished the program.

This program is written in assembly in the file sw/demos/asm/sum64.s and can be compiled using the assembly compiler (developed for deliverable ED4) using the command below. The assembly compiler outputs the file asm.s.hex containing hex instruction words for use in Verilog's \$readmemh function. This data is used for each core's instruction memory. The assembly program is also shown in Section D.2.

python sw/asm.py sw/demos/asm/sum64.s

#### **8.3.2** Performance Measurements

Behavioural simulation will be used to measure the following metrics to estimate general performance of the system-on-chip:

- Total program run-time.
  - This is the time from when the reset signal is de-asserted to when all cores have halted. Each core has an output halt signal which the SoC can use to determine if all cores have halted using wire all\_halted = &core\_halts;
- Time spent on the serial part.

  The serial part of this scenario consists of the intermediary summation of it's subset range. As each core is performing this task, the average will be used.
- Time spent on communication.
   This includes time spent on thread synchronisation, i.e. waiting for the global memory to become available and waiting on the root to finish broadcast. Again, the average time will be used.
- Time spent fetching instructions.
   Instruction fetches occur during stage STAGE\_IF of the pipeline. The behavioural test bench will record the number of clock cycles each core spends in this state, then calculate the average time spent fetching instructions.

#### 8.3.3 Performance Results



Figure 8.3: Chart showing how the communication times (Tbus) and serial times (Tsum) changes with core count.



Figure 8.4: Similar to Figure 8.3 but using shared instruction memory to reduce block memory requirements per core.

REFERENCES 54

#### References

[1] Tech Differences, "Difference between loosely coupled and tightly coupled multiprocessor system (with comaprison chart)," Jul 2017. [Online]. Available: https://techdifferences.com/difference-between-loosely-coupled-and-tightly-coupled-multiprocessor-system. html (Accessed 2019-04-20).

- [2] N. Chatterjee, S. Paul, and S. Chattopadhyay, "Fault-tolerant dynamic task mapping and scheduling for network-on-chip-based multicore platform," *ACM Transactions on Embedded Computing Systems*, vol. 16, pp. 1–24, 05 2017.
- [3] Terasic Technologies, "SoC Platform Cyclone DE1-SoC Board." [Online]. Available: https://www.terasic.com.tw/cgi-bin/page/archive.pl?Language=English& No=836 (Accessed 2019-04-20).
- [4] *MiniSpartan6+*, Scarab Hardware, 2014. [Online]. Available: https://www.scarabhardware.com/minispartan6/ (Accessed 2019-04-20).
- [5] V. Subramanian, "Multiple gate field-effect transistors for future CMOS technologies," *IETE Technical review*, vol. 27, no. 6, pp. 446–454, 2010.
- [6] M. J. Flynn, Computer architecture: Pipelined and parallel processor design. Jones & Bartlett Learning, 1995.
- [7] L. Benini and G. De Micheli, "Networks on Chips: A new SoC paradigm," *Computer*, vol. 35, pp. 70–78, 02 2002.
- [8] D. Zhu, L. Chen, S. Yue, T. M. Pinkston, and M. Pedram, "Balancing On-Chip Network Latency in Multi-application Mapping for Chip-Multiprocessors," in 2014 IEEE 28th International Parallel and Distributed Processing Symposium, May 2014, pp. 872–881.
- [9] Xilinx, Spartan-6 FPGA Block RAM Resources, Xilinx.
- [10] Altera, Recommended HDL Coding Styles QII51007-9.0.0, Altera.
- [11] B. Lancaster, "FPGA-based RISC Microprocessor and Compiler," vol. 3.14, pp. 37–50. [Online]. Available: https://github.com/bendl/prco304 (Accessed March 2018).
- [12] P. Kocher, D. Genkin, D. Gruss, W. Haas, M. Hamburg, M. Lipp, S. Mangard, T. Prescher, M. Schwarz, and Y. Yarom, "Spectre attacks: Exploiting speculative execution," arXiv preprint arXiv:1801.01203, 2018.
- [13] M. Weber, "Arbiters: Design Ideas and Coding Styles," p. 2.
- [14] A. Palchaudhuri and R. S. Chakraborty, *High Performance Integer Arithmetic Circuit Design on FPGA: Architecture, Implementation and Design Automation*. Springer, 2015, vol. 51.
- [15] V. Salauyou and M. Gruszewski, "Designing of hierarchical structures for binary comparators on fpga/soc," in *IFIP International Conference on Computer Information Systems and Industrial Management*. Springer, 2015, pp. 386–396.

REFERENCES 55

- [16] Xilinx, Spartan-6 FPGA Configurable Logic Block User Guide UG384, Xilinx.
- [17] Altera, Cyclone V Device Handbook Device Interfaces and Integration CV-5V2, Altera.
- [18] A. S. Tanenbaum, Structured Computer Organization. Pearson Education India, 2016.
- [19] Xilinx, Spartan-6 Family Overview DS160, Xilinx.
- [20] Intel, Cyclone V Device Overview CV-51001, Intel.

# Appendix A

# **Peripheral Information**

| <b>A.1</b> | Special Registers            | 56 |
|------------|------------------------------|----|
| A.2        | Watchdog Timer               | 57 |
| A.3        | GPIO Interface               | 57 |
| <b>A.4</b> | Timer with Interrupt         | 58 |
| A.5        | UART Interface               | 58 |
| B.1        | Register Set Multiplex       | 59 |
| B.2        | Instruction Set Architecture | 60 |

To provide user's with useful functionality, common system-on-chip peripherals were created. This section describes each peripheral and it's design decisions. The full memory-map is shown in Figure 6.5.

## A.1 Special Registers

From the software perspective, it is important for both the developer and software algorithms to know the target system's architecture to better utilise the resources available to them. Software written for one architecture with N cores must also run on an architecture with M cores. To enable such portability, the software must query the system for information such as: number of processor cores and the current core identifier. Without this information, the developer would be required to produce software for each individual architecture (e.g. an Intel i5 with 4 cores or an Intel i7 with 8 cores, or an NVIDIA GTX 970 with 1664 CUDA cores.

The special register peripheral is shown below.



Figure A.1: Vmicro16 Special Registers layout (0x0080 - 0x008F).

### A.2 Watchdog Timer

In any multi-threaded system there exists the possibility for a deadlock – a state where all threads are in a waiting state – and algorithm execution is forever blocked. This can occur either by poor software programming or incorrect thread arbitration by the processor. A common method of detecting a deadlock is to make each thread signal that it is not blocked by resetting a countdown timer. If the countdown timer is not reset, it will eventually reach zero and it is assumed that all threads are blocked as none have reset the countdown.

In this system-on-chip design, software can reset the watchdog timer by writing any 16-bit value to the address 0x00B8.

This peripheral is optional and can be enabled using the configuration parameters described in Configuration Options.



#### A.3 GPIO Interface

| 15 | 14 | 13 | 12 | 11 | 10 | 9  | 8   | 7   | 6   | 5  | 4   | 3     | 2   | 1 | 0 | _       |
|----|----|----|----|----|----|----|-----|-----|-----|----|-----|-------|-----|---|---|---------|
|    |    |    |    |    |    |    |     |     |     | GP | IO0 | Out   | put |   |   | 0090 RW |
|    |    |    |    |    |    | GP | IO1 | Out | put |    |     |       |     |   |   | 0091 RW |
|    |    |    |    |    |    |    |     |     |     | GP | IO2 | Out   | put |   |   | 0092 RW |
|    |    |    |    |    |    |    |     |     |     | G] | PIO | 3 Inp | ut  |   |   | 0093 R  |

On the DE1-SoC board, GPIO0 is assigned to the LEDs, and GPIO1 and GPIO2 to the 6 seven-segment displays.

### A.4 Timer with Interrupt



**Clock Frequency** Uses top level FPGA clock (normally 50 MHz).

Load Value Value to count down from each clock.

I Interrupt enable bit. Default 0.

**R** Reset Load Value and Prescaler values to their last written value.

**S** Start the timer countdown. 1 = start. 0 = stop.

Prescaler Number of clocks per FPGA clock to wait between each decrement.

#### A.5 UART Interface



E Enable the UART component.

I Enable an interrupt upon receiving new data. Default 1.

Note: If DEF\_USE\_REPROG is enabled in vmicro16\_soc\_config.v then the receiver port will be reserved for programming the instruction memory, resulting in reads and writes to addresses 0x00A1 and 0x00A2 to return 0.

# Appendix B

# **Additional Figures**

```
input
                         [MASTER_PORTS*BUS_WIDTH-1:0]
                         [MASTER_PORTS-1:0]
[MASTER_PORTS-1:0]
[MASTER_PORTS-1:0]
2
        input
                                                                    S_PWRITÉ,
                                                                    S_PSELx,
S_PENABLE,
3
        input
        input
                        [MASTER_PORTS*DATA_WIDTH-1:0]
[MASTER_PORTS*DATA_WIDTH-1:0]
                                                                    S_PWDATA,
5
        input
        output reg
                                                                    S_PRDATA,
        output reg [MASTER_PORTS-1:0]
                                                                    S_PREADY,
```

Listing 7: Variable size inputs and outputs to the interconnect.

## **B.1** Register Set Multiplex



Figure B.1: Normal mode (bottom) and interrupt mode (top) register sets are multiplexed to switch between contexts.

## **B.2** Instruction Set Architecture

|             | 15-11 | 10-8    | 7-5  | 4-0   | rd ra simm5                  |  |
|-------------|-------|---------|------|-------|------------------------------|--|
|             | 15-11 | 10-8    | 7-0  | 40    | rd imm8                      |  |
|             | 15-11 | 10-0    | 7.0  |       | nop                          |  |
|             | 15    | 14:12   | 11:0 |       | extended immediate           |  |
| SPCL        | 00000 | 11 bits | 11.0 |       | NOP                          |  |
| SPCL        | 00000 | 11h'000 |      |       | NOP                          |  |
| SPCL        | 00000 | 11h'001 |      |       | HALT                         |  |
| SPCL        | 00000 | 11h'002 |      |       | Return from interrupt        |  |
| LW          | 00001 | Rd      | Ra   | s5    | Rd <= RAM[Ra+s5]             |  |
| SW          | 00001 | Rd      | Ra   | s5    | RAM[Ra+s5] <= Rd             |  |
| BIT         | 00011 | Rd      | Ra   | s5    | bitwise operations           |  |
| BIT OR      | 00011 | Rd      | Ra   | 00000 | Rd <= Rd   Ra                |  |
| BIT_XOR     | 00011 | Rd      | Ra   | 00001 | Rd <= Rd ↑ Ra                |  |
| BIT_AND     | 00011 | Rd      | Ra   | 00010 | Rd <= Rd & Ra                |  |
| BIT_NOT     | 00011 | Rd      | Ra   | 00011 | Rd <= ~Ra                    |  |
| BIT_LSHFT   | 00011 | Rd      | Ra   | 00100 | Rd <= Rd << Ra               |  |
| BIT_RSHFT   | 00011 | Rd      | Ra   | 00101 | Rd ← Rd → Ra                 |  |
| MOV         | 00100 | Rd      | Ra   | Х     | Rd <= Ra                     |  |
| MOVI        | 00101 | Rd      |      | 8     | Rd <= i8                     |  |
| ARITH U     | 00110 | Rd      | Ra   | s5    | unsigned arithmetic          |  |
| ARITH UADD  | 00110 | Rd      | Ra   | 11111 | Rd <= uRd + uRa              |  |
| ARITH_USUB  | 00110 | Rd      | Ra   | 10000 | Rd <= uRd - uRa              |  |
| ARITH_UADDI | 00110 | Rd      | Ra   | OAAAA | Rd <= uRd + Ra + AAAA        |  |
| ARITH_S     | 00111 | Rd      | Ra   | s5    | signed arithmetic            |  |
| ARITH_SADD  | 00111 | Rd      | Ra   | 11111 | Rd <= sRd + sRa              |  |
| ARITH_SSUB  | 00111 | Rd      | Ra   | 10000 | Rd <= sRd - sRa              |  |
| ARITH_SSUBI | 00111 | Rd      | Ra   | OAAAA | Rd <= sRd - sRa + AAAA       |  |
| BR          | 01000 | Rd      | i    | 8     | conditional branch           |  |
| BR_U        | 01000 | Rd      | 0000 | 0000  | Any                          |  |
| BR_E        | 01000 | Rd      | 0000 | 0001  | Z=1                          |  |
| BR_NE       | 01000 | Rd      | 0000 | 0010  | Z=0                          |  |
| BR_G        | 01000 | Rd      | 0000 | 0011  | Z=0 and S=O                  |  |
| BR_GE       | 01000 | Rd      | 0000 | 0100  | S=O                          |  |
| BR_L        | 01000 | Rd      | 0000 | 0101  | S != O                       |  |
| BR_LE       | 01000 | Rd      | 0000 | 0110  | Z=1 or (S != O)              |  |
| BR_S        | 01000 | Rd      | 0000 | 0111  | S=1                          |  |
| BR_NS       | 01000 | Rd      | 0000 | 1000  | S=0                          |  |
| CMP         | 01001 | Rd      | Ra   | X     | SZO <= CMP(Rd, Ra)           |  |
| SETC        | 01010 | Rd      | Im   | m8    | Rd <= (Imm8 _f_ SZO) ? 1 : 0 |  |
| MULT        | 01011 | Rd      | Ra   | X     | Rd <= uRd * uRa              |  |
| HALT        | 01100 |         | Х    |       |                              |  |
|             |       |         |      |       |                              |  |
| LWEX        | 01101 | Rd      | Ra   | s5    | Rd <= RAM[Ra+s5]             |  |
|             |       |         |      |       | RAM[Ra+s5] <= Rd             |  |
| SWEX        | 01110 | Rd      | Ra   | s5    | Rd <= 0 1 if success         |  |
| 1           |       |         |      |       |                              |  |

Figure B.2: Vmicro16 instruction set architecture.

# Appendix C

# **Configuration Options**

| C.1 | System-on-chip Configuration Options | 61 |
|-----|--------------------------------------|----|
| C.2 | Core Options                         | 62 |
| C.3 | Peripheral Options                   | 63 |

The following configuration options are defined in vmicro16\_soc\_config.v.

Defaults with empty/blank values signifies that the preprocessor define is commented out/not defined/disabled by default/computed by other parameters.

## C.1 System-on-chip Configuration Options

| Macro            | Default | Purpose                                                                                      |
|------------------|---------|----------------------------------------------------------------------------------------------|
| CORES            | 4       | Number of CPU cores in the SoC                                                               |
| SLAVES           | 8       | Number of peripherals                                                                        |
| DEF_USE_WATCHDOG | //      | Enable watchdog module to recover from dead-<br>locks and infinite loops                     |
| DEF_GLOBAL_RESET | //      | Enable synchronous reset logic. Will consume more LUT resources. Does not reset BRAM blocks. |

Table C.1: SoC Configuration Options

## **C.2** Core Options

| Macro                  | Default | Purpose                                                                                                                                                                                                             |
|------------------------|---------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| DATA_WIDTH             | 16      | Width of CPU registers in bits                                                                                                                                                                                      |
| DEF_CORE_HAS_INSTR_MEM | //      | Enable a per core instruction memory cache                                                                                                                                                                          |
| DEF_MEM_INSTR_DEPTH    | 64      | Instruction memory cache per core                                                                                                                                                                                   |
| DEF_MEM_SCRATCH_DEPTH  | 64      | RW RAM per core                                                                                                                                                                                                     |
| DEF_ALU_HW_MULT        | 1       | Enable/disable HW multiply (1 clock)                                                                                                                                                                                |
| FIX_T3                 | //      | Enable a T3 state for the APB transaction                                                                                                                                                                           |
| DEF_USE_REPROG         | //      | Programme instruction memory via UART0. Requires DEF_GLOBAL_RESET. Enabling this will reserve the UART0 RX port for exclusive use for programming the instruction memory. Software reads of UART0 RX will return 0. |

Table C.2: Core Options

## C.3 Peripheral Options

| Macro           | Default  | Purpose                                             |
|-----------------|----------|-----------------------------------------------------|
| APB_WIDTH       |          | AMBA APB PADDR signal width                         |
| APB_PSELX_GPIO0 | 0        | GPIO0 index                                         |
| APB_PSELX_UART0 | 1        | UART0 index                                         |
| APB_PSELX_REGS0 | 2        | REGS0 index                                         |
| APB_PSELX_BRAM0 | 3        | BRAM0 index                                         |
| APB_PSELX_GPIO1 | 4        | GPIO1 index                                         |
| APB_PSELX_GPIO2 | 5        | GPIO2 index                                         |
| APB_PSELX_TIMR0 | 6        | TIMR0 index                                         |
| APB_BRAM0_CELLS | 4096     | Shared memory words                                 |
| DEF_MMU_TIM0_S  | 16'h0000 | Per core scratch memory start/end address           |
| DEF_MMU_TIM0_E  | 16'h007F | "                                                   |
| DEF_MMU_SREG_S  | 16'h0080 | Per core special registers start/end address        |
| DEF_MMU_SREG_E  | 16'h008F | "                                                   |
| DEF_MMU_GPIO0_S | 16'h0090 | Shared GPIOn start/end address                      |
| DEF_MMU_GPIO0_E | 16'h0090 | n,                                                  |
| DEF_MMU_GPIO1_S | 16'h0091 | n,                                                  |
| DEF_MMU_GPIO1_E | 16'h0091 | "                                                   |
| DEF_MMU_GPIO2_S | 16'h0092 | "                                                   |
| DEF_MMU_GPIO2_E | 16'h0092 | "                                                   |
| DEF_MMU_UART0_S | 16'h00A0 | Shared UART start/end address                       |
| DEF_MMU_UART0_E | 16'h00A1 | "                                                   |
| DEF_MMU_REGS0_S | 16'h00B0 | Shared registers start/end address                  |
| DEF_MMU_REGS0_E | 16'h00B7 | "                                                   |
| DEF_MMU_BRAM0_S | 16'h1000 | Shared memory with global monitor start/end address |
| DEF_MMU_BRAM0_E | 16'h1FFF | "                                                   |
| DEF_MMU_TIMR0_S | 16'h0200 | Shared timer peripheral start/end address           |
| DEF_MMU_TIMR0_E | 16'h0202 | "                                                   |

 Table C.3: Peripheral Options

## Appendix D

# Viva Demonstration Examples

| D.1 | 2-core Timer Interrupt and ISR |  |  |  |  |  |  |  |  |  |  |      |  |  | 64 |
|-----|--------------------------------|--|--|--|--|--|--|--|--|--|--|------|--|--|----|
| D.2 | 1-160 Core Parallel Summation  |  |  |  |  |  |  |  |  |  |  | <br> |  |  | 66 |

### D.1 2-core Timer Interrupt and ISR

This example demo, shown during the viva, blinks an LED every 0.5 seconds via a timer interrupt. Core 0 sets up the interrupt vector (by writing the isr0 function address to the interrupt vector) and enables all interrupt sources. Core 1 sets up the timer interval peripheral to produce an interrupt every 0.5 seconds. Core 1 also performs the interrupt handler (isr0): toggle an LED, write the state to UART0, and resets the watchdog.

```
// interrupts.s
2
             Toggle LED in ISR
3
4
        movi r7, #0x80 lw r7, r7
              // core1 sets up the timer
// Core0 enables interrupts and performs the isr
10
                      r7, r0
r0, timer
r0, BR_NE
              cmp
11
             movi
14
              // Set interrupt vector (0)
15
                       r0, isr0
r1, #0x1
r2, #0x08
17
             movi
18
             movi
             lshft
20
                        r0, r1
21
              // enable all interrupts
23
              movi r0, #0x0f
                       r0, r1 + #0x8
24
25
               // enter idle state
26
             halt
                       r0, r0
28
             // set timr0 address 0x200 into r0
// shift left 8 places
movi r0, #0x01
movi r1, #0x09
lshft r0, r1
31
32
35
              // Set load value
36
              //movi r1, #0x31
//sw r1, r0
              //mus r1, r0
// test we the expected value back
//lw r2, r0
39
40
              // set load = 0x3000
```

```
movi r1, #0x3
movi r2, #0x0C
//movi r2, #0x04
lshft r1, r2
sw r1, r0
43
44
45
46
47
48
                       // Set prescale value to 0x1000

// 20ns * load * prescaler = nanosecond delay

// 20ns * 10000 * 5000 = 1.0s

// 20.0 * 0x3000 * 0x1000 = ~1.0s
49
50
51
52
53
                      movi r1, #0x1
// 1.0 second
//movi r2, #0x0C
// 0.5 second
movi r2, #0x0B
// 0.25 second
//movi r2, #0x0a
// 0.0625 second
//movi r2, #0x04
lshft r1, r2
sw r1, r0 + #0x02
                       movi
                                       r1, #0x1
54
55
56
57
59
60
61
62
63
64
                       // Start the timer (write 0x0001 to 0x0101)
movi r1, #0x01
sw r1, r0 + #0x01
65
67
68
             exit:
    // enter idle state
    halt r0, r0
69
70
71
72
73
74
75
              isr0:
                       movi r0, #0x90
lw r1, r0
// xor with 1
76
                       movi r2, #0x1
xor r1, r2
// write back
77
78
79
80
                                         r1, r0
81
                        // write ascii value to uart0
82
                       movi r0, #0xa0
movi r2, #0x30
add r1, r2
83
84
85
86
                       sw
                                       r1, r0
87
                       // reset watchdog
movi    r0, #0xb8
sw    r1, r0
88
89
90
91
                       // return from interrupt
intr r0, r0
92
```

#### D.2 1-160 Core Parallel Summation

This example demo performs a parallel summation of numbers 1 to 320. The algorithm *assigns* each core a subset of the summation space. It does this using the core's ID and the number of cores in the system. The following formulas determine where the subset begins and ends for each core. Core 0 broadcasts the number to sum to then each core calculates its subset start and end positions. Each core then performs a summation over it's subset then adds the result to a global shared value. After pushes it's results, the global shared value will contain the final summation result.

$$N_{samples} = 320 (D.1)$$

$$N_{threads} = 64 (D.2)$$

$$subset = N_{samples} / N_{threads}$$
 (D.3)

$$start = ID * subset$$
 (D.4)

$$end = start + subset$$
 (D.5)

```
// sum64.s
// Simple 1-160 core summation program
2
3
         // Set up common values, such as: Core id (r6), number of threads (cores) (r7), shared memory addresses (r5) \,
4
5
        entry:
// Core id in r6
r0. #0x8
6
              movi r0, #0x80
lw r0, r0
              lw
               // store in r6
10
12
              // get number of threads
movi r0, #0x81
13
14
              lw r0, r0
// store in r7
15
                         r0, r0
16
17
               // BRAMO shared memory 0x1000
19
                        r5, #0x01
r2, #0x0C
              movi
20
21
22
23
              lshft
        jmp_to_barrier:
    // NOT_ROOT
    // wait a
24
                  wait_at barrier
26
27
                      r6, r3
r4, barrier_arrive
               cmp
28
                         r4, BR_NE
30
               // ROOT
31
                    calculates nsamples_per_thread
32
                      ns = 100

nst = ns / (num_threads)

nst = ns >> (num_threads - 1)

r0 = (num_threads - 1) WRONG!!!
33
34
35
36
37
         root_broadcast:
38
              // The root (core idx 0) broadcasts the number of samples
39
               // 16 cores
40
                           r4, #0x14
41
42
               // 32 cores
               //movi
                           r4, #0x0a
43
               // 64 cores
44
45
                         r4, #0x05
               movi 14, .....
// 80 cores
//movi r4, #0x04
46
47
               // 160 cores
48
               //movi
                           r4, #0x02
49
               // ROOT
51
               // Do the broadcast
52
                    write nsamples_per_thread to shared bram (broadcast)
                  0x1001
                         r4, r5 + #0x01
55
```

```
// Reach the barrier to tell everone // that we have arrived
 58
 59
            barrier_arrive:
                  // load latest count
                 lwex r0, r5
// try increment count
// increment by 1
addi r0, r3 + #0x01
// attempt store
swex r0, r5
 61
 62
 63
 64
 65
 66
 67
                  // check success (== 0)
                 cmp r0, r3
// branch if failed
movi r4, barrier_arrive
br r4, BR_NE
69
70
 71
 72
           // Wait in an infinite loop
// for all cores to 'arrive'
73
74
           barrier:
 75
 76
                  // load the count
                 ٦w
                          r0, r5
 77
 78
                  // compare with number of threads
                 cmp r0, r7
// jump back to barrier if not equal
movi r4, barrier
br r4, BR_NE
 79
 80
 81
 82
 83
 84
           // EACH CORE
// All cores have arrived and in sync
 85
 86
           synced1:
                 // Retrieve load the nsamples_per_thread
lw r4, r5 + #0x01
// Calculate nstart = idx * nsamples_per_thread
// in r2
 87
 88
                 // in r2
mov
 89
 90
                          r2, r6
r2, r4
                 mult
 92
 93
                  // Loop limit in r4
// samples_per_thread -> samples_per_thread + nstart
add r4, r2
 94
 96
 97
           // Perform the summation in a tight for loop
// Sum numbers from nstart to limit
sum_loop:
 98
 99
100
                  // sum += i
101
                  add
                             r1, r2
                  // increment i
addi r2, r3 + #0x01
103
104
                  // check end
105
                        r2, r4
r0, sum_loop
r0, BR_NE
107
                 movi
108
                  br
109
           // Summation of the subset finished, result is in r1 // Now use a mutex to add it to the global sum value in shared mem
110
111
           sum_mutex:
112
                  // load latest count
113
                 lwex r0, r5 + #0x2
// try increment count
// increment by 1
114
115
116
                             r0, r1
                 // make copy as swex has a return value
mov r2, r0
// attempt store
swex r0, r5 + #0x02
// check success (== 0)
118
119
120
121
122
                 cmp r0, r3
// branch if failed
movi r4, sum_mutex
br r4, BR_NE
123
124
126
127
           // Write the latest global sum value to gpio1
128
           write_gpio:
movi r3, #0x91
129
130
131
                              r2, r3
            // Write the latest global sum value to uart0 tx
133
           write_uart_done:
    movi r3, #0xa0
134
135
136
                  movi
                              r2, #0x30
                              r2, r6
r2, r3
137
                  add
138
                  SW
139
           // This core has finished
140
            // Enter a low power state
141
           exit:
142
143
                 halt
                              r0, r0
```

# Appendix E

# **Code Listing**

| E.1 | SoC Code Listing $\ldots$ | 58         |
|-----|------------------------------------------------------------------------------------------------------------|------------|
|     | E.1.1 vmicro16_soc_config.v                                                                                | 58         |
|     | E.1.2 top_ms.v                                                                                             | <b>7</b> 0 |
|     | E.1.3 vmicro16_soc.v                                                                                       | 71         |
|     | E.1.4 vmicro16.v                                                                                           | 77         |
| E.2 | Peripheral Code Listing                                                                                    | 91         |

### **E.1** SoC Code Listing

#### E.1.1 vmicro16\_soc\_config.v

Configuration file for configuring the vmicro16\_soc.v and vmicro16.v features.

```
// Configuration defines for the vmicro16_soc and vmicro16 cpu.
2
         `ifndef VMICRO16_SOC_CONFIG_H
         `define VMICRO16_SOC_CONFIG_H
5
         `include "clog2.v"
6
7
         `define FORMAL
8
         `define CORES
10
         define SLAVES
11
12
         13
        14
16
17
18
        // Top level data width for registers, memory cells, bus widths `define DATA_WIDTH 16\,
20
21
        // Set this to use a workaround for the MMU's APB T2 clock //`define FIX_T3 \,
23
24
25
        // Instruction memory (read only)
// Must be large enough to support software program.
ifdef DEF_CORE_HAS_INSTR_MEM
// 64 16-bit words per core
idefine DEF_MEM_INSTR_DEPTH 64
27
28
29
31
              // 4096 16-bit words global
`define DEF_MEM_INSTR_DEPTH 4096
32
33
34
35
         `endif
         // Scratch memory (read/write) on each core.
// See `DEF_MMU_TIMO_* defines for info.
`define DEF_MEM_SCRATCH_DEPTH 64
36
```

```
// Enables hardware multiplier and mult rr instruction `define DEF_ALU_HW_MULT 1 \,
 40
 41
 42
           // Enables global reset (requires more luts)
//`define DEF_GLOBAL_RESET
 43
 44
 45
 46
           // Enable a watch dog timer to reset the soc if threadlocked
           //`define DEF_USE_WATCHDOG
 47
 48
           // Enables instruction memory programming via UARTO
 49
           //`define DEF_USE_REPROG
 50
 51
           `ifdef DEF_USE_REPROG

`ifndef DEF_GLOBAL_RESET

`error_DEF_USE_REPROG_requires_DEF_GLOBAL_RESET
 52
 53
 54
                 `endif
 55
           `endif
 56
 57
           58
           59
 60
 61
 62
            `define APB_PSELX_GPI00 0
 63
           `define APB_PSELX_UARTO 1
`define APB_PSELX_REGSO 2
 64
 65
            `define APB_PSELX_BRAMO 3
            define APB_PSELX_GPI01 4
define APB_PSELX_GPI02 5
define APB_PSELX_TIMRO 6
 67
 68
 69
           `define APB_PSELX_WDOGO 7
 70
 71
           `define APB_GPI00_PINS 8
`define APB_GPI01_PINS 16
`define APB_GPI02_PINS 8
 72
 73
 74
 75
           // Shared memory words
`define APB_BRAMO_CELLS 4096
 76
 77
 78
           79
           // Memory mapping
 80
 81
 82
           // IIMO
// Number of scratch memory cells per core
define DEF_MMU_TIMO_CELLS 64
define DEF_MMU_TIMO_S 16'h0000
define DEF_MMU_TIMO_E 16'h007F
 83
 84
 85
 86
           // SREG
 87
           `define DEF_MMU_SREG_S
`define DEF_MMU_SREG_E
 88
                                                     16'h0080
                                                     16'h008F
 89
           // GPI00
 90
           `define DEF_MMU_GPIOO_S
`define DEF_MMU_GPIOO_E
// GPIO1
                                                     16'h0090
 92
                                                     16'h0090
 93
           `define DEF_MMU_GPI01_S
`define DEF_MMU_GPI01_E
 94
 95
                                                     16'h0091
           // GPI02
 96
           `define DEF_MMU_GPI02_S
`define DEF_MMU_GPI02_E
 97
 98
                                                     16'h0092
 99
           // UARTO
            define DEF_MMU_UARTO_S
                                                     16'h00A0
100
           `define DEF_MMU_UARTO_E
                                                    16'h00A1
101
102
           // REGSO
           `define DEF_MMU_REGSO_S
`define DEF_MMU_REGSO_E
                                                     16'h00B0
103
                                                    16'h00B7
104
           // WDOGO
105
           `define DEF_MMU_WDOGO_S
`define DEF_MMU_WDOGO_E
                                                     16'h00B8
106
107
                                                    16'h00B8
           // BRAMO
108
            `define DEF_MMU_BRAMO_S
`define DEF_MMU_BRAMO_E
                                                     16'h1000
109
110
                                                    16'h1fff
           // TIMRO
111
           `define DEF_MMU_TIMRO_S
`define DEF_MMU_TIMRO_E
                                                     16'h0200
112
                                                     16'h0202
113
114
           115
116
          // Interrupts
/// Enable/disable interrupts
// Disabling will free up resources for other features
// Disabling will free up resources for other features
// define DEF_ENABLE_INT
// Number of interrupt in signals
define DEF_NUM_INT
8
// Default interrupt bitmask (0 = hidden, 1 = enabled)
define DEF_INT_MASK
// Bit position of the TIMRO interrupt signal
define DEF_INT_TIMRO
// Interrupt vector memory location
117
118
119
120
121
122
123
124
126
           // Interrupt vector memory location
`define DEF_MMU_INTSV_S 16'h0100
`define DEF_MMU_INTSV_E 16'h010'
127
                                                16'h0100
16'h0107
128
129
```

#### E.1.2 top\_ms.v

Top level module that connects the SoC design to hardware pins on the FPGA.

```
module seven_display # (
    parameter INVERT = 1
) (
 1
 3
                       input [3:0] n,
output [6:0] segments
 4
 5
 6
7
                       reg [6:0] bits;
                       assign segments = (INVERT ? "bits : bits);
 9
                       always @(n)
10
                      always w(m)
case (n)
4'h0: bits = 7'b0111111; // 0
4'h1: bits = 7'b0000110; // 1
11
                              4'h1: bits = 7'b0000110; // 1
4'h2: bits = 7'b1011011; // 2
4'h3: bits = 7'b1011011; // 3
4'h4: bits = 7'b1101101; // 4
4'h5: bits = 7'b1101101; // 5
4'h6: bits = 7'b1111101; // 6
4'h7: bits = 7'b1111101; // 6
4'h7: bits = 7'b1111111; // 8
4'h9: bits = 7'b1110111; // 8
4'h8: bits = 7'b1110111; // A
4'h8: bits = 7'b111100; // B
13
14
15
17
18
19
20
21
22
23
                               4'hC: bits = 7'b0111100; // B
4'hC: bits = 7'b0111001; // C
4'hD: bits = 7'b1011110; // D
4'hE: bits = 7'b1111001; // E
4'hF: bits = 7'b1110001; // F
24
25
26
27
28
                       endcase
              endmodule
30
31
             // minispartan6+ XC6SLX9
module top_ms # (
    parameter GPIO_PINS = 8
) ( .
32
33
34
35
36
                       input
                                                         CLK50,
37
                       input [3:0]
                       // UART
38
39
                       input
40
                      output
                       // Peripherals
output [7:0]
41
                                                        LEDS,
42
43
                       // 3v3 input from the s6 on the delsoc
44
45
                                                        S6_3v3,
46
47
                       // SSDs
                      output [6:0] ssd0,
output [6:0] ssd1,
output [6:0] ssd2,
output [6:0] ssd3,
output [6:0] ssd4,
48
50
51
52
                       output [6:0] ssd5
54
              );
                       //wire [15:0]
//wire
                                                                      M PADDR:
55
56
                                                                      M_PWRITE;
57
                       //wire [5-1:0]
                                                                     M_PSELx;
                                                                                          // not shared
                                                                      M_PENABLE;
                       //wire
                       //wire [15:0]
//wire [15:0]
                                                                     M_PWDATA;
M_PRDATA; // input to intercon
M_PREADY; // input to intercon
59
60
                       //wire
61
62
                      wire [7:0] gpio0;
wire [15:0] gpio1;
wire [7:0] gpio2;
63
64
65
66
                      vmicro16_soc soc (
    .clk (CLK50),
    .reset (~SW[0]),
67
68
69
70
71
                                //.M_PADDR
                                                              (M_PADDR)
                               //.M_PWRITE
//.M_PSELx
                                                            (M_PWRITE),
72
                                                             (M_PSELx),
```

```
//.M_PENABLE (M_PENABLE),
//.M_PWDATA (M_PWDATA),
//.M_PRDATA (M_PRDATA),
 74
75
 76
 77
                                                                                (M_PREADY),
                                          //.M_PREADY
 78
                                          // UART
 79
                                          .uart_tx (TXD),
.uart_rx (RXD),
 80
 81
                                          // GPIO
 83
                                          .gpio0
                                                                    (LEDS[3:0]),
 84
                                                                    (gpio1),
                                          .gpio1
 85
                                           .gpio2
 87
                                          // DBUG
 88
                                                                 (LEDS[4])
 89
                                           .dbug0
                                          //.dbug1 (LEDS[7:4])
 90
                              );
 91
 92
                               assign LEDS[7:5] = \{TXD, RXD, S6_3v3\};
 93
 94
                              // SSD displays (split across 2 gpio ports 1 and 2)
wire [3:0] ssd_chars [0:5];
assign ssd_chars[0] = gpio1[3:0];
assign ssd_chars[1] = gpio1[7:4];
assign ssd_chars[2] = gpio1[11:8];
assign ssd_chars[3] = gpio1[15:12];
assign ssd_chars[4] = gpio2[3:0];
assign ssd_chars[5] = gpio2[7:4];
seven_display ssd_0 (.n(ssd_chars[0]), .segments (ssd0));
seven_display ssd_1 (.n(ssd_chars[1]), .segments (ssd1));
seven_display ssd_2 (.n(ssd_chars[2]), .segments (ssd2));
seven_display ssd_4 (.n(ssd_chars[4]), .segments (ssd3));
seven_display ssd_5 (.n(ssd_chars[5]), .segments (ssd5));
 95
 96
 97
 98
100
101
102
103
104
105
106
107
108
109
                    endmodule
110
```

## E.1.3 vmicro16\_soc.v

```
3
4
5
        `include "vmicro16_soc_config.v"
`include "clog2.v"
`include "formal.v"
        module pow_reset # (
8
9
             parameter INIT = 1,
parameter N = 8
             parameter N
10
        ) (
11
13
              input
                              reset
              output reg resethold
14
15
              initial resethold = INIT ? (N-1) : 0;
17
              always @(*)
18
                   resethold = |hold;
19
20
21
              reg [clog2(N)-1:0] hold = (N-1);
              always @(posedge clk)
22
                   if (reset)
23
24
                        hold \leq N-1;
25
                   else
                        if (hold)
26
27
                              hold <= hold - 1;
28
        {\tt endmodule}
29
30
           Vmicro16 multi-core SoC with various peripherals
         // and interrupts
31
        module vmicro16_soc (
32
              input clk, input reset,
33
34
35
              // UARTO
36
37
38
              input
                                                        uart_rx,
              output
                                                        uart_tx,
39
              output [`APB_GPI00_PINS-1:0]
output [`APB_GPI01_PINS-1:0]
output [`APB_GPI02_PINS-1:0]
40
                                                        gpio0,
41
                                                        gpio1,
                                                        gpio2,
42
43
              output
                                                        halt,
45
```

```
output
                               [`CORES-1:0]
[`CORES*8-1:0]
                                                             dbug0,
 47
                output
                                                             dbug1
 48
          ):
                wire ['CORES-1:0] w_halt;
 49
 50
                assign halt = &w_halt;
 51
 52
                assign dbug0 = w_halt;
 53
                // Watchdog reset pulse signal.
// Passed to pow_reset to generate a longer reset pulse
 54
 55
 56
                wire wdreset;
 57
                wire prog_prog;
 58
 59
                // soft register reset hold for brams and registers
                wire soft_reset;
ifdef DEF_GLOBAL_RESET
 60
 61
 62
                     pow_reset # (
                          INIT
                                            (1).
 63
                           . N
                                            (8)
 64
 65
                     ) por_inst (
                           .clk (clk), ifdef DEF_USE_WATCHDOG
 66
 67
                           .reset
 68
                                            (reset | wdreset | prog_prog),
 69
                            else
                           .reset
 70
                                            (reset).
 71
                            endif
 72
                           .resethold (soft_reset)
73
74
                     );
                `else
 75
                     assign soft_reset = 0;
                `endif
 76
 77
78
                // Peripherals (master to slave)
wire [`APB_WIDTH-1:0]
 79
                                                             M_PADDR;
                 wire [`SLAVES-1:0]
                                                             M_PSELx; // not shared
M_PENABLE;
 81
 82
                 wire
                 wire [`DATA_WIDTH-1:0] M_PWDATA;
wire [`SLAVES*`DATA_WIDTH-1:0] M_PRDATA; // input to intercon
wire [`SLAVES-1:0] M_PREADY; // input
 83
 84
 85
 86
               // Master apb interfaces
wire [`CORES*`APB_WIDTH-1:0]
wire [`CORES-1:0]
wire [`CORES-1:0]
 87
 88
                                                           w_PADDR;
 89
                                                             w_PWRITE;
 90
                                                             w_PSELx;
                 wire [ CORES-1:0] w_PSELx;
wire [ CORES-1:0] w_PENABLE;
wire [ CORES* DATA_WIDTH-1:0] w_PWDATA;
wire [ CORES-1:0] w_PREADY;
 91
 92
 93
 94
 95
          // Interrupts
ifdef DEF_ENABLE_INT
wire ['DEF_NUM_INT-1:0] ints;
wire ['DEF_NUM_INT*'DATA_WIDTH-1:0] ints_data;
 96
 97
 98
 99
                100
101
102
          `endif
103
104
105
                apb_intercon_s # (
                                            (`CORES)
                     .MASTER_PORTS
.SLAVE_PORTS
106
                                            (`SLAVES),
(`APB_WIDTH)
107
                      .BUS_WIDTH
108
                      .DATA_WIDTH
                                            ( DATA_WIDTH),
109
110
                      .HAS_PSELX_ADDR (1)
111
                ) apb (
                     .clk
                                      (clk),
(soft_reset),
112
                      .reset
113
                     // APB master to slave
.S_PADDR (w_PADDR),
114
115
                                      (w_PADDR)
                     .S_PWRITE (W_PWRITE),
.S_PSELx (W_PSELx),
.S_PENABLE (W_PENABLE)
116
117
118
                      .S_PWDATA
                                       (w_PWDATA),
119
                                      (w_PRDATA),
120
                      .S PRDATA
                                      (w_PREADY),
                      .S_PREADY
121
                     // shared bus
.M_PADDR (
122
123
                                       (M_PADDR)
124
                      .M_PWRITE
                                       (M_PWRITE),
                                     (M_PSELx),
(M_PENABLE),
(M_PWDATA),
125
                      .M PSELx
                      .M_PENABLE
126
                      M_PWDATA
127
128
                      .M_PRDATA
                                       (M_PRDATA),
129
                      .M_PREADY
                                      (M_PREADY)
130
131
          `ifdef DEF_USE_WATCHDOG
                vmicro16_watchdog_apb # (
    .BUS_WIDTH (`APB_WIDTH),
133
134
```

```
135
                        .NAME
                                         ("WDOGO")
136
                 ) wdog0_apb (
137
                       .clk
                                         (clk).
138
                        .reset
                                 slave to master

OR (),

ITE (M_PWRITE),

Lx (M_PSELx[`APB_PSELX_WDOGO]),

'M PENABLE),
                                         (),
                       // apb s
139
140
                        .S PWRITE
141
                        .S_PSELx
142
                        .S_PENABLE
143
144
                        .S_PWDATA
                        .S_PRDATA
                                         (),
(M_PREADY[`APB_PSELX_WDOGO]),
145
                       .S_PREADY
146
147
148
                        .wdreset
                                         (wdreset)
           );
`endif
149
150
151
                 vmicro16_gpio_apb # (
    .BUS_WIDTH (`APB_WIDTH),
    .DATA_WIDTH (`DATA_WIDTH)
152
153
154
155
                        .PORTS
                                         (`APB_GPIOO_PINS),
156
                        .NAME
                                         ("GPI00")
157
                 ) gpio0_apb (
                                         (clk),
158
                       .clk
                                         (soft_reset),
159
                        .reset
                       // apb slave to master interface
.S_PADDR (M_PADDR),
160
161
                                         (M_PWRITE),
(M_PSELx[`APB_PSELX_GPI00]),
(M_PENABLE),
162
                        .S PWRITE
                        .S_PSELx
163
                       .S_PENABLE
164
                                         (M_PWDATA),
(M_PWDATA),
(M_PRDATA[^APB_PSELX_GPIOO*^DATA_WIDTH +: `DATA_WIDTH]),
(M_PREADY[^APB_PSELX_GPIOO]),
165
                        .S_PWDATA
166
                        .S_PRDATA
167
                        .S PREADY
                                         (gpio0)
168
                        .gpio
169
                 );
170
                 // GPI01 for Seven segment displays (16 pin)
vmicro16_gpio_apb # (
    .BUS_WIDTH (`APB_WIDTH),
171
172
                        .BUS_WIDTH (`APB_WIDTH),
.DATA_WIDTH (`DATA_WIDTH)
173
174
175
                        . PORTS
                                         ( APB_GPIO1_PINS),
176
                        . NAME
                                         ("GPI01")
177
                 ) gpio1_apb (
178
                       .clk
                                         (clk),
179
                        .reset
                                         (soft_reset),
                       // apb slave to master interface
.S_PADDR (M_PADDR),
180
                                         (M_PADDR),
(M_PWRITE)
181
                        .S_PWRITE
182
                                          (M_PSELx[`APB_PSELX_GPI01]),
183
                        .S_PSELx
                                         (M_PENABLE),
(M_PUDATA),
(M_PRDATA[^APB_PSELX_GPI01*^DATA_WIDTH +: `DATA_WIDTH]),
(M_PREADY[^APB_PSELX_GPI01]),
                        .S_PENABLE
184
                        S_PWDATA
185
                        .S_PRDATA
186
                       .S_PREADY
187
                                         (gpio1)
188
                        .gpio
189
                 ):
190
191
                 // GPI02 for Seven segment displays (8 pin)
                 vmicro16_gpio_apb # (
    .BUS_WIDTH ( `APB_WIDTH),
    .DATA_WIDTH ( `DATA_WIDTH),
    .PORTS ( `APB_GPIO2_PINS),
192
193
194
195
                        .NAME
                                         ("GPI02")
196
197
                 ) gpio2_apb (
198
                       .clk
                                         (clk),
                       .reset (soft_reset),
// apb slave to master in
199
200
                                                         interface
                       .S_PADDR
                                         (M_PADDR),
201
202
                       .S_PWRITE
.S PSELx
                                         (M_PWRITE),
(M_PSELx[`APB_PSELX_GPI02]),
203
                                         (M_PENABLE),
                        .S_PENABLE
204
                                         (M_FENADLA,)
(M_PWDATA),
(M_PRDATA[^APB_PSELX_GPI02*^DATA_WIDTH +: `DATA_WIDTH]),
(M_PREADY[^APB_PSELX_GPI02]),
                        .S_PWDATA
205
206
                        .S_PRDATA
207
                        .S PREADY
208
                        .gpio
                 );
209
210
                 apb_uart_tx # (
   .DATA_WIDTH (8),
   .ADDR_EXP (4) //2^^4 = 16 FIF0 words
211
212
213
214
                 ) uart0_apb (
215
                       .clk
                                         (clk),
                        .reset
216
                                         (soft_reset),
                       // apb slave to master interface
.S_PADDR (M_PADDR),
217
218
                        .S_PWRITE
                                         (M_PWRITE)
219
                                         (M_PSELx[`APB_PSELX_UARTO]),
220
                        .S PSELx
                       .S_PENABLE
.S_PWDATA
                                         (M_PENABLE),
(M_PWDATA),
(M_PRDATA['APB_PSELX_UARTO*'DATA_WIDTH +: 'DATA_WIDTH]),
221
222
                        .S_PRDATA
```

```
.S_PREADY
                                        (M_PREADY[`APB_PSELX_UARTO]),
224
225
                       // uart wires
226
                       .tx wire
                                        (uart tx).
227
                       .rx_wire
228
229
                timer_apb timr0 (
    .clk (clk),
230
231
232
                       .reset
                                        (soft_reset),
                       // apb slave to master interface
.S_PADDR (M_PADDR),
233
                                        (M_PADDR),
(M_PWRITE)
234
                       .S_PWRITE
235
                                        (M_PSELx[`APB_PSELX_TIMRO]),
(M_PENABLE),
236
                       .S_PSELx
                       .S_PENABLE
237
                                        (M_PWDATA),
(M_PRDATA[^APB_PSELX_TIMRO*`DATA_WIDTH +: `DATA_WIDTH]),
(M_PREADY[^APB_PSELX_TIMRO])
                       .S_PWDATA
238
                       S PRDATA
239
240
                       .S_PREADY
241
                        ifdef DEF_ENABLE_INT
242
                                        (ints ['DEF_INT_TIMRO]),
(ints_data['DEF_INT_TIMRO*'DATA_WIDTH +: 'DATA_WIDTH])
                       ,.out
243
                        .int_data
244
245
246
                );
247
                248
249
250
251
252
                       .CELL_DEPTH
                                                    (8),
253
                       .PARAM_DEFAULTS_RO (`CORES),
.PARAM_DEFAULTS_R1 (`SLAVES)
254
255
                ) regs0_apb (
256
257
                      .clk
                                        (clk),
                                        (soft_reset),
258
                       .reset
                      // apb slave to master interface
.S_PADDR (M_PADDR),
.S_PWRITE (M_PWRITE),
259
260
261
                                        (M_PSELx[`APB_PSELX_REGSO]),
262
                       .S_PSELx
                       .S_PENABLE
                                        (M_PENABLE),
263
                                        (M_PWDATA),
(M_PRDATA[^APB_PSELX_REGSO*`DATA_WIDTH +: `DATA_WIDTH]),
(M_PREADY[^APB_PSELX_REGSO])
264
                       .S PWDATA
265
                       S PRDATA
266
                       .S_PREADY
267
268
                vmicro16_bram_ex_apb # (
    .BUS_WIDTH ( `APB_WIDTH) ,
    .MEM_WIDTH ( `DATA_WIDTH) ,
269
270
271
272
                       .MEM_DEPTH
                                           ( APB_BRAMO_CELLS),
                       .CORE_ID_BITS (`clog2(`CORES))
273
274
                ) bram_apb (
275
                                        (clk),
                      .clk
                                        (soft_reset),
276
                       .reset
                      // apb slave to master interface
.S_PADDR (M_PADDR),
.S_PWRITE (M_PWRITE),
.S_PSELx (M_PSELx[`APB_PSELX_BRAM0]),
277
278
279
280
                       .S_PENABLE
                                        (M_PENABLE),
281
                                        (M_PWDATA),
(M_PRDATA[^APB_PSELX_BRAMO*`DATA_WIDTH +: `DATA_WIDTH]),
(M_PREADY[^APB_PSELX_BRAMO])
282
                       .S PWDATA
                       .S_PRDATA
283
                       .S_PREADY
284
285
286
                 // There must be atleast 1 core
`static_assert(`CORES > 0)
`static_assert(`DEF_MEM_INSTR_DEPTH > 0)
287
288
289
290
                 `static_assert(`DEF_MMU_TIMO_CELLS > 0)
291
292
           // Single instruction memory
ifndef DEF_CORE_HAS_INSTR_MEM
// slave input/outputs from interconnect
293
294
295
                 wire ['APB_WIDTH-1:0]
296
                                                              instr_M_PADDR;
297
                                                               instr_M_PWRITÉ;
                 wire
                                                              instr_M_PSELx;
instr_M_PENABLE;
298
                 wire [1-1:0]
                                                                                      // not shared
299
                 wire
                 wire ['DATA_WIDTH-1:0]
                                                              instr_M_PWDATA;
instr_M_PRDATA; // slave response
instr_M_PREADY; // slave response
300
                 wire [1*`DATA_WIDTH-1:0]
301
302
                 wire [1-1:0]
303
                // Master apb interfaces
wire [`CORES*`APB_WIDTH-1:0]
304
                                                              instr_w_PADDR;
305
                        [ CORES-1:0]
[ CORES-1:0]
306
                 wire
                                                              instr_w_PWRITÉ;
                                                              instr_w_PSELx;
307
                 wire
                                                              instr_w_PENABLE;
instr_w_PWDATA;
                        [ CORES-1:0]
308
                 wire
                wire ['CORES*'DATA_WIDTH-1:0]
wire ['CORES*'DATA_WIDTH-1:0]
wire ['CORES-1:0]
309
310
                                                              instr_w_PRDATA;
311
                                                              instr_w_PREADY;
312
```

```
`ifdef DEF_USE_REPROG
  wire [`clog2(`DEF_MEM_INSTR_DEPTH)-1:0] prog_addr;
  wire [`DATA_WIDTH-1:0] prog_data;
313
314
315
316
                    wire prog_we;
317
                    uart_prog rom_prog (
                                          (clk)
318
                          .clk
                                          (reset | wdreset),
319
                          .reset
320
                          // input stream
321
                          .uart_rx
                                          (uart_rx),
                          // programmer .addr (
322
                                          (prog_addr),
(prog_data),
323
324
                          .data
325
                                          (prog_we),
                          .we
326
                          .prog
                                          (prog_prog)
327
               `endif
328
329
               `ifdef DEF_USE_REPROG
330
                    vmicro16_bram_prog_apb
331
332
333
                    vmicro16_bram_apb
                endif
334
335
               # (
336
                    .BUS_WIDTH
                                          (`APB_WIDTH),
                                          (`DATA_WIDTH),
337
                     .MEM_WIDTH
                    .MEM_DEPTH .USE_INITS
                                          (`DEF_MEM_INSTR_DEPTH),
338
339
                                          (1).
                                          ("INSTR_ROM_G")
                     .NAME
340
341
               ) instr_rom_apb (
342
                    .clk
                                          (clk),
                                          (reset),
343
                    .reset
                    .S_PADDR
                                          (instr_M_PADDR),
344
                    .S_PWRITE
345
                                          (0),
                                          (instr_M_PSELx),
(instr_M_PENABLE),
                     .S_PSELx
346
                     S_PENABLE
347
                    .S_PWDATA
                                          (0).
348
                     .S_PRDATA
                                          (instr_M_PRDATA),
349
350
                    .S_PREADY
                                          (instr_M_PREADY)
351
                    `ifdef DEF_USE_REPROG
352
353
354
                          .addr
                                         (prog_addr),
355
                          .data
                                         (prog_data),
356
                          .we
                                         (prog_we),
                     .prog
357
                                         (prog_prog)
358
359
               );
360
361
               apb_intercon_s # (
                                          (`CORES),
(1),
(`APB_WIDTH),
(`DATA_WIDTH),
                    .MASTER_PORTS
.SLAVE_PORTS
362
363
                    .BUS_WIDTH
.DATA_WIDTH
364
365
                     .HAS_PSELX_ADDR (0)
366
367
               ) apb_instr_intercon (
368
                    .clk
                                    (clk),
369
                                    (soft_reset),
                     .reset
                    // APB master from cores
// master
370
371
                     .S_PADDR
372
                                     (instr_w_PADDR)
373
                     .S PWRITE
                                    (instr_w_PWRITE),
(instr_w_PSELx),
                     .S_PSELx
374
375
                     .S_PENABLE
                                     (instr_w_PENABLÉ),
376
                     .S_PWDATA
                                     (instr_w_PWDATA),
377
                     .S PRDATA
                                     (instr_w_PRDATA),
                                    (instr_w_PREADY),
                     .S_PREADY
378
                    // shared bus slaves
379
380
                        slave outputs
                                    (instr_M_PADDR),
(instr_M_PWRITE),
(instr_M_PSELx),
(instr_M_PENABLE),
                     .M_PADDR
381
                     .M PWRITE
382
                     .M PSELx
383
                    .M_PENABLE
384
                                    (instr_M_PWDATA),
(instr_M_PRDATA),
(instr_M_PREADY)
385
                     .M_PWDATA
386
                     .M PRDATA
                     .M_PREADY
387
         );
`endif
388
389
390
               genvar i;
391
               generate for(i = 0; i < `CORES; i = i + 1) begin : cores</pre>
392
393
                    vmicro16_core # (
    .CORE_ID
394
                                                     (i).
395
                                                     (`DATA_WIDTH),
                          .DATA_WIDTH
396
397
                          .MEM_INSTR_DEPTH ('DEF_MEM_INSTR_DEPTH),
.MEM_SCRATCH_DEPTH ('DEF_MMU_TIMO_CELLS)
398
399
400
                    ) c1 (
401
                          .clk
                                          (clk),
```

```
402
                        .reset
                                       (soft_reset),
403
                        // debug
404
405
                        .halt
                                       (w_halt[i]),
406
                        // interrupts
407
                                       (ints),
                        .ints
408
409
                        .ints_data (ints_data),
410
                        // Output master port 1
.w_PADDR (w_PADDR
411
                                                     [ APB_WIDTH*i +: APB_WIDTH]
412
                                                     [i]
                        .w_PWRITE
                                       (w_PWRITE
413
                        .w_PSELx
                                       (w_PSELx
414
                        .w_PENABLE (w_PENABLE [i]
415
                                       (w_PWDATA [`DATA_WIDTH*i +: `DATA_WIDTH]),
(w_PRDATA [`DATA_WIDTH*i +: `DATA_WIDTH]),
                        .w_PWDATA
416
                        .w_PRDATA
417
                                                   [i]
418
                        .w_PREADY
                                       (w_PREADY
419
         `ifndef DEF_CORE_HAS_INSTR_MEM
420
421
                        // APB instruction rom
422
                           // Output master port 2
                                      (instr_w_PADDR [`APB_WIDTH*i +: `APB_WIDTH]
E (instr_w_PWRITE [i]
(instr_w_PSELx [i]
                        .w2_PADDR (instr_w_PADDR
//.w2_PWRITE (instr_w_PWRI
423
424
                        .w2_PSELx
425
426
                        .w2_PENABLE (instr_w_PENABLE [i]
                        //.w2_PWDATA (instr_w_PWDATA [`DATA_WIDTH*i +: `DATA_WIDTH]),
.w2_PRDATA (instr_w_PRDATA [`DATA_WIDTH*i +: `DATA_WIDTH]),
.w2_PREADY (instr_w_PREADY [i] )
427
428
429
         `endif
430
431
432
              end
              endgenerate
433
434
435
              436
              // Formal Verit
437
                                 ication
              438
439
              wire all_halted = &w_halt;
440
441
              // Count number of clocks each core is spending on
442
443
444
445
446
447
448
449
              initial
                   for(i2 = 0; i2 < `CORES; i2 = i2 + 1) begin
  bus_core_times[i2] = 0;
  core_work_times[i2] = 0;</pre>
450
451
452
                   end
453
454
455
              // total bus time
456
              generate
                   457
458
459
460
                                        bus_core_times[g2] <= bus_core_times[g2] + 1;</pre>
461
462
                                  // Core working time
`ifndef DEF_CORE_HAS_INSTR_MEM
    if (!w_PSELx[g2] && !instr_w_PSELx[g2])
463
464
465
                                  `else
466
467
                                        if (!w_PSELx[g2])
                                  `endif
468
                                              if (!w_halt[g2])
469
470
                                                      core_work_times[g2] <= core_work_times[g2] + 1;</pre>
471
472
                           end
473
                     end
474
              endgenerate
475
              reg [15:0] bus_time_average = 0;
reg [15:0] bus_reqs_average = 0;
reg [15:0] fetch_time_average = 0;
476
477
478
479
              reg [15:0] work_time_average = 0;
480
              always @(all_halted) begin
for (i2 = 0; i2 < `CORES; i2 = i2 + 1) begin
481
482
                        bus_time_average = bus_time_average bus_reqs_average = bus_reqs_average
                                                                        + bus_core_times[i2];
483
                                                                         + bus_core_reqs_count[i2];
484
                        work_time_average = work_time_average + core_work_times[i2];
fetch_time_average = fetch_time_average + instr_fetch_times[i2];
485
486
487
488
                   bus_time_average = bus_time_average / `CORES;
bus_reqs_average = bus_reqs_average / `CORES;
489
490
```

```
work_time_average = work_time_average / `CORES;
fetch_time_average = fetch_time_average / `CORES;
491
492
493
494
495
              // Count number of bus requests per core
496
497
              /// clock delay of w_PSELx
reg [`CORES-1:0] bus_core_reqs_last;
// rising edges of each
wire [`CORES-1:0] bus_core_reqs_real;
// storage for counters for each core
reg [15:0] bus_core_reqs_count [0:`CORES-1];
498
499
500
501
502
504
              initial
                   for(i2 = 0; i2 < `CORES; i2 = i2 + 1)
bus_core_reqs_count[i2] = 0;
505
506
507
508
              // 1 clk delay to detect rising edge
              always @(posedge clk)
   bus_core_reqs_last <= w_PSELx;</pre>
509
510
              generate
512
513
                   514
                           // Detect new reqs for each core
515
                           assign bus_core_reqs_real[g3] = w_PSELx[g3] >
516
                                                                                bus core regs last[g3]:
517
518
                           always @(posedge clk)
519
520
                                  if (bus_core_reqs_real[g3])
                                        bus_core_reqs_count[g3] <= bus_core_reqs_count[g3] + 1;
521
522
523
524
              endgenerate
525
526
              `ifndef DEF_CORE_HAS_INSTR_MEM
527
                   528
529
531
532
                   integer i3;
533
                   initial
                       for(i3 = 0; i3 < `CORES; i3 = i3 + 1)
534
535
                             instr_fetch_times[i3] = 0;
536
537
                   // total bus time
                   // Instruction fetches occur on the w2 master port
538
539
                   generate
                        genvar g4;
for (g4 = 0; g4 < `CORES; g4 = g4 + 1) begin : formal_for_fetch_times
    always @(posedge clk)
    if (instr_w_PSELx[g4])
    if first times[g4] <= instr fetch_times[g4] + 1;</pre>
540
541
542
543
                                       instr_fetch_times[g4] <= instr_fetch_times[g4] + 1;</pre>
544
545
                        end
546
                   endgenerate
              `endif
547
548
549
              `endif // end FORMAL
550
551
         endmodule
552
```

## **E.1.4** vmicro16.v

Vmicro16 CPU core module.

```
This file contains multiple modules.
            // Its fite contains matter modules.
// Verilator likes 1 file for each module
/* verilator lint_off DECLFILENAME */
/* verilator lint_off UNUSED */
/* verilator lint_off BLKSEQ */
/* verilator lint_off WIDTH */
 3
 4
 5
 6
            // Include Vmicro16 ISA containing definitions for the bits
include "vmicro16_isa.v"
 8
10
            `include "clog2.v"
`include "formal.v"
11
12
14
15
            // This module aims to be a SYNCHRONOUS, WRITE_FIRST BLOCK RAM
16
```

```
https://www.xilinx.com/support/documentation/user\_guides/ug473\_7Series\_Memory\_Resources.pdf \\ https://www.xilinx.com/support/documentation/user\_guides/ug383.pdf
 17
 18
                      https://\verb|www.xilinx.com/support/documentation/sw_manuals/xilinx2016\_4/ug901-vivado-synthesis.pdf
 19
             module vmicro16_bram # (
 20
                   parameter MEM_WIDTH parameter MEM_DEPTH
 21
                                                               = 16,
                                                              = 64,
 22
                    parameter CORE_ID
                                                             = 0,
 23
                    parameter USE_INITS
 24
                    parameter PARAM_DEFAULTS_R0 = 0,
parameter PARAM_DEFAULTS_R1 = 0,
 25
 26
 27
                    parameter PARAM_DEFAULTS_R2 = 0,
            parameter NAME
 28
                    parameter PARAM_DEFAULTS_R3 = 0
 29
 30
                    input clk,
 31
 32
                    input reset
 33
                                         [`clog2(MEM_DEPTH)-1:0] mem_addr,
 34
                    input
 35
                                        [MEM_WIDTH-1:0]
                    input
                                                                                   mem_in,
 36
                    input
                                                                                   mem_we
                    output reg [MEM_WIDTH-1:0]
 37
                                                                                   mem out
 38
                    // memory vector
(* ram_style = "block" *)
reg [MEM_WIDTH-1:0] mem [0:MEM_DEPTH-1];
 39
 40
 41
 42
                     // not synthesizable
 43
                    integer i;
initial begin
 44
 45
                           for (i = 0; i < MEM_DEPTH;
mem[0] = PARAM_DEFAULTS_RO;
 46
                                                i < MEM_DEPTH; i = i + 1) mem[i] = 0;</pre>
 47
                           mem[1] = PARAM_DEFAULTS_R1;
mem[2] = PARAM_DEFAULTS_R2;
mem[3] = PARAM_DEFAULTS_R3;
 48
 49
 50
 51
                           if (USE_INITS) begin
    //`define TEST_SW
    ifdef TEST_SW
 52
 53
 54
 55
                                   $readmemh("E:\\Projects\\uni\\vmicro16\\sw\\verilog_memh.txt", mem);
 56
                                    endif
 57
                                   `define TEST_ASM
`ifdef TEST ASM
 58
 59
 60
                                   $readmemh("E:\\Projects\\uni\\vmicro16\\sw\\asm.s.hex", mem);
                                    endif
 61
 62
                                  //`define TEST_COND
`ifdef TEST_COND
mem[0] = {`VMICR016_OP_MOVI,
mem[0] = {`VMICR016_OP_MOVI,
 63
 64
                                                                                            3'h7, 8'hCO}; // lock
3'h7, 8'hCO}; // lock
 65
 66
 67
                                    endif
 68
                                  //`define TEST_CMP
  `ifdef TEST_CMP
  mem[0] = { 'VMICR016_OP_MOVI,
  mem[1] = { 'VMICR016_OP_MOVI,
  mem[2] = { 'VMICR016_OP_CMP,
 69
 70
                                                                                            3'h0, 8'h0A};
 71
                                                                                            3'h1, 8'h0B};
3'h1, 3'h0, 5'h1};
 72
 73
 74
 75
                                  //`define TEST_LWEX

`ifdef TEST_LWEX

mem[0] = {`VMICR016_OP_MOVI,
mem[1] = {`VMICR016_OP_SW,
mem[2] = {`VMICR016_OP_LW,
mem[3] = {`VMICR016_OP_LWEX,
mem[4] = {`VMICR016_OP_SWEX,
``ordif
 76
 77
 78
                                                                                             3'h0, 8'hC5};
                                                                                            3'h0, 3'h0, 5'h1};
3'h2, 3'h0, 5'h1};
3'h2, 3'h0, 5'h1};
3'h2, 3'h0, 5'h1};
3'h3, 3'h0, 5'h1};
 79
 80
 81
 82
 83
                                    endif
 84
                                  // define TEST_MULTICORE

ifdef TEST_MULTICORE

mem[0] = {`VMICR016_OP_MOVI,
mem[1] = {`VMICR016_OP_MOVI,
mem[2] = {`VMICR016_OP_SW,
mem[3] = {`VMICR016_OP_MOVI,
mem[4] = {`VMICR016_OP_MOVI,
mem[5] = {`VMICR016_OP_MOVI,
mem[6] = {`VMICR016_OP_MOVI,
mem[7] = {`VMICR016_OP_MOVI,
mem[8] = {`VMICR016_OP_MOVI,
mem[9] = {`VMICR016_OP_SW,
`endif
 85
                                                                                            3'h0, 8'h90};
 87
                                                                                            3'h1, 8'h33};
3'h1, 3'h0, 5'h0};
 88
 89
                                                                                             3'h0, 8'h80};
 91
                                                                                             3'h2, 3'h0, 5'h0};
                                                                                             3'h1, 8'h33};
 92
                                                                                            3'h1, 8'h33};
3'h1, 8'h33};
 93
 94
 95
                                                                                             3'h0, 8'h91};
96
97
                                                                                            3'h2, 3'h0, 5'h0};
 98
                                  99
100
101
102
103
104
105
106
```

```
107
                                        `endif
108
                                        //`define ALL_TEST
`ifdef ALL_TEST
109
110
111
                                        // Standard all test
// REGSO
112
                                       mem[0] = {\text{`VMICR016_OP_MOVI,} \text{mem[1] = {\text{`VMICR016_OP_SW,} \text{mem[2] = {\text{`VMICR016_OP_SW,} \text{}}}
                                                                                                         3'h0, 8'h81};
3'h1, 3'h0, 5'h0}; // MMU[0x81] = 6
3'h2, 3'h0, 5'h1}; // MMU[0x82] = 6
113
114
115
                                        // GPI00
116
                                       mem[3] = {\text{ \text{ VMICR016_OP_MOVI,}} \\
mem[4] = {\text{ \text{ VMICR016_OP_MOVI,}} \\
mem[5] = {\text{ \text{ VMICR016_OP_SW,}} \\
mem[6] = {\text{ \text{ VMICR016_OP_LW,}} \\
\end{table}
                                                                                                         3'h0, 8'h90};
117
                                                                                                         3'h1, 8'hD};
3'h1, 3'h0, 5'h0};
118
120
                                                                                                         3'h2, 3'h0, 5'h0};
                                        // TIMO
121
                                       mem[7] = {`VMICRO16_OP_MOVI,
mem[8] = {`VMICRO16_OP_LW,
                                                                                                         3'h0, 8'h07};
3'h3, 3'h0, 5'h03};
122
123
124
                                      // UARTO
                                                                                                          3'h0, 8'hA0}; // UAF
3'h1, 8'h41}; // asc
3'h1, 3'h0, 5'h0};
3'h1, 8'h42}; // ascii B
3'h1, 3'h0, 5'h0};
3'h1, 8'h43}; // ascii C
3'h1, 3'h0, 5'h0};
3'h1, 8'h44}; // ascii D
3'h1, 8'h45}; // ascii D
3'h1, 3'h0, 5'h0};
3'h1, 8'h45}; // ascii E
3'h1, 3'h0, 5'h0};
3'h1, 3'h0, 5'h0};
                                                                                                                                                   // UARTO
125
                                                                                                                                                  // ascii A
126
127
128
129
130
131
132
133
134
135
136
137
                                        // BRAMO
138
                                       mem[22] = {\text{`VMICRO16_OP_MOVI,}}
mem[23] = {\text{`VMICRO16_OP_MOVI,}}
mem[24] = {\text{`VMICRO16_OP_SW,}}
                                                                                                           3'h0, 8'hC0};
139
                                                                                                           3'h1, 8'hA};
3'h1, 3'h0, 5'h5};
3'h2, 3'h0, 5'h5};
140
141
142
                                        mem[25] = { `VMICRO16_OP_LW,
                                       143
                                                                                                           3'h0, 8'h91};
3'h1, 8'h12};
3'h1, 3'h0, 5'h0};
3'h2, 3'h0, 5'h0};
144
145
146
147
                                        // GPI02
148
                                       mem[30] = { VMICRO16_OP_MOVI,
mem[31] = { VMICRO16_OP_MOVI,
mem[32] = { VMICRO16_OP_SW,
                                                                                                           3'h0, 8'h92};
3'h1, 8'h56};
3'h1, 3'h0, 5'h0};
149
150
151
152
                                         endif
153
                                        //`define TEST_BRAM
`ifdef TEST_BRAM
154
155
                                       // 2 core BRAMO test
mem[0] = {\text{VMICR016_OP_MOVI,}}
mem[1] = {\text{VMICR016_OP_MOVI,}}
mem[2] = {\text{VMICR016_OP_SW,}}
mem[3] = {\text{VMICR016_OP_LW,}}
156
                                                                                                         3'h0, 8'hC0};
3'h1, 8'hA};
3'h1, 3'h0, 5'h5};
3'h2, 3'h0, 5'h5};
157
158
159
160
161
                                         endif
162
                               end
163
                       end
164
                       always @(posedge clk) begin
// synchronous WRITE_FIRST (page 13)
165
166
                               167
168
169
170
                                       mem_out <= mem[mem_addr];</pre>
172
173
                       end
174
                       // TODO: Reset impl = every clock while reset is asserted, clear each cell one at a time, mem[i++] <= 0
175
                       11
176
              endmodule
177
178
              module vmicro16_core_mmu # (
parameter MEM_WIDTH = 16,
memorian MEM DEPTH = 64,
179
180
181
182
183
                      parameter CORE_ID = 3'h0,
parameter CORE_ID_BITS = `clog2(`CORES)
184
185
186
              ) (
187
                       input clk,
                       input reset.
188
189
                       input req,
output busy,
190
191
192
                        // From core
193
                       input [MEM_WIDTH-1:0] mmu_addr, input [MEM_WIDTH-1:0] mmu_in,
194
195
```

```
196
                   input
                                                                  mmu_we,
197
                   input
                                                                  mmu lwex.
198
                   input
                                                                  mmu swex.
                   output reg [MEM_WIDTH-1:0] mmu_out,
199
200
                   // interrupts
output reg [`DATA_WIDTH*`DEF_NUM_INT-1:0] ints_vector,
output reg [`DEF_NUM_INT-1:0] ints_mask,
201
202
203
204
                   // TO APB interconnect
205
                   output reg [ APB_WIDTH-1:0] M_PADDR,
206
                   output reg
                                                                    M_PWRITÉ,
207
                                                                    M PSELx.
208
                   output reg
                                                                    M_PENABLE,
209
                   output reg
                   output reg [MEM_WIDTH-1:0] M_PWDATA,
210
                    // from interconnect
input [MEM_WIDTH-1:0]
211
                   input
                                                                   M PRDATA
212
213
                                                                    M PREADY
                   input
            );
214
                   localparam MMU_STATE_T1 = 0;
localparam MMU_STATE_T2 = 1;
localparam MMU_STATE_T3 = 2;
215
216
217
                   reg [1:0] mmu_state
218
                                                              = MMU_STATE_T1;
219
                   reg [MEM_WIDTH-1:0] per_out = 0;
wire [MEM_WIDTH-1:0] tim0_out;
220
221
222
223
                   assign busy = req || (mmu_state == MMU_STATE_T2);
224
225
                   // more luts than below but easier
                   // more tits than below that easier // wire tim0_en = (mmu\_addr >= `DEF\_MMU\_TIM0\_S) // &$\mathre{G} (mmu\_addr <= `DEF_MMU_TIM0_E); // wire sreg_en = (mmu\_addr >= `DEF\_MMU\_SREG\_S) // &$\mathre{G} (mmu\_addr <= `DEF_MMU_SREG_E);
226
227
228
229
                   // ces (mmu_addr <= DEF_MTMU_INTSV_S)
// wire intv_en = (mmu_addr >= `DEF_MTMU_INTSV_S)
// ces (mmu_addr <= `DEF_MTMU_INTSV_E);
//wire intm_en = (mmu_addr >= `DEF_MTMU_INTSM_S)
// ces (mmu_addr <= `DEF_MTMU_INTSM_E);
230
231
232
233
234
235
                   wire timO_en = ~mmu_addr[12] && ~mmu_addr[9] && ~mmu_addr[7];
                  wire intw_en = mmu_addr[8] && mmu_addr[9] && mmu_addr[5]; wire intm_en = mmu_addr[8] && mmu_addr[3];
236
237
238
239
240
                                           = !(|{tim0_en, sreg_en, intv_en, intm_en});
                   wire apb_en
                                          241
                   wire tim0_we
242
                   wire intv we
243
                   wire intm_we
244
                   // Special register selects
localparam SPECIAL_REGS = 8;
wire [MEM_WIDTH-1:0] sr_val;
245
246
247
248
249
                    // Interrupt vector and mask
                   initial ints_vector = 0;
initial ints_mask = 0;
wire [2:0] intv_addr = mmu_addr[`clog2(`DEF_NUM_INT)-1:0];
250
251
252
                   always @(posedge clk)
253
                          if (intv_we)
254
255
                                ints_vector[intv_addr*`DATA_WIDTH +: `DATA_WIDTH] <= mmu_in;</pre>
256
                   always @(posedge clk)
if (intm_we)
257
258
                                 ints_mask <= mmu_in;</pre>
259
260
261
                   always @(ints_vector)
262
                          $display($time,
263
                                        "\tC%d\t\tints_vector W: | %h %h %h %h %h %h %h %h %h "", CORE_ID,
264
265
                                CORE_ID,
ints_vector[0*`DATA_WIDTH +: `DATA_WIDTH],
ints_vector[1*`DATA_WIDTH +: `DATA_WIDTH],
ints_vector[2*`DATA_WIDTH +: `DATA_WIDTH],
ints_vector[2*`DATA_WIDTH +: `DATA_WIDTH],
ints_vector[4*`DATA_WIDTH +: `DATA_WIDTH],
ints_vector[5*`DATA_WIDTH +: `DATA_WIDTH],
ints_vector[6*`DATA_WIDTH +: `DATA_WIDTH],
ints_vector[7*`DATA_WIDTH +: `DATA_WIDTH]).
266
267
268
269
270
271
272
273
274
275
                   always @(intm_we)
276
                          $display($time, "\tC%d\t\tintm_we W: %b", CORE_ID, ints_mask);
277
278
                   // Output port
always @(*)
279
280
281
                                      (tim0_en) mmu_out = tim0_out;
                          else if (sreg_en) mmu_out = sr_val;
else if (intv_en) mmu_out = ints_vector[mmu_addr[2:0]*\DATA_WIDTH
282
283
                                                                                                    +: DATA_WIDTH];
284
```

```
285
286
287
                    // APB master to slave interface
288
289
                    always @(posedge clk)
                          if (reset) begin
    mmu_state <= MMU_STATE_T1;</pre>
290
291
292
                                  M_PENABLE <= 0;
                                 M_PADDR <= 0;
M_PWDATA <= 0;
M_PSELx <= 0;
M_PWRITE <= 0;
293
294
295
296
                           end
297
298
                           else
299
                                  casex (mmu_state)
                                        MMU_STATE_T1: begin
if (req && apb_en) begin
M_PADDR <= {mmu_lwex,
300
301
302
                                                                               mmu_swex,
CORE_ID[CORE_ID_BITS-1:0],
mmu_addr[MEM_WIDTH-1:0]};
303
304
305
306
                                                       M_PWDATA <= mmu_in;
M_PSELx <= 1;
M_PWRITE <= mmu_we;</pre>
307
308
309
310
                                                       mmu_state <= MMU_STATE_T2;</pre>
311
                                                end
312
313
                                         end
314
                                         `ifdef FIX_T3
MMU_STATE_T2: begin
315
316
317
                                                       M_PENABLE <=
318
                                                       if (M_PREADY == 1'b1) begin
    mmu_state <= MMU_STATE_T3;</pre>
319
320
                                                       end
321
322
                                                end
323
324
                                                MMU_STATE_T3: begin
                                                       // Slave has output a ready signal (finished)
M_PENABLE <= 0;</pre>
325
326
                                                       M_PADDR <= 0;
M_PWDATA <= 0;
M_PSELx <= 0;
327
328
329
                                                       M_PWRITE <= 0;
330
                                                       // Clock the peripheral output into a reg,
// to output on the next clock cycle
per_out <= M_PRDATA;
331
332
333
334
335
                                                       mmu_state <= MMU_STATE_T1;</pre>
                                                end
336
337
                                         `else
                                                // No FIX_T3
MMU_STATE_T2: begin
if (M_PREADY == 1'b1) begin
338
339
340
                                                             M_PREADY == 1 0.

M_PENABLE <= 0;

M_PADDR <= 0;

M_PWDATA <= 0;

M_PSELx <= 0;

M_PWRITE <= 0;
341
342
343
344
345
                                                              // Clock the peripheral output into a reg,
// to output on the next clock cycle
per_out <= M_PRDATA;
346
347
348
349
                                                              mmu_state <= MMU_STATE_T1;</pre>
350
351
                                                       end else begin
352
                                                              M_PENABLE <= 1;
                                                       \quad \text{end} \quad
353
354
                                                end
355
                                         `endif
356
                                  endcase
357
                    (* ram_style = "block" *)
358
                   (* ram_style = "block" *)
vmicro16_bram # (
    .MEM_WIDTH (MEM_WIDTH),
    .MEM_DEPTH (SPECIAL_REGS),
    .USE_INITS (0),
    .PARAM_DEFAULTS_RO (CORE_ID),
    .PARAM_DEFAULTS_R1 (`CORES),
    .PARAM_DEFAULTS_R2 (`APB_BRAMO_CELLS),
    .PARAM_DEFAULTS_R3 (`SLAVES),
    .NAME ("ram_sr")
359
360
361
362
363
364
365
366
                           NAME
                                                ("ram_sr")
367
368
                    ) ram_sr (
369
                          .clk
                                                (clk),
                                                (reset).
                           .reset
370
                           .mem_addr
                                                (mmu_addr[`clog2(SPECIAL_REGS)-1:0]),
371
372
                           .mem_in
                                                (),
373
                           .mem_we
374
                           .mem_out
                                                (sr_val)
```

```
375
                );
376
377
                 // Each M core has a TIMO scratch memory
                 (* ram_style = "block" *)
378
379
                 vmicro16_bram # (
                                        (MEM_WIDTH),
380
                       .MEM_WIDTH
                                        (MEM_DEPTH),
                       MEM_DEPTH
381
                       .USE_INITS
                                        (0),
("TIMO")
382
383
                       .NAME
384
                ) TIMO (
                      .clk
                                        (clk).
385
                                        (reset)
386
                       .reset
                      .mem_addr
                                        (mmu_addr[7:0]),
387
                                        (mmu_in),
388
                       .mem_in
                                        (timO_we)
389
                       .mem_we
                                        (tim0_out)
390
                       .mem_out
391
392
           endmodule
393
394
          module vmicro16_regs # (
    parameter CELL_WIDTH
    parameter CELL_DEPTH
396
                                                         = 16,
397
398
399
                parameter CELL_SEL_BITS
                                                         = `clog2(CELL_DEPTH),
                parameter CELL_DEFAULTS parameter DEBUG_NAME
                                                         = 0,
400
401
                parameter CORE_ID = 0,
parameter PARAM_DEFAULTS_RO = 16'h0000,
parameter PARAM_DEFAULTS_R1 = 16'h0000
402
403
404
405
               input clk,
input reset,
// Dual port register reads
input [CELL_SEL_BITS-1:0] rs1, // port 1
output [CELL_WIDTH-1 :0] rd1,
//input [CELL_WIDTH-1 :0] rs2, // port 2
//output [CELL_WIDTH-1 :0] rd2,
406
                 input clk,
407
408
409
410
411
412
413
                input [CELL_SEL_BITS-1:0]
415
                                                               ws1,
416
                 input [CELL_WIDTH-1:0]
                                                               wd
417
                 (* ram_style = "distributed" *)
418
                reg [CELL_WIDTH-1:0] regs [0:CELL_DEPTH-1] /*verilator public_flat*/;
419
420
                    Initialise registers with default values
421
                // Intitution registers with acjusters duties // Really only used for special registers used by the soc // TODO: How to do this on reset?
422
423
424
                 integer i;
425
                initial
426
                      if (CELL_DEFAULTS)
427
                            $readmemh(CELL_DEFAULTS, regs);
                      else begin
for(i = 0; i < CELL_DEPTH; i = i + 1)
regs[i] = 0;
regs[0] = PARAM_DEFAULTS_R0;
regs[1] = PARAM_DEFAULTS_R1;
end
428
429
430
431
432
433
434
                `ifdef ICARUS
435
436
                      always @(regs)
                            $display($time, "\tC%02h\t\t| %h %h %h %h %h %h %h %h %h ", CORE_ID,
437
438
                                  regs[0], regs[1], regs[2], regs[3], regs[4], regs[5], regs[6], regs[7]);
439
440
                 `endif
441
442
443
                 always @(posedge clk)
                      if (reset) begin
  for(i = 0; i < CELL_DEPTH; i = i + 1)
      regs[i] <= 0;
  regs[0] <= PARAM_DEFAULTS_RO;
  regs[1] <= PARAM_DEFAULTS_R1;
</pre>
444
445
446
447
448
                      end
449
                      else if (we) begin
450
                            451
452
453
                             // Perform the write
454
455
                            regs[ws1] <= wd;
456
                      end
457
458
                 // sync writes, async reads
                assign rd1 = regs[rs1];
//assign rd2 = regs[rs2];
459
460
          endmodule
461
462
463
          module vmicro16_dec # (
```

```
parameter INSTR_WIDTH = 16
parameter INSTR_OP_WIDTH = 5,
parameter INSTR_RS_WIDTH = 3,
                                                                      = 16,
464
465
466
                     parameter ALU_OP_WIDTH = 5
467
468
                      //input clk, // not used yet (all combinational) //input reset, // not used yet (all combinational) \label{eq:combinational}
469
470
471
                      input [INSTR_WIDTH-1:0]
472
                                                                         instr.
473
474
                      output [INSTR_OP_WIDTH-1:0] opcode,
                      output [INSTR_RS_WIDTH-1:0] rd,
output [INSTR_RS_WIDTH-1:0] ra,
475
476
                     output [3:0]
output [7:0]
output [11:0]
output [4:0]
477
                                                                             imm4.
478
                                                                             imm8.
479
                                                                             imm12,
480
481
                      // This can be freely increased without affecting the isa output reg [ALU_OP_WIDTH-1:0] alu_op,
482
483
484
                      output reg has_imm4,
output reg has_imm8,
485
486
487
                      output reg has_imm12,
                      output reg has_we, output reg has_br,
488
489
490
                      output reg has_mem,
491
                      output reg has_mem_we,
492
                      output reg has_cmp,
493
494
                      output halt,
495
                      output intr,
496
497
                      output reg has_lwex,
498
                      output reg has_swex
499
500
                          ' TODO: Use to identify bad instruction and
                      // raise exceptions
//,output is_bad
501
502
              ):
503
504
                      assign opcode = instr[15:11];
                      assign rd = instr[10:8];
assign ra = instr[7:5];
505
506
                     assign imm4 = instr[7:0];
assign imm8 = instr[7:0];
assign imm12 = instr[11:0];
assign simm5 = instr[4:0];
507
508
509
510
511
                     512
513
514
515
516
                                                                                        alu_op = `VMICRO16_ALU_NOP;
alu_op = `VMICRO16_ALU_NOP; endcase
517
518
519
                                                                                        alu_op = `VMICRO16_ALU_LW;
alu_op = `VMICRO16_ALU_SW;
alu_op = `VMICRO16_ALU_LW;
alu_op = `VMICRO16_ALU_SW;
                              `VMICRO16_OP_LW:
520
                             VMICRO16_OP_SW:

VMICRO16_OP_LWEX:

VMICRO16_OP_SWEX:
521
522
523
524
                                                                                        alu_op = `VMICRO16_ALU_MOV;
alu_op = `VMICRO16_ALU_MOVI;
                              `VMICRO16_OP_MOV:
525
                              `VMICRO16_OP_MOVI:
526
527
                                                                                        alu_op = `VMICRO16_ALU_BR;
alu_op = `VMICRO16_ALU_MULT;
                             `VMICRO16_OP_BR:
`VMICRO16_OP_MULT:
528
529
530
                                                                                        alu_op = `VMICRO16_ALU_CMP;
alu_op = `VMICRO16_ALU_SETC;
531
                               `VMICRO16_OP_CMP:
                              `VMICRO16_OP_SETC:
532
533
534
                              `VMICRO16_OP_BIT:
                                                                         casez (simm5)
                                     VMICRO16_OP_BIT_CASE

VMICRO16_OP_BIT_XOR:

VMICRO16_OP_BIT_XOR:

VMICRO16_OP_BIT_NOT:

VMICRO16_OP_BIT_LSHFT:

VMICRO16_OP_BIT_RSHFT:
                                                                                        alu_op = `VMICRO16_ALU_BIT_OR;
alu_op = `VMICRO16_ALU_BIT_XOR;
alu_op = `VMICRO16_ALU_BIT_AND;
535
536
537
                                                                                        alu_op = `VMICR016_ALU_BIT_AND;
alu_op = `VMICR016_ALU_BIT_NOT;
alu_op = `VMICR016_ALU_BIT_LSHFT;
alu_op = `VMICR016_ALU_BIT_RSHFT;
alu_op = `VMICR016_ALU_BAD; endcase
538
539
540
541
                                      default:
542
                              `VMICRO16_OP_ARITH_U:
                                                                                casez (simm5)
543
                                      CRUIG_UP_ARITH_U: casez (simmb)

'VMICRO16_OP_ARITH_UADD: alu_op = 'VMICRO16_ALU_ARITH_UADD;

'VMICRO16_OP_ARITH_USUB: alu_op = 'VMICRO16_ALU_ARITH_USUB;

'VMICRO16_OP_ARITH_UADDI: alu_op = 'VMICRO16_ALU_ARITH_UADDI;
default: alu_op = 'VMICRO16_ALU_BAD; endcase
544
545
546
547
548
                                      CRO16_OP_ARITH_S: casez (simm5)

`VMICRO16_OP_ARITH_SADD: alu_op = `VMICRO16_ALU_ARITH_SADD;

`VMICRO16_OP_ARITH_SSUB: alu_op = `VMICRO16_ALU_ARITH_SSUB;

`VMICRO16_OP_ARITH_SSUBI: alu_op = `VMICRO16_ALU_ARITH_SSUBI;
default: alu_op = `VMICRO16_ALU_BAD; endcase
                              `VMICRO16_OP_ARITH_S:
549
550
551
552
553
```

```
554
                       default: begin
555
                                                                     alu_op = `VMICRO16_ALU_NOP;
556
                             $display($time, "\tDEC: unknown opcode: %h ... NOPPING", opcode);
557
558
                       end
                 endcase
559
560
                 // Special opcodes
//assign nop == ((opcode == `VMICRO16_OP_SPCL) & (~instr[0]));
assign halt = ((opcode == `VMICRO16_OP_SPCL) & instr[0]);
assign intr = ((opcode == `VMICRO16_OP_SPCL) & instr[1]);
561
562
563
564
565
                 566
567
                      VMICRO16_OP_LWEX,

VMICRO16_OP_LWEX,

VMICRO16_OP_LW,

VMICRO16_OP_MOV,

VMICRO16_OP_MOVI,

//VMICRO16_OP_ARITH_U,

VMICRO16_OP_ARITH_S,

VMICRO16_OP_SETC,

VMICRO16_OP_BIT,

VMICRO16_OP_MULT:
default:
568
570
571
573
574
575
576
577
                                                            has_we = 1'b1;
has_we = 1'b0;
578
579
                       default:
580
                 endcase
581
                 // Contains 4-bit immediate
always @(*)
582
583
                       if( ((opcode == `VMICRO16_OP_ARITH_U) && (simm5[4] == 0)) || ((opcode == `VMICRO16_OP_ARITH_S) && (simm5[4] == 0)) ) has_imm4 = 1'b1;
584
585
586
587
                       else
                             has_imm4 = 1'b0;
588
589
                 590
591
592
                                                           has_imm8 = 1'b1;
has_imm8 = 1'b0;
593
594
                       default:
                 endcase
595
596
                 //// Contains 12-bit immediate
//always @(*) case (opcode)
// VMICRO16_OP_MOVI_L: //
default: //
597
598
                                                            has_imm12 = 1'b1;
has_imm12 = 1'b0;
599
                          default:
600
601
                 //endcase
                 // Will branch the pc
always @(*) case (opcode)
603
604
                         VMICRO16_OP_BR: has_br = 1'b1;
default: has_br = 1'b0;
605
                       default:
607
                 endcase
608
                 // Requires external memory
always @(*) case (opcode)
`VMICRO16_OP_LW,
`VMICRO16_OP_LWEX,
`VMICRO16_OP_LWEX,
609
610
611
612
613
                       VMICRO16_OP_SWEX: has_mem = 1'b1;
614
                                                      has_mem = 1'b0;
                       default:
615
                 endcase
617
                 618
619
620
621
                                                    has_mem_we = 1'b0;
622
                       default:
623
624
                 625
626
627
628
                 endcase
629
                 630
631
632
633
634
635
                 endcase
                 always @(*) case (opcode)

VMICRO16_OP_SWEX: has_swex = 1'b1;

has_swex = 1'b0;
636
637
638
639
                 endcase
640
           endmodule
641
642
643
```

```
module vmicro16_alu # (
   parameter OP_WIDTH = 5,
   parameter DATA_WIDTH = 16,
644
645
646
647
                   parameter CORE_ID
648
                    // input clk, // TODO: make clocked
649
650
                                       [OP_WIDTH-1:0]
651
                    input
                                                                    op,
                                       [DATA_WIDTH-1:0] op,
[DATA_WIDTH-1:0] a, // rs1/dst
[DATA_WIDTH-1:0] b, // rs2
652
                    input
653
                    input
                                       [3:0]
654
                    input
                                                                    flags,
                   output reg [DATA_WIDTH-1:0] c
655
656
                   localparam TOP_BIT = (DATA_WIDTH-1);
// 17-bit register
reg [DATA_WIDTH:0] cmp_tmp = 0; // = {carry, [15:0]}
657
658
659
660
661
                   always @(*) begin
662
                          cmp\_tmp = 0;
663
664
                          case (op)
                          // branch/nop, outpo

'VMICRO16_ALU_BR,

'VMICRO16_ALU_NOP:
665
                                                   output nothing
666
                                                               c = {DATA_WIDTH{1'b0}};
667
                           // load/store addresses (use value in rd2)
`VMICRO16_ALU_LW,
668
669
                          `VMICRO16_ALU_SW:
670
                                                                          c = b;
                          // bituise operations
VMICRO16_ALU_BIT_OR:
VMICRO16_ALU_BIT_XOR:
VMICRO16_ALU_BIT_AND:
671
                                                                          c = a | b;
c = a ^ b;
c = a & b;
c = ~(b);
672
673
674
                          VMICRO16_ALU_BIT_NOT:
VMICRO16_ALU_BIT_LSHFT:
VMICRO16_ALU_BIT_RSHFT:
675
                                                                           c = a \ll b
676
                                                                           c = a \Rightarrow b;
677
678
                          `VMICRO16_ALU_MOV:
`VMICRO16_ALU_MOVI:
679
                                                                           c = b;
680
                           `VMICRO16_ALU_MOVI_L:
                                                                           c = b;
681
682
                          `VMICRO16_ALU_ARITH_UADD: c = a + b;

`VMICRO16_ALU_ARITH_USUB: c = a - b;

// TODO: ALU should have simm5 as input

`VMICRO16_ALU_ARITH_UADDI: c = a + b;
683
684
685
686
687
688
                           `ifdef DEF_ALU_HW_MULT
689
                                  VMICRO16\_ALU\_MULT: c = a * b;
                           `endif
690
691
                          692
693
                           // TODO: ALU should have simm5 as input

VMICRO16_ALU_ARITH_SSUBI: c = $signed(a) - $signed(b);
694
695
696
                           `VMICRO16_ALU_CMP: begin

// TODO: Do a-b in 17-bit register

// Set zero, overflow, carry, signed bits in result
697
698
699
                                 cmp\_tmp = a - b;
700
701
702
                                 // N Negative condition code flag
// Z Zero condition code flag
// C Carry condition code flag
// V Overflow condition code flag
c[`VMICRO16_SFLAG_N] = cmp_tmp[TOP_BIT];
c[`VMICRO16_SFLAG_Z] = (cmp_tmp == 0);
c[`VMICRO16_SFLAG_C] = 0; //cmp_tmp[TOP_BIT+1]; // not used
703
704
705
706
707
708
709
710
711
                                  // Overflow flag
                                 // Overflow flag
// https://stackoverflow.com/questions/30957188/
// https://github.com/bendl/prco304/blob/master/prco_core/rtl/prco_alu.v#L50
case(cmp_tmp[TOP_BIT+1:TOP_BIT])
    2'b01: c['VMICR016_SFLAG_V] = 1;
    2'b10: c['VMICR016_SFLAG_V] = 0;
enders
712
714
715
716
717
718
                                 endcase
719
720
                                 $display($time, "\tC%02h: ALU CMP: %h %h = %h = %b", CORE_ID, a, b, cmp_tmp, c[3:0]);
721
722
723
                          `VMICRO16_ALU_SETC: c = { {15{1'b0}}}, r_setc };
724
725
                           // TODO: Parameterise
726
727
                          default: begin
    $display($time, "\tALU: unknown op: %h", op);
728
729
                                 cmp\_tmp = 0;
                          end
730
731
                                        endcase
732
                                        end
733
```

```
branch setc_check (
                                         (flags),
(b[7:0]),
735
                       .flags
736
                       .cond
737
                       .en
                                         (r_setc)
                 );
738
739
740
           endmodule
           // flags = 4 bit r_cmp_flags register
// cond = 8 bit VMICRO16_OP_BR_? value. See vmicro16_isa.v
module branch (
   input [3:0] flags,
   input [7:0] cond,
   output reg en
741
742
743
744
745
                 output reg en
747
                     always @(*)
748
749
750
751
752
753
754
755
756
757
758
760
761
           endmodule
762
763
764
           module vmicro16_core # (
765
                parameter DATA_WIDTH = 16,
parameter MEM_INSTR_DEPTH = 64,
parameter MEM_SCRATCH_DEPTH = 64,
767
768
769
                 parameter MEM_WIDTH
770
771
                 parameter CORE_ID
                                                           = 3'h0
772
773
774
775
                 input
                                     clk,
                 input
                                    reset,
776
                 output [7:0] dbug,
777
778
779
                 output
                                    halt,
                 // interrupt sources
input ['DEF_NUM_INT-1:0] ints,
input ['DEF_NUM_INT*'DATA_WIDTH-1:0] ints_data,
output ['DEF_NUM_INT-1:0] ints_ack,
780
781
782
783
784
                 // APB master to slave interface (apb_intercon) output [`APB_WIDTH-1:0] w_PADDR, output w_PWRITE,
785
786
787
788
                 output
                                                           w_PSELx,
w_PENABLE,
789
                 output
                             [DATA_WIDTH-1:0]
790
                                                           w PWDATA.
                 output
791
                             [DATA_WIDTH-1:0]
                                                           w_PRDATA,
                 input
792
                                                           w_PREADY
793
794
           `ifndef DEF_CORE_HAS_INSTR_MEM
                 output reg [`APB_WIDTH-1:0] w2_PADDR,
795
796
                                                                w2_PWRITE.
797
                 output reg
798
                 output reg
                                                                w2_PSELx,
799
                 output reg
                                                                w2_PENABLE,
                 output reg [DATA_WIDTH-1:0]
800
                                                               w2_PWDATA,
                                  [DATA_WIDTH-1:0]
                                                               w2_PRDATA,
w2_PREADY
801
                 input
802
                 input
803
            `endif
804
           );
                 localparam STATE_IF = 0;
805
                 localparam STATE_IF = 0;
localparam STATE_R1 = 1;
localparam STATE_R2 = 2;
localparam STATE_ME = 3;
localparam STATE_WB = 4;
localparam STATE_FE = 5;
localparam STATE_IDLE = 6;
localparam STATE_HALT = 7;
806
807
808
809
810
811
812
                 reg [2:0] r_state = STATE_IF;
813
814
                         [DATA_WIDTH-1:0] r_pc
[DATA_WIDTH-1:0] r_pc_saved
[DATA_WIDTH-1:0] r_instr
                                                                       = 16'h0000:
815
                                                                       = 16'h0000;
816
                 reg
817
                 reg
818
                 wire [DATA_WIDTH-1:0] w_mem_instr_out;
                                                  w halt:
819
                 wire
820
                 assign dbug = {7'h00, w_halt};
assign halt = w_halt;
821
822
823
```

```
wire [4:0] wire [4:0]
                                         r_instr_opcode;
824
825
                                         r_instr_alu_op;
              wire [2:0]
                                        r_instr_rsd;
r_instr_rsa;
826
              wire [2:0]
827
             reg
                    [DATA_WIDTH-1:0] r_instr_rdd = 0;
828
              reg
                    [DATA_WIDTH-1:0] r_instr_rda = 0;
829
              wire [3:0]
830
                                         r instr imm4:
                                        r_instr_imm8;
r_instr_simm5;
831
              wire [7:0]
              wire [4:0]
832
833
              wire
                                         r_instr_has_imm4;
                                         r_instr_has_imm8;
834
              wire
                                         r_instr_has_we;
              wire
835
                                         r_instr_has_br;
836
              wire
                                         r_instr_has_cmp;
r_instr_has_mem;
837
              wire
838
              wire
839
                                         r_instr_has_mem_we;
              wire
                                         r_instr_halt;
r_instr_has_lwex;
840
              wire
841
              wire
                                         r_instr_has_swex;
              wire
843
             wire [DATA_WIDTH-1:0] r_alu_out;
844
845
              wire [DATA_WIDTH-1:0] r_mem_scratch_addr = $signed(r_alu_out) + $signed(r_instr_simm5);
847
              wire [DATA_WIDTH-1:0] r_mem_scratch_in = r_instr_rdd;
              wire [DATA_WIDTH-1:0] r_mem_scratch_out;
wire r_mem_scratch_we = r_instr_has_mem_we && (r_state == STATE_ME);
reg r_mem_scratch_req = 0;
848
849
850
851
              wire
                                         r_mem_scratch_busy;
852
                    [2:0]
                                         r_reg_rs1 = 0;
853
              reg
              wire [DATA_WIDTH-1:0] r_reg_rd1_s;
854
              wire [DATA_WIDTH-1:0] r_reg_rd1_i;
wire [DATA_WIDTH-1:0] r_reg_rd1 = regs_use_int ? r_reg_rd1_i : r_reg_rd1_s;
855
856
                       [15:0] r_reg_rd2;
              //wire
857
              wire [DATA_WIDTH-1:0] r_reg_wd = (r_instr_has_mem) ? r_mem_scratch_out : r_alu_out;
858
                                         r_reg_we = r_instr_has_we && (r_state == STATE_WB);
859
860
              // branching wire w_intr;
861
862
863
              wire
                            w_branch_en;
                            w_branching
                                             = r_instr_has_br && w_branch_en;
864
              wire
             reg [3:0] r_cmp_flags
                                             = 4'h00; // N, Z, C, V
865
866
              867
868
869
              // 2 cycle register fetch always @(*) begin
870
871
                  r_reg_rs1 = 0;
if (r_state == STATE_R1)
872
873
                       r_reg_rs1 = r_instr_rsd;
874
875
                   else if (r_state == STATE_R2)
                       r_reg_rs1 = r_instr_rsa;
876
877
                   else
878
                       r_reg_rs1 = 3'h0;
              end
879
880
             reg regs_use_int = 0;
  ifdef DEF_ENABLE_INT
wire [`DEF_NUM_INT*`DATA_WIDTH-1:0] ints_vector;
wire [`DEF_NUM_INT-1:0] ints_mask;
881
882
883
884
885
              wire
                                                          has_int = ints & ints_mask;
              reg int_pending = 0;
886
              reg int_pending_ack = 0;
always @(posedge clk)
887
888
889
                   if (int_pending_ack)
                   // We've now branched to the isr
int_pending <= 0;
else if (has_int)
890
891
892
                  // Notify fsm to switch to the ints_vector at the last stage int_pending <= 1; else if (w_intr)
893
894
895
                       // Return to Interrupt instruction called,
// so we've finished with the interrupt
896
                       // so we've finished with the interrupt int_pending <= 0;
897
898
              `endif
899
900
              // Next program counter logic
reg [`DATA_WIDTH-1:0] next_pc = 0;
901
902
              always @(posedge clk)
903
                  904
905
906
907
908
909
910
                             ints_vector[0 +: `DATA_WIDTH]);
// TODO: check bounds
911
912
                             // Save state
```

```
<= r_pc + 1;
                             r_pc_saved
914
                                                 <= 1;
915
                             regs_use_int
                              int_pending_ack <= 1;</pre>
916
                             int_pending____
// Jump to ISR
r nc <= ints_vector[0 +: `DATA_WIDTH];</pre>
 917
918
                        end else if (w_intr) begin

$\frac{1}{2}\text{display(\frac{1}{2}\text{time, "\tc\%02h: Returning from ISR: \hat{\hat{h}}\",
919
920
                                  CORE_ID, r_pc_saved);
 921
922
                             923
924
 925
926
                             int_pending_ack <= 0;</pre>
                         end else
927
 928
                          endif
929
                         if (w_branching) begin
                             w_Dranching, begin
$\text{begin} \text{"\tC\02h: branching to \hat\hat\n", CORE_ID, r_instr_rdd);}
r_pc <= r_instr_rdd;</pre>
930
931
                              `ifdef DEF_ENABLE_INT
933
                                 int_pending_ack <= 0;</pre>
934
                              `endif
935
                         end else if (r_pc < (MEM_INSTR_DEPTH-1)) begin
936
                             // normal increment
// pc <= pc + 1
937
938
                                                <= r_pc + 1;
939
                             r_pc
940
941
                              `ifdef DEF_ENABLE_INT
942
                                 int_pending_ack <= 0;</pre>
 943
                              `endif
944
                         end
                   end // end r_state == STATE_WB
else if (r_state == STATE_HALT) begin
    ifdef DEF_ENABLE_INT
945
946
947
                        948
949
950
 951
952
953
                                                <= r_pc;// + 1; HALT = stay with same PC
                             r_pc_saved
954
                                               <= 1;
                             regs_use_int
 955
956
                              int_pending_ack <= 1;</pre>
957
                              // Jump to ISR
958
                        r_pc
                                                 <= ints_vector[0 +: `DATA_WIDTH];</pre>
959
961
962
 963
964
                         end
965
                          endif
                   end
 966
 967
         `ifndef DEF_CORE_HAS_INSTR_MEM
initial w2_PSELx = 0;
initial w2_PENABLE = 0;
968
 969
 970
971
              initial w2_PADDR = 0;
972
         `endif
 973
974
               // cpu state machine
975
              always @(posedge clk)
                   if (reset) begin
976
 977
                        r_state
                                               <= STATE_IF;
                                               <= 0;
978
                         r_instr
979
                        r_mem_scratch_req <= 0;
                        r_instr_rdd
 980
 981
                        r_instr_rda
                                               <= 0;
                    end
982
 983
                   else begin
 984
          `ifdef DEF_CORE_HAS_INSTR_MEM
985
                        if (r_state == STATE_IF) begin
986
                                  r_instr <= w_mem_instr_out;
 987
 988
                                  $display("");
989
                                  %display(%time, "\tc%02h: PC: %h", CORE_ID, r_pc); $display($time, "\tc%02h: INSTR: %h", CORE_ID, w_mem_instr_out);
990
 991
992
                                  r state <= STATE R1:
993
994
                        end
          `else
                        // wait for global instruction rom to give us our instruction
if (r_state == STATE_IF) begin
    // wait for ready signal
996
997
998
                             if (!w2_PREADY) begin
w2_PSELx <= 1;
w2_PWRITE <= 0;
w2_PENABLE <= 1;</pre>
999
1000
1001
1002
                                  w2_PWDATA <= 0;
```

```
1004
                                       w2_PADDR
                                                     <= r_pc;
                                 end else begin

w2_PSELx <= 0;

w2_PWRITE <= 0;

w2_PENABLE <= 0;

w2_PWDATA <= 0;
1005
1006
1007
1008
1009
1010
1011
                                       r_instr <= w2_PRDATA;</pre>
1012
                                       $display("");
1013
                                       $display($time, "\tC%02h: PC: %h", CORE_ID, r_pc); $display($time, "\tC%02h: INSTR: %h", CORE_ID, w2_PRDATA);
1014
1015
1016
1017
                                      r_state <= STATE_R1;
1018
                                 end
                            end
1019
1020
           `endif
1021
                           else if (r_state == STATE_R1) begin
    if (w_halt) begin
        $display("");
        $display("");
1022
1023
1024
1025
                                 $\text{display(fime, "\tC%O2h: PC: %h HALT", CORE_ID, r_pc);}
r_state <= STATE_HALT;
end else begin</pre>
1026
1027
1028
                                      // primary operand
r_instr_rdd <= r_reg_rd1;
r_state <= STATE_R2;</pre>
1029
1030
1031
                                      r_state
1032
                                 end
1033
                           end
1034
1035
1036
1037
1038
1039
1040
                                 if (r_instr_has_mem) begin
                                      r_state
// Pulse req
1041
                                                               <= STATE_ME;
1042
                                       r_mem_scratch_req <= 1;
1043
1044
                                     r_state <= STATE_WB;
1045
                            end
1046
                            else if (r_state == STATE_ME) begin
1047
                                 // Pulse req
1048
                                 r_mem_scratch_req <= 0;
// Wait for MMU to finish
if (!r_mem_scratch_busy)</pre>
1049
1050
1051
1052
                                      r_state <= STATE_WB;
1053
                            end
                            else if (r_state == STATE_WB) begin
1054
                                 1055
1056
                                      r_cmp_flags <= r_alu_out[3:0];
1057
1058
1059
                                 r_state <= STATE_FE;</pre>
1060
                            end
1061
                            else if (r_state == STATE_FE)
1062
                           r_state <= STATE_FE)
r_state <= STATE_IF;
else if (r_state == STATE_HALT) begin
`ifdef DEF_ENABLE_INT
if (int_pending) begin
r_state <= STATE_FE;
1063
1064
1065
1066
1067
                                       end
1068
                                 `endif
1069
1070
1071
                      end
1072
           `ifdef DEF_CORE_HAS_INSTR_MEM
1073
                // Instruction ROM
(* rom_style = "distributed" *)
vmicro16_bram # (
.MEM_WIDTH (DATA_WIDTH
1074
1075
1076
1077
                                             (DATA_WIDTH),
                       .MEM_DEPTH
                                             (MEM_INSTR_DEPTH),
1078
1079
                      .CORE_ID
                                             (CORE_ID),
                      .USE_INITS
                                             (1),
("INSTR_MEM")
1080
                      NAME
1081
1082
                 ) mem_instr (
                      .clk
                                             (clk),
1083
1084
                      .reset
                                             (reset),
                      // port 1 .mem_addr
1085
1086
                                             (r_pc),
                                             (0),
1087
                      .mem_in
                                             (1'b0), // ROM
1088
                      .mem_we
1089
                      .mem_out
                                             (w_mem_instr_out)
1090
1091
           `endif
1092
                // MMU
1093
```

```
vmicro16_core_mmu #
.MEM_WIDTH
.MEM_DEPTH
1094
                                             (DATA_WIDTH),
(MEM_SCRATCH_DEPTH),
1095
1096
                      .CORE_ID
                                             (CORE_ID)
1097
1098
                 ) mmu (
1099
                       .clk
                                             (clk)
                                             (reset),
1100
                       .reset
                                             (r_mem_scratch_req)
1101
                       .req
                                             (r_mem_scratch_busy),
1102
                       .busy
                      // interrupts
.ints_vector
1103
                                             (ints_vector),
(ints_mask),
1104
                       .ints mask
1105
1106
                       // port 1
                       .mmu_addr
1107
                                             (r_mem_scratch_addr),
1108
                       .mmu_in
                                             (r_mem_scratch_in),
                                             (r_mem_scratch_we),
(r_instr_has_lwex),
(r_instr_has_swex),
1109
                       .mmu_we
1110
                       .mmu lwex
1111
                       .mmu_swex
1112
                       .mmu_out
                                             (r_mem_scratch_out),
                      // APB maste
.M_PADDR
1113
                                             r to slave
                                             (w_PADDR),
1114
                       .M_PWRITE
                                             (w_PWRITE),
1115
                                             (w_PSELx),
(w_PENABLE),
                       .M_PSELx
1116
                       .M_PENABLE
1117
                                             (w_PWDATA),
                       .M_PWDATA
1118
                       .M_PRDATA
                                             (w_PRDATA),
1119
                      .M_PREADY
                                             (w_PREADY)
1120
1121
1122
                 // Instruction decoder
1123
1124
                 vmicro16_dec dec (
1125
                      // input
1126
                       .instr
                                             (r_instr),
                      // output async
1127
                                             (),
                      .opcode
1128
1129
                      .rd
                                             (r_instr_rsd),
1130
                       .ra
                                             (r_instr_rsa)
1131
                       .imm4
                                             (r_instr_imm4)
                       .imm8
                                             (r_instr_imm8),
1132
                                             (),
(r_instr_simm5)
1133
                      .imm12
1134
                       .simm5
1135
                       .alu_op
                                             (r_instr_alu_op)
                       .has_imm4 .has_imm8
                                             (r_instr_has_imm4),
(r_instr_has_imm8),
(r_instr_has_we),
1136
1137
1138
                       .has_we
1139
                       .has_br
                                             (r_instr_has_br)
1140
                       .has_cmp
                                             (r_instr_has_cmp),
                                             (r_instr_has_mem),
(r_instr_has_mem_we),
1141
                       .has mem
                       .has_mem_we
1142
1143
                       .halt
                                             (w_halt),
                                             (w_intr),
(r_instr_has_lwex),
(r_instr_has_swex)
1144
                       .intr
1145
                       .has_lwex
                       .has_swex
1146
1147
1148
                // Software registers
vmicro16_regs # (
    .CORE_ID (CORE_ID),
1149
1150
1151
                       .CELL_WIDTH (`DATA_WIDTH)
1152
1153
                 ) regs (
                                       (clk)
1154
                      .clk
                                       (reset),
                       .reset
1155
                      // async port 0
1156
1157
                       .rs1
                                       (r_reg_rs1),
1158
                       .rd1
                                       (r_reg_rd1_s),
                      // async port 1 //.rs2 (
1159
1160
                      //.rd2
// write port
1161
1162
                                       (r_reg_we && ~regs_use_int),
(r_instr_rsd),
1163
                       .we
                      .ws1
1164
                                       (r_reg_wd)
1165
                      .wd
1166
1167
                 // Interrupt replacement registers 
`ifdef DEF_ENABLE_INT
1168
1169
                vmicro16_regs # (
.CORE_ID (CORE_ID),
.CELL_WIDTH (`DATA_WIDTH),
.DEBUG_NAME ("REGSINT")
1170
1171
1172
1173
1174
                 ) regs_intr (
1175
                      .clk
                                       (clk),
                      .reset (reset),
// async port 0
1176
1177
                      .rs1
                                       (r_reg_rs1),
1178
1179
                       .rd1
                                       (r_reg_rd1_i),
                      // async port 1
//.rs2 (
1180
1181
                      //.rd2
                                          (),
1182
```

```
1183
                      // write port
                                       (r_reg_we && regs_use_int),
(r_instr_rsd),
(r_reg_wd)
1184
                      .we
                       .ws1
1185
1186
                );
`endif
1187
1188
1189
1190
1191
                 vmicro16_alu # (
                      .CORE_ID(CORE_ID)
1192
                 ) alu (
1193
                                       (r_instr_alu_op),
(r_instr_rdd),
1194
                      .op
1195
                       .a
1196
                       .b
                                       (r_instr_rda),
                      .b .flags (r_cm_r_ // async output (r_alu_out)
1197
                                       (r_cmp_flags),
1198
1199
1200
                 );
1201
                 branch branch_check (
1202
                                       (r_cmp_flags),
 (r_instr_imm8),
1203
                       .flags
1204
1205
                                       (w_branch_en)
                 ):
1206
1207
           \verb"endmodule"
1208
```

## **E.2** Peripheral Code Listing

Various memory-mapped APB peripherals, such as GPIO, UART, timers, and memory.

```
// Vmicro16 peripheral modules
2
3
4
        `include "vmicro16_soc_config.v"
`include "formal.v"
5
         // Simple watchdog peripheral
6
7
        module vmicro16_watchdog_apb # (
             parameter BUS_WIDTH = 16,
parameter NAME = "WD"
parameter CLK HZ = 50 0
9
             parameter CLK_HZ
                                       = 50_000_000
10
        ) (
11
13
             input reset,
14
              // APB Slave to master interface
                                                      S_PADDR, // not used (optimised out)
16
             input [0:0]
                                                      S_PWRITÉ,
S_PSELx,
17
             input
18
             input
                                                      S_PENABLE,
             input
                      [0:0]
20
             input
                                                      S_PWDATA,
21
22
              // prdata not used
23
             output [0:0]
                                                      S_PRDATA,
24
             output
                                                      S_PREADY
25
26
              // watchdog reset, active high
27
                                                      wdreset
28
        );
             //assign S_PRDATA = (S_PSELx & S_PENABLE) ? gpio : 16'h0000; assign S_PREADY = (S_PSELx & S_PENABLE) ? 1'b1 : 1'b0;
29
30
                                 = (S_PSELx & S_PENABLE & S_PWRITE);
31
32
33
             // countdown timer
             reg [`clog2(CLK_HZ)-1:0] timer = CLK_HZ;
34
36
37
             wire w_wdreset = (timer == 0);
              // infer a register to aid timing
38
39
             initial wdreset = 0;
             always @(posedge clk)
wdreset <= w_wdreset;
40
41
43
             always @(posedge clk)
44
                  if (we) begin
                       $display($time, "\t\%s <= RESET", NAME);
timer <= CLK_HZ;</pre>
45
46
47
                  end else begin
                        timer <= timer - 1;
48
                  end
49
        endmodule
```

```
module timer_apb # (
         parameter CLK_HZ = 50_000_000
) (
 53
 54
                input clk,
 55
56
57
                input reset,
 58
                input clk_en,
 59
               // 0 16-bit value R/W
// 1 16-bit control R
// 2 16-bit prescaler
input [1:0]
 60
                                                  b0 = start, b1 = reset
 61
 62
                                                             S_PADDR,
 63
 64
                                                             S_PWRITE,
 65
                input
                                                             S_PSELx,
S_PENABLE,
 66
                input
 67
                input
                               [ DATA_WIDTH-1:0]
                                                             S PWDATA.
 68
                input
 69
 70
                output reg [`DATA_WIDTH-1:0]
                                                             S_PRDATA,
 71
                output
                                                             S_PREADY.
 72
                output out,
                output [ DATA_WIDTH-1:0] int_data
 74
          );
 75
               76
 77
 78
 79
 80
                reg [`DATA_WIDTH-1:0] r_counter = 0;
 81
               reg ['DATA_WIDTH-1:0] r_load = 0;
reg ['DATA_WIDTH-1:0] r_pres = 0;
reg ['DATA_WIDTH-1:0] r_ctrl = 0;
 82
 83
 84
 85
               localparam CTRL_START = 0;
localparam CTRL_RESET = 1;
localparam CTRL_INT = 2;
 86
 87
 88
 89
               localparam ADDR_LOAD = 2'b00;
localparam ADDR_CTRL = 2'b01;
 90
 91
                localparam ADDR_PRES = 2'b10;
 92
 93
                always @(*) begin
 94
                     S_{PRDATA} = 0;
 95
                      if (en)
 96
 97
                           case(S_PADDR)
                                ADDR_LOAD: S_PRDATA = r_counter;

ADDR_CTRL: S_PRDATA = r_ctrl;

//ADDR_CTRL: S_PRDATA = r_pres;

default: S_PRDATA = 0;
 98
 99
100
101
                           endcase
102
                end
103
104
                // prescaler counts from r_pres to 0, emitting a stb signal
// to enable the r_counter step
reg [`DATA_WIDTH-1:0] r_pres_counter = 0;
105
106
107
                wire counter_en = (r_pres_counter == 0);
always @(posedge clk)
108
109
                     if (r_pres_counter == 0)
110
                           r_pres_counter <= r_pres;
111
112
113
                          r_pres_counter <= r_pres_counter - 1;
114
                always @(posedge clk)
115
116
                      if (we)
117
                           case(S PADDR)
                                 // Write to the load register:
// Set load register
// Set counter register
118
119
120
                                 ADDR_LOAD: begin r_load
121
                                                             <= S_PWDATA;
122
                                      123
124
                                 end
125
                                 ADDR_CTRL: begin
r_ctrl <= S_PWDATA;
126
                                      $\frac{1}{2} \frac{1}{2} \text{display($time, "\t\ttimr0: WRITE CTRL: \h", S_PWDATA);}
128
129
                                 end
                                 end
ADDR_PRES: begin
r pres <= S_PWDATA;
130
131
                                      $display($time, "\t\ttimr0: WRITE PRES: %h", S_PWDATA);
132
                                 end
133
                           endcase
134
135
                     else
                           if (r_ctrl[CTRL_START]) begin
136
                                 if (r_counter == 0)
    r_counter <= r_load;</pre>
137
138
                           else if(counter_en)
    r_counter <= r_counter -1;
end else if (r_ctrl[CTRL_RESET])</pre>
139
140
141
```

```
142
                               r_counter <= r_load;</pre>
143
               // generate the output pulse when r_counter == 0
// out = (counter reached zero &% counter star:
144
               // out = (counter reached zero & counter started)
assign out = (r_counter == 0) && r_ctrl[CTRL_START]; // && r_ctrl[CTRL_INT];
assign int_data = {`DATA_WIDTH{1'b1}};
145
146
147
         endmodule
148
150
         // APB wrapped programmable vmicro16_bram
module vmicro16_bram_prog_apb # (
    parameter BUS_WIDTH = 16,
    parameter MEM_WIDTH = 16,
    parameter MEM_DEPTH = 64,
151
152
153
154
155
                                            = 0,
= 0,
= "BRAMPROG",
               parameter APB_PADDR
156
               parameter USE_INITS parameter NAME
157
158
               parameter CORE_ID
159
160
         ) (
               161
162
163
                                                          S_PWRITE,
S_PSELx,
165
               input
166
               input
                                                          S_PENABLE,
167
               input
               input [BUS_WIDTH-1:0]
168
                                                          S_PWDATA,
169
170
               output [BUS_WIDTH-1:0]
                                                          S_PRDATA,
                                                          S_PREADY,
172
               // interface to program the instruction memory
input [`clog2(`DEF_MEM_INSTR_DEPTH)-1:0] addr,
173
174
               input
                             [ DATA_WIDTH-1:0]
175
               input
                                                                            data,
176
               input
                                                                            we,
177
               input
                                                                           prog
178
         );
               wire [MEM_WIDTH-1:0] mem_out;
180
               assign S_PRDATA = (S_PSELx & S_PENABLE) ? mem_out : 16'h0000;
181
               assign S_PREADY = (S_PSELx & S_PENABLE) ? 1'b1
182
               wire s_we = (S_PSELx & S_PENABLE & S_PWRITE);
183
184
               185
186
187
188
               vmicro16_bram # (
.MEM_WIDTH (
.MEM_DEPTH (
189
                                     (MEM_WIDTH),
190
                                    (MEM_DEPTH),
191
                                     ("BRAMPROG"),
192
                     .NAME
                     .USE_INITS
.CORE_ID
                                    (0),
(-1)
193
194
               ) bram_apb (
195
                    .clk
196
197
                     .reset
                                     (reset),
198
199
                     .{\tt mem\_addr}
                                     (mem_addr),
200
                    .mem_in
                                     (mem_data),
201
                     .mem we
                                     (mem_we)
                                     (mem_out)
202
                     .mem_out
               );
203
          endmodule
204
205
          // APB wrapped vmicro16_bram
206
         module vmicro16_bram_apb # (
207
                                            = 16,
= 16,
               parameter BUS_WIDTH
parameter MEM_WIDTH
parameter MEM_DEPTH
parameter APB_PADDR
209
                                             = 64,
210
                                           = 0,
= 0,
= "BRAM",
211
212
               parameter USE_INITS
               parameter NAME
213
               parameter CORE_ID
214
215
216
               input clk,
217
               input reset,
// APB Slave to master interface
218
219
               input ['clog2(MEM_DEPTH)-1:0] S_PADDR,
220
               input
                                                          S_PWRITE,
S_PSELx,
221
               input
                                                          S_PENABLE,
222
               input
               input [BUS_WIDTH-1:0]
223
                                                          S_PWDATA,
224
               output [BUS_WIDTH-1:0]
                                                          S_PRDATA.
225
226
               output
                                                          S_PREADY
227
         );
               wire [MEM_WIDTH-1:0] mem_out;
228
229
               assign S_PRDATA = (S_PSELx & S_PENABLE) ? mem_out : 16'h0000;
assign S_PREADY = (S_PSELx & S_PENABLE) ? 1'b1 : 1'b0;
```

```
232
                assign we
                                     = (S_PSELx & S_PENABLE & S_PWRITE);
233
234
                always Q(*)
                     if (S_PSELx && S_PENABLE)
236
                           $display($time, "\t\t%s => \h", NAME, mem_out);
237
                always @(posedge clk)
238
                     if (we)
239
                           240
241
242
                vmicro16_bram # (
243
                      .MEM_WIDTH (MEM_WIDTH),
244
                      .MEM_DEPTH (MEM_DEPTH),
245
                      .NAME
                                      (NAME),
246
                      .USE_INITS
247
                                      (1), (-1)
                      .CORE_ID
248
249
                ) bram_apb (
                    .clk
                                      (clk).
250
                                      (reset).
251
                     .reset
252
253
                      .{\tt mem\_addr}
                                      (S_PADDR)
254
                     .mem_in
                                      (S_PWDATA),
                                      (we),
255
                     .mem_we
256
                     .mem_out
                                      (mem_out)
               );
258
          endmodule
259
          // Shared memory with hardware monitor (LWEX/SWEX) module vmicro16\_bram\_ex\_apb # (
260
261
               parameter BUS_WIDTH = 16,
parameter MEM_WIDTH = 16,
262
263
         parameter MEM_DEPTH = 64,
parameter CORE_ID_BITS = 3,
parameter SWEX_SUCCESS = 16'h0000,
parameter SWEX_FAIL = 16'h0001
) (
264
265
266
267
269
                input clk,
270
                input reset,
               272
273
274
275
276
                input
                                                            S_PSELx,
S_PENABLE,
277
                input
278
                input
                        [MEM_WIDTH-1:0]
279
                input
                                                            S_PWDATA,
280
                output reg [MEM_WIDTH-1:0]
                                                            S_PRDATA,
S_PREADY
281
282
                output
283
          );
                // exclusive flag checks
wire [MEM_WIDTH-1:0] mem_out;
284
285
                                             swex_success = 0;
286
287
                localparam ADDR_BITS = `clog2(MEM_DEPTH);
288
289
290
                // hack to create a 1 clock delay to S_PREADY
                // for bram to be ready reg cdelay = 1;
291
292
                always @(posedge clk)
if (S_PSELx)
293
294
295
                          cdelay <= 0;
                     else
296
297
                          cdelay <= 1;</pre>
298
                \label{eq:continuous} $$//assign S_PRDATA = (S_PSELx & S_PENABLE) ? swex_success ? 16'hF0F0 : 16'h0000; assign S_PREADY = (S_PSELx & S_PENABLE & (!cdelay)) ? 1'b1 : 1'b0; assign we = (S_PSELx & S_PENABLE & S_PWRITE);
299
300
301
                                      = (S_PSELx & S_PENABLE);
302
303
                \label{limits} \begin{tabular}{ll} // Similar to: \\ // & http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0204f/Cihbghef.html \\ \end{tabular}
304
305
306
               // mem_wd is the CORE_ID sent in bits [18:16]
localparam TOP_BIT_INDEX = `APB_WIDTH -1;
localparam PADDR_CORE_ID_MSB = TOP_BIT_INDEX - 2;
localparam PADDR_CORE_ID_LSB = PADDR_CORE_ID_MSB - (CORE_ID_BITS-1);
307
308
309
310
311
                312
                                                            = S_PADDR[TOP_BIT_INDEX];
= S_PADDR[TOP_BIT_INDEX-1];
                wire
313
314
                                                 swex
                wire [CORE_ID_BITS-1:0] core_id = S_

// CORE_ID to write to ex_flags register

wire [ADDR_BITS-1:0] mem_addr = S_
315
                                                                 = S_PADDR[PADDR_CORE_ID_MSB:PADDR_CORE_ID_LSB];
316
                                                                  = S_PADDR[ADDR_BITS-1:0];
317
318
                wire [CORE_ID_BITS:0] ex_flags_read;
is locked = |ex_flags_read;
319
320
                                                 is_locked_self = is_locked && (core_id == (ex_flags_read-1));
321
                wire
```

```
322
              // Check exclusive access flags
always @(*) begin
323
324
325
                   swex_success = 0;
326
                   if (en)
                        // bug!
if (!swex && !lwex)
327
328
                             swex_success = 1;
329
                        else if (swex)
if (is_locked && !is_locked_self)
330
331
                             // someone else has locked it swex_success = 0; else if (is_locked && is_locked_self)
332
333
334
                                  swex_success = 1;
335
336
              end
337
338
              always @(*)
                   if (swex)
339
340
                        if (swex success)
                             S_PRDATA = SWEX_SUCCESS;
341
                        else
S_PRDATA = SWEX_FAIL;
342
343
344
                   else
345
                        S_PRDATA = mem_out;
346
              wire reg_we = en && ((lwex && !is_locked)
347
348
                                     || (swex && swex_success));
349
              reg [CORE_ID_BITS:0] reg_wd;
always @(*) begin
350
351
                   reg_wd = {{CORE_ID_BITS}{1'b0}};
352
353
354
                   if (en)
                        // if wanting to lock the addr
355
                        if (lwex)
356
357
                              // and not already locked
                             if (!is_locked) begin
358
359
                                  reg_wd = (core_id + 1);
                             end
360
                        else if (swex)
362
                             if (is_locked && is_locked_self)
363
                                  reg_wd = {{CORE_ID_BITS}{1'b0}};
              end
364
365
              // Exclusive flag for each memory cell
vmicro16_bram # (
    .MEM_WIDTH (CORE_ID_BITS + 1),
366
367
368
369
                    .MEM_DEPTH
                                   (MEM_DEPTH),
370
                    .USE_INITS
                                  (0),
371
                    .NAME
                                  ("rexram")
              ) ram_exflags (
372
373
                                  (clk),
                   .clk
374
                   .reset
                                  (reset),
375
                                  (mem_addr),
376
                   .{\tt mem\_addr}
                                  (reg_wd), (reg_we),
377
                   .mem_in
378
                   .mem_we
379
                    .mem_out
                                   (ex_flags_read)
              );
380
381
              always @(*)
382
383
                   if (S_PSELx && S_PENABLE)
                        $display($time, "\t\tBRAMex[%h] READ %h\tCORE: %h",
    mem_addr, mem_out, S_PADDR[16 +: CORE_ID_BITS]);
384
385
386
387
              always @(posedge clk)
388
                   if (we)
                        389
390
391
              vmicro16_bram # (
.MEM_WIDTH (
.MEM_DEPTH (
392
                                  (MEM_WIDTH),
393
                                  (MEM_DEPTH),
394
                    .USE_INITS
                                   ("BRAMexinst")
396
                    .NAME
397
              ) bram_apb (
                                  (clk).
398
                   .clk
399
                                  (reset),
                   .reset
400
401
                    .{\tt mem\_addr}
                                  (mem_addr),
402
                   .mem in
                                   (S PWDATA).
                                   (we && swex_success),
403
                   .mem we
404
                   .mem_out
                                  (mem_out)
         );
endmodule
405
406
407
408
          // Simple APB memory-mapped register set
         module vmicro16_regs_apb # (
parameter BUS_WIDTH
409
410
```

```
parameter DATA_WIDTH = 16
parameter CELL_DEPTH = 8,
parameter PARAM_DEFAULTS_RO = 0,
411
                                               = 16,
412
413
             parameter PARAM_DEFAULTS_R1 = 0
414
415
              input clk.
416
              input reset,
417
              // APB Slave to master interface input [`clog2(CELL_DEPTH)-1:0] S_PADDR.
418
419
                                                     S_PWRITE.
420
              input
421
              input
                                                     S_PSELx,
                                                     S_PENABLE,
422
              input
                     [DATA_WIDTH-1:0]
                                                     S_PWDATA,
423
              input
424
                                                     S_PRDATA,
              output [DATA_WIDTH-1:0]
425
426
              output
                                                     S_PREADY
         ):
427
              wire [DATA_WIDTH-1:0] rd1;
428
429
              430
431
432
433
434
              always @(*)
435
                   if (reg_we)
                        $display($time, "\t\tREGS_APB[%h] <= %h",
436
437
                            S_PADDR, S_PWDATA);
438
              always @(*)
439
                    rassert(reg_we == (S_PSELx & S_PENABLE & S_PWRITE))
440
441
442
              vmicro16_regs # (
                                           (CELL_DEPTH), (DATA_WIDTH),
                  .CELL_DEPTH .CELL_WIDTH
443
444
                                           (PARAM_DEFAULTS_RO),
445
                   .PARAM_DEFAULTS_RO
                   .PARAM_DEFAULTS_R1 (PARAM_DEFAULTS_R1)
446
447
              ) regs_apb (
                             (clk)
448
                  .clk
                   .reset (reset),
449
450
                   // port 1
451
                             (S_PADDR),
452
                   .rd1
                             (rd1),
                            (reg_we),
(S_PADDR)
453
                   .we
                   .ws1
454
                             (S_PWDATA)
455
                   .wd
                   // port 2 unconnected //.rs2 (), //.rd2 ()
456
457
458
459
460
         endmodule
461
         // Simple GPIO write only peripheral
462
        module vmicro16_gpio_apb # (
parameter BUS_WIDTH = 16,
parameter DATA_WIDTH = 16,
parameter PORTS = 8,
463
464
465
                                   = 8,
= "GPIO"
466
467
             parameter NAME
         ) (
468
              input clk.
469
470
              input reset,
              input reset,
// APB Slave to master interface
S_PADDR, // not used (optimised out)
471
472
473
                                                     S_PSELx,
474
              input
475
              input
                                                     S_PENABLE,
              input [DATA_WIDTH-1:0]
476
                                                     S_PWDATA.
477
478
              output [DATA_WIDTH-1:0]
                                                     S_PRDATA,
479
              output
                                                     S_PREADY,
              output reg [PORTS-1:0]
480
                                                     gpio
481
              482
483
484
485
              always @(posedge clk)
486
                  if (reset)
    gpio <= 0;
else if (ports_we) begin
    $\display(\$\time, "\t\\%s <= \\h\", NAME, S_PWDATA[PORTS-1:0]);
    gpio <= S_PWDATA[PORTS-1:0];
end</pre>
487
488
489
490
491
                   end
492
         endmodule
493
```