# **Multi-core RISC Processor Design & Implementation**

**Demonstration Viva** 

Ben Lancaster

201280376 ELEC5881M - Main Project

July 18, 2019

## **Quick Links**

#### Main Project

B. Lancaster

Introduction

Top Level Design

Multi-core Functionalit

Dogulto

- GitHub repository: https://github.com/bendl/vmicro16
- Full Report: https://github.com/bendl/vmicro16/blob/master/docs/reports/build/ELEC5881M\_Ben\_Lancaster\_201280376\_Final.pdf

### Main Project

B. Lancaster

Introductio

Top Level Design

Multi-core Functionalit

Results

Conclusio

1 Introduction Why Multi-core? Why RISC?

- 2 Top Level Design
  Overview
  Memory Map
  Interconnect
  Interrupts
- Multi-core Functionality HW/SW Requirements Context Identification Atomics

- Pesults
  Results 1
- 5 Conclusion
  Accomplishments
  Future Improvements
  Q&A

#### Main Project

#### B. Lancaster

# Introduction Why Multi-core?

Why RISC?

Design

Multi-core Functional

Results

nesuits

. . .

1 Introduction Why Multi-core? Why RISC?

2 Top Level Design

Multi-core Functionality

4 Results

## Why Multi-core?

Main Project

B. Lancaster

Why Multi-core?

Why RISC?

Multi-core

Regulte

riesuits

Block Title

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

# Why RISC?

#### Main Project

B. Lancaster

Introduction
Why Multi-core?
Why RISC?

Top Level Design

Multi-core Functionality

Results

### Main Project

B. Lancaster

1 Introduction

Introduction

## Top Level

#### Design Overview

Overview Memory Map

nterconnect

Multi-core

Functionality

Results

Conclusion

2 Top Level Design

Overview

Memory Map

Interconnect

Interrupts

3 Multi-core Functionality

4 Results



### Main Project

B. Lancaster

Introductio

Top Level Design Overview Memory Map Interconnect

Interrupts
Multi-core

Functiona

Conclusion

### What this project produces:

System-on-Chip with multi-processor functionality
 Tested on FPGA hardware with 1-96 CPU cores.

### Main Project

B. Lancaster

Introductio

Design Overview Memory Map Interconnect

Interrupts

Multi-core

Functionalit

Results

Canalysian

## What this project produces:

- System-on-Chip with multi-processor functionality
   Tested on FPGA hardware with 1-96 CPU cores.
- Custom 16-bit RISC CPU
   With interrupts and its own Instruction Set Architecture (ISA).

### Main Project

B. Lancaster

Introductio

Design Overview Memory Map Interconnect

Multi-core Functionalit

Results

Conclusion

### What this project produces:

- System-on-Chip with multi-processor functionality
   Tested on FPGA hardware with 1-96 CPU cores.
- Custom 16-bit RISC CPU
   With interrupts and its own Instruction Set Architecture (ISA).
- Software/Assembly compiler
   PRCO304 programming language/Intel assembly syntax.

### Main Project

B. Lancaster

Introductio

Design
Overview
Memory Map
Interconnect

Multi-core

Functional

nesuits

Conclusion

### What this project produces:

- System-on-Chip with multi-processor functionality
   Tested on FPGA hardware with 1-96 CPU cores.
- Custom 16-bit RISC CPU
   With interrupts and its own Instruction Set Architecture (ISA).
- Software/Assembly compiler
   PRCO304 programming language/Intel assembly syntax.
- Aimed at Design Engineers, not end users
   Project is provided as source code/design files for Design Engineers to customise and implement in hardware themselves.

# **Top Level Hierarchy**

#### Main Project

B. Lancaster

Introductio

Top Level

### Overview

Memory Map Interconnect Interrupts

Multi-core

Results



# **Memory Map**



- Shared Memory with Global Monitor
- Timer with Interrupt
- Per-core Interrupt Vector and Mask
- Shared Register Set
- UART Transceiver
- Multiple GPIO ports
- Per-core scratch memory
- Per-core Special Registers
- Customisable by designers

## Interconnect

#### Main Project

B. Lancaster

Introduction

Top Level Design Overview

Memory Map

Interconnect Interrupts

Multi-core

Regulte

## **Interrupts**

#### Main Project

B. Lancaster

Introduction

Ton Lovel

Design Overview

Memory Map

Interconne

Interrupts

Multi-core

Functional

Results

Conclusion



Demo: 2 Core LED toggle (GPIO0) with TIMR0 1s interrupt (interrupts\_2.s)

# **Timer Interrupt Example**

|              |                                     |                  |          | 2,870.000 ns                           | 3,870.000 ns                           |                                                              |
|--------------|-------------------------------------|------------------|----------|----------------------------------------|----------------------------------------|--------------------------------------------------------------|
| Main Project | Name                                | Value            | 2,500 ns | 3,000 ns 3,500 ns                      | 4,000 ns 4,500 ns                      | 5,000 ns   5,500 ns                                          |
| B. Lancaster | > " r_pc[15:0]<br>> " r_instr[15:0] | 0010<br>4000     | 0010     | \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 | \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 | \(\begin{array}{ccccc} \O & \O |
| ntroduction  | > 16 regs[0:7][15:0]<br>> 16 COALU  | 0010,0100,0008,0 |          |                                        |                                        | 0010,0100,0008,0000,0000,0000                                |
| Top Level    | ✓ <u>I</u> IINT                     |                  |          |                                        |                                        |                                                              |
| Design       | > W ints[7:0]                       | 01               | 00       | CO                                     | co                                     | <b>X</b> CO                                                  |
| Overview     | > W ints_vector[127:0]              | 00000000000000   |          |                                        |                                        | 000000000000000000000000000000000000000                      |
| Memory Map   | > W ints_mask[7:0]                  | Of               |          |                                        |                                        | 0f                                                           |
| Interconnect | ¼ w_intr                            | 0                |          |                                        |                                        |                                                              |
| Interrupts   | M has_int                           | 1                |          |                                        |                                        |                                                              |
| Aulti-core   | 18 int_pending                      | 0                |          |                                        |                                        |                                                              |
| unctionality | int_pending_ack                     | 0                |          |                                        |                                        |                                                              |
| Results      | V regs_use_int > MMU                | 0                |          |                                        |                                        |                                                              |
| Conclusion   | > <u>™</u> C1                       |                  |          |                                        |                                        |                                                              |
| JULICIUSIUIT | ✓ 📉 TIMRO                           |                  |          |                                        |                                        |                                                              |
|              | M out                               | 1                |          |                                        |                                        |                                                              |
|              | ₩ S_PSELx                           | 0                |          |                                        |                                        |                                                              |

Figure: TIMR0 1us interrupt with context switching

# **Timer Peripheral Registers**

### Main Project

B. Lancaster

Introduction

Top Leve Design Overview Memory Map

Memory Ma Interconnect

Interrupts

interrupts

Functionalit

Results

Conclusi



Figure: t = 20 ns \* load \* prescaler

Resolution (32-bit timer): 20ns to 85s.

## Examples:

- For 1us: Load = 0x32, Prescaler = 0 (20ns \* 0x32 = 1000ns)
- For 1s: Load = 0x1000, Prescaler = 0x3000 (demo)
   (20ns \* 0x1000 \* 0x3000 = approx. 1s)



### Main Project

B. Lancaster

1 Introduction

Introduction

Top Leve Design 2 Top Level Design

Multi-core Functionality

Requirements
Context Identification
Atomics

Results

Conclusion

3 Multi-core Functionality

HW/SW Requirements
Context Identification
Atomics

- 4 Results
- 5 Conclusion

Main Project

B. Lancaster

Introduction

Top Leve Design

Multi-core

HW/SW Requirements Context Identification Atomics

Results

Conclusion

Hardware:

Bus Arbitration

(scheduling: priority, rotating, etc.)

Main Project

B. Lancaster

Introductio

Top Leve Design

Functionalit

Requirements

Context Identification

Atomics

Results

Conclusion

### Hardware:

- Bus Arbitration (scheduling: priority, rotating, etc.)
- Atomic functions
   (atomic versions of load/store to prevent race conditions)

Main Project

B. Lancaster

Introduction

Top Leve Design

Functionalit

Requirements

Context Identification

Atomics

Results

Conclusion

### Hardware:

- Bus Arbitration (scheduling: priority, rotating, etc.)
- Atomic functions
   (atomic versions of load/store to prevent race conditions)
- Per-core instruction memory

#### Main Project

B. Lancaster

Introductio

Top Leve Design

Functionali

Requirements
Context Identification

Results

Conclusio

### Hardware:

- Bus Arbitration (scheduling: priority, rotating, etc.)
- Atomic functions
   (atomic versions of load/store to prevent race conditions)
- Per-core instruction memory
- Per-core context-switching for interrupt handling

### Main Project

B. Lancaster

Introductio

Top Leve Design

Functionalit

Requirements

Context Identification

Results

Conclusio

### Hardware:

- Bus Arbitration (scheduling: priority, rotating, etc.)
- Atomic functions
   (atomic versions of load/store to prevent race conditions)
- Per-core instruction memory
- Per-core context-switching for interrupt handling

### Software:

Semaphores/Mutexes
 (exclusive memory access)

#### Main Project

B. Lancaster

Introduction

Top Leve Design

Functionalit

Requirements
Context Identification

Results

Conclusio

### Hardware:

- Bus Arbitration (scheduling: priority, rotating, etc.)
- Atomic functions
   (atomic versions of load/store to prevent race conditions)
- Per-core instruction memory
- Per-core context-switching for interrupt handling

- Semaphores/Mutexes
   (exclusive memory access)
- Thread synchronisation (memory barriers)

#### Main Project

B. Lancaster

Introductio

Top Leve Design

Multi-core Functionality

Requirements
Context Identification

Results

Conclusio

### Hardware:

- Bus Arbitration (scheduling: priority, rotating, etc.)
- Atomic functions
   (atomic versions of load/store to prevent race conditions)
- Per-core instruction memory
- Per-core context-switching for interrupt handling

- Semaphores/Mutexes
   (exclusive memory access)
- Thread synchronisation (memory barriers)
- Context identification
   What core am I?
   How many cores?
   How much memory?

## **Context Identification**

#### Main Project

B. Lancaster

Introduction

Top Leve Design

Multi-core Functionality

Context Identification
Atomics

Results

Conclusio



Figure: Special Registers 0x0080 to 0x008F

```
entry:
    // get core idx 0x80 in r7
    movi
            r7. #0x80
    1 w
             r7, r7
       Branch away if not core 0
    cmp
            r7. r0
    movi
            ro, exit
    br
             ro. BR_NE
    // Core 0 only instructions
    nop
    nop
    nop
exit:
    halt
```

## **Atomic Instructions**

#### Main Project

B. Lancaster

Introduction

Top Leve Design

Functionality
HW/SW
Requirements
Context Identification

Context Identification
Atomics

Results

Conclusion

- Enables semaphores, mutexes, memory barriers
- Prevent race conditions between threads/cores
- LW[EX] and SW[EX]
- Implementation in next slide

### Example:

```
try_inc:
    // load and lock (if not already locked)
            r0 r1
    // do something (i.e. add 1 (semaphore))
    addi
            r0. #0x01
    // attempt store
            r0 r1
    swex
    // check success (== 0)
            r0 r3
    cmp
    // if not equal (NE), retry
    movi
            r4. try inc
    br
            r4. BR_NE
critical.
    // rO is latest value
```

## **Exclusive Access Flow Chart**

Main Project

B. Lancaster

Introduction

Top Leve Design

Multi-core

**Functional** 

HW/SW

Requirements

Context Identificat

Atomics

Results



## HW - How do I know which core this lwex/swex is from?

#### Main Project

B. Lancaster

Introduction

Top Leve Design

Multi-core Functionali

Requirements

Context Identificati

Atomics

Results

Conclusion



The Core Idx is sent with each MMU request to the shared bus.

| 83          | 62 | 41     | 20 0   |  |
|-------------|----|--------|--------|--|
| Core $N$ -1 |    | Core 1 | Core 0 |  |

PADDR\*NUMCORES-1:0 interconnect input.

## **Exclusive Access**

### Main Project

```
mutex claim:
B. Lancaster
                  // load and lock (if not already locked)
                  lwex
                          r0 r1
                  // do something (i.e. add 1 (semaphore))
                  addi
                          r0, #0x01
                  // attempt store
                  swex
                          r0, r1
                  // check success (== 0)
                  cmp
                          r0. r3
HW/SW
                  // if not equal (NE), retry
                  movi
                          r4. mutex claim
Atomice
                  hr
                          r4. BR NE
              critical:
```

1



Figure: HW impl

Demo: 8 core number summation (sum.s)

#### Main Project

#### B. Lancaster

Introduction

Top Level Design

Multi-core Functionality

Results 1

. . .

1 Intro

1 Introduction

2 Top Level Design

3 Multi-core Functionality

4 Results Results 1

# Multi-core vs Single-Core for Summation

#### Main Project

B. Lancaster

Results Regulte 1

Fach core has low work load.

- Sum subset of numbers in for
- loop
- Ideal scenario for parallelism
  - Highly parallelisable
  - Few inter-thread dependencies

Insert graph showing core count vs total time

## **Results 1**

Main Project

B. Lancaster

Introduction

Top Level Design

Multi-core Functionality

Results
Results 1

### Main Project

B. Lancaster

Introduction

Introduction

Top Leve Design 2 Top Level Design

Multi-core Functionality

unotional

Result

#### Conclusion

Accomplishments Future Improvements Q&A 3 Multi-core Functionality

4 Results

5 Conclusion
Accomplishments
Future Improvements
Q&A

# **Accomplishments**

Main Project

B. Lancaster

Introductio

Top Leve Design

Multi-core Functionalit

Result

Assemblehmen

Accomplishments
Future Improvement

 Near complete System-on-Chip design with various peripherals Timers, GPIO, UART, Registers, Memory

- Common multi-thread/core synchronisation primitives
   Semaphores, Mutexes, Memory Barriers, Atomic Instructions
- AMBA APB bus interface with Global Monitor Timers, GPIO, UART, Registers, Memory
- Working shared bus arbitration
   Schedules access to shared resources
- Working FPGA implementation for a 96 core design Nearly fills Cyclone V FPGA on the DE1-SoC
- Interrupts with hardware context-switching Low latency to react to interrupt
- Acknowledges design limitations and attempts to overcome LUT resources, block memories, power and temperature requirements

## **Future Improvements**

#### Main Project

B. Lancaster

Introductio

Top Leve Design

Multi-core Functionalit

Dogulto

Accomplishments
Future Improvements

Working Global Reset

Global resets are expensive (LUT resources) Resetting block memories is not trivial

- On-chip Programming
   Use the UART0 receiver to program each cores flash memory
- Per-core gating/enabling
   Improve power efficiency for ASIC implementation by disabling cores at run-time via software.
- Improve memory bottleneck
   Each core requires it's own memory reduce by multiplexing access to a single large memory.

### Main Project

B. Lancaster

Introduction

Top Level Design

Multi-core Functionality

Results

Conclusion

Accomplishments
Future Improvements

Q&A

Q&A