### 1. IDEA SIMULATOR AND TOY BENCHMARKS

This report describes a behavioral level simulator written for the iDEA processor in order to model and accurately predict the behavior of the instructions' execution. The simulator is written in Python and available as open source. Using the simulator, we can obtain performance metrics such as instruction count and clock cycles, as well as ensuring logical and functional correctness. For the purpose of evaluating performance and comparing to alternatives, we simulate the execution of a number of small benchmark programs. The benchmark programs are written in C and compiled using the mips-gcc toolset to elf32-bigmips assembly code. By analyzing the statistics gathered from the simulations, we can predict potential savings and impact of hardware changes. Here, we present the simulator, compiler and benchmark programs used.

The simulator consists of two main components; an instruction parser and the pipeline simulator itself.



Fig. 1. Simulator data flow

#### 1.1. Instruction Parser

The instruction parser takes as input an elf32-bigmips format assembly file, the standard format produced by mipself-gcc. The benchmark programs are written in C and compiled using mips-elf-gcc with the appropriate flags (-c -g) set to produce the elf32 assembly code.

In the instruction parser, the elf32 file is parsed and converted to an intermediate, simplified assembly format. The resulting assembly code is run through an instruction parser, which also expands pseudo-instructions and performs instruction substitutions where necessary, to fit the iDEA instruction set. As a final preprocessing stage, the instructions are checked for dependencies and NOP instructions are inserted where necessary to avoid data hazards. Because the iDEA architecture does not currently implement hardware stalling or data forwarding, dependencies have to be resolved in software. The NOP insertion algorithm is described by Algorithm 1.

### Algorithm 1 NOP-Insertion Algorithm

```
Input: I, list of instructions
Output: I', I with dependencies solved by NOP insertion
N \leftarrow nPipelineStages - nIFStages
for i from 0 to len(I) do
    d \leftarrow 0
    k \leftarrow 1
    while d < N and i - k \ge 0 do
        if I[i] is a jump or branch then
            L[i] \leftarrow N
        else if I[i], I[i-k] are overlapping stores then
            L[i] \leftarrow N - d
        else if I[i] depends on I[i-k] then
            L[i] \leftarrow N - d
        d \leftarrow d + L[i] + L[i-k] + 1
        k \leftarrow k + 1
    end while
end for
for i from 0 to len(L) do
    Insert L[i] instructions before I[i]
end for
```

A dependency is detected if the destination register of an instruction is equal to an operand register or the destination register of a previous instruction at most N instructions before (taking into account the L NOP insertion list). Overlapping stores are defined as any store instructions with references to the same memory location. After the data dependencies have been found in this first iteration, all branch and jump targets are reevaluated according to the new instruction memory locations. In the third and final iteration, branches and jumps are checked for dependencies across PC-changes,

which may require additional NOPs to be inserted. Finally, the branch and jump targets are again reevaluated. The final instruction list is passed on to the simulator.

The output of the preprocessor and instruction parser consists of two files: a preprocessing log file containing information about code adjustments to fit iDEA, NOP-insertions and instruction expansions, as well as a file containing the assembly code as input to the simulator.

# 1.2. Simulator

The simulator is a behavioral-level simulator modelling the iDEA pipeline, written in Python. It can be configured to model a variable number of pipeline stages and different pipeline configurations. All iDEA instructions are supported by the simulator. The simulator implementation is based on an open-source project, a very simple MIPS simulator modelling the basic MIPS pipeline and a small core MIPS instruction set using a specialised assembly format for the input file. This limitation of the original simulatior means that examples have to be hand coded. The iDEA simulator extends the basic simulator to work on more practical problems and directly on the assembly code produced by mipsgcc. This allows us to automate the simulation process, from a C program to a simulation log file and run statistics file, with automatic checking of correctness. The main modifications made on the starting point simulator are as follows.

- Conversion from elf32-bigmips to simulator assembly
- Data dependency checking and NOP insertion
- Improved statistics collection
- Expanded the supported instruction set
- Variable pipeline length
- Easy-to-use command line interface
- Logfiles of simulation and preprocessing
- Error handling for input and memory references
- Support for subroutine calls
- · Memory dump and automated correctness checking
- · Modelling data and instruction memory
- Support for preloading data memory

Our simulator models the iDEA pipeline stage by stage and the instruction passing between stages. Register file, data memory and instruction memory are modelled individually. By enabling the verbose output option for the simulation log file, the simulator will output detailed information about the state of the instructions in the pipeline and the register values for each clock cycle. The pipeline length and configuration are fully customizable.

The pipeline simulator uses one branch delay slot (to fit the code generated for the MIPS platform), but this can easily be modified. Statistics collected during the simulation run are cycle count, NOP count, simulation run time and **core instruction** count.

By putting the tags **START\_CCORE** and **END\_CCORE** in the elf32 assembly input file, the simulator will count the instructions executed within those tags as core instructions and output this count as a separate statistic. This can be useful to ignore setup/initialization and finalization code, instead only counting the instructions executed within the computation itself.

The main limitations of the simulator is that it is dependent on the elf32 format as input, and the parsing is not very robust in the sense. The output files generated by the simulator are simple text files, and all debugging and output tracing has to be done manually, which is tedious for large programs. Large simulations (100,000s of cycles) take several seconds to run.

#### 1.3. Benchmarks

Five benchmarks, presented below, are used to evaluate performance. The benchmarks are written in standard C and use a self-checking loop to verify correctness. Data and expected output are stored in constant arrays which are then pre-loaded into the data memory before simulation. These benchmarks were selected to represent simple applications commonly run on softcore processors in FPGA.

- 1. **Fib**. Generating the 50 first Fibonacci numbers.
- 2. **Fir**. An FIR filter application using 5 taps on an array of 50 integers.
- 3. **MMult**. Matrix multiplication of two 5x5 integer constant matrices.
- 4. **Median**. A median filter with a window size of 5, on an array of 50 integers.
- 5. **QSort**. Quicksort on an array of 100 random integers.

## 1.4. Result Analysis

Table 1.3 shows the cycle and NOP counts for the benchmarks described in Section 1.3. The results show a large amount of NOP insertions to resolve dependencies. In most benchmarks the NOP instructions make up the vast majority of the total instructions executed. This can partially be traced to the compiler, which targets the MIPS platform rather than iDEA, and therefore might not generate very efficient code to run on iDEA. We also observe that for some of the

|                 |           | Total Cycles |       |       |       | NOPs   |       |       |       |
|-----------------|-----------|--------------|-------|-------|-------|--------|-------|-------|-------|
| Pipeline Stages | Benchmark | -00          | -01   | -02   | -03   | -00    | -01   | -02   | -03   |
| 5               | fib       | 4897         | 1986  | 1713  | 1223  | 0      | 0     | 1     | 1     |
|                 | fir       | 427          | 681   | 0     | 7     | 0      | 0     | 1     | 1     |
|                 | median    | 427          | 681   | 0     | 7     | 0      | 0     | 1     | 1     |
|                 | mmult     | 427          | 681   | 0     | 7     | 0      | 0     | 1     | 1     |
|                 | bubble    | 136077       | 32497 | 43332 | 43332 | 107062 | 32497 | 33725 | 33725 |
|                 | qsort     | 427          | 681   | 0     | 7     | 0      | 0     | 1     | 1     |
| 6               | fib       | 427          | 681   | 0     | 7     | 0      | 0     | 1     | 1     |
|                 | fir       | 427          | 681   | 0     | 7     | 0      | 0     | 1     | 1     |
|                 | median    | 427          | 681   | 0     | 7     | 0      | 0     | 1     | 1     |
|                 | mmult     | 427          | 681   | 0     | 7     | 0      | 0     | 1     | 1     |
|                 | qsort     | 427          | 681   | 0     | 7     | 0      | 0     | 1     | 1     |
| 7               | fib       | 427          | 681   | 0     | 7     | 0      | 0     | 1     | 1     |
|                 | fir       | 427          | 681   | 0     | 7     | 0      | 0     | 1     | 1     |
|                 | median    | 427          | 681   | 0     | 7     | 0      | 0     | 1     | 1     |
|                 | mmult     | 427          | 681   | 0     | 7     | 0      | 0     | 1     | 1     |
|                 | qsort     | 427          | 681   | 0     | 7     | 0      | 0     | 1     | 1     |
| 8               | fib       | 427          | 681   | 0     | 7     | 0      | 0     | 1     | 1     |
|                 | fir       | 427          | 681   | 0     | 7     | 0      | 0     | 1     | 1     |
|                 | median    | 427          | 681   | 0     | 7     | 0      | 0     | 1     | 1     |
|                 | mmult     | 427          | 681   | 0     | 7     | 0      | 0     | 1     | 1     |
|                 | qsort     | 427          | 681   | 0     | 7     | 0      | 0     | 1     | 1     |
| 9               | fib       | 427          | 681   | 0     | 7     | 0      | 0     | 1     | 1     |
|                 | fir       | 427          | 681   | 0     | 7     | 0      | 0     | 1     | 1     |
|                 | median    | 427          | 681   | 0     | 7     | 0      | 0     | 1     | 1     |
|                 | mmult     | 427          | 681   | 0     | 7     | 0      | 0     | 1     | 1     |
|                 | qsort     | 427          | 681   | 0     | 7     | 0      | 0     | 1     | 1     |

Table 1. Benchmark simulation results

benchmarks the lower optimization levels actually perform better than higher optimization levels. In a few cases, -O2 and -O3 are equivalent in terms of clock cycles. This can be explained by the fact that we use a compiler which tries to optimize for the MIPS architecture, which might lead to improvements for MIPS, but not for iDEA.

### 1.5. Usage

The command line interface of the simulator is used as follows.

```
python run-simulator.py [options]
<Input ASM> <Sim. Log>
<Sim. ASM File> <Preproc. Log>
<Memory Dump>
```

## Example:

usr@sys:~\$ python src/run-simulator.py -v
benchmark/toy/median/median-02.asm

This will run the elf32-bigmips assembly file median-O2 with verbose output.

Only the Assembler Input File is a required input, others are optional (default filenames are used).

### **Options**

| -h, –help    | Show help message and exit.    |  |  |  |  |
|--------------|--------------------------------|--|--|--|--|
| -v, -verbose | Verbose debug output.          |  |  |  |  |
| -q, –quiet   | Supress simulator output.      |  |  |  |  |
| -p N         | Set number of pipeline stages. |  |  |  |  |
| -f N         | Set number of IF stages.       |  |  |  |  |
| -d N         | Set number of ID stages.       |  |  |  |  |
| -e N         | Set number of EX/MEM stages.   |  |  |  |  |
| -w N         | Set number of WB stages.       |  |  |  |  |

The pipeline must contain at least 4 stages (IF-ID-EX-WB). If the pipeline is deeper than 4 stages, the EX/MEM stage will be allocated more cycles, up to a maximum of 4. After that, the pipeline will be padded with IF.

# **Automated Testing**

runSimulations.sh is a Bash script that automatically runs all benchmarks and outputs success/failure notification and the most vital statistics for a given interval of pipeline stages.

```
usr@sys:~$ ./runSimulations.sh
```