|  |  |
| --- | --- |
| **Computer Architectures [GRB-ZZZ]** | Delivery date:  18th November 2022 |
| **Laboratory**  **5** | Expected delivery of lab\_05.zip must include:   * this document compiled possibly in pdf format. |

1. **Processor configuration and performance checking in gem5**

Download from the course site the supporting material: lab5\_material.zip

Unzip the file in your working directory (example my\_gem5Dir) and obtain the following files:

1. start.sh

a bash script that correctly sets the gem5 paths to execute the python script: mygem5script.py.

1. mygem5script.py

a configurable script for gem5 that allows you to set different features to the simulated processor. In a few words, the script configures an Out-of-Order (O3) processor based on the *DerivO3CPU,* a superscalar processor with a reduced number of features.

The processor pipeline stages can be summarized as:

* *Fetch stage*: instructions are fetched from the instruction cache. The fetchWidth parameter sets the number of fetched instructions. This stage does branch prediction and branch target prediction.
* *Decode stage*: This stage decodes instructions and handles the execution of unconditional branches. The decodeWidth parameter sets the maximum number of instructions processed per clock cycle.
* *Rename stage*: parameters relevant for this stage are the entries in the re-order buffer and the instruction queue (a kind of shared reservation station). Register operands of the instruction are renamed, updating a renaming map (stall may appear if not available entries). The maximum number of instructions processed per clock cycle is set by the renameWidth parameter.
* *Dispatch/issue stage*: instructions whose renamed operands are available are dispatched to functional units. For loads and stores, they are dispatched to the Load/Store Queue (LSQ). The simulated processor has a single instruction queue from which all instructions are issued. Ordinarily, instructions are taken in-order from this queue. The maximum number of instructions processed per clock cycle is set by the dispatchWidth parameter.
* *Execute stage*: the functional unit processes their instruction. Each functional unit can be configured with a different latency. Conditional branch mispredictions are identified here. The maximum number of instructions processed per clock cycle depends on the different functional units configured and their latencies.
* *Write stage*: it sends the result of the instruction to the reorder buffet. The maximum number of instructions processed per clock cycle is set by the wbWidth parameter.
* *Commit stage*: it processes the reorder buffer, freeing up reorder buffer entries. The maximum number of instructions processed per clock cycle is set by the commitWidth parameter.

In the event of a branch misprediction, trap, or other speculative execution event, "squashing" can occur at all stages of this pipeline. When a pending instruction is squashed, it is removed from the instruction queues, reorder buffers, requests to the instruction cache, etc.

1. Simulate the program basicmath\_large (from MiBench) following the next steps. Remember to modify the program to **reduce the simulation time**. Please write here the changes that you have made in your program (basicmath\_large):
   1. Run the start.sh script for setting the gem5 paths

|  |
| --- |
| **~/my\_gem5Dir$** source start.sh |

* 1. Simulate the program

|  |
| --- |
| **~/my\_gem5Dir$** /opt/gem5/build/ALPHA/gem5.opt mygem5script.py -c basicmath\_large |

Notice that the program output is automatically redirected to the file m5out/program.out.

Check the statistics (in m5out) file and collect the following parameters:

* 1. Number of instructions simulated
  2. Number of CPU Clock Cycles
  3. Clock Cycles per Instruction (CPI)
  4. Number of instructions committed
  5. Host time in seconds
  6. Prediction ratio for Conditional Branches
     + Prediction ratio = Number of Incorrect Predicted Conditional Branches / Number of Predicted Conditional Branches
  7. BTB hits.

Collect these parameters in Table 1 in the column *Basic configuration*.

1. Modify the processor configuration by doubling the parameters in the stages: Fetch, decode, rename, dispatch, execute, write and commit. **Do not change any value related to the branch predictors**. Simulate again the program basicmath\_large and collect the statistics in Table 1 in the column *X2 configuration*.

Modify one more time the processor configuration by doubling again the parameters in the stages: Fetch, decode, rename, dispatch, execute, write, and commit. **Do not change any value related to the branch predictors**. Simulate again the program basicmath\_large and collect the statistics in Table 1 in the column *X4 configuration*.

TABLE1: basicmath\_large program behavior on different CPU configurations

|  |  |  |  |
| --- | --- | --- | --- |
| CPUs  Parameters | Basic configuration | X2  configuration | X4 configuration |
| Ticks | 405446289000 | 275974271500 | 234159980000 |
| CPU clock domain | 500 | 500 | 500 |
| Clock Cycles | 810892599 | 551948576 | 468319962 |
| Instructions simulated | 371766395 | 371766395 | 371766395 |
| CPI | 2.181 | 1.484665 | 1.259716 |
| Committed instructions | 377572880 | 371766395 | 371766395 |
| Host seconds | 1112.23 | 956.71 | 921.26 |
| Prediction ratio | 0,198708 | 0,17415 | 0.14 |
| BTB hits | 34281937 | 39064202 | 47212135 |

1. Select one of the previous hardware configurations (Basic, X2, X4):

|  |  |
| --- | --- |
| Selected hardware Configuration: | X4 |

Despite hardware enhancements for increasing the CPU performance, remember that optimizing compilers for programs in high-level code also exist. The aim of optimizing compilers is to minimize or maximize some attributes of an executable computer program (code size, performance, etc.). They are also aware of hardware enhancements to perform very accurate optimizations.

Compilers can be your best friend (or worst enemy!). The more information you provide in your program, the better the optimized program will be.

* 1. Compile the program basicmath\_large using the provided *Makefile* using the ALPHA compiler with different optimization levels (DO NOT CONFUSE WITH O3 PROCESSOR).

*hint:*

|  |
| --- |
| add a variable to the Makefile in order to change the optimization level:  OPT=”-O3”  and substitute all the -O3 occurrences with the new variable as follows:  -O3 → $(OPT) |

* 1. For visualize the enabled optimizations from the compiler perspective, you can run:

|  |
| --- |
| ~/my\_gem5Dir$ /opt/alphaev67-unknown-linux-gnu/bin/alphaev67-unknown-linux-gnu-gcc -c -Q -O2 --help=optimizers |

By changing the “-O2” parameter with the desired one, you will find the enabled/disabled optimizations.

Here are some possible types of optimizations

* <https://en.wikipedia.org/wiki/Optimizing_compiler>
* https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
  1. Simulate the program for different optimization levels and collect statistics (change OPT variable in the Makefile, O3 is the default, you need to change OPT accordingly to the values in parenthesis).

TABLE2: basicmath\_large program behavior on the different compiler optimization level

|  |  |  |  |  |  |  |
| --- | --- | --- | --- | --- | --- | --- |
| Optimization  Parameters | Selected hw configuration (-O3) | Opt lvl 0 (-O0) | Opt lvl 1 (-O1) | Opt lvl 2 (-O2) | Opt size (-Os) | Opt fast (-O3 -ffast-math) |
| Ticks | 234159980000 | 318603453500 | 432933279500 | 234159980000 | 236323403500 | 234159980000 |
| CPU clock domain | 500 | 500 | 500 | 500 | 500 | 500 |
| Clock Cycles | 468319962 | 637215809 | 865866799 | 468319962 | 472646815 | 468319962 |
| Instructions simulated | 371766395 | 514979779 | 386620224 | 371766395 | 371797124 | 371766395 |
| CPI | 1.259716 | 1.237361 | 2.239580 | 1.259716 | 1.271249 | 1.259716 |
| Committed instructions | 371766395 | 514979779 | 392637642 | 371766395 | 377568707 | 377572880 |
| Host seconds | 921.26 | 1320.25 | 2528.47 | 920.47 | 916.41 | 900.06 |
| Prediction ratio | 0.14 | 0,142371 | 0,20232 | 0,148248 | 0,153544 | 0,148248 |
| BTB hits | 47212135 | 65273434 | 34219978 | 47212135 | 44804486 | 47212135 |
| Executable Size | 738900 | 740660 | 738892 | 738900 | 738812 | 672220 |

1. **Branch predictors comparison**

The gem5 includes different branch predictors:

* LocalBP:

Implements a local predictor that uses the PC to index into a table of counters. It is like a basic BHT.

* BiModeBP:

The bi-mode predictor is a two-level branch predictor that has three separate history arrays: a taken array, a not-taken array, and a choice array. The taken/not-taken arrays are indexed by a hash of the PC and the global history. The choice array is indexed by the PC only. Because the taken/not-taken arrays use the same index, they must be the same size.

The bi-mode branch predictor aims to eliminate the destructive aliasing that occurs when two branches of opposite biases share the same global history pattern. By separating the predictors into taken/not-taken arrays, and using the branch's PC to choose between the two, destructive aliasing is reduced.

* TournamentBP:

Implements a tournament branch predictor, hopefully identical to the one used in the 21264.

It has a local predictor, which uses a local history table to index into a table of counters, and a global predictor, which uses a global history to index into a table of counters. A choice predictor chooses between the two. Both the global history register, and the selected local history are speculatively updated.

Starting from your Custom Configuration and default optimization level (O3), enable one at a time, every one of the different branch predictors in the mygem5script.py section called: BPU SELECTION, and collect the resulting statistics for any configuration in the following table. Select one of the branch predictors and customize its values. Report the results in the last column of the next table.

TABLE3: basicmath\_large program behavior on different CPU configurations

|  |  |  |  |  |
| --- | --- | --- | --- | --- |
| CPUs  Parameters | Local predictor | Bimodal predictor | Tournament predictor | Custom configuration |
| Ticks | 234159980000 | 249700442500 | 250246598500 | 230891326500 |
| CPU clock domain | 500 | 500 | 500 | 500 |
| Clock Cycles | 468319962 | 499400907 | 500493225 | 461782655 |
| Instructions simulated | 371766395 | 386620224 | 386620224 | 386620224 |
| CPI | 1.259716 | 1.291709 | 1.294535 | 1.194409 |
| Committed instructions | 371766395 | 392637642 | 392637642 | 392637642 |
| Host seconds | 921.26 | 1848.88 | 1858.60 | 1776.71 |
| Prediction ratio | 0.14 | 0,14712 | 0,14666 | 0,10999 |
| BTB hits | 47212135 | 44024031 | 44376034 | 46120404 |

Report the branch prediction configuration of your custom configuration.

TABLE4: BPU custom configuration Vs. the basic one

|  |  |  |
| --- | --- | --- |
| Parameter name | *Basic configuration* value | New value |
| my\_predictor.globalPredictorSize | 64 | 128 |
| my\_predictor.choicePredictorSize | 64 | 128 |
| my\_predictor.BTBEntries | 256 | 512 |
|  |  |  |
|  |  |  |
|  |  |  |