|  |  |
| --- | --- |
| **Computer Architectures 02LSEOV 02LSEOQ [AA-LZ]** | Delivery date:  19/11/2020 |
| **Laboratory**  **5** | Expected delivery of lab\_05.zip must include:   * This document compiled in pdf format. |

**CPU Architect Engineer for one hour**

**1. Introduction**

**Set up the environment**: download from the Portale della Didattica an updated version (gem5\_env\_2020\_lab5.zip) of the gem5\_env\_2020 used in the previous session.

The environment is similar to the one used previously, but includes an additional benchmark (located in benchmarks/basic\_math) and a python script mygem5script.py.

mygem5script.py is a configurable script for gem5 that allows you to set different features to the simulated processor. The script configures an Out-of-Order (O3) processor based on the *DerivO3CPU,* a superscalar processor.

The processor pipeline stages can be summarized as:

* *Fetch stage*: instructions are fetched from the instruction cache. The fetchWidth parameter set the number of fetched instructions. This stage does branch prediction and branch target prediction.
* *Decode stage*: This stage decode instructions and handles execution of unconditional branches. The decodeWidth parameter sets the maximum number of instructions processed per clock cycle.
* *Rename stage*: parameters relevant for this stage are the entries in the re-order buffer and the instruction queue (a kind of shared reservation stations). Register operands of the instruction are renamed, updating a renaming map (stall may appear if not available entries). The maximum number of instructions processed per clock cycle is set by the renameWidth parameter.
* *Dispatch/issue stage*: instructions whose renamed operands are available are dispatched to functional units. For loads, stores, they are dispatched to the Load/Store Queue (LSQ). The simulated processor has a single instruction queue from which all instructions issue. Ordinarily instructions are taken in-order from this queue. The maximum number of instructions processed per clock cycle is set by the dispatchWidth parameter.
* *Execute stage*: the functional unit actually processes their instruction. Each functional can be configured with a different latency. Conditional branch mispredictions are identified here. The maximum number of instructions processed per clock cycle depends on the different functional units configured and their latencies.
* *Write stage*: send the result of the instruction to the reorder buffet. The maximum number of instructions processed per clock cycle is set by the wbWidth parameter.
* *Commit stage*: process the reorder buffer, freeing up reorder buffer entries. The maximum number of instructions processed per clock cycle is set by the commitWidth parameter.

In the event of branch misprediction, trap, or other speculative execution event, "squashing" can occur at all stages of this pipeline. When a pending instruction is squashed, it is removed from the instruction queues, reorder buffers, requests to the instruction cache, etc.

**Disclaimer**: During this laboratory, you are asked to play the role of a CPU architect engineer, in charge of designing a new processor. Therefore, you must rely on your own creativity and imagination.

**2. 1st part: All the best money can buy**

Initially, in the first part of the laboratory we will give you infinite resources. You are asked to design a new processor able to achieve the best performances money can buy. You can invest for improving the processor without any constraint.

The design of a new processor is guided by a set of benchmarks. They are used for evaluating whether the new architecture performs better compared to a previous one. For the evaluation, a set of statistics of interest are monitored. In our case, we will use only one benchmark and a limited set of statistics.

As done in the previous laboratory, do not forget to set up the environment with the start.sh script (or start\_vbox.sh).

For this 1st part, we will use a new benchmark, the program basicmath\_small. The program can be compiled with the MakeFile (command make basicmath\_small).

Simulate the program with the following command

|  |
| --- |
| ~/gem5\_env\_2020\_lab5$ gem5\_sim mygem5script.py -c basicmath\_small |

Notice that the program output is automatically redirected to the file m5out/program.out.

Check the statistics (in m5out) file and collect the following parameters:

* sim\_ticks (Number of ticks simulated)
* sim\_insts (Number of instructions simulated)
* system.cpu.numCycles (Number of CPU Clock Cycles)
* system.cpu.cpi (Clock Cycles per Instruction)
* system.cpu.committedInsts (Number of instructions committed)
* host\_seconds (Host time in seconds)
* Prediction ratio for Conditional Branches: system.cpu.branchPred.condIncorrect / system.cpu.branchPred.condPredicted
* system.cpu.branchPred.BTBHits (Number of BTB hits)

Collect these parameters in Table 1 in the column *Basic configuration*. This represents the starting point. From this configuration, you are asked to perform the so-called “Design Space Exploration”. In this phase, different parameters are modified in order to find the best configuration. **Modify exclusively the parameters in the stages: Fetch, decode, rename, dispatch, execute, write and commit. Do not change any value related to the branch predictors.**

**Experiment1:** Modify the processor configuration by doubling the parameters in the stages: Fetch, decode, rename, dispatch, execute, write and commit. Simulate again the benchmark basicmath\_small and collect the statistics in the Table 1 in the column *X2 configuration*.

**Experiment2:** Modify one more time the processor configuration by doubling again the parameters in the stages: Fetch, decode, rename, dispatch, execute, write and commit. Simulate again the benchmark basicmath\_small and collect the statistics in the Table 1 in the column *X4 configuration*.

Now you should have an initial picture of the effects of the changes performed in Experiment 1 and 2. Analyze the results, and modify the basic processor configuration in order to improve the previous results as much as possible. **Again,** **modify exclusively the parameters in the stages: Fetch, decode, rename, dispatch, execute, write and commit. Do not change any value related to the branch predictors.**

For evaluating the effectiveness of your configuration use always the benchmark basicmath\_small and collect the statistics in the Table 1 in the column *Custom configuration*.

TABLE1

|  |  |  |  |  |
| --- | --- | --- | --- | --- |
| CPUs  Parameters | Basic configuration | X2  configuration | X4 configuration | Custom configuration |
| sim\_ticks | 10131916500 | 6744904500 | 5155653000 | 4460673500 |
| sim\_insts | 8803289 | 8803289 | 8803289 | 8803289 |
| system.cpu.numCycles | 20264076 | 13490047 | 10311308 | 8921351 |
| system.cpu.cpi | 2.301876 | 1.532387 | 1.171302 | 1.013411 |
| system.cpu.committedInsts | 8803289 | 8803289 | 8803289 | 8803289 |
| host\_seconds | 53.02 | 35.50 | 30.27 | 33.78 |
| Prediction ratio | 204264/1072117=0.19052398199 | 214708/1260655=0.17031463802 | 231794/1453963=0.15942221363 | 260575/1479293=0.17614833572 |
| system.cpu.branchPred.BTBHits | 809387 | 920057 | 1033381 | 1075469 |

Report also the processor configuration values of your custom configuration in Table 2 (increase the number of rows if required).

TABLE2: CPU custom configuration Vs. the basic one

|  |  |  |
| --- | --- | --- |
| Parameter name | Basic configuration value | New value |
| fetchWidth | 2 | 8 |
| fetchBufferSize | 8 | 64 |
| fetchQueueSize | 8 | 64 |
| decodeWidth | 2 | 8 |
| numIQEntries | 4 | 256 |
| numROBEntries | 4 | 512 |
| renameWidth | 2 | 8 |
| dispatchWidth | 2 | 8 |
| issueWidth | 2 | 8 |
| LQEntries | 8 | 256 |
| SQEntries | 8 | 256 |
| MyInt\_ALU\_Count | 1 | 8 |
| MyFP\_ALU\_Count | 1 | 8 |
| MyFP\_MultDiv\_Count | 1 | 8 |
| wbWidth | 2 | 8 |
| commitWidth | 2 | 8 |
| squashWidth | 2 | 128 |

When performing Design Space Exploration, only few parameters are changed at a time. Otherwise, the size of the problem would rapidly growth and become untreatable. Now that we have a stable configuration (your *Custom Configuration* derived previously) it is time to consider also different units. Let’s move to Branch predictors.

The gem5 includes different branch predictors:

* LocalBP:

Implements a local predictor that uses the PC to index into a table of predictors. it is similar to a basic BHT.

* BiModeBP

The bi-mode predictor is a two-level branch predictor that has three separate history arrays: a taken array, a not-taken array, and a choice array. The taken/not-taken arrays are indexed by a hash of the PC and the global history. The choice array is indexed by the PC only. Because the taken/not-taken arrays use the same index, they must be the same size.

The bi-mode branch predictor aims to eliminate the destructive aliasing that occurs when two branches of opposite biases share the same global history pattern. By separating the predictors into taken/not-taken arrays, and using the branch's PC to choose between the two, destructive aliasing is reduced.

* TournamentBP

Implements a tournament branch predictor, hopefully identical to the one used in the 21264.

It has a local predictor, which uses a local history table to index into a table of counters, and a global predictor, which uses a global history to index into a table of counters. A choice predictor chooses between the two. Both the global history register and the selected local history are speculatively updated.

Starting from your custom configuration, enable one at a time, each one of the different branch predictors in the mygem5script.py section called: BPU SELECTION. Collect the resulting statistics for each configuration in the Table 3. Select exclusively **one (don’t ask which one you should select)** of the branch predictors and customize its values. Report the results in the last column labeled *BP Custom Configuration*.

TABLE3

|  |  |  |  |  |
| --- | --- | --- | --- | --- |
| CPUs  Parameters | Local BP | BiModeBP | TournamentBP | Custom configuration |
| sim\_ticks | 4460673500 | 4331095500 | 4375377500 | 3207278500 |
| sim\_insts | 8803289 | 8803289 | 8803289 | 8803289 |
| system.cpu.numCycles | 8921351 | 8662193 | 8750762 | 6414608 |
| system.cpu.cpi | 1.013411 | 0.983972 | 0.994033 | 0.728660 |
| system.cpu.committedInsts | 8803289 | 8803289 | 8803289 | 8803289 |
| host\_seconds | 33.78 | 28.83 | 30.39 | 22.50 |
| Prediction ratio | 260575/1479293=0.17614833572 | 241837/1423869=0.16984497871 | 250247/1440792=0.17368711097 | 39118/1010786=0.03870057559 |
| system.cpu.branchPred.BTBHits | 1075469 | 1014783 | 1031510 | 869257 |

In the following, report the branch predictor selected for the improvements and the branch prediction configuration of your custom configuration (Table 4).

Branch Predictor Selected: Ho scelto il BiModeBP come partenza essendo quello con Prediction rate inferiore e prestazioni migliori.

TABLE4:BPU custom configuration Vs. the basic one

|  |  |  |
| --- | --- | --- |
| Parameter name | Basic configuration value | New value |
| globalPredictorSize | 64 | 131072 |
| choicePredictorSize | 64 | 131072 |
| BTBEntries | 256 | 262144 |
|  |  |  |
|  |  |  |
|  |  |  |

**3. 2nd part: The reality**

Unfortunately, the life of an engineer is not that easy. You are always requested to make choices, based on your intuition and experience. In particular, in a real-case scenario, you won’t have all these degrees of freedom, but your manager will always give you a limited amount of resources in terms of engineers, computers and time to complete the duty. Additionally, you will work on a specific portion of the processor – not the entire one.

Simulations of a real set of benchmarks might take weeks, and implementing a new architectural feature requires months of coding and debugging (not just changing the value of one variable in a python script).

To get a glimpse of this, let’s focus on the Functional Units only (see FUNCTIONAL UNIT DEFINITION in mygem5script.py). For each Functional Unit, we have the *count* (i.e., how many of those units are present) and a set of operation it can be used for. For each operation, it is also defined the *latency*.

For the following tasks, **consider exclusively the fft benchmark of the previous laboratory**.

Preliminarily, simulate the benchmark fft with the Basic Configuration (i.e., the initial one that you have in the mygem5script.py) in order to obtain the stats.txt file.

**Task 1**: You are asked to improve the functional units responsible for the floating-point operations, by *halving their latency*. Consider the following operations:

* FloatAdd
* FloatCmp
* FloatCvt
* FloatMult
* FloatDiv

In this case, changing the latency can be done instantly but when considering a real processor this can take months of research. Therefore, before moving into the implementation part, you must have some insights that you are not going to waste your time and your company money.

Due to a limited budget, you are asked to select only one operation among the one listed above as target for the improvements.

Which one would you select? For answering this question, you have to resort (**again!**) to the Amdahl’s Law.

During the Lab 2, we have used the formula with a great level of detail and accuracy since we were able to see clock cycle per clock cycle the evolution of the pipeline. However, in practice industrial architectural simulators do not have this level of detail. Additionally, the benchmarks are long and complex programs. Consequently, it is unfeasible to analyse them as we did in the Lab2.

In this lab, you have to use the statics generated by gem5 (in stats.txt) and some approximations in order to compute all the elements composing the formula. Specifically, the Fraction Enhanced should be computed as follows (in the example, it is used the FloatMult):

For each operation, compute the Speedup Overall resorting to the Amdahl’s Law and then with the simulator. Report the obtained values in Table 5.

TABLE5

|  |  |  |
| --- | --- | --- |
| Operation | Speedup Overall by Amdahl  1/((1-fraction\_enhanced)+fraction\_enhanced/speedup\_enhanced) | Speedup Overall by Simulation  numcycles\_default/numcycles\_enhanced |
| FloatAdd | 1/((1-0.05870378755488381)+0.05870378755488381/2)=1.0302394797757035 | 40565696/39815912=1.01883126525 |
| FloatCmp | 1/((1-0.005843512108816942)+0.005843512108816942/2)=1.0029303177279716 | 40565696/40496194=1.0017162600515002 |
| FloatCvt | 1/((1-0.005186890894952692)+0.005186890894952692/2)=1.00260018889553 | 40565696/40534880=1.00076023414896 |
| FloatMult | 1/((1-0.039254688779609)+0.039254688779609/2)=1.0200202895067367 | 40565696/40036644=1.0132141944764401 |
| FloatDiv | 1/((1-7.0995946920274705e-06)+7.0995946920274705e-06/2)=1.000003549809947 | 40565696/40565696=1.0 |

As you can see, the Amdahl’s Law is a valuable tool for designers since it represents a good approximation of what the speedup will be (providing an upper bound), without running simulations.

**Task 2:** Probably, you have also noticed that for each Functional Unit, you can also increase how many of them are present in the processor. Now you are asked to decide, starting from the *Base Configuration and considering the fft benchmark*, for the following operations whether is more advantageous to:

* double the unit that implement that operation, or;
* halve the latency of that operation.

The operations to be considered are in the following Table 6. In the column *Action to be Taken* report which of the two option you would consider (Double Unit or Halve Latency). This task cannot be done with the Amdahl’s Law solely, therefore you can resort to gem5 for deciding.

TABLE6

|  |  |
| --- | --- |
| Operation | Action to be Taken |
| IntALU | Double Unit |
| FloatAdd | Halve Latency |
| FloatCmp | Halve Latency |
| FloatCvt | Halve Latency |
| FloatMult | Halve Latency |
| FloatDiv | Double Unit |

For each operation, provide also a motivation for the action to be taken that you have chosen. **Please note that motivations like “the speedup overall is greater in this case ….” Are not accepted. Use the statistics and reason on their meaning.**

Your Motivation:

IntALU: Non possiamo dimezzare la latenza dell’unità perchè di default è già di 1 cc.

Raddoppiando il numero di unità funzionali, i tentativi di usare l’unità funzionale quando nessuna è disponibile passano in totale da 1780807 a 9186 (da 1772120 a 0 per la IntAlu), diminuendo il numero di cc totali da 40565696cc a 39981916cc.

Raddoppiando il numero di unità funzionali FP\_ALU, i tentativi di usare l’unità funzionale quando nessuna è disponibile in totale passano da 1780807 a 1778745, (da 6472 a 0 per la FloatAdd, da 15 a 0 per la FloatCmp, da 2131 a 0 per la FloatCvt) ma il numero di cc totali non varia.

FloatAdd:

Dimezzando la latenza dell’unità funzionale, sebbene aumentino i tentativi di usare l’unità funzionale quando nessuna è disponibile passino da 6472 a 7105, l’aumento di velocità permette di passare da 40565696cc a 39815912cc ed è quindi preferibile.

FloatCmp:

Dimezzando la latenza dell’unità funzionale invece, sebbene aumentino i tentativi di usare l’unità funzionale quando nessuna è disponibile passino da 15 a 1166, l’aumento di velocità permette di passare da 40565696 a 40496194 ed è quindi preferibile.

FloatCvt:

L’aumento di velocità permette di passare da 40565696cc a 40534880cc ed è quindi preferibile.

Raddoppiando il numero di unità funzionali FP\_MultDiv, i tentativi di usare l’unità funzionale quando nessuna è disponibile in totale passano da 1780807 a 1780737, (da 69 a 0 per la FloatMult, mentre la FloatDiv rimane a 0) ma il numero di cc totali non varia.

FloatMult:

L’aumento di velocità permette di passare da 40565696cc a 40036644cc ed è quindi preferibile.

FloatDiv:

L’aumento di velocità non varia il numero di cicli totali perchè sono presenti istruzioni div in un numero troppo limitato. Raddoppiare il numero di unità funzionali FP\_MultDiv è preferibile solo in quanto si diminuisce maggiormente il numero di stalli strutturali, rimuovendo anche quelli sulle FloatMul, sebbene in entrambi i casi il numero di cc totali non cambi.

As for the previous tasks, the budget allows for one action only (i.e., improvement). Once you have identified for each operation the most appropriate action to be taken, you are now requested to decide which action is the most urgent one (i.e., the one that gives the best results). Also here show your motivations.

Your answer and motivation:

Potendo scegliere un solo miglioramento, quello che porta a un maggiore beneficio in termini di numero di cicli impiegati dal programma è il dimezzamento della latenza dell’unità FloatAdd. Essendo l’istruzione Float più usata, dimezzarne la latenza migliora molto le prestazioni, consentendo di fare prima il commit delle varie istruzioni che attendono la sua terminazione e quindi velocizzare l’intero programma.

Il secondo candidato era il raddoppio dell’unità IntAlu, che diminuisce nettamente i tentativi falliti di accesso all’unità funzionale. Questo tuttavia non aumenta le prestazioni quanto il primo, perchè le istruzioni Alu hanno solo 1cc di latenza, quindi anche se devono attendere che la FU si liberi, l’attesa è molto breve. Inoltre tipicamente queste operazioni avvengono in contemporanea alle FloatAdd, che risultano essere un collo di bottiglia più grave in questo caso, e quindi, anche se ci sono molte istruzioni in attesa di entrare nell’unità funzionale IntAlu, l’esecuzione complessiva non viene rallentata eccessivamente.

**Hint:** For your reasoning, focus on the statistics: instructions simulated, type of FU issued, attempts to use FU when none available, CPU cycles required and operation latency.