**Exercise 3**

**3.1 Reading**

The article „The Future of Microprocessors” by Shekhar Borkar and Andrew Chien, discusses the shift away from relying only on transistor speed for microprocessor performance scaling. The Authors argue, that to keep up with Moore’s law, significant architectural changes are necessary, with a particular emphasis on prioritizing energy efficiency in chip design. By simply adding as many compute cores at maximum frequency as the transistor-integration capacity allows for the power consumption of the microprocessor would quickly grow into the hundreds of Watts. Instead, the authors suggest, that cache sizes will increase, since they require less energy compared to logic and that microprocessors will instead feature smaller fine grained compute cores capable of operating at different frequencies or accelerators for special workloads.

Some predictions from the paper have become reality. For example, frequency adjustment for certain cores is common even in consumer grade CPUs (Intel Turbo Boost, AMD Turbo Core). However, cache sizes did not increase as rapidly as predicted. Most mainstream CPUs usually have a cache of around 30 MB, some reach 64 MB and therefore differ from the originally expected development. Notable exceptions, such as the IBM z15, use exotic architectures with very large off-chip caches, which are realised as eDRAMs and reach up to 960 MB in the L4 cache. The integration of smaller computing cores and accelerators in CPUs can be observed above all in exotic architectures, while x64-based systems often rely on external multi-core accelerators such as GPUs or the Xeon Phi.

Nevertheless, the core premise of the paper is correct, as energy efficiency has become a very important metric not only in the HPC sector, but also in the consumer sector, especially for mobile devices. We therefore accept the paper.

**3.2 Matrix multiply – sequential version**

The octance cluster has a AMD EPYC 7351P CPU, which according to Mandelbrot (<https://browser.geekbench.com/v3/cpu/8555579>) has a Single-core GFLOP/s performance of 3,35.

The naïve implementation uses three nested for loops to perform the matrix

It achieves the following results:   
*Running on one node*

*======================*

*Unoptimized :*

258s 0.06639 GFlops/s

*======================*

There are two main problems with the implementation:

* Hardware utilization: The processor used has 16 CPU Cores. The naïve code, however, only uses one thread and core.
* Cache utilization: Cache utilization in the naive version is suboptimal due to a strided access pattern, causing cache misses for accesses to matrix B.

The optimised code is using a transposed matrix for matrix multiplication.

achieved performance GFLOP/s

This is achieved because multiple elements from the same cacheline can be used. It achieves the following results:

*Running on one node*

*======================*

*Optimized :*

40s 0.42928 GFlops/s

*======================*

**3.6 Willingness to present**

Willing to present all exercises.