## Intel<sup>®</sup> oneAPI VTune<sup>™</sup> Profiler 2021.1.1 Gold

**Elapsed Time:** 0.045s

Application execution time is too short. Metrics data may be unreliable. Consider reducing the sampling interval or increasing your application execution time.

**Clockticks:** 144,000,000 **Instructions Retired:** 133,200,000

**CPI Rate:** 1.081

The CPI may be too high. This could be caused by issues such as memory stalls, instruction starvation, branch misprediction or long latency instructions. Explore the other hardware-related metrics to identify what is causing high CPI.

MUX Reliability: 0.975

**Retiring:** 25.8% of Pipeline Slots 28.9% of Pipeline Slots

 FP Arithmetic:
 0.0% of uOps

 FP x87:
 0.0% of uOps

 FP Scalar:
 0.0% of uOps

 FP Vector:
 0.0% of uOps

 Other:
 100.0% of uOps

**Heavy Operations:** 0.0% of Pipeline Slots **Microcode Sequencer:** 4.3% of Pipeline Slots **Assists:** 0.0% of Pipeline Slots

**Front-End Bound:** 27.0% of Pipeline Slots

Issue: A significant portion of Pipeline Slots is remaining empty due to issues in the Front-End.

Tips: Make sure the code working size is not too large, the code layout does not require too many memory accesses per cycle to get enough instructions for filling four pipeline slots, or check for microcode assists.

## **Front-End Latency:** 28.1% of Pipeline Slots

This metric represents a fraction of slots during which CPU was stalled due to front-end latency issues, such as instruction-cache misses, ITLB misses or fetch stalls after a branch misprediction. In such cases, the front-end delivers no uOps.

0.0% of Clockticks ICache Misses: ITLB Overhead: 1.5% of Clockticks 0.0% of Clockticks **Branch Resteers: Mispredicts Resteers:** 0.0% of Clockticks **Clears Resteers:** 0.0% of Clockticks 0.0% of Clockticks Unknown Branches: 0.0% of Clockticks DSB Switches: **Length Changing Prefixes:** 0.0% of Clockticks MS Switches: 0.0% of Clockticks

Issue: A significant fraction of cycles was stalled due to switches of uOp delivery to the Microcode Sequencer (MS). Commonly used instructions are optimized for delivery by the DSB or MITE pipelines. Certain operations cannot be handled natively by the execution pipeline, and must be performed by microcode (small programs injected into the execution stream). Switching to the MS too often can negatively impact performance. The MS is designated to deliver long uOp flows required by CISC instructions like CPUID, or uncommon conditions like Floating Point Assists when dealing with Denormals. Note that this metric value may be highlighted due to Microcode Sequencer issue.

Front-End Bandwidth: 0.0% of Pipeline Slots Front-End Bandwidth MITE: 28.1% of Clockticks Front-End Bandwidth DSB: 0.0% of Clockticks

(Info) DSB Coverage: 37.1%

Branch Mispredict: 0.0% of Pipeline Slots
Machine Clears: **Bad Speculation:** 0.0% of Pipeline Slots 2.3% of Pipeline Slots **Back-End Bound:** 44.9% of Pipeline Slots

A significant portion of pipeline slots are remaining empty. When operations take too long in the back-end, they introduce bubbles in the pipeline that ultimately cause fewer pipeline slots containing useful work to be retired per cycle than the machine is capable to support. This opportunity cost results in slower execution. Long-latency operations like divides and memory operations can cause this, as can too many operations being directed to a single execution port (for example, more multiply operations arriving in the back-end per cycle than the execution unit can support).

Memory Bound: 24.4% of Pipeline Slots The metric value is high. This can indicate that the significant fraction of execution pipeline slots could be stalled due to demand memory load and stores. Use Memory Access analysis to have the metric breakdown by memory hierarchy, memory bandwidth information, correlation by memory objects.

L1 Bound:

DTLB Overhead:

Load STLB Hit:

Load STLB Miss:

0.0% of Clockticks

0.0% of Clockticks

0.0% of Clockticks

**Loads Blocked by Store Forwarding:** 0.0% of Clockticks

Lock Latency:
Split Loads:
4K Aliasing:
FB Full:

L2 Bound:

0.0% of Clockticks
0.0% of Clockticks
0.0% of Clockticks

L3 Bound: 11.2% of Clockticks

This metric shows how often CPU was stalled on L3 cache, or contended with a sibling Core. Avoiding cache misses (L2 misses/L3 hits) improves the latency and increases performance.

Contested Accesses: 0.0% of Clockticks
Data Sharing: 1.6% of Clockticks
L3 Latency: 0.0% of Clockticks

This metric shows a fraction of cycles with demand load accesses that hit the L3 cache under unloaded scenarios (possibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3 hits) will improve the latency, reduce contention with sibling physical cores and increase performance. Note the value of this node may overlap with its siblings.

**SQ Full:** 0.0% of Clockticks **DRAM Bound:** 0.0% of Clockticks 7.5% of Clockticks **Memory Bandwidth:** 11.2% of Clockticks **Memory Latency: Store Bound:** 0.0% of Clockticks **Store Latency:** 15.0% of Clockticks **False Sharing:** 0.0% of Clockticks **Split Stores:** 0.0% of Clockticks **DTLB Store Overhead:** 0.2% of Clockticks Store STLB Hit: 0.0% of Clockticks 0.2% of Clockticks Store STLB Hit:

**Core Bound:** 20.5% of Pipeline Slots

This metric represents how much Core non-memory issues were of a bottleneck. Shortage in hardware compute resources, or dependencies software's instructions are both categorized under Core Bound. Hence it may indicate the machine ran out of an 000 resources, certain execution units are overloaded or dependencies in program's data- or instruction- flow are limiting the performance (e.g. FP-chained long-latency arithmetic operations).

**Divider:** 0.0% of Clockticks **Port Utilization:** 9.4% of Clockticks

**Cycles of 0 Ports Utilized:** 25.8% of Clockticks **Serializing Operations:** 15.0% of Clockticks

Mixing Vectors: 0.0% of uOps

Cycles of 1 Port Utilized: 11.7% of Clockticks

Cycles of 2 Ports Utilized: 16.4% of Clockticks

Cycles of 3+ Ports Utilized: 14.1% of Clockticks

ALU Operation Utilization: 11.7% of Clockticks

Port 0: 9.4% of Clockticks
Port 5: 4.7% of Clockticks
Port 6: 23.4% of Clockticks
Load Operation Utilization: 7.0% of Clockticks

Port 2: 9.4% of Clockticks 9.4% of Clockticks 9.4% of Clockticks 4.7% of Clockticks 4.7% of Clockticks 4.7% of Clockticks 9.0% of Clockticks 9.0% of Clockticks 9.0% of Clockticks

**Vector Capacity Usage (FPU):** 0.0%

**Average CPU Frequency:** 1.208 GHz

**Total Thread Count:** 9 **Paused Time:** 0s

**Effective Physical Core Utilization:** 52.1% (2.082 out of 4)

The metric value is low, which may signal a poor physical CPU cores utilization caused by:

- load imbalance
- threading runtime overhead
- contended synchronization
- thread/process underutilization
- incorrect affinity that utilizes logical cores instead of physical cores

Explore sub-metrics to estimate the efficiency of MPI and OpenMP parallelism or run the Locks and Waits analysis to identify parallel bottlenecks for other parallel runtimes.

## **Effective Logical Core Utilization:** 32.5% (2.603 out of 8)

The metric value is low, which may signal a poor logical CPU cores utilization. Consider improving physical core utilization as the first step and then look at opportunities to utilize logical cores, which in some cases can improve processor throughput and overall performance of multi-threaded applications.

## **Collection and Platform Info:**

**Application Command Line:** ./codecs/HHI-VVC/decoder/vvdecapp "-b" "./bin/HHI-VVC/randomaccess\_faster.cfg/CLASS\_C/ RaceHorses\_416x240\_30\_QP\_27\_HHI-VVC.bin"

**User Name:** root

**Operating System:** 5.4.0-72-generic DISTRIB\_ID=Ubuntu DISTRIB\_RELEASE=18.04 DISTRIB\_CODENAME=bionic DISTRIB\_DESCRIPTION="Ubuntu 18.04.5 LTS"

**Computer Name:** eimon

**Result Size:** 14.1 MB

**Collection start time:** 22:33:27 18/04/2021 UTC

**Collection stop time:** 22:33:28 18/04/2021 UTC

**Collector Type:** Event-based sampling driver

CPU:

Name: Intel(R) Processor code named Kabylake

ULX

Frequency: 1.992 GHz

**Logical CPU Count:** 8

**Cache Allocation Technology:** 

**Level 2 capability:** not detected

Level 3 capability: not detected