## Intel<sup>®</sup> oneAPI VTune<sup>™</sup> Profiler 2021.1.1 Gold

**Elapsed Time:** 40.799s

**Clockticks:** 89,982,000,000 **Instructions Retired:** 204,046,200,000

CPI Rate: 0.441 MUX Reliability: 0.999

**Retiring:** 58.7% of Pipeline Slots **Light Operations:** 54.5% of Pipeline Slots

FP Arithmetic: 1.2% of uOps
FP x87: 0.0% of uOps
FP Scalar: 1.1% of uOps
FP Vector: 0.0% of uOps
Other: 98.8% of uOps

Heavy Operations:
Microcode Sequencer:
Assists:

4.2% of Pipeline Slots
1.0% of Pipeline Slots
0.0% of Pipeline Slots

**Front-End Bound:** 20.8% of Pipeline Slots

Issue: A significant portion of Pipeline Slots is remaining empty due to issues in the Front-End.

Tips: Make sure the code working size is not too large, the code layout does not require too many memory accesses per cycle to get enough instructions for filling four pipeline slots, or check for microcode assists.

Front-End Latency: 8.3% of Pipeline Slots **ICache Misses:** 2.9% of Clockticks ITLB Overhead: 0.3% of Clockticks 5.2% of Clockticks **Branch Resteers:** 4.2% of Clockticks **Mispredicts Resteers:** 0.0% of Clockticks **Clears Resteers: Unknown Branches:** 0.9% of Clockticks DSB Switches: 2.2% of Clockticks **Length Changing Prefixes:** 0.0% of Clockticks MS Switches: 0.7% of Clockticks Front-End Bandwidth: 12.4% of Pipeline Slots

This metric represents a fraction of slots during which CPU was stalled due to front-end bandwidth issues, such as inefficiencies in the instruction decoders or code restrictions for caching in the DSB (decoded uOps cache). In such cases, the front-end typically delivers a non-optimal amount of uOps to the back-end.

Front-End Bandwidth MITE: 29.5% of Clockticks

This metric represents a fraction of cycles during which CPU was stalled due to the MITE fetch pipeline issues, such as inefficiencies in the instruction decoders.

**Front-End Bandwidth DSB:** 6.2% of Clockticks (Info) DSB Coverage: 36.2%

Issue: A significant fraction of uOps was not delivered by the DSB (known as Decoded ICache or uOp Cache). This may happen if a hot code region is too large to fit into the DSB.

Tips: Consider changing the code layout (for example, via profile-guided optimization) to help your hot regions fit into the DSB.

See the "Optimization for Decoded ICache" section in the Intel 64 and IA-32 Architectures Optimization Reference Manual.

```
Bad Speculation:
                            12.1% of Pipeline Slots
  Branch Mispredict:
                               12.1% of Pipeline Slots
  Machine Clears:
                               0.1% of Pipeline Slots
Back-End Bound:
                            8.4% of Pipeline Slots
  Memory Bound:
                               2.0% of Pipeline Slots
     L1 Bound:
                                  5.5% of Clockticks
                                     6.0% of Clockticks
        DTLB Overhead:
           Load STLB Hit:
                                        5.9% of Clockticks
                                       0.0% of Clockticks
           Load STLB Miss:
        Loads Blocked by Store Forwarding: 3.7% of Clockticks
                                    0.0% of Clockticks
        Lock Latency:
        Split Loads:
                                     0.0% of Clockticks
        4K Aliasing:
                                     1.1% of Clockticks
        FB Full:
                                     0.0% of Clockticks
                                  0.4% of Clockticks
     L2 Bound:
     L3 Bound:
                                  0.5% of Clockticks
        Contested Accesses:
                                     0.0% of Clockticks
                                     0.0% of Clockticks
        Data Sharing:
                                     1.7% of Clockticks
        L3 Latency:
                                     0.1% of Clockticks
        SO Full:
     DRAM Bound:
                                  0.1% of Clockticks
                                     0.7% of Clockticks
        Memory Bandwidth:
        Memory Latency:
                                     3.4% of Clockticks
                                  0.5% of Clockticks
     Store Bound:
                                     7.7% of Clockticks
        Store Latency:
                                     0.0% of Clockticks
        False Sharing:
                                    0.2% of Clockticks
        Split Stores:
                                    4.3% of Clockticks
        DTLB Store Overhead:
           Store STLB Hit:
                                       4.3% of Clockticks
           Store STLB Hit:
                                       0.0% of Clockticks
  Core Bound:
                               6.4% of Pipeline Slots
     Divider:
                                  0.6% of Clockticks
                                  22.6% of Clockticks
     Port Utilization:
        Cycles of 0 Ports Utilized: 7.6% of Clockticks
                                       0.5% of Clockticks
           Serializing Operations:
           Mixing Vectors:
                                       0.0% of uOps
                                    5.4% of Clockticks
        Cycles of 1 Port Utilized:
        Cycles of 2 Ports Utilized: 8.9% of Clockticks
        Cycles of 3+ Ports Utilized: 29.0% of Clockticks
           ALU Operation Utilization: 36.0% of Clockticks
              Port 0:
                                          30.8% of Clockticks
                                          35.2% of Clockticks
              Port 1:
              Port 5:
                                          34.5% of Clockticks
                                          43.4% of Clockticks
              Port 6:
                                          35.3% of Clockticks
           Load Operation Utilization:
                                          43.5% of Clockticks
              Port 2:
              Port 3:
                                          44.3% of Clockticks
                                          35.0% of Clockticks
           Store Operation Utilization:
                                          35.0% of Clockticks
              Port 4:
              Port 7:
                                          17.9% of Clockticks
        Vector Capacity Usage (FPU): 24.4%
```

**Average CPU Frequency:** 2.239 GHz

**Total Thread Count:** 1 Paused Time: 0s

## **Effective Physical Core Utilization:** 24.1% (0.964 out of 4)

The metric value is low, which may signal a poor physical CPU cores utilization caused by:

- load imbalance
- threading runtime overhead
- contended synchronization
- thread/process underutilization
- incorrect affinity that utilizes logical cores instead of physical cores

Explore sub-metrics to estimate the efficiency of MPI and OpenMP parallelism or run the Locks and Waits analysis to identify parallel bottlenecks for other parallel runtimes.

## **Effective Logical Core Utilization:** 12.3% (0.985 out of 8)

The metric value is low, which may signal a poor logical CPU cores utilization. Consider improving physical core utilization as the first step and then look at opportunities to utilize logical cores, which in some cases can improve processor throughput and overall performance of multi-threaded applications.

## **Collection and Platform Info:**

**Application Command Line:** ./codecs/hm/encoder/TAppEncoderStatic "-c" "./configs/hm/encoder\_intra\_main.cfg" "-i" "./sequences/CLASS\_C/RaceHorses\_416x240\_30.yuv" "-wdt" "416" "-hgt" "240" "-b" "./bin/hm/encoder\_intra\_main.cfg/CLASS\_C/RaceHorses\_416x240\_30\_QP\_32\_hm.bin" "-o" "./rec\_yuv/hm/encoder\_intra\_main.cfg/CLASS\_C/RaceHorses\_416x240\_30\_QP\_32\_hm.yuv" "-fr" "30" "-fs" "0" "-f" "50" "-q" "32"

**User Name:** root

**Operating System:** 5.4.0-65-generic DISTRIB\_ID=Ubuntu DISTRIB\_RELEASE=18.04 DISTRIB\_CODENAME=bionic DISTRIB\_DESCRIPTION="Ubuntu 18.04.5 LTS"

**Computer Name:** eimon

**Result Size:** 99.8 MB

**Collection start time:** 20:46:22 09/02/2021 UTC

**Collection stop time:** 20:47:03 09/02/2021 UTC

**Collector Type:** Event-based sampling driver

CPU:

Name: Intel(R) Processor code named Kabylake

ULX

Frequency: 1.992 GHz

**Logical CPU Count:** 8

**Cache Allocation Technology:** 

Level 2 capability: not detected

**Level 3 capability:** not detected