# Intel<sup>®</sup> oneAPI VTune<sup>™</sup> Profiler 2021.1.1 Gold

#### **Recommendations:**

Hotspots: Start with Hotspots analysis to understand the efficiency of your algorithm.

Use Hotspots analysis to identify the most time consuming functions. Drill down to see the time spent on every line of code.

Microarchitecture Exploration: There is low microarchitecture usage (38.9%) of available hardware resources. of Pipeline Slots

Run Microarchitecture Exploration analysis to analyze CPU microarchitecture bottlenecks that can affect application performance.

Threading: There is poor utilization of logical CPU cores (22.1%) in your application.

Use Threading to explore more opportunities to increase parallelism in your application.

**Elapsed Time:** 0.202s

**CPU:** 

**IPC:** 1.188

**SP GFLOPS:** 0.000

**DP GFLOPS:** 0.000

**x87 GFLOPS:** 0.026

**Average CPU Frequency:** 2.372 GHz

**GPU:** 

Time: 1.2% (0.002s) of Elapsed time

GPU utilization is low. Consider offloading more work to the GPU to increase overall application performance.

**IPC Rate:** 1.286

**Effective Logical Core Utilization:** 22.1% (1.765 out of 8)

The metric value is low, which may signal a poor logical CPU cores utilization. Consider improving physical core utilization as the first step and then look at opportunities to utilize logical cores, which in some cases can improve

processor throughput and overall performance of multithreaded applications.

### **Effective Physical Core Utilization:** 41.1% (1.642 out of 4)

The metric value is low, which may signal a poor physical CPU cores utilization caused by:

- load imbalance
- threading runtime overhead
- contended synchronization
- thread/process underutilization
- incorrect affinity that utilizes logical cores

instead of physical cores

Explore sub-metrics to estimate the efficiency of MPI and OpenMP parallelism or run the Locks and Waits analysis to identify parallel bottlenecks for other parallel runtimes.

## Microarchitecture Usage: 38.9% of Pipeline Slots

You code efficiency on this platform is too low.

Possible cause: memory stalls, instruction starvation, branch misprediction or long latency instructions.

Next steps: Run Microarchitecture Exploration analysis to identify the cause of the low microarchitecture usage efficiency.

**Retiring:** 38.9% of Pipeline Slots **Front-End Bound:** 15.6% of Pipeline Slots **Back-End Bound:** 35.9% of Pipeline Slots

A significant portion of pipeline slots are remaining empty. When operations take too long in the back-end, they introduce bubbles in the pipeline that ultimately cause fewer pipeline slots containing useful work to be retired per cycle than the machine is capable to support. This opportunity cost results in slower execution. Long-latency operations like divides and memory operations can cause this, as can too many operations being directed to a single execution port (for example, more multiply operations arriving in the back-end per cycle than the execution unit can support).

Memory Bound: 19.4% of Pipeline Slots Core Bound: 16.5% of Pipeline Slots

This metric represents how much Core non-memory issues were of a bottleneck. Shortage in hardware compute resources, or dependencies software's instructions are both categorized under Core Bound. Hence it may indicate the machine ran out of an 000 resources, certain execution units are overloaded or dependencies in program's data- or instruction- flow are limiting the performance (e.g. FP-chained long-latency arithmetic operations).

**Bad Speculation:** 9.5% of Pipeline Slots

Memory Bound:19.4% of Pipeline SlotsL1 Bound:4.8% of ClockticksL2 Bound:1.0% of ClockticksL3 Bound:4.7% of ClockticksDRAM Bound:10.6% of Clockticks

**DRAM Bandwidth Bound:** 0.0% of Elapsed Time

**Store Bound:** 1.9% of Clockticks

**Vectorization:** 0.0% of Packed FP Operations

A significant fraction of floating point arithmetic instructions are scalar. Use Intel Advisor to see possible reasons why the code was not vectorized.

**Instruction Mix:** 

**SP FLOPs:** 0.0% of uOps

Packed: 43.9% from SP FP 128-bit: 7.0% from SP FP

A significant fraction of floating point arithmetic vector instructions is executed with a partial vector load. Make sure you compile the code with the latest instruction set or use Intel Advisor for vectorization help.

**256-bit:** 36.8% from SP FP **Scalar:** 56.1% from SP FP

A significant fraction of floating point arithmetic instructions are scalar. Use Intel Advisor to see possible reasons why the code was not vectorized.

**DP FLOPs:** 0.0% of uOps

 Packed:
 1.5% from DP FP

 128-bit:
 1.5% from DP FP

 256-bit:
 8.0% from DP FP

 Scalar:
 98.5% from DP FP

A significant fraction of floating point arithmetic instructions are scalar. Use Intel Advisor to see possible reasons why the code was not vectorized.

**x87 FLOPs:** 0.4% of uOps

Non-FP: 99.6% of uOps

## FP Arith/Mem Rd Instr. Ratio: 0.017

The metric value is low. This can be a result of unaligned access to data for vector operations. Use Intel Advisor to find possible data access inefficiencies for vector operations.

## FP Arith/Mem Wr Instr. Ratio: 0.045

The metric value is low. This can be a result of unaligned access to data for vector operations. Use Intel Advisor to find possible data access inefficiencies for vector operations.

#### **GPU Active Time:** 1.2%

GPU utilization is low. Consider offloading more work to the GPU to increase overall application performance.

### **GPU Utilization when Busy: 34.2%**

The percentage of time when the EUs were stalled or idle is high, which has a negative impact on compute-bound applications.

IPC Rate: 1.286
EU State: 34.2%
Active: 34.2%
Stalled: 32.1%

A significant portion of GPU time is lost due to stalls. For compute-bound code, this could indicate that performance is limited by memory or sampler acesses.

Idle: 33.8%

A significant portion of GPU time is spent idle. This is usually caused by imbalance or thread scheduling problems.

**Occupancy:** 44.4% of peak value

Low value of the occupancy metric may be caused by inefficient work scheduling. Make sure work items are neither too small nor too large.

# **Collection and Platform Info:**

**Application Command Line:** ./codecs/hm/decoder/TAppDecoderStatic "-b" "./bin/hm/encoder\_randomaccess\_main.cfg/CLASS\_A/ Kimono\_1920x1080\_24\_QP\_22\_hm.bin"

**Operating System:** 5.4.0-65-generic DISTRIB\_ID=Ubuntu DISTRIB\_RELEASE=18.04 DISTRIB\_CODENAME=bionic DISTRIB\_DESCRIPTION="Ubuntu 18.04.5 LTS"

**Computer Name:** eimon

Result Size: 3.7 MB

**Collection start time:** 09:56:10 10/02/2021 UTC

**Collection stop time:** 09:56:10 10/02/2021 UTC

**Collector Type:** Event-based sampling driver, Event-based

counting driver

CPU:

Name: Intel(R) Processor code named Kabylake

ULX

Frequency: 1.992 GHz

**Logical CPU Count:** 8

**Max DRAM Single-Package Bandwidth:** 10.000 GB/s

**Cache Allocation Technology:** 

Level 2 capability: not detected

**Level 3 capability:** not detected

**GPU:** 

Name:

Display controller: Intel Corporation Device 22807

**Vendor:** Intel Corporation

EU Count: 24

**Max EU Thread Count:** 7

**Max Core Frequency:** 1.150 GHz