





# XPU-Point: Simulator-Agnostic Sample Selection Methodology for Heterogeneous CPU-GPU Applications

Alen Sabu, Harish Patil, Wim Heirman, Changxi Liu, Trevor E. Carlson



**Parallel Architectures and Compilation Techniques (PACT)** 

November 06, 2025

## Complex Architectures → Unrealistic Simulation Times



Estimated Simulation Times: gem5 (CPU portion) and AccelSim (GPU portion) heterogeneous CPU-GPU benchmarks SPEChpc 2021 and PyTorch/inference

Modern architectures require smarter simulation techniques



# **Simulation: Key Questions**

Where to Simulate?

#### Unit of Work/Simulation

- Repeatable across runs
- Microarchitecture-independent

### How to Simulate?

- Trace-driven/Checkpoint-driven
- System-level/User-level

Are Simulation Regions Representative?

#### Compute Sampling Error

- Using simulation
- Using native execution (simulator-agnostic)























# **Projection Methodology**

Instead of all regions...

BBV<sub>1</sub> BBV<sub>2</sub> BBV<sub>3</sub> BBV<sub>4</sub> BBV<sub>5</sub> BBV<sub>6</sub> BBV<sub>7</sub> BBV<sub>8</sub> BBV<sub>9</sub> BBV<sub>10</sub> BBV<sub>11</sub> BBV<sub>12</sub>

...simulate only selected regions

Project performance using weights



**Speedup** = 12/3 = 4



# Simulation Region Validation With Simulation

Are Simulation Regions Representative?



Compute Performance Stats & Sampling Error

Sampling Error
$$= \left| 1 - \frac{\text{Extrapolated Perf}}{\text{Actual Perf}} \right|$$

#### **Challenge:**

Whole-program simulation is very slow

#### **Workarounds:**

- Use short workloads
- Use a fast, less accurate simulator



# Simulation Region Validation With Native Execution

Are Simulation Regions Representative?



Run natively: Gather TSC/performance counters



Compute Performance Stats & Sampling Error

Sampling Error
$$= \left| 1 - \frac{\text{Extrapolated Perf}}{\text{Actual Perf}} \right|$$

#### Simulator-agnostic:

- Using native system as the simulator
- Much faster

#### **Challenge:**

Precisely gathering region performance



# Simulation Region Selection at Intel: Past 20 years

| Methodology                                        | Scope                                        | Regions (Unit of Work/Simulation) | Sample Validation<br>Technique                          | Comment                         |
|----------------------------------------------------|----------------------------------------------|-----------------------------------|---------------------------------------------------------|---------------------------------|
| PinPoints (MICRO 2004)                             | Single-threaded/<br>Itanium                  | Fixed instructions                | [simulator-agnostic]<br>Pin (JIT) + perfmon             | Fixed-length<br>intervals only  |
| Cross-binary Simulation Points (ISPASS 2007, 2015) | Single-threaded,<br>multiple<br>binaries/x86 | Fixed instructions<br>(binary 1)  | CMP\$IM: Fast Pin-<br>based cache simulator             | Less detailed<br>simulator used |
| GT-PinPoints (IISWC 2015)                          | OpenCL: GPU-<br>only/Intel GPUs              | GPU kernels                       | [simulator-agnostic] CoFluent                           | GPU-only                        |
| LoopPoint (HPCA 2022)                              | Multi-<br>threaded/x86                       | Loop iterations                   | Sniper: Pin/SDE-based simulator                         | SPEC 'train' runs<br>used       |
| XPU-Point (PACT 2025)                              | Heterogeneous<br>CPU-GPU                     | GPU kernel: end<br>to end         | [simulator-agnostic]<br>Pin (probe) +<br>GT-Pin & NVBit | Co-analysis of CPU<br>and GPU   |



# Why Heterogeneous Architectures?

- Multi-cores aren't scaling well<sup>1</sup> power and thermal constraints
- XPU: Heterogeneous system w/ CPU, GPU, and memory co-packaged





# Simulation of Heterogeneous Architectures

Heterogeneous CPU-GPU simulation is extremely challenging

### **Simulation Slowdowns**



CPU simulation >10,000 × slowdown<sup>1</sup>
GPU simulation >1,000,000,000 × slowdown<sup>2</sup>

## Phase-based CPU-GPU Region Selection

- Modern CPU-GPU workloads are co-operative (Ex. GROMACS)
- Need CPU and GPU co-analysis for combined phase detection



Challenge: No framework for simultaneous CPU and GPU analysis



# XPU-Pin: Framework for Co-Analysis of Heterogeneous Execution



Support for generic accelerators: Need instrumentation tool as shared library



## **XPU-Point: End-to-End Workflow**



Sampling Error = 
$$1 - \frac{\text{Extrapolated Perf}}{\text{Actual Perf}}$$



## Unit of Work for XPU-Point





## **XPU-Profiler: CPU-GPU BBV Generation**



Challenge: Overhead of profiling  $\rightarrow$  Be selective (shared libraries)



## **XPU-BBVs: CPU-GPU BBV Concatenation**

#### **CPU BBV GPU BBV** (kernel name<sub>i-1</sub>, count<sub>i-1</sub>) ( $kernel\ name_{i-1}$ , $count_{i-1}$ ) $\overline{\mathrm{BBV}_{\mathrm{t0}}}$ $\mathrm{BBV}_{\mathrm{w0}}$ $\overline{\mathrm{BBV}_{\mathrm{w1}}}$ $BBV_{tN}$ (kernel name<sub>i</sub>, count<sub>i</sub>) (kernel name<sub>i</sub>, count<sub>i</sub>) ---- *Concatenate* ----(kernel $name_{i-1}$ , $count_{i-1}$ ) **XPU-BBV**

(kernel name<sub>i</sub>, count<sub>i</sub>)



# **XPU-Timer: Time Stamps for CPU-GPU Regions**





## **Experimental Setup**

- CPUs
  - Multiple Intel Client/Server CPUs
- GPUs
  - Intel: Iris Xe (Integrated), Discrete Graphics 2 (DG2), Ponte Vecchio (PVC)
  - NVIDIA: A100, GeForce GTX 1080, Titan XP
- Compilers
  - Intel OneAPI, GNU, NVCC



## Results Reported

1. Sampling Error

Sampling Error = 
$$1 - \frac{\text{Extrapolated Perf}}{\text{Actual Perf}}$$

2. Speedup

$$Speedup = \frac{Number\ of\ Total\ Regions}{Number\ of\ Simulation\ Regions}$$

- Base analysis
  - BBV generation and error measurement on the same machine
- Cross analysis
  - Profiling (Machine<sub>1</sub> /GPU<sub>1</sub>) -> Measurement (Machine<sub>2</sub> / GPU<sub>2</sub>)



## Results: SPECaccel2023



#### **XPU slices:**

Combined CPU-GPU phase detection



## Results: SPECaccel2023



#### **XPU slices:**

Combined CPU-GPU phase detection

#### **GPU slices:**

GPU-only phase detection

Focusing on GPU-only evaluation could lead to inaccurate decisions



# Results: SPEChpc2021





# Results: SPEChpc2021





# **GROMACS: Various Configurations**

| Type | nb  | pme | pmefft | bonded | update | #slices |
|------|-----|-----|--------|--------|--------|---------|
| A    | GPU | CPU | CPU    | CPU    | CPU    | 305     |
| В    | GPU | CPU | CPU    | GPU    | CPU    | 506     |
| С    | GPU | GPU | CPU    | CPU    | CPU    | 707     |
| D    | GPU | GPU | CPU    | GPU    | CPU    | 908     |
| E    | GPU | GPU | GPU    | CPU    | CPU    | 3730    |
| F    | GPU | GPU | GPU    | GPU    | CPU    | 3931    |

The classification of GROMACS based on the offloading device for the execution of each calculation. We also use -nsteps 200 with -notunepme for all types.



# Results: GROMACS





## **Results: GROMACS**





# PyTorch Inference Workloads: Overheads



PyTorch Inference runs evaluated on platform with Intel Sapphire Rapids CPU and Intel Ponte Vecchio GPU

Challenge: Overhead of profiling  $\rightarrow$  Be selective (shared libraries)



# Results: PyTorch Inference



PyTorch Inference (selective profiling) runs evaluated on Intel Ponte Vecchio GPU



# Results: PyTorch Inference



PyTorch Inference (selective profiling) runs evaluated on Intel Ponte Vecchio GPU



## Summary

- XPU-Point is the first to enable accelerated heterogeneous simulation through CPU-GPU co-sampling
- Works for both Intel- and NVIDIA-based CPU-GPU platforms
- XPU-Point tools are open-sourced on GitHub
  - https://github.com/nus-comparch/xpupoint
- Acknowledgments
  - Roland Schulz, Edward Mascarenhas, Aleksandr Bobyr, Intel GTPin Team

