# GPU power and energy optimization interactive walkthrough

*Alan Gray, NVIDIA, February 2025*

In this exercise, you will be guided to optimize the energy usage of a benchmark by reducing the GPU clock frequency (which in turn reduces power draw).

We will use a [QCD (Physics) benchmark](https://repository.prace-ri.eu/git/UEABS/ueabs/-/tree/master/qcd/part_1) here, but the same techniques can be used for any code.

This notebook is designed for T4 GPUs, but can be run on other GPUs with some minor adjustments. The commands can also be copied and run directly on any other GPU server.

If running this notebook on Google Colab: to get started, change the runtime type to "T4 GPU" via the "Additional connection options" button in the top right of this notebook, and click "Connect". (*If no T4 GPUs are available at the current time, or you get a message about usage limits, you can try again later*.)

Press the "play" button on each code cell in turn following the instructions. When it has completed a tick will appear, and you can move on to the next cell.

##Setup

Test an interactive prompt, using nvidia-smi to print details of the available GPU:

In [None]:
! nvidia-smi

Get the benchmark code:

In [None]:
! git clone --depth=1 https://repository.prace-ri.eu/git/UEABS/ueabs.git

Create a config file to allow the benchmark to be built (the details of this are not important for this exercise):

In [2]:
with open('ueabs/qcd/part_1/config.mk', 'w') as f:
    f.write("""
MPIDIR=/usr/lib/x86_64-linux-gnu/openmpi
GPUS_PER_NODE=1
NVARCH=sm_75 # For T4 GPU, adjust for any other GPU
CFLAGS = $(DEFINES) -O2 -DARCH=0 -w -I $(MPIDIR)/include
LDFLAGS = -lm  -arch=$(NVARCH) -L./targetDP -ltarget -L$(MPIDIR)/lib -lmpi -lm -lgomp
CC=mpicc
TARGETCC=nvcc
TARGETCFLAGS=-x cu -arch=$(NVARCH) -I. -DCUDA -DVVL=1 -DSoA -DGPUSPN=$(GPUS_PER_NODE) -dc -c $(CFLAGS)
    """)


Build the "targetDP" library, a dependency of the benchmark:

In [None]:
! pushd ueabs/qcd/part_1/targetDP; make clean; make; popd

Build the benchmark:

In [None]:
! pushd ueabs/qcd/part_1/src; make clean; make; popd

Adjust the input file to specify 500 iterations, which large enough for representative power measurements:

In [5]:
! sed -i "s/max_cg_iters 1/max_cg_iters 500/g" ueabs/qcd/part_1/src/kernel_E.input

Create a script to run the code while monitoring GPU power and clock frequency. See the inline comments in the script.

In [6]:
with open('run.sh', 'w') as f:
    f.write("""
# start nvidia-smi looping in the background, writing power and clock measurements to a CSV file every 100 ms.
nvidia-smi --query-gpu=index,power.draw,clocks.gr --format=csv --loop-ms 100 > GPU_readings.csv 2>&1 &
# run the benchmark
pushd ueabs/qcd/part_1/src/
mpirun --allow-run-as-root -np 1 ./bench
popd
# stop nvidia-smi
pkill nvidia-smi
    """)

## Run at default clock

**Run the script you just created (will take around 1 minute on a T4 GPU, ignore the warning about a missing reference file):**

(the **bold** text here and below is to highlight parts that will be repeated in this exercise)

In [None]:
! bash ./run.sh 2>&1 | tee run.log

Let's have a look at the first few lines of the GPU_readings.csv file. You will see that it consists of a header, plus power and clock measurements:

In [None]:
! head -5 GPU_readings.csv

**Extract the power readings into a separate file:**

In [15]:
! cat GPU_readings.csv | grep -v index | awk '{ print $2}'  > GPU_power.csv

Create a graph of power measurements. You will see that the power ramps up to the maximum (70W for T4) as the benchmark starts up.

(Optional: Create a similar graph of clock frequency measurements - see end of notebook.)

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv('GPU_power.csv', header=None)
column = data.iloc[:, 0]
plt.figure(figsize=(5, 3))
plt.plot(column)
plt.xlabel('Sample')
plt.ylabel('GPU Power (W)')
plt.show()


**Strip the first 200 and last 100 entries from the power readings file, to remove startup and shutdown parts (which won't be relevant for a real-world longer run):**

In [17]:
! sed -i '1,200d' GPU_power.csv; sed -i "$(($(wc -l < GPU_power.csv)-100)),\$d" GPU_power.csv

Create a script to extract average power, time and energy (power multiplied by time) readings from our data:

In [18]:
with open('write_stats.sh', 'w') as f:
    f.write("""
# Extract average power across the samples
power=`cat GPU_power.csv | awk 'BEGIN{sum=0;count=0}{ sum += $1; count+=1}END{print sum/count}'`;
# Extract time from benchmark output
time=`grep "BENCHMARK TIME " run.log | awk '{ print $3 }'`;
# Calculate Energy = Power x Time
energy=$(awk -v p="$power" -v t="$time" 'BEGIN {print (p*t/1000)}');
printf "Power = %.1f W Time = %.1f s Energy = %.1f kJ" $power $time $energy
    """)

**Run our script to write the power, time and energy readings:**

In [None]:
! bash ./write_stats.sh

##Record Results

Double click on this cell to edit it and take a note of your measurements at default clock:

Default clock:
Power = ?? W Time = ?? s Energy = ?? kJ

Reduced clock:
Power = ?? W Time = ?? s Energy = ?? kJ

##Repeat with reduced clock

Reduce the maximum GPU clock frequency to 990 GHz (for T4 GPU) by setting "Application Clocks". (Note that 5001 is the memory clock for the T4 GPU, and we are not changing that in this exercise.)


(Optional: query which clocks are supported choose any different maximum frequency - see end of notebook.)

In [None]:
! nvidia-smi -ac 5001,990

Now re-run the benchmark with the reduced clock by going back and clicking play on all the cells denoted by a **bold** text description above, and update the "Results" cell with the "Reduced clock" values. You will see the power is reduced with minimal impact on the benchmark time meaning that the GPU energy of the benchmark decreases 😀.

You can now try some of the optional commands below, to

*   analyse the clock frequency behaviour
*   re-run at different maximum clock frequencies
*   run using a power limit instead of a maximum clock frequency

This benchmark is mainly sensitive to GPU memory bandwidth. While behavior varies across different codes and GPUs, there usually exists similar potential for energy savings. For more details, see my GTC presentations:

https://www.nvidia.com/en-us/on-demand/session/gtc24-s62419/

https://www.nvidia.com/en-us/on-demand/session/gtcspring23-s52087/  
.



##Optional extra commands

Extract the clock readings into a separate file:

In [None]:
! cat GPU_readings.csv | grep -v index | awk '{ print $4}'  > GPU_clock.csv

Create a graph of clock measurements:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv('GPU_clock.csv', header=None)
column = data.iloc[:, 0]
plt.figure(figsize=(5, 3))
plt.plot(column)
plt.xlabel('Sample')
plt.ylabel('SM Clock (MHz)')
plt.show()

Query the supported clocks on the GPU (to allow repeated runs with any supported clock):

In [None]:
! nvidia-smi -q -d SUPPORTED_CLOCKS

Set a power limit instead of a maximum clock, for comparison of these techniques. First, reset the aplication clocks to the highest supported, and then set the power limit e.g. to the power you measured above at reduced clock, and re-run the benchmark. The resulting benchmark time will tell you if there is any benefit of either technique.

In [None]:
! ! nvidia-smi -ac 5001,1590; nvidia-smi -pl 60