# Profiling and Optimisation of CPU and GPU Code
Writing code is the first problem, but generally, the second is **optimising** code for performance, an equally important skill, especially in GPU computing. Before optimising, you need to know *where* the time is being spent, which is where **profiling** comes in. Profiling means measuring the performance characteristics of your program, typically which parts of the code consume the most time or resources. 

## Profiling Python Code with cPython (CPU) 
Python has a built-in profiler called **cPython**. It can help you find which functions are taking up the most time in your program. This is key before you go into GPU acceleration; sometimes, you might find bottlenecks in places you didn't expect or identify parts of the code that would benefit the most from being moved to the GPU.

### How to use cProfile 
You can make use of cProfile via the command line: `python -m cProfile -o profile_results.pstats myscript.py`, which will run `myscript.py` under the profiler and output stats to a file.

```Python 
import cProfile
import pstats
import numpy as np

# ─────────────────────────────────────────────────────────────────────────────
# 1) Naïve Game of Life implementation
# ─────────────────────────────────────────────────────────────────────────────

def life_step_naive(grid: np.ndarray) -> np.ndarray:
    N, M = grid.shape
    new = np.zeros((N, M), dtype=int)
    for i in range(N):
        for j in range(M):
            cnt = 0
            for di in (-1, 0, 1):
                for dj in (-1, 0, 1):
                    if di == 0 and dj == 0:
                        continue
                    ni, nj = (i + di) % N, (j + dj) % M
                    cnt += grid[ni, nj]
            if grid[i, j] == 1:
                new[i, j] = 1 if (cnt == 2 or cnt == 3) else 0
            else:
                new[i, j] = 1 if (cnt == 3) else 0
    return new

def simulate_life_naive(N: int, timesteps: int, p_alive: float = 0.2):
    grid = np.random.choice([0, 1], size=(N, N), p=[1-p_alive, p_alive])
    history = []
    for _ in range(timesteps):
        history.append(grid.copy())
        grid = life_step_naive(grid)
    return history

# ─────────────────────────────────────────────────────────────────────────────
# 2) Profiling using cProfile
# ─────────────────────────────────────────────────────────────────────────────

N = 200
STEPS = 100
P_ALIVE = 0.2

profiler = cProfile.Profile()
profiler.enable()                  # ── start profiling ────────────────

# Run the full naïve simulation
simulate_life_naive(N=N, timesteps=STEPS, p_alive=P_ALIVE)

profiler.disable()                 # ── stop profiling ─────────────────

stats = pstats.Stats(profiler).sort_stats('cumtime')
stats.print_stats(10)              # print top 10 functions by cumulative time


```

- **Interpreting cProfile output**: When you print stats, you'll see a table with columns including: 
- **ncalls**: number of calls to the function. 
- **tottime**: total time spent in the function (excluding sub-function calls). 
- **cumtime**: cumulative time spent in the function includes sub-functions.
- The function name

```bash 
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.034    0.034    4.312    4.312 4263274180.py:27(simulate_life_naive)
      100    4.147    0.041    4.150    0.041 4263274180.py:9(life_step_naive)
... (other functions)
```
Therefore in the above table `ncalls` (100) tells you `life_step_naive` was invoked 100 times. `tottime` (4.147 s) is the time spent inside `life_step_naive` itself, excluding any functions it calls. `cumtime` (4.150 s) is the total time in `life_step_naive` plus any sub-calls it makes. So in this example, `life_step_naive` spent about 4.147 s in its own Python loops, and an extra ~0.003 s in whatever minor sub-calls it did (array indexing, % operations, etc.), for a total of 4.150 s. The per-call columns are simply` tottime/ncalls` and `cumtime/ncalls`, and the single call to `simulate_life_naive` shows its cumulative 4.312 s includes all the 100 naive steps plus the list-append overhead.

### Finding Bottlenecks 

To pinpoint where your code spends most of its time, look at the cumulative time (`cumtime`) column in the profiler report. This shows the total time in a function plus all of its sub-calls. A high total time (`tottime`) means that the function’s own Python code is heavy, whereas a large gap between `cumtime` and `tottime` reveals significant work in any functions it invokes (array indexing, modulo ops, etc.).

In our naive Game of Life example:
- `life_step_naive` is called 100 times, with `tottime ≈ 4.147 s` and `cumtime ≈ 4.150 s`.
    - Almost all the work is in its own nested loops and per-cell logic.
    - Only a few milliseconds are spent in its sub-calls (grid indexing, % arithmetic).
- `simulate_life_naive` appears once with `cumtime ≈ 4.312 s`, which covers the single Python loop plus all 100 calls to `life_step_naive`.

Once you’ve identified the culprit:
- If you have high `tottime` in a Python function, you may want to consider consider vectorising inner loops (e.g. switch to NumPy’s np.roll + np.where) or using a compiled extension.
- If you have heavy external calls under your `cumtime`, then you may want to explore hardware acceleration (e.g. GPU via `CuPy`) or more efficient algorithms.

## Profiling the CPU-Vectorised Implementation using NumPy. 

```python
import cProfile
import pstats
import numpy as np

# ─────────────────────────────────────────────────────────────────────────────
# 1) NumPy Game of Life implementation
# ─────────────────────────────────────────────────────────────────────────────

def life_step_numpy(grid: np.ndarray) -> np.ndarray:
    neighbours = (
        np.roll(np.roll(grid, 1, axis=0), 1, axis=1) +
        np.roll(np.roll(grid, 1, axis=0), -1, axis=1) +
        np.roll(np.roll(grid, -1, axis=0), 1, axis=1) +
        np.roll(np.roll(grid, -1, axis=0), -1, axis=1) +
        np.roll(grid, 1, axis=0) +
        np.roll(grid, -1, axis=0) +
        np.roll(grid, 1, axis=1) +
        np.roll(grid, -1, axis=1)
    )
    return np.where((neighbours == 3) | ((grid == 1) & (neighbours == 2)), 1, 0)

def simulate_life_numpy(N: int, timesteps: int, p_alive: float = 0.2):
    grid = np.random.choice([0, 1], size=(N, N), p=[1-p_alive, p_alive])
    history = []
    for _ in range(timesteps):
        history.append(grid.copy())
        grid = life_step_numpy(grid)
    return history

# ─────────────────────────────────────────────────────────────────────────────
# 2) Profiling using cProfile
# ─────────────────────────────────────────────────────────────────────────────

N = 200
STEPS = 100
P_ALIVE = 0.2

profiler = cProfile.Profile()
profiler.enable()  # ── start profiling ────────────────────────

# Run the full NumPy-based simulation
simulate_life_numpy(N=N, timesteps=STEPS, p_alive=P_ALIVE)

profiler.disable()  # ── stop profiling ─────────────────────────

stats = (
    pstats.Stats(profiler)
          .strip_dirs()                  # remove full paths
          .sort_stats('cumtime')         # sort by cumulative time
)
# show only the NumPy functions in the report
stats.print_stats(r"life_step_numpy|simulate_life_numpy")
```
```bash
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   100    0.028    0.000    0.055    0.001 2865127924.py:9(life_step_numpy)
     1    0.000    0.000    0.011    0.011 2865127924.py:22(simulate_life_numpy)
```

### Interpreting the Results
`life_step_numpy` 
**`ncalls = 100`**: called once per generation for 100 generations, the same as before. 
**`tottime ≈ 0.028 s`**: time spent in the Python-level wrapper (the eight `np.roll` calls and the one `np.where`, excluding the internal C work. 
**`cumtime ≈ 0.055 s`**: includes both the Python-level overhead *and* the time spent inside NumPy's compiled code (rolling, adding, masking, etc.)
`simulate_life_numpy`
**`ncalls = 1`**: the top-level driver is run once. 
**`cumtime ≈ 0.011 s`**: covers grid initialisation, 100 calls to `life_step_numpy`, and the history list appends. 

### Why is it so much faster than the naive version?
- **Bulk C-level operations**
    - The eight `np.roll` shifts and the single `np.where` are all implemented in optimised C loops. 
    - cProfile only attributes a few milliseconds to Python itself because the heavy lifting happens outside Python's interpreter. 
- **Minimal Python overhead**
    - We pay one Python-level call per generation (100 calls total) versus *hundreds of thousands* of Python-loop iterations in the naive version. 
    - That drops the Python-layer `tottime` from ~4s (naive) to ~0.03s (NumPy)
- **Cache and vector-friendly memory access**
    - NumPy works on large contiguous buffers, so the CPU prefetches data and applies vector instructions.
    - The naïve per-cell modulo arithmetic and scattered indexing defeat those hardware optimisations.

Overall, by moving the neighbour counting and rule application into a few large NumPy calls, we cut down Python‐level time from over 4 seconds to under 0.1 seconds for 100 generations on a 200×200 grid.

------- UP TO HERE -------

## Profiling GPU Code with NVIDIA Nsight Systems 
When we involve GPUs, cProfile alone isn't enough. cProfile will tell us about the Python side, but we also need to know what's happening on the GPU. Does the GPU spend most of its time computing, or is it idle while waiting for data? Are there a few kernel launches that take a long time or many tiny kernel launches?

**NVIDIA Nsight Systems** is a profiler for GPU applications that provides a timeline fo CPU and GPU activity. It can show: 
- When your code launched GPU kernels and how long they ran 
- GPU memory trsnfers between host and device. 
- CPU-side functions as well (to correlate CPU and GPU)

### Using Nsight Systems 
Nsight Systems can be used via a GUI or command line. ON clusters, you might use the CLI, assuming it's installed. 

You will need to run your script under Nsight: 
```bash 
nsys profile -o profile_report python my_gpu_script.py
```

This will run `my_gpu_script.py` and record profiling data into `profile_report.qdrep`(and a `.nsys-rep`). 
After that, you typically use the Nsight system GUI to open the report file (`.qdrep`) for detailed analysis. If you only have the command line, you can use `sys stats profile_report.qdrep` to get summary info.

**What to look for**: When you open the timeline, you'll see tracks for CPU threads and GPU streams. For example:
- The CPU timeline might show segments where Python is executing, interspersed with calls to the CUDA driver/runtime to launch kernels or copy data.
- The GPU timeline will show the kernels (like `elementwise_kernel` or others if using CuPy, or specific names if using custom CUDA code) and memory copy operations (labelled like `memcpyHtoD` for host-to-device copies, etc.).

You might discover:
- A lot of small kernel launches with gaps, which could indicate overhead or that the GPU is not fully utilised (maybe launch fewer, larger kernels if possible).
- A particular kernel takes a majority of the time, which is a candidate for optimisation (maybe its algorithm could be improved, or check if it's doing unnecessary work).
- Significant time spent in memory transfers: you should minimise moving data between CPU and GPU.

For instance, if we profiled our CuPy Game of Life, we might see a kernel for each `cp.roll` call. That could appear as several short kernel executions per time step. If there's a delay between them, maybe they could be fused (advanced optimisation). Or if memory copy appears because we did `cp.asnumpy` every step for visualisation, that would show up, and we'd know it's a bottleneck (so we'd avoid frequent copying).
Nsight Systems also provides some summary like total GPU utilisation, how many kernel launches, etc.

## General Optimisation Strategies
Once you have the profiling data, here are some strategies: 
**On the CPU side (Python)**: 
- **Vectorise Operations**: We saw this with NumPy; doing things in batch is faster than Python loops. 
- **Use efficient libraries**: If a certain computation is slow in Python, see if there is a library (NumPy, SciPy, etc) that does it in C or another language. 
- **Optimise algorithms**: Sometimes, a better algorithm can speed things up more than any level of optimisation. For example, if you find a certain computation is N^2 in complexity and it's slow, see if you can make it N log N or similar.
- **Consider multiprocessing or parallelisation**: Use multiple CPU cores (with `multiprocessing` or `joblib` or others) if appropriate.

**On the GPU side**:
- **Minimise data transfers**: Once data is on the GPU, try to do as much as possible there. Transferring large arrays back and forth every iteration will kill performance. Maybe accumulate results and transfer once at the end, or use pinned memory for faster transfers if you must.
- **Kernel fusion / reducing launch overhead**: Each call (like our multiple `cp.roll` operations) launches separate kernels. If possible, combining operations into one kernel means the GPU can do it all in one pass. Some libraries or tools do this automatically (for example, CuPy might fuse elementwise operations under the hood, and deep learning frameworks definitely fuse a lot of ops). If not, one can write a custom CUDA kernel to do more work in one go.
- **Asynchronous overlap**: GPUs operate asynchronously relative to the CPU. You can have the CPU queue up work and then do something else (like prepare next batch of data) while GPU is processing. Nsight can show if your CPU and GPU are overlapping or if one is waiting for the other. Ideally, you overlap communication (PCIe transfers) with computation if possible.
- **Memory access patterns**: This is more advanced, but if diving into custom kernel, coalesced memory access (accessing consecutive memory addresses in threads that are next to each other) is important for performance. Uncoalesced or random access can slow down even if arithmetic is small.
- **Use specialised libraries**: For certain tasks, libraries like cuDNN (deep neural nets), cuBLAS (linear algebra), etc., are heavily optimised. Always prefer a library call (e.g., `cp.fft` or `cp.linalg`) over writing your own, if it fits the need, because those are likely tuned for performance.

### Example: Profiling and optimising Game of Life
Suppose profiling shows that our GPU version is spending a lot of time launching 8 kernels for the neighbour count (one for each roll) and then another for the where. An optimisation strategy could be to write a custom CuPy raw kernel (using `cp.RawKernel` or Numba or C++/CUDA) that, for each cell, reads its 8 neighbours and computes the next state in one kernel. That would likely be faster (one launch per step instead of 9). It's more complex to implement, but that's the trade-off: simple code vs optimised code.

## Analysing Results 

As an example, after profiling, you might summarise:
- **CPU profile result**: 80% of time spent in function foo() which does heavy math in pure Python.
- **Action**: Try to use NumPy or CuPy in foo() to exploit vectorisation/GPU. Or rewrite critical parts in C if necessary (via Cython, etc.).
- **GPU profile result (Nsight)**: GPU utilisation only 30%, with many small gaps; a lot of time in data transfer.
- **Action**: Batch operations to use GPU more fully (e.g., process larger chunks of data at once). Reduce data transfers (e.g., move data transfer outside of a loop so it happens once, not every iteration).
- **GPU profile result 2**: One kernel uses most of GPU time, and GPU is 100% utilised during it, but overall speed is still slow.
- **Action**: That kernel might need algorithmic improvement. Maybe it’s using a brute-force method that could be optimised (e.g., use shared memory tiling in a custom kernel, or use a more efficient algorithm).

## Putting it into Practice 
- **Profile the baseline CPU implementation**: Identify slow parts. Optimise those as much as possible (maybe by moving to NumPy, etc.) before moving to GPU. (No point offloading something trivial; offload the heavy parts.)
- **Port to GPU (CuPy)**: Then profile the GPU run. Ensure you measure correctly (synchronise before timing to get true GPU time).
- **Profile with Nsight**: see the breakdown of GPU activity.
- **Tackle the biggest issues**:
    - If GPU utilisation is low due to overhead -> try to batch operations.
    - If GPU kernels are the bottleneck -> consider algorithm changes or lower-level optimisations.
    - If CPU is becoming a bottleneck (it can happen, if CPU is preparing data slowly) -> optimise CPU side or move more logic to GPU.
- **Iterate**: optimisation is often iterative; fix one bottleneck, then the next reveals itself.

## Summary 
- Always measure (profile) first; don’t guess where the bottleneck is – surprises are common!
- Use the right tool: cProfile for Python, Nsight for GPU, and others like memory profilers if needed.
- After identifying hotspots, use strategies like vectorization, better algorithms, or hardware-specific optimizations.
- Keep an eye on both CPU and GPU usage. A perfect GPU utilization means nothing if your CPU is stuck waiting, and vice versa.
- Also, be mindful of readability vs performance. Optimize where needed, but don’t prematurely micro-optimize code that isn’t actually slow.


### Exercise

Get a feel for running the different style of code and explore some scaling that can be done. Feel that doing some actual writing of the code is probably not what you really do with GPU programming you seem to spend more offloading. They will try to recreate a style of graph the same as what was done for my home system, and was included in the previous slide. 