# Profiling and Optimisation of CPU and GPU Code
Writing code is the first problem, but generally, the second is **optimising** code for performance, an equally important skill, especially in GPU computing. Before optimising, you need to know *where* the time is being spent, which is where **profiling** comes in. Profiling means measuring the performance characteristics of your program, typically which parts of the code consume the most time or resources. 

## Profiling Python Code with cPython (CPU) 
Python has a built-in profiler called **cPython**. It can help you find which functions are taking up the most time in your program. This is key before you go into GPU acceleration; sometimes, you might find bottlenecks in places you didn't expect or identify parts of the code that would benefit the most from being moved to the GPU.

### How to use cProfile 
You can make use of cProfile via the command line: `python -m cProfile -o profile_results.pstats myscript.py`, which will run `myscript.py` under the profiler and output stats to a file.

```Python 
import cProfile, pstats

# Suppose we want to profile the Game of Life CPU simulation for 100 steps
def run_game_of_life(grid, steps=100):
    for _ in range(steps):
        grid = life_step(grid)  # life_step as defined earlier
    return grid

# Create an initial grid
N = 200
grid = np.random.choice([0,1], size=(N,N), p=[0.8,0.2])

# Run profiler
profiler = cProfile.Profile()
profiler.enable()
run_game_of_life(grid, steps=100)
profiler.disable()

# Print profiling results
stats = pstats.Stats(profiler).sort_stats('cumtime')
stats.print_stats(10)  # print top 10 functions by cumulative time
```
- **Interpreting cProfile output**: When you print stats, you'll see a table with columns like: 
- **ncalls**: number of calls to the function. 
- **tottime**: total time spent in the function (excluding sub-function calls). 
- **cumtime**: cumulative time spent in the function includes sub-functions.
- The function name

For example, you could see:
```bash 
ncalls  tottime  percall  cumtime  percall  filename:lineno(function)
100    0.050    0.000    0.400    0.004    life.py:10(life_step)
... (other functions)
```

This would mean that `life_step` was called 100 times and, in total, took 0.4 seconds (cumtime), out of which 0.05 was spent in its own code and the rest in sub-calls (likely numpy operations). The percall columns are just tottime/ncalls and cumtime/ncalls for reference. 

**Finding Bottlenecks**: We look for functions with high cumulative time; those are our bottlenecks. In pure Python code, the culprit is often a function containing a heavy loop or computation. In our Game of Life `life_step` might show up as heavy, or perhaps NumPy internal functions might appear (though they might be under some `<built-in>` labels in the output). 

Once you know where the time is going, you can think about optimising: 
- If the bottleneck is a Python loop, can you vectorise it (use NumPy) or use a faster library? 
- If the bottleneck is an external call (say a NumPy function), maybe using the GPU or another algorithm would help. 

## Profiling GPU Code with NVIDIA Nsight Systems 
When we involve GPUs, cProfile alone isn't enough. cProfile will tell us about the Python side, but we also need to know what's happening on the GPU. Does the GPU spend most of its time computing, or is it idle while waiting for data? Are there a few kernel launches that take a long time or many tiny kernel launches?

**NVIDIA Nsight Systems** is a profiler for GPU applications that provides a timeline fo CPU and GPU activity. It can show: 
- When your code launched GPU kernels and how long they ran 
- GPU memory trsnfers between host and device. 
- CPU-side functions as well (to correlate CPU and GPU)

### Using Nsight Systems 
Nsight Systems can be used via a GUI or command line. ON clusters, you might use the CLI, assuming it's installed. 

You will need to run your script under Nsight: 
```bash 
nsys profile -o profile_report python my_gpu_script.py
```

This will run `my_gpu_script.py` and record profiling data into `profile_report.qdrep`(and a `.nsys-rep`). 
After that, you typically use the Nsight system GUI to open the report file (`.qdrep`) for detailed analysis. If you only have the command line, you can use `sys stats profile_report.qdrep` to get summary info.

**What to look for**: When you open the timeline, you'll see tracks for CPU threads and GPU streams. For example:
- The CPU timeline might show segments where Python is executing, interspersed with calls to the CUDA driver/runtime to launch kernels or copy data.
- The GPU timeline will show the kernels (like `elementwise_kernel` or others if using CuPy, or specific names if using custom CUDA code) and memory copy operations (labelled like `memcpyHtoD` for host-to-device copies, etc.).

You might discover:
- A lot of small kernel launches with gaps, which could indicate overhead or that the GPU is not fully utilised (maybe launch fewer, larger kernels if possible).
- A particular kernel takes a majority of the time, which is a candidate for optimisation (maybe its algorithm could be improved, or check if it's doing unnecessary work).
- Significant time spent in memory transfers: you should minimise moving data between CPU and GPU.

For instance, if we profiled our CuPy Game of Life, we might see a kernel for each `cp.roll` call. That could appear as several short kernel executions per time step. If there's a delay between them, maybe they could be fused (advanced optimisation). Or if memory copy appears because we did `cp.asnumpy` every step for visualisation, that would show up, and we'd know it's a bottleneck (so we'd avoid frequent copying).
Nsight Systems also provides some summary like total GPU utilisation, how many kernel launches, etc.

## General Optimisation Strategies
Once you have the profiling data, here are some strategies: 
**On the CPU side (Python)**: 
- **Vectorise Operations**: We saw this with NumPy; doing things in batch is faster than Python loops. 
- **Use efficient libraries**: If a certain computation is slow in Python, see if there is a library (NumPy, SciPy, etc) that does it in C or another language. 
- **Optimise algorithms**: Sometimes, a better algorithm can speed things up more than any level of optimisation. For example, if you find a certain computation is N^2 in complexity and it's slow, see if you can make it N log N or similar.
- **Consider multiprocessing or parallelisation**: Use multiple CPU cores (with `multiprocessing` or `joblib` or others) if appropriate.

**On the GPU side**:
- **Minimise data transfers**: Once data is on the GPU, try to do as much as possible there. Transferring large arrays back and forth every iteration will kill performance. Maybe accumulate results and transfer once at the end, or use pinned memory for faster transfers if you must.
- **Kernel fusion / reducing launch overhead**: Each call (like our multiple `cp.roll` operations) launches separate kernels. If possible, combining operations into one kernel means the GPU can do it all in one pass. Some libraries or tools do this automatically (for example, CuPy might fuse elementwise operations under the hood, and deep learning frameworks definitely fuse a lot of ops). If not, one can write a custom CUDA kernel to do more work in one go.
- **Asynchronous overlap**: GPUs operate asynchronously relative to the CPU. You can have the CPU queue up work and then do something else (like prepare next batch of data) while GPU is processing. Nsight can show if your CPU and GPU are overlapping or if one is waiting for the other. Ideally, you overlap communication (PCIe transfers) with computation if possible.
- **Memory access patterns**: This is more advanced, but if diving into custom kernel, coalesced memory access (accessing consecutive memory addresses in threads that are next to each other) is important for performance. Uncoalesced or random access can slow down even if arithmetic is small.
- **Use specialised libraries**: For certain tasks, libraries like cuDNN (deep neural nets), cuBLAS (linear algebra), etc., are heavily optimised. Always prefer a library call (e.g., `cp.fft` or `cp.linalg`) over writing your own, if it fits the need, because those are likely tuned for performance.

### Example: Profiling and optimising Game of Life
Suppose profiling shows that our GPU version is spending a lot of time launching 8 kernels for the neighbour count (one for each roll) and then another for the where. An optimisation strategy could be to write a custom CuPy raw kernel (using `cp.RawKernel` or Numba or C++/CUDA) that, for each cell, reads its 8 neighbours and computes the next state in one kernel. That would likely be faster (one launch per step instead of 9). It's more complex to implement, but that's the trade-off: simple code vs optimised code.

## Analysing Results 

As an example, after profiling, you might summarise:
- **CPU profile result**: 80% of time spent in function foo() which does heavy math in pure Python.
- **Action**: Try to use NumPy or CuPy in foo() to exploit vectorisation/GPU. Or rewrite critical parts in C if necessary (via Cython, etc.).
- **GPU profile result (Nsight)**: GPU utilisation only 30%, with many small gaps; a lot of time in data transfer.
- **Action**: Batch operations to use GPU more fully (e.g., process larger chunks of data at once). Reduce data transfers (e.g., move data transfer outside of a loop so it happens once, not every iteration).
- **GPU profile result 2**: One kernel uses most of GPU time, and GPU is 100% utilised during it, but overall speed is still slow.
- **Action**: That kernel might need algorithmic improvement. Maybe it’s using a brute-force method that could be optimised (e.g., use shared memory tiling in a custom kernel, or use a more efficient algorithm).

## Putting it into Practice 
- **Profile the baseline CPU implementation**: Identify slow parts. Optimise those as much as possible (maybe by moving to NumPy, etc.) before moving to GPU. (No point offloading something trivial; offload the heavy parts.)
- **Port to GPU (CuPy)**: Then profile the GPU run. Ensure you measure correctly (synchronise before timing to get true GPU time).
- **Profile with Nsight**: see the breakdown of GPU activity.
- **Tackle the biggest issues**:
    - If GPU utilisation is low due to overhead -> try to batch operations.
    - If GPU kernels are the bottleneck -> consider algorithm changes or lower-level optimisations.
    - If CPU is becoming a bottleneck (it can happen, if CPU is preparing data slowly) -> optimise CPU side or move more logic to GPU.
- **Iterate**: optimisation is often iterative; fix one bottleneck, then the next reveals itself.

## Summary 
- Always measure (profile) first; don’t guess where the bottleneck is – surprises are common!
- Use the right tool: cProfile for Python, Nsight for GPU, and others like memory profilers if needed.
- After identifying hotspots, use strategies like vectorization, better algorithms, or hardware-specific optimizations.
- Keep an eye on both CPU and GPU usage. A perfect GPU utilization means nothing if your CPU is stuck waiting, and vice versa.
- Also, be mindful of readability vs performance. Optimize where needed, but don’t prematurely micro-optimize code that isn’t actually slow.


