![GPU ecosystem](img/02_GPU_Ecosystem/Folie1.PNG)

![GPU Ecosystem](img/13_ecosystem.png)
(Image credits: https://blogs.nvidia.com/blog/what-is-cuda-2/)

### CuPy
CuPy provides a NumPy-compatible API, with its backend implemented in CUDA C++. This design enables users already familiar with NumPy to seamlessly leverage GPU acceleration by simply replacing the import statement.

In [1]:
import cupy as cp
import numpy as np
from numba import cuda

cp_stream = cp.cuda.Stream()
cp_stream.use()

# We will use this call to synchronize async calls of cupy. 
cp_stream.synchronize()

[NVSHARE][INFO]: Using CUDA device with id 0
[NVSHARE][INFO]: Using scheduler socket path: /var/run/nvshare/scheduler0.sock
[NVSHARE][INFO]: Successfully initialized nvshare GPU
[NVSHARE][INFO]: Client ID = d581a26579178066


We will use `cp_stream.synchronize()` to ensure that asynchronous GPU calls are completed before moving on.  
At this point, we will not go into the details of what a CUDA stream is or how it works.  

In many examples you will see online, synchronization is done with `Stream.null.synchronize()`, which waits on the **default stream**. This approach is fine if you are the only one using the GPU.  
In our environment, however, the GPUs are shared. To manage this fairly and safely, we use some additional abstractions under the hood. For this reason, we explicitly call `cp_stream.synchronize()` in our examples.

In [2]:
# Create a CuPy array of 10'000 elements and perform some operations
x = cp.arange(10_000, dtype=cp.float32)        # array([0., 1., 2., ...], dtype=float32) on GPU
y = cp.ones(10_000, dtype=cp.float32)          # array([1., 1., ...], dtype=float32) on GPU
z = x * 2 + y                                  # elementwise operations on GPU
print("CuPy result z (on GPU):", z)            # This is a cupy.ndarray
print("Type of z:", type(z))
print("Retrieve z to host as NumPy:", cp.asnumpy(z))  # convert to NumPy array for printing

CuPy result z (on GPU): [1.0000e+00 3.0000e+00 5.0000e+00 ... 1.9995e+04 1.9997e+04 1.9999e+04]
Type of z: <class 'cupy.ndarray'>
Retrieve z to host as NumPy: [1.0000e+00 3.0000e+00 5.0000e+00 ... 1.9995e+04 1.9997e+04 1.9999e+04]


When you run this, you should see that z is a CuPy array object, but converting it to a NumPy array yields the expected result (e.g., [1., 3., 5., ...] for the above computation). The syntax and results are just like NumPy, but all calculations took place on the GPU.

We can also demonstrate some common mathematical operations and verify that CuPy’s results match NumPy’s:

In [13]:
data_cpu = np.linspace(0, 1, 5)
data_gpu = cp.linspace(0, 1, 5)
print("NumPy exp:", np.exp(data_cpu))
print("CuPy exp:", cp.exp(data_gpu))

# Verify they match (after bringing the CuPy result to host)
assert np.allclose(np.exp(data_cpu), cp.asnumpy(cp.exp(data_gpu)))
print("CuPy matches NumPy for exp computation.")

NumPy exp: [1.         1.28402542 1.64872127 2.11700002 2.71828183]
CuPy exp: [1.         1.28402542 1.64872127 2.11700002 2.71828183]
CuPy matches NumPy for exp computation.


This shows how you can use `cp.exp`, `cp.sin`, `cp.log`, etc., just like their NumPy counterparts. Most universal functions (ufuncs) are supported by CuPy.

#### Memory transfers
Note that creating data_gpu from scratch on the GPU (as above with `cp.linspace`) avoids any CPU-GPU copy. If instead you did data_gpu = cp.asarray(data_cpu), it would transfer the NumPy array to GPU memory. It’s important in HPC to minimize host-device transfers because the PCIe bus is relatively slow.
**Try to perform as much computation on the GPU as possible once the data is there, and only bring results back to the host when needed.**

One of the strengths of CuPy lies in its highly integrated memory system. Data movement between host (CPU) and device (GPU) memory is managed efficiently, reducing overhead and enabling seamless interoperability with other Python frameworks. In particular, **NumPy** arrays can be transferred to and from CuPy with minimal effort, and CuPy arrays can also interoperate directly with **Numba's CUDA** extension without explicit data copies.

This tight integration allows researchers and developers to combine the ease of NumPy, the GPU acceleration of CuPy, and the just-in-time compilation capabilities of Numba in a single workflow. Below are some common examples of casting between these frameworks.

In [16]:
# NumPy -> CuPy
array_np = np.array([1, 2, 3])
array_cp = cp.asarray(array_np)

# CuPy -> NumPy
array_np = cp.asnumpy(array_cp)

# CuPy -> Numba (DeviceNDArray)
array_numba = cuda.as_cuda_array(array_cp)

# Numba -> CuPy
array2_cp = cp.asarray(array_numba)

#### CuPy for FFT and Linear Algebra

Beyond basic arithmetic, CuPy also supports advanced numerical routines. For example, Fast Fourier Transforms (FFT) are available via cupy.fft (using NVIDIA’s cuFFT library under the hood), and linear algebra routines (e.g. matrix multiplication, inversion, SVD) via cupy.linalg (using cuBLAS, cuSolver, etc.). This means you can accelerate signal processing and linear algebra workloads with minimal changes to your code.

### Task 1

Create an example using the Fast Fourier Transform (FFT):

1. Generate a large synthetic signal (e.g., a sine wave or a combination of waves)
2. Compute its frequency spectrum on the CPU using NumPy  
3. Compute the same frequency spectrum on the GPU using CuPy  
4. Compare the two results to verify that they match
5. Compare the runtime of the two implementations
6. Compare the total runtime (including H2D and D2H) to the kernel runtime only

In [27]:
#ToDo

Difference between CPU FFT and GPU FFT: 3.753457592312974e-10
CPU implementation: 34.5 ms ± 352 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
GPU kernel:         669 μs ± 774 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
GPU implementation: 3.75 ms ± 8.45 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


For linear algebra, CuPy’s linalg module includes functions like `cp.linalg.inv` (matrix inverse), `cp.linalg.svd`, `cp.linalg.solve` (solve linear systems), and of course matrix multiplication can be done with the `@` operator or `cp.dot`.

### Task 2

Use CuPy to solve a system of linear equations:

1. Create a large random matrix `A` and a random vector `b`.  
2. Solve the system \(Ax = b\) on the GPU using `cp.linalg.solve`.  
3. Solve the same system on the CPU using `np.linalg.solve`.  
4. Compare the runtime of both approaches.  
5. Verify that the solutions are consistent (e.g., by checking the maximum absolute difference).

In [34]:
#ToDo

Max difference between CPU and GPU results: 3.70e-04
CPU implementation: 281 ms ± 723 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)
GPU kernel:         5.98 ms ± 1.84 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


### Writing Custom CUDA Kernels with CuPy (Advanced)

This section is optional, since you have learned to write custom GPU kernels with Numba in a previous notebook. But it’s good to know that CuPy also allows custom kernels written in CUDA C. This shows how CuPy allows going “closer to the metal” when necessary, by embedding CUDA C code. In practice, many problems can be solved with CuPy’s high-level functions alone, but if you have an existing CUDA kernel or need something very custom, CuPy can incorporate it

In [36]:
# A simple example of a custom CUDA kernel using CuPy
kernel_code = r"""
extern "C" __global__
void add_arrays(const float* x, const float* y, float* out, int n) {
    int idx = blockDim.x * blockIdx.x + threadIdx.x;
    if(idx < n) {
        out[idx] = x[idx] + y[idx];
    }
}
"""
# Compile the kernel
module = cp.RawModule(code=kernel_code, name_expressions=["add_arrays"])
add_arrays_kernel = module.get_function("add_arrays")

# Prepare data
n = 1000000
x = cp.random.random(n, dtype=cp.float32)
y = cp.random.random(n, dtype=cp.float32)
out = cp.empty(n, dtype=cp.float32)

# Launch kernel (grid = ceil(n/256) blocks, block = 256 threads)
threads_per_block = 256
blocks = (n + threads_per_block - 1) // threads_per_block
add_arrays_kernel((blocks,), (threads_per_block,), (x, y, out, n))
cp_stream.synchronize()

# Verify correctness
assert cp.allclose(out, x + y)
print("Custom kernel produced correct results.")

Custom kernel produced correct results.


### RAPIDS: GPU DataFrames and Machine Learning (cuDF, cuML, etc.)

RAPIDS is an open-source suite of libraries led by NVIDIA that brings the familiar data science stack (pandas, SQL, scikit-learn, etc.) to the GPU. The key idea is to provide GPU-accelerated DataFrames, machine learning algorithms, and more, with APIs that mirror popular Python libraries, allowing users to leverage GPUs with minimal code changes. RAPIDS is built on top of the CUDA libraries. Two primary components of RAPIDS we will explore are cuDF and cuML:

cuDF – a GPU DataFrame library that provides a pandas-like API for loading, filtering, joining, grouping, and analyzing tabular data on the GPU. You can think of cuDF as “pandas on CUDA”. It uses a C++ GPU library called libcudf under the hood for performance.

cuML – a suite of GPU-accelerated machine learning algorithms (part of RAPIDS) that mirror the scikit-learn API. It offers algorithms like linear regression, logistic regression, PCA, clustering (e.g., K-Means, DBSCAN), random forests, etc., all running on GPUs.

RAPIDS also includes cuGraph (GPU graph analytics with a NetworkX-like API), cuSignal (GPU signal processing), cuxFilter (GPU-accelerated dashboards), and others.

⚠️ **Important:** Please switch to the **`hpcp_rapids`** kernel.  
Managing all required packages for this course in a single, cross-compatible environment is challenging, so we provide this preconfigured kernel for you.


In [5]:
import cudf
import numpy as np
import pandas as pd

# Create a cuDF DataFrame from Python data
df = cudf.DataFrame({
    'id': [1, 2, 3, 4, 5],
    'value': [10.0, 15.5, 7.2, 23.3, 42.0]
})
print("cuDF DataFrame:\n", df)
print("Compute summary stats (like pandas describe):\n", df.describe())

cuDF DataFrame:
    id  value
0   1   10.0
1   2   15.5
2   3    7.2
3   4   23.3
4   5   42.0
Compute summary stats (like pandas describe):
              id      value
count  5.000000   5.000000
mean   3.000000  19.600000
std    1.581139  13.947939
min    1.000000   7.200000
25%    2.000000  10.000000
50%    3.000000  15.500000
75%    4.000000  23.300000
max    5.000000  42.000000


This should display a DataFrame similar to how pandas would (with an index and the data for id and value columns). The df.describe() will compute summary statistics on the GPU. You can already see the API is pandas-like. Many pandas methods (such as `df.head()`, `df.tail()`, `df.mean()`, etc.) are available in cuDF.

Now, let’s do something more substantial with cuDF. We’ll generate a large random dataset and perform some typical data analysis tasks on GPU: filtering, arithmetic, grouping, and aggregation. We will also compare the performance with pandas to highlight the benefit.

In [6]:
# Generate a large random dataset
N = 1_000_000  # one million rows
# Using NumPy to generate data, then we'll transfer to cuDF
age = np.random.randint(18, 90, size=N)            # random ages between 18 and 90
income = np.random.random(size=N) * 100000         # random income between 0 and 100k
# Create a pandas DataFrame (just to compare, optional)
pdf = pd.DataFrame({'age': age, 'income': income})

# Create a cuDF DataFrame from the NumPy arrays
gdf = cudf.DataFrame({
    'age': age,       # cudf can accept numpy arrays, it will copy to GPU
    'income': income
})

In [7]:
# 1. Filtering: find how many people are over 65 years old
num_over_65 = gdf[gdf.age > 65].shape[0]
print("Number of people over 65:", num_over_65)

# 2. Arithmetic: add a new column for income in thousands
gdf['income_k'] = gdf['income'] / 1000.0
print(gdf.head())

# 3. Group by: compute average income by age group (decade of age)
gdf['age_decade'] = (gdf['age'] // 10) * 10
avg_income_by_decade = gdf.groupby('age_decade').income.mean()
print("Average income by age decade: ", end="")
print(avg_income_by_decade)

Number of people over 65: 333448
   age        income   income_k
0   55  56036.257305  56.036257
1   26  82063.073881  82.063074
2   30  94068.201321  94.068201
3   27  17022.360668  17.022361
4   22  88255.036825  88.255037
Average income by age decade: age_decade
40    50124.512709
10    50400.116154
20    49887.161590
70    50054.429544
80    50045.276308
30    49958.066663
50    50180.339086
60    49880.791125
Name: income, dtype: float64


In [8]:
# Compare with pandas
%timeit pdf[pdf.age > 65]
%timeit gdf[gdf.age > 65]
%timeit pdf.groupby(pdf.age // 10).income.mean()
%timeit gdf.groupby('age_decade').income.mean()

6.46 ms ± 32.8 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.42 ms ± 13.4 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
18.2 ms ± 116 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
6.8 ms ± 10.7 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


#### Using cuML (GPU Machine Learning)

After data preparation with cuDF, you might want to perform machine learning or analytics on the data. This is where cuML comes in. cuML provides GPU implementations of many machine learning algorithms with an API that closely mirrors scikit-learn. This design lets you often just swap out sklearn for cuml and get an accelerated version of the algorithm. Supported algorithms include linear regression, logistic regression, ridge regression, K-Means clustering, DBSCAN, PCA, t-SNE, UMAP, random forest, gradient boosting, and more. Many algorithms in cuML can accept input as cuDF DataFrames or CuPy arrays, and will output results as cuDF/CuPy, ensuring the data stays on GPU through the pipeline.

Let’s do a simple example: K-Means clustering on a set of points. We will generate a synthetic dataset of points on the GPU (using CuPy) and then use cuML’s KMeans to cluster them. Finally, we’ll compare the result with CPU scikit-learn KMeans on a smaller sample (for validation).

In [10]:
from cuml.cluster import KMeans
from sklearn.cluster import KMeans as skKMeans
import cupy as cp

# Generate synthetic data: 100,000 points in 2D forming 3 clusters
# We'll sample points around three centers for clear clustering
cp.random.seed(0)
N = 100_000
# Three cluster centers
centers = cp.array([[0,0], [5,5], [0,5]], dtype=cp.float32)
labels_true = cp.random.randint(0, 3, size=N)  # true cluster labels (random assignment)
# Points = cluster center + some random noise
points = centers[labels_true] + 0.5 * cp.random.randn(N, 2).astype(cp.float32)

# GPU KMeans
gpu_km = KMeans(n_clusters=3, max_iter=100)
gpu_labels = gpu_km.fit_predict(points)
print("First 10 predicted cluster labels (GPU):", gpu_labels[:10])

# (Optional) CPU KMeans on a subset to verify
sample_points = cp.asnumpy(points[:5000])  # take 5000 points to CPU for quick check
cpu_km = skKMeans(n_clusters=3, n_init='auto', max_iter=100, random_state=0)
cpu_labels = cpu_km.fit_predict(sample_points)
# Compare cluster centers (sort for matching order)
print("GPU cluster centers:\n", cp.asnumpy(gpu_km.cluster_centers_))
print("CPU cluster centers (approx, on subset):\n", cpu_km.cluster_centers_)


First 10 predicted cluster labels (GPU): [1 2 1 2 2 2 1 1 1 0]
GPU cluster centers:
 [[-4.6802746e-04  4.9970641e+00]
 [ 8.1728620e-04 -3.5804582e-03]
 [ 4.9998760e+00  5.0027781e+00]]
CPU cluster centers (approx, on subset):
 [[-0.00886273  5.002054  ]
 [ 5.005546    4.997921  ]
 [-0.00847459 -0.01666689]]


### Task 3: Linear Regression using cuML. 

Suppose we want to fit a linear model:

$$
y = 3.5 \cdot x_0 - 2.2 \cdot x_1 + 1.0 + \text{noise}
$$

Here, **3.5** and **-2.2** are the coefficients for the features \(x_0\) and \(x_1\), and **1.0** is the bias (the constant offset).

1. Instantiate `LinearRegression` using `algorithm='svd'`
2. Fit the model on `(X, y)`
3. Retrieve and print the estimated coefficients and intercept/bias
4. Compare them to the ground-truth values `[[3.5, -2.2], intercept=1.0]` (report absolute errors)

In [13]:
from cuml.linear_model import LinearRegression

# Generate random data for linear regression: y = 3.5*x0 - 2.2*x1 + noise
M, D = 500000, 2  # 500k samples, 2 features
X = cp.random.rand(M, D, dtype=cp.float32)
true_coeff = cp.array([3.5, -2.2], dtype=cp.float32)
y = X.dot(true_coeff) + 1.0  # true bias = 1.0
y += 0.01 * cp.random.randn(M, dtype=cp.float32)  # add some noise

# Fit linear regression on GPU
#ToDo

# Print results
print("Estimated coefficients (GPU):", coef_gpu)
print("Estimated intercept (GPU):", intercept_gpu)

Estimated coefficients (GPU): [ 3.5000002 -2.1999416]
Estimated intercept (GPU): 0.9999674558639526


### JAX: GPU-Accelerated Numerical Computing

JAX is a Python library for accelerator-oriented numerical computing that brings high-performance, GPU/TPU-backed computation to familiar NumPy-like code. In a nutshell, JAX can be thought of as “NumPy on the CPU, GPU, and TPU” with support for automatic differentiation, making it popular in machine learning research. However, JAX’s capabilities go far beyond machine learning – it is equally useful for general-purpose numerical computing tasks in scientific computing and HPC contexts. With JAX, you write ordinary Python functions using jax.numpy (which mirrors the NumPy API) and get GPU-accelerated execution for free. If a GPU (or other accelerator) is available on your system, JAX will find it and by default allocate arrays and run computations on the GPU. This allows developers and researchers to write code once and have it run efficiently on CPUs or GPUs without specialized GPU programming.

We can go back to our day 1 tutorial on JAX and just update the JAX installation to include CUDA. Once done, we have free CUDA execution integrated.