# 🚀 CPU vs GPU Benchmarking Notebook
**Author:** ssinha78  
**Description:** Demonstrates beginner-friendly CPU (NumPy, Pandas, scikit-learn) vs GPU (CuPy, cuDF, cuML) examples
with a variety of benchmarking techniques: `%timeit`, `perf_counter`, cell magics, and profiling.


## 0. Environment Setup
Import CPU and GPU libraries, and helper for timing.

In [None]:
import time
import numpy as np
import pandas as pd
from time import perf_counter

# GPU libs
import cupy as cp
import cudf
from cuml.linear_model import LogisticRegression as cuLogisticRegression
from cuml.cluster import KMeans as cuKMeans


## 1. Vector Addition
**Benchmark with `%timeit`**

In [None]:
# CPU (NumPy)
n = 10_000_000
a = np.random.rand(n)
b = np.random.rand(n)
%timeit -n 5 c = a + b

In [None]:
# GPU (CuPy)
a_gpu = cp.array(a)
b_gpu = cp.array(b)
%timeit -n 5 c_gpu = a_gpu + b_gpu; cp.cuda.Device().synchronize()

## 2. Custom CUDA Kernel
**Benchmark with `perf_counter`**

In [None]:
from cupy import RawKernel
kernel_code = r'''
extern "C" __global__
void square(const float* x, float* y, size_t n) {
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if (i < n) y[i] = x[i] * x[i];
}'''
sq_kernel = RawKernel(kernel_code, 'square')

# Prepare data
x = np.random.rand(n).astype('float32')
x_gpu = cp.array(x)
y_gpu = cp.empty_like(x_gpu)
threads, blocks = 256, (n + 255)//256

start = perf_counter()
sq_kernel((blocks,), (threads,), (x_gpu, y_gpu, n))
cp.cuda.Stream.null.synchronize()
print('RawKernel square time:', perf_counter() - start)

## 3. DataFrame GroupBy & Filter
**Benchmark with `perf_counter` and cell profiling**

In [None]:
from time import perf_counter
n = 50_000_00
data = {'key': np.random.randint(0,1000,n),
        'v1': np.random.randn(n),
        'v2': np.random.randn(n)}
pdf = pd.DataFrame(data)

# CPU
start = perf_counter()
res_cpu = pdf.groupby('key').agg({'v1':['sum','mean'],'v2':['min','max']})
cpu_time = perf_counter() - start
print(f'Pandas groupby time: {cpu_time:.4f}s')

# GPU
gdf = cudf.DataFrame.from_pandas(pdf)
start = perf_counter()
res_gpu = gdf.groupby('key').agg({'v1':['sum','mean'],'v2':['min','max']})
cp.cuda.Stream.null.synchronize()
gpu_time = perf_counter() - start
print(f'cuDF groupby time: {gpu_time:.4f}s')
print(f'Speedup: {cpu_time/gpu_time:.2f}x')

In [None]:
%%cudf.pandas.profile
pdf = pd.DataFrame(data)
res = pdf.groupby('key').agg({'v1':['sum','mean'],'v2':['min','max']})

## 4. Memory Transfer: Pageable vs Pinned
**Benchmark with `perf_counter`**

In [None]:
h = np.random.rand(20_000_000).astype('float32')
# Pageable
start = perf_counter()
d = cp.array(h)
cp.cuda.Stream.null.synchronize()
print('Pageable copy:', perf_counter() - start)

# Pinned
h_pin = cp.cuda.alloc_pinned_memory(h.nbytes)
view = np.frombuffer(h_pin, h.dtype, h.shape)
view[:] = h
start = perf_counter()
d2 = cp.asarray(view)
cp.cuda.Stream.null.synchronize()
print('Pinned copy:   ', perf_counter() - start)

## 5. Machine Learning
### 5.1 Logistic Regression
**Benchmark with `perf_counter`**

In [None]:
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X, y = make_classification(200_000, 20, random_state=0)
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2)

# CPU
start = perf_counter()
clf = LogisticRegression(max_iter=100).fit(Xtr, ytr)
pred = clf.predict(Xte)
print('SKLearn acc:', accuracy_score(yte, pred))
print('SKLearn time:', perf_counter() - start)

# GPU
Xc, yc = cp.array(X), cp.array(y)
Xtr_c, Xte_c, ytr_c, yte_c = train_test_split(Xc, yc, test_size=0.2)
start = perf_counter()
cuclf = cuLogisticRegression(max_iter=100).fit(Xtr_c, ytr_c)
pred_c = cuclf.predict(Xte_c)
cp.cuda.Stream.null.synchronize()
from cuml.metrics import accuracy_score as cu_acc
print('cuML acc:', cu_acc(yte_c, pred_c))
print('cuML time:', perf_counter() - start)

### 5.2 K-Means Clustering
**Benchmark with `%timeit`**

In [None]:
# CPU KMeans
data = np.random.rand(1_000_000, 2)
from sklearn.cluster import KMeans
%timeit -n3 km = KMeans(3).fit(data)

In [None]:
# GPU KMeans
data_c = cp.asarray(data)
%timeit -n3 km_c = cuKMeans(3).fit(data_c); cp.cuda.Device().synchronize()

## 6. Best Practices & Tips
- Minimize host↔device transfers (batch data).  
- Use pinned memory for large copies.  
- Profile often: `%timeit`, `perf_counter`, Nsight.  
- Prefer library routines (cuBLAS, cuDNN).  
- Tune blocks/threads (128–256 threads).  
- Mixed precision (FP16) where possible.