# C Kernel Engine: PyTorch-Based Kernel Testing

This notebook shows how to:

1. Build the `libckernel_engine.so` shared library.
2. Call C kernels (LayerNorm, GELU, softmax, GEMM) from Python via `ctypes`.
3. Compare their outputs and gradients against PyTorch references.

The workflow is:

- **Write/modify a kernel** in `src/kernels/*.c`.
- `make` to rebuild the shared library.
- Run the tests from this notebook, which re-use the helpers in `unittest/*.py`.

You can keep this notebook open while iterating on C code to get fast feedback on correctness (max diff, RMSE, etc.).

In [None]:
# If this notebook is opened from the C-Kernel-Engine root, this cell
# will show the expected layout.
import os, subprocess, sys

print("CWD:", os.getcwd())
print("Contents:", os.listdir("."))

## 1. Build the Shared Library

This runs the `Makefile` to produce `libckernel_engine.so`.
Re-run this cell after changing any C kernel implementation.

In [None]:
%%bash
set -e
make

## 2. LayerNorm: Forward and Backward Tests

We reuse the helpers from `unittest/test_layernorm.py`:

- Forward: compare naive, rolled, unrolled C kernels vs `torch.layer_norm`.
- Backward: compare C `layernorm_backward_kernel` vs PyTorch autograd for `x`, `gamma`, `beta`.

In [None]:
import importlib
from unittest import test_layernorm

# Reload in case you've edited unittest/test_layernorm.py
importlib.reload(test_layernorm)

print("=== LayerNorm forward tests ===")
test_layernorm.run_single_test(T=32, D=128)

print("\n=== LayerNorm backward tests ===")
test_layernorm.run_backward_test(T=16, D=32)

## 3. GELU Forward Test

Compare `gelu_fast_inplace` vs PyTorch's `F.gelu(approximate="tanh")`.

In [None]:
from unittest import test_gelu
importlib.reload(test_gelu)

print("=== GELU forward test ===")
test_gelu.run_single_test(N=1024)

## 4. Causal Softmax: Forward and Backward

Forward:
- `causal_softmax_head_major` vs a PyTorch row-wise masked softmax.

Backward:
- `backward_causal_softmax_head_major` vs a pure-PyTorch Jacobian-vector product implementation.

In [None]:
from unittest import test_softmax, test_softmax_backward
importlib.reload(test_softmax)
importlib.reload(test_softmax_backward)

print("=== Softmax forward test ===")
test_softmax.run_single_test(H=2, T=8)

print("\n=== Softmax backward test ===")
test_softmax_backward.run_single_test(H=2, T=8)

## 5. GEMM Tests for LLM Shapes

We test all GEMM variants (`naive`, `avx512`, `fine_grained`, `blocked_serial`) against PyTorch matmul + bias, for shapes that matter to LLMs:

- `[T, D] · [D, 4D]` (MLP1)
- `[T, d] · [T, d]` (QK^T-style)
- `[T, T] · [T, d]` (SV)

In [None]:
from unittest import test_gemm
importlib.reload(test_gemm)

print("=== GEMM tests ===")
test_gemm.run_all()

## 6. Exploring Intermediate Values

Because these tests are just Python modules, you can import them and inspect intermediate tensors directly. For example, to inspect LayerNorm intermediate mean/rstd:

```python
import torch
from unittest import test_layernorm

x = torch.randn(4, 8)
gamma = torch.randn(8)
beta = torch.randn(8)

out, mean, rstd = test_layernorm.run_c_layernorm_naive(x, gamma, beta)
print("mean:", mean)
print("rstd:", rstd)
```

You can follow the same pattern for any new kernel you add (e.g., RMSNorm):

1. Implement the kernel in `src/kernels/*.c` and expose it in `include/ckernel_engine.h`.
2. Add a small `unittest/test_<kernel>.py` that:
   - Builds random tensors.
   - Computes a PyTorch reference.
   - Calls the C kernel and reports max diff / RMSE.
3. Add a section here that imports and runs that test.

This keeps the workflow interactive and repeatable as you design new kernels.