<a href="https://colab.research.google.com/github/daisysong76/AI--Machine--learning/blob/main/Custom_Kernel_fusion_practice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import torch
import torch.nn as nn
from torch.utils.cpp_extension import load_inline

# Define the CUDA kernel as a string
cuda_kernel = """
extern "C" __global__ void custom_sin_kernel(float *input, float *output, int size) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < size) {
        output[idx] = sinf(input[idx]);
    }
}
"""

# Define the C++ wrapper
cpp_wrapper = """
#include <torch/extension.h>
#include <cuda_runtime.h>

void custom_sin_cuda(torch::Tensor input, torch::Tensor output) {
    int size = input.numel();
    int blockSize = 256;
    int gridSize = (size + blockSize - 1) / blockSize;

    custom_sin_kernel<<<gridSize, blockSize>>>(input.data_ptr<float>(), output.data_ptr<float>(), size);
}

PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
    m.def("forward", &custom_sin_cuda, "Custom sin CUDA kernel");
}
"""

# Load the custom extension
custom_sin_module = load_inline(
    name="custom_sin_module",
    cpp_sources=[cpp_wrapper],
    cuda_sources=[cuda_kernel],
    functions=['forward'],
    extra_cuda_cflags=['-lcudart'],
    verbose=True
)

# Usage in PyTorch
class CustomSin(nn.Module):
    def forward(self, x):
        output = torch.empty_like(x)
        custom_sin_module.forward(x, output)
        return output

# Example Usage
input_tensor = torch.randn(1024, device='cuda')
custom_sin_layer = CustomSin().cuda()
output_tensor = custom_sin_layer(input_tensor)

# Verify the result
reference_output = torch.sin(input_tensor)
print(torch.allclose(output_tensor, reference_output)) # Should print True

**Explanation of a Real-World Custom Kernel Example:**

1.  **CUDA Kernel (`cuda_kernel`):**
    * This string defines the actual CUDA kernel.
    * `__global__` indicates that this function runs on the GPU.
    * `blockIdx.x`, `blockDim.x`, and `threadIdx.x` are used to calculate the global thread index.
    * `sinf()` is the single-precision sine function in CUDA's math library.
    * The kernel calculates the sine of each element in the input tensor and stores it in the output tensor.

2.  **C++ Wrapper (`cpp_wrapper`):**
    * This C++ code acts as an interface between PyTorch and the CUDA kernel.
    * `torch/extension.h` provides PyTorch's C++ API.
    * `cuda_runtime.h` provides CUDA runtime functions.
    * `custom_sin_cuda` function:
        * Calculates the grid and block dimensions for the CUDA kernel launch.
        * Launches the CUDA kernel using `<<<gridSize, blockSize>>>`.
        * Obtains the raw data pointers from the Pytorch Tensors using `input.data_ptr<float>()`
    * `PYBIND11_MODULE`: Creates a Python module that exposes the `custom_sin_cuda` function to PyTorch.

3.  **`load_inline`:**
    * This PyTorch function compiles the CUDA kernel and C++ wrapper into a PyTorch extension.
    * `cuda_sources` and `cpp_sources` provide the source code.
    * `extra_cuda_cflags` provides extra flags to the cuda compiler.
    * `verbose=True` prints the compilation output.

4.  **`CustomSin` Module:**
    * This PyTorch `nn.Module` wraps the custom CUDA kernel.
    * The `forward` method calls the `custom_sin_module.forward` function, which in turn launches the CUDA kernel.

5.  **Usage:**
    * The example creates an input tensor on the GPU.
    * It creates an instance of the `CustomSin` module.
    * It calls the module's `forward` method to execute the custom kernel.
    * It then checks if the custom kernel produced the same output as the Pytorch function `torch.sin()`.

**Real-World Use Cases:**

* **Optimized Activation Functions:** Implement custom activation functions that are faster than standard PyTorch functions.
* **Specialized Mathematical Operations:** Implement highly optimized mathematical operations that are not available in PyTorch.
* **Image Processing:** Implement custom image processing kernels for operations like convolution, filtering, or color space conversion.
* **Signal Processing:** Implement custom signal processing kernels for operations like FFT, filtering, or modulation.
* **Graph Neural Networks:** Implement custom graph convolution kernels for specialized graph structures.
* **Any Computationally Intensive Operation:** If you have a computationally intensive operation that is a bottleneck in your model, you can implement it as a custom CUDA kernel to improve performance.