# Setting up Google Colab for Triton

To run Triton code in Google Colab, follow these setup steps:

1. **Enable GPU**:
   - Go to **Runtime > Change runtime type**.
   - Set **Hardware accelerator** to **GPU**, then click **Save**.

2. **Install Triton**:
   - Run the following command in the next cell to install Triton.


# Introduction to GPU Architecture 

In this notebook, we'll explore the basics of GPU architecture and introduce **Triton**, a Python library that makes it easier to write efficient GPU programs. 

The purpose of this notebook is to help users onboard to developing custom GPU kernels in Python, using **Google Colab** to access GPU resources. By the end, you’ll have a foundational understanding of how GPUs work and be ready to write your first Triton kernel.

---

### Core Concepts

- **Cores**: 
  - GPUs are designed with thousands of small processing units called **cores**. Each core can execute operations simultaneously, making GPUs ideal for parallel processing tasks like deep learning and scientific computing.
  - This parallelism allows a GPU to handle many operations at once, providing massive speedup over serial processing on CPUs.

- **Memory Hierarchy**: 
  - **Global Memory**: 
    - This is the main memory accessible by all cores. It has a large capacity but is relatively slow. Global memory is often used to store large datasets, like images or matrices, that threads will work on.
  - **Shared Memory**:
    - A small, high-speed memory accessible only by cores within the same thread block. Shared memory is critical for operations where multiple threads need to access or modify the same data, such as matrix multiplication.
  - **Registers**:
    - Registers are the fastest type of memory, used for temporary data storage within each thread. They are private to each thread and offer minimal latency, making them ideal for frequently accessed data.

- **Thread Blocks**:
  - A **thread block** is a group of threads that execute concurrently and can communicate with each other through **shared memory**.
  - In Triton (similar to CUDA), thread blocks allow developers to structure workloads efficiently by grouping threads to process subsets of data. Threads within the same block can synchronize and share data through shared memory, a high-speed memory accessible to all threads in the block.

  - **Why Thread Blocks?**
    - By dividing the overall workload into thread blocks, GPUs can process data in parallel, where each thread performs a part of the computation. This parallelism enables GPUs to handle large datasets much faster than serial CPU processing.

  - **Execution in Triton**:
    - In Triton kernels, users define the **grid** and **block size** to control work distribution across the GPU. The number of threads within each block and the number of blocks in a grid are crucial for optimizing performance, as they determine memory access patterns and processing efficiency.

  - **Example**:
    - In image processing, each thread might represent a single pixel. By processing all pixels in parallel, a GPU can efficiently handle high-resolution images in real-time, allowing for quick operations on each pixel independently.

- **Block Size**:
  - **Block size** refers to the number of elements that a single Triton program instance operates on simultaneously. 
  - It’s typically defined as a power of two (e.g., 128, 256, 512) for optimal performance, as this aligns well with the GPU’s memory structure and access patterns. The choice of block size can have a significant impact on performance and memory efficiency, so it’s often adjusted based on the task and data size.

---

With these foundational concepts in mind, we’ll start by setting up Triton in Colab and verifying that our GPU is ready to use. Let’s dive in!


# Setting up Google Colab for Triton

To run Triton code in Google Colab, follow these setup steps:

1. **Enable GPU**:
   - Go to **Runtime > Change runtime type**.
   - Set **Hardware accelerator** to **GPU**, then click **Save**.

2. **Install Triton**:
   - Run the following command in the next cell to install Triton.

In [8]:
# Install Triton
!pip install triton

# Verifying GPU Availability

Let's check if a GPU is available in this Colab environment. We can use `torch.cuda.is_available()` to confirm. If a GPU is detected, we’ll print its name.

In [7]:
import torch

if torch.cuda.is_available():
    print("GPU is available:", torch.cuda.get_device_name(0))
else:
    print("No GPU found. Please enable GPU under Runtime > Change runtime type.")

No GPU found. Please enable GPU under Runtime > Change runtime type.


# Writing a Simple Triton Kernel

Triton makes it easy to write GPU kernels with a Pythonic interface. We'll start with a basic operation: **vector addition**.

### Vector Addition

Consider two vectors, $A$ and $B$, each with $N$ elements. We want to compute their element-wise sum to produce a new vector, $C$, where each element is defined by:

$C[i] = A[i] + B[i]$

This is a great starting point for understanding GPU parallelization, as each element addition is independent and can be done in parallel.


In [4]:
# Import Triton libraries for writing and running GPU kernels
import triton
import triton.language as tl

# Step 1: Define the kernel function for vector addition.
# This kernel will add elements from two input vectors, A and B,
# and store the result in a third vector, C.
@triton.jit
def vector_add_kernel(A_ptr, B_ptr, C_ptr, N, BLOCK_SIZE: tl.constexpr):
    
    # Generate unique indices for each thread within the block.
    # `tl.arange(0, BLOCK_SIZE)` produces a range of local indices within each block,
    # and `tl.program_id(0)` is the block ID, ensuring a unique index per thread globally.
    idx = tl.arange(0, BLOCK_SIZE) + tl.program_id(0) * BLOCK_SIZE

    # Set a mask to prevent threads from accessing out-of-bounds memory.
    # Only threads with indices < N will load and process data.
    mask = idx < N
    
    # Load data from global memory at the computed indices.
    # `tl.load` fetches elements from A and B, using the mask to avoid invalid accesses.
    a = tl.load(A_ptr + idx, mask=mask)
    b = tl.load(B_ptr + idx, mask=mask)
    
    # Perform element-wise addition of vectors A and B.
    # Each thread calculates one element of the result in parallel.
    c = a + b
    
    # Store the result in vector C at the corresponding index, with masking.
    # `tl.store` writes each result back to global memory.
    tl.store(C_ptr + idx, c, mask=mask)

ModuleNotFoundError: No module named 'triton'

### Understanding the Kernel

In this kernel, we use several Triton functions to perform vector addition in parallel across the GPU. Here’s a breakdown of the core functions:

- **`tl.arange(0, BLOCK_SIZE)`**: Generates a range of indices for each thread within the block, from `0` to `BLOCK_SIZE - 1`. This allows each thread to identify which part of the data it will work on.

- **`tl.program_id(0)`**: Returns a unique identifier for each program instance (or thread block). By multiplying this identifier by `BLOCK_SIZE`, we compute a unique starting index for each thread block, ensuring that threads operate on different data segments.

- **`tl.load` and `tl.store`**: These functions handle memory access:
    - **`tl.load`**: Reads data from global memory into the kernel for processing. In this case, it loads elements from `A_ptr` and `B_ptr`.
    - **`tl.store`**: Writes processed data back to global memory. Here, it stores the result of the addition in `C_ptr`.

- **`mask`**: Ensures that we don’t access out-of-bounds memory when the array length isn’t a perfect multiple of the block size. 

> **Note**: In addition to safety, masking also helps with **memory efficiency** by preventing unnecessary data access. This reduces memory bandwidth usage, as only the valid indices are accessed.

Each thread computes one element of the vector sum independently, allowing the GPU to process large vectors in parallel efficiently. 

Now, let’s initialize some data and launch the kernel!


### Executing the Kernel 

Now, let’s initialize some vectors and execute the kernel to see Triton in action. We’ll create two random input vectors, `A` and `B`, and an empty output vector, `C`. Then we’ll launch our `vector_add_kernel` to compute the element-wise sum of `A` and `B` in parallel on the GPU. Finally, we’ll print out a "Hello, GPU!" message and display a portion of the result to confirm that the kernel worked as expected.


In [9]:
import torch

# Initialize input vectors A and B with N elements
# These vectors are created on the GPU using torch's `device='cuda'`
N = 1024
A = torch.rand(N, device='cuda')    # Random values in vector A
B = torch.rand(N, device='cuda')    # Random values in vector B
C = torch.empty(N, device='cuda')   # Empty vector C for storing the result

# Launch the kernel
# BLOCK_SIZE defines how many elements each thread block processes
BLOCK_SIZE = 128
grid = (N + BLOCK_SIZE - 1) // BLOCK_SIZE  # Calculate the number of blocks needed

# Execute the kernel with the specified grid size and block size
vector_add_kernel[grid](A, B, C, N, BLOCK_SIZE=BLOCK_SIZE)

# Print "Hello, GPU!" message and show the first 10 elements of the result
print("Hello, GPU!")
print("Result of A + B (first 10 elements):", C[:10].cpu().numpy())

AssertionError: Torch not compiled with CUDA enabled

### Comparing Performance: Triton vs. PyTorch (CUDA)

To understand Triton’s potential advantages, we’ll compare the performance of a vector addition operation using Triton and PyTorch (CUDA). By running each approach multiple times, we’ll observe the time differences and highlight Triton’s efficiency for this basic operation.


In [10]:
# Triton Vector Addition Kernel (Same as the one we defined previously)
@triton.jit
def vector_add_kernel(A_ptr, B_ptr, C_ptr, N, BLOCK_SIZE: tl.constexpr):
    idx = tl.arange(0, BLOCK_SIZE) + tl.program_id(0) * BLOCK_SIZE
    mask = idx < N
    a = tl.load(A_ptr + idx, mask=mask)
    b = tl.load(B_ptr + idx, mask=mask)
    c = a + b
    tl.store(C_ptr + idx, c, mask=mask)


# Function to execute Triton kernel and measure time
def triton_vector_add(A, B, C, N, BLOCK_SIZE=128):
    grid = (N + BLOCK_SIZE - 1) // BLOCK_SIZE
    # Start timing
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)
    start.record()
    vector_add_kernel[grid](A, B, C, N, BLOCK_SIZE=BLOCK_SIZE)
    end.record()
    torch.cuda.synchronize()
    return start.elapsed_time(end)  # Time in milliseconds


NameError: name 'triton' is not defined

In [None]:
# PyTorch CUDA Vector Addition
def cuda_vector_add(A, B):
    # Start timing
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)
    start.record()
    C = A + B
    end.record()
    torch.cuda.synchronize()
    return start.elapsed_time(end)  # Time in milliseconds

In [None]:
# Initialize data and run comparison 
import torch

# Initialize vectors on GPU
N = 1024 * 1024  # Use a large N to better observe performance differences
A = torch.rand(N, device='cuda')
B = torch.rand(N, device='cuda')
C = torch.empty(N, device='cuda')

# Measure performance over multiple runs and average results
triton_times = [triton_vector_add(A, B, C, N) for _ in range(10)]
cuda_times = [cuda_vector_add(A, B) for _ in range(10)]

avg_triton_time = sum(triton_times) / len(triton_times)
avg_cuda_time = sum(cuda_times) / len(cuda_times)

print(f"Average Triton time: {avg_triton_time:.3f} ms")
print(f"Average CUDA time: {avg_cuda_time:.3f} ms")


In [None]:
import matplotlib.pyplot as plt

# Plotting the results
plt.bar(["Triton", "CUDA (PyTorch)"], [avg_triton_time, avg_cuda_time], color=['royalblue', 'darkorange'])
plt.ylabel("Average Time (ms)")
plt.title("Performance Comparison: Triton vs. CUDA (PyTorch)")
plt.show()