# Introduction to GPU Architecture 

In this notebook, we'll get familiar with the basics of GPU architecture and using Triton for efficient GPU programming. 

The purpose of this notebook is to help users onboard to developing custom kernels in Python using Google Colab to access GPUs. 

### Core Concepts

- **Cores**: GPUs contain thousands of small cores designed for parallel processing.
- **Memory Hierarchy**: 
  - **Global Memory**: Main memory, which is slower but accessible by all cores.
  - **Shared Memory**: Fast, small, and accessible only by cores within the same thread block.
  - **Registers**: The fastest memory, used for temporary storage within each thread.
    

- **Thread Blocks**:

A thread block is a group of threads that execute concurrently and can communicate with each other through shared memory.

In Triton, similar to CUDA, thread blocks allow developers to structure workloads efficiently by grouping threads that process a subset of data. Threads within the same block can synchronize and share data through shared memory, which is high-speed memory accessible to all threads in the block.

**Why Thread Blocks?** By dividing the overall workload into blocks, GPUs can process data in parallel, where each thread performs part of the computation. This parallelism enables GPUs to handle large data sets much faster than serial CPU processing.

**Execution in Triton**: In Triton kernels, users define the grid and block size to control work distribution across the GPU. The number of threads within each block and the number of blocks in a grid are crucial for optimizing performance, as they determine memory access patterns and processing efficiency.

**Example**: In image processing, each thread might represent a single pixel. By processing all pixels in parallel, a GPU can efficiently handle high-resolution images in real-time.


**Block size** 

Block size is the number of elements that a single Triton program instance operates on simultaneously.

It's typically defined as a power of two (e.g., 128, 256, 512) for optimal performance.

# Setting up Google Colab for Triton

To run Triton code in Google Colab, follow these setup steps:

1. **Enable GPU**:
   - Go to **Runtime > Change runtime type**.
   - Set **Hardware accelerator** to **GPU**, then click **Save**.

2. **Install Triton**:
   - Run the following command in the next cell to install Triton.


In [1]:
# Install Triton
!pip install triton

[31mERROR: Could not find a version that satisfies the requirement triton (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for triton[0m[31m
[0m

# Verifying GPU Availability

Let's check if a GPU is available in this Colab environment. We can use `torch.cuda.is_available()` to confirm. If a GPU is detected, we’ll print its name.

In [2]:
import torch

if torch.cuda.is_available():
    print("GPU is available:", torch.cuda.get_device_name(0))
else:
    print("No GPU found. Please enable GPU under Runtime > Change runtime type.")

No GPU found. Please enable GPU under Runtime > Change runtime type.


# Writing a Simple Triton Kernel

Triton makes it easy to write GPU kernels with a Pythonic interface. We'll start with a basic operation: **vector addition**.

### Vector Addition

Consider two vectors, \(A\) and \(B\), each with \(N\) elements. We want to compute their element-wise sum to produce a new vector, \(C\), where each element is defined by:

$
C[i] = A[i] + B[i]
$

This is a great starting point for understanding GPU parallelization, as each element addition is independent and can be done in parallel.


In [4]:
# Import Triton libraries

import triton 
import triton.language as tl 

@titon.jit
def vector_add_kernel(
    A_ptr, B_ptr, C_ptr, N, 
    BLOCK_SIZE: tl.constexpr
):
    # Compute programmatically unique index for each thread 
    idx = tl.arange(0,BLOCK_SIZE) + tl.program_id(0) * BLOCK_SIZE

    #The resulting idx is a tensor of indices, where each element corresponds to a unique global thread ID. 
    
    #1. Determine which data elements each thread should process.
    
    #2. Ensure that threads across different blocks don't overlap in their computations.
    
    #3. Create a mapping between threads and data elements in a way that scales with the grid size.
    
    # Set mask to avoid out-of-bounds access
    mask = idx < N
    
    # Load data from pointers A and B, add them, and store to C
    a = tl.load(A_ptr + idx, mask=mask)
    b = tl.load(B_ptr + idx, mask=mask)
    c = a + b
    tl.store(C_ptr + idx, c, mask=mask)
    

ModuleNotFoundError: No module named 'triton'

In [3]:
# Best practices for validating and benchmarking your custom ops against native reference implementations.

### Understanding the Kernel 

In this kernel: 

- **`tl.arange(0, BLOCK_SIZE)`**: Creates a range of indices for each thread block 
- **`tl.program_id(0)`**: Returns a unique identifier for each program instance, allowing us to compute a global index for each thread.
- - **`tl.load` and `tl.store`**: Efficiently load from and store to global memory, respectively, using the pointers `A_ptr`, `B_ptr`, and `C_ptr`.
- **`mask`**: Ensures that we don’t access out-of-bounds memory when the array length isn’t a perfect multiple of the block size.


Each thread computes one element of the vector sum in parallel. Now let’s initialize some data and launch the kernel!



# Executing the Kernel 

Now, let’s initialize some vectors and execute the kernel to see Triton in action. We’ll output the resulting vector as a confirmation that our kernel is working.


In [5]:
import torch

# Initialize input vectors A and B with N elements
N = 1024
A = torch.rand(N, device='cuda')
B = torch.rand(N, device='cuda')
C = torch.empty(N, device='cuda')


# Launch the kernel
BLOCK_SIZE = 128
grid = (N + BLOCK_SIZE - 1) // BLOCK_SIZE  # Number of program instances
vector_add_kernel[grid](A, B, C, N, BLOCK_SIZE=BLOCK_SIZE)

# Print "Hello, GPU!" message and a slice of the output vector
print("Hello, GPU!")
print("Result of A + B (first 10 elements):", C[:10].cpu().numpy())

AssertionError: Torch not compiled with CUDA enabled