[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dougc333/Colab-Notebooks/blob/main/cuda_samples.ipynb)


CUDA Samples
https://github.com/NVIDIA/cuda-samples/blob/master/Samples/6_Performance/transpose/transpose.cu


1.   https://developer.nvidia.com/blog/unified-memory-cuda-beginners/   
2.   https://developer.nvidia.com/blog/maximizing-unified-memory-performance-cuda/
3. https://www.nvidia.com/en-us/on-demand/session/gtcspring21-cwes1175/


In [5]:
from google.colab import drive
drive.mount('/content/drive')

!apt-get install emacs


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
emacs is already the newest version (1:27.1+1-3ubuntu5.2).
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.


In [6]:
%cd /content/drive/MyDrive/cuda

/content/drive/MyDrive/cuda


In [7]:
import torch

if torch.cuda.is_available():
    print("CUDA is available. PyTorch can use your GPU.")
    print(f"Number of GPUs available: {torch.cuda.device_count()}")
    print(f"GPU Name: {torch.cuda.get_device_name(0)}") # Prints the name of the first GPU
else:
    print("CUDA is not available. PyTorch will use your CPU.")

CUDA is available. PyTorch can use your GPU.
Number of GPUs available: 1
GPU Name: Tesla T4


In [None]:
#faster
!apt-get -y install cuda-toolkit-12-4
!rm /etc/alternatives/cuda
!ln -s  /usr/local/cuda-12.4 /etc/alternatives/cuda


Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  cuda-cccl-12-4 cuda-command-line-tools-12-4 cuda-compiler-12-4 cuda-crt-12-4
  cuda-cudart-12-4 cuda-cudart-dev-12-4 cuda-cuobjdump-12-4 cuda-cupti-12-4
  cuda-cupti-dev-12-4 cuda-cuxxfilt-12-4 cuda-documentation-12-4
  cuda-driver-dev-12-4 cuda-gdb-12-4 cuda-libraries-12-4
  cuda-libraries-dev-12-4 cuda-nsight-12-4 cuda-nsight-compute-12-4
  cuda-nsight-systems-12-4 cuda-nvcc-12-4 cuda-nvdisasm-12-4
  cuda-nvml-dev-12-4 cuda-nvprof-12-4 cuda-nvprune-12-4 cuda-nvrtc-12-4
  cuda-nvrtc-dev-12-4 cuda-nvtx-12-4 cuda-nvvm-12-4 cuda-nvvp-12-4
  cuda-opencl-12-4 cuda-opencl-dev-12-4 cuda-profiler-api-12-4
  cuda-sanitizer-12-4 cuda-toolkit-12-4-config-common cuda-tools-12-4
  cuda-visual-tools-12-4 default-jre default-jre-headless fonts-dejavu-core
  fonts-dejavu-extra gds-tools-12-4 libatk-wrapper-java
  libatk-wrapper-java-jni libcublas-12-4

In [8]:
!nvidia-smi

Fri Aug 29 01:52:49 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   43C    P8              9W /   70W |       2MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [9]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0


In [None]:
%%writefile check_unified.cu
#include <stdio.h>
#include <cuda_runtime.h>

int main() {
    cudaDeviceProp prop;
    int device = 0;  // pick device 0

    cudaError_t err = cudaGetDeviceProperties(&prop, device);
    if (err != cudaSuccess) {
        printf("cudaGetDeviceProperties failed: %s\n", cudaGetErrorString(err));
        return -1;
    }

    printf("Device %d: %s\n", device, prop.name);
    printf("Unified addressing: %d\n", prop.unifiedAddressing);

    return 0;
}

Overwriting check_unified.cu


In [None]:
!/content/drive/MyDrive/cuda nvcc -o check_unified check_unified.cu

/bin/bash: line 1: /content/drive/MyDrive/cuda: Is a directory


In [None]:
!chmod 755 /content/drive/MyDrive/cuda/check_unified

In [None]:
!/content/drive/MyDrive/cuda/check_unified

/bin/bash: line 1: /content/drive/MyDrive/cuda/check_unified: Permission denied


** Unified Addressing **
Unified addressing is a single address space accessible by any GPU/CPU
It adds a page handling mechanism in the OS kernel on a page fault when mapping VM address to physical address.

Reference: https://developer.nvidia.com/blog/maximizing-unified-memory-performance-in-cuda/

<img src="https://drive.google.com/uc?id=1xrOgMck9tXkDHiHoKldqVkFRjYtXSDAH">

Unified memory


<img src="https://drive.google.com/uc?id=1FLrnkIgEPIBtrndQbdpSpegM2dxYYVCj">

In [None]:
import os
os.getcwd()

'/content/drive/MyDrive/cuda'

First cuda program

Called saxpy. Copied verbatim from blog. Structure of cuda program


1.   ```__global__ void fn``` is a kernel. This code runs on the gpu
2. there are 2 models, a hardware model of the chip and a logical model of grids, blocks and threads.

**Hardware model**
*   The GPU consists of SMs. Each block is run in a SM. One SM can run multiple blocks.
*   Blocks contain threads. Each group of 32 threads is a warp. Blocks are interleaved in SMs
```
 ---------------
| SM0: B0, B3
|---------------
| SM1: B1, B4
|---------------
| SM2: B2, B5
|---------------
```
The gpu scheduler runs blocks in SMs dynamically. The above figure can be deceptive in guaranteeing an order of execution.

**Logical model **

*   The logical model starts with a 1D, 2D or 3D assumption.
*   







3.  Thread indexing:  i = blockIdx.x * blockDim.x + threadIdx.x. This is global thread indexing. Global thread indexing has implications on how <<gridDim, blockDim, sharedMem, stream>> are computed.

*   gridDim=num blocks to launch
*   blockDim=num threads per block
*   sharedmem:allocate sm per block
*   stream:used to overlap kernels and memcopies

How to set <<gridDim, blockDim>>? numBlocks is calculated from gridDim. numBlocks isn't a reserved keyword it is our abbreviation for number of blocks


*  For 1D, numBlocks=gridDim.x,
*  2D, numBlocks=gridDim.x * gridDim.y.
*  3d numBlocks = gridDim.x * gridDim.y * gridDimg.z

How is N divided? One way is to divide each single N into a separate thread.





https://developer.nvidia.com/blog/easy-introduction-cuda-c-and-c/



In [31]:
%%writefile stall.cu

#include <nvtx3/nvToolsExt.h>

__global__ void kernel(float *a, int N ) {
    int gid = blockIdx.x * blockDim.x + threadIdx.x;

    if (gid < N) {
        a[gid] = a[gid] * 2.0f;
    }
}

int main() {
    int N = 1<<20;
    float *a, *d_a;
    a = (float *)malloc(N*sizeof(float));

    cudaMalloc(&d_a, N*sizeof(float));

    //initialize data before memcpy
    for(int i=0;i<N; i++){
      a[i] = float(i);
    }
    cudaMemcpy(d_a, a, N*sizeof(float), cudaMemcpyHostToDevice);
    nvtxRangePushA("Kernel Launch");
    kernel<<<(N+255)/256, 256>>>(d_a, N);
    cudaMemcpy(a, d_a, N*sizeof(float), cudaMemcpyDeviceToHost);

    cudaDeviceSynchronize();
    nvtxRangePop();
    for(int j=0;j<N;j++){
      printf("a:%f",a[j]);
    }
    cudaFree(d_a);
    free(a);
}

Overwriting stall.cu


In [32]:
!nvcc -o stall stall.cu

In [33]:
!chmod +x /content/drive/MyDrive/cuda/stall

In [None]:
!/content/drive/MyDrive/cuda/stall

In [None]:
%%writefile saxpy.cu
#include <stdio.h>

__global__
void saxpy(int n, float a, float *x, float *y)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n) y[i] = a*x[i] + y[i];
}



int main(void)
{
  int N = 1<<20;

  float *x, *y, *d_x, *d_y;

  x = (float*)malloc(N*sizeof(float));
  y = (float*)malloc(N*sizeof(float));

  cudaMalloc(&d_x, N*sizeof(float));
  cudaMalloc(&d_y, N*sizeof(float));

  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
  cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);

  // Perform SAXPY on 1M elements
  saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y);

  cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);

  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = max(maxError, abs(y[i]-4.0f));
  printf("Max error: %f\n", maxError);

  cudaFree(d_x);
  cudaFree(d_y);
  free(x);
  free(y);
}



  ```
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n) y[i] = a*x[i] + y[i];
```
i corresponds to each thread. We have to test i<n because
```
 (1024*1024+255)/256
4096.99609375
```
we get 4097 but want 4096 so test


In [None]:
!/content/drive/MyDrive/cuda nvcc -o saxpy saxpy.cu

In [None]:
!chmod 755 /content/drive/MyDrive/cuda/saxpy

In [None]:
!/content/drive/MyDrive/cuda/saxpy

https://developer.nvidia.com/blog/how-implement-performance-metrics-cuda-cc/
Second saxpy post


In [None]:
%%writefile saxpy_2nd_blog_post.cu

//https://developer.nvidia.com/blog/how-implement-performance-metrics-cuda-cc/
//confusing first part of post uses cpu timers; incomplete

#include <stdio.h>
#include <chrono>

__global__
void saxpy(int n, float a, float *x, float *y)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n) y[i] = a*x[i] + y[i];
}



int main(void)
{
  cudaEvent_t start, stop;
  cudaEventCreate(&start);
  cudaEventCreate(&stop);


  int N = 20*(1<<20);
  float *x, *y, *d_x, *d_y;
  x = (float*)malloc(N*sizeof(float));
  y = (float*)malloc(N*sizeof(float));

  cudaMalloc(&d_x, N*sizeof(float));
  cudaMalloc(&d_y, N*sizeof(float));

  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
  cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);


  cudaEventRecord(start);
  saxpy<<<(N+255)/256, 256>>>(N, 2.0, d_x, d_y);
  cudaEventRecord(stop);

  // Perform SAXPY on 1M elements
  cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);
  cudaEventSynchronize(stop);
  float milliseconds = 0;
  cudaEventElapsedTime(&milliseconds, start, stop);

  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = max(maxError, abs(y[i]-4.0f));
  printf("Max error: %f milliseconds t2-t1: %f\n", maxError,milliseconds );
  printf("Effective Bandwidth (GB/s): %f \n", N*4*3/milliseconds/1e6);
  cudaFree(d_x);
  cudaFree(d_y);
  free(x);
  free(y);
}


Overwriting saxpy_2nd_blog_post.cu


In [None]:
!nvcc -o saxpy_2nd_blog_post saxpy_2nd_blog_post.cu

In [None]:
!chmod 755 /content/drive/MyDrive/cuda/saxpy_2nd_blog_post

In [None]:
!/content/drive/MyDrive/cuda/saxpy_2nd_blog_post

['Max error: 0.000000 milliseconds t2-t1: 1.081248',
 'Effective Bandwidth (GB/s): 232.747920 ']

In [None]:
%%writefile saxpy_unified.cu
#include <stdio.h>
#include <chrono>

__global__
void saxpy(int n, float a, float *x, float *y)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n) y[i] = a*x[i] + y[i];
}


int main(void)
{
  cudaEvent_t start, stop;
  cudaEventCreate(&start);
  cudaEventCreate(&stop);


  int N = 20*(1<<20);
  float *x, *y, *d_x, *d_y;
  //shouldnt need a malloc either?
  x = (float*)malloc(N*sizeof(float));
  y = (float*)malloc(N*sizeof(float));

  cudaMallocManaged(&d_x, N*sizeof(float));
  cudaMallocManaged(&d_y, N*sizeof(float));

  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  //this is wrong, shouldnt need memcpy w unified
  /*cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);*/
  /*cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);*/

  cudaEventRecord(start);
  saxpy<<<(N+255)/256, 256>>>(N, 2.0, d_x, d_y);
  cudaEventRecord(stop);

  // Perform SAXPY on 1M elements
  cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);
  cudaEventSynchronize(stop);
  float milliseconds = 0;
  cudaEventElapsedTime(&milliseconds, start, stop);


  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = max(maxError, abs(y[i]-4.0f));
  printf("Max error: %f milliseconds t2-t1: %f\n", maxError,milliseconds );
  printf("Effective Bandwidth (GB/s): %f \n", N*4*3/milliseconds/1e6);
  cudaFree(d_x);
  cudaFree(d_y);
  free(x);
  free(y);
}

Writing saxpy_unified.cu


In [None]:
%cd /content/drive/MyDrive/cuda
!nvcc -o saxpy_unified saxpy_unified.cu

/content/drive/MyDrive/cuda


In [None]:
!chmod 755 /content/drive/MyDrive/cuda/saxpy_unified

In [None]:
!/content/drive/MyDrive/cuda/saxpy_unified

Max error: 0.000000 milliseconds t2-t1: 1.229472
Effective Bandwidth (GB/s): 204.688048 


In [None]:
!nvprof /content/drive/MyDrive/cuda/saxpy_unified

==38747== NVPROF is profiling process 38747, command: /content/drive/MyDrive/cuda/saxpy_unified
Max error: 0.000000 milliseconds t2-t1: 1.286400
Effective Bandwidth (GB/s): 195.629856 
==38747== Profiling application: /content/drive/MyDrive/cuda/saxpy_unified
==38747== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   69.42%  42.531ms         2  21.266ms  21.122ms  21.409ms  [CUDA memcpy HtoD]
                   29.00%  17.771ms         1  17.771ms  17.771ms  17.771ms  [CUDA memcpy DtoH]
                    1.58%  965.64us         1  965.64us  965.64us  965.64us  saxpy(int, float, float*, float*)
      API calls:   51.15%  95.186ms         2  47.593ms  8.3260us  95.178ms  cudaEventCreate
                   33.42%  62.194ms         3  20.731ms  19.058ms  21.754ms  cudaMemcpy
                   10.98%  20.424ms         2  10.212ms  63.777us  20.360ms  cudaMallocManaged
                    4.18%  7.7761ms         2  3.88

** Pinned Memory **

Pinned memory cannot be paged by the CPU. Use cudaHostAlloc instead of cudaMalloc. This allocates a fixed area in the CPU address space which is directly DMAed into GPU address space over interconnect, PCIe or nvlink. cudaMemcpyAsync requires pinned memory.

In [None]:
%%writefile cuda_memcpy_async

#include <iostream>
#include <cuda_runtime.h>

// A simple kernel that doubles each element
__global__ void doubleElements(int *d_data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        d_data[idx] *= 2;
    }
}

int main() {
    const int N = 1 << 16;  // 65536 ints
    const int SIZE = N * sizeof(int);

    int *h_data;            // host pointer
    int *d_data;            // device pointer
    cudaStream_t stream;    // CUDA stream for async ops

    // Allocate pinned (page-locked) memory on host (needed for async copies)
    cudaMallocHost((void**)&h_data, SIZE);
    // Allocate memory on device
    cudaMalloc((void**)&d_data, SIZE);

    // Initialize host data
    for (int i = 0; i < N; i++) h_data[i] = i;

    // Create a stream
    cudaStreamCreate(&stream);

    // Async copy host->device
    cudaMemcpyAsync(d_data, h_data, SIZE, cudaMemcpyHostToDevice, stream);

    // Launch kernel in same stream
    int threads = 256;
    int blocks = (N + threads - 1) / threads;
    doubleElements<<<blocks, threads, 0, stream>>>(d_data, N);

    // Async copy device->host
    cudaMemcpyAsync(h_data, d_data, SIZE, cudaMemcpyDeviceToHost, stream);

    // Wait for stream to complete
    cudaStreamSynchronize(stream);

    // Check results
    std::cout << "h_data[0] = " << h_data[0] << "\n";
    std::cout << "h_data[N-1] = " << h_data[N-1] << "\n";

    // Cleanup
    cudaStreamDestroy(stream);
    cudaFree(d_data);
    cudaFreeHost(h_data);

    return 0;
}

In [None]:
!git clone https://github.com/NVIDIA/cuda-samples.git

Cloning into 'cuda-samples'...
remote: Enumerating objects: 30412, done.[K
remote: Counting objects: 100% (14672/14672), done.[K
remote: Compressing objects: 100% (1471/1471), done.[K
remote: Total 30412 (delta 13818), reused 13201 (delta 13201), pack-reused 15740 (from 2)[K
Receiving objects: 100% (30412/30412), 135.80 MiB | 17.20 MiB/s, done.
Resolving deltas: 100% (26469/26469), done.
Updating files: 100% (2052/2052), done.


In [None]:
!apt install cmake

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
cmake is already the newest version (3.22.1-1ubuntu1.22.04.2).
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.


In [None]:
%cd /content/drive/MyDrive/cuda/cuda-samples/

/content/drive/MyDrive/cuda/cuda-samples


In [None]:
import os
os.getcwd()

'/content/drive/MyDrive/cuda/cuda-samples'

In [None]:
!mkdir build


In [None]:
%cd /content/drive/MyDrive/cuda/cuda-samples/build/

/content/drive/MyDrive/cuda/cuda-samples/build


In [None]:
!cmake ..

-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- The CUDA compiler identification is NVIDIA 12.4.131 with host compiler GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Found CUDAToolkit: /usr/local/cuda/targets/x86_64-linux/include (found version "12.4.131")
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - S

In [None]:
!cd build/build
! make

make: *** No rule to make target 'all'.  Stop.


https://developer.nvidia.com/blog/maximizing-unified-memory-performance-cuda/


In [None]:
%%writefile um1.cu
template <typename data_type, op_type op>
__global__ void stream_thread(data_type *ptr, const size_t size,
                              data_type *output, const data_type val)
{
  size_t tid = threadIdx.x + blockIdx.x * blockDim.x;
  size_t n = size / sizeof(data_type);
  data_type accum = 0;

  for(; tid < n; tid += blockDim.x * gridDim.x)
    if (op == READ) accum += ptr[tid];
      else ptr[tid] = val;

  if (op == READ)
    output[threadIdx.x + blockIdx.x * blockDim.x] = accum;
}
void main(){

}

In [None]:
%%writefile add_grid.cu

#include <iostream>
#include <math.h>

// CUDA kernel to add elements of two arrays
__global__
void add(int n, float *x, float *y)
{
  int index = blockIdx.x * blockDim.x + threadIdx.x;
  int stride = blockDim.x * gridDim.x;
  for (int i = index; i < n; i += stride)
    y[i] = x[i] + y[i];
}

int main(void)
{
  int N = 1<<20;
  float *x, *y;

  // Allocate Unified Memory -- accessible from CPU or GPU
  cudaMallocManaged(&x, N*sizeof(float));
  cudaMallocManaged(&y, N*sizeof(float));

  // initialize x and y arrays on the host
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  // Launch kernel on 1M elements on the GPU
  int blockSize = 256;
  int numBlocks = (N + blockSize - 1) / blockSize;
  add<<<numBlocks, blockSize>>>(N, x, y);

  // Wait for GPU to finish before accessing on host
  cudaDeviceSynchronize();

  // Check for errors (all values should be 3.0f)
  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = fmax(maxError, fabs(y[i]-3.0f));
  std::cout << "Max error: " << maxError << std::endl;

  // Free memory
  cudaFree(x);
  cudaFree(y);

  return 0;
}


In [None]:
%%writefile async_api.cu


/* Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 *  * Redistributions of source code must retain the above copyright
 *    notice, this list of conditions and the following disclaimer.
 *  * Redistributions in binary form must reproduce the above copyright
 *    notice, this list of conditions and the following disclaimer in the
 *    documentation and/or other materials provided with the distribution.
 *  * Neither the name of NVIDIA CORPORATION nor the names of its
 *    contributors may be used to endorse or promote products derived
 *    from this software without specific prior written permission.
 *
 * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
 * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
 * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
 * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
 * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
 * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
 * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
 * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
 * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 */

/*
 * This sample illustrates the usage of CUDA events for both GPU timing and
 * overlapping CPU and GPU execution.  Events are inserted into a stream
 * of CUDA calls.  Since CUDA stream calls are asynchronous, the CPU can
 * perform computations while GPU is executing (including DMA memcopies
 * between the host and device).  CPU can query CUDA events to determine
 * whether GPU has completed tasks.
 */

// includes, system
#include <stdio.h>

// includes CUDA Runtime
#include <cuda_profiler_api.h>
#include <cuda_runtime.h>

// includes, project
#include "helper_cuda.h"
#include "helper_functions.h" // helper utility functions

__global__ void increment_kernel(int *g_data, int inc_value)
{
    int idx     = blockIdx.x * blockDim.x + threadIdx.x;
    g_data[idx] = g_data[idx] + inc_value;
}

bool correct_output(int *data, const int n, const int x)
{
    for (int i = 0; i < n; i++)
        if (data[i] != x) {
            printf("Error! data[%d] = %d, ref = %d\n", i, data[i], x);
            return false;
        }

    return true;
}

int main(int argc, char *argv[])
{
    int            devID;
    cudaDeviceProp deviceProps;

    printf("[%s] - Starting...\n", argv[0]);

    // This will pick the best possible CUDA capable device
    devID = findCudaDevice(argc, (const char **)argv);

    // get device name
    checkCudaErrors(cudaGetDeviceProperties(&deviceProps, devID));
    printf("CUDA device [%s]\n", deviceProps.name);

    int n      = 16 * 1024 * 1024;
    int nbytes = n * sizeof(int);
    int value  = 26;

    // allocate host memory
    int *a = 0;
    checkCudaErrors(cudaMallocHost((void **)&a, nbytes));
    memset(a, 0, nbytes);

    // allocate device memory
    int *d_a = 0;
    checkCudaErrors(cudaMalloc((void **)&d_a, nbytes));
    checkCudaErrors(cudaMemset(d_a, 255, nbytes));

    // set kernel launch configuration
    dim3 threads = dim3(512, 1);
    dim3 blocks  = dim3(n / threads.x, 1);

    // create cuda event handles
    cudaEvent_t start, stop;
    checkCudaErrors(cudaEventCreate(&start));
    checkCudaErrors(cudaEventCreate(&stop));

    StopWatchInterface *timer = NULL;
    sdkCreateTimer(&timer);
    sdkResetTimer(&timer);

    checkCudaErrors(cudaDeviceSynchronize());
    float gpu_time = 0.0f;

    // asynchronously issue work to the GPU (all to stream 0)
    checkCudaErrors(cudaProfilerStart());
    sdkStartTimer(&timer);
    cudaEventRecord(start, 0);
    cudaMemcpyAsync(d_a, a, nbytes, cudaMemcpyHostToDevice, 0);
    increment_kernel<<<blocks, threads, 0, 0>>>(d_a, value);
    cudaMemcpyAsync(a, d_a, nbytes, cudaMemcpyDeviceToHost, 0);
    cudaEventRecord(stop, 0);
    sdkStopTimer(&timer);
    checkCudaErrors(cudaProfilerStop());

    // have CPU do some work while waiting for stage 1 to finish
    unsigned long int counter = 0;

    while (cudaEventQuery(stop) == cudaErrorNotReady) {
        counter++;
    }

    checkCudaErrors(cudaEventElapsedTime(&gpu_time, start, stop));

    // print the cpu and gpu times
    printf("time spent executing by the GPU: %.2f\n", gpu_time);
    printf("time spent by CPU in CUDA calls: %.2f\n", sdkGetTimerValue(&timer));
    printf("CPU executed %lu iterations while waiting for GPU to finish\n", counter);

    // check the output for correctness
    bool bFinalResults = correct_output(a, n, value);

    // release resources
    checkCudaErrors(cudaEventDestroy(start));
    checkCudaErrors(cudaEventDestroy(stop));
    checkCudaErrors(cudaFreeHost(a));
    checkCudaErrors(cudaFree(d_a));

    exit(bFinalResults ? EXIT_SUCCESS : EXIT_FAILURE);
}


Overwriting async_api.cu


In [None]:
%%writefile clock.cu
/* Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 *  * Redistributions of source code must retain the above copyright
 *    notice, this list of conditions and the following disclaimer.
 *  * Redistributions in binary form must reproduce the above copyright
 *    notice, this list of conditions and the following disclaimer in the
 *    documentation and/or other materials provided with the distribution.
 *  * Neither the name of NVIDIA CORPORATION nor the names of its
 *    contributors may be used to endorse or promote products derived
 *    from this software without specific prior written permission.
 *
 * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
 * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
 * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
 * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
 * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
 * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
 * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
 * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
 * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 */

/*
 * This example shows how to use the clock function to measure the performance
 * of block of threads of a kernel accurately. Blocks are executed in parallel
 * and out of order. Since there's no synchronization mechanism between blocks,
 * we measure the clock once for each block. The clock samples are written to
 * device memory.
 */

// System includes
#include <assert.h>
#include <stdint.h>
#include <stdio.h>

// CUDA runtime
#include <cuda_runtime.h>

// helper functions and utilities to work with CUDA
#include "helper_cuda.h"
#include "helper_functions.h"

// This kernel computes a standard parallel reduction and evaluates the
// time it takes to do that for each block. The timing results are stored
// in device memory.
__global__ static void timedReduction(const float *input, float *output, clock_t *timer)
{
    // __shared__ float shared[2 * blockDim.x];
    extern __shared__ float shared[];

    const int tid = threadIdx.x;
    const int bid = blockIdx.x;

    if (tid == 0)
        timer[bid] = clock();

    // Copy input.
    shared[tid]              = input[tid];
    shared[tid + blockDim.x] = input[tid + blockDim.x];

    // Perform reduction to find minimum.
    for (int d = blockDim.x; d > 0; d /= 2) {
        __syncthreads();

        if (tid < d) {
            float f0 = shared[tid];
            float f1 = shared[tid + d];

            if (f1 < f0) {
                shared[tid] = f1;
            }
        }
    }

    // Write result.
    if (tid == 0)
        output[bid] = shared[0];

    __syncthreads();

    if (tid == 0)
        timer[bid + gridDim.x] = clock();
}

#define NUM_BLOCKS  64
#define NUM_THREADS 256

// It's interesting to change the number of blocks and the number of threads to
// understand how to keep the hardware busy.
//
// Here are some numbers I get on my G80:
//    blocks - clocks
//    1 - 3096
//    8 - 3232
//    16 - 3364
//    32 - 4615
//    64 - 9981
//
// With less than 16 blocks some of the multiprocessors of the device are idle.
// With more than 16 you are using all the multiprocessors, but there's only one
// block per multiprocessor and that doesn't allow you to hide the latency of
// the memory. With more than 32 the speed scales linearly.

// Start the main CUDA Sample here
int main(int argc, char **argv)
{
    printf("CUDA Clock sample\n");

    // This will pick the best possible CUDA capable device
    int dev = findCudaDevice(argc, (const char **)argv);

    float   *dinput  = NULL;
    float   *doutput = NULL;
    clock_t *dtimer  = NULL;

    clock_t timer[NUM_BLOCKS * 2];
    float   input[NUM_THREADS * 2];

    for (int i = 0; i < NUM_THREADS * 2; i++) {
        input[i] = (float)i;
    }

    checkCudaErrors(cudaMalloc((void **)&dinput, sizeof(float) * NUM_THREADS * 2));
    checkCudaErrors(cudaMalloc((void **)&doutput, sizeof(float) * NUM_BLOCKS));
    checkCudaErrors(cudaMalloc((void **)&dtimer, sizeof(clock_t) * NUM_BLOCKS * 2));

    checkCudaErrors(cudaMemcpy(dinput, input, sizeof(float) * NUM_THREADS * 2, cudaMemcpyHostToDevice));

    timedReduction<<<NUM_BLOCKS, NUM_THREADS, sizeof(float) * 2 * NUM_THREADS>>>(dinput, doutput, dtimer);

    checkCudaErrors(cudaMemcpy(timer, dtimer, sizeof(clock_t) * NUM_BLOCKS * 2, cudaMemcpyDeviceToHost));

    checkCudaErrors(cudaFree(dinput));
    checkCudaErrors(cudaFree(doutput));
    checkCudaErrors(cudaFree(dtimer));

    long double avgElapsedClocks = 0;

    for (int i = 0; i < NUM_BLOCKS; i++) {
        avgElapsedClocks += (long double)(timer[i + NUM_BLOCKS] - timer[i]);
    }

    avgElapsedClocks = avgElapsedClocks / NUM_BLOCKS;
    printf("Average clocks/block = %Lf\n", avgElapsedClocks);

    return EXIT_SUCCESS;
}

Overwriting clock.cu


<h1>Triton kernels</h1>
vLLM offline mode

<href=https://www.youtube.com/watch?v=E8Mju53VB00&list=PLoROMvodv4rOY23Y0BoGoBGgQ1zmU_MT_&index=6">

In [None]:
%cd /content/drive/MyDrive

/content/drive/MyDrive


In [None]:
!pip install triton



In [None]:
# 📦 Triton kernel for PagedAttention (simplified illustration)
# NOTE: This is not production vLLM code, but illustrates the idea

import triton
import triton.language as tl
import torch

@triton.jit
def paged_attention_kernel(
    Q_ptr, K_ptr, V_ptr, Out_ptr,
    B, H, L_Q, L_KV, D,
    stride_qb, stride_qh, stride_qd,
    stride_kb, stride_kh, stride_kd,
    stride_vb, stride_vh, stride_vd,
    stride_ob, stride_oh, stride_od,
    BLOCK_D: tl.constexpr,
):
    pid = tl.program_id(axis=0)
    b = pid // H  # batch
    h = pid % H   # head

    offs_d = tl.arange(0, BLOCK_D)

    q_ptrs = Q_ptr + b * stride_qb + h * stride_qh + offs_d * stride_qd
    k_ptrs = K_ptr + b * stride_kb + h * stride_kh + offs_d * stride_kd
    v_ptrs = V_ptr + b * stride_vb + h * stride_vh + offs_d * stride_vd
    o_ptrs = Out_ptr + b * stride_ob + h * stride_oh + offs_d * stride_od

    q = tl.load(q_ptrs)
    k = tl.load(k_ptrs)
    v = tl.load(v_ptrs)

    score = tl.dot(q, k)
    weight = tl.softmax(score)
    out = tl.dot(weight, v)

    tl.store(o_ptrs, out)

# Example usage
B, H, L, D = 1, 1, 32, 64
Q = torch.randn((B, H, L, D), device="cuda")
K = torch.randn((B, H, L, D), device="cuda")
V = torch.randn((B, H, L, D), device="cuda")
Out = torch.empty_like(Q)

paged_attention_kernel[(B * H,)](
    Q, K, V, Out,
    B, H, L, L, D,
    Q.stride(0), Q.stride(1), Q.stride(3),
    K.stride(0), K.stride(1), K.stride(3),
    V.stride(0), V.stride(1), V.stride(3),
    Out.stride(0), Out.stride(1), Out.stride(3),
    BLOCK_D=D,
)

<h1>LLM Implementation</h1>
<h4>lecture 3</h4>
<href="https://www.youtube.com/watch?v=ptFiH_bHnJw&list=PLoROMvodv4rOY23Y0BoGoBGgQ1zmU_MT_&index=3"/>

<p>Notes from lecture: What has changed in the last year? The new models are CommandA LLM, 2 Olmo Furious, SmolLM2, Phi-3, Phi-4, most of these are in their 2-4x iteration reflecting changes in industry. 19 dense model releases in last year. Not going to recreate foundation model training but can create vllm offline mode for these models.  </p>