<a href="https://colab.research.google.com/github/dagmaros27/AIMS_Notebooks/blob/main/CUDA_Practical_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **CUDA Programming on NVIDIA GPUs**

# **Practical 3**

Again make sure the correct Runtime is being used, by clicking on the Runtime option at the top, then "Change runtime type", and selecting an appropriate GPU such as the T4.

Then verify with the instruction below the details of the GPU which is available to you.  

In [None]:
!nvidia-smi


Tue Jan 20 06:39:23 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   32C    P8              9W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

---

First we upload two header files from the course webpage.

In [None]:
!wget https://people.maths.ox.ac.uk/gilesm/cuda/headers/helper_cuda.h
!wget https://people.maths.ox.ac.uk/gilesm/cuda/headers/helper_string.h


--2026-01-20 06:39:23--  https://people.maths.ox.ac.uk/gilesm/cuda/headers/helper_cuda.h
Resolving people.maths.ox.ac.uk (people.maths.ox.ac.uk)... 129.67.184.129, 2001:630:441:201::8143:b881
Connecting to people.maths.ox.ac.uk (people.maths.ox.ac.uk)|129.67.184.129|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 27832 (27K) [text/x-chdr]
Saving to: ‘helper_cuda.h’


2026-01-20 06:39:24 (197 KB/s) - ‘helper_cuda.h’ saved [27832/27832]

--2026-01-20 06:39:24--  https://people.maths.ox.ac.uk/gilesm/cuda/headers/helper_string.h
Resolving people.maths.ox.ac.uk (people.maths.ox.ac.uk)... 129.67.184.129, 2001:630:441:201::8143:b881
Connecting to people.maths.ox.ac.uk (people.maths.ox.ac.uk)|129.67.184.129|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14875 (15K) [text/x-chdr]
Saving to: ‘helper_string.h’


2026-01-20 06:39:25 (366 KB/s) - ‘helper_string.h’ saved [14875/14875]





---

The next step is to create the file laplace3d.cu which includes within it a reference C++ routine against which the CUDA results are compared.

In [None]:
%%writefile laplace3d.cu

////////////////////////////////////////////////////////////////////////
//
// Program to solve Laplace equation on a regular 3D grid
//
////////////////////////////////////////////////////////////////////////

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>

#include <helper_cuda.h>

////////////////////////////////////////////////////////////////////////
// kernel function
////////////////////////////////////////////////////////////////////////

__global__ void GPU_laplace3d(long long NX, long long NY, long long NZ,
                              const float* __restrict__ d_u1,
                                    float* __restrict__ d_u2)
{
  long long i, j, k, indg, IOFF, JOFF, KOFF;
  float     u2, sixth=1.0f/6.0f;

  //
  // define global indices and array offsets
  //

  i    = threadIdx.x + blockIdx.x*blockDim.x;
  j    = threadIdx.y + blockIdx.y*blockDim.y;
  indg = i + j*NX;

  IOFF = 1;
  JOFF = NX;
  KOFF = NX*NY;

  if ( i>=0 && i<=NX-1 && j>=0 && j<=NY-1 ) {

    for (k=0; k<NZ; k++) {

      if (i==0 || i==NX-1 || j==0 || j==NY-1 || k==0 || k==NZ-1) {
        u2 = d_u1[indg];  // Dirichlet b.c.'s
      }
      else {
        u2 = ( d_u1[indg-IOFF] + d_u1[indg+IOFF]
             + d_u1[indg-JOFF] + d_u1[indg+JOFF]
             + d_u1[indg-KOFF] + d_u1[indg+KOFF] ) * sixth;
      }
      d_u2[indg] = u2;

      indg += KOFF;
    }
  }
}


////////////////////////////////////////////////////////////////////////
// Gold routine -- reference C++ code
////////////////////////////////////////////////////////////////////////

void Gold_laplace3d(long long NX, long long NY, long long NZ, float* u1, float* u2)
{
  long long i, j, k, ind;
  float     sixth=1.0f/6.0f;  // predefining this improves performance more than 10%

  for (k=0; k<NZ; k++) {
    for (j=0; j<NY; j++) {
      for (i=0; i<NX; i++) {   // i loop innermost for sequential memory access
	      ind = i + j*NX + k*NX*NY;

        if (i==0 || i==NX-1 || j==0 || j==NY-1|| k==0 || k==NZ-1) {
          u2[ind] = u1[ind];          // Dirichlet b.c.'s
        }
        else {
          u2[ind] = ( u1[ind-1    ] + u1[ind+1    ]
                    + u1[ind-NX   ] + u1[ind+NX   ]
                    + u1[ind-NX*NY] + u1[ind+NX*NY] ) * sixth;
        }
      }
    }
  }
}

////////////////////////////////////////////////////////////////////////
// Main program
////////////////////////////////////////////////////////////////////////

int main(int argc, const char **argv){

  int       NX=512, NY=512, NZ=512, REPEAT=20,
            BLOCK_X, BLOCK_Y, bx, by, i, j, k;
  float    *h_u1, *h_u2, *h_foo,
           *d_u1, *d_u2, *d_foo;

  size_t    ind, bytes = sizeof(float) * NX*NY*NZ;

  printf("Grid dimensions: %d x %d x %d \n\n", NX, NY, NZ);

  // initialise card

  findCudaDevice(argc, argv);

  // initialise CUDA timing

  float milli;
  cudaEvent_t start, stop;
  cudaEventCreate(&start);
  cudaEventCreate(&stop);

  // allocate memory for arrays

  h_u1 = (float *)malloc(bytes);
  h_u2 = (float *)malloc(bytes);
  checkCudaErrors( cudaMalloc((void **)&d_u1, bytes) );
  checkCudaErrors( cudaMalloc((void **)&d_u2, bytes) );

  // initialise u1

  for (k=0; k<NZ; k++) {
    for (j=0; j<NY; j++) {
      for (i=0; i<NX; i++) {
        ind = i + j*NX + k*NX*NY;

        if (i==0 || i==NX-1 || j==0 || j==NY-1|| k==0 || k==NZ-1)
          h_u1[ind] = 1.0f;           // Dirichlet b.c.'s
        else
          h_u1[ind] = 0.0f;
      }
    }
  }

  // copy u1 to device

  cudaEventRecord(start);
  checkCudaErrors( cudaMemcpy(d_u1, h_u1, bytes,
                              cudaMemcpyHostToDevice) );
  cudaEventRecord(stop);
  cudaEventSynchronize(stop);
  cudaEventElapsedTime(&milli, start, stop);
  printf("Copy u1 to device: %.1f (ms) \n\n", milli);

  // Gold treatment

  cudaEventRecord(start);
  for (i=0; i<REPEAT; i++) {
    Gold_laplace3d(NX, NY, NZ, h_u1, h_u2);
    h_foo = h_u1; h_u1 = h_u2; h_u2 = h_foo;   // swap h_u1 and h_u2
  }

  cudaEventRecord(stop);
  cudaEventSynchronize(stop);
  cudaEventElapsedTime(&milli, start, stop);
  printf("%dx Gold_laplace3d: %.1f (ms) \n\n", REPEAT, milli);

  // Set up the execution configuration

  BLOCK_X = 16; // number of threads
  BLOCK_Y = 16; // in each direction

  bx = 1 + (NX-1)/BLOCK_X; // number of blocks
  by = 1 + (NY-1)/BLOCK_Y; // in each direction

  dim3 dimGrid(bx,by);
  dim3 dimBlock(BLOCK_X,BLOCK_Y);

  // Execute GPU kernel

  cudaEventRecord(start);

  for (i=0; i<REPEAT; i++) {
    GPU_laplace3d<<<dimGrid, dimBlock>>>(NX, NY, NZ, d_u1, d_u2);
    getLastCudaError("GPU_laplace3d execution failed\n");

    d_foo = d_u1; d_u1 = d_u2; d_u2 = d_foo;   // swap d_u1 and d_u2
  }

  cudaEventRecord(stop);
  cudaEventSynchronize(stop);
  cudaEventElapsedTime(&milli, start, stop);
  printf("Block dimensions: %d x %d \n", BLOCK_X,BLOCK_Y);
  printf("%dx GPU_laplace3d: %.1f (ms) \n\n", REPEAT, milli);

  // Read back GPU results

  cudaEventRecord(start);
  checkCudaErrors( cudaMemcpy(h_u2, d_u1, bytes, cudaMemcpyDeviceToHost) );
  cudaEventRecord(stop);
  cudaEventSynchronize(stop);
  cudaEventElapsedTime(&milli, start, stop);
  printf("Copy u2 to host: %.1f (ms) \n\n", milli);

  // error check

  float err = 0.0;

  for (k=0; k<NZ; k++) {
    for (j=0; j<NY; j++) {
      for (i=0; i<NX; i++) {
        ind = i + j*NX + k*NX*NY;
        err += (h_u1[ind]-h_u2[ind])*(h_u1[ind]-h_u2[ind]);
      }
    }
  }

  printf("rms error = %f \n",sqrt(err/ (float)(NX*NY*NZ)));

 // Release GPU and CPU memory

  checkCudaErrors( cudaFree(d_u1) );
  checkCudaErrors( cudaFree(d_u2) );
  free(h_u1);
  free(h_u2);

  cudaDeviceReset();
}


Writing laplace3d.cu



---

We can now compile and run the executable.


In [None]:
!nvcc laplace3d.cu -o laplace3d -I. -lineinfo -arch=sm_70 --ptxas-options=-v --use_fast_math -lcudart

ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z13GPU_laplace3dxxxPKfPf' for 'sm_70'
ptxas info    : Function properties for _Z13GPU_laplace3dxxxPKfPf
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 64 registers, 392 bytes cmem[0]


In [None]:
!./laplace3d

Grid dimensions: 512 x 512 x 512 

GPU Device 0: "Turing" with compute capability 7.5

Copy u1 to device: 114.2 (ms) 

20x Gold_laplace3d: 28674.7 (ms) 

Block dimensions: 16 x 16 
20x GPU_laplace3d: 183.4 (ms) 

Copy u2 to host: 114.4 (ms) 

rms error = 0.000000 




---
By going back to the previous code block you can modify the code to complete the initial Practical 3 exercises. Remember to first make your own copy of the notebook so that you are able to edit it.

For students doing this as an assignment to be assessed, you should again add your name to the title of the notebook (as in "Practical 3 -- Mike Giles.ipynb"), make it shared (see the Share option in the top-right corner) and provide the shared link as the submission mechanism.


---

For the later parts of Practical 3, the instructions below create, compile and execute a second version of the code in which each CUDA thread computes the value for just one 3D grid point.

In [None]:
%%writefile laplace3d_new.cu

//
// Program to solve Laplace equation on a regular 3D grid
//

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>

#include <helper_cuda.h>

////////////////////////////////////////////////////////////////////////
// kernel function
////////////////////////////////////////////////////////////////////////

__global__ void GPU_laplace3d(long long NX, long long NY, long long NZ,
	         	      const float* __restrict__ d_u1,
			            float* __restrict__ d_u2)
{
  long long i, j, k, indg, IOFF, JOFF, KOFF;
  float     u2, sixth=1.0f/6.0f;

  //
  // define global indices and array offsets
  //

  i    = threadIdx.x + blockIdx.x*blockDim.x;
  j    = threadIdx.y + blockIdx.y*blockDim.y;
  k    = threadIdx.z + blockIdx.z*blockDim.z;

  IOFF = 1;
  JOFF = NX;
  KOFF = NX*NY;

  indg = i + j*JOFF + k*KOFF;

  if (i>=0 && i<=NX-1 && j>=0 && j<=NY-1 && k>=0 && k<=NZ-1) {
    if (i==0 || i==NX-1 || j==0 || j==NY-1 || k==0 || k==NZ-1) {
      u2 = d_u1[indg];  // Dirichlet b.c.'s
    }
    else {
      u2 = ( d_u1[indg-IOFF] + d_u1[indg+IOFF]
           + d_u1[indg-JOFF] + d_u1[indg+JOFF]
           + d_u1[indg-KOFF] + d_u1[indg+KOFF] ) * sixth;
    }
    d_u2[indg] = u2;
  }
}

////////////////////////////////////////////////////////////////////////
// Gold routine -- reference C++ code
////////////////////////////////////////////////////////////////////////

void Gold_laplace3d(long long NX, long long NY, long long NZ, float* u1, float* u2)
{
  long long i, j, k, ind;
  float     sixth=1.0f/6.0f;  // predefining this improves performance more than 10%

  for (k=0; k<NZ; k++) {
    for (j=0; j<NY; j++) {
      for (i=0; i<NX; i++) {   // i loop innermost for sequential memory access
	      ind = i + j*NX + k*NX*NY;

        if (i==0 || i==NX-1 || j==0 || j==NY-1|| k==0 || k==NZ-1) {
          u2[ind] = u1[ind];          // Dirichlet b.c.'s
        }
        else {
          u2[ind] = ( u1[ind-1    ] + u1[ind+1    ]
                    + u1[ind-NX   ] + u1[ind+NX   ]
                    + u1[ind-NX*NY] + u1[ind+NX*NY] ) * sixth;
        }
      }
    }
  }
}


////////////////////////////////////////////////////////////////////////
// Main program
////////////////////////////////////////////////////////////////////////

int main(int argc, const char **argv){

  int       NX=512, NY=512, NZ=512, REPEAT=20,
            BLOCK_X, BLOCK_Y, BLOCK_Z,bx, by, bz, i, j, k;
  float    *h_u1, *h_u2, *h_foo,
           *d_u1, *d_u2, *d_foo;

  size_t    ind, bytes = sizeof(float) * NX*NY*NZ;

  printf("Grid dimensions: %d x %d x %d \n\n", NX, NY, NZ);

  // initialise card

  findCudaDevice(argc, argv);

  // initialise CUDA timing

  float milli;
  cudaEvent_t start, stop;
  cudaEventCreate(&start);
  cudaEventCreate(&stop);

  // allocate memory for arrays

  h_u1 = (float *)malloc(bytes);
  h_u2 = (float *)malloc(bytes);
  checkCudaErrors( cudaMalloc((void **)&d_u1, bytes) );
  checkCudaErrors( cudaMalloc((void **)&d_u2, bytes) );

  // initialise u1

  for (k=0; k<NZ; k++) {
    for (j=0; j<NY; j++) {
      for (i=0; i<NX; i++) {
        ind = i + j*NX + k*NX*NY;

        if (i==0 || i==NX-1 || j==0 || j==NY-1|| k==0 || k==NZ-1)
          h_u1[ind] = 1.0f;           // Dirichlet b.c.'s
        else
          h_u1[ind] = 0.0f;
      }
    }
  }

  // copy u1 to device

  cudaEventRecord(start);
  checkCudaErrors( cudaMemcpy(d_u1, h_u1, bytes,
                              cudaMemcpyHostToDevice) );
  cudaEventRecord(stop);
  cudaEventSynchronize(stop);
  cudaEventElapsedTime(&milli, start, stop);
  printf("Copy u1 to device: %.1f (ms) \n\n", milli);

  // Gold treatment

  cudaEventRecord(start);
  for (i=0; i<REPEAT; i++) {
    Gold_laplace3d(NX, NY, NZ, h_u1, h_u2);
    h_foo = h_u1; h_u1 = h_u2; h_u2 = h_foo;   // swap h_u1 and h_u2
  }

  cudaEventRecord(stop);
  cudaEventSynchronize(stop);
  cudaEventElapsedTime(&milli, start, stop);
  printf("%dx Gold_laplace3d: %.1f (ms) \n\n", REPEAT, milli);

  // Set up the execution configuration

  BLOCK_X = 8; // number of threads
  BLOCK_Y = 8; // in each direction
  BLOCK_Z = 8; // of 3D thread block

  bx = 1 + (NX-1)/BLOCK_X; // number of blocks
  by = 1 + (NY-1)/BLOCK_Y; // in each direction
  bz = 1 + (NZ-1)/BLOCK_Z; // of 3D grid of blocks

  dim3 dimGrid(bx,by,bz);
  dim3 dimBlock(BLOCK_X,BLOCK_Y,BLOCK_Z);

  // Execute GPU kernel

  cudaEventRecord(start);

  for (i=0; i<REPEAT; i++) {
    GPU_laplace3d<<<dimGrid, dimBlock>>>(NX, NY, NZ, d_u1, d_u2);
    getLastCudaError("GPU_laplace3d execution failed\n");

    d_foo = d_u1; d_u1 = d_u2; d_u2 = d_foo;   // swap d_u1 and d_u2
  }

  cudaEventRecord(stop);
  cudaEventSynchronize(stop);
  cudaEventElapsedTime(&milli, start, stop);
  printf("Block dimensions: %d x %d x %d\n", BLOCK_X,BLOCK_Y,BLOCK_Z);
  printf("%dx GPU_laplace3d_new: %.1f (ms) \n\n", REPEAT, milli);

  // Read back GPU results

  cudaEventRecord(start);
  checkCudaErrors( cudaMemcpy(h_u2, d_u1, bytes, cudaMemcpyDeviceToHost) );
  cudaEventRecord(stop);
  cudaEventSynchronize(stop);
  cudaEventElapsedTime(&milli, start, stop);
  printf("Copy u2 to host: %.1f (ms) \n\n", milli);

  // error check

  float err = 0.0;

  for (k=0; k<NZ; k++) {
    for (j=0; j<NY; j++) {
      for (i=0; i<NX; i++) {
        ind = i + j*NX + k*NX*NY;
        err += (h_u1[ind]-h_u2[ind])*(h_u1[ind]-h_u2[ind]);
      }
    }
  }

  printf("rms error = %f \n",sqrt(err/ (float)(NX*NY*NZ)));

 // Release GPU and CPU memory

  checkCudaErrors( cudaFree(d_u1) );
  checkCudaErrors( cudaFree(d_u2) );
  free(h_u1);
  free(h_u2);

  cudaDeviceReset();
}


Writing laplace3d_new.cu


In [None]:
!nvcc laplace3d_new.cu -o laplace3d_new -I. -lineinfo -arch=sm_70 --ptxas-options=-v --use_fast_math -lcudart

ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z13GPU_laplace3dxxxPKfPf' for 'sm_70'
ptxas info    : Function properties for _Z13GPU_laplace3dxxxPKfPf
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 21 registers, 392 bytes cmem[0]


In [None]:
!./laplace3d_new

Grid dimensions: 512 x 512 x 512 

GPU Device 0: "Turing" with compute capability 7.5

Copy u1 to device: 114.4 (ms) 

20x Gold_laplace3d: 29493.6 (ms) 

Block dimensions: 8 x 8 x 8
20x GPU_laplace3d_new: 228.6 (ms) 

Copy u2 to host: 113.3 (ms) 

rms error = 0.000000 



---

The next instructions check how many fp32 and integer instructions are performed by the two versions

In [None]:
!ncu --metrics "smsp__sass_thread_inst_executed_op_fp32_pred_on.sum,smsp__sass_thread_inst_executed_op_integer_pred_on.sum" ./laplace3d
!ncu --metrics "smsp__sass_thread_inst_executed_op_fp32_pred_on.sum,smsp__sass_thread_inst_executed_op_integer_pred_on.sum" ./laplace3d_new

Grid dimensions: 512 x 512 x 512 

==PROF== Connected to process 1922 (/content/laplace3d)
GPU Device 0: "Turing" with compute capability 7.5

Copy u1 to device: 120.7 (ms) 

20x Gold_laplace3d: 40849.2 (ms) 

==PROF== Profiling "GPU_laplace3d" - 0: 0%....50%....100% - 1 pass
==PROF== Profiling "GPU_laplace3d" - 1: 0%....50%....100% - 1 pass
==PROF== Profiling "GPU_laplace3d" - 2: 0%....50%....100% - 1 pass
==PROF== Profiling "GPU_laplace3d" - 3: 0%....50%....100% - 1 pass
==PROF== Profiling "GPU_laplace3d" - 4: 0%....50%....100% - 1 pass
==PROF== Profiling "GPU_laplace3d" - 5: 0%....50%....100% - 1 pass
==PROF== Profiling "GPU_laplace3d" - 6: 0%....50%....100% - 1 pass
==PROF== Profiling "GPU_laplace3d" - 7: 0%....50%....100% - 1 pass
==PROF== Profiling "GPU_laplace3d" - 8: 0%....50%....100% - 1 pass
==PROF== Profiling "GPU_laplace3d" - 9: 0%....50%....100% - 1 pass
==PROF== Profiling "GPU_laplace3d" - 10: 0%....50%....100% - 1 pass
==PROF== Profiling "GPU_laplace3d" - 11: 0%....50%..

In [None]:
!ncu ./laplace3d
!ncu ./laplace3d_new

Grid dimensions: 512 x 512 x 512 

==PROF== Connected to process 2460 (/content/laplace3d)
GPU Device 0: "Turing" with compute capability 7.5

Copy u1 to device: 120.3 (ms) 

20x Gold_laplace3d: 41574.9 (ms) 

==PROF== Profiling "GPU_laplace3d" - 0: 0%....50%....100% - 9 passes
==PROF== Profiling "GPU_laplace3d" - 1: 0%....50%....100% - 9 passes
==PROF== Profiling "GPU_laplace3d" - 2: 0%....50%....100% - 9 passes
==PROF== Profiling "GPU_laplace3d" - 3: 0%....50%....100% - 9 passes
==PROF== Profiling "GPU_laplace3d" - 4: 0%....50%....100% - 9 passes
==PROF== Profiling "GPU_laplace3d" - 5: 0%....50%....100% - 9 passes
==PROF== Profiling "GPU_laplace3d" - 6: 0%....50%....100% - 9 passes
==PROF== Profiling "GPU_laplace3d" - 7: 0%....50%....100% - 9 passes
==PROF== Profiling "GPU_laplace3d" - 8: 0%....50%....100% - 9 passes
==PROF== Profiling "GPU_laplace3d" - 9: 0%....50%....100% - 9 passes
==PROF== Profiling "GPU_laplace3d" - 10: 0%....50%....100% - 9 passes
==PROF== Profiling "GPU_laplac

In [None]:
from google.colab import runtime
runtime.unassign()