<a href="https://colab.research.google.com/github/ggruszczynski/gpu_colab/blob/main/30_matrix_matrix_multiplication.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Matrix x Matrix multiplication
As a step by step instruction has been presented in tutorial 2, here is a time for a stand-alone practice.

Accelerate the code - finish the matrix multiplication cuda kernel.

In [8]:
%%file matrix_add.cu

// This program computes a simple version of matrix multiplication
// By: Nick from CoffeeBeforeArch

#include <algorithm>
#include <cassert>
#include <cstdlib>
#include <functional>
#include <iostream>
#include <vector>

using std::cout;
using std::generate;
using std::vector;

__global__ void matrixMul(const int *a, const int *b, int *c, int N) {
  // Compute each thread's global row and column index
  int row = blockIdx.y * blockDim.y + threadIdx.y;
  int col = blockIdx.x * blockDim.x + threadIdx.x;

  // Iterate over row, and down column
  c[row * N + col] = 0;
  for (int k = 0; k < N; k++) {
    // Accumulate results for a single element
    // TODO: write your code here
  }
}

// Check result on the CPU
void verify_result(vector<int> &a, vector<int> &b, vector<int> &c, int N) {
  for (int row = 0; row < N; row++) {
    for (int col = 0; col < N; col++) {
      int tmp = 0; // For every element in the row-column pair
      for (int k = 0; k < N; k++) {
        // Accumulate the partial results
        tmp += a[row * N + k] * b[k * N + col];
      }
      // Check against the CPU result
      assert(tmp == c[row * N + col]);
    }
  }
}

int main() {
  int N = 1 << 10;  // Matrix size of 1024 x 1024;

  // Size (in bytes) of matrix
  size_t bytes = N * N * sizeof(int);

  // Host vectors
  vector<int> h_a(N * N);
  vector<int> h_b(N * N);
  vector<int> h_c(N * N);

  // Initialize matrices
  generate(h_a.begin(), h_a.end(), []() { return rand() % 100; });
  generate(h_b.begin(), h_b.end(), []() { return rand() % 100; });

  // Allocate device memory
  int *d_a, *d_b, *d_c;
  cudaMalloc(&d_a, bytes);
  cudaMalloc(&d_b, bytes);
  cudaMalloc(&d_c, bytes);

  // Copy data to the device
  cudaMemcpy(d_a, h_a.data(), bytes, cudaMemcpyHostToDevice);
  cudaMemcpy(d_b, h_b.data(), bytes, cudaMemcpyHostToDevice);

  // Threads per CTA dimension
  int THREADS = 32;

  // Blocks per grid dimension (assumes THREADS divides N evenly)
  int BLOCKS = N / THREADS;

  // Use dim3 structs for block  and grid dimensions
  dim3 threads(THREADS, THREADS);
  dim3 blocks(BLOCKS, BLOCKS);

  // Launch kernel
  matrixMul<<<blocks, threads>>>(d_a, d_b, d_c, N);

  // Copy back to the host
  cudaMemcpy(h_c.data(), d_c, bytes, cudaMemcpyDeviceToHost);

  // Check result
  verify_result(h_a, h_b, h_c, N);

  cout << "COMPLETED SUCCESSFULLY\n";

  // Free memory on device
  cudaFree(d_a);
  cudaFree(d_b);
  cudaFree(d_c);

  return 0;
}

Overwriting matrix_add.cu


In [2]:
!echo "Check your GPU version"
!nvidia-smi

Check your GPU version
Sat Oct 28 11:58:49 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   50C    P8    10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+----------------------------------------------------------------

In [9]:
%%bash

CUDA_SUFF=70 # or CUDA_SUFF=35 for older GPUs
nvcc -gencode arch=compute_${CUDA_SUFF},code=sm_${CUDA_SUFF} ./matrix_add.cu -o matrix_add
./matrix_add

COMPLETED SUCCESSFULLY


In [10]:
%%bash
# ls

nvprof  ./matrix_add

COMPLETED SUCCESSFULLY


==1365== NVPROF is profiling process 1365, command: ./matrix_add
==1365== Profiling application: ./matrix_add
==1365== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   86.82%  14.491ms         1  14.491ms  14.491ms  14.491ms  matrixMul(int const *, int const *, int*, int)
                    9.39%  1.5670ms         2  783.48us  743.80us  823.16us  [CUDA memcpy HtoD]
                    3.79%  633.34us         1  633.34us  633.34us  633.34us  [CUDA memcpy DtoH]
      API calls:   93.85%  293.91ms         3  97.972ms  68.515us  293.77ms  cudaMalloc
                    5.60%  17.526ms         3  5.8420ms  978.63us  15.553ms  cudaMemcpy
                    0.28%  861.92us         1  861.92us  861.92us  861.92us  cuDeviceGetPCIBusId
                    0.20%  640.75us         3  213.58us  205.89us  226.45us  cudaFree
                    0.05%  164.87us       101  1.6320us     190ns  66.125us  cuDeviceGetAttribute
        

### What is the difference between ‘GPU activities’ and ‘API calls’ in the results of ‘nvprof’?

Answer from <https://forums.developer.nvidia.com/t/what-is-the-difference-between-gpu-activities-and-api-calls-in-the-results-of-nvprof/71338/1>

Section ‘GPU activities’ list activities which execute on the GPU like CUDA kernel, CUDA memcpy, CUDA memset. And timing information here represents the execution time on the GPU.

Section ‘API Calls’ list CUDA Runtime/Driver API calls. And timing information here represents the execution time on the host.

For example, CUDA kernel launches are asynchronous from the point of view of the CPU.
It returns immediately, before the kernel has completed, and perhaps before the kernel has even started.
This time is captured for the Launch API like cuLaunchKernel in the ‘API Calls’ section.
Eventually kernel starts execution on the GPU and runs to the completion.
This time is captured for kernel in the ‘GPU activities’.

In [11]:
%%bash
nvprof --print-gpu-trace ./matrix_add --benchmark

COMPLETED SUCCESSFULLY


==1459== NVPROF is profiling process 1459, command: ./matrix_add --benchmark
==1459== Profiling application: ./matrix_add --benchmark
==1459== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
364.99ms  819.58us                    -               -         -         -         -  4.0000MB  4.7662GB/s    Pageable      Device     Tesla T4 (0)         1         7  [CUDA memcpy HtoD]
366.05ms  753.69us                    -               -         -         -         -  4.0000MB  5.1828GB/s    Pageable      Device     Tesla T4 (0)         1         7  [CUDA memcpy HtoD]
366.81ms  14.434ms            (32 32 1)       (32 32 1)        28        0B        0B         -           -           -           -     Tesla T4 (0)         1         7  matrixMul(int const *, int const *, int*, int) [117]
381.26ms  653.47us                    -               -        