<a href="https://colab.research.google.com/github/aditya-malte/Simple-LP1-Codes/blob/master/HPC2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Select Runtime as GPU

In [1]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130


##Note
"//n" has been used  for newline  and not "/n" cause the code is saved to a file. Hence I need to escape the /n so that it is not saved as newline in the file, but with the actual text.

##CUDA code explained
1. M, N are n*n matrices, P is output matrix
2. The output of each element in P is calculated by individual threads
3. All matrices (M, N, P) are stored in row-major format.
4. All variable names starting with d_ are on device (on this code)

In [0]:
code = """
#include <iostream>
using namespace std;

__global__ void matmul(int* M, int* N, int* P, int* width)
{
  int row = blockIdx.x;
  int col = threadIdx.x;
  P[row*(*width)+col]=0;    //set the output of current element to 0
  for(int i=0; i<*width; i++)
  {
    P[row*(*width)+col] += M[row*(*width)+i]*N[i*(*width)+col]; //I've converted the general A[row][col] to A[row*width+col]
  }                                                             //because of the row major format
}
//d_xyz in my code means xyz is on the device
int main()
{
  int width = 4;  //width of n*n matrix
  int* d_width;
  cudaMalloc(&d_width, sizeof(int));
  //copy width
  cudaMemcpy(d_width, &width, sizeof(int), cudaMemcpyHostToDevice);
  
  //define input matrices
  int M[width][width] = {{5,7,9,10},
                        {2,3,3,8},  
                        {8,10,2,3},
                        {3,3,4,8}
                        };

  int N[width][width] = {{3,10,12,18},
                        {12,1,4,9},
                        {9,10,12,2},
                        {3,12,4,10}};
  
  //declare output matrix on host side
  int P[width][width];

  int *d_M, *d_N, *d_P;
  cudaMalloc(&d_M, sizeof(int)*width*width);
  cudaMalloc(&d_N, sizeof(int)*width*width);
  cudaMalloc(&d_P, sizeof(int)*width*width);

  //copy matrices to GPU
  cudaMemcpy(d_M, M, sizeof(int)*width*width, cudaMemcpyHostToDevice);
  cudaMemcpy(d_N, N, sizeof(int)*width*width, cudaMemcpyHostToDevice);
  cudaMemcpy(d_P, P, sizeof(int)*width*width, cudaMemcpyHostToDevice);
  
  matmul<<<width, width>>>(d_M, d_N, d_P, d_width);     
  cudaMemcpy(P, d_P, sizeof(int)*width*width, cudaMemcpyDeviceToHost);
  
  cout<<"The output is:\\n";
  for(int i=0; i<width; i++)
  {
    for(int j=0; j<width; j++)
    {
      cout<<P[i][j]<<" ";
    }
    cout<<"\\n";
  }
  cudaFree(d_M);
  cudaFree(d_N);
  cudaFree(d_P);
  return 0;
}
"""


In [0]:
text_file = open("code.cu", "w")
text_file.write(code)
text_file.close()

In [0]:
!nvcc code.cu

In [5]:
!./a.out

The output is:
210 267 236 271 
93 149 104 149 
171 146 172 268 
105 169 128 169 


In [6]:
!nvprof ./a.out

==2494== NVPROF is profiling process 2494, command: ./a.out
The output is:
210 267 236 271 
93 149 104 149 
171 146 172 268 
105 169 128 169 
==2494== Profiling application: ./a.out
==2494== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   43.51%  7.2960us         1  7.2960us  7.2960us  7.2960us  matmul(int*, int*, int*, int*)
                   42.18%  7.0720us         4  1.7680us  1.5680us  2.3040us  [CUDA memcpy HtoD]
                   14.31%  2.4000us         1  2.4000us  2.4000us  2.4000us  [CUDA memcpy DtoH]
      API calls:   99.13%  137.16ms         4  34.290ms  8.0030us  137.13ms  cudaMalloc
                    0.39%  538.83us         1  538.83us  538.83us  538.83us  cuDeviceTotalMem
                    0.21%  295.91us        96  3.0820us     155ns  134.06us  cuDeviceGetAttribute
                    0.14%  187.18us         1  187.18us  187.18us  187.18us  cudaLaunchKernel
                    0.09%  121.28us