# Report Lab 1

Andres Calderon

October 29, 2015

### 1 Code

The following code was used to complete the report:

#### 1.1 kernel.cu

```
2
                  (C) Copyright 2010 The Board of Trustees of the
3
     *cr
                            University of Illinois
                             All Rights Reserved
    #include <stdio.h>
9
10
    #define TILE_SIZE 16
12
    __global__ void mysgemm(int m, int n, int k, const float *A, const float *B, float *C) {
13
       15
16
        * Compute C = A \times B
17
           where A is a (m x k) matrix
18
19
           where B is a (k \times n) matrix
           where C is a (m x n) matrix
20
21
        * Use shared memory for tiling
23
        *************************
25
       // INSERT KERNEL CODE HERE
26
27
       // Declaring the variables in shared memory...
       __shared__ float A_s[TILE_SIZE][TILE_SIZE];
28
      __shared__ float B_s[TILE_SIZE] [TILE_SIZE];
29
     // Finding the coordinates for the current thread...
31
32
     int tx = threadIdx.x;
     int ty = threadIdx.y;
33
     int col = blockIdx.x * blockDim.x + tx;
34
35
     int row = blockIdx.y * blockDim.y + ty;
36
37
     float sum = 0.0f;
     for(int i = 0; i < ((k - 1) / TILE_SIZE) + 1; ++i){</pre>
39
40
       // Validation in the case the thread tries to write in share
       // memory of the dimensions of matrix A..
41
       if(row < m && (i * TILE_SIZE + tx) < k){</pre>
42
         A_s[ty][tx] = A[(row * k) + (i * TILE_SIZE + tx)];
43
       } else {
44
         /\!/ In that case, just write a 0 which will no affect
45
         // the computation...
         A_s[ty][tx] = 0.0f;
47
```

```
48
          // Similar validation for B...
 49
         if((i * TILE_SIZE + ty) < k && col < n){
 50
           B_s[ty][tx] = B[((i * TILE_SIZE + ty) * n) + col];
 51
         } else {
 52
           B_s[ty][tx] = 0.0f;
 53
 54
         // Wait for all the threads to write in share memory
 55
 56
         __syncthreads();
 57
          // Compute the multiplication on the tile...
         for(int j = 0; j < TILE_SIZE; ++j){</pre>
 59
           sum += A_s[ty][j] * B_s[j][tx];
 60
         // Wait to finish before to go ahead with the next phase...
 62
 63
         __syncthreads();
 64
       // Write the final result in C just if it is inside of the valid
 65
        // dimensions...
 66
       if(row < m && col < n){
 67
         C[row * n + col] = sum;
 68
 69
 70
 71
 72
     void basicSgemm(char transa, char transb, int m, int n, int k, float alpha, const float *A, int lda, const
 73
          float *B, int ldb, float beta, float *C, int ldc)
 74
         if ((transa != 'N') && (transa != 'n')) {
 75
         printf("unsupported value of 'transa'\n");
 76
 77
           return;
 78
 79
         if ((transb != 'N') && (transb != 'n')) {
 80
 81
         printf("unsupported value of 'transb'\n");
         return;
 82
         }
 83
 84
         if ((alpha - 1.0f > 1e-10) || (alpha - 1.0f < -1e-10)) {
 85
 86
         printf("unsupported value of alpha\n");
 87
         return;
         }
 88
 89
         if ((beta - 0.0f > 1e-10) || (beta - 0.0f < -1e-10)) {
 90
         printf("unsupported value of beta\n");
 91
         return;
 93
         const unsigned int BLOCK_SIZE = TILE_SIZE;
 94
 95
         // Initialize thread block and kernel grid dimensions
 96
         const dim3 dim_block(BLOCK_SIZE, BLOCK_SIZE, 1);
 97
        const dim3 dim_grid(((n - 1) / BLOCK_SIZE) + 1, ((m - 1) / BLOCK_SIZE) + 1, 1);
 98
99
          /\!/ Calling the kernel with the above-mentioned setting...
100
       mysgemm<<<dim_grid, dim_block>>>(m, n, k, A, B, C);
101
102
```

# 2 Answer to Questions

- 1. In your kernel implementation, how many threads can be simultaneously executing? Assume a GeForce GTX 280 GPU which has 30 streaming multiprocessors.
- 2. Use nvcc -ptxas-options="-v" to report the resource usage of your implementation your implementation. Note that the compilation will fail but you will still get a report of the relevant information. Experiment with the Nvidia visual profiler, which is part of the CUDA toolkit, and use it to further understand the resource usage. In particular, report your branch divergence behavior and whether your memory accesses are coalesced.



Figure 1: NVVP performance analysis for sgemm-tiled.



Figure 2: NVVP performance analysis for sgemm.





Figure 3: First performance comparisson between tiling and no tiling versions.

3. Compare the performance of the The Tiled Matrix multiplication to the simple matrix multiplication as you increase the size of the matrices and for different tile sizes. Explain any trends that you see.

## References

- [1] Nvidia Corporation. CUDA C Programing Guide. PG-02829-001\_v7.5, 2015.
- [2] David Kirk and Wen-Mei Hwu. Programming Massively Parallel Processors: A Hands-On Approach. Morgan Kaufmann, 2012.
- [3] David Luebke, John Owens, Mike Roberts and Cheng-Han Lee. Coalesce Memory Access Intro to Parallel Programming. Udacity Course, 2015. https://www.udacity.com/course/intro-to-parallel-programming--cs344.



Figure 4: Second performance comparisson between tiling and no tiling versions.



Figure 5: Performance using different values of TILE\_SIZE.



Figure 6: Performance of TILE\_SIZE 16 y 32 with more data.