# Report Lab 2

Andres Calderon - SID:861243796

October 30, 2015

### 1 Code

The following code was used to complete the report:

#### 1.1 kernel.cu

```
2
                  (C) Copyright 2010 The Board of Trustees of the
3
     *cr
                             University of Illinois
                              All Rights Reserved
    #include <stdio.h>
9
10
    #define TILE_SIZE 16
11
12
    __global__ void mysgemm(int m, int n, int k, const float *A, const float *B, float *C) {
13
       15
16
        * Compute C = A \times B
17
           where A is a (m x k) matrix
18
19
           where B is a (k \times n) matrix
           where C is a (m x n) matrix
20
21
        * Use shared memory for tiling
23
        *************************
25
       // INSERT KERNEL CODE HERE
26
27
       // Declaring the variables in shared memory...
      __shared__ float A_s[TILE_SIZE][TILE_SIZE];
28
      __shared__ float B_s[TILE_SIZE][TILE_SIZE];
29
      // Finding the coordinates for the current thread...
31
32
     int tx = threadIdx.x;
     int ty = threadIdx.y;
33
     int col = blockIdx.x * blockDim.x + tx;
34
35
     int row = blockIdx.y * blockDim.y + ty;
36
37
     float sum = 0.0f;
     for(int i = 0; i < ((k - 1) / TILE_SIZE) + 1; ++i){</pre>
39
40
       \ensuremath{/\!/} Validation in the case the thread tries to write in share
       // memory a value outside the dimensions of matrix A...
41
       if(row < m && (i * TILE_SIZE + tx) < k){</pre>
42
         A_s[ty][tx] = A[(row * k) + (i * TILE_SIZE + tx)];
43
       } else {
44
         /\!/ In that case, just write a 0 which will no affect
45
         // the computation...
         A_s[ty][tx] = 0.0f;
47
```

```
48
          // Similar validation for B...
49
         \label{eq:colored} \mbox{if((i * TILE\_SIZE + ty) < k \&\& col < n){}} \{
50
           B_s[ty][tx] = B[((i * TILE_SIZE + ty) * n) + col];
51
         } else {
52
            B_s[ty][tx] = 0.0f;
53
54
         // Wait for all the threads to write in share memory
55
56
         __syncthreads();
57
          // Compute the multiplication on the tile...
         for(int j = 0; j < TILE_SIZE; ++j){</pre>
59
           sum += A_s[ty][j] * B_s[j][tx];
60
61
         // Wait to finish before to go ahead with the next phase...
62
63
         __syncthreads();
64
        // Write the final result in C just if it is inside of the valid
65
        // dimensions...
66
       if(row < m && col < n){
67
         C[row * n + col] = sum;
68
69
70
71
     void basicSgemm(char transa, char transb, int m, int n, int k, float alpha, const float *A, int lda, const
72
          float *B, int ldb, float beta, float *C, int ldc)
73
         if ((transa != 'N') && (transa != 'n')) {
74
         printf("unsupported value of 'transa'\n");
75
76
         return;
77
78
         if ((transb != 'N') && (transb != 'n')) {
79
         printf("unsupported value of 'transb'\n");
80
81
         return:
82
83
         if ((alpha - 1.0f > 1e-10) || (alpha - 1.0f < -1e-10)) {
84
         printf("unsupported value of alpha\n");
85
         return:
86
87
88
         if ((beta - 0.0f > 1e-10) || (beta - 0.0f < -1e-10)) {
89
         printf("unsupported value of beta\n");
90
91
         return:
         }
         const unsigned int BLOCK_SIZE = TILE_SIZE;
93
94
         // Initialize thread block and kernel grid dimensions
95
         const dim3 dim_block(BLOCK_SIZE, BLOCK_SIZE, 1);
96
         const dim3 dim_grid(((n - 1) / BLOCK_SIZE) + 1, ((m - 1) / BLOCK_SIZE) + 1, 1);
97
98
          // Calling the kernel with the above-mentioned setting...
99
         mysgemm<<<dim_grid, dim_block>>>(m, n, k, A, B, C);
100
101
```

Validation in line 40 is based in the code explained in [3].

## 2 Answer to Questions

1. In your kernel implementation, how many threads can be simultaneously executing? Assume a GeForce GTX 280 GPU which has 30 streaming multiprocessors.

According to [5] GeForce GTX 280 allows up to 1024 threads per Streaming Multiprocessor (SM). In my implementation I am using a BLOCK\_SIZE (TILE\_SIZE) of  $16 \times 16 = 256$  threads per block so each SM can allocate 4 blocks. With 30 SMs in total, the number of threads simultaneously executing is equal to:

```
16 \times 16 \times 4 \times 30 = 30720.
```

2. Use nvcc -ptxas-options="-v" to report the resource usage of your implementation your implementation. Note that the compilation will fail but you will still get a report of the relevant information. Experiment with the Nvidia visual profiler, which is part of the CUDA toolkit, and use it to further understand the resource usage. In particular, report your branch divergence behavior and whether your memory accesses are coalesced.

The -ptxas-options="-v" option was included in the Makefile as shows figure 1 (line 2). The output result can be seen in figure 2. In line 5, it shows the number of registers allocated (12), the size of share memory (around 2 Kb) and the number of bytes used for constant memory (12 bytes). It is consistent with the use of two floating-point arrays in share memory for matrices A and B, each of  $16 \times 16$  (TILE\_SIZE). The remaining bytes and size of constant memory could be explained by the use of kernel arguments and internal operations.

```
NVCC
1
                   = nvcc
       NVCC FLAGS
                   = --ptxas-options="-v" -03 -I/usr/local/cuda/include
2
                   = -lcudart -L/usr/local/cuda/lib64
       LD FLAGS
3
       EXE
                    = sgemm-tiled
4
                    = main.o support.o
5
6
7
       default: $(EXE)
8
9
       main.o: main.cu kernel.cu support.h
         $(NVCC) -c -o $@ main.cu $(NVCC_FLAGS)
10
11
       support.o: support.cu support.h
12
         $(NVCC) -c -o $@ support.cu $(NVCC_FLAGS)
13
14
       $(EXE): $(OBJ)
15
         $(NVCC) $(OBJ) -o $(EXE) $(LD_FLAGS)
16
17
18
         rm -rf *.o $(EXE)
19
```

Figure 1: Content of Makefile.

```
storm.ee.ucr.edu /home/tempmaj/classacc2391/PhD/Y1Q1/GPU/lab2 $ make
1
      nvcc -c -o main.o main.cu --ptxas-options="-v" -03 -I/usr/local/cuda/include
2
3
      ptxas info
                    : 0 bytes gmem
     ptxas info
                    : Compiling entry function '_Z7mysgemmiiiPKfS0_Pf' for 'sm_10'
4
      ptxas info
5
                    : Used 12 registers, 2104 bytes smem, 12 bytes cmem[1]
      nvcc main.o support.o -o sgemm-tiled -lcudart -L/usr/local/cuda/lib64
6
      storm.ee.ucr.edu /home/tempmaj/classacc2391/PhD/Y1Q1/GPU/lab2 $
```

Figure 2: Output of compilation using -ptxas-options="-v".

Using the Nvidia Visual Profiler (NVVP) it was possible to extract valuable information about the performance of the code. A very gentle introduction to the use of NVVP is available at [4]. NVVP 5.0 (available at storm.ee.ucr.edu) was used to analyze the implementation. For TILE\_SIZE=16 the results are shown in figure 3. All the tests were run using the default parameters.

We can see that the values for Branch Divergence Overhead, which measures the instruction issue overhead caused by divergent branches, and Total Replay Overhead, the percentage of instruction issues due to memory replays, are relatively small in comparison with the same values for the no tiled version (figure 4). Similarly, the value of Global Memory Replay Overhead, the percentage of instruction issues due to replays for non-coalesced global memory accesses, is consistently smaller in the tiled version. However, as we can see for the Global Load Efficiency metric, the use of global memory bandwidth is more efficient for the simple implementation (61.2%) than for the tiled implementation (39.8%).



Figure 3: NVVP performance analysis for sgemm-tiled.



Figure 4: NVVP performance analysis for sgemm.

```
#!/bin/bash
1
2
      date
      for i in 'seq 50 50 1000'; do
3
4
         ./sgemm-tiled $i
         ./sgemm-tiled $i
5
         ./sgemm-tiled $i
6
7
      done
      STRING="Done!!!"
8
      echo $STRING
9
      date
```

Figure 5: Script to collect test data.

3. Compare the performance of the The Tiled Matrix multiplication to the simple matrix multiplication as you increase the size of the matrices and for different tile sizes. Explain any trends that you see.

In order to collect data, the script in figure 5 was used to run three iterations of the implementations with different values of N (size of a square matrix). The results of the iterations were saved to log files using the following command-line function:

```
storm.ee.ucr.edu /home/tempmaj/classacc2391/PhD/Y1Q1/GPU/lab2 $ ./test.sh > \leftrightarrow testing_tiled_1K-20K_T16.log
```

Then, the times for kernel execution are extracted using:

```
storm.ee.ucr.edu /home/tempmaj/classacc2391/PhD/Y1Q1/GPU/lab2 $ more

testing_tiled_1K-20K_T16.log | grep 'Launching kernel...' | grep -Po '\d+.\d+' >

times_tiled_1K-20K_T16.dat
```

The files were processed in  $\mathbb{R}^1$  to get the average of the three iterations and generate some plots. All data, code and figures are available at [1].

Figures 6 and 7 show the performance between both tiled and no tiled implementations. The tests were set using different values of N for square matrices. Overall, the tiled implementation outperforms the simple one in small and relatively bigger datasets. Although, the tiled implementation shows a lower global load efficiency, it seems that a more appropriate control of branch divergence and coalesced access give better results.

Figure 8 shows the performance for different values of TILE\_SIZE. The general trend shows that the larger the size of the tile the better performance. However, there is not significant difference between sizes 16 and 32. Figure 9 shows an additional test with bigger values of N for just these two values. It seems there is not an increase in performance using a size of 32 over 16. That is explained by the fact that a TILE\_SIZE equal to 32 allows 1024 threads per block, but the hardware limitation for each Streaming Multiprocessor (SM) is up to 1536 threads [2]. So, under this configuration just one block (1024 threads) is actively executing.

#### References

- [1] Andres Calderon. GitHub Personal Repository, 2015. https://github.com/aocalderon/PhD/tree/master/Y1Q1/GPU/lab2.
- [2] David Kirk and Wen-Mei Hwu. Programming Massively Parallel Processors: A Hands-On Approach. Morgan Kaufmann, 2012.

<sup>1</sup>https://www.r-project.org/



Figure 6: First performance comparison between tiled and no tiled versions (N ranges from 50 to 1000).



Figure 7: Second performance comparison between tiled and no tiled versions (N ranges from 1000 to 20000).



Figure 8: Performance using different values of TILE\_SIZE.



Figure 9: Performance of TILE\_SIZE 16 y 32 with more data.

- [3] Wen-Mei Hwu. A Tiled Kernel for Arbitrary Matrix Dimensions Heterogeneous Parallel Programming. Coursera Course, 2015. https://www.dropbox.com/s/4y06b1m6dozp2kt/2%20-%208%20-%202.8-%20A%20Tiled%20Kernel%20for%20Arbitrary%20Matrix%20Dimensions.mp4?dl=0.
- [4] David Luebke, John Owens, Mike Roberts and Cheng-Han Lee. *Using NVVP Part1 and Part 2 Intro to Parallel Programming*. Udacity Course, 2015. https://www.youtube.com/watch?v=hyKA5fb5ZJI.
- [5] Nvidia Corporation. GeForce® GTX 200 GPU Architectural Overview. Technical Brief #TB-04044-001\_v01, 2008. http://www.nvidia.com/docs/IO/55506/GeForce\_GTX\_200\_GPU\_Technical\_Brief.pdf.