3.36pt





# Memory Access Coalescing

Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur

March 25, 2020



### Recap: Memory Spaces



Figure: Global Memory Accesses



### Access Scopes







# Memory Access Types

Latency of accesses differ for different memory spaces

- ► Global Memory (accessible by all threads) is the slowest
- ► Shared Memory (accessible by threads in a block) is very fast.
- ► Registers (accessible by one thread) is the fastest.



# Warp Requests to Memory

- ► The GPU coalesces global memory loads and stores requested by a warp of threads into global memory transactions.
- ► A warp typically requests 32 aligned 4 byte words in one global memory transaction.
- ► Reducing number of global memory transactions by warps is one of the keys for optimizing execution time
- ► Efficient memory access expressions must be designed by the user for the same.







warp 0

5 6

A[4] A[5] A[6] A[7]

A[12] A[13] A[14] A[15]

A[20] A[21] A[22] A[23]

A[28] A[29] A[30] A[31]





warp 0

4 5 6

A[4] A[5] A[6]

A[12] A[13] A[14] A[15]

A[20] A[21] A[22] A[23]

A[28] A[29] A[30] A[31]



1 global memory transaction for read 1 global memory transaction for write

|   | A[0]  | A[1]  | A[2]  | A[3]  | A[4]  | A[5]  | A[6]  | A[7]  |
|---|-------|-------|-------|-------|-------|-------|-------|-------|
|   | A[8]  | A[9]  | A[10] | A[11] | A[12] | A[13] | A[14] | A[15] |
| ′ | A[16] | A[17] | A[18] | A[19] | A[20] | A[21] | A[22] | A[23] |
|   | A[24] | A[25] | A[26] | A[27] | A[28] | A[29] | A[30] | A[31] |
|   |       |       |       |       |       |       |       |       |

13

14 15





{
 int tid= blockDim.x \* blockIdx.x + threadIdx.x;
 a[tid] = a[tid] + 1;
}
 gl
 me

1 global memory transaction for read 1 global memory transaction for write global memory

| • | A[0]  | A[1]  | A[2]  | A[3]  | A[4]  | A[5]  | A[6]  | A[7]  |
|---|-------|-------|-------|-------|-------|-------|-------|-------|
|   | A[8]  | A[9]  | A[10] | A[11] | A[12] | A[13] | A[14] | A[15] |
|   | A[16] | A[17] | A[18] | A[19] | A[20] | A[21] | A[22] | A[23] |
|   | A[24] | A[25] | A[26] | A[27] | A[28] | A[29] | A[30] | A[31] |

warp 2

18 19 20 21 22 23







A[24] A[25] A[26] A[27] A[28] A[29] A[30] A[31]

warp 0



6



```
__global__ void offset_access(float* a, int s)
 int tid= blockDim.x * blockIdx.x + threadIdx.x;
 a[tid+s] = a[tid+s] + 1;
```

Misaligned offset access: s=1

2 global memory transactions for read

2 global memory transactions for write

|        | A[0]  | A[1]  | A[2]  | A[3]  | A[4]  | A[5]  | A[6]  | A[7]  |
|--------|-------|-------|-------|-------|-------|-------|-------|-------|
| global | A[8]  | A[9]  | A[10] | A[11] | A[12] | A[13] | A[14] | A[15] |
| memory | A[16] | A[17] | A[18] | A[19] | A[20] | A[21] | A[22] | A[23] |
|        | A[24] | A[25] | A[26] | A[27] | A[28] | A[29] | A[30] | A[31] |





global

```
__global__ void offset_access(float* a, int s)
 int tid= blockDim.x * blockIdx.x + threadIdx.x:
 a[tid+s] = a[tid+s] + 1;
                                                 memory
```

Misaligned offset access: s=1

2 global memory transactions for read

2 global memory transactions for write

|   | A[0]  | A[1]  | A[2]  | A[3]  | A[4]  | A[5]  | A[6]  | A[7]  |
|---|-------|-------|-------|-------|-------|-------|-------|-------|
| y | A[8]  | A[9]  | A[10] | A[11] | A[12] | A[13] | A[14] | A[15] |
|   | A[16] | A[17] | A[18] | A[19] | A[20] | A[21] | A[22] | A[23] |
|   | A[24] | A[25] | A[26] | A[27] | A[28] | A[29] | A[30] | A[31] |





```
__global__ void offset_access(float* a, int s)
 int tid= blockDim.x * blockIdx.x + threadIdx.x;
 a[tid+s] = a[tid+s] + 1;
```

Aligned offset access: s=8

- 1 global memory transaction for read
- 1 global memory transaction for write

|        | A[0]  | A[1]  | A[2]  | A[3]  | A[4]  | A[5]  | A[6]  | A[7]  |
|--------|-------|-------|-------|-------|-------|-------|-------|-------|
| global | A[8]  | A[9]  | A[10] | A[11] | A[12] | A[13] | A[14] | A[15] |
| memory | A[16] | A[17] | A[18] | A[19] | A[20] | A[21] | A[22] | A[23] |
|        | A[24] | A[25] | A[26] | A[27] | A[28] | A[29] | A[30] | A[31] |







warp 0

5 6

A[4] A[5] A[6] A[7]

A[12] A[13] A[14] A[15]

A[20] A[21] A[22] A[23]

A[28] A[29] A[30] A[31]







```
__global__ void strided_access(float* a, int s)
{
  int tid= blockDim.x * blockIdx.x + threadIdx.x;
  a[tid*s] = a[tid*s] + 1;
}
  global
  memor
```

Misaligned strided access: s=2

- 2 global memory transactions for read
- 2 global memory transactions for write

|        | A[0]  | A[1]  | A[2]  | A[3]  | A[4]  | A[5]  | A[6]  | A[7]  |
|--------|-------|-------|-------|-------|-------|-------|-------|-------|
| global | A[8]  | A[9]  | A[10] | A[11] | A[12] | A[13] | A[14] | A[15] |
| memory | A[16] | A[17] | A[18] | A[19] | A[20] | A[21] | A[22] | A[23] |
|        | A[24] | A[25] | A[26] | A[27] | A[28] | A[29] | A[30] | A[31] |





```
__global__ void strided_access(float* a, int s)
{
  int tid= blockDim.x * blockIdx.x + threadIdx.x;
  a[tid*s] = a[tid*s] + 1;
}
```

global memory

| A[0]  | A[1]  | A[2]  | A[3]  | A[4]  | A[5]  | A[6]  | A[7]  |
|-------|-------|-------|-------|-------|-------|-------|-------|
| A[8]  | A[9]  | A[10] | A[11] | A[12] | A[13] | A[14] | A[15] |
| A[16] | A[17] | A[18] | A[19] | A[20] | A[21] | A[22] | A[23] |
| A[24] | A[25] | A[26] | A[27] | A[28] | A[29] | A[30] | A[31] |

warp 0

Misaligned strided access: s=4

- 2 global memory transactions for read
- 2 global memory transactions for write





```
__global__ void strided_access(float* a, int s)
{
  int tid= blockDim.x * blockIdx.x + threadIdx.x;
  a[tid*s] = a[tid*s] + 1;
}
```

global memory

|   | A[0]  | A[1]  | A[2]  | A[3]  | A[4]  | A[5]  | A[6]  | A[7]  |
|---|-------|-------|-------|-------|-------|-------|-------|-------|
|   | A[8]  | A[9]  | A[10] | A[11] | A[12] | A[13] | A[14] | A[15] |
| / | A[16] | A[17] | A[18] | A[19] | A[20] | A[21] | A[22] | A[23] |
|   | A[24] | A[25] | A[26] | A[27] | A[28] | A[29] | A[30] | A[31] |

warp 0

Misaligned strided access: s=4

4 global memory transactions for read

4 global memory transactions for write



# Profiling

- ► Profiling can be performed using the CUDA event API.
- ► CUDA events are of type cudaEvent\_t
- Events are created using cudaEventCreate() and destroyed using cudaEventDestroy()
- Events can record timestamps using cudaEventRecord()
- ► The time elapsed between two recorded events is done using cudaEventElapsedTime()



#### Driver Code: Offset Access

```
cudaEvent_t startEvent, stopEvent;
float ms:
int blockSize = 1024:
int n = nMB*1024*1024/sizeof(float); //nMB=128
cudaMalloc(&d_a, n * sizeof(float));
for (int i = 0; i <= 32; i++)
        cudaMemset(d_a, 0.0, n * sizeof(float));
        cudaEventRecord(startEvent);
        offset_access << n/blockSize, blockSize >> (d_a, i):
        cudaEventRecord(stopEvent);
        cudaEventSynchronize(stopEvent);
        cudaEventElapsedTime(&ms, startEvent, stopEvent);
        printf("%d, %fn", i, 2*nMB/ms);
```

Source:

https://devblogs.nvidia.com/how-access-global-memory-efficiently-cuda-c-kernels/



#### Driver Code: Strided Access

```
cudaEvent_t startEvent, stopEvent;
float ms:
int blockSize = 1024:
int n = nMB*1024*1024/sizeof(float); //nMB=128
cudaMalloc(&d_a, n * 33 * sizeof(float));
for (int i = 0; i <= 32; i++)
        cudaMemset(d_a, 0.0, n * sizeof(float));
        cudaEventRecord(startEvent);
        offset_access << n/blockSize, blockSize >> (d_a, i):
        cudaEventRecord(stopEvent);
        cudaEventSynchronize(stopEvent);
        cudaEventElapsedTime(&ms, startEvent, stopEvent);
        printf("%d, %fn", i, 2*nMB/ms);
```

Source:

https://devblogs.nvidia.com/how-access-global-memory-efficiently-cuda-c-kernels/





Figure: Memory Bandwidth Plot



# Using Shared Memory

- ► Applications typically require different threads to access the same data over and over again (data reuse)
- ► Redundant global memory accesses can be avoided by loading data into shared memory.



# Using Shared Memory

- ► Each SM typically has 64KB of on-chip memory that can be partitioned between L1 cache and shared memory.
- ► Settings are typically 48KB shared memory / 16KB L1 cache, and 16KB shared memory / 48KB L1 cache. By default the 48KB shared memory setting is used.
- ► This can be configured during runtime API from the host for all kernels using cudaDeviceSetCacheConfig() or on a per-kernel basis using cudaFuncSetCacheConfig()



# Recap: Matrix Multiplication Kernel

```
__global__
void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int N){
int i=blockIdx.y*blockDim.y+threadIdx.y;
int j=blockIdx.x*blockDim.x+threadIdx.x;
if ((i<N) && (j<N)) {
  float Pvalue = 0.0;
  for (int k = 0; k < N; ++k) {
     Pvalue += d_M[i*N+k]*d_N[k*N+j];
  }
  d_P[i*N+j] = Pvalue;
}
</pre>
```



# Recap Matrix Multiplication Kernel

- ▶ Number of threads launched is equal to the number of elements in the matrix
- ► The same row and column is accessed multiple times by different threads.
- Redundant global memory accesses are a bottleneck to performance



# Recap: Matrix Multiplication Kernel



# Total Mem. accesses required

$$= N^2 (N + N/32)$$

$$\approx N^3$$



# Matrix Multiplication Kernel using Tiling

An alternative strategy is to use shared memory for reducing global memory traffic

- Partition the data into subsets called tiles so that each tile fits into shared memory
- ► Threads in a block collaboratively load tiles into shared memory before they use the elements for the dot-product calculation









### Matrix Multiplication Kernel using Tiling

```
__global__
void MatrixMulKernel(float* d_M, float* d_N, float* d_P,int Width) {.

    __shared__ float Mds[TILE_WIDTH][TILE_WIDTH];
    __shared__ float Nds[TILE_WIDTH][TILE_WIDTH];

    int bx = blockIdx.x;
    int by = blockIdx.y;
    int tx = threadIdx.x;
    int ty = threadIdx.y;
```



```
int Row = by * TILE_WIDTH + ty;
int Col = bx * TILE_WIDTH + tx;
float Pvalue = 0;
for (int m = 0; m < Width/TILE_WIDTH; ++m) {
  Mds[ty][tx] = d_M[Row*Width + m*TILE_WIDTH + tx];
  Nds[ty][tx] = d_N[(m*TILE_WIDTH + ty)*Width + Col];
  __syncthreads();
  for (int k = 0; k < TILE_WIDTH; ++k)
    Pvalue += Mds[ty][k] * Nds[k][tx];
  __syncthreads();
}
d_P[Row*Width + Col] = Pvalue;
}</pre>
```





Figure: Load and compute tiles in shared memory









# Tranpose Operation







# Matrix Transpose CPU only

Professional CUDA C Programming by Cheng et al.



#### Matrix Transpose GPU Kernel- Naive Row

```
__global__ void transposeNaiveRow(float *out, float *in, const int nx, int ny)
{
  unsigned int ix = blockDim.x * blockIdx.x + threadIdx.x;
  unsigned int iy = blockDim.y * blockIdx.y + threadIdx.y;
  if (ix < nx && iy < ny) {
    out[ix * ny + iy] = in[iy * nx + ix];
  }
}</pre>
```

Loads by rows and stores by columns



#### Matrix Transpose GPU Kernel- Naive Col

```
__global__ void transposeNaiveRow(float *out, float *in, const int nx,int ny)
{
  unsigned int ix = blockDim.x * blockIdx.x + threadIdx.x;
  unsigned int iy = blockDim.y * blockIdx.y + threadIdx.y;
  if (ix < nx && iy < ny) {
    out[iy*nx + ix] = in[ix*ny + iy];
  }
}</pre>
```

Loads by columns and stores by rows





```
int main(int argc, char **argv)
  // set up device
   int dev = 0:
   cudaDeviceProp deviceProp;
   CHECK (cudaGetDeviceProperties (&deviceProp, dev));
   printf("%s starting transpose at ", argv[0]);
   printf("device %d: %s ", dev, deviceProp.name);
   CHECK (cudaSetDevice(dev));
  // set up array size 8192*8192
   int nx = 1 << 13;
   int nv = 1 << 13:
   // select a kernel and block size
   int iKernel = 0:
   int blockx = 32:
   int blocky = 32;
   if (argc > 1) iKernel = atoi(argv[1]);
```



```
size_t nBytes = nx * ny * sizeof(float);
// execution configuration
dim3 block (blockx, blocky);
dim3 grid ((nx + block.x - 1) / block.x, (ny + block.y - 1) / block.y
    ):
// allocate host memory
float *h_A = (float *)malloc(nBytes);
float *hostRef = (float *)malloc(nBytes);
float *gpuRef = (float *)malloc(nBytes);
// initialize host array
initialData(h_A, nx * ny);
// allocate device memory
float *d_A, *d_C:
CHECK(cudaMalloc((float**)&d_A, nBytes));
CHECK(cudaMalloc((float**)&d_C, nBytes));
// copy data from host to device
CHECK(cudaMemcpy(d_A, h_A, nBytes, cudaMemcpyHostToDevice));
```



```
// kernel pointer and descriptor
void (*kernel)(float *, float *, int, int);
char *kernelName;
// set up kernel
switch (iKernel)
   case 0:
     kernel = &transposeNaiveRow; kernelName = "NaiveRow"; break;
   case 1:
     kernel = &transposeNaiveCol; kernelName = "NaiveCol"; break:
   // run kernel
   kernel <<< grid, block>>>(d_C, d_A, nx, ny);
        CHECK(cudaGetLastError()):
        CHECK(cudaMemcpy(gpuRef, d_C, nBytes, cudaMemcpyDeviceToHost));
```



## Profile using NVPROF

- ▶ nvprof is a command-line profiler available for Linux, Windows, and OS X.
- nvprof is able to collect statistics pertaining to multiple events/metrics at the same time.
- ▶ nvprof is a standalonetool and does not require the programmer to use the CUDA events API.



#### Execute Code: NaiveRow

```
nvprof –devices 0 –metrics gst throughput, gld throughput ./transpose 0
==108029== NVPROF is profiling process 108029, command: ./transpose 0
./transpose starting transpose at device 0: Tesla K40m with matrix nx 8192 ny
    8192 with kernel 0
==108029== Some kernel(s) will be replayed on device 0 in order to collect all
     events/metrics.
==108029== Replaying kernel "transposeNaiveRow(float*, float*, int, int)" (
    done)
==108029== Metric result:
Invocations Metric Name
                              Metric Description
                                                        Min
                                                                   Max
Device "Tesla K40m (0)"
Kernel: transposeNaiveRow(float*, float*, int, int)
           gst_throughput
                           Global Store Throughput 249.37GB/s 249.37GB/s
           gld_throughput
                           Global Load Throughput 31.171GB/s 31.171GB/s
```



#### Execute Code: NaiveCol

```
nvprof –devices 0 –metrics gst throughput, gld throughput ./transpose 1
==108037== NVPROF is profiling process 108037, command: ./transpose 1
./transpose starting transpose at device 0: Tesla K40m with matrix nx 8192 ny
    8192 with kernel 1
==108037== Some kernel(s) will be replayed on device 0 in order to collect all
     events/metrics.
==108037== Replaying kernel "transposeNaiveCol(float*, float*, int, int)" (
    done)
==108037== Metric result:
Invocations Metric Name Metric Description
                                                      Min
                                                                  Max
Device "Tesla K40m (0)"
Kernel: transposeNaiveCol(float*, float*, int, int)
         gst_throughput Global Store Throughput 17.421GB/s 17.421GB/s
          gld_throughput Global Load Throughput 139.37GB/s 139.37GB/s
```



## Using Nvidia Visual Profiler

- ► The nvvp software provides a GUI based tool for analyzing CUDA applications and supports a guided analysis mode for optimizing kernels.
- ▶ nvprof provides a —analysis-metrics option to capture all GPU metrics for use by NVIDIA Visual Profiler software during its guided analysis mode.
- ► The -o flag can be used with nvprof to dump a logs file that can be imported into nvvp.



## Naive Row Kernel Profiling Analysis





## Naive Col Kernel Profiling Analysis





## Compute Analysis





#### Memory Bandwidth Analysis: Naive Row





### Memory Bandwidth Analysis: Naive Col





#### Latency Analysis in NVVP

Instruction stalls prevents warps from executing on any given cycle and are of the following types.

- ▶ Pipeline busy: The compute resources required by the instruction is not available.
- ► Constant: A constant load is blocked due to a miss in the constants cache.
- ► Memory Throttle: Large number of pending memory operations prevent further forward progress.
- ► **Texture:** The texture subsystem is fully utilized or has too many outstanding requests.
- ► Synchronization: The warp is blocked at a \_\_syncthreads() call.



#### Latency Analysis in NVVP

Instruction stalls prevents warps from executing on any given cycle and are of the following types.

- ▶ Instruction Fetch: The next assembly instruction has not yet been fetched.
- **Execution Dependency:** An input required by the instruction is not yet available.
- ► Memory Dependency: A load/store cannot be made because the required resources are not available, or are fully utilized, or too many requests of a given type are oustanding.
- ▶ Not Selected: Warp was ready to issue, but some other warp was issued instead.



# Latency Analysis





#### Transpose using Shared Memory

```
#define TILE DIM 32
#define BLOCK ROWS 32
__global__ void transposeCoalesced(float *odata, float *idata, const int nx,
   const int ny)
  __shared__ float tile[TILE_DIM][TILE_DIM];
 int x = blockIdx.x * TILE_DIM + threadIdx.x;
  int y = blockIdx.y * TILE_DIM + threadIdx.y;
  int width = gridDim.x * TILE_DIM;
 for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS)</pre>
    tile[threadIdx.v+i][threadIdx.x] = idata[(v+i)*width + x];
 __svncthreads():
```

Source: https://devblogs.nvidia.com/efficient-matrix-transpose-cuda-cc/



#### Transpose using Shared Memory

```
x = blockIdx.y * TILE_DIM + threadIdx.x; // transpose block offset
y = blockIdx.x * TILE_DIM + threadIdx.y;

for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS)
   odata[(y+j)*width + x] = tile[threadIdx.x][threadIdx.y + j];</pre>
```



#### Execute Code: TransposeCoalesced

```
./transpose 2

==108373== NVPROF is profiling process 108373, command: ./transpose 2
./transpose starting transpose at device 0: Tesla K40m with matrix nx 8192 ny 8192 with kernel 2
==108373== Metric result:
Invocations Metric Name Metric Description Min Max
Device "Tesla K40m (0)"
Kernel: transposeCoalesced(float*, float*, int, int)
1 shared_store_throughput Shared Memory Store Throughput 81.40GB/s 81.40GB/s shared_load_throughput Shared Memory Load Throughput 1e+03GB/s 1e+03GB/s
```

nyprof –devices 0 –metrics shared store throughput, shared load throughput



#### Kernel Analysis





## Compute and Latency Analysis







### Memory Bandwidth Analysis





### Using Shared Memory: Simple Copy

```
__global__ void copySharedMem(float *odata, float *idata, const int nx, const
    int ny)
{
    __shared__ float tile[TILE_DIM * TILE_DIM];
    int x = blockIdx.x * TILE_DIM + threadIdx.x;
    int y = blockIdx.y * TILE_DIM + threadIdx.y;
    int width = gridDim.x * TILE_DIM;
    for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS)
        tile[(threadIdx.y+j)*TILE_DIM + threadIdx.x] = idata[(y+j)*width + x];
    __syncthreads();
    for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS)
        odata[(y+j)*width + x] = tile[(threadIdx.y+j)*TILE_DIM + threadIdx.x];
}</pre>
```



### Profiling Results: CopySharedMem





#### No Bank Conflicts

```
__global__ void transposeNoBankConflicts(float *odata, float *idata, const int
    nx, const int nv)
   shared float tile[TILE DIM][TILE DIM+1]:
   int x = blockIdx.x * TILE_DIM + threadIdx.x;
   int y = blockIdx.y * TILE_DIM + threadIdx.y;
   int width = gridDim.x * TILE_DIM;
   for (int j = 0; j < TILE DIM; j += BLOCK ROWS)
     tile[threadIdx.y+j][threadIdx.x] = idata[(y+j)*width + x];
   __svncthreads():
  x = blockIdx.y * TILE_DIM + threadIdx.x; // transpose block offset
   y = blockIdx.x * TILE_DIM + threadIdx.y;
   for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS)</pre>
     odata[(y+j)*width + x] = tile[threadIdx.x][threadIdx.y + j];
```



#### No Bank Conflicts





#### Profiling Results: No bank conflicts







#### Profiling Results: No bank conflicts





#### Transpose Fine Grained



#### Profiling Results: Transpose FineGrained





#### Transpose Coarse Grained

```
__global__ void transposeCoarseGrained(float *odata, float *idata, int width,
   int height)
 shared float block[TILE DIM][TILE DIM+1]:
 int xIndex = blockIdx.x * TILE_DIM + threadIdx.x;
 int yIndex = blockIdx.y * TILE_DIM + threadIdx.y;
 int index_in = xIndex + (vIndex)*width;
 xIndex = blockIdx.v * TILE DIM + threadIdx.x;
 yIndex = blockIdx.x * TILE_DIM + threadIdx.y;
 int index_out = xIndex + (yIndex)*height;
 for (int i=0; i<TILE_DIM; i += BLOCK_ROWS)</pre>
          block[threadIdx.y+i][threadIdx.x] = idata[index_in+i*width];
 _svncthreads():
 for (int i=0; i<TILE_DIM; i += BLOCK_ROWS)</pre>
   odata[index_out+i*height] = block[threadIdx.v+i][threadIdx.x];
```



## Profiling Results: Transpose CoarseGrained





- ▶ Just as shared memory performance can be degraded via bank conflicts, an analogous performance degradation can occur with global memory access through 'partition camping'.
- ► Global memory is divided into either 6 partitions (on 8- and 9-series GPUs) or 8 partitions (on 200-and 10-series GPUs) of 256-byte width.
- ► To use global memory effectively, concurrent accesses to global memory by all active warps should be divided evenly amongst partitions.
- ▶ partition camping occurs when: global memory accesses are directed through a subset of partitions, causing requests to queue up at some partitions while other partitions go unused.



- ➤ Since partition camping concerns how active thread blocks behave, the issue of how thread blocks are scheduled on multiprocessors is important.
- ► When a kernel is launched, the order in which blocks are assigned to multiprocessors is determined by the one-dimensional block ID defined as:

```
bid = blockIdx.x + gridDim.x*blockIdx.y;
```

- a row-major ordering of the blocks in the grid.
- ► Ref: "Optimizing Matrix Transpose in CUDA" Greg Ruetsch, Paulius Micikevicius
- ► Ref: "High-Performance Computing with CUDA" Marc Moreno Maza



- ► Once maximum occupancy is reached, additional blocks are assigned to multiprocessors as needed
- ► How quickly and the order in which blocks complete cannot be determined
- ► So active blocks are initially contiguous but become less contiguous as execution of the kernel progresses.





- ▶ With 8 partitions of 256-byte width, all data in strides of 2048 bytes (or 512 floats) map to the same partition.
- ► Any float matrix with 512 × k columns, such as our 2048 × 2048 matrix, will contain columns whose elements map to a single partition.
- $\blacktriangleright$  With tiles of 32  $\times$  32 floats whose one-dimensional block IDs are shown in the figures, the mapping of idata and odata onto the partitions is depicted next.





- ► Concurrent blocks will be accessing tiles row-wise in idata which will be roughly equally distributed amongst partitions
- ► However these blocks will access tiles column-wise in odata which will typically access global memory through just a few partitions.
- ▶ Just as with shared memory, padding would be an option (potentially expensive) but there is a better one ...







- The key idea is to view the grid under a diagonal coordinate system. If blockldx.x and blockldx.y represent the diagonal coordinates, then (for block-square matrixes) the corresponding cartesian coordinates are given by: blockldx\_y = blockldx.x; blockldx\_x = (blockldx.x+blockldx.y)% gridDim.x;
- ➤ One would simply include the previous two lines of code at the beginning of the kernel, and write the kernel assuming the cartesian interpretation of blockldx fields, except using blockldx\_x and blockldx\_y in place of blockldx.x and blockldx.y, respectively, throughout the kernel.



```
__global__ void transposeDiagonal(float *odata,
float *idata, int width, int height)
        __shared__ float tile[TILE_DIM][TILE_DIM+1];
        int blockIdx_x, blockIdx_v;
        // diagonal reordering
        if (width == height) {
                blockIdx_y = blockIdx.x;
                blockIdx_x = (blockIdx.x+blockIdx.y)%gridDim.x;
        } else {
                int bid = blockIdx.x + gridDim.x*blockIdx.y;
                blockIdx_y = bid%gridDim.y;
                blockIdx_x = ((bid/gridDim.y)+blockIdx_y)%gridDim.x;
```



```
int xIndex = blockIdx x*TILE DIM + threadIdx.x:
 int vIndex = blockIdx_v*TILE_DIM + threadIdx.v;
int index_in = xIndex + (yIndex)*width;
xIndex = blockIdx_y*TILE_DIM + threadIdx.x:
yIndex = blockIdx_x*TILE_DIM + threadIdx.y;
int index_out = xIndex + (yIndex)*height;
for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) {</pre>
        tile[threadIdx.v+i][threadIdx.x] =
        idata[index_in+i*width];
__syncthreads()
for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) {</pre>
        odata[index out+i*height] =
        tile[threadIdx.x][threadIdx.y+i];
```





#### Diagonal

| 0 | 64 | 128 |     |     |    |
|---|----|-----|-----|-----|----|
|   | 4  | 65  | 129 |     |    |
|   |    | 2   | 66  | 130 |    |
| 1 |    |     | 3   | 67  |    |
|   |    |     |     | 4   | 68 |
| X |    |     |     |     | 5  |





- ightharpoons
- $\triangleright$
- ightharpoons
- $\triangleright$

