## **Shared Memory Bank Conflicts**

## Shared memory is physically stored in banks



**Logical Shared Memory** cuda.shared.array(4,4)



**Physical Shared Memory** in 4 banks



Actual shared memory is 32 4-byte wide banks. For space in these slides, we will portray shared memory as having 4 banks (**A, B, C, D**) and a warp as having 4 threads



Logical Shared Memory cuda.shared.array(4,4)







Successive 4-byte words (1 box in these slides) will belong to different banks



Logical Shared Memory
cuda.shared.array(4,4)











Memory accesses in the same bank result in the access operations being serialized. We call this a bank conflict.



**Logical Shared Memory** cuda.shared.array(4,4)















Input Output



Here is a technique we can use to avoid bank conflicts when we know we need to make columnar access to shared memory



First, when we allocate our shared memory tile, we will pad it with an extra column



Logical Shared Memory
cuda.shared.array(4,5)

DEEP LEARNING INSTITUTE



Next, when we write to the tile, we act as if the tile is (4,4) and only write to addresses in the range [0:4][0:4]



Logical Shared Memory
cuda.shared.array(4,5)







So if we consider how the array is laid out within the memory banks, we see the following:



Logical Shared Memory
cuda.shared.array(4,5)







So if we consider how the array is laid out within the memory banks, we see the following:



Logical Shared Memory cuda.shared.array(4,5)





So if we consider how the array is laid out within the memory banks, we see the following:



Logical Shared Memory cuda.shared.array(4,5)







So if we consider how the array is laid out within the memory banks, we see the following:



Logical Shared Memory cuda.shared.array(4,5)





Warp Now when we access a column of shared memory, each element resides in a different bank and there are no bank conflicts 9 10 12 10 11 13 14 15 13 14 15 **Logical Shared Memory Physical Shared Memory** 

cuda.shared.array(4,5)

in 4 banks





















From our earlier matrix transpose example, the single change in green below would suffice to avoid bank conflicts while retaining correctness

```
tile = cuda.shared.array(2,3)
x, y = cuda.grid(2)

tile[tIdx.y][tIdx.x] = in[y][x]
cuda.syncthreads()

o_x = bId.y*bDim.y + tId.x
o_y = bId.x*bDim.x + tId.y
o[o_y][o_x] = tile[tIdx.x][tIdx.y]
```





Input Output



