# GPU-accelerated vs. CPU-only Applications













both host and device memory. **GPU** a=np.arange(n) Time



In accelerated applications there is













# **CUDA Thread Hierarchy**











CUDA can process thousands of threads in parallel. The sizes are greatly reduced in these images for simplicity.











## Kernels are launched with an execution configuration













# CUDA-Provided Thread Hierarchy Variables

Inside kernel definitions, CUDAprovided variables describe its executing thread, block, and grid



gridDim.x is the number of blocks in
the grid, in this case 2







Inside a kernel **blockDim.x** describes the number of threads in a block. In this case **4** 





Inside a kernel threadIdx.x

describes the index of the thread within
a block. In this case 0









Inside a kernel threadIdx.x

describes the index of the thread within
a block. In this case 0









# **Coordinating Parallel Threads**











0 | 4

1 | 5

2 6

3 || '















0 | 4

1 || 5

2 | 6

3 || 3

There is an idiomatic way to calculate this value, however. Recall that each thread has access to the size of its block via blockDim.x





GPU DATA 0 | 4

1 || 5

2 | 6

3 || -

...and the index of its block within the grid via blockIdx.x





GPU DATA 0 | 4

1 5

2 | 6

3 || '

...and its own index within its block via threadIdx.x





0 | 4

1 || 5

2 | 6

3 || .

Using these variables, the formula threadIdx.x + blockIdx.x \* blockDim.x will return the thread's unique index in the whole grid, which we can then map to data elements.





0 4

0

threadIdx.x +

blockIdx.x blockDim.x

0

5

data\_index

6

3

7











0 4

1

threadIdx.x +

blockIdx.x \* blockDim.x

0

4

1 |

5

data\_index

2 |

6

3

7







3





0 4

threadIdx.x + blockIdx.x \* blockDim.x

data\_index











0 4

3

threadIdx.x + blockIdx.x \* blockDim.x

0

4

1 |

5

data\_index

2 | 6

3

7







0 4

0

blockIdx.x threadIdx.x + blockDim.x

5

data\_index

6

3









0 4

threadIdx.x + blockIdx.x \* blockDim.x

1 1 4

1 5

data\_index

2 | 6

?

3

7











0 4

threadIdx.x +

blockIdx.x \* blockDim.x

data\_index









0 4

3

threadIdx.x +

blockIdx.x \*

\* blockDim.x

4

1 | 5

data\_index

2

6

1

3

7

















# **Grid-Stride Loops**



3

Often there are more data elements than there are threads in the grid

do\_work[2, 4](d\_a)

15

19

11

23

27

31









... or else work is left undone









 1
 5
 9
 13
 17
 21
 25
 29

 2
 6
 10
 14
 18
 22
 26
 30

 3
 7
 11
 15
 19
 23
 27
 31

















11

19

23

27

31

3

convenience function for this common calculation: cuda.gridsize(), returning the number of threads in the grid





















With all threads working in parallel using a grid stride loop...

31















With all threads working in this way, all elements are covered with the performance advantage of memory coalescing











With all threads working in this way, all elements are covered with the performance advantage of memory coalescing









With all threads working in this way, all elements are covered with the performance advantage of memory coalescing





15

11

19

23

27







With all threads working in this way, all elements are covered with the performance advantage of memory coalescing







