This assignement works with parallel prefix scan using upsweep and downsweep approach. The upsweep uses inclusive scan while downsweep uses exclusive scan. The CPU results are generated using exclusive scan and compared against GPU results.
git clone https://github.com/amanmaldar/BellochPrefixSum
- Go to main function and update dimensions threadDim.x (threadsInBlock) and gridDim.x (numberOfBlocks)
- Update
#define printing 0
to#define printing 1
to enable printing if needed nvcc bellochPrefixSum.cu -std=c++11 -o bellochPrefixSum
./bellochPrefixSum
- Asssume we have array A of size 64. blockDim.x = 8, gridDim.x = 8
- Copy A to GPU DRAM as A_D. Copy A_D to correpsonding shared memory of each block.
- Each block runs the upsweep addtion. We are interested in last element in each block.
- Copy this last element from each block to CPU.
- CPU generates the cummulative addition for this block. The results are copied back to GPU as blocksum_device.
- Load nth block's last element with blocksum_device[n-1] element. Do this for all blocks.
- Perform the downsweep on all the blocks now. Copy the result to CPU. Compare with reference CPU generated results.
Downsweep kernel takes less time than upsweep kernel.
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
8 16 24 32 40 48 56 64
1 2 1 4 1 2 1 8 1 2 1 4 1 2 1 8 1 2 1 4 1 2 1 8 1 2 1 4 1 2 1 8 1 2 1 4 1 2 1 8 1 2 1 4 1 2 1 8 1 2 1 4 1 2 1 8 1 2 1 4 1 2 1 8
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
Total entries: 64
Total CPU version time is: 0 mSec
GPU upsweep kernel time is: 0.0398159 mSec
GPU downsweep kernel time is: 0.0219345 mSec
----------------------------------------------------------------------------------------------------------------------------------
Total entries: 32000000
Total CPU version time is: 89.1559 mSec
GPU upsweep kernel time is: 45.217 mSec
GPU downsweep kernel time is: 24.3959 mSec
----------------------------------------------------------------------------------------------------------------------------------
Total entries: 64000000
Total CPU version time is: 177.801 mSec
GPU upsweep kernel time is: 89.3421 mSec
GPU downsweep kernel time is: 48.7161 mSec