GitHub - falarion08/Dot-Product-Implementation at https://git.chanpinqingbaoju.com

Dot Product Implementation

Link to CUDA implementation: https://colab.research.google.com/drive/1dxFkxKCokf-MbbmoTKiWeVs1MpCF6-hx?usp=sharing

Important Notes:

Each kernel result were tested on an input size of 2^20 and the average time was taken on 30 number of runs.
The number of blocks and threads for each input sizes are both 1024 in length.
The implementation for CUDA uses cooperative groups with prefetching

I. Analysis

Average Runtime (microseconds)

Kernel	2^20	2^24	2^28
C	4,633.33 us	129,600.00 us	1,289,266.67 us
x86-64	1,433.33 us	28,366.67 us	271,266.67 us
x86-SIMD	1,133.33 us	16,166.67 us	246,133.33 us
CUDA	355.81us	590.65 us	8,013.80 us

Measuring Performance of x86-64, x86-SIMD, and CUDA to C Implementation (How many times faster than normal C implementation)

Kernel	2^20	2^24	2^28
x86-64	3.23	4.57	4.75
x86-SIMD	4.09	8.02	5.24
CUDA	13.02	219.42	160.88

Measuring Performance of x86-SIMD and CUDA to x86-64 (How many times faster than x86-64)

Kernel	2^20	2^24	2^28
x86-SIMD	1.26	1.75	5.24
CUDA	4.02	48.03	33.85

Measuring Performance of CUDA to x86-SIMD (How many times faster than x86-SIMD)

Kernel	2^20	2^24	2^28
CUDA	3.19	27.37	30.71

From the table provided, it is observable that the assembly implementation of dot product is always faster than normal C Programming. This is mainly because it takes several instructions to run a for loop in C when translated in assembly than doing a LOOP instruction call in assembly. x86-SIMD provides an even faster computation in computing the dot matrix as it can process multiple data with a single instruction. Specifically, a single instruction can perform operations on 8 or 16 integer elements, in which you could multiply, add, and other operations on the elements. For the CUDA implementation, cooperative groups was used along with prefetching in order to obtain huge perfomance gain by having faster execution time. Threads are partitions into smaller group of threads and performs mathematical operations and binary reduction. The execution time increases as smaller group of threads are synchronized and does not wait for all the threads in a block to reach a certain barrier of the code. In addition, prefetching sends the GPU data before it is needed while the CPU executes instructions. As a result, memory latency (cost of transferring data) to decrease.

Although, the average execution time for all array sizes is fastest in CUDA, the consequence of GPU is the overheard costs as compared to running the program in x64 which directly occurs in the CPU and does not need any time to transfer data. Consider the execution time of running CUDA for 30 times (results of V.). The execution time of for the kernel is 11.61 ms. Considering the overhead time (Data Transfer Time, GPU Page Fault Groups, Page Throttling), the actual runtime for executing the kernel is 20.800 ms. Consider the second best performing kernel, which is the implementation for x86-SIMD (results of IV.), the execution time for running an array size of 2^20 is 43.000 ms. With prefetching and cooperative groups implementation of CUDA, it still performs better than the x64-SIMD by 2.06 times.

II. C Program Result

III. x86-64 Program Result

IV. x86-SIMD Program Result

Register YMM0 stores the base address of pointer a, and loads the initial first 8 elements
Register YMM1 stores the base address of pointer b, and loads the initial first 8 elements
Register YMM2 stores the vertical multiplation of pointer a and pointer b
Register YMM3 stores the horizontal addition of register YMM2 (Duplicates of the horizontal additions will occur)

V. Cuda Program Result

Overhead time

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.vs		.vs
Dot Product		Dot Product
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.vs

.vs

Dot Product

Dot Product

.gitattributes

.gitattributes

LICENSE

LICENSE

README.md

README.md

Repository files navigation

About

Releases

Packages

Languages

License

falarion08/Dot-Product-Implementation

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

License

Stars

Watchers

Forks

Languages