Skip to content

An implementation of dot product using CUDA, x64, and SIMD using the integer data type (32-bits) in C Language.

License

Notifications You must be signed in to change notification settings

falarion08/Dot-Product-Implementation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Dot Product Implementation

Link to CUDA implementation: https://colab.research.google.com/drive/1dxFkxKCokf-MbbmoTKiWeVs1MpCF6-hx?usp=sharing

Important Notes:

  • Each kernel result were tested on an input size of 2^20 and the average time was taken on 30 number of runs.
  • The number of blocks and threads for each input sizes are both 1024 in length.
  • The implementation for CUDA uses cooperative groups with prefetching

I. Analysis

Average Runtime (microseconds)

Kernel 2^20 2^24 2^28
C 4,633.33 us 129,600.00 us 1,289,266.67 us
x86-64 1,433.33 us 28,366.67 us 271,266.67 us
x86-SIMD 1,133.33 us 16,166.67 us 246,133.33 us
CUDA 355.81us 590.65 us 8,013.80 us

Measuring Performance of x86-64, x86-SIMD, and CUDA to C Implementation (How many times faster than normal C implementation)

Kernel 2^20 2^24 2^28
x86-64 3.23 4.57 4.75
x86-SIMD 4.09 8.02 5.24
CUDA 13.02 219.42 160.88

Measuring Performance of x86-SIMD and CUDA to x86-64 (How many times faster than x86-64)

Kernel 2^20 2^24 2^28
x86-SIMD 1.26 1.75 5.24
CUDA 4.02 48.03 33.85

Measuring Performance of CUDA to x86-SIMD (How many times faster than x86-SIMD)

Kernel 2^20 2^24 2^28
CUDA 3.19 27.37 30.71

From the table provided, it is observable that the assembly implementation of dot product is always faster than normal C Programming. This is mainly because it takes several instructions to run a for loop in C when translated in assembly than doing a LOOP instruction call in assembly. x86-SIMD provides an even faster computation in computing the dot matrix as it can process multiple data with a single instruction. Specifically, a single instruction can perform operations on 8 or 16 integer elements, in which you could multiply, add, and other operations on the elements. For the CUDA implementation, cooperative groups was used along with prefetching in order to obtain huge perfomance gain by having faster execution time. Threads are partitions into smaller group of threads and performs mathematical operations and binary reduction. The execution time increases as smaller group of threads are synchronized and does not wait for all the threads in a block to reach a certain barrier of the code. In addition, prefetching sends the GPU data before it is needed while the CPU executes instructions. As a result, memory latency (cost of transferring data) to decrease.

Although, the average execution time for all array sizes is fastest in CUDA, the consequence of GPU is the overheard costs as compared to running the program in x64 which directly occurs in the CPU and does not need any time to transfer data. Consider the execution time of running CUDA for 30 times (results of V.). The execution time of for the kernel is 11.61 ms. Considering the overhead time (Data Transfer Time, GPU Page Fault Groups, Page Throttling), the actual runtime for executing the kernel is 20.800 ms. Consider the second best performing kernel, which is the implementation for x86-SIMD (results of IV.), the execution time for running an array size of 2^20 is 43.000 ms. With prefetching and cooperative groups implementation of CUDA, it still performs better than the x64-SIMD by 2.06 times.

II. C Program Result

image

III. x86-64 Program Result

image

IV. x86-SIMD Program Result

image

  • Register YMM0 stores the base address of pointer a, and loads the initial first 8 elements image

  • Register YMM1 stores the base address of pointer b, and loads the initial first 8 elements

    image

  • Register YMM2 stores the vertical multiplation of pointer a and pointer b

    image

  • Register YMM3 stores the horizontal addition of register YMM2 (Duplicates of the horizontal additions will occur)

    image

V. Cuda Program Result

image

Overhead time

image

About

An implementation of dot product using CUDA, x64, and SIMD using the integer data type (32-bits) in C Language.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published