This is a 100 days challenge to master CUDA C/C++ programming. I'll be also learning new C++ language features (C++ 14/17/20) and experimenting with them when possible.
- Programming Massively Parallel Processors 4th edition by David Kirk and Wen-mei Hwu.
- Oak Ridge x Nvidia CUDA Training series
- Refresher on basic cuda syntax and notions.
- Read PMPP Chapter 4.
- Implemented simple vector addition using grid-stride loop.
- Refresher on C++ vectors and timing (std::chrono).
- Learned about
cudaEvent_t
,cudaEventCreate(&event)
,cudaEventRecord(&event)
andcudaEventTimeElapsed(&ms, start, end)
and how to use them for computing the runtime of a cuda kernel. - Benchmarked CPU vs GPU versions.
- Started reading Chapter 5 as a refresher for different types of device memory.
- Implemented naive matrix multiplication.
- Refresher on C++ 2d arrays and dynamic memory allocation.
- Benchmarked CPU vs GPU versions.
- Profiled the naive matrix multiplication (Nsight compute) and reduced runtime by 50% by tuning the number of threads per block (reduced threadsPerBlock from 32x32 to 16x16 and increased the number of blocks from 52 to 108 for better balancing of workload in SMs).
- Implemented a small script to get the device propreties.
- Finished reading PMPP Chapter 5.
- Implemented tiled matrix multiplication with shared memory.
- Benchmarked
mmSharedTile_gpu
with the CPU and naive GPU version.
- Re-read some parts of Chapter 4-5.
- Modified
mmSharedTile_gpu()
to handle arbitrary matrix width (N
does have to be divsible byTILE_DIM
).
- Worked on some of the exercies of Chapter 5.
- Started reading Chapter 6.
- Modified
mmSharedTile_gpu()
to handle arbitrary matrix size (nRows != nCols)
- Continued reading Chapter 6.
- Started reading about gaussian blur and implementing it.
- Implemented the 2D version of normal distribution density function
normal_pdf_gpu()
andnormal_pdf_cpu()
- Finished implementing gaussian blur on cpu. Used the "edge expension" padding strategy.
- Created some testing code to compare outputs with
cv2
outputs (in python). - Refresher on
fstream
in C++.
- Implemented a naive version of gaussian blur on gpu.
- Had a refresher on
atomicAdd()
andcudaDeviceSynchronize()
. - Something weird: Random init using
std::mt19937(seed)
yielded different results when I compile .cu and .cpp files containing the same init function. Is this due to different compilers / flags used for g++ vs nvcc? Need to investigate more.
- Read Chapter 7 about convolutions.
- Optimized the gaussian blur by storing the kernel in constant memory for efficient use of cache and higher performance.
- Started incorporating the tiling technique with shared memory to address memory bandwith bottlenecks.
- Finished reading Chapter 7.
- Finished Optimizing gaussian blur using tiling and shared memory.
- Implemented some activation functions for 2D matrices: relu, sigmoid and tanh.
- Re-read Chapter 6 and 7.
- Started reading Chapter 10 about redaction patterns.
- Implemented simple sum redaction
simpleSumReduction_gpu()
that handles arrays that fit only in 1 block. - Implemented
simpleSumReductionV2_gpu()
which uses a better thread assignement strategy for better execution resource utlization and less control divergence.
- Finished reading Chapter 10.
- Optimized sum reduction by implementing
sumReductionSharedMultiSegment_gpu()
which uses shared memory for load time reduction and generlizes to arbitrary sized arrays (i.e. not limited by 1 block).
- Read about precision loss for parallel reduction sum. Floating point sum yields different results depending on the algorithm, e.g. sequential sum (cpu version) results != Hierarchical reduction sum (gpu version) results.
- Added an integer version of the kernel which yields the exact same result as the cpu version.
- Started to work on implementing Softmax.
- Got a refresher about the log-sum-exp and maximum tricks for numerical stability of softmax.
- Implemented 1D Parallel maximum.
- Implemented Softmax function as described below:
- To improve numerical stability, we use the max trick, where we subtract the maximum value from all inputs:
- The softmax function was implemented in a non-optimized manner, with each operation (subtraction, exponentiation, division, max, and sum) executed in separate kernels. This introduces overhead and increases runtime.
- The next step is to apply kernel fusion to enhance performance.
- Optimized the softmax function by applying kernel fusion as follows:
expArray_gpu()
,substractArray_gpu()
are now combined and integrated into thesumArray_gpu()
kernel (namedsumExpArray_gpu()
).
- This helped to improve performance by:
- Reduction of kernel launch overhead as we reduced the number of kernel from 5 to 3.
- Lower Memory Bandwidth Usage as global memory reads and writes (i.e. less intermediate results) are decreased in the fused kernels.
- Better Memory Locality.
- Results:
- Improved runtime by 30% on a test 1D array of size 4096 x 1024.
- Read Chapter 9 (Histograms).
- Implemented naive parallel histogram kernel.
- I'll be implementing different types of attention for the next few days, e.g. scoring functions such as dot product, Bahdanu, Luong, self attention..
- Refresher on attention in general, and Bahdanau and Luong scoring functions.
- Implemented the simple dot product based scoring function.
- Very short session (not enough time to work on much).
- Fused dot product kernels (multiplication and sum) into one for better performance and less memory consomption.
- Layed the idea for implementing Luong attention (will implement tomorrow).
- Implemented Luong attention scoring function, i.e. Bilinear attention score:
- We can compute all scores
$(s_k)$ for$k$ in$1..m$ in one go as follows:
-
S
: Encoder states matrix of shape(d, m)
, where each column represents an encoder state$s_k$ , for$k$ in$1..m$ . -
W
: Weight matrix of shape(d, d)
. -
h^T_{t}
: The decoder vector at timestep$t$ of shape(1, d)
.
- Quick session, started implementing Bahdanau attention score, i.e MLP score $$ \text{score}(h_t, s_k) = W^T_{2} \times tanh (W_1 [h, s_k]) $$
- Completed the Bahdanau attention score implmentation.
- Implemented vector concatination and 2D tanh which are used to calculate the score.
- Refresher on CMake and how to create one for cuda and cpp project.
- Refactored some of the attention to seperate kernels from testing code.
- Started implementing self attention
- Implemented 2D Softmax kernel and run testing for cpu vs gpu versions.
- Contrary to my 1D Softmax implementation which used 3 kernels, the 2D version uses only 1 kernel which gets rid of the kernel calls overhead.
- Implemented utility functions for self attention:
-
matmul()
is a tiled matrix multiplication with shared memory that handles arbitrary matrix shapes. -
matmulScaledTranspose()
computes$\frac{Q K^T}{scalingFactor}$ .
-
- Implemented Self Attention which combines the
softmax2D()
,matmul()
andmatmulScaledTranspose()
to get the attention matrix given$Q$ ,$K$ , and$V$ . - Next up is to test it and then refactor it into a AttentionLayer class to seperate the weights creation and intialization from forward pass. The next objective would be to align it with Pytorch's
MultiHeadAttention()
layer and compare performance (using only 1 head for pytorch's layer as a first step).
- Developed the
AttentionLayer
class to replicate PyTorch'sMultiHeadAttention
functionality. - Refactored the
selfAttention()
kernel to seamlessly integrate withAttentionLayer
. - Designed
AttentionLayer
to handle data allocation and initialization using Xavier normal initialization on a CUDA device. - Implemented the
forward()
function to invoke the self-attention kernel and return the computed attention output.
Example usage:
AttentionLayer attentionLayer(seq_len, dim_emb, "cuda");
float* attentionOuput = attentionLayer.forward(queries, keys, values);
- Implemented the CPU version of selfAttention.
- Added testing for GPU version and spent some time debugging.
- Fixed boundary checks and indexing bugs in the
matmulTransposScaled()
kernel.
- In the next few days, I'll be implementing different kind of quantization techniques (symmetric, asymmetric, LLM.int8, ...)
- Implemented Symmetric quantization.
- Scale factor (S):
where:
-
$\max(|x|)$ is the maximum floating-point value. -
$b$ is the bit width of the quantized representation.
- Zero-point (Z):
- In symmetric quantization:
- Quantization:
where
- Dequantization:
- Implemented cpu version for 8bit symmetric quantization for testing vs gpu one and fixed a bug in max calculations for the gpu version.
- Implemented 8 bits Asymmetric quantization (gpu and cpu version + testing):
- Scale factor (S):
where:
-
$x_{\max}$ ,$x_{\min}$ are the maximum and minimum floating-point values. -
$q_{\max}$ ,$q_{\min}$ are the maximum and minimum quantized integer values.
- Zero-point (Z):
- In asymmetric quantization:
- Quantization:
where
- Dequantization:
- Learned how to add python binding for cuda kernels and spent some time debugging some windows errors.
- Added python binding test for symmetric quantization.
- Read the LLM.int8() paper.
- Started implementing the LLM.int8() specific kernels.
- Implemented vector-wise quantization which applies 8 bits absmax quantization row-wise (will implement column-wise later).
- Implemented the CPU version and tested my kernel against it.
- Next up will be implementing the mixed-precision multiplication for outliers / non-outliers matrices.
- Continued implementing LLM.int8().
- Implemented gpu and cpu version of column-wise quantization and tested it.
- Implemented the third step of LLM.int8() which is basically a matmul between 2 int8 which results in a int32 matrix.
- Finished working on the building blocks of LLM.int8() quantization method.
- Implemented tiled matrix multiplication and learned how to dynamically allocate shared memory inside templated functions.
- Implemented dequentization kernel.
- Implemented outlier detection kernel to get the columns that contain at least 1 value higher than
threshold
(defaults to 6). - Implemented cpu versions + basic testing for all the above kernels.
- Converted
rowWiseQuant8bits()
to work with half precision (fp16) and refactored the testing function to handle it. - Added pytorch binding for that function to allow testing vs pytorch version.
- Learned more about
torch.profiler
and run benchmarks against bitsandbytesint8_vectorwise_quant()
function. - My cuda implementation is 30% slower: 8.59ms vs 11.6ms. Note that my cuda kernel (i.e. computation) runs in 406us, which means a lot of the time is spent copying data between host and device.
- I'll dig deeper into profling to see what bottlenecks my implementation has and how to improve it.
- Focused on profiling C++ and Python-binded CUDA kernels using Nvidia Nsight Systems and NVTX.
- Identified bottlenecks in vector-wise quantization, primarily due to:
- Unnecessary memory allocation which leads to more overhead for copying data between CPU and GPU freeing memory.
- Wrapper function around the kernel is 30% slower due to these issues.
- However, the actual kernel execution time is 39µs, faster than BitsandBytes (54µs).
- BitsandBytes includes an extra outlier detection step, which accounts for some of the difference.
- Detected my implementation's bottlenecks and optimized it:
-
My kernel wrapper was taking host pointers (from
torch::Tensor.data_ptr<>(tensor)
) and then allocating, copying and freeing new device arrays which added most of the overhead, wheras bnb implementation takes tensors already on device.=> Modified my kernel wrapper to directly take device pointers from torch tensors.
-
I was using
torch::zeros()
to init tensors that will contain the results, which is inefficient as it has an extra uncessary init operation.=> Replaced it with
torch::empty()
.
-
- Created more robust benchmarking:
- 50 runs per function to get more reliable statistics.
- Use
torch.cuda.synchronize()
before and after function calls to ensure device readiness.
Benchmark was done on a 4096x4096 tensor.
- My implementation: 34us.
- Bitsandbytes implementation: 53us.
Note that bitsandbytes implementation contains 1 extra step for detecting the outliers.
- Back to reading PMPP. I started reading Chapter 14 about sparse matrix-vector multiplication.
- Implemented a COOMatrix class to represent sparse matrices in the COO format.
- Implemented SpMV for COO matrices.
- Implemented reference counting for COOMatrixGPU to prevent premature destruction (destructor is called and cuda memory is freed for rowsIdx, colIdx, and value after I pass my matrix by value to the kernel, which makes it unusable afterwards).
- Continued reading about SpMV, specifically the CSR matrix format.
- Implemented a SpMV for CSR matrices.
- Started implementing the CSRMatrix class (will continue implementing it later).
- Completed the implementation of CSRMatrixGPU.
- Implemented testing for spvm_csr().