Skip to content
This repository has been archived by the owner on Oct 24, 2022. It is now read-only.

CUDA implementation of the best model in the Robust Mini-batch Gradient Descent repo

Notifications You must be signed in to change notification settings

WHMHammer/robust-mini-batch-gradient-descent-cuda

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Robust Mini-Batch Gradient Descent

This repository contains the source codes for the final project of COMP_SCI 468: Programming Massively Parallel Processors with CUDA at Northwestern University.

In this project, we implemented the best model (ε-then-Z-score-trimmed Huber Loss without preprocessing) found in Robust Mini-Batch Gradient Descent with C++ (baseline) and CUDA.

Note: the CUDA codes are optimized for RTX 2060 Super graphics cards in the Wilkinson lab.

For more details on the model itself, please visit the Robust Mini-Batch Gradient Descent repo.

Run Tests

make dependencies    # only before the first run
make                 # build both the C++ and the CUDA implementations
make run             # run the same test on both implementations
# make clean         # to clean up the binaries and input/output files generated
# make clobber       # to also clean up the graphs and python virtual environment

Two graphs, testing.png and training.png will be generated, showing the regression lines under the training set and the testing set. These graphs are generated to check the correctness of the implementation. For more test cases, please visit the Robust Mini-Batch Gradient Descent repo.

Results

Implementation # of models trained Training Time (ms)
C++ 1 54.701
10 542.758
100 5,336.640
1000 54,236,484
CUDA 1 119.817
10 121.361
100 132.146
1000 848.172

TODO

  • Find out ways to reduce shared memory usage

  • Add padding to adapt to non-power-of-2 batch sizes

Contribution

Hanming Wang

  • Implemented the baseline implementation in C++

  • Implemented a naive implementation in CUDA

  • Implemented the python test script

  • Parallelized residual calculation

  • Parallelized merge sort (which was replaced by the later enumeration sort)

  • Parallelize epsilon-trimming

  • Parallelize z-score-trimming

  • Parallelized loss and gradient calculation

  • Implemented a local copy of the weight vector in shared memory and only write to global memory when the model is fully trained

  • Changed X from row-major to column-major and applied other methods for coalesced global memory access

  • Changed the batch sample selection mechanism to save the time from random number generator

  • Applied thread coarsening

  • Solved many thread divergence issues

Jiren Li

  • Implemented the enumeration sort

  • Implemented the odd-even sort (which was replaced by the enumeration sort)

  • Unrolled the loops for the parallel merge sort (which was replaced by the later odd-even sort)

Acknowledgement

We appreciate prof. Nikos Hardavellas for the wonderful lectures throughout the quarter. Hanming especially thanks Nikos for the time spent during and beyond the office hours, answering the numerous questions (some being stupid) about CUDA.