-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactoring & Improvements to reduce LOC #2
base: less_idle_more_brrr
Are you sure you want to change the base?
Conversation
…ernel changes make it compile with cudnn default batchsize is 24 now
updated the profile script to be more robust, and adapted to recent k…
Adding a java port of the project
const int warp_size = 32; | ||
const int block_size = 512; | ||
const int OC_per_warp = warp_size * x128::size; // 256 at BF16 | ||
const int block_size_x = 32; | ||
const int block_size_y = block_size / block_size_x; // 16 | ||
const int grid_size_x = OC / OC_per_warp; // e.g. 3 horizontal blocks for 768 OCs at BF16 | ||
const int grid_size_y = max(1, cuda_threads_per_SM * cuda_num_SMs / (block_size * grid_size_x)); // full GPU! | ||
const int grid_size_y = max(1, deviceProp.maxThreadsPerMultiProcessor * deviceProp.multiProcessorCount |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for just storing the entire deviceProp
@@ -1636,29 +1504,28 @@ void matmul_backward(floatX* dinp, floatX* dweight, floatX* dbias, | |||
dim3(block_size_x, block_size_y), | |||
OC_per_warp * sizeof(float), main_stream>>>(dbias_buffer, dout, B, T, OC); | |||
cast_and_add_kernel<<<CEIL_DIV(OC, 256), 256, 0, main_stream>>>(dbias, dbias_buffer, OC); | |||
cudaCheck(cudaGetLastError()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why remove the kernel launch check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I figured it'd be more consistent to always have a single "cudaCheck(cudaGetLastError())" at the end of every function.
That should be more than enough for the initial debug of where the problem comes from without cluttering e.g. attention_forward/backward with 3 error checks each.
Remove the duplicated #include <assert.h>
… into ademeure-less_idle_more_brrr-3
…into ngc92-separate-compilation
…h feels slightly cleaner
…work on older CUDAs) + fixes
Refactoring and removing unused functions to reduce the number of lines of code and make everything slightly more consistent (while still having space for the code to breathe).
Also update encoder_backward with my version from more_stochastic branch in order to delete atomicAddX() from the codebase, while improving accuracy very slightly by using stochastic rounding (it's literally the only place in the entire code where we are not accumulating in FP32!)
This is based off my other PR: karpathy#343 - assuming everyone likes these changes, I will try to merge this back with latest version of the main branch at some point next week once that PR has been integrated, then create a new PR.