Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring & Improvements to reduce LOC #2

Open
wants to merge 33 commits into
base: less_idle_more_brrr
Choose a base branch
from

Conversation

ademeure
Copy link
Owner

@ademeure ademeure commented May 4, 2024

Refactoring and removing unused functions to reduce the number of lines of code and make everything slightly more consistent (while still having space for the code to breathe).

Also update encoder_backward with my version from more_stochastic branch in order to delete atomicAddX() from the codebase, while improving accuracy very slightly by using stochastic rounding (it's literally the only place in the entire code where we are not accumulating in FP32!)

This is based off my other PR: karpathy#343 - assuming everyone likes these changes, I will try to merge this back with latest version of the main branch at some point next week once that PR has been integrated, then create a new PR.

const int warp_size = 32;
const int block_size = 512;
const int OC_per_warp = warp_size * x128::size; // 256 at BF16
const int block_size_x = 32;
const int block_size_y = block_size / block_size_x; // 16
const int grid_size_x = OC / OC_per_warp; // e.g. 3 horizontal blocks for 768 OCs at BF16
const int grid_size_y = max(1, cuda_threads_per_SM * cuda_num_SMs / (block_size * grid_size_x)); // full GPU!
const int grid_size_y = max(1, deviceProp.maxThreadsPerMultiProcessor * deviceProp.multiProcessorCount
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for just storing the entire deviceProp

@@ -1636,29 +1504,28 @@ void matmul_backward(floatX* dinp, floatX* dweight, floatX* dbias,
dim3(block_size_x, block_size_y),
OC_per_warp * sizeof(float), main_stream>>>(dbias_buffer, dout, B, T, OC);
cast_and_add_kernel<<<CEIL_DIV(OC, 256), 256, 0, main_stream>>>(dbias, dbias_buffer, OC);
cudaCheck(cudaGetLastError());
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why remove the kernel launch check?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured it'd be more consistent to always have a single "cudaCheck(cudaGetLastError())" at the end of every function.

That should be more than enough for the initial debug of where the problem comes from without cluttering e.g. attention_forward/backward with 3 error checks each.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants