Refactoring & Improvements to reduce LOC #2

ademeure · 2024-05-04T22:56:27Z

Refactoring and removing unused functions to reduce the number of lines of code and make everything slightly more consistent (while still having space for the code to breathe).

Also update encoder_backward with my version from more_stochastic branch in order to delete atomicAddX() from the codebase, while improving accuracy very slightly by using stochastic rounding (it's literally the only place in the entire code where we are not accumulating in FP32!)

This is based off my other PR: karpathy#343 - assuming everyone likes these changes, I will try to merge this back with latest version of the main branch at some point next week once that PR has been integrated, then create a new PR.

…ernel changes make it compile with cudnn default batchsize is 24 now

updated the profile script to be more robust, and adapted to recent k…

Adding a java port of the project

…improvements)

ngc92 · 2024-05-04T23:04:31Z

train_gpt2.cu

        const int warp_size = 32;
        const int block_size = 512;
        const int OC_per_warp = warp_size * x128::size; // 256 at BF16
        const int block_size_x = 32;
        const int block_size_y = block_size / block_size_x; // 16
        const int grid_size_x = OC / OC_per_warp; // e.g. 3 horizontal blocks for 768 OCs at BF16
-        const int grid_size_y = max(1, cuda_threads_per_SM * cuda_num_SMs / (block_size * grid_size_x)); // full GPU!
+        const int grid_size_y = max(1, deviceProp.maxThreadsPerMultiProcessor * deviceProp.multiProcessorCount


+1 for just storing the entire deviceProp

ngc92 · 2024-05-04T23:04:54Z

train_gpt2.cu

@@ -1636,29 +1504,28 @@ void matmul_backward(floatX* dinp, floatX* dweight, floatX* dbias,
                                       dim3(block_size_x, block_size_y),
                                       OC_per_warp * sizeof(float), main_stream>>>(dbias_buffer, dout, B, T, OC);
        cast_and_add_kernel<<<CEIL_DIV(OC, 256), 256, 0, main_stream>>>(dbias, dbias_buffer, OC);
-        cudaCheck(cudaGetLastError());


why remove the kernel launch check?

I figured it'd be more consistent to always have a single "cudaCheck(cudaGetLastError())" at the end of every function.

That should be more than enough for the initial debug of where the problem comes from without cluttering e.g. attention_forward/backward with 3 error checks each.

Remove the duplicated #include <assert.h>

… into ademeure-less_idle_more_brrr-3

…into ngc92-separate-compilation

…h feels slightly cleaner

…work on older CUDAs) + fixes

harryjackson and others added 19 commits April 30, 2024 14:51

Adding a java port of the project

74d5bcd

Merge branch 'master' into llm.java

c13a730

small fixes for profiling

e780b56

Merge branch 'master' of github.com:karpathy/llm.c

6c629ac

updated the profile script to be more robust, and adapted to recent k…

325b456

…ernel changes make it compile with cudnn default batchsize is 24 now

try to detect if we need sudo

13eeeb2

fixed instruction counts

7fabb12

fix × symbol

977d6a6

Merge pull request karpathy#345 from ngc92/profile

6260a18

updated the profile script to be more robust, and adapted to recent k…

Update README.md

aa5dd2e

first attempt at moving cudnn out of the main file for faster compiles

67a82a6

Merge pull request karpathy#332 from harryjackson/llm.java

ff2fbdc

Adding a java port of the project

minor fix of the include

03b7323

fixed up test and profile targets

aa5bb25

improved debugging for cudnn

19c290d

refactoring & remove unused functions to reduce LOC (+wip profile.py …

bfb9c51

…improvements)

Added makefile gencode changes

abaaceb

revert profile.py changes for now

18d7ed9

Remove arch=native as it only available on recent CUDA versions

ec0ab2d

ngc92 reviewed May 4, 2024

View reviewed changes

ngc92 and others added 9 commits May 5, 2024 02:54

don't compile/link cudnn if not asked for it

b087b9c

Merge pull request karpathy#349 from lancerts/fix-include

6c179fa

Remove the duplicated #include <assert.h>

Merge branch 'less_idle_more_brrr' of https://github.com/ademeure/llm.c…

2642ffd

… into ademeure-less_idle_more_brrr-3

add ema to tok/s

ce333de

Merge branch 'ademeure-less_idle_more_brrr-3'

8168b78

Merge branch 'separate-compilation' of https://github.com/ngc92/llm.c …

f2224f2

…into ngc92-separate-compilation

Merge branch 'ngc92-separate-compilation'

64b6c2a

Merge remote-tracking branch 'karpathy/master' into cleanup_may4

c261eec

Slightly reduce lines of code in cudnn_att

83ec4b8

ademeure added 5 commits May 5, 2024 02:24

Compile for the user's GPU architecture using nvidia-smi query on Linux

8675104

Add PTX back to binary + fix whitespaces

7789738

Fix makefile

c15ca1f

Removed makefile change so we can integrate karpathy#339 instead whic…

9910a40

…h feels slightly cleaner

Add FP16 path for atomicStochasticAdd (+remove __bfloat1622float2 to …

876ab93

…work on older CUDAs) + fixes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactoring & Improvements to reduce LOC #2

Refactoring & Improvements to reduce LOC #2

ademeure commented May 4, 2024

ngc92 May 4, 2024

ngc92 May 4, 2024

ademeure May 4, 2024

Refactoring & Improvements to reduce LOC #2

Are you sure you want to change the base?

Refactoring & Improvements to reduce LOC #2

Conversation

ademeure commented May 4, 2024

ngc92 May 4, 2024

Choose a reason for hiding this comment

ngc92 May 4, 2024

Choose a reason for hiding this comment

ademeure May 4, 2024

Choose a reason for hiding this comment