# GPT-2 KV Cache Experiments (gpt2_optim)

This notebook builds and runs the KV cache experiments under `gpt2_optim/`.
It focuses on: correctness validation, speed comparison, and profiling.

Assumptions:
- CUDA is available (Colab GPU runtime).
- You have access to `gpt2_124M.bin` and `gpt2_tokenizer.bin` (downloaded via `llm.c` starter pack).


## Setup
Cloning the repository and building the project

In [1]:
!rm -rf llm.c
!git clone https://github.com/karpathy/llm.c.git


Cloning into 'llm.c'...
remote: Enumerating objects: 6149, done.[K
remote: Total 6149 (delta 0), reused 0 (delta 0), pack-reused 6149 (from 1)[K
Receiving objects: 100% (6149/6149), 2.25 MiB | 6.01 MiB/s, done.
Resolving deltas: 100% (3971/3971), done.


In [2]:
!cd llm.c && chmod u+x dev/download_starter_pack.sh && ./dev/download_starter_pack.sh


Downloaded tiny_shakespeare_val.bin to /content/llm.c/dev/data/tinyshakespeare/tiny_shakespeare_val.bin
Downloaded tiny_shakespeare_train.bin to /content/llm.c/dev/data/tinyshakespeare/tiny_shakespeare_train.bin
Downloaded gpt2_tokenizer.bin to /content/llm.c/dev/../gpt2_tokenizer.bin
Downloaded gpt2_124M_bf16.bin to /content/llm.c/dev/../gpt2_124M_bf16.bin
Downloaded gpt2_124M.bin to /content/llm.c/dev/../gpt2_124M.bin
Downloaded gpt2_124M_debug_state.bin to /content/llm.c/dev/../gpt2_124M_debug_state.bin
Downloaded hellaswag_val.bin to /content/llm.c/dev/data/hellaswag/hellaswag_val.bin
All files downloaded and saved in their respective directories


In [36]:
!rm -rf gpt2_optim
!git clone https://github.com/agridrama/gpt2_optim.git


Cloning into 'gpt2_optim'...
remote: Enumerating objects: 64, done.[K
remote: Counting objects: 100% (64/64), done.[K
remote: Compressing objects: 100% (39/39), done.[K
remote: Total 64 (delta 36), reused 50 (delta 22), pack-reused 0 (from 0)[K
Receiving objects: 100% (64/64), 36.11 KiB | 18.05 MiB/s, done.
Resolving deltas: 100% (36/36), done.


In [37]:
!cd gpt2_optim && make all GPU_COMPUTE_CAPABILITY=75 PRECISION=BF16 LLM_C_ROOT=../llm.c


/usr/local/cuda/bin/nvcc --threads=0 -t=0 --use_fast_math -std=c++17 -O3 --generate-code arch=compute_75,code=[compute_75,sm_75] -DENABLE_BF16 -I/content/gpt2_optim/src -I../llm.c /content/gpt2_optim/src/inference_gpt2_optimize.cu -lcublas -lcublasLt -lnvidia-ml -lnvToolsExt  -o /content/gpt2_optim/bin/inference_gpt2optimcu
      multi_gpu_config = multi_gpu_config_init(1, 0, 1, empty_str, empty_str, "mpi");
                                                                              ^


      multi_gpu_config = multi_gpu_config_init(1, 0, 1, empty_str, empty_str, "mpi");
                                                                              ^


/usr/local/cuda/bin/nvcc --threads=0 -t=0 --use_fast_math -std=c++17 -O3 --generate-code arch=compute_75,code=[compute_75,sm_75] -DENABLE_BF16 -I/content/gpt2_optim/src -I../llm.c /content/gpt2_optim/src/validate_kvcache_optimization.cu -lcublas -lcublasLt -lnvidia-ml -lnvToolsExt  -o /content/gpt2_optim/bin/validate_kvcache_optimizatio

## Inference with KV Cache Optimization
Command line arguments:
- `-e`: specify model path (example: `../llm.c/gpt2_124M_bf16.bin`)
- `-tk`: specify tokenizer path (example: `../llm.c/gpt2_tokenizer.bin`)
- `-g`: specify number of tokens to generate (example: `64`)
- `-b`: specify batch size (example: `4`)
- `-m`: specify sampling method (example: `0` = random sampling, `1` = greedy sampling)

In [16]:
# random sampling according to the logits distribution
!cd gpt2_optim && ./bin/inference_gpt2optimcu \
  -e ../llm.c/gpt2_124M_bf16.bin \
  -tk ../llm.c/gpt2_tokenizer.bin \
  -g 64 -b 4 -m 0


Multi-GPU support is disabled. Using a single GPU.
[System]
Device 0: Tesla T4
Loading GPT-2 model from ../llm.c/gpt2_124M_bf16.bin
 -> max_seq_len: 1024
 -> vocab_size: 50257
 -> padded_vocab_size: 50304
 -> num_layers: 12
 -> num_heads: 12
 -> channels: 768
allocating 2475 MiB for activations
device memory usage: 2854 MiB / 14912 MiB
memory per sequence: 618 MiB
 -> estimated maximum batch size: 23
=== GPT-2 Inference (gpt2_optim) ===
[Run Config]
  checkpoint: ../llm.c/gpt2_124M_bf16.bin
  tokenizer:  ../llm.c/gpt2_tokenizer.bin
  genT:       64
  batch size: 4
  sampling:   random
  validation: off

=== Section: Naive Inference (Baseline) ===

Base implementation: total time = 5845.26 ms, forward time = 5723.97 ms
Token per second: 43.80
Generated tokens:
Batch 0:
!
<|endoftext|>There were<|endoftext|>Forum Jump<|endoftext|>Copyright by WZZ<|endoftext|>The slow-motion shot gives you a glimpse at a picture that<|endoftext|>In case you<|endoftext|>Chennai: the government has<|endofte

In [None]:
# Greedy sampling (always pick the token with the highest logit)
!cd gpt2_optim && ./bin/inference_gpt2optimcu \
  -e ../llm.c/gpt2_124M_bf16.bin \
  -tk ../llm.c/gpt2_tokenizer.bin \
  -g 64 -b 4 -m 1


Multi-GPU support is disabled. Using a single GPU.
[System]
Device 0: Tesla T4
Loading GPT-2 model from ../llm.c/gpt2_124M_bf16.bin
 -> max_seq_len: 1024
 -> vocab_size: 50257
 -> padded_vocab_size: 50304
 -> num_layers: 12
 -> num_heads: 12
 -> channels: 768
allocating 2475 MiB for activations
device memory usage: 2854 MiB / 14912 MiB
memory per sequence: 618 MiB
 -> estimated maximum batch size: 23
=== GPT-2 Inference (gpt2_optim) ===
[Run Config]
  checkpoint: ../llm.c/gpt2_124M_bf16.bin
  tokenizer:  ../llm.c/gpt2_tokenizer.bin
  genT:       64
  batch size: 4
  sampling:   argmax
  validation: off

=== Section: Naive Inference (Baseline) ===

Base implementation: total time = 5845.68 ms, forward time = 5821.09 ms
Token per second: 43.79
Generated tokens:
Batch 0:
!

The first thing I did was to go to the local store and buy a few of the "B"s. I was told that the Bs were the best I had ever had. I was so excited to try them out. I was so excited to try them out. I was so excited to

In [38]:
!cd gpt2_optim && ./bin/validate_kvcache_optimization \
  -e ../llm.c/gpt2_124M_bf16.bin \
  -tk ../llm.c/gpt2_tokenizer.bin \
  -g 32 -b 2


Multi-GPU support is disabled. Using a single GPU.
[System]
Device 0: Tesla T4
Loading GPT-2 model from ../llm.c/gpt2_124M_bf16.bin
 -> max_seq_len: 1024
 -> vocab_size: 50257
 -> padded_vocab_size: 50304
 -> num_layers: 12
 -> num_heads: 12
 -> channels: 768
=== KV Cache Validation ===
[Run Config]
  checkpoint: ../llm.c/gpt2_124M_bf16.bin
  tokenizer:  ../llm.c/gpt2_tokenizer.bin
  genT:       32
  batch size: 2
  precision:  BF16

allocating 1237 MiB for activations
device memory usage: 1616 MiB / 14912 MiB
memory per sequence: 618 MiB
 -> estimated maximum batch size: 23
Step | max_abs_diff | rmse      | base_top3(token:val)                 | opt_top3(token:val)
-----+--------------+-----------+-------------------------------------+-------------------------------------
   0 |     0.000000 |  0.000000 |    11: -28.3750    13: -28.5000    11: -28.7500 |    11: -28.3750    13: -28.5000    11: -28.7500
   1 |     0.500000 |  0.267544 |   290: -78.5000   262: -79.0000   543: -79.0000 | 

In [None]:
# Install Nsight Systems (nsys), might take a few minutes
!wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/nsight-systems-2025.5.2_2025.5.2.266-1_amd64.deb
!apt update
!apt install ./nsight-systems-2025.5.2_2025.5.2.266-1_amd64.deb
!apt --fix-broken install


In [19]:
!cd gpt2_optim && ./bin/profile_kvcache_optimization \
    -e ../llm.c/gpt2_124M_bf16.bin \
    -tk ../llm.c/gpt2_tokenizer.bin \
    -g 128 -b 2

Multi-GPU support is disabled. Using a single GPU.
[System]
Device 0: Tesla T4
Loading GPT-2 model from ../llm.c/gpt2_124M_bf16.bin
 -> max_seq_len: 1024
 -> vocab_size: 50257
 -> padded_vocab_size: 50304
 -> num_layers: 12
 -> num_heads: 12
 -> channels: 768
allocating 1237 MiB for activations
device memory usage: 1616 MiB / 14912 MiB
memory per sequence: 618 MiB
 -> estimated maximum batch size: 23


In [26]:
!cd gpt2_optim && nsys profile -t cuda,nvtx \
  -o prof_kvcache \
  ./bin/profile_kvcache_optimization \
    -e ../llm.c/gpt2_124M_bf16.bin \
    -tk ../llm.c/gpt2_tokenizer.bin \
    -g 128 -b 2


Collecting data...
Multi-GPU support is disabled. Using a single GPU.
[System]
Device 0: Tesla T4
Loading GPT-2 model from ../llm.c/gpt2_124M_bf16.bin
 -> max_seq_len: 1024
 -> vocab_size: 50257
 -> padded_vocab_size: 50304
 -> num_layers: 12
 -> num_heads: 12
 -> channels: 768
allocating 1237 MiB for activations
device memory usage: 1644 MiB / 14912 MiB
memory per sequence: 618 MiB
 -> estimated maximum batch size: 23
Generating '/tmp/nsys-report-93e9.qdstrm'
Generated:
	/content/gpt2_optim/prof_kvcache.nsys-rep
