By https://www.kaggle.com/code/carrot1500/nanogpt-trained-on-openwebtext/notebook

In [1]:
from pathlib import Path

In [2]:
!(python --version; \
 pip --version; \
 conda --version; \
 python -c "import torch; print(torch.__version__)")

Python 3.10.12
pip 23.2.1 from /opt/conda/lib/python3.10/site-packages/pip (python 3.10)
conda 23.7.4
2.0.0


In [3]:
# Pip install dependencies and nanoGPT (hide output)
!pip install torch numpy transformers datasets tiktoken wandb tqdm >/dev/null 2>&1
# Upgrade datasets (default Kaggle kernel is 2.1.0 vs 2.12.0)
!pip install datasets -U >/dev/null 2>&1
!git clone https://github.com/karpathy/nanoGPT.git

fatal: destination path 'nanoGPT' already exists and is not an empty directory.


In [4]:
# nanoGPT expects data at a certain path, but the data are big, so symbolic link it
!for s in train val; do echo $s && ln -s /kaggle/input/openwebtext-data-prepared-for-nanogpt/$s.bin /kaggle/working/nanoGPT/data/openwebtext/$s.bin; done

train
ln: failed to create symbolic link '/kaggle/working/nanoGPT/data/openwebtext/train.bin': File exists
val
ln: failed to create symbolic link '/kaggle/working/nanoGPT/data/openwebtext/val.bin': File exists


In [5]:
data_dir = Path("/kaggle/working/nanoGPT/data/openwebtext/")
sorted(list(data_dir.glob("*")))

[PosixPath('/kaggle/working/nanoGPT/data/openwebtext/prepare.py'),
 PosixPath('/kaggle/working/nanoGPT/data/openwebtext/readme.md'),
 PosixPath('/kaggle/working/nanoGPT/data/openwebtext/train.bin'),
 PosixPath('/kaggle/working/nanoGPT/data/openwebtext/val.bin')]

The command below kicks off a training of a GPT-2 model, appropriate for a single GPU (I'm using P100). Since we are using only 1 GPU (instead of 8 in an A100), I've reduced gradient accumulation steps from $5 \times 8 = 40 \rightarrow 4$. Correspondingly, since this is a smaller batch (50k tokens as compared to 500k), I've reduced the learning rate from $0.0006 \rightarrow 0.0001$ (i.e., approximately linear scaling).

The default GPT-2 model is 124M parameters, which is too much for the P100, so we are using H = L = 8 and $d_{embd} = 512$ for a model that is about half the size.

In [7]:
%%bash
cd nanoGPT && python3 train.py config/train_gpt2.py --wandb_log=False \
  --max_iters=1000 \
  --log_interval=1 \
  --eval_interval=200 \
  --eval_iters=20 \
  --learning_rate="0.0001" \
  --gradient_accumulation_steps=4 \
  --batch_size=12 \
  --n_layer=6 \
  --n_head=6 \
  --n_embd=384 \
  --compile=False \
  --out_dir=out

Overriding config with config/train_gpt2.py:
# config for training GPT-2 (124M) down to very nice loss of ~2.85 on 1 node of 8X A100 40GB
# launch as the following (e.g. in a screen session) and wait ~5 days:
# $ torchrun --standalone --nproc_per_node=8 train.py config/train_gpt2.py

wandb_log = True
wandb_project = 'owt'
wandb_run_name='gpt2-124M'

# these make the total batch size be ~0.5M
# 12 batch size * 1024 block size * 5 gradaccum * 8 GPUs = 491,520
batch_size = 12
block_size = 1024
gradient_accumulation_steps = 5 * 8

# this makes total number of tokens be 300B
max_iters = 600000
lr_decay_iters = 600000

# eval stuff
eval_interval = 1000
eval_iters = 200
log_interval = 10

# weight decay
weight_decay = 1e-1

Overriding: wandb_log = False
Overriding: max_iters = 1000
Overriding: log_interval = 1
Overriding: eval_interval = 200
Overriding: eval_iters = 20
Overriding: learning_rate = 0.0001
Overriding: gradient_accumulation_steps = 4
Overriding: batch_size = 12
Overriding: n_laye

In [8]:
# Move the model outputs to the working directory and clean up the git repo
!mv /kaggle/working/nanoGPT/out /kaggle/working/model
!rm -rf /kaggle/working/nanoGPT