The primary objective of this project is to demonstrate how to develop LLMs that can run and scale on PyTorch/CUDA, and PyTorch/XLA, without modifications to the model. The secondary objectives of this project are as follows:
- Provide an easy to understand LLM model for learning GPT2 model architecture details
- Demonstrate use of FSDP and activation checkpointing to scale LLM training by wrapping appropriate transformer layer class
- Provide an easy to understand FSDP training loop that uses TensorBoard for training metrics
- Demonstrate how to debug LLM code in Visual Studio Code for CUDA/XLA using Docker extensions
To keep things simple, we started with karpathy/nanoGPT repository, and simplified and adapted the model and the training loop to achieve the objectives outlined above.
For pre-training, the GPT2 model can be initialized from scratch, or loaded from the following pre-trained Hugging Face models: gpt2
, gpt2-medium
, gpt2-large
, and gpt2-xl
. To load from a pre-trained Hugging Face model, use the command line argument --hf_model=model-name
. Command line arguments to the training script are automatically parsed and mapped to the training configuration.
Below we provide a step-by-step tutorial on how to debug, and pre-train the model using AWS Deep Learning Desktop. Any other equivalent or better Nvidia GPU machine can be used for walking through the tutorial, as well.
Follow the instructions at this repository to launch a g5.12xlarge
desktop with 4 NVIDIA A10G Tensor Core GPUs with 24GB memory per GPU. Specify EbsVolumeSize
parameter at least 500 GB when launching the dekstop.
Clone this repository under the home directory on the launched desktop, and cd
into the cloned repository . Activate pytorch
conda environment:
conda activate pytorch
Install tiktoken==0.5.2
in your pytorch
conda environment:
pip3 install tiktoken==0.5.2
pip3 install -r requirements.txt
Convert Hugging Face openwebtext
dataset into a dataset we can use with GPT2 by running:
python3 gpt_data_converter.py
Launch pre-installed Visual Studio Code and open this repository in Code. Install Python and Docker extensions for Visual Studio Code. Select pytorch
conda environment Python interpreter (Use Shift + CMD + P > Python: Select Interpreter
).
There are three options for debugging the current file train_fsdp.py
in Code:
-
To debug with CUDA running on the desktop use
Python: CurrentFile
debugger configuration in Code -
To debug in a docker container running CUDA, use
Docker: Python Debug CUDA
-
To debug in a docker container running XLA on top of CUDA, use
Docker: Python Debug XLA
debugger configuration
First, build the Docker image using following command:
docker buildx build -t gp2-fsdp-cuda:latest -f Dockerfile.cuda .
Next, start the docker container for running distributed training on XLA running on top of CUDA:
./docker-cuda.sh
Next exec
into the running Docker container using the docker container short id:
docker exec -it CONTAINER_ID /bin/bash
Next, launch distributed training within the docker container:
./run_cuda.sh 1>run_cuda.out 2>&1 &
Note: In this case we are mounting the cloned repository on the /app
directory of the Docker container.
First, build the Docker image using following command:
docker buildx build -t gp2-fsdp-xla-cuda:latest -f Dockerfile.xla.cuda .
Next, start the docker container for running distributed training on XLA running on top of CUDA:
./docker-xla-cuda.sh
Next exec
into the running Docker container using the docker container short id:
docker exec -it CONTAINER_ID /bin/bash
Next, launch distributed training within the docker container:
./run_xla_cuda.sh 1>run_xla_cuda.out 2>&1 &
Note: In this case we are mounting the cloned repository on the /app
directory of the Docker container.