Fully Sharded Data Parallel (FSDP) GPT2

The primary objective of this project is to demonstrate how to develop LLMs that can run and scale on PyTorch/CUDA, and PyTorch/XLA, without modifications to the model. The secondary objectives of this project are as follows:

Provide an easy to understand LLM model for learning GPT2 model architecture details
Demonstrate use of FSDP and activation checkpointing to scale LLM training by wrapping appropriate transformer layer class
Provide an easy to understand FSDP training loop that uses TensorBoard for training metrics
Demonstrate how to debug LLM code in Visual Studio Code for CUDA/XLA using Docker extensions

To keep things simple, we started with karpathy/nanoGPT repository, and simplified and adapted the model and the training loop to achieve the objectives outlined above.

For pre-training, the GPT2 model can be initialized from scratch, or loaded from the following pre-trained Hugging Face models: gpt2, gpt2-medium, gpt2-large, and gpt2-xl. To load from a pre-trained Hugging Face model, use the command line argument --hf_model=model-name. Command line arguments to the training script are automatically parsed and mapped to the training configuration.

Step-by-step Tutorial

Below we provide a step-by-step tutorial on how to debug, and pre-train the model using AWS Deep Learning Desktop. Any other equivalent or better Nvidia GPU machine can be used for walking through the tutorial, as well.

Launch AWS Deep Learning Desktop

Follow the instructions at this repository to launch a g5.12xlarge desktop with 4 NVIDIA A10G Tensor Core GPUs with 24GB memory per GPU. Specify EbsVolumeSize parameter at least 500 GB when launching the dekstop.

Development Setup

Clone this repository under the home directory on the launched desktop, and cd into the cloned repository . Activate pytorch conda environment:

conda activate pytorch

Download and Convert Hugging Face Dataset

Install tiktoken==0.5.2 in your pytorch conda environment:

pip3 install tiktoken==0.5.2
pip3 install -r requirements.txt

Convert Hugging Face openwebtext dataset into a dataset we can use with GPT2 by running:

python3 gpt_data_converter.py

Debug in Visual Studio Code

Launch pre-installed Visual Studio Code and open this repository in Code. Install Python and Docker extensions for Visual Studio Code. Select pytorch conda environment Python interpreter (Use Shift + CMD + P > Python: Select Interpreter).

There are three options for debugging the current file train_fsdp.py in Code:

To debug with CUDA running on the desktop use Python: CurrentFile debugger configuration in Code
To debug in a docker container running CUDA, use Docker: Python Debug CUDA
To debug in a docker container running XLA on top of CUDA, use Docker: Python Debug XLA debugger configuration

Distributed FSDP Training using CUDA in Docker Container

First, build the Docker image using following command:

docker buildx build -t gp2-fsdp-cuda:latest -f Dockerfile.cuda .

Next, start the docker container for running distributed training on XLA running on top of CUDA:

./docker-cuda.sh

Next exec into the running Docker container using the docker container short id:

docker exec -it CONTAINER_ID /bin/bash

Next, launch distributed training within the docker container:

./run_cuda.sh 1>run_cuda.out 2>&1 &

Note: In this case we are mounting the cloned repository on the /app directory of the Docker container.

Distributed FSDP Training using XLA/CUDA in Docker Container

First, build the Docker image using following command:

docker buildx build -t gp2-fsdp-xla-cuda:latest -f Dockerfile.xla.cuda .

Next, start the docker container for running distributed training on XLA running on top of CUDA:

./docker-xla-cuda.sh

Next exec into the running Docker container using the docker container short id:

docker exec -it CONTAINER_ID /bin/bash

Next, launch distributed training within the docker container:

./run_xla_cuda.sh 1>run_xla_cuda.out 2>&1 &

Note: In this case we are mounting the cloned repository on the /app directory of the Docker container.

Acknowledgements

karpathy/nanoGPT

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.vscode		.vscode
policies		policies
.gitignore		.gitignore
Dockerfile.cuda		Dockerfile.cuda
Dockerfile.xla.cuda		Dockerfile.xla.cuda
Dockerfile.xla.neuron		Dockerfile.xla.neuron
LICENSE		LICENSE
README.md		README.md
activation_checkpoint_handler.py		activation_checkpoint_handler.py
checkpoint_handler.py		checkpoint_handler.py
device_utils.py		device_utils.py
docker-cuda.sh		docker-cuda.sh
docker-xla-cuda.sh		docker-xla-cuda.sh
docker-xla-neuron.sh		docker-xla-neuron.sh
gpt_data_converter.py		gpt_data_converter.py
gpt_dataset.py		gpt_dataset.py
logging_handler.py		logging_handler.py
model.py		model.py
requirements.txt		requirements.txt
run_cuda.sh		run_cuda.sh
run_xla_cuda.sh		run_xla_cuda.sh
run_xla_neuron.sh		run_xla_neuron.sh
train_config.py		train_config.py
train_fsdp.py		train_fsdp.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fully Sharded Data Parallel (FSDP) GPT2

Step-by-step Tutorial

Launch AWS Deep Learning Desktop

Development Setup

Download and Convert Hugging Face Dataset

Debug in Visual Studio Code

Distributed FSDP Training using CUDA in Docker Container

Distributed FSDP Training using XLA/CUDA in Docker Container

Acknowledgements

About

Releases

Packages

Languages

License

ajayvohra2005/gpt2-fsdp

Folders and files

Latest commit

History

Repository files navigation

Fully Sharded Data Parallel (FSDP) GPT2

Step-by-step Tutorial

Launch AWS Deep Learning Desktop

Development Setup

Download and Convert Hugging Face Dataset

Debug in Visual Studio Code

Distributed FSDP Training using CUDA in Docker Container

Distributed FSDP Training using XLA/CUDA in Docker Container

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages