PyTorch Distributed Training Examples

Practical implementations for mastering distributed LLM training. From basics to production-scale systems.

📚 Learning Path

Week 1-2: PyTorch DDP

01_basic_ddp - Multi-GPU training fundamentals
02_ddp_advanced - Gradient accumulation, mixed precision, profiling

Week 3-4: PyTorch FSDP

03_fsdp_basics - Fully sharded data parallel
04_fsdp_advanced - CPU offloading, auto-wrap policies

Week 5-6: DeepSpeed

05_deepspeed_zero - ZeRO stage 1/2/3 implementations
06_deepspeed_advanced - Offloading, 3D parallelism

Week 7-8: Megatron-LM

07_tensor_parallel - Tensor parallelism
08_pipeline_parallel - Pipeline parallelism, 3D parallelism

Week 9: NCCL

09_nccl_benchmarks - Collective operations, profiling

Week 10: Mixed Precision

10_mixed_precision - FP16/BF16 training

Week 11-12: AWS Infrastructure

11_aws_infrastructure - Multi-node training on EC2/SageMaker/Trainium

🚀 Quick Start

# Setup environment
conda create -n llm-training python=3.10 -y
conda activate llm-training
pip install -r requirements.txt

# Run first example
cd 01_ddp_basics
torchrun --nproc_per_node=4 train_ddp.py

📊 Progress Tracker

Week	Topic	Status	Throughput	Memory	Blog Post
1-2	DDP	🔄	-	-	Link
3-4	FSDP	⏳	-	-	-

🎯 Goals

Train models from 124M to 13B+ parameters
Master all parallelism strategies
Achieve >50% MFU on production workloads
Build portfolio for AWS AGI Senior Research Engineer role

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PyTorch Distributed Training Examples

📚 Learning Path

Week 1-2: PyTorch DDP

Week 3-4: PyTorch FSDP

Week 5-6: DeepSpeed

Week 7-8: Megatron-LM

Week 9: NCCL

Week 10: Mixed Precision

Week 11-12: AWS Infrastructure

🚀 Quick Start

📊 Progress Tracker

🎯 Goals

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

PyTorch Distributed Training Examples

📚 Learning Path

Week 1-2: PyTorch DDP

Week 3-4: PyTorch FSDP

Week 5-6: DeepSpeed

Week 7-8: Megatron-LM

Week 9: NCCL

Week 10: Mixed Precision

Week 11-12: AWS Infrastructure

🚀 Quick Start

📊 Progress Tracker

🎯 Goals

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages