Practical implementations for mastering distributed LLM training. From basics to production-scale systems.
- 01_basic_ddp - Multi-GPU training fundamentals
- 02_ddp_advanced - Gradient accumulation, mixed precision, profiling
- 03_fsdp_basics - Fully sharded data parallel
- 04_fsdp_advanced - CPU offloading, auto-wrap policies
- 05_deepspeed_zero - ZeRO stage 1/2/3 implementations
- 06_deepspeed_advanced - Offloading, 3D parallelism
- 07_tensor_parallel - Tensor parallelism
- 08_pipeline_parallel - Pipeline parallelism, 3D parallelism
- 09_nccl_benchmarks - Collective operations, profiling
- 10_mixed_precision - FP16/BF16 training
- 11_aws_infrastructure - Multi-node training on EC2/SageMaker/Trainium
# Setup environment
conda create -n llm-training python=3.10 -y
conda activate llm-training
pip install -r requirements.txt
# Run first example
cd 01_ddp_basics
torchrun --nproc_per_node=4 train_ddp.py| Week | Topic | Status | Throughput | Memory | Blog Post |
|---|---|---|---|---|---|
| 1-2 | DDP | π | - | - | Link |
| 3-4 | FSDP | β³ | - | - | - |
- Train models from 124M to 13B+ parameters
- Master all parallelism strategies
- Achieve >50% MFU on production workloads
- Build portfolio for AWS AGI Senior Research Engineer role