Skip to content

Vikasht34/PyTorch-Examples

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 

Repository files navigation

PyTorch Distributed Training Examples

Practical implementations for mastering distributed LLM training. From basics to production-scale systems.

πŸ“š Learning Path

Week 1-2: PyTorch DDP

Week 3-4: PyTorch FSDP

Week 5-6: DeepSpeed

Week 7-8: Megatron-LM

Week 9: NCCL

Week 10: Mixed Precision

Week 11-12: AWS Infrastructure

πŸš€ Quick Start

# Setup environment
conda create -n llm-training python=3.10 -y
conda activate llm-training
pip install -r requirements.txt

# Run first example
cd 01_ddp_basics
torchrun --nproc_per_node=4 train_ddp.py

πŸ“Š Progress Tracker

Week Topic Status Throughput Memory Blog Post
1-2 DDP πŸ”„ - - Link
3-4 FSDP ⏳ - - -

🎯 Goals

  • Train models from 124M to 13B+ parameters
  • Master all parallelism strategies
  • Achieve >50% MFU on production workloads
  • Build portfolio for AWS AGI Senior Research Engineer role

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors