Skip to content

aivanou/disttraining

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Repository contains examples of distributed training jobs with Pytorch.

The models can be trained on single GPU instances as well as in production environment, e.g. SLURM.

CharNN

The model is a simple minGPT model that supports the following features:

  • DDP trainig on a single or multiple nodes
  • Checkpointing
  • Metrics logging with tensorboard
  • Profiling support
  • Job configuration via Hydra

Executing single process:

pip install -r requirements.txt
python charnn/main.py

Running on multiple GPUs on a single host:

torchrun --nnodes 1 --nproc_per_node 4 \
--rdzv_backend c10d \
--rdzv_endpoint localhost:29500 apps/charnn/main.py

Run with checkpoint:

mkdir -p logs/tb

torchrun --nnodes 1 --nproc_per_node 4 \
--rdzv_backend c10d \
--rdzv_endpoint localhost:29500 apps/charnn/main.py \
+trainer.checkpoint_path=./logs/model.pt \
+trainer.log_dir=./logs/tb

Setting up SLURM cluster and executing job in SLURM

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published