## Introduction

The limited computation resource might discourage distibuted training across multiple gpus. It’s basically an easy job to wrap the model with DDP (short for DistributedDataParallel). 

In this repo, I want to share code, insighs with all beginners in DDP. I’m not going to include detailed explanation of how DDP works, instead, I provide minimum knowledge needed to make the model run in multiple gpus. Note that I only introduce DDP on one machine with multiple gpus, which is the most general case (Otherwise, we should use model parallel as stated in the official [blog](https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html). 

1. [Intro - notes about pytorch ddp]()
1. [Data Parallelism](02-Data_Parallelism.ipynb)
1. [Data Parallel v2](02-Data_Parallelism_2.ipynb)
1. [Model Parallel](03-Model_Parallel.ipynb)
1. [Pipeline Parallelism]()
1. [ZeRO](04-ZeRO.ipynb)
1. [Memory]()

### Get familiar with the system architecture

In [1]:
! nvidia-smi topo -m

	[4mGPU0	GPU1	GPU2	GPU3	CPU Affinity[0m
GPU0	 X 	NV1	NV1	NV2	0-39
GPU1	NV1	 X 	NV2	NV1	0-39
GPU2	NV1	NV2	 X 	NV1	0-39
GPU3	NV2	NV1	NV1	 X 	0-39

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing a single PCIe switch
  NV#  = Connection traversing a bonded set of # NVLinks


### Distributed launch

In [2]:
! python -m torch.distributed.launch

and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

usage: launch.py [-h] [--nnodes NNODES] [--nproc_per_node NPROC_PER_NODE]
                 [--rdzv_backend RDZV_BACKEND] [--rdzv_endpoint RDZV_ENDPOINT]
                 [--rdzv_id RDZV_ID] [--rdzv_conf RDZV_CONF] [--standalone]
                 [--max_restarts MAX_RESTARTS]
                 [--monitor_interval MONITOR_INTERVAL]
                 [--start_method {spawn,fork,forkserver}] [--role ROLE] [-m]
                 [--no_python] [--run_path] [--log_dir LOG_DIR] [-r REDIRECTS]
                 [-t TEE] [--node_rank NODE_RANK] [--master_addr MASTER_ADDR]
                 [--master_port MASTER_PORT] [--use_env]
                 training_script ...
launch.py: error: the fol

### Torchrun

In [3]:
! torchrun

usage: torchrun [-h] [--nnodes NNODES] [--nproc_per_node NPROC_PER_NODE]
                [--rdzv_backend RDZV_BACKEND] [--rdzv_endpoint RDZV_ENDPOINT]
                [--rdzv_id RDZV_ID] [--rdzv_conf RDZV_CONF] [--standalone]
                [--max_restarts MAX_RESTARTS]
                [--monitor_interval MONITOR_INTERVAL]
                [--start_method {spawn,fork,forkserver}] [--role ROLE] [-m]
                [--no_python] [--run_path] [--log_dir LOG_DIR] [-r REDIRECTS]
                [-t TEE] [--node_rank NODE_RANK] [--master_addr MASTER_ADDR]
                [--master_port MASTER_PORT]
                training_script ...
torchrun: error: the following arguments are required: training_script, training_script_args
