# Scaling out (multi-nodes + multi-GPUs)

Lightning integrates, standardly, a lot of options to go for multi-nodes, multi-GPUs training on a cluster. For HPC-grade cluster, it would be fooling not to leverage Interconnect capabilities. Lightning makes it easy to launch with SLURM at negligeable cost.

### SLURM

SLURM is an Open Source, flexible job scheduler used to manage resources in an HPC context, in 3 keys functions:
* Exclusive or non-exclusive resource allocation system for a specific time period
* A tool for executing and monitoring jobs on a set of allocated resources
* A scheduling system that manages contention for resources

### Quick-guide to Resources Allocation and Job Submission

**This part is just a brief introduction to SLURM not intended as a comprehensive review.**

A few commands:

* Listing visible cluster resources with `sinfo <options>`
* Allocate resources with `salloc <resources_type_and_options>`
* Run commands directly on allocated resources with an `srun <command>`
* Run an ensemble of commands with a call to `sbatch <script>`

Running command ends-up in SLURM building execution environment, including setting variables and network devices so allocated nodes can communicate with Interconnect.

### Scale-out with SLURM

The `trainout.py` Python script is very similar to the regular scale-up method developed in `trainup_gpu.py`.

In [3]:
!cat trainout.py

from pytorch_lightning import Trainer
from pytorch_lightning.plugins.environments import SLURMEnvironment

from kosmoss.parallel.data import FlattenedDataModule
from kosmoss.parallel.models import LitMLP

# This file is for launching a training with an srun. To perform a run with a SlurmCluster object, follow the guide at https://pytorch-lightning.readthedocs.io/en/stable/clouds/cluster.html#building-slurm-scripts. We are not however much interested in the grid search promoted by the guide since we will be using the Ray Tune framwork in the last part.

def main(config):
    
    datamodule = FlattenedDataModule(**config["data"])
    model = LitMLP(**config["models"])
    
    trainer = Trainer(
        gpus=8, 
        num_nodes=4, 
        strategy="ddp",
        
        # By default, a failed sbatch will be resubmitted
        # To deactivate the behavior, configure the SLURM environment with no auto_requeue
        plugins=[SLURMEnvironment(auto_requeue=False)]
    )
    trainer.fi

In our case, the resource allocation has been configured within the sbatch script `submit.sbatch` itself—lines 3 to 8 serve this purpose. 

In [4]:
!cat submit.sbatch

#!/bin/bash -l

#SBATCH --nodes=2
#SBATCH --gres=gpu:2
#SBATCH --ntasks-per-node=8
#SBATCH --mem=0
#SBATCH --time=0-00:30:00
#SBATCH -p dgx_a100
#SBATCH --signal=SIGUSR1@90

# activate conda env
source activate $1

# debugging flags (optional)
export NCCL_DEBUG=INFO
export PYTHONFAULTHANDLER=1

# on your cluster you might need these:
# set the network interface
# export NCCL_SOCKET_IFNAME=^docker0,lo

# might need the latest CUDA
# module load NCCL/2.4.7-1-cuda.10.0

# select the torch distributed backend. 'nccl' or 'gloo'
export PL_TORCH_DISTRIBUTED_BACKEND='nccl'

# run script from above
srun python trainout.py

In details:
* `SBATCH --nodes=2` requests 2 Nodes
* `SBATCH --gres=gpu:2` requests 2 GPUs
* `SBATCH --ntasks-per-node=8` requests 8 tasks per Node
* `SBATCH --mem=0` requests all memory available on the the Node
* `SBATCH --time=0-00:30:00` set the limit on total run time to 30min
* `SBATCH -p dgx_a100` select only Nodes from the dgx_a100 partition
* `SBATCH --signal=SIGUSR1@90` signal the job with SIGUSR1 when it's 90 seconds to its ending time, so that it can save the model weights on disk—this is not automatic and needs to be implemented on signal receive

Finally, you can load the environment with specific packages with the command `module load <library>`, and launch the training in an `srun python trainout.py`.

Run the `sbatch submit.sbatch` command the execute the script on the cluster.