# Scaling out (multi-nodes + multi-GPUs)

Lightning integrates, standardly, a lot of options to go for multi-nodes, multi-GPUs training on a cluster. For HPC-grade cluster, it would be fooling not to leverage Interconnect capabilities. Lightning makes it easy to launch with SLURM at negligeable cost.

### SLURM

SLURM is an Open Source, flexible job scheduler used to manage resources in an HPC context, in 3 keys functions:
* Exclusive or non-exclusive resource allocation system for a specific time period
* A tool for executing and monitoring jobs on a set of allocated resources
* A scheduling system that manages contention for resources

### Quick-guide to Resources Allocation and Job Submission

**This part is just a brief introduction to SLURM not intended as a comprehensive review.**

A few commands:

* Listing visible cluster resources with `sinfo <options>`
* Allocate resources with `salloc <resources_type_and_options>`
* Run commands directly on allocated resources with an `srun <command>`
* Run an ensemble of commands with a call to `sbatch <script>`

Running command ends-up in SLURM building execution environment, including setting variables and network devices so allocated nodes can communicate with Interconnect.

### Scale-out with SLURM

The `trainout.py` Python script is very similar to the regular scale-up method developed in `trainup_gpu.py`.

In [1]:
!cat trainout.py

import argparse
import os.path as osp
from pytorch_lightning import Trainer

from kosmoss.parallel.models import LitMLP

# This file is for launching a training with an srun. To perform a run with a SlurmCluster object, follow the guide at https://pytorch-lightning.readthedocs.io/en/stable/clouds/cluster.html#building-slurm-scripts. We are not however much interested in the grid search promoted by the guide since we will be using the Ray Tune framwork in the last part.

def main(config):
    
    datamodule = FlattenedDataModule(**config["data"])
    model = LitMLP(**config["models"])
    
    trainer = Trainer(gpus=8, num_nodes=4, strategy="ddp")
    trainer.fit(model, datamodule=datamodule)


if __name__ == "__main__":
    
    config = {
        "data": {
            "batch_size": 256,
            "num_workers": 16,
        },
        "models": {
            "in_channels": 20,
            "hidden_channels": 100,
            "out_channels": 4,
            "lr": 1e-4
        }
    }



In our case, the resource allocation has been configured within the sbatch script. 

echo ${SLURM_NODEID} && cat /etc/docker/daemon.json