# Example: Single Node Multi-GPU training with `quix` in Jupyter

In this example, we're going to check out how `quix` can be used for single node training in a Jupyter Notebook.
Note that this method is not the most optimal way of training, but could be useful for some light training on a 
node where multiple users have direct access to the GPUs, such as `samsida.hpc.uio.no`. As such, this is an *illustrative
example*, but in general *not the most efficient methodology for training with `quix`*.

## Step 1: Setting magics and environment variables
To start off, we will set up some standard Jupyter magic commands that are often useful. The value of these in 
the current notebook is a little dubious, but it is good practice to have these cell magic commands in the first run
cell. We use the `autoreload` extension, and set `matplotlib inline` to allow simple plotting in the notebook.

In addition, we will set some specific environment variables for running in a notebook.
We use the magic `%env` for this purpose, and we set
- `CUDA_VISIBLE_DEVICES`: This tells PyTorch to use only a subset of the available devices on the node, since we are sharing it with others. In this case, it seems that GPU 1,2,3 are available for training.
- `MASTER_ADDR`: The master address for the distributed training, used to communicate between processes on the node with DDP.
- `MASTER_PORT`: The master port for the distributed training, used for multiprocessing setups.

In a `slurm` setting, these environment variables can be inferred, but in this case, we explicitly set them to handle the distributed training on the node.

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
%env CUDA_VISIBLE_DEVICES=0,1,2,3
%env MASTER_ADDR=localhost
%env MASTER_PORT=29501

env: CUDA_VISIBLE_DEVICES=0,1,2,3
env: MASTER_ADDR=localhost
env: MASTER_PORT=29501


## Step 2: Launching a run using `single_node_launcher`

The single node launcher expects the environment variables we configured using `%env` in the previous cell.
Now, to run a node, we need a `quix.run.Runner` class and the `quix.run.single_node_launcher` function.
In the context of this example notebook, we have not installed `quix` using `pip`, hence we have a fallback
where we load the module by adding the parent directory using `sys.path.append`. 
You can safely ignore this if you have installed `quix` using `pip` however.

We launch our run on the node by calling `single_node_launcher`. This expects a `Runner` class which deals
with config parsing etc. as well as the specific run configuration passed as keyword arguments. 
The `single_node_launcher` infers which GPUs to use using `CUDA_VISIBLE_DEVICES`, and launches the run
using `torch.multiprocessing.spawn`. 

In [2]:
# Check if quix is an environment module
try:
    from quix.run import single_node_launcher, Runner
    
# If not, assume we are running from the notebook in the repo, and use a trick
except ModuleNotFoundError:    
    import sys
    sys.path.append('../')
    from quix.run import single_node_launcher, Runner

In [3]:
single_node_launcher(
    Runner,
    model='resnet18',
    custom_runid='myrun',
    project='testproject',
    dataset='Caltech256',
    num_classes=257,
    epochs=50,
    aug3=True,
    input_ext='jpg',
    target_ext='cls',
    data_path='/work2/litdata/',
    batch_size=512,
    lr_init=3e-5,
    zro=False,
    model_ema=True,
    stdout=True,
)

INFO:2023-12-28 03:56:40,873 | Parsing augmentations...
INFO:2023-12-28 03:56:40,888 | Parsing data...
INFO:2023-12-28 03:56:41,055 | Parsing model...
INFO:2023-12-28 03:56:41,266 | Parsing loss...
INFO:2023-12-28 03:56:41,266 | Parsing parameter groups...
INFO:2023-12-28 03:56:41,267 | Parsing optimizer...
INFO:2023-12-28 03:56:41,268 | Parsing scaler...
INFO:2023-12-28 03:56:41,268 | Parsing scheduler...
INFO:2023-12-28 03:56:41,268 | Parsing DDP...
INFO:2023-12-28 03:56:41,284 | Parsing EMA...
INFO:2023-12-28 03:56:41,300 | Parsing checkpoint...
INFO:2023-12-28 03:56:41,303 | Parsing logger...
INFO:2023-12-28 03:56:41,304 | Finished parsing!
DEBUG:2023-12-28 03:57:00,537 | epoch=0 iteration=0 training=True timedelta=18.825194835662842 loss=5.678041934967041 Acc1=0.005859375 Acc5=0.017578125 last_lr=3.242188066968856e-05 gpumem=16714.125
DEBUG:2023-12-28 03:57:01,754 | epoch=0 iteration=1 training=True timedelta=0.5081648826599121 loss=5.6404924392700195 Acc1=0.00390625 Acc5=0.015625

## Step 3: Inference

So, where did our model go? Well, the `Runner` instance has done its job in each process by:
- Initializing the process group for distributed training.
- Parsing the configuration from the Config classes and config kwargs passed to `single_node_launcher`.
- Initializing all relevant modules, checkpoints, optimizers, schedulers, etc., in each process to perform the run.
- Running the training on each process with checkpoints and logging.
    - The checkpoints and logs are by default saved in the path `<savedir>/<project>/<runid>`. 
        - Checkpoints are stored in `<savedir>/<project>/<runid>/checkpoint`.
        - Logs are stored in `<savedir>/<project>/<runid>/log`.
    - `<savedir>` defaults to the users $HOME folder.
    - If either `project` and `runid` are not specified, they are by default dropped from the path.
        - E.g., assuming you did not specify a `project` or `savedir` for the run, the path is inferred as `~/<runid>/checkpoint|log`.
- By default, the runner uses `rolling_checkpoints = 5`, which means that only the last 5 checkpoints are kept from the run, and stored in the checkpoint path inferred by the `Runner`. All this can be customized by subclassing `Runner`.

So, in this case, our checkpoints and logs are stored in `~/testproject/myrun/checkpoint|log` and we can retrieve anything we want from there to do inference.

## Conclusion

We've seen a method for training a model on a single node with multiple GPUs 
using the Jupyter Notebook format. 

Even though we need to jump through a few 
hoops (defining the main function in a seperate script) we see that it can be 
done `quixly`$^\mathrm{TM}$ with few lines of code.