## Supervised ImageNet Training

In this tutorial we will learn how to train a *supervised* model on the ImageNet dataset using our VISSL framework. Although, VISSL is focussed for self-supervised learning, it can be easily used to train supervised image classification models.

VISSL relies on config files that can fully reproduce an experiment, and this tutorial will use the config provided at `hydra_configs/config/pretrain/supervised/supervised_1gpu_resnet_example.yaml`. This config trains a ResNet-50 model using a single GPU and we show how to extend this to multi-GPU or multi-node training.


We first go over the requirements for training this model.

## Requirements

This tutorial assumes you have access to a relatively modern machine running Linux with a GPU and enough disk space for the ImageNet images (~160 GB).


1. Installing VISSL and its dependencies. VISSL can be easily installed using `conda` or `pip` as noted in `INSTALL.MD`.
2. Download the ImageNet images and preprocess the `validation` set images using the [script](https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/valprep.sh). This preprocessing is exactly the same as in the standard [PyTorch ImageNet example](https://github.com/pytorch/examples/tree/master/imagenet), and so it can be skipped if you already have the preprocessed data.

## Training

You will need to change the paths to the dataset in the config file `hydra_configs/config/pretrain/supervised/supervised_1gpu_resnet_example.yaml`. Specifically, you will need to provide the paths to the `train` and `val` folders of ImageNet.

You can optionally change the output path by changing the string at `CHECKPOINT.DIR` in the config file. By default, the config will output the training log and the model checkpoints to a directory named `checkpoints`.


Assuming you are in the root directory of the VISSL repo, the training can be started using the command 
```sh
python tools/distributed_train.py config=pretrain/supervised/supervised_1gpu_resnet_example
``` 
**Note** that the initial part of the config path, `hyrdra_configs/config` is omitted when running the training command.

If all goes well, you should see the training script start with messages such as:

```bash
INFO 2020-07-01 14:06:06,196 distributed_train.py: 132: Spawning process for node_id: 0, local_rank: 0, dist_rank: 0, dist_run_id: 127.0.0.1:51899
INFO 2020-07-01 14:06:06,197 train.py:  50: Env set for rank: 0, dist_rank: 0
INFO 2020-07-01 14:06:06,198 env.py:  20: printing stats env
```

VISSL is aimed at reproducibility, so the training script will first print out the running configuration -- the environment variables, versions of various libraries, the full training config, data size, model etc. All the outputs printed on `stdout` are also stored in a `log.txt` file under the output directory specified by `CHECKPOINT.DIR`.

You can then see the training stats printed out:
```bash
INFO 2020-07-01 14:06:34,485 log_hooks.py: 126: Rank: 0; [ep: 0] iter: 1; lr: 0.025; loss: 7.08849; btime(ms): 10386
INFO 2020-07-01 14:06:36,878 log_hooks.py: 126: Rank: 0; [ep: 0] iter: 5; lr: 0.025; loss: 8.26936; btime(ms): 2500
INFO 2020-07-01 14:06:39,867 log_hooks.py: 126: Rank: 0; [ep: 0] iter: 10; lr: 0.025; loss: 8.94679; btime(ms): 1549
INFO 2020-07-01 14:06:42,854 log_hooks.py: 126: Rank: 0; [ep: 0] iter: 15; lr: 0.025; loss: 7.74198; btime(ms): 1231
INFO 2020-07-01 14:06:45,857 log_hooks.py: 126: Rank: 0; [ep: 0] iter: 20; lr: 0.025; loss: 7.50769; btime(ms): 1074
```

In addition, VISSL also prints out the GPU memory usage and the ETA (approximate time for the experiment to finish).

## Understanding the training config

We can now try to understand the train config file.

### Data
The input data and labels needed to train the model are specified under the 
`DATA` key. The training and testing data are specified under `DATA.TRAIN` and `DATA.TEST`. 
For example,
```yaml
DATA:
    TRAIN:
      DATA_SOURCES: [disk_folder]
      DATA_PATHS: ["<path to train folder>"]
      LABEL_SOURCES: [disk_folder]
      DATASET_NAMES: [imagenet1k]
      BATCHSIZE_PER_REPLICA: 64
```
This specifies that the model will train on the images provided in the folder `DATA.TRAIN.DATA_PATHS` and infer the labels from the directory structure of the images. The model is trained with a batchsize of 64 images/GPU.

The image transforms are specified in `TRANSFORMS` and are generally wrap the `torchvision` image transforms. One can easily compose together multiple transforms by specifying them in the config file, or implement their own custom image transforms. VISSL uses such compositionality of data transforms for implementing many self-supervised methods as well.

### Model
VISSL specifies the model as a `TRUNK` (the base ConvNet) and a `HEAD` (the classification or task-specific parameters). This allows one to cleanly separate the logic between the task itself and the ConvNet. Multiple model trunks (see listing under `vissl/model/trunks`) can be used for the same task.

A ResNet-50 model that outputs classification scores for 1000 classes (the number of classes in ImageNet) is specified as
```yaml
MODEL:
    TRUNK:
      NAME: resnet
      TRUNK_PARAMS:
        RESNETS:
          DEPTH: 50
    HEAD:
      PARAMS: [
        ["mlp", {"dims": [2048, 1000]}],
      ]
```
Here `TRUNK` specifies the base ConvNet architecture, and `HEAD` specifies a single fully connected layer (special case of a MLP) that produces 1000 outputs.

VISSL automatically sets the model to `eval` mode when using the data in `DATA.TEST`. This ensures that layers such as BatchNorm, Dropout behave correctly when used to report test set accuracies.

### Criterion and Optimizer
The criterion and optimizer are specified under the `CRITERION` and `OPTIMIZER` keys. VISSL Criterions behave similar to the default `torch.nn` criterions.

The `OPTIMIZER` contains information about the base optimizer (SGD in this case) and the learning rate scheduler (`OPTIMIZER.param_schedulers`).

### Measuring Accuracy
Accuracy meters are specified under `METERS` and measure the top-1 and top-5 accuracies.

### Number of GPUs
The number of GPUs and number of nodes are specified under `DISTRIBUTED`. VISSL seamlessly runs the **same** code on either a single GPU or across multiple nodes/GPUs.

## Extending the config

### Distributed Training
VISSL can easily support distributed training by changing the nodes/GPUs specified under `DISTRIBUTED` and does not require writing any extra code. The *same* code can be run in single GPU or multi-GPU settings.


```yaml
DISTRIBUTED:
    NUM_NODES: 1 # number of nodes
    NUM_PROC_PER_NODE: 1 # number of gpus per node
```

After changing the number of nodes/gpus, you can run your experiment with the same command as earlier 
```bash
python tools/distributed_train.py config=<config> 
```

If running on more than one node, you will need to run this command on each of the nodes. Please note the following:
1. The batch size specified in the configs under `DATA.TRAIN.BATCHSIZE_PER_REPLICA` (denoted as *B*) is **per GPU**. So if you run your code on *N* nodes with *G* gpus each, then the total effective batch size is *B*`*`*N*`*`*G*. 
2. Since running on multiple GPUs changes the effective batch size, you may also want to change the learning rate or use learning rate warmup (see [the ImageNet in 1 hour paper](https://arxiv.org/abs/1706.02677)).


### Mixed Precision or FP16 training
If you installed VISSL with Apex, you can easily train the model using mixed precision. This requires adding the following lines to the config file under the `MODEL`

```yaml
 AMP_PARAMS:
      USE_AMP: True
      AMP_ARGS: {"opt_level": "O1"}
```
This will run the model using the `O1` setting in apex which should generally result in stable training while saving GPU memory (and possibly faster training depending on the GPU architecture). See the [apex documentation](https://nvidia.github.io/apex/amp.html#opt-levels-and-properties) for more information on what the different mixed precision flags.

### Using SyncBatchNorm in the model
This can be specified in the config under the `MODEL`

```yaml
SYNC_BN_CONFIG:
      CONVERT_BN_TO_SYNC_BN: True
      SYNC_BN_TYPE: pytorch
```

If you have `apex` installed, you can use a faster version of SyncBatchNorm by

```yaml
SYNC_BN_CONFIG:
      CONVERT_BN_TO_SYNC_BN: True
      SYNC_BN_TYPE: apex
      GROUP_SIZE: 8 # set to number of GPUs per node for fast performance.
```

Our model definitions are written such that one can easily replace BatchNorm with other normalization functions (LayerNorm, GroupNorm etc.) by changing arguments in the config file.
