## Best Practices

### Practical Advice for Training NNs

In [3]:
import torch

### Source Code Organization

Make your own skeleton! Keep it lightweight.

Recommended folder structure

- data
- runs
- skeleton
- models
- experiments
    - experiment1
        - settings.json
        - main.py
        - train.py
        - test.py
    - ...
- utils




### ArgumentParser vs. Settings File

- ArgumentParser defaults are often misused as inputs
- W&B can track hyperparameters, but what if we want to share the source code?
- Often many hyperparameters (fills your terminal)
- Same source code, multiple experiments

``` bash

python train.py \ 
    --log_dir "../runs/project/experiment" 
    --batch_size 10 \ 
    --model "LSTM" --lr 0.001 --epochs 10 \
    --num_layers 3 --hidden_size 512 \
    --pretrained --seq_len 20

```

### Typical Settings File

``` json
{
    "train_data":               "../../data/HFR/train.txt",
    "test_data":                "../../data/HFR/test.txt",
    "val_data":                 "../../data/HFR/val.txt",

    "batch_size":               5,
    "epochs":                   1000,

    "lr":                       0.00001,
    "lr_decay_factor":          0.98,

    "weight_reconstruction":    1.0,
    "weight_perceptual":        0.1,
    "weight_binarization":      0.8,
    "weight_hinge":             0.8,

    "crop_size":                [224, 224],

    "device":                   "cuda",
    "gpus":                     [2, 3],
    "num_workers":              4,

    "log_interval":             100,
    "log_dir":                  "../../runs/flow/layered/01-initial",
    "save":                     "checkpoint.pt",
    "resume":                   false,

    "comment":                  "Layered representation for optical flow. The mask is now used to inpaint only the background. Added an extra conv layer to the upsampling block in M. Jin's network. The Inpainter now also takes the mask as input, and the non-inpainted pixels are replaced with the original values. The frame skip is now 10 instead of 5."
}
```

### Loading the Settings File

In [1]:
import json
import os
from argparse import Namespace


def load_config(file):
    with open(file, 'r') as f:
        config = json.load(f)
        config = Namespace(**config)
    return config

def backup_config(file, log_dir):
    # backup the config file to the log-folder
    shutil.copy(file, os.path.join(log_dir, os.path.basename(file)))

### Alternatives: 

- YAML
- Text file
- Bash file 

### Checkpoints!
You should use them. If computer crashes, no need to train from scratch.

In [30]:
import torchvision.models as models
model = models.resnet18(pretrained=True)
optimizer = torch.optim.Adam(model.parameters())

# Not recommended
torch.save(model, 'runs/full-model.pth')

# Better:
torch.save(model.state_dict(), 'runs/full-model.pth')


In [19]:
# Even better, save additional stuff!
torch.save({
    'epoch': 10,
    'model': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'val_loss': 0.456,
    'is_best': True,
}, 'runs/state_dict.pth')

In [20]:
# Load the checkpoint
state = torch.load('runs/state_dict.pth')

In [21]:
print(state['epoch'])
print(state['is_best'])

10
True


In [None]:
print(state['model'].keys())

### Why Checkpoints

- Computers like to crash for unknown reasons
- Early stopping
- Researchers appreciate if you publish the weights/checkpoints
- Finetuning on new task

Recommendation:
- 1 checkpoint per epoch
- If file is large: keep latest N checkpoints, delete rest.
- Always keep track of *best* checkpoint

### Repoducibility

These days, researchers give a shit about reproducibility (my experience).
This must change!

- Source code should be self-contained
- Requirements go into requirements.txt file!
- Provide the code for ALL of your experiments
- Publish network weights
- Report final numbers with manual random seed!

### Random Seed

In [29]:
# Torch
torch.manual_seed(42)

# CUDA RNG
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

# If you use numpy as well ...
import numpy as np
np.random.seed(42)

# ... or random package
import random
random.seed(42)

### Help others porting your code!

- Publish error curves (training, validation, etc.) ...
- ... or just the error after first weight update (fixed seed)

### Useful Tools and Tricks

Not really best practice but useful to know.

#### Command line tools
- GPU usage
    ```
    watch -n 0.5 nvidia-smi
    ```
- CPU usage
    ```
    htop
    ```

<img src="./figures/nvidia-smi.png" width="50%"/>


<img src="./figures/htop.png" width="50%"/>

#### Data Loading

Sometimes data pre-processing is heavy. Need fast read-access to data.

- Load dataset into RAM:
    ``` cp -r path/to/my/dataset /dev/shm/ ```
- Works only if your dataset fits into memory (duh!)
- For proof of concept, entire dataset often not needed anyway

You can increase the shared memory by editing the [``` /etc/fstab ```](https://masukkhan.wordpress.com/2015/12/09/resize-devshm-filesystem-in-linux/) file.



### Useful Links

##### A Recipe for Training Neural Networks:
[https://karpathy.github.io/2019/04/25/recipe/](https://karpathy.github.io/2019/04/25/recipe/)