## About last lecture

- Catching bugs requires a **lot of effort**! Luckily, we have many weapons

- Many best practices for debugging and code organization also **greatly increase readability**!

- There are **several tips** concerning **code organization and structure** that we should depending on our scenario

- There are many more, often **problem specific** and **very hard to generalize**

## What's left?

We did briefly overview many general aspects of coding, from debugging to coding principles.

But many of us are interested in domain-specific coding tips $\rightarrow$ **deep learning**!

## About this lecture

We are going to briefly overview some major coding aspects in deep learning

Mainly showing examples in Pytorch and Tensorflow

# DL Best Practices

### Modeling ([Karpathy's blog](https://karpathy.github.io/2019/04/25/recipe/))
- Inspect data
- Start simple
- Overfit
- Apply some regularizations
- Hyper-parameters tuning

### Scripting
- File Organization
- Sequential layers
- Calling layers
- Training and evaluation modes
- Mixing operations
- Detaching
- Numerical stability
- Shuffling data
- Gradient clipping
- tf.reshape/th.view vs tf.transpose/th.permute
- Squeezing
- Working with GPU devices
- Freeing GPU memory

## As always, let's first check your experience!

Time for another [Google Form](https://forms.gle/qmQuLWXaygSjft3V7) (**5 mins**)

<center>
<div>
<img src="../Images/Lecture-7/qsn_dl.png" width="600" alt='Deep Learning'/>
</div>
</center>

## Modeling

*Because, eventually, we all prefer a todo listing...*

### Intro

Neural networks are **not a plug-and-play module** that is expected to work effortlessly

Instead, neural networks often **fail silently** (*no exceptions!*) and can still manage to reach **some satisfying result** (*unexpected*)
   - Leakage
   - Data errors
   - Wrong initialization, regularization, optimization, etc...

#### What you should do (*in pills*)

- **Start slow** $\rightarrow$ don't be eager to test out your super fancy model on your ultra fancy data
- **Check** your data attentively
- **Start simple** $\rightarrow$ progressively introduce complexity
- **Hammer your model** until it is sufficiently powerful and regularized to work well

### Become one with the data

Intuitively, we **never start from modeling**, but rather we **look at data**

- Understand relevant **features** for addressing the task
- Spot **outliers** (*may be symptomatic of data collection errors*)
- Spot **errors** (e.g., duplicates, wrong labeling)
- Check data **distributions**
- Identify **biases**
- Identify possible **correct pre-processing** steps
- Helps in **understanding** model post-training evaluation (*error analysis*)

### Start simple

Still, we are not ready yet to plug-in our fancy model

- Write a **training/evaluation skeleton**
- Test **simple baselines** (*how far can they go?*) $\rightarrow$ downplaying possible scripting errors
- **Avoid** any unnecessary complexity $\rightarrow$ pick the simplest setting possible
- **Overfit one single batch** $\rightarrow$ spotting errors, evaluate model capacity, evaluate data
- Check **learning curves**
- Inspect model **prediction dynamics** $\rightarrow$ gives good intuition of model training

### Overfit

Pick a **large enough model** that is able to overfit $\rightarrow$ we understand that the task is *solvable* on training data

- Pick **existing models** (*even their simplest version*) rather than making custom ones
- **Progressively** introduce complexity $\rightarrow$ multiple inputs, larger inputs, etc..
- Pay attention to employed **optimizers** $\rightarrow$ weight decay, learning rate decay

### Regularize

Once we have a big enough model, we have to **regularize it** to allow better generalization capabilities

- Add **more data** (*if you can!*) $\rightarrow$ data-augmentation, real data, synthetic data
- Check **spurious** inputs
- **Decrease** model size
- **Decrease** batch size (*if you are using batch normalization*)
- Dropout
- Weight decay
- Early stopping

### Tuning

To find a good regularized model, we may consider a **hyper-parameter calibration** routine

- Random search (*it usually works quite well to explore the hyper-parameter space*)
- Bayesian optimization (e.g., Optuna)
- **Use your brain** $\rightarrow$ in many cases, you may need to think about the valid search space

## Scripting

Let's see some concrete examples of advanced coding that might be helpful.

### File Organization

**Split** your model into individual layers and losses to

- Enhance **re-usability** (*easier to spot errors*)
- Enhance **readability** (*top-down view of a model*)

The same applies for **nested** models, layers and losses

In [None]:
# losses.py
class CustomLoss(th.nn.Module):
    def __init__(self, *args, **kwargs):
        ...
        
    def forward(self, inputs):
        ...
        
# layers.py
class CustomLayer(th.nn.Module):
    def __init__(self, *args, **kwargs):
        ...
        
    def forward(self, inputs):
        ...
        
# models.py
class CustomModel(th.nn.Module):
    def __init__(self, *args, **kwargs):
        self.layer = CustomLayer(...)
        self.loss_op = CustomLoss(...)
        
    def forward(self, inputs):
        pass

In **Tensorflow** this is particularly recommended when considering ```tf.function```

#### tf.function covers function nesting

If a function invokes other functions, you just need to decorate the top-level function with ```tf.function```

In [None]:
class MyModel(tf.keras.Model):
    
    def loss_op(self, x, y, training=False):
        output = self.model(x, training=training)
        loss = ...
        return loss

    def train_op(self, x, y):
        with tf.GradientTape() as tape:
            loss = self.loss_op(x=x, y=y, training=True)
        grads = tape.gradient(loss, self.model.trainable_variables)
        return grads
    
    @tf.function
    def batch_fit(self, x, y):
        loss, grads = self.train_op(x, y)
        self.optimizer.apply_gradients(zip(grads, 
                                           self.model.trainable_variables))
        return loss
    
    @tf.function
    def batch_evaluate(self, x, y):
        return self.loss_op(x=x, y=y, training=False)
        
    @tf.function
    def batch_predict(self, x):
        return self.model(x, training=False)

### Sequential Layers

In many cases, we may have to define a sequential network

In [None]:
import torch as th

class ConvBlock(nn.Module):
    def __init__(self):
        super(ConvBlock, self).__init__()
        self.block = th.nn.Sequential(
            th.nn.Conv2d(...),
            th.nn.ReLU(...),
            th.nn.BatchNorm2d(...)
        )
        
    def forward(self, x):
        return self.block(x)

In [None]:
import torch as th

class ConvBlock(nn.Module):
    def __init__(self):
        super(ConvBlock, self).__init__()
        self.block = [
            th.nn.Conv2d(...),
            th.nn.ReLU(...),
            th.nn.BatchNorm2d(...)
        ]
        
    def forward(self, x):
        for block in self.blocks:
            x = block(x)
        return x

In [None]:
import torch as th

class ConvBlock(nn.Module):
    def __init__(self):
        
    def forward(self, x):
        x = th.nn.Conv2d(...)(x)
        x = th.nn.ReLU(...)(x)
        x = th.nn.BatchNorm2d(...)(x)
        return x

### Which one do you use?

#### (A) is the best choice

- [**TH**] Uses nn.Sequential(...) to define a sequential network $\rightarrow$ higher efficiency, readibility
- [**TF**] Likewise, use ```tf.keras.Sequential``` in tensorflow

#### (B) may give some problems with list wrapping

- [**TH**] Consider using ```th.nn.ModuleList(...)``` rather than a list
- [**TF**] It is **fine to list wrapping** multiple layers, but the **recommended way** is to use ```tf.keras.Sequential```

#### (C) is terrible!

- Generates layers at **each** forward pass $\rightarrow$ losing track of model weights to update
- You need to define layers in the ```__init__(...)``` method so that model weights are **kept throughout the life the of the model**

### Calling layers

Both Tensorflow and Pytorch implement layers as **callable** objects

**Always invoke** your layer/model as ```layer(...)``` and ```model(...)```, respectively.

#### Never invoke ```call(...)``` or ```forward(...)``` explicitly!

### Training and Evaluation modes

Torch has two model modalities: ```model.train()``` and ```model.eval()```

In [None]:
data_iterator = data_iterator()
model.eval()
for step in range(steps):
    batch = next(data_iterator)
    batch_loss, batch_preds = model.batch_predict(*batch)
    ...

In [None]:
data_iterator = data_iterator()
model.eval()
with th.no_grad():  # <--- the difference is here
    for step in range(steps):
        batch = next(data_iterator)
        batch_loss, batch_preds = model.batch_predict(*batch)
        ...

### Which one do you use?

#### Both are correct!

#### Model.eval()

- Just **changes model execution** so that layers like Dropout, BatchNorm can execute correctly

#### th.no_grad()

- **Disables automatic differentiation** saving up memory and time

In the common case where you **don't compute** any gradient during evaluation, you can use both to **gain** some speed-up and use less memory.

#### Beware of incorrect behaviours!

In [None]:
def train(model, optimizer, epochs, train_steps, train_data_iterator,
          val_steps, val_data_iterator):
    model.train()
    for epoch in range(epochs):
        epoch_train_iterator = train_data_iterator()
        for step in range(train_steps):
            batch = next(epoch_train_iterator)
            batch_loss = model.batch_fit(*batch)
            
        # Get loss on validation set
        val_loss, val_metrics = evaluate(model=model,
                                         steps=val_steps,
                                         data_iterator=val_data_iterator)
                
                
def evaluate(model, steps, data_iterator):
    model.eval()
    ...

Can you spot the **error**?

Tensorflow is as **straightforward** as Pytorch

You just have to remember to call your model with ```training=True|False``` for training and evaluation modes, respectively

In [None]:
    def loss_op(self, x, y, training=False):
        output = self.model(x, training=training)  # <---
        loss = ...
        return loss

    def train_op(self, x, y):
        with tf.GradientTape() as tape:
            loss = self.loss_op(x=x, y=y, training=True)   # <---
        grads = tape.gradient(loss, self.model.trainable_variables)
        return grads
    
    @tf.function
    def batch_fit(self, x, y):
        loss, grads = self.train_op(x, y)
        self.optimizer.apply_gradients(zip(grads, 
                                           self.model.trainable_variables))
        return loss
    
    @tf.function
    def batch_evaluate(self, x, y):
        return self.loss_op(x=x, y=y, training=False)   # <---
        
    @tf.function
    def batch_predict(self, x):
        return self.model(x, training=False)  # <---

### Mixing operations

Consider the following code snippet

In [None]:
# Numpy version
loss = np.square(y_pred - y_true).sum()

# Torch version
loss = (y_pred - y_true).pow(2).sum()

The numpy code is **always run on the CPU**, while the torch code may also run on the GPU

- **Avoid mixing** numpy and torch operations in ```forward(...)``` method since numpy operations slow down your code execution!

### Detaching

How to **properly** collect model outputs?

In [None]:
for epoch in range(self.epochs):
    epoch_train_iterator = train_data_iterator()
    for step in range(steps):
        batch = next(epoch_train_iterator)
        batch_loss, batch_loss_info = model.batch_fit(*batch)
        batch_loss_info = {f'train_{key}': item.detach().numy() 
                           for key, item in batch_loss_info.items()}
        batch_loss_info['train_loss'] = batch_loss.detach().numpy()

In [None]:
for epoch in range(self.epochs):
    epoch_train_iterator = train_data_iterator()
    for step in range(steps):
        batch = next(epoch_train_iterator)
        batch_loss, batch_loss_info = model.batch_fit(*batch)
        batch_loss_info = {f'train_{key}': item.detach().numpy()
                           for key, item in batch_loss_info.items()}
        # the difference is here
        batch_loss_info['train_loss'] = batch_loss_info  

#### [TH] Detaching is need!

- **Removes a tensor from torch tracking** for automatic differentation
- If you don't do that, the unnecessary recording of these tensors **slows down** your program execution!

#### [TF] use ```tensor.numpy()```

- If your operation is **outside the tensorflow graph**, you receive a ```tf.EagerTensor``` storing numerical content (based on provided inputs)

### Numerical Stability

Mathematical correctness of your code **doesn't necessarily translates** to correct results

In [None]:
# from https:/github.com/vahidk/EffectivePytorch
def unstable_softmax(logits):
    exp = th.exp(logits)
    return exp / th.sum(exp)

# prints [nan, 0.]
print(unstable_softmax(th.tensor([1000., 0.])).numpy())

In [None]:
# from https:/github.com/vahidk/EffectivePytorch
def unstable_softmax_cross_entropy(labels, logits):
    logits = th.log(th.nn.softmax(logits))
    return -th.sum(labels * logits)

labels = th.tensor([0.5, 0.5])
logits = th.tensor([1000., 0.])

# prints inf
ce = unstable_softmax_cross_entropy(labels, logits).numpy()

### Shuffling data

One common error is to **not appropriately shuffle data**

In [None]:
def light_iterator(
        df: pd.DataFrame,
) -> Iterator:
    texts, labels = df.text.values, df.label.values
    assert len(texts) == len(labels), 
    f'Inconsistent number of texts and labels'

    for (text, label) in zip(texts, labels):
        yield text, label

In [None]:
# Torch
from functools import partial
from torchdata.datapipes.iter import IterableWrapper

data_generator = partial(light_iterator, df=df)
data = IterableWrapper(data_generator(), deepcopy=False)

if shuffle:
    data = data.shuffle(buffer_size=100)    # <---

In [None]:
# Tensorflow
from functools import partial
import tensorflow as tf

data_generator = partial(light_iterator, df=df)
data = tf.data.Dataset.from_generator(generator=data_generator,
                                      output_signature=(
                                          tf.TensorSpec(shape=(),
                                                        dtype=tf.string),
                                          tf.TensorSpec(shape=(),
                                                        dtype=tf.int64)
                                      ))
if shuffle:
    data = data.shuffle(buffer_size=100)     # <---

#### Beware of buffer_size

These data pipeline APIs work by setting up a **buffer from which we sample** examples.

If your buffer is **too small** you may just sampling class-equivalent examples!

- **Always** pre-shuffle your data (if possible)
- Set a ```buffer_size``` **equal to** your data size (if not too big)
- **Inspect** your data stream for sanity check

### Gradient clipping

In many cases, you may want to clip gradients **to increase model training stability**

<center>
<div>
<img src="../Images/Lecture-7/gradient_clipping.png" width="1300"/>
</div>
</center>


In [None]:
# Torch
optimizer.zero_grad()
loss.backward()
th.nn.utils.clip_grad_norm(model.parameters(), max_norm=10)

# Tensorflow
grads = tape.gradient(loss, self.model.trainable_variables)
grads, _ = tf.clip_by_global_norm(grads, 10)

### tf.reshape/th.view vs tf.transpose/th.permute

In many cases, we **may erroneously** use one operation in place of the other one!

In [None]:
x = tf.reshape(tf.range(6), [2, 3])
# <tf.Tensor: shape=(2, 3), dtype=int32, numpy=
#  array([[0, 1, 2], [3, 4, 5]], dtype=int32)>

tf.reshape(x, [3, 2])
# <tf.Tensor: shape=(3, 2), dtype=int32, numpy=
# array([[0, 1],
#        [2, 3],
#        [4, 5]], dtype=int32)>

tf.transpose(x)
# <tf.Tensor: shape=(3, 2), dtype=int32, numpy=
# array([[0, 3],
#        [1, 4],
#        [2, 5]], dtype=int32)>

#### Beware of reshaping!

If used incorrectly, it may lead to **leakage between** batch samples! 

### Squeezing

Squeezing is the operation of removing 1-unit dimensions in a tensor

In [None]:
# Shape -> [2, 3, 1, 1]
x = tf.reshape(tf.range(6), [2, 3, 1, 1])

# Shape -> [2, 3]
x = tf.squeeze(x)

#### Beware of squeezing with batched tensors

One time I lost a **couple of hours** with a strange error...

It turned out that the batching with a specific ```batch_size``` led to a batch **with a single sample**

$\rightarrow$ squeezing without specifying any dimension inherently converted my input 3D tensors to 2D!

### Working with GPU devices

When working with GPUs we have to carefully inspect **how** these devices are used

$\rightarrow$ this is particularly **annoying** in Tensorflow!

- Tensorflow automatically reserves **all available memory** from the selected GPU
- Moreover, Tensorflow also reserves some memory in **all discovered GPUs**, even if you are in a single-GPU setting (*efficiency reasons to reduce memory fragmentation*)

In [None]:
# Tensorflow
gpus = tf.config.experimental.list_physical_devices('GPU')

# Enforces Tensorflow to just use the necessary amount of GPU memory
if gpus:
    try:
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
    except RuntimeError as e:
        print(e)

# Limiting visibility (only first gpu)
gpu_start_index = 0
gpu_end_index = 1
tf.config.set_visible_devices(gpus[gpu_start_index:gpu_end_index],
                              "GPU")

# Setting max GPU memory (e.g., 3GB on first GPU)
tf.set_virtual_device_configuration(gpus[0],
                [tf.VirtualDeviceConfiguration(memory_limit=3072)])

In [None]:
# Torch
# Setting max GPU memory (e.g., 30% of total GPU memory)
th.cuda.set_per_process_memory_fraction(0.3, device='cuda:0')

### Freeing GPU memory

In many cases, you may find yourself in a scenario where **multiple models** have to be trained (e.g., cross-validation).

In these cases, you may need to **efficiently release GPU usage** to avoid memory problems

In [None]:
# Tensorflow
from tensorflow.python.keras import backend as K
import gc

del model
gc.collect()
K.clear_session()

In [None]:
# Torch
del model
gc.collect()
th.cuda.empty_cache()

#### Manually freeing GPU memory may not be needed

Generally speaking, Tensorflow/Torch **might efficiently re-use** the previously allocated memory

$\rightarrow$ manually freeing memory might lead to **minor code execution** speed reductions

#### Tensorflow has a problematic GPU memory management

According to this [thread](https://github.com/tensorflow/tensorflow/issues/36465), commands like ```K.clear_session()``` **may not really work**.

Instead, the recommended way is to run your train/evaluation routine in a **separate process**

In [None]:
import multiprocessing

process_eval = multiprocessing.Process(target=evaluate, args=(...))
process_eval.start()
process_eval.join()

## Bonus

- Mixed-precision
- Gradient accumulation

### Mixed-precision

In many cases, you may want to **speed-up your training** by relying on mixed-precision operations

In [None]:
# Autocast
with th.cuda.amp.autocast():
    outputs = model(inputs)
    loss = loss_op(outputs, targets)
    
# GradScaler
scaler = th.cuda.amp.GradScaler()

loss = ...
optimizer = ...

scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

#### th.cuda.amp.autocast()

- Automatically **casts down** heavy operations (e.g., convolution, matrix multiplication) to 16-bit
- Allows mixed-precision computations

#### th.cuda.amp.GradScaler()

- Allows to work with 16-bit gradient values while **avoiding under/over-flows**
- Scales up loss to **avoid underflows**
- Scale gradient values down during gradient update **to ensure correct** model weights update

In Tensorflow, mixed-precision is pretty easy to setup as well

In [None]:
policy = tf.keras.mixed_precision.experimental.Policy('mixed_float16')
tf.keras.mixed_precision.experimental.set_policy(policy) 

### Gradient accumulation

Gradient accumulation is the technique of accumulating gradients over **multiple batches** and perform a **single cumulative gradient update**

$\rightarrow$ Allows training with **bigger batch sizes** to allow for robust optimization

#### Tensorflow

A quick way for doing gradient accumulation is to leverage the [```gradient-accumulator```](https://pypi.org/project/gradient-accumulator/) library

$\rightarrow$ since there is **no straightforward way** of doing gradient accumulation

In [None]:
from gradient_accumulator import GradientAccumulateModel
from tensorflow.keras.models import Model

model = Model(...)
model = GradientAccumulateModel(accum_steps=4, 
                                inputs=model.input,
                                outputs=model.output)

#### Pytorch

In [None]:
accumulation_size = 4

for step in range(steps):
    batch_x, batch_y = next(train_iterator)
    preds = model.batch_predict(x=batch_x)
    loss = model.loss_op(y_pred=preds, y_true=batch_y)
    
    # Normalize loss to account for batch accumulation
    loss = loss / accumulation_size
    
    loss.backward()
    
    if (step + 1) % accumulation_size == 0 or step == len(steps) - 1:
        optimizer.step()
        optimizer.zero_grad()

# Concluding Remarks

- Machine learning always starts from data! $\rightarrow$ deep learning is no exception!

- The general recommendation is always to start small and progressively increase complexity (*I understand you want to try out your fancy architecture you saw in your dreams...*)

- Tensorflow and Pytorch mainly **share the same type of common mistakes and best practices**

# 次回 (Jikai!)

Actually, **there's nothing left** to show you... (or, better, there is still **way too much** stuff to talk about!)

- In the first edition, 10 hours were not enough. This year edition is no exception!

- Anyway, I thought it could have been a **good opportunity** to show you something I've been working on.

# Any questions?

<center>
<div>
<img src="../Images/Lecture-3/jojo-arrivederci.gif" width="1200" alt='JOJO_arrivederci'/>
</div>
</center>