## It's coding time again!

<center>
<div>
<img src="Images/Lecture-3/programming_skills.png" width="1800" alt='programming_skills'/>
</div>
</center>

<br/>
<br/>
<br/>

## Let's recall your interests

61.8% is interested in programming mistakes!

<center>
<div>
<img src="Images/Lecture-1/google_form_true.png" width="2200" alt='google_form_true'/>
</div>
</center>


Let's focus on this topic!

## Why do we need this lecture?

Whether you like it or not, experimental setting might require you to do some coding stuff.

Coding translates to: 

1. Transparency (*don't you dare do some cheap tricks!*)
2. Correctness (*your code should reflect your paper statements*) 
3. **Readability** (*please, don't make this a nightmare*)
4. Efficiency (*time is money*)
5. **Maintainability** (*I'm sure you'll re-use this code*)

In the past lecture, we have shown some tools to make our code more efficient in Tensorflow and Pytorch [4]

We now provide some tips & tricks concerning [3, 5]

## What are we going to cover?

- Debugging
- General coding best practices
- Tensorflow and PyTorch best practices
- Misc: code documentation, README, controlled environments

*The 5-minute-in-the-future of yourself and your friends will appreciate!*

# Debugging

<center>
<div>
<img src="Images/Lecture-4/programming_meme.png" width="600"/>
</div>
</center>


### What are we going to see

- Using a debugger
- Type hints
- Typechecking
- Assertions
- Logging

### The debugging slice

Whether you've already realized it or you still have to, debugging usually takes around 90% of your work time

- 9% the idea
- 1% code writing
- 90% debugging

It is tiresome, boring, stressful, annoying $\rightarrow$ we all know that!

#### My point

- We don't really need to be good programmers to write correct code (*there are many high-level libraries!*)
- This lecture is not a computer science 101 course about programming (*I'm far from being competent on this matter*)
- What I want to say is that you should learn how to think to tackle debugging

#### Debugging is like an investigation game where you have to find the culprit, and usually the culprit is you!

#### What are our weapons?

- [Dynamic] Debugger
- [Static] Type hints
- [Dynamic] Typechecking
- [Dynamic] Assertions
- [Dynamic] Logging

All these methods, combined, allow us to better inspect our code to find unwanted behaviours/features (*bugs*)

### Using a Debugger

Long story short, we have a powerful tool to inspect our code

- Stop using ```print(...)``` for debugging
- Just use a debugger
- Just use a debugger
- Just use a debugger
- Got it?

There are many powerful IDEs that support a debugger: PyCharm, Spider, VisualStudio Code

#### Unless you are a skilled programmer, just avoid programming with 
   - text editors like vim, sublimetext, nano, notepad++, etc.
   - notebooks (unless you need to run real experiments)

*personal opinion*

### Type hints [[Read1](https://peps.python.org/pep-0483/), [Read2](https://bernat.tech/posts/the-state-of-type-hints-in-python/)]

If debugging is our dynamic way of inspecting code, type hints are our way to statically analyze it

In [None]:
from typing import Typle, List

# With type hints (torch_datapipe_record_pipeline.py, Lines 46-53)
def parse_inputs(input_data: Tuple[str, int]) -> [List[int], int]:
    text, label = input_data
    text = preprocess_text(text=text)
    tokens = vocab(tokenizer(text))
    return tokens, label


# Without type hints
def parse_inputs(input_data):
    text, label = input_data
    text = preprocess_text(text=text)
    tokens = vocab(tokenizer(text))
    return tokens, label

#### Advantages

- Integrated documentation with code $\rightarrow$ much more readable than docstrings
- Accurate code re-factoring for IDEs
- Allows code auto-completion for IDEs
- Linters (included in IDEs) can tell wrong function calls based on type hints $\rightarrow$ warnings!!
- We can define compund type like ```List[int]```

#### Disadvantages

- We need at least ```Python 3.6``` (*reasonable given any deep learning framework requirements*)
- May have conflicts with docstrings depending on the tool used
- Minor added computation overhead
- Forces to import all type dependencies, even though they are not used at runtime at all
- Compound type may require some additional operations by the interpreter

The last two points are solved via post-poned evaluation of annotations (*requires ```Python 3.7```*)

In [2]:
# w/o post-poned evaluation
class A:
    def f(self) -> A: # NameError: name 'A' is not defined
        pass
    
# w/ post-poned evaluation
from __future__ import annotations

class A:
    def f(self) -> A:
        pass

#### An example

We can use Python's reference linter ```mypy``` to run our type hinted code


#### Script

```mypy_example.py```

#### Types of type hints

- Nominal types: ```int```, ```float```, ```bool```, etc... (*all bultin type*)
- Compound types: ```List[int]```
    - We can also define type aliases for readability: ```CustomType = Optiona[List[int], Dict[str, str]]
    
- Compotional types: ```Union[...]``` (*one of*), ```Intersection[...]``` (*each one*), ```Optional[...]``` (*can be None*)
- Generic types:

In [None]:
from typing import TypeVar, Generic, Iterable, Iterator

T = TypeVar('T')   # must use the same variable name

class CustomClass(Generic[T]):
    def __init__(self, value: T) -> None:
        self.value: T = value
            
    def get_iterator(values: Iterable[CustomClass[int]]) -> Iterator:
        for value in values:
            yield value

####  Proper function overloading

Suppose you have the following function

In [None]:
def do_stuff(x: Union[int, List[int]]) -> Union[int, List[int]]:
    ...

The type hinter understands that you can call ```do_stuff``` with
   - ```input x is int returns int```
   - ```input x is int returns List[int]```
   - ```input x is List[int] returns int```
   - ```input x is List[int] returns List[int]```
   
#### How to avoid this?

In [None]:
from typing import overload

@overload
def do_stuff(x: int) -> int:
    ...
    
@overload
def do_stuff(x: List[int]) -> List[int]:
    ...


def do_stuff(x: Union[int, List[int]]) -> Union[int, List[int]]:
    ...

#### What type hints don't do

- Runtime type inference $\rightarrow$ we need some libraries that leverage type hints
- No performance tuning $\rightarrow$ type hints are treated just like comments

#### Only type hinted code is type-checked!

#### Bonus: type hints and Sphinx for merging documentation

We can remove all typing information from docstrings and infer them from type hints

Sphinx has the plugin [```agronoholm/sphinx-autodoc-typehints```](https://github.com/agronholm/sphinx-autodoc-typehints)

Then add the following extensions to Sphinx's ```conf.py```

```
extensions = ["sphinx.ext.autodoc", "sphinx_autodoc_typehints"]
```

### Typechecking

### Assertions

### Logging

# Coding Best Practices

### What are we going to see

- Naming choice
- Comments
- Nesting
- Inheritance
- Abstraction
- Organization
- Profiling
- Code optimization
- Testing

# Tensorflow and PyTorch Best Practices

### Modeling ([Karpathy's blog](https://karpathy.github.io/2019/04/25/recipe/))
- Inspect data
- Start simple
- Overfit
- Apply some regularizations
- Hyper-parameters tuning

### Scripting
- File Organization
- Sequential layers
- Calling layers
- Training and evaluation modes
- Mixing operations
- Detaching
- Numerical stability
- Shuffling data
- Gradient clipping
- tf.reshape/th.view vs tf.transpose/th.permute
- Squeezing
- Working with GPU devices
- Freeing GPU memory

## Modeling

*Because, eventually, we all prefer a todo listing...*

### Intro

Neural networks are not a plug-and-play module that is expected to work effortlessly

Instead, neural networks often fail silently (*no exceptions!*) and can still manage to reach some satisfying result (*unexpected*)
   - Leakage
   - Data errors
   - Wrong initialization, regularization, optimization, etc...
    
#### What you should do (*in pills*)

- Start slow $\rightarrow$ don't be eager to test out your super fancy model on your ultra fancy data
- Check your data attentively
- Start simple $\rightarrow$ progressively introduce complexity
- Hammer your model until it is sufficiently powerful and regularized to work well

### Become one with the data

Intuitively, we never start from modeling, but rather we look at data

- Understand relevant features for addressing the task
- Spot outliers (*may be symptomatic of data collection errors*)
- Spot errors (e.g., duplicates, wrong labeling)
- Check data distributions
- Identify biases
- Identify possible correct pre-processing steps
- Helps in understanding model post-training evaluation (*error analysis*)

### Start simple

We are not ready yet to plug-in our fancy model

- Write a training/evaluation skeleton
- Test simple baselines (*how far can they go?*) $\rightarrow$ downplaying possible scripting errors
- Avoid any unnecessary complexity $\rightarrow$ pick the simplest setting possible
- Overfit one single batch $\rightarrow$ spotting errors, evaluate model capacity, evaluate data
- Check learning curves
- Inspect model prediction dynamics $\rightarrow$ gives good intuition of model training

### Overfit

Pick a large enough model that is able to overfit $\rightarrow$ we understand that the task is *solvable* on training data

- Pick existing models (*even their simplest version*) rather than making custom ones
- Progressively introduce complexity $\rightarrow$ multiple inputs, larger inputs, etc..
- Pay attention to employed optimizers $\rightarrow$ weight decay, learning rate decay

### Regularize

Once we have a big enough model, we have to regularize it to allow better generalization capabilities

- Add more data (*if you can!*) $\rightarrow$ data-augmentation, real data, synthetic data
- Check spurious inputs
- Decrease model size
- Decrease batch size (*if you are using batch normalization*)
- Dropout
- Weight decay
- Early stopping

### Tuning

To find a good regularized model, we may consider a hyper-parameter calibration routine

- Random search (*it usually works quite well to explore the hyper-parameter space*)
- Bayesian optimization (e.g., Optuna)
- Use your brain $\rightarrow$ in many cases, you may need to think about the valid search space

## Scripting

### File Organization

Split your model into individual layers and losses to

- Enhance re-usability (*easier to spot errors*)
- Enhance readability (*top-down view of a model*)

The same applies for nested models, layers and losses

In [None]:
# losses.py
class CustomLoss(th.nn.Module):
    def __init__(self, *args, **kwargs):
        ...
        
    def forward(self, inputs):
        ...
        
# layers.py
class CustomLayer(th.nn.Module):
    def __init__(self, *args, **kwargs):
        ...
        
    def forward(self, inputs):
        ...
        
# models.py
class CustomModel(th.nn.Module):
    def __init__(self, *args, **kwargs):
        self.layer = CustomLayer(...)
        self.loss_op = CustomLoss(...)
        
    def forward(self, inputs):
        pass

In Tensorflow this is particularly recommended when considering ```tf.function```

#### tf.function covers function nesting

If a function invokes other functions, you just need to decorate the top-level function with ```tf.function```

In [None]:
class MyModel(tf.keras.Model):
    
    def loss_op(self, x, y, training=False):
        output = self.model(x, training=training)
        loss = ...
        return loss

    def train_op(self, x, y):
        with tf.GradientTape() as tape:
            loss = self.loss_op(x=x, y=y, training=True)
        grads = tape.gradient(loss, self.model.trainable_variables)
        return grads
    
    @tf.function
    def batch_fit(self, x, y):
        loss, grads = self.train_op(x, y)
        self.optimizer.apply_gradients(zip(grads, self.model.trainable_variables))
        return loss
    
    @tf.function
    def batch_evaluate(self, x, y):
        return self.loss_op(x=x, y=y, training=False)
        
    @tf.function
    def batch_predict(self, x):
        return self.model(x, training=False)

### Sequential Layers

In many cases, we may have to define a sequential network

#### Which one do you use?

<table><tr>
<td> <img src="Images/Lecture-4/th-examples-1-1.png" width="1100"/> </td>
<td> <img src="Images/Lecture-4/th-examples-1-2.png" width="1100"/> </td>
<td> <img src="Images/Lecture-4/th-examples-1-3.png" width="1100"/> </td>
</tr>
<tr>
<td style="text-align: center; vertical-align: middle;"> <strong>(A)</strong> </td>
<td style="text-align: center; vertical-align: middle;"> <strong>(B)</strong> </td>
<td style="text-align: center; vertical-align: middle;"> <strong>(C)</strong> </td>
</tr>
</table>

#### (A) is the best choice

- [**TH**] Uses nn.Sequential(...) to define a sequential network $\rightarrow$ higher efficiency, readibility
- [**TF**] Likewise, use ```tf.keras.Sequential``` in tensorflow
    
#### (B) may give some problems with list wrapping

- [**TH**] Consider using ```th.nn.ModuleList(...)``` rather than a list
- [**TF**] It is fine to list wrapping multiple layers, but the recommended way is to use ```tf.keras.Sequential```
    
#### (C) is terrible!

- Generates layers at each forward pass $\rightarrow$ losing track of model weights to update
- You need to define layers in the ```__init__(...)``` method so that model weights are kept throughout the life the of the model

### Calling layers

Both Tensorflow and Pytorch implement layers as callable objects

Always invoke your layer/model as ```layer(...)``` and ```model(...)```, respectively.

#### Never invoke ```call(...)``` or ```forward(...)``` explicitly!

### Training and Evaluation modes

Torch has two model modalities: ```model.train()``` and ```model.eval()```

#### Which one do you use?

<table><tr>
<td> <img src="Images/Lecture-4/th-examples-3-1.png" width="1100"/> </td>
<td> <img src="Images/Lecture-4/th-examples-3-2.png" width="1100"/> </td>
</tr>
<tr>
<td style="text-align: center; vertical-align: middle;"> <strong>(A)</strong> </td>
<td style="text-align: center; vertical-align: middle;"> <strong>(B)</strong> </td>
</tr>
</table>

#### Both are correct!

#### Model.eval()

- Just changes model execution so that layers like Dropout, BatchNorm can execute correctly

#### torch.no_grad()

- Disables automatic differentiation saving up memory and time


In the common case where you don't compute any gradient during evaluation, you can use both to gain some speed-up and use less memory.

#### Beware of incorrect behaviours!

In [None]:
def train(model, optimizer, epochs, train_steps, train_data_iterator, val_steps, val_data_iterator):
    model.train()
    for epoch in range(epochs):
        epoch_train_iterator = train_data_iterator()
        for step in range(train_steps):
            batch = next(epoch_train_iterator)
            batch_loss = model.batch_fit(*batch)
            
        # Get loss on validation set
        val_loss, val_metrics = evaluate(model=model, steps=val_steps, data_iterator=val_data_iterator)
                
                
def evaluate(model, steps, data_iterator):
    model.eval()
    ...

Tensorflow is as straightforward as Pytorch

You just have to remember to call your model with ```training=True|False``` for training and evaluation modes, respectively

In [None]:
    def loss_op(self, x, y, training=False):
        output = self.model(x, training=training)  # <---
        loss = ...
        return loss

    def train_op(self, x, y):
        with tf.GradientTape() as tape:
            loss = self.loss_op(x=x, y=y, training=True)   # <---
        grads = tape.gradient(loss, self.model.trainable_variables)
        return grads
    
    @tf.function
    def batch_fit(self, x, y):
        loss, grads = self.train_op(x, y)
        self.optimizer.apply_gradients(zip(grads, self.model.trainable_variables))
        return loss
    
    @tf.function
    def batch_evaluate(self, x, y):
        return self.loss_op(x=x, y=y, training=False)   # <---
        
    @tf.function
    def batch_predict(self, x):
        return self.model(x, training=False)  # <---

### Mixing operations

Consider the following code snippet

In [None]:
# Numpy version
loss = np.square(y_pred - y_true).sum()

# Torch version
loss = (y_pred - y_true).pow(2).sum()

The numpy code is always run on the CPU, while the torch code may also run on the GPU

- Avoid mixing numpy and torch operations in ```forward(...)``` method since numpy operations slow down your code execution!

### Detaching

How to properly collect model outputs?

#### Which one do you use?

<table><tr>
<td> <img src="Images/Lecture-4/th-examples-2-1.png" width="1100"/> </td>
<td> <img src="Images/Lecture-4/th-examples-2-2.png" width="1100"/> </td>
</tr>
<tr>
<td style="text-align: center; vertical-align: middle;"> <strong>(A)</strong> </td>
<td style="text-align: center; vertical-align: middle;"> <strong>(B)</strong> </td>
</tr>
</table>

#### [TH] Detaching is need!

- Remove tensor from torch tracking for automatic differentation
- If you don't do that, the unnecessary recording of these tensors slows down your program execution!

#### [TF] use ```tensor.numpy()```

- If your operation is outside the tensorflow graph, you receive a ```tf.EagerTensor``` storing numerical content (based on provided inputs)

### Numerical Stability

Mathematical correctness of your code doesn't necessarily translates to correct results

#### Some examples

<table><tr>
<td> <img src="Images/Lecture-4/th-examples-4-1.png" width="1100"/> </td>
<td> <img src="Images/Lecture-4/th-examples-4-2.png" width="1100"/> </td>
<td> <img src="Images/Lecture-4/th-examples-4-3.png" width="1100"/> </td>
</tr>
<tr>
<td style="text-align: center; vertical-align: middle;"> <strong>(A)</strong> </td>
<td style="text-align: center; vertical-align: middle;"> <strong>(B)</strong> </td>
<td style="text-align: center; vertical-align: middle;"> <strong>(C)</strong> </td>
</tr>
</table>

#### Stable versions

<table><tr>
<td> <img src="Images/Lecture-4/th-examples-5-1.png" width="1100"/> </td>
<td> <img src="Images/Lecture-4/th-examples-5-2.png" width="1100"/> </td>
<td> <img src="Images/Lecture-4/th-examples-5-3.png" width="1100"/> </td>
</tr>
<tr>
<td style="text-align: center; vertical-align: middle;"> <strong>(A)</strong> </td>
<td style="text-align: center; vertical-align: middle;"> <strong>(B)</strong> </td>
<td style="text-align: center; vertical-align: middle;"> <strong>(C)</strong> </td>
</tr>
</table>

### Shuffling data

One common error is to not appropriately shuffle data

In [None]:
# Torch (torch_datapipe_example.py, Lines 92-96)
data_generator = partial(self.light_iterator, df=df)
data = IterableWrapper(data_generator(), deepcopy=False)

if shuffle:
    data = data.shuffle(buffer_size=100)    # <---
    
#------------------------------------------------------------------------------------------
    
# Tensorflow (tf_data_pipeline_gen.py, Lines 65-72)
data_generator = partial(self.light_iterator, df=df)
data = tf.data.Dataset.from_generator(generator=data_generator,
                                      output_signature=(
                                          tf.TensorSpec(shape=(), dtype=tf.string),
                                          tf.TensorSpec(shape=(), dtype=tf.int64)
                                      ))
if shuffle:
    data = data.shuffle(buffer_size=100)     # <---

#### Beware of buffer_size

These data pipeline APIs work by setting up a buffer from which sampling random examples.

If your buffer is too small you may just sampling class-equivalent examples!

- Always pre-shuffle your data (if possible)
- Set a buffer_size equal to your data size (if not too big)
- Inspect your data stream for sanity check

### Gradient clipping

In many cases, you may want to clip gradients to increase model training stability

<center>
<div>
<img src="Images/Lecture-4/gradient_clipping.png" width="1000"/>
</div>
</center>


In [None]:
# Torch
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm(model.parameters(), max_norm=10)

# Tensorflow
grads = tape.gradient(loss, self.model.trainable_variables)
grads, _ = tf.clip_by_global_norm(grads, 10)

### tf.reshape/th.view vs tf.transpose/th.permute

In many cases, we may erroneously use one operation in place of the other one!

In [None]:
x = tf.reshape(tf.range(6), [2, 3])
# <tf.Tensor: shape=(2, 3), dtype=int32, numpy=
#  array([[0, 1, 2], [3, 4, 5]], dtype=int32)>

tf.reshape(x, [3, 2])
# <tf.Tensor: shape=(3, 2), dtype=int32, numpy=
# array([[0, 1],
#        [2, 3],
#        [4, 5]], dtype=int32)>

tf.transpose(x)
# <tf.Tensor: shape=(3, 2), dtype=int32, numpy=
# array([[0, 3],
#        [1, 4],
#        [2, 5]], dtype=int32)>

#### Beware of reshaping!

If used incorrectly, it may lead to leakage between batch samples! 

### Squeezing

Squeezing is the operation of removing 1-unit dimensions in a tensor

In [None]:
# Shape -> [2, 3, 1, 1]
x = tf.reshape(tf.range(6), [2, 3, 1, 1])

# Shape -> [2, 3]
x = tf.squeeze(x)

#### Beware of squeezing with batched tensors

One time I lost a couple of hours with a strange error...

It turned out that the batching with a specific ```batch_size``` led to a batch with a single sample

$\rightarrow$ squeezing without specifying any dimension inherently converted my input 3D tensors to 2D!

### Working with GPU devices

When working with GPUs we have to carefully inspect how these devices are used

$\rightarrow$ this is particularly annoying with Tensorflow!

- Tensorflow automatically reserves all available memory from the selected GPU
- Moreover, Tensorflow also reserves some memory in all discovered GPUs, even if you are in a single-GPU setting (*efficiency reasons to reduce memory fragmentation*)

In [None]:
# Tensorflow
gpus = tf.config.experimental.list_physical_devices('GPU')

# Enforces Tensorflow to just use the necessary amount of GPU memory
if gpus:
    try:
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
    except RuntimeError as e:
        print(e)

# Limiting visibility (only first gpu)
gpu_start_index = 0
gpu_end_index = 1
tf.config.set_visible_devices(gpus[gpu_start_index:gpu_end_index], "GPU")

# Setting max GPU memory (e.g., 3GB on first GPU)
tf.config.experimental.set_virtual_device_configuration(gpus[0], 
                                                        [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=3072)])


# Torch

# Setting max GPU memory (e.g., 30% of total GPU memory)
torch.cuda.set_per_process_memory_fraction(0.3, device='cuda:0')

### Freeing GPU memory

In many cases, you may find yourself in a scenario where multiple models have to be trained (e.g., cross-validation).

In these cases, you may need to efficiently allocate/release GPU usage to avoid memory problems

In [None]:
# Tensorflow
from tensorflow.python.keras import backend as K
import gc

del model
gc.collect()
K.clear_session()

# Torch
del model
gc.collect()
torch.cuda.empty_cache()

#### Manually freeing GPU memory may not be needed

Generally speaking, Tensorflow/Torch might efficiently re-use the previously allocated memory

$\rightarrow$ manually freeing memory might lead to minor code execution speed reductions

#### Tensorflow has a problematic GPU memory management

According to this [thread](https://github.com/tensorflow/tensorflow/issues/36465), commands like ```K.clear_session()``` may not really work.

Instead, the recommended way is to run your train/evaluation routine in a separate process

In [None]:
import multiprocessing

process_eval = multiprocessing.Process(target=evaluate, args=(...))
process_eval.start()
process_eval.join()

## Bonus

- Mixed-precision
- Gradient accumulation
- Distributed training

### Mixed-precision

In many cases, you may want to speed-up your training by relying on mixed-precision operations

In [None]:
# Autocast
with torch.cuda.amp.autocast():
    outputs = model(inputs)
    loss = loss_op(outputs, targets)
    
# GradScaler
scaler = torch.cuda.amp.GradScaler()

loss = ...
optimizer = ...

scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

#### torch.cuda.amp.autocast()

- Automatically casts down heavy operations (e.g., convolution, matrix multiplication) to 16-bit
- Allows mixed-precision computations


#### torch.cuda.amp.GradScaler()

- Allows to work with 16-bit gradient values while avoid under/over-flows
- Scales up loss to avoid underflows
- Scale gradient values down during gradient update to ensure correct model weights update

In Tensorflow, mixed-precision is pretty easy to setup as well

In [None]:
policy = tf.keras.mixed_precision.experimental.Policy('mixed_float16')
tf.keras.mixed_precision.experimental.set_policy(policy) 

### Gradient accumulation

### Distributed training

## Misc 

- Code documentation
- Writing a proper README
- Controlled environments

### Code documentation

### Writing a proper README

### Controlled environments

# Concluding Remarks

- TODO

# 次回 (Jikai!)

Actually, there's nothing left to show you...

Since I wanted to hold a 10-hours course (thus, 2 CFUs), I thought it could have been a good opportunity to show you something I've been working on.

- Deasy-learning (*a tiny tiny custom library for research*)
- Course feedback (*don't forget to leave a like and hit subscribe!* ~semicit)
- **Motivational outro** (*please, don't miss this!*)

# Any questions?

<center>
<div>
<img src="Images/Lecture-1/jojo-arrivederci.gif" width="1200" alt='JOJO_arrivederci'/>
</div>
</center>