# Good practices in NN/DL project design
## What to do and - more importantly perhaps - not to do

# Is my project right for Neural Networks?

* The thought process should not be: “I have some data, why don’t we try neural networks”
* But it should be: “Given the problem, does it make sense to use neural networks?”

    * Do I really need non-linear modelling?
    * What literature is out there for similar problems?
    * How much data will I be able to gather or put my hands on?
    * Are there datasets out there that I can re-use before I collect my data?



## Do I really need non-linear modelling?

* Sometimes linear methods perform just as well if not better
* Less risk of catastrophic overfitting
* Faster to code, optimize, run, debug
* Use linear modelling as a baseline before you move to non-linear methods?

## Real-life example

Drop-in question: "I tried deep learning on my data and it didn't perform better than this other simpler method"

* Classifying gene expression samples
* Thousands features
* 1000 samples
* 2 classes
* NN looked like this:

In [4]:
from keras.layers import Dense
from keras.models import Sequential

model = Sequential()
model.add(Dense(1000, input_dim=5000))
model.add(Dense(500))
model.add(Dense(2, activation="softmax"))

model.summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_12 (Dense)             (None, 1000)              5001000   
_________________________________________________________________
dense_13 (Dense)             (None, 500)               500500    
_________________________________________________________________
dense_14 (Dense)             (None, 2)                 1002      
Total params: 5,502,502
Trainable params: 5,502,502
Non-trainable params: 0
_________________________________________________________________


## Parameters (weights) vs. samples

* If the number of parameters is many times higher than the number of samples a NN will never work
* Ideally, we are looking for the inverse: way more samples than parameters
* Some rules of thumb out there:
    * Definitely bad if number of weights > number of samples
    * 10x as many labelled samples as there are weights
    * A few thousand samples per class
    * Just try it and downscale/regularize until you're not overfitting anymore (or until you have a linear model)

## And even if I have enough data for a NN...

... is Deep Learning the right choice?

* The tasks were Deep Learning shine are those that require feature extraction:
    * Imaging -> edge/object detection
    * Audio/text -> sound/word/sentence detection
    * Protein structure prediction -> mutation patterns/local structure/global structure

* Deep Learning makes feature extraction automatic and seem to work best when there is a hierarchy to these features
* Is your data made that way? 
    * Does it have an order (spatial/temporal)? 
    * Are smaller patterns going to form higher-order patterns?
* All these different types of layers need to be there for a reason


<img src="figures/feature_extraction.png"></img>

source: [datarobot](https://www.datarobot.com/blog/a-primer-on-deep-learning/)

## And even when both these conditions have met

... you need a few more things:

* Domain knowledge is not enough
* Sometimes people with NN/DL knowledge and no domain knowledge end up being the right ones for the job (see Alphafold)
* You also need lots of patience and time, these things rarely work out of the box

## A few more things to keep in mind

* You need extensive knowledge of your data:
    * Split the data in a rigorous way to avoid introducing biases
    * Check for _information leakage_ before you get overly optimistic results
    * Make sure that there are no errors in your data

And therein lies the main issue:
* Some think that DL is about having a model magically fixing your data
* Instead, DL is _mostly_ about knowing your data

## 1) Neural Nets are very good at detecting patterns and they will use this against you

### (a.k.a. target leakage)

## Target leakage

* Making a predictor when you know the answers is not as easy as it seems
* Need to remove any revealing info you would not have access to in real scenario
* Classic example: predict yearly salary of employee
    * But one of the features is "monthly income"

## Example: detecting COVID-19 from chest scans 
(https://www.datarobot.com/blog/identifying-leakage-in-computer-vision-on-medical-images/)

* COVIDx dataset
* Training set: 66 positive COVID results, 120 random non-COVID examples
* 2-class classifier based on ResNet50 Featurizer
* Perfect validation results! Great!


## Example: detecting COVID-19 from chest scans 

Inspecting dataset with image embeddings tells another story: can anyone tell what's wrong?

<img src="figures/covidchest.png">
[(source)](https://www.datarobot.com/blog/identifying-leakage-in-computer-vision-on-medical-images/)

## Example: detecting COVID-19 from chest scans 

Let's look at activations map and see more in detail
* Get final layer's output after activation (ReLU) and plot figure

<img src="figures/covidchest2.png"></img>
[(source)](https://www.datarobot.com/blog/identifying-leakage-in-computer-vision-on-medical-images/)

## Example: normalizing inputs on train/validation/test data

* If you normalize on validation data as well you are getting information you wouldn't have in a real scenario
* Boston housing dataset: does the same model perform differently when normalizing data based on all samples or just the training samples?

## 2) Know your train/validation/test sets

* A _train set_ is a set of samples used to tune the NN weights
* A _validation set_ is a set used to tune the NN hyperparameters:
    * Type of model (maybe not even a NN)
    * Number of layers
    * Number of neurons per layer
    * Type of layers
    * Optimizer
    * Validation set results are NOT the ones that will get published
    * Doesn't matter if you cross-validate
* A _test set_ is a secluded set of samples that are used only once to test the final model
    * Give an idea of how well the model generalizes to unseen data (results go on paper)

### Beware of similar samples across sets

<img src="figures/homer.png" width=500>

<br>
<br>
<img src="figures/guyincognito.png">
(2F08 “Fear of Flying”)

## Knowing what each set does is half the battle

Train, validation and test sets cannot be too similar to each other, or you will not be able to tell if the network is generalizing or just memorizing

* _How_ different they should be depends on what you're trying to achieve
* Come up with a similarity measure
* At the very least remove duplicate samples
* You would be surprised how often scientists mess this up



<img src="figures/andrewng.png">

<img src="figures/trainvalidationleak1.png">

<img src="figures/andrewng.png">

<img src="figures/trainvalidationleak.png">

# Sad ending :(
<img src="figures/trainvalidationleak2.png">

## Another example, protein structure prediction

* For some reason most researchers try to split train/validation/test by sequence similarity
* If two proteins have <25% identical amino acids, they are deemed different enough
* But protein families/superfamilies contain many proteins that share no detectable sequence similarity
* Sequence similarity is not the right metric!

<img src="figures/25percent.png">

## 3) Your model is only as good as your data 

Reasons why one of my networks wouldn't work:

* Labels were wrong (label for amino acid n was assigned to amino acid n+1)
* The actual target sequence was missing from the multiple sequence alignment
* Inputs weren't correctly scaled/normalized
* Script to convert 3-letter code amino acid to one letter (LYS -> K) didn't work as expected



<img src="figures/unknown.png">

## NNs are robust

They will "kind of" work even when some labels are incorrect, but it is going to be very tricky to understand if and what is wrong

* Before training:
    * Plot data distributions
    * Test all data preparation scripts
    * Manually look at data files
    * Check labels for mistakes, unbalancedness

* While training:    
    * Look at badly predicted samples
    * Be paranoid when something doesn't work well, even more when it works surpisingly well

<br>
<br>
<img src="figures/monk.jpg" width=400>

## Ok, my data is perfect but I don't have enough of it: what now?

Main avenues:
* Find more of it
* Make smaller models
* Cut down insignificant features
* Generate artificial samples: Data augmentation
* Transfer learning (so find more data, again)
* Think outside the (black) box

## Feature selection

* We are moving away from Deep Learning (automatic feature extraction from raw data)
* Remove highly correlated inputs first, that's easy
* Keep in mind that categorical inputs are more "costly" in terms of parameters
    * E.g. a 10-category input will be encoded as 10 separate inputs (one-hot)
* Feature ablation studies
* Autoencoders to compress inputs?
* Feature importance through other ML methods:
    * Random Forest
    * Logistic regression

## Feature ablation

* Ablation study (on features):
    * Remove parts of the inputs, see what happens
    * If results improve, remove some other inputs
    * If not, try removing other inputs and so on

* Could be implemented in annealing procedure to speed things up
* As usual, do this only on training data

## Regularizers (https://keras.io/api/layers/regularizers/)

You thought we were done with Keras api explanations, but we ain't

* Regularizers are used to constrain the training so that weights don't get too big (a cause of overfitting)
* L1 regularization (Lasso): 
    * $L_r(x,y) = L(x,y) + \lambda  \sum_{i,j} \lvert w_{i,j}\rvert $
    * Results in sparse weight matrices (many weights to 0)
* L2 regularization (Ridge):
    * $L_r(x,y) = L(x,y) + \lambda  \sum_{i,j} w_{i,j}^2 $
    * Results in smaller weights

In [1]:
from keras.layers import Dense
from keras import regularizers

layer = Dense(
    units=64,
    kernel_regularizer=regularizers.l1(1e-5), #the parameter is the lambda
    bias_regularizer=regularizers.l2(1e-4),
    activity_regularizer=regularizers.l1_l2(l1=1e-5, l2=1e-4) #two lambdas here
)


## Data augmentation

* An example of augmentation commonly seen in image recognition:
    * If we have few images, we can flip them, rotate them, shift them...
    * Extra instances that the network will benefit from
* Pixel-wise classification was also a form of augmentation!
    * 100 images of size (640x480) suddenly become 100x640x480 labelled samples
* Extreme examples: generative neural networks (GANs, Variational Autoencoders)
* Similar things might be done to other kinds of data as well
    * Can you imagine ways to augment the data you work with?

## Transfer learning

* Say you have a small labelled dataset for a specific problem
* But there are larger datasets out there for similar applications
* Transfer learning means training a large Neural Network on the large set, then use parts of it on the small set


## Transfer learning

* Train a deep classifier on a large dataset
* The bottom (first) layers of the network learn to extract relevant features
* The top (last) layers learn to classify
* Keep the bottom layers, freeze them (so that the weight can't change anymore)
* Re-initialize the top layers weights randomly
* Retrain the network on the small dataset so that only the top layers weights are now trained

<img src="figures/transferlearning.png"></img>
[(img source)](https://www.slideshare.net/xavigiro/transfer-learning-d2l4-insightdcu-machine-learning-workshop-2017)

## Think outside the box

* Neural Networks allow you to be very flexible and creative
* Say we have images of cells as in the convolutional lab, but too few of them ($n$ samples)
    * Build a network to classify whether two cells are of the same kind
    * Suddenly we have a dataset with $n^2$ samples instead of $n$
    * Two branches of the network (one per input image) with a merged output layer
    * The two branches can actually be the _same_ network repeated twice
    * When you're done with training it, do transfer learning for the actual classification task
    * Maybe issues with information leakage? Would be good to check
    


## Tips and tricks on training your Neural Networks

* Know your data
* Fix random seeds for reproducibility
* Manually calculate metric for baseline naïve predictor
* Overfit first, ask questions later:
    * Training on small dataset (one batch of data) first: can you make it overfit?
    * Can you make it overfit on the full dataset?
    * Now scale it back (fewer layers/neurons etc)
* Look obsessively at training curves, compare multiple tests
* Look at samples where prediction fails, why are they special?
* Change one thing at a time!
* Be patient and let the model train when you have a reasonable one
* Neural Networks are not necessarily black boxes, visualize outputs from different layers to see where the network is focusing
* Ensemble multiple NNs to get better predictions
* Think of your labels, should you classify or regress?

## When to use GPUs, when CPUs?

* Use GPUs if it saves you a considerable amount of time
    * Especially when trying to optimize hyperparameters (fast turn-around)
    * Might be ok to use CPU if you want to train a known architecture and have some time
    * Access to GPU nodes on clusters might be slow
    * If only for inference (not training), CPU should be quick enough
* Use CPUs if you run out of memory (OOM errors):
    * Bigger GPUs have ~16GB RAM to date
    * A small cluster node has usually 128GB RAM

## Training Neural Networks in Sweden

Many resources available:

* [Snowy@UPPMAX](https://www.uppmax.uu.se/support/user-guides/using-the-gpu-nodes-on-snowy/) (Tesla T4s)
* [Kebnekaise@HPC2N](https://www.hpc2n.umu.se/resources/hardware/kebnekaise) (V100s/K80s, 46 nodes)
* [Tetralith@NSC](https://www.nsc.liu.se/support/systems/tetralith-GPU-user-guide/) (Tesla T4s, 170 nodes)
* [Alvis@C3](https://www.c3se.chalmers.se/about/Alvis/) (Tesla T4s, 17+ nodes)
* [Google Colab](colab.research.google.com/)
* ...?

<img src="figures/snic.png">