In [None]:
%%HTML
<link rel="stylesheet" type="text/css" href="../css/custom.css">

# Neural networks in practice
- What type of learning (machine learning vs. deep learning) 

- What type of neural network? 


- What tooling do I use? 

- How do I get the best performance?

- How do I monitor my performance? 

# Type of learning

* Simple heuristics
* Machine learning
* Deep learning / neural networks

![footer_logo](../images/logo.png)

# Deep  learning or Machine learning?

### Data type
* **Structured**: tabular data
    - Handcrafted feature engineering
    - Boosting algoritms    
* **Unstructured**: images/text/signals
    - Deep Learning


# Amount of data

To cite the [Deep Learning](http://www.deeplearningbook.org/contents/intro.html) book:

>  As of 2016, a rough rule of thumb is that a supervised deep learning algorithm will generally achieve acceptable performance with around **5,000 labeled examples per category**, and will match or exceed human performance when trained with a dataset containing at least 10 million labeled examples. Working successfully with datasets smaller than this is an important research area, focusing in particular on how we can take advantage of large quantities of unlabeled examples, with **unsupervised or semi-supervised learning**.

# Problem Complexity
<center><img src="../images/chihuahua-muffin.png" width="700"><center>



# Question
### Deep learning or machine learning? 
> You want to predict the expected stay of a patient in the ICU based on their medical history contained in the electronic health records, and the lab test and measurements results during their stay so far in the ICU such as heart rate, blood pressure, etc. Do you use deep learning or traditional machine learning?

# Question
### Deep learning or machine learning? 

> You want to create a model that acts as a second reader for radiologists - the doctors that read medical images. It helps the radiologist regions of interest that might be early sign of lung cancer on (3D) CT scans. 

# Question 
### Deep learning or machine learning? 
> You want to create a system that generates a quick overview of the sentiment of a number of product reviews on a website. Deep learning or traditional machine learning?

# What type of neural network? 
* Fully-connected neural networks
* Convolutional neural networks
* Recurrent neural networks
* Generative neural networks

<center><img src="../images/neural_networks_collection.png" width="500"><center>


# Fully-connected neural networks
**Use it for**: 
- Tabular datasets

![simple nn](../images/model_diagram.gif)



# Convolutional Neural Networks
* Neural networks with convolutional (and pooling) layers
* Typically work well on data with a spatial relationship
* Translation invariant
![](../images/ender-translated.png)


# Convolutional Neural Networks

**Use it for**: images

**Try it on**: text data, time series data, sequence input data

![](../images/example_conv_net.png)


#  Recurrent Neural Networks

* Sequence prediction problems.
* LSTM is the most successful RNNs

**Use it for**: text data, speech data

**Don't use it for**: tabular data, image data

<center><img src="../images/rnn-architecture.png" width="400"><center>


#  Generative Models

* Generative-Adversarial Networks (GANs)
* Variational-Autoencoders

![](https://miro.medium.com/max/1400/1*BaZPg3SRgZGVigguQCmirA.png) 

# Question
## What type of neural network? 
> You have a dataset of images of objects. For each object, you have 72 different viewpoints. This is your training set. Your goal is that, given an image of an object, you want to output an image of that object from a different viewpoint. What type of network would you use?

<center><img src="https://www.researchgate.net/profile/Jean_Elsner/publication/329969195/figure/fig11/AS:708799614705664@1546002405051/Samples-of-different-viewpoints-from-the-object-recognition-dataset-The-same-object.ppm" width="400"><center>


# Question
## What type of neural network? 
> You have a collection of audio fragments from music. You want to determine the genre the fragment belongs to. What type of network do you use? 

<center><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/5/57/Treble_a.svg/1200px-Treble_a.svg.png" width="200"><center>


# Question
## What type of neural network? 
> You have a collection of ECGs from patients, which denote the heart rhythm over time. You want to distinguish between normal heart rhythm cases and atrial fibrillation (abnormal). What type of neural network would be most suitable for this problem?

<center><img src="../images/ecg.png" width="200"><center>

# Infrastructure

* Local, server or cloud?
* Cloud providers: 
    - Amazon Web Services (AWS)
    - Google Cloud Platform (GCP)
    - Microsoft Azure

![](../images/cloud-logos.png)
   
   
    
 

# Infrastructure

### CPU vs GPU 
    
Required for GPU: 
* CUDA: API for Parallel Computing by NVIDIA
* cuDNN: GPU-accelerated library of primitives for deep neural networks
    
![Nvidia cards center third](../images/nvida_cards_vs.png)


# Optimize model performance

How do I get the best model performance?
* How do I preprocess my data? 
* What hyperparameters & architecture?
* Regularization
* From scratch or transfer learning?

## Data preprocessing

#### Feature data
- Zero-center: subtract the mean from every data point
- Standardize/scale: divide by the standard deviation

![center third](../images/feature_scaling.png)


# Question
## Why do we normalize the inputs x?

a) It makes the parameter initialization faster

b) Normalization is another word for regularization--It helps to reduce variance

c) It makes the cost function faster to optimize

d) It makes it easier to visualize the data


&rarr; c

# Data preprocessing

#### Target data
- Multi-label classification: one-hot encode the categorical targets
- Regression: match target range with output activation function

![center half](../images/target_one_hot.png)


# Hyperparameter choices 

- Gradient descent
- Loss function
- Weight updates
- Weight initialization


# Gradient descent

Mini-batch gradient descent is the best choice in most cases. 

- Small batch offers regularizing effect(adds noise to learning process), but high run time
- Multicore architectures require a minimum batch size to be effective
- Bigger batch size = faster computation
- Batch size is usually a power of two
- Dependent on your machine and model size

Good starting point: {32, 64} for CPU or {128, 256} for GPU
    
> … [batch size] is typically chosen between 1 and a few hundreds, e.g. [batch size] = 32 is a good default value
_Practical recommendations for gradient-based training of deep architectures, 2012_ 


# Loss function

In general:
<center><img src="../images/loss_functions_table.png" width="700"><center>


# Weight updates (optimizer)

**SGD** 
- Strength: often best generalization 
- Weakness: 
    - long training time
    - sensitive to initialization & learning rate parameter
    
**SGD with momentum**
- Strength: overcomes sensitivity to initialization
- Weakness:
    - sensitive to learning rate parameter $\alpha$ 
    - sensitive momentum parameter $\beta$ 
        
**Adam**
- Strength:
    - good default settings
    - works well on sparse features
    - automatically decays learning rate parameter
- Weakness: generalizes worse
    

# Weight updates (optimizer)

#### Recommendation
* Adam is a good default choice
* SGD with momentum if you have the resources to find a good learning rate
* Sparse data: adaptive gradient methods such as Adam, RMSprop or AdaGrad
* Look at state-of-the-art papers for your dataset and/or task

# Weight updates (optimizer)
**Example:** Say you want to train a Generative Adversarial Network (GAN) to perform super-resolution on a set of images. After some research you stumble upon this paper in which the researchers used the Adam optimizer to solve the exact same problem. Wilson et al. argue that training GANs does not correspond to solving optimization problems and that Adam may be well-suited for such scenarios. Which optimizer do you choose?

   &rarr; Adam

# Weight updates (optimizer)
**Example:** For a project at your current job you have to classify written user responses into positive and negative feedback. You consider to use bag-of-words as input features for your machine learning model. Since these features can be very sparse you decide to go for an adaptive gradient method, which leaves Adam, RMSprop or AdaGrad. You also have a limited time frame for your project. Adam and RMSprop both have more tunable parameters than AdaGrad. Which optimizer do you choose?

   &rarr; AdaGrad

# Weight initialization 

In case of ReLU activations go for He initialization, otherwise use Xavier (also called Glorot initialization)
![](https://pouannes.github.io/initialization/converge_22layers.png#center) 
Source: Kaiming He's paper comparing Xavier and He initialization with ReLU as the activation function

# Architecture

**Dropout** 
- `keep_prob`
    * 0.5 for fully-connected nets
    * 0.1-0.2 for convolutional nets
- When to use:
    * after every fully-connected layer (except the last)
    * convnet: not necessary or only in the lower layers

_Important: disable during test time_

# Architecture

**Activation**
- ReLU for hidden layers
- Sigmoid, softmax or linear for your final layer

_Exception:_ if the fraction of 'dead' neurons is high, go for Leaky ReLU or Maxout


# Architecture

**Convolution, activation and batch normalization**
- Original paper: conv-bn-act 
- Practice: conv-act-bn 

![](../images/convbnact.png)

# Architecture
**Pooling**
- Periodically insert between convolutional layers or blocks
- Max pooling more effective than average pooling
- Global average pooling or global max pooling can be used as an alternative to "flatten" the feature maps for the last layer.
- Cheaper alternative to convolution with stride

![](https://qph.fs.quoracdn.net/main-qimg-1afcb29913e4a667e78790c597e27712)


# Architecture

![](../images/example_conv_net.png)

# Regularization
- **Dropout** and batch normalization
- Data augmentation
- Early stopping 
<center><img src="../images/dropout.png" width="700"><center>

# Regularization
- Dropout and batch normalization
- **Data augmentation**
- Early stopping 

![](../images/ender-translated.png)


# Regularization
- Dropout and batch normalization
- **Data augmentation**
- Early stopping 

A technique to increase the diversity of your training set by applying random (but realistic) transformations, such as image rotation.

![](../images/ender-rotated.png)

# Regularization
- Dropout and batch normalization
- **Data augmentation**
- Early stopping 
![center](../images/dataaugmentation.png)


# Regularization
- Dropout and batch normalization
- Data augmentation
- **Early stopping**

![half center](../images/overfitting.png)


# Transfer learning
A model developed for a task is reused as the starting point for another task.
![center third](../images/andrew_ng_nips_2016_transfer_learning-1.png)


# Transfer learning

![center half](../images/pretrained_ops_accuracy.jpeg)


# Transfer learning

**Transfer learning as a feature extractor**

    - Remove last fully-connected layer
    - Treat the rest as a feature extractor: get output of the last layer with your dataset 
    - Train a linear classifier (e.g. linear SVM or Softmax) these outputs
    
**Transfer learning with fine-tuning**

    - Remove last layer
    - Retrain weights of earlier layers (but not all)
    


# Question
### What type of transfer learning would you use in these scenarios:

a) New dataset is _small_ and _similar_ to original dataset

b) Net dataset is _large_ and _similar_ to original dataset

c) New dataset is _small_ but _very different_ from original dataset

d) New dataset is _large_ and _very different_ from original dataset

a) Feature extractor

b) Fine tune

c) Feature extractor from earlier in the network

d) Fine tune through entire network

# General heuristics
- Use pre-trained networks whenever possible
- More data is better, but quality matters
- Rarely helps to go deeper than 3 or 4 layers
- Larger networks contain significantly more local minimum
- Overfitting: regularization > smaller neural network

# Question
### If your Neural Network model seems to have high bias (underfitting), what of the following would be promising things to try? (More than one can apply)

    a) Get more test data
    b) Add regularization
    c) Get more training data
    d) Make the Neural Network deeper 
    e) Increase the number of units in each hidden layer 

   &rarr; d & e

# Question
### Increasing the parameter keep_prob in Dropout from  0.5 to 0.6 will likely cause the following (choose two):

    a) Increase the regularization effect
    b) Decrease the regularization effect
    c) Increase training set error
    d) Decrease the training set error

   &rarr; b & d

# Monitoring

**Score**

- *Train/val score*: track the validation/training score to determine the amount of over fitting.

- Ratio of updated weights
- Activation/gradient distributions per layer

![center](../images/overfitting.jpeg)


# Monitoring -  Tensorboard

- Learning curves
- Computational graphs
- Weights
<center><img src="../images/tensorboard.png" width="400"><center>


# Monitoring - Tensorboard

- Learning curves
- Computational graphs
- Embeddings
- Weights

In Keras:

```python
keras.callbacks.TensorBoard(log_dir='./logs')
```
Befor starting you training session, start Tensorboard with the following bash command:

```bash
tensorboard --logdir nameOfDirectory
```

# Monitoring - Visdom

![half center](../images/visdom.gif)


# Conclusion

- Use neural networks for unstructured data
- Transfer learning helps you kick-start your problem
- Many things to consider, but a good library (e.g. Keras) will help you

![](../images/keras.png)