#**ONCODE Masterclass: Introduction to machine learning in cancer genomics**
##November 18th, 2020
##**Deep Learning 1: Image Classification with a Convolutional Neural Network (CNN)**

<img src='https://drive.google.com/uc?id=1gQU3ywndM16loOCb4k4AGklgXu0-yBwv'>

##Introduction for absolute beginners: main concepts in deep learning
Deep learning is a class of machine learning algorithms inspired by the structure and function of the brain. At the basic level is the perceptron, the mathematical representation of a biological neuron. Just like in the human cortex, there can be several layers of interconnected perceptrons. Input values get passed through this “network” of hidden layers until they eventually converge to the output layer.
<center> 
<img src='https://drive.google.com/uc?id=1YKTC1-wasZywRT6hXCLyBMw8qgrKXRHf' width="450" height="220"/>
</center>

**Weights** and **biases** and are learned parameters of a perceptron model. **Weights** control the signal (or the strength of the connection) between two neurons.  In other words, a weight decides how much influence the input will have on the output.

An **activation function** (sigmoid on the above figure) acts as a mathematical ‘gateway’ which receives the input and calculates a weighted sum with added bias to determine if the node should fire or not. This allows some connections to become stronger, causing new connections to appear, or weaker, causing connections to be eliminated.

**Biases** allow you to shift the activation function to the left or right, which may be critical for successful learning.
<center> 
<img src='https://drive.google.com/uc?id=1UTyxjy_aTukY13SmgHQiYGxIVsQGnQ8d' width="200" height="80"/>
</center>

The output of the network is computed by multiplying the input (x) by the weight (w0) and passing the result through some kind of activation function (e.g. a sigmoid function.)
Here is the function that this network computes, for various values of w0:
<center> 
<img src='https://drive.google.com/uc?id=1QwjY10MYcxUX0UH0vvy87IbviNl-ricV' width="460" height="340"/>
</center>

Changing the weight w0 essentially changes the "steepness" of the sigmoid. That's useful, but what if you wanted the network to output 0 when x is 2? Just changing the steepness of the sigmoid won't really work -- you want to be able to shift the entire curve to the right.

That's exactly what the bias allows you to do. If we add a bias to that network, like so:
<center> 
<img src='https://drive.google.com/uc?id=1dq-UZIOvqTY6UGPz4xhltbvbhOjBdgFQ' width="200" height="150"/>
</center>

...then the output of the network becomes sig(w0*x + w1*1.0). Here is what the output of the network looks like for various values of w1:

<center> 
<img src='https://drive.google.com/uc?id=1SWAGE5wFFv_hDww9zF-MFPODiAOAQFmb' width="460" height="340"/>
</center>

Having a weight of -5 for w1 shifts the curve to the right, which allows us to have a network that outputs 0 when x is 2.

In a regular neural network, neurons are fully-connected (each layer have full connections to all activations in the previous layer) and make up multiple layers.

<center> 
<img src='https://drive.google.com/uc?id=1HYUFSGpX4pCMykKIq8p8-qyoseWk_yYl' width="430" height="220"/>
</center>

**How does the model learn the parameters?**

<center> 
<img src='https://drive.google.com/uc?id=1TC2tUm7kdadsz_Pc-dvA8SlAZP6eXL3p' width="430" height="270"/>
</center>

Training our deep learning model means that it's learning the values of parameters (weights wij and biases bj) in an iterative process when the information is going forward and back.

**Forwardpropagation** occurs when the network is exposed to a training data sample which 'crosses' the entire neural network for its prediction (label) to be calculated. 
That is, passing the input data through the network in such a way that all the neurons apply their transformation to the information they receive from the neurons of the previous layer and sending it to the neurons of the next layer. When the data has crossed all the layers, and all of the neurons have made their calculations, the final layer will be reached with a result of label prediction for the input example.

Next, a **loss function** is used to estimate the loss (or error) and to compare and measure how good/bad our prediction result was in relation to the correct result (remember that we are in a supervised learning environment and we have the label that tells us the expected value). Ideally, we want our cost to be zero, that is, without divergence between estimated and expected value. Therefore, as the model is being trained, the weights of the interconnections of the neurons will gradually be adjusted until good predictions are obtained.

Once the loss has been calculated, this information is propagated backwards. Hence, its name: **backpropagation**. Starting from the output layer, that loss information propagates to all the neurons in the hidden layer that contribute directly to the output. However, the neurons of the hidden layer only receive a fraction of the total signal of the loss, based on the relative contribution that each neuron has contributed to the original output. This process is repeated, layer by layer, until all the neurons in the network have received a loss signal that describes their relative contribution to the total loss.

The deep learning model wants to learn weights and biases that minimize the loss function.

**Neural Network vs. Deep Learning model**

Deep learning models are deep neural networks with multiple hidden layers and nodes in each hidden layer. (Typically, if the model has more than 2 hidden layers it's called 'Deep learning')

**But why to choose Deep Learning instead of conventional machine learning algorithms?**

When dealing with large input data sizes with a long list of input values, machine learning algorithms typically require some feature selection prior to model training. The main advantage of deep learning models is that they do not necessarily need structured data and pre-obtained/selected features to classify the data. Deep learning models send the input through different layers of the network, with each network layer hierarchically defining specific features of the original input data.
<center> 
<img src='https://drive.google.com/uc?id=10mZ4aCmWrv56ymrPQI4llK_F4Nn3-JiX' width="490" height="300"/>
</center>

Moreover, deep learning models perform better in so-called 'big data' conditions, as they are well suited to learn complex feature patterns from large datasets and perform with higher accuracy on certain tasks (e.g.: image recognition/classification, speech recognition).

<center> 
<img src='https://drive.google.com/uc?id=12t2HoroX8Jf02a6IxsGzO-lzoWrpQ61T' width="450" height="300"/><br>copyright: Andrew Ng</br>
</center> 

## What are Convolutional Neural Networks (CNNs)?

D. H. Hubel and T. N. Wiesel proposed an explanation for the way in which mammals visually perceive the world around them using a layered architecture of neurons in the brain. In their hypothesis, within the visual cortex, complex functional responses generated by “complex cells” are constructed from more simplistic responses from “simple cells’. For instances, simple cells would respond to oriented edges etc, while complex cells will also respond to oriented edges but with a degree of spatial invariances.
<center> 
<img src='https://drive.google.com/uc?id=1026v3vAbACmAmCMIkVT8dh1AQY4t9UtT' width="720" height="240"/>
</center> 

This in turn inspired the architecture of deep convolutional neural networks, by the combination of **local connections, layering** and **spacial invariance (shifting the input signal)**.
Convolutional networks have been tremendously successful in practical applications where the input data has grid-like topology (e.g. images).
<center> 
<img src='https://drive.google.com/uc?id=189HShi0OjE2FXoxUdKdDsHS44YNG2_4S' width="720" height="320"/>
</center> 

The CNN is a combination of two basic building blocks:

**The Convolution Block** — Consists of the **Convolution Layer** and the **Pooling Layer**. This layer forms the essential component for feature extraction.

**The Fully Connected Block** — Consists of a fully connected simple neural network architecture. This layer performs the task of classification based on the input from the convolutional block.

**Convolution Layer**: An operation is applied on a particular matrix (the image matrix) using another matrix (usually the filter-matrix). The operation involves multiplying the values of a cell corresponding to a particular row and column, of the image matrix, with the value of the corresponding cell in the filter matrix. 
So, now, how do we find a particular feature? We simply convolve the ‘filter-matrix’ over the image matrix and constitute another matrix that contain some values.
<center> 
<img src='https://drive.google.com/uc?id=1-7Dd2WdH1-dnQCSt5MHB12P4nC9YeMac' width="520" height="300"/>
</center> 

**Pooling Layer**: This layer performs the process of extracting a particular value from a set of values, usually the max value or the average value of all the values. This reduces the size of the output matrix. 
For example, for MAX-POOLING, we take in the max value among all the values of say a 2x2 part of the matrix. Thus, we are actually taking in the values denoting the presence of a feature in that section of the image. In this way we are getting rid of unwanted information regarding the presence of a feature in a particular portion of the image and considering only what is required to know. It is common to periodically insert a Pooling layer in-between successive convolutional blocks in a CNN architecture. Its function is to progressively reduce the spatial size of the representation to reduce the number of parameters and computation in the network.

## Overview of exercises
Now we are going to implement a CNN to classify handwritten digits of the MNIST dataset. The labeled dataset consists of 60000 images of size 28x28 = 784 pixels (one gray-scale number) including the corresponding labels from 0,..,9. Each image is normalized such that each pixel takes on values in the range [0,1]. <br>
</br>

This architecture will be implemented with Keras and TensorFlow. In order to prevent the network from overfitting during learning we implement dropout and data augmentation, i.e. new images are generated from the original ones via rotation, translation and zooming. If we will have time, we can also explore the application of cross-validation.
## Loading packages

In [26]:
import itertools
import numpy as np
import pandas as pd
import datetime
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.metrics import confusion_matrix
from sklearn.model_selection import KFold

import tensorflow as tf

from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.optimizers import Adam
from keras.layers.normalization import BatchNormalization
from keras.utils import np_utils, plot_model
from keras.layers import Conv2D, MaxPooling2D, ZeroPadding2D, GlobalAveragePooling2D
from keras.layers.advanced_activations import LeakyReLU 
from keras.preprocessing.image import ImageDataGenerator

np.random.seed(25)

##Load data and have a look at it

In [None]:
(X_train, y_train), (X_test, y_test) = mnist.load_data()
print("X_train original shape", X_train.shape)
print("y_train original shape", y_train.shape)
print("X_test original shape", X_test.shape)
print("y_test original shape", y_test.shape)

In [None]:
fig = plt.figure()
for i in range(9):
  plt.subplot(3,3,i+1)
  plt.tight_layout()
  plt.imshow(X_train[i], cmap='gray', interpolation='none')
  plt.title("Digit: {}".format(y_train[i]))
  plt.xticks([])
  plt.yticks([])
fig

## Reshaping digit images
mnist.load_data() supplies the MNIST digits with structure (number of samples, 28, 28) i.e. with 2 dimensions per example representing a greyscale image in 28x28.

However, we are going to work with 'Conv2D' layers in Keras which are designed to handle 3 dimensions per example. They have 4-dimensional inputs and outputs. This covers colour images (number of samples, number of channels, width, height), but more importantly, it covers deeper layers of the network, where each example has become a set of feature maps i.e. (number of samples, number of features, width, height).

The greyscale image for MNIST digits input would either need a different CNN layer design (or a parameter to the layer constructor to accept a different shape), or the design could simply use a standard CNN and you must explicitly express the examples as 1-channel images. In Keras, they chose the latter approach, which needs the re-shape.

In [None]:
X_train = X_train.reshape(X_train.shape[0], 28, 28, 1)
X_test = X_test.reshape(X_test.shape[0], 28, 28, 1)

X_train = X_train.astype('float32')
X_test = X_test.astype('float32')

X_train.shape

##Normalization
The pixel values of the 28x28 images range from 0 to 255: the background majority close to 0, and those close to 255 representing the digit. Normalizing the input data helps to speed up the training. It reduces the chance of getting stuck in local optima as well, since we're using stochastic gradient descent to find the optimal weights for the network.<br>
</br>
So let's normalize the pixel values to lie between 0 and 1.

In [None]:
X_train/=255
X_test/=255

X_train.shape

##One-hot encoding of image labels

Our model will output a probability distribution across all 0-9 numbers in a vector, therefore we will need to convert the labels using one hot encoding to represent the numbers in a vector:


In [None]:
number_of_classes = 10

Y_train = np_utils.to_categorical(y_train, number_of_classes)
Y_test = np_utils.to_categorical(y_test, number_of_classes)

y_train[0], Y_train[0]

##Building a basic CNN model

**Batch size** defines the number of samples that will be propagated through the network. 

One **Epoch** is defined when the entire dataset is passed forward and backward through the neural network only ONCE. Typically, more then one epoch is used as passing the entire dataset through a neural network once is not enough. Keep in mind that we are using a limited dataset (however big it is, it's just a sample from a population) and we are often optimizing  thousands or hunderds of thousands of parameters in a deep learning model in an iterative process. So, updating the weights with a single pass or one epoch is not enough.

**Softmax function** turns logits (numeric output of the last linear layer of a multi-class classification neural network) into probabilities by taking the exponents of each output and then normalizing each number by the sum of those exponents so the entire output vector adds up to one — all probabilities should add up to one.

In [9]:
def build_cnn():
    # 1. Convolution
    # 2. Activation
    # 3. Pooling
    # Repeat Steps 1,2,3 for adding more hidden layers
    model = Sequential()

    model.add(Conv2D(32, (3, 3), input_shape=(28,28,1)))
    model.add(Activation('relu'))
    model.add(Conv2D(32, (3, 3)))
    model.add(Activation('relu'))
    model.add(MaxPooling2D(pool_size=(2,2)))

    model.add(Conv2D(64,(3, 3)))
    model.add(Activation('relu'))
    model.add(Conv2D(64, (3, 3)))
    model.add(Activation('relu'))
    model.add(MaxPooling2D(pool_size=(2,2)))

    model.add(Flatten())

    # Fully connected block
    model.add(Dense(512))
    model.add(Activation('relu'))
    model.add(Dropout(0.2))
    model.add(Dense(10))

    model.add(Activation('softmax'))

    return model

In [10]:
model = build_cnn()

In [None]:
model.summary()

In [None]:
plot_model(model, to_file='model.png', show_shapes = True, show_layer_names = True)

Now we need to compile our model and define a few last bits, namely: the optimizer of choice and the loss function that we are going to use.

**ADAM optimizer** is an adaptive learning rate method, which means that it computes individual learning rates for different parameters. The amount that the weights are updated during training is referred to as the step size or the “learning rate”. The learning rate is a configurable hyperparameter used in the training of neural networks that has a small positive value, often in the range between 0.0 and 1.0.

We are going to use the **Categorical crossentropy** loss. If we use this loss, we will train a CNN to output a probability over the classes for each image. This loss is used for multi-class classification.

In [14]:
model.compile(loss='categorical_crossentropy', optimizer=Adam(), metrics=['accuracy'])

We have to divide the current training set further into the final training set and a validation set.

In [15]:
size = int(len(X_train) * 0.8)

train_x, val_x = X_train[:size], X_train[size:]
train_y, val_y = Y_train[:size], Y_train[size:]

In [None]:
history_basic = model.fit(train_x, train_y, batch_size=128, epochs=5, validation_data=(val_x, val_y))


In [None]:
score = model.evaluate(X_test, Y_test)
print()
print('Test accuracy: ', score[1])

In [None]:
history_basic.history.keys() 

In [None]:
fig = plt.figure()
plt.subplot(2,1,1)
plt.plot(history_basic.history['accuracy'])
plt.plot(history_basic.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='lower right')

plt.subplot(2,1,2)
plt.plot(history_basic.history['loss'])
plt.plot(history_basic.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper right')

plt.tight_layout()

And finally we plot the confusion matrix:

In [None]:
y_pred = model.predict(X_test)
Y_pred_classes = np.argmax(y_pred,axis=1) 
Y_true = np.argmax(Y_test,axis=1)

confusion_mtx = confusion_matrix(Y_true, Y_pred_classes) 

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    plt.figure(figsize = (5,5))
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=90)
    plt.yticks(tick_marks, classes)
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")
    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

plot_confusion_matrix(confusion_mtx, classes = [0,1,2,3,4,5,6,7,8,9]) 
plt.show()

##The next level: enhancing the training set with data augmentation

In [21]:
# rotations, translations, zoom
gen = ImageDataGenerator(rotation_range=8, width_shift_range=0.08, shear_range=0.3,
                         height_shift_range=0.08, zoom_range=0.08)
# get transformed images
test_gen = ImageDataGenerator()
train_generator = gen.flow(X_train, Y_train, batch_size=64)
test_generator = test_gen.flow(X_test, Y_test, batch_size=64)

In [None]:
history_augmented = model.fit_generator(train_generator, steps_per_epoch=60000//64, epochs=5, 
                    validation_data=test_generator, validation_steps=10000//64)

In [None]:
score = model.evaluate(X_test, Y_test)
print()
print('Test accuracy: ', score[1])

In [None]:
fig = plt.figure()
plt.subplot(2,1,1)
plt.plot(history_augmented.history['accuracy'])
plt.plot(history_augmented.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='lower right')

plt.subplot(2,1,2)
plt.plot(history_augmented.history['loss'])
plt.plot(history_augmented.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper right')

plt.tight_layout()

In [None]:
y_pred = model.predict(X_test)
Y_pred_classes = np.argmax(y_pred,axis=1) 
Y_true = np.argmax(Y_test,axis=1)

confusion_mtx = confusion_matrix(Y_true, Y_pred_classes) 

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    plt.figure(figsize = (5,5))
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=90)
    plt.yticks(tick_marks, classes)
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")
    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

plot_confusion_matrix(confusion_mtx, classes = [0,1,2,3,4,5,6,7,8,9]) 
plt.show()

## +1 Exercise: using K-Fold cross-validation

In [None]:
num_folds = 10
# per-fold metric containers
acc_per_fold = []
loss_per_fold = []

inputs = np.concatenate((X_train, X_test), axis=0)
targets = np.concatenate((y_train, y_test), axis=0)

fold = 1
kf = KFold(n_splits=num_folds)
for train, test in kf.split(inputs, targets):
    model_fold = build_cnn()
    model_fold.compile(loss='categorical_crossentropy', optimizer=Adam(), metrics=['accuracy'])

    y_train_fold = targets[train]
    y_test_fold = targets[test]

    number_of_classes = 10
    Y_train_fold = np_utils.to_categorical(y_train_fold, number_of_classes)
    Y_test_fold = np_utils.to_categorical(y_test_fold, number_of_classes)

    print(f'Training for fold {fold} ...')

    X_train_fold = inputs[train]
    X_test_fold = inputs[test]

    size = int(len(inputs[train]) * 0.8)

    train_x_fold, val_x_fold = X_train_fold[:size], X_train_fold[size:]
    train_y_fold, val_y_fold = Y_train_fold[:size], Y_train_fold[size:]

    history_fold = model_fold.fit(train_x_fold, train_y_fold, batch_size=128, epochs=5, validation_data=(val_x_fold, val_y_fold))
  
    # Generate generalization metrics
    scores = model_fold.evaluate(X_test_fold, Y_test_fold, verbose=0)
    print(f'Score for fold {fold}: {model_fold.metrics_names[0]} of {scores[0]}; {model_fold.metrics_names[1]} of {scores[1]*100}%')
    acc_per_fold.append(scores[1] * 100)
    loss_per_fold.append(scores[0])

    fold = fold + 1

In [None]:
print('Score per fold')
for i in range(0, len(acc_per_fold)):
  print(f'> Fold {i+1} - Loss: {loss_per_fold[i]} - Accuracy: {acc_per_fold[i]}%')
print('Average scores for all folds:')
print(f'> Accuracy: {np.mean(acc_per_fold)} (+- {np.std(acc_per_fold)})')
print(f'> Loss: {np.mean(loss_per_fold)}')