# 4.3 Convolutional Neural Networks

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import h5py


import torch
import torch.nn as nn


Just like MLPs, each neuron in a CNN receives some inputs, performs a dot product, and outputs predictions that will be sent to an activation function. CNNs are designed to have images as inputs. When using fully connected layers, images have too many samples and a fully connected layer would have too many trainable parameters. CNN layers are not 1D hidden layers, they are volume of neurons.

![Convolutional Kernel](../img/cnn2.jpeg)
Figure: A ConvNet arranges its neurons in three dimensions (width, height, depth), as visualized in one of the layers.


The classic sequencing of a CNN (or a block) is: Convolutional Layer, Activation Layer, Pooling Layer, Fully Connected Layer.


<!-- <img src="figures/Convolution.png" alt="cnn" style="width: 400px;"/> -->
## 1.1 Anatomy of the Conv layer
* The **input** "feature map" is the input data for a given layer. It is 3D for a 2D convolution. The three dimensions are: *height*, *width*, *depth/channels*. For instance, width and heights are the 2D images (e.g. 32x32 pixels), the depth can be the different RGB (1 image per color R, G, B). For *1D convolution*, the input data should be ordered as the following **(batch size, number of channels, length of signal)**. For *2D convolutions*, the input size is **(batch size, number of channels, pixel height, pixel width)**.
 
* The **filters** are the **convolution kernels** are the dimensionality of the output space. In general, the number of filters is greater than the input number of channels: the network heights and widths usually decrease throughout the networks, thus increasing the depth or number of filters does not increase the complexity too much. 
* The **kernel_size** is a list of 2 integers that specifies the height and width of the 2D convolution kernel. If both integers are equal, just use a single integer value. The kernel is a filter that has weights to apply on the input values, each kernel map as 1 bias. Let's take an example of a 3x3 input feature map, a 2x2 kernel size, a 2x2 output feature map. The highlighted part of the map in blue are those we focus on:

![Convolution Kernel](../img/convolution.png)
 <!-- width="600"></center> -->
Convolution with kernel window (Fig. 6.2.1 from Dive into Deep Learning).

In the example above, we perform the following operation: the top-left output value is = 0×0+1×1+3×2+4×3=19. Then we repeat the process until all elements of the output map are filled. We also repeat this for each filter. Usually prefer using small but many kernels/filters.


![Convolutional Kernel](../img/CNN_StanfordCS230.gif)



* The **stride** is the step that the convolution skips when being applying the filters/kernels. Small strides work better in practice. It is one way to reduce the feature map, but the most popular choice for this is to use ***maxpooling***.


* The **padding_mode** is set to either ``same`` or ``valid`` depending on whether the edges of the feature map are extended and filled with zeros (same) to fit the total length of the kernel size and stride, or whether they are ignored (valid). Prefer to use ``same`` especially for the hidden conv layers to avoid losing data and feature knowledge.


The convolution can be 1D or 2D depending on the array input:

When the input is a single dimension array (vector, time series), use a [Conv1D layer](!https://pytorch.org/docs/stable/generated/torch.nn.Conv1d.html#torch.nn.Conv1d).

When the input is a 2D image (2D array) or a time series with multiple channels (example of a seismogram with 3 components), use [Conv2D layers](!https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html#torch.nn.Conv2d).

Here are examples of a 32x32 image with three channels (RGB) sent to a 64 output channels/filters with a size of kernels of 6 pixels. The padding is "same" such that the edges of the feature maps are filled with zeros. We write the function in Pytorch.


A great lecture about CNN is one done at Stanford, [CS230](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-convolutional-neural-networks).

In [None]:
# pytorch version
torch.nn.Conv2d(in_channels=3, out_channels=64, kernel_size=6)

## 1.2 Pooling layers

**MaxPooling** layers are downsampling layers. It ouputs the max value of each channel of windows in a feature map. Downsampling reduces the size of the feature-map, as well as to induce spatial-filter hierarchies by making successive convolution layers look at increasingly large windows (in terms of the fraction of the original input they cover). The pooling size is the factor of reduction in the layer size.


![Max Pooling](../img/max_pooling.png)

Maximum pooling (Fig. 6.5.1 from Dive into Deep Learning).

<!-- 114 / 2 = 57 and 80 / 2 = 40 -->

**General pooling** are other functions, such as average pooling or even L2-norm pooling. Average pooling was often used historically but has recently fallen out of favor compared to the max pooling operation, which has been shown to work better in practice.


Pooling is designed to reduce the complexity and model size in order to deal with overfitting and computational expense. Some argue that striding is sufficient. Some also found that auto-encoders and generative adversarial networks perform better without pooling.


## 1.3 Other notes

In the convolutional layer, the neurons are not connected to every part of the input data.

A dense layer learns global patterns. A convolution layer learns local patterns. Because of that, CNNs are **translation invariant** as they pick part of the image of time series and generalize the learning elsewhere. CNNs learn **hierarchical patterns**: a first layer learns a local pattern, a second layer combines the local features to create a broader scale feature.



# 2 Practice on LeNet-5 network

We will create a CNN LeNet architecture (LeCun et al, 1998) to classify images from the fashion MNIST data. We will write it in Keras/Tensorflow and in PyTorch.

In [None]:
import torch
from torch.utils.data import Dataset, DataLoader
from sklearn.datasets import load_digits,fetch_openml
from sklearn.preprocessing import StandardScaler
from torch.utils.data.sampler import SubsetRandomSampler
from torchvision.transforms import transforms, ToTensor, Compose,Normalize
from sklearn.model_selection import train_test_split
from torchvision import datasets
import os

dataset = datasets.MNIST(root="./",download=True,
 transform=Compose([ToTensor(),Normalize([0.5],[0.5])]))
L=len(dataset)
Lt = int(0.8*L)

train_set, val_set = torch.utils.data.random_split(dataset, [Lt,L-Lt])
loaded_train = DataLoader(train_set, batch_size=50)
loaded_test = DataLoader(val_set, batch_size=50)
print(loaded_train)

X, y = next(iter(loaded_train))
print(X.shape)

### 2.1 Create the model

We will create a CNN LeNet-5 architecture (LeCun et al, 1998) that is a sequential stack of 3 convolutional layers, 2 fully connected layers. There are several graphical representations of networks that we often find in the literature, with a few examples below.

![LeNet](../img/lenet.svg)
LeNet-5 architecture

![Le-net](../img/lenet-vert.svg)

Using words, we can see that the CNN is composed of an input map of size 28x28 pixels, and the images are in gray scales so there is a single channel. It is followed by a convolution layer of size 28x28 and depth 6 (# of channels) and kernel sizes of 5x5, a pooling layer of size 2 - stride 2, a conv layer of depth 6 (or 6 channels) and kernel size 5x5, another pooling layer of size 2 - stride 2, and then 3 fully connected (dense) layers of respective sizes 120, 84, 10. The activation functions in the original LeNet-5 were sigmoids and the last activation function was a Gaussian function,  which we replaced with softmax. One can test the role of activation functions by changing them to ReLu.

In [None]:
# Implementation in Keras
# model = keras.Sequential([
# # Must define the input shape in the first layer of the neural network
# keras.layers.Conv2D(filters=6, kernel_size=5, padding='same', activation='sigmoid', input_shape=(28,28,1)),
# keras.layers.AveragePooling2D(2), # you could replace with MaxPooling2D
# keras.layers.Conv2D(filters=16, kernel_size=5, padding='same', activation='sigmoid'),
# keras.layers.AveragePooling2D(2),# you could replace with MaxPooling2D
# keras.layers.Flatten(),
# keras.layers.Dense(120, activation='sigmoid'),
# # keras.layers.Dropout(0.5),
# keras.layers.Dense(84, activation='sigmoid'),
# # keras.layers.Dropout(0.5),
# keras.layers.Dense(10, activation='softmax')])
# # Take a look at the model summary
# model.summary()

Implementation in pytorch

In [None]:
class Reshape(torch.nn.Module):
    def forward(self, x):
        return x.view(-1, 1, 28, 28)

In [None]:
model_lenet= torch.nn.Sequential(Reshape(), nn.Conv2d(1, 6, kernel_size=5,
                                               padding=2), nn.Sigmoid(),
                          nn.AvgPool2d(kernel_size=2, stride=2),
                          nn.Conv2d(6, 16, kernel_size=5), nn.Sigmoid(),
                          nn.AvgPool2d(kernel_size=2, stride=2), nn.Flatten(),
                          nn.Linear(16 * 5 * 5, 120), nn.Sigmoid(),
                          nn.Linear(120, 84), nn.Sigmoid(), nn.Linear(84, 10))

In [None]:
X = torch.rand(size=(1, 1, 28, 28), dtype=torch.float32)
print('Initial input shape: \t', X.shape)
for layer in model_lenet:
    X = layer(X)
    print(layer.__class__.__name__, 'output shape: \t', X.shape)

### 2.2 Prepare training
Choose the training parameters
* error metric: accuracy
* loss function: *crossentropy* for multiclassification.
* batch size, 
* the number of epochs (iterations). 
* Optimizer: Adam


In [None]:
alpha = 0.005# learning rate lr
loss_function = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model_lenet.parameters(), lr=alpha)

In [None]:
  # Set fixed random number seed
torch.manual_seed(42)

### 2.3 Train the model
Plot the accuracy scores as a function of epochs to see how well we train.

In [None]:
def train(model, n_epochs, trainloader, testloader=None,learning_rate=0.001 ):

    os.makedirs('lenet_checkpoint',exist_ok=True)
    # Define loss and optimization method
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
    
    # # Save loss and error for plotting
    loss_time = np.zeros(n_epochs)
    accuracy_time = np.zeros(n_epochs)

    # # Loop on number of epochs
    for epoch in range(n_epochs):
    #     # Initialize the loss
        running_loss = 0
    #     # Loop on samples in train set
        for data in trainloader:
    #         # Get the sample and modify the format for PyTorch
            inputs, labels = data[0], data[1]
            inputs = inputs.float()
            labels = labels.long()
    #         # Set the parameter gradients to zero
            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, labels)
    #         # Propagate the loss backward
            loss.backward()
    #         # Update the gradients
            optimizer.step()
    #         # Add the value of the loss for this sample
            running_loss += loss.item()
    #     # Save loss at the end of each epoch
        loss_time[epoch] = running_loss/len(trainloader)

        
        checkpoint = {
            'epoch': epoch + 1,
            'state_dict': model.state_dict(),
            'optimizer': optimizer.state_dict()
                        }
        
        f_path = './lenet_checkpoint/checkpoint.pt'
        torch.save(checkpoint, f_path)
        

        
    #     # After each epoch, evaluate the performance on the test set
        if testloader is not None:
            correct = 0
            total = 0
    #         # We evaluate the model, so we do not need the gradient
            with torch.no_grad(): # Context-manager that disabled gradient calculation.
    #             # Loop on samples in test set
                for data in testloader:
    #                 # Get the sample and modify the format for PyTorch
                    inputs, labels = data[0], data[1]
                    inputs = inputs.float() 
                    labels = labels.long()
    #                 # Use model for sample in the test set
                    outputs = model(inputs)
    #                 # Compare predicted label and true label
                    _, predicted = torch.max(outputs.data, 1)
                    total += labels.size(0)
                    correct += (predicted == labels).sum().item()
    #         # Save error at the end of each epochs
            accuracy_time[epoch] = 100 * correct / total
    
    #     # Print intermediate results on screen
        if testloader is not None:
            print('[Epoch %d] loss: %.3f - accuracy: %.3f' %
              (epoch + 1, running_loss/len(trainloader), 100 * correct / total))
        else:
            print('[Epoch %d] loss: %.3f' %
              (epoch + 1, running_loss/len(trainloader)))

    # # Save history of loss and test error
    if testloader is not None:
        return (loss_time, accuracy_time)
    else:
        return (loss_time)
        

In [None]:
(loss, accuracy) = train(model_lenet, 3,loaded_train, loaded_test)

In [None]:
fig, ax1 = plt.subplots()
print(len(loss))
color = 'tab:red'
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Loss', color=color)
ax1.plot(np.arange(1, len(loss)+1), loss, color=color)
ax1.tick_params(axis='y', labelcolor=color)

ax2 = ax1.twinx()

color = 'tab:blue'
ax2.set_ylabel('Correct predictions', color=color)
ax2.plot(np.arange(1, len(accuracy)+1), accuracy, color=color)
ax2.tick_params(axis='y', labelcolor=color)

fig.tight_layout()
plt.show()

# 3. Example on seismic data

In this class, we will use a simplified version of ConvNetQuake (Perol et al, 2018). The network was designed as a classification algorithm that detects seismic events and assigns their location in spatial clusters. The earthquakes from a previously known earthquake catalog were clustered using k-means. We will use the two seismic station seismograms already labeled as "earthquakes" or "noise" to perform.

![ConvNetQuake](../img/ConvNetQuake.jpg)
<!-- <img src="figures/ConvNetQuake.jpg" alt="ConvNetQuake" style="width: 400px;"/> -->


### 3.1 read the data
Download the data and place it in a "data" folder.

In [None]:
import wget
wget.downloads("https://www.dropbox.com/s/vi9gmjy8d4zd5jv/templates_029.h5?dl=1")
os.replace("templates_029.h5", "../../data/templates_029.h5")

In [None]:
# load OK029 template data:
with h5py.File("../../data/templates_029.h5", "r") as f:
    eq1 = np.asarray(f['earthquakes']);neq1=eq1.shape[0]
    no1 = np.asarray(f["noise"])


# # load OK027 template data:
# with h5py.File("./data/templates_027.h5", "r") as f:
#     eq2 = np.asarray(f['earthquakes'])
#     no2 = np.asarray(f["noise"])

### 3.2 Prep the data

In [None]:
#  allocate memory
quakes=np.zeros(shape=(eq1.shape[0],1000,3),dtype=np.float32)
noise=np.zeros(shape=(no1.shape[0],1000,3),dtype=np.float32)
# quakes2=np.zeros(shape=(eq2.shape[0],1000,3),dtype=np.float32)
# noise2=np.zeros(shape=(no2.shape[0],1000,3),dtype=np.float32)

# Normalize the seismograms to their peak amplitudes
for iq in range(eq1.shape[0]):
    for ic in range(3):
        if np.max(np.abs(eq1[iq,ic,:]))>0:
            quakes[iq,:,ic]=eq1[iq,ic,:]/np.max(np.abs(eq1[iq,ic,:]))
            
for iq in range(no1.shape[0]):
    for ic in range(3):
        if np.max(np.abs(no1[iq,ic,:]))>0:
            noise[iq,:,ic]=no1[iq,ic,:]/np.max(np.abs(no1[iq,ic,:]))

# for iq in range(eq2.shape[0]):
#     for ic in range(3):
#         if np.max(np.abs(eq2[iq,ic,:]))>0:
#             quakes2[iq,:,ic]=eq2[iq,ic,:]/np.max(np.abs(eq2[iq,ic,:]))
            
# for iq in range(no12.shape[0]):
#     for ic in range(3):
#         if np.max(np.abs(no2[iq,ic,:]))>0:
#             noise2[iq,:,ic]=no2[iq,ic,:]/np.max(np.abs(no2[iq,ic,:]))

# select data that is strictly positive and finite
iq1=np.where( ( np.abs(quakes[:,0,0])>0)&(np.isfinite(quakes[:,0,0])))[0]
# iq2=np.where( (np.abs(quakes2[:,0,0])>0)&(np.isfinite(quakes2[:,0,0])))[0]

In [None]:
# label & data
y = np.concatenate((np.ones(len(iq1)+len(iq2),dtype=np.int),np.zeros(len(iq1)+len(iq2),dtype=np.int))) # 0 for noise, 1 for event
# X = np.zeros(shape=(len(train_labels),1000,3,1))
X= np.concatenate((quakes[iq1,:,:],quakes2[iq2,:,:],noise[iq1,:,:],noise2[iq2,:,:]),axis=0)
X=X[...,None]# add that depth/channel dimension

nlabels=2 # = len(np.unique(y))

# Split train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)

### 3.3 Define ML model
ConvNetQuake is a simple stack of 8 conv2d layers with 32 channels, stride of 2 kernel size of 3x3, ReLu activation functions, padding is the same

In [None]:
# try here

In [None]:
# write model training function

In [None]:
# train the model


In [None]:
# plot the learning curve

In [None]:
# test the model

# 4. How to read and recode published networks

Let us say that you read a research paper explaining the architecture of the convolutional neural network used by the authors to carry out their data analysis. How will you try to reproduce their results? They do not provide a github!

Let us look at the following paper:

Rouet-Leduc, B., Hulbert, C., McBrearty, I. W., Johnson, P. A. (2020). Probing slow earthquakes with deep learning. Geophysical Research Letters, 47, e2019GL085870. [https://doi.org/10.1029/2019GL085870](https://doi.org/10.1029/2019GL085870).

![CNN RouetLeDuc 2020](../img/cnn_rouet-leduc.png)
<!-- <img src="figures/cnn_rouet-leduc.png" width="600"> -->
Schematic of the CNN and its architecture (Figure 1 from Rouet-Leduc et al. (2020)

* **Batch Normalization** => unclear from the paper, but this seems to be the normalization of the data
* **Dropout** => unclear from the paper what this is
* **Input** Spectrogram = Image with 129 x 95 x1 pixels
* **Conv2D** convolution is has a kernel size of 16x16 feature map of size 114x80 is depth 32 (# of channels), activation is ReLU (found in the supplementary material)
* **Maxpooling** of size 2
* **Dropout** 5%, found in the supplementary material
* **Conv2D** of kernel size 8 x 8, depth 64
* **Maxpooling** of size 2
* **Dropout** 5%, found in the supplementary material
* **Full Connected - Dense layers** with 36608 neurons (found in the supplementary material)
* **Full Connected - Dense layers** with 10 neurons (found in the supplementary material)
* **Full Connected - Dense layers** with 1 neuron, sigmoid activation function (found in the supplementary material)

## 5. Tuning CNN networks

There are many hyperparameters and model choices to make:
* training: learning rate, optimizer, batch_size, loss functions, regularization
* architecture: number of layers, depth of kernels, activation functions, batch normalization

One can treat the hyperparameter search as an optimization problem. In fact, there is an entire research field about **Network Architecture Search**. One can implement this by performing a **grid search** over the model architecture hyperparameter and picking the best performing model.

Keras tuner (https://keras-team.github.io/keras-tuner/) can be used to randomize the grid search.

<!-- http://caffe.berkeleyvision.org/model_zoo.html -->

In [None]:
model = torch.nn.Sequential(nn.Conv2d(in_channels=1, out_channels=32, kernel_size=16),
                            nn.ReLU(),
                            nn.MaxPool2d(kernel_size=2),
                            nn.Dropout(0.05),
                            nn.Conv2d(in_channels=32, out_channels=64, kernel_size=8),
                            nn.ReLU(),
                            nn.MaxPool2d(kernel_size=2),
                            nn.Dropout(0.05),
                            nn.Conv2d(in_channels=64, out_channels=128, kernel_size=4),
                            nn.Flatten(),
                            nn.Linear(36608, 10),
                            nn.Sigmoid(),
                            nn.Linear(10, 1),
                            nn.Sigmoid())