# Deep Learning for Geo/Environmental sciences

<center><img src="../logo_2.png" alt="logo" width="600"/></center>

<em>*Created with ChapGPT</em>

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Lecture 5: Neural Network Architectures

 - [Recap](#Recap)
 - [Recurrent Neural Networks](#Recurrent-Neural-Networks)
 - [Convolutional Neural Networks](#Convolutional-Neural-Networks)

This tutorial is modified from
- Christopher Olah's blog on LSTMs at [Colah's blog](https://colah.github.io/posts/2015-08-Understanding-LSTMs/)
- Saha's tutorial on convolutional neural network at [Towards Data Science
](https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53)

## Recap

Neural networks are a class of models that are inspired by the human brain. They are composed of layers of neurons, where each neuron is a simple computational unit that takes in some input, processes it, and produces an output. 

The neurons are connected to each other by weights, which determine how much influence one neuron has on another. The weights are learned from data using an optimization algorithm, such as gradient descent.

Fully connected neural networks, also known as multilayer perceptrons (MLPs), are a type of neural network where each neuron in one layer is connected to every neuron in the next layer. MLPs are composed of an input layer, one or more hidden layers, and an output layer. 

![a](https://d2l.ai/_images/mlp.svg)

They make no assumptions about the data and can learn complex patterns in the data. However, they are prone to overfitting and can be computationally expensive to train, especially for large numbers of inputs

In many applications, the data has a spatial or temporal structure that can be exploited to improve the performance of the model. We call these model choices 'inductive biases'.

Convolutional neural networks (CNNs) are a type of neural network that is designed to work with spatial data, such as images. They use convolutional layers to extract features from the data and pooling layers to reduce the dimensionality of the data. CNNs are widely used in computer vision tasks, such as image classification and object detection.

Temporal data, such as time series data, can be modeled using recurrent neural networks (RNNs). RNNs have connections that form a directed cycle, allowing them to capture temporal dependencies in the data. RNNs are widely used in natural language processing tasks, such as machine translation and sentiment analysis.

The way we construct a neural network is called its architecture, and is crucial to the performance of the model. In this lesson, we will discuss the architecture of RNNs and CNNs and how they can be used to model spatial and temporal data.

## Recurrent Neural Networks

In this section, we will discuss recurrent neural networks (RNNs), a type of neural network that is designed to work with temporal data. RNNs have connections that form a directed cycle, allowing them to capture temporal dependencies in the data.

They can be thought of as autoregressive models, where the output at each time step is a function of the input at that time step and the output at the previous time step. RNNs are widely used in natural language processing tasks, such as machine translation and sentiment analysis.

They do this by introducing internal loops in the network, allowing information to persist over time. This provides the network with a form of memory, which allows it to capture long-range dependencies in the data.

<center><img src="_images/RNN-rolled.png" alt="RNN" width="150"/></center>

While conceptually useful, really an RNN is made up discrete units that are connected to each other in a chain-like structure. Each unit has a set of weights that it uses to process the input and produce an output:

<center><img src="_images/RNN-unrolled.png" alt="RNN" width="600"/></center>

The weights are shared across all time steps, allowing the network to learn to capture temporal dependencies in the data.

These structures are good at modelling sequences, especially with short-range dependencies:

<center><img src="_images/RNN-shorttermdepdencies.png" alt="RNN" width="600"/></center>

One of the main limitations is that they have difficulty capturing long-range dependencies in the data:

<center><img src="_images/RNN-longtermdependencies.png" alt="RNN" width="600"/></center>

This is because the gradients tend to either explode or vanish as they are backpropagated through the network. This is the vanishing gradient problem we discussed last lecture.

### Long-short term memory (LSTM) Networks

Long-short term memory (LSTM) networks are a type of recurrent neural network that is designed to address the vanishing gradient problem. They have a more complex architecture than standard RNNs, with additional gates that control the flow of information through the network.

While an RNN has a single `tanh` layer:

<center><img src="_images/LSTM3-SimpleRNN.png" alt="RNN" width="600"/></center>

An LSTM has four layers:
<center><img src="_images/LSTM3-chain.png" alt="RNN" width="600"/></center>
 

These four layers are:
 - Forget gate: controls the flow of information out of the cell state
 - Input gate: controls the flow of information into the cell state
 - Cell state: the memory of the network
 - Output gate: controls the flow of information from the cell state to the output
 

The forget gate is a sigmoid layer that decides what information to throw away from the cell state. It looks at $h_{t-1}$ and $x_t$, and outputs a number between 0 and 1 for each number in the cell state $C_{t-1}$. A 1 represents "completely keep this" while a 0 represents "completely get rid of this."

<center><img src="_images/LSTM3-focus-forget.png" alt="RNN" width="600"/></center>

The input gate is a sigmoid layer that decides which values to update. It also has a tanh layer creating a vector of new candidate values, $\tilde{C}_t$, that could be added to the state. The input gate decides how much of each value in the candidate should be added to the state.

<center><img src="_images/LSTM3-focus-input.png" alt="RNN" width="600"/></center>

The cell state is the memory of the network. It runs straight down the entire chain, with only some minor linear interactions. It's very easy for information to just flow along it unchanged. These are analogous to skip connections in the UNet architecture which helps avoid vanishing gradients.

<center><img src="_images/LSTM3-focus-cell.png" alt="RNN" width="600"/></center>

The output gate decides what the next hidden state, $h_t$, should be. The hidden state is a filtered version of the cell state. The output is based on the cell state, but it's a selective copy that only includes the parts the network decides to use.

<center><img src="_images/LSTM3-focus-output.png" alt="RNN" width="600"/></center>

Let's see how the LSTM works in practice in the [simple_lstm](simple_lstm.ipynb) notebook.

### Transformer Networks - Attention is all you need

Recently, transformer networks have become popular for natural language processing tasks. They are based on the idea of self-attention, where each word in the input sequence is assigned a weight based on its relevance to the other words in the sequence. This allows the model to capture long-range dependencies in the data without the need for recurrent connections.

This makes transformer networks more parallelizable and easier to train than RNNs, which can be slow and difficult to train due to the sequential nature of the computation. Transformer networks have achieved state-of-the-art performance on a wide range of natural language processing tasks, such as machine translation and sentiment analysis.

They are the basis of models such as BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pre-trained Transformer), which are revolutionizing the field of natural language processing.

## Convolutional Neural Networks

The [original](https://arxiv.org/pdf/1706.03762.pdf) transformer model consists of an encoder and a decoder, each of which is composed of multiple layers of self-attention and feedforward layers. The encoder processes the input sequence, while the decoder generates the output sequence. The model is trained using a variant of the attention mechanism called multi-head attention, which allows it to attend to different parts of the input sequence at the same time. 

In this section, we introduce convolutional neural network layers. These are the key components enabling deep learning on image, video, and audio data. By the end of the lesson, we will be able to:

- Understand convolution, padding, stride.
- Understand pooling.

But first, why do we need convolutional neural networks at all? Why can't we just use a fully connected network on images?

## Why not MLP for 2D images?

Recall the hidden layer in MLP:

$$
\begin{equation}
    \mathbf{H} = \mathbf{X} \mathbf{W} + \mathbf{b}
\end{equation}
$$



Now suppose we have 2D images as inputs $\mathbf{X}$ and $[\mathbf{X}]_{i, j}$ denote the pixel at location $(i,j)$.



We use $\mathbf{H}$ and $[\mathbf{H}]_{i, j}$ to denote the hidden layer output and its pixel at $(i,j)$, respectively.


To have each of the hidden units receive input from each of the input pixels,
we would need the following expression:

$$
\begin{aligned} 
\left[\mathbf{H}\right]_{i, j} 
&= [\mathbf{B}]_{i, j} + \sum_k \sum_l[\mathsf{W}]_{i, j, k, l}  [\mathbf{X}]_{k, l}\\ 
&=  [\mathbf{B}]_{i, j} +
\sum_a \sum_b [\mathsf{V}]_{i, j, a, b}  [\mathbf{X}]_{i+a, j+b}.
\end{aligned}
$$
where $\mathbf{B}$ and $\mathbf{W}$ to denote the bias and weight, respectively.



The switch from $\mathsf{W}$ to $\mathsf{V}$ is obtained by simply re-indexing the subscripts $(k, l)$ such that $k = i+a$ and $l = j+b$, i.e., $[\mathsf{V}]_{i, j, a, b} = [\mathsf{W}]_{i, j, i+a, j+b}$. The indices $a$ and $b$ run over both positive and negative offsets, covering the entire image.



For any given location ($i$, $j$) in the hidden representation $[\mathbf{H}]_{i, j}$,
we compute its value by summing over pixels in $x$,
centered around $(i, j)$ and weighted by $[\mathsf{V}]_{i, j, a, b}$. 



Then the total number of parameters required for a *single* layer in this parametrization given a $1000 \times 1000$ image (1 megapixel) is mapped to a $1000 \times 1000$ hidden representation. This requires $10^{12}$ (one trillion) parameters, far beyond what computers currently can handle.  

Q to ChatGPT 3.5: How many parameters do you have?

ChatGPT 3.5: I'm based on the GPT-3.5 architecture, which has 175 billion parameters. That's a lot of parameters, and they're what help me understand and generate responses to your questions!

## Convolution

Let's introduce weighting coefficients that do not depend on $(i, j)$.
In other words, we have $[\mathsf{V}]_{i, j, a, b} = [\mathbf{V}]_{a, b}$ and $\mathbf{U}$ is a constant, say $u$.



As a result, we can simplify the definition for $\mathbf{H}$:

$$[\mathbf{H}]_{i, j} = u + \sum_a\sum_b [\mathbf{V}]_{a, b}  [\mathbf{X}]_{i+a, j+b}.$$



This is a *convolution* (or actually **cross-correlation**)!



Note that $[\mathbf{V}]_{a, b}$ needs many fewer coefficients than $[\mathsf{V}]_{i, j, a, b}$ since it
no longer depends on the location within the image. 



Consequently, the number of parameters required is no longer $10^{12}$ but a much more reasonable $4 \times 10^6$: we still have the dependency on $a, b \in (-1000, 1000)$. 



We should not have
to look very far away from location $(i, j)$
in order to glean relevant information
to assess what is going on at $[\mathbf{H}]_{i, j}$.

Therefore, we set $[\mathbf{V}]_{a, b} = 0$ outside some range $|a|> \Delta$ or $|b| > \Delta$,



Equivalently, we can rewrite $[\mathbf{H}]_{i, j}$ as

$$[\mathbf{H}]_{i, j} = u + \sum_{a = -\Delta}^{\Delta} \sum_{b = -\Delta}^{\Delta} [\mathbf{V}]_{a, b}  [\mathbf{X}]_{i+a, j+b}.$$



This reduces the number of parameters from $4 \times 10^6$ to $4 \Delta^2$, where $\Delta$ is typically smaller than $10$. 


As such, we reduced the number of parameters by another four orders of magnitude. Now this new connection is called a *convolutional layer*. 


*Convolutional neural networks* (CNNs)
are a special family of neural networks that contain convolutional layers.

In the deep learning research community,
$\mathbf{V}$ is referred to as a *convolution kernel*,
a *filter*, or simply the layer's *weights* that are learnable parameters.

Let's see how this works
with two-dimensional data and hidden representations.
Consider an input of a two-dimensional tensor
with a height of 3 and width of 3.



The height and width of the kernel (the weight) are both 2.

The shape of the *kernel window* (or *convolution window*)
is given by the height and width of the kernel
(here it is $2 \times 2$).

![Two-dimensional cross-correlation operation. The shaded portions are the first output element as well as the input and kernel tensor elements used for the output computation: $0\times0+1\times1+3\times2+4\times3=19$.](https://d2l.ai/_images/correlation.svg)



In the two-dimensional cross-correlation operation,
we begin with the convolution window positioned
at the upper-left corner of the input tensor
and slide it across the input tensor,
both from left to right and top to bottom.



When the convolution window slides to a certain position,
the input subtensor contained in that window
and the kernel tensor are multiplied elementwise
and the resulting tensor is summed up
yielding a single scalar value.



This result gives the value of the output tensor
at the corresponding location.
Here, the output tensor has a height of 2 and width of 2
and the four elements are derived from
the two-dimensional cross-correlation operation:

$$
0\times0+1\times1+3\times2+4\times3=19,\\
1\times0+2\times1+4\times2+5\times3=25,\\
3\times0+4\times1+6\times2+7\times3=37,\\
4\times0+5\times1+7\times2+8\times3=43.
$$



Note that along each axis, the output size
is slightly smaller than the input size.


Because the kernel has width and height greater than $1$,
we can only properly compute the cross-correlation
for locations where the kernel fits wholly within the image,
the output size is given by the input size $n_\textrm{h} \times n_\textrm{w}$
minus the size of the convolution kernel $k_\textrm{h} \times k_\textrm{w}$
via

$$(n_\textrm{h}-k_\textrm{h}+1) \times (n_\textrm{w}-k_\textrm{w}+1).$$


Let's take a look at the implementation of the convolution operation in the following code snippet using the `nn.conv2d` class from tensorflow.

In [3]:
import tensorflow as tf

X = tf.constant([[[[0.0], [1.0], [2.0]], [[3.0], [4.0], [5.0]], [[6.0], [7.0], [8.0]]]])
K = tf.constant([[[[0.0]], [[1.0]]], [[[2.0]], [[3.0]]]])
tf.nn.conv2d(X, K, strides=1, padding='VALID')[0, :, :, 0]


<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[19., 25.],
       [37., 43.]], dtype=float32)>

## Padding

In the above example of a convolution operation, the input had both a height and width of 3
and the convolution kernel had both a height and width of 2,
yielding an output representation with dimension $2\times2$.


Assuming that the input shape is $n_\textrm{h}\times n_\textrm{w}$
and the convolution kernel shape is $k_\textrm{h}\times k_\textrm{w}$, the output shape will be $(n_\textrm{h}-k_\textrm{h}+1) \times (n_\textrm{w}-k_\textrm{w}+1)$: 
we can only shift the convolution kernel so far until it runs out
of pixels to apply the convolution to. 



One tricky issue when applying convolutional layers is that we tend to lose pixels on the perimeter of our image. Consider the following figure that depicts the pixel utilization as a function of the convolution kernel size and the position within the image. The pixels in the corners are hardly used at all. 

![Pixel utilization for convolutions of size $1 \times 1$, $2 \times 2$, and $3 \times 3$ respectively.](https://d2l.ai/_images/conv-reuse.svg)



To solve this problem, we add extra pixels of filler around the boundary of our input image and set the values of the extra pixels to zero.
In the following figure, we pad a $3 \times 3$ input,
increasing its size to $5 \times 5$. The corresponding output then increases to a $4 \times 4$ matrix.
The shaded portions are the first output element as well as the input and kernel tensor elements used for the output computation: $0\times0+0\times1+0\times2+0\times3=0$.

![Two-dimensional cross-correlation with padding.](https://d2l.ai/_images/conv-pad.svg)

In the following example, we create a two-dimensional convolutional layer with a height and width of 3 and apply 1 pixel of padding on all sides. Given an input with a height and width of 8, we find that the height and width of the output is also 8.

In [4]:
from keras.layers import Conv2D, ZeroPadding2D, MaxPooling2D

X = tf.random.uniform((1, 8, 8, 1))

# 1 row and column is padded on either side, so a total of 2 rows or columns are added
X = ZeroPadding2D(padding=(1, 1))(X)
X = Conv2D(1, kernel_size=3)(X)
X.shape

TensorShape([1, 8, 8, 1])

When the height and width of the convolution kernel are different, we can make the output and input have the same height and width by setting different padding numbers for height and width.

In [5]:
# We use a convolution kernel with height 5 and width 3. The padding on either
# side of the height and width are 2 and 1, respectively
X = tf.random.uniform((1, 8, 8, 1))

X = ZeroPadding2D(padding=(2, 1))(X)
X = Conv2D(1, kernel_size=(5, 3))(X)
X.shape


TensorShape([1, 8, 8, 1])

## Stride

When computing the cross-correlation,
we start with the convolution window
at the upper-left corner of the input tensor,
and then slide it over all locations both down and to the right.



In the previous examples, we defaulted to sliding one element at a time.
However, sometimes, either for computational efficiency
or because we wish to downsample,
we move our window more than one element at a time,
skipping the intermediate locations. This is particularly useful if the convolution 
kernel is large since it captures a large area of the underlying image.



We refer to the number of rows and columns traversed per slide as *stride*.
So far, we have used strides of 1, both for height and width.



In the following example, we use a larger stride of 3 vertically and 2 horizontally.The shaded portions are the output elements as well as the input and kernel tensor elements used for the output computation: $0\times0+0\times1+1\times2+2\times3=8$, $0\times0+6\times1+0\times2+0\times3=6$.


![Cross-correlation with strides of 3 and 2 for height and width, respectively.](https://d2l.ai/_images/conv-stride.svg)

We can see that when the second element of the first column is generated,
the convolution window slides down three rows.

The convolution window slides two columns to the right
when the second element of the first row is generated.

When the convolution window continues to slide two columns to the right on the input,
there is no output because the input element cannot fill the window
(unless we add another column of padding).


In the following example, we set the strides on both the height and width to 2, thus halving the input height and width.

In [6]:
X = tf.random.uniform((1, 8, 8, 1))

X = ZeroPadding2D(padding=1)(X)
X = Conv2D(1, kernel_size=3, strides=2)(X)

X.shape

TensorShape([1, 4, 4, 1])

Another example

In [7]:
X = tf.random.uniform((1, 8, 8, 1))

X = ZeroPadding2D(padding=(0, 1))(X)
X = Conv2D(1, kernel_size=(3, 5), strides=(3, 4))(X)

X.shape

TensorShape([1, 2, 2, 1])

## Pooling

*Pooling* is introduced to mitigate the sensitivity of convolutional layers to location and of spatially downsampling representations.



Like convolutional layers, *pooling* operators
consist of a fixed-shape window that is slid over
all regions in the input according to its stride,
computing a single output for each location traversed
by the fixed-shape window (sometimes known as the *pooling window*).


However, unlike the cross-correlation computation
of the inputs and kernels in the convolutional layer,
the pooling layer contains no parameters (there is no *kernel*).
Instead, pooling operators are deterministic,
typically calculating either the maximum or the average value
of the elements in the pooling window.


These operations are called *maximum pooling* (*max-pooling* for short)
and *average pooling*, respectively.

![Max-pooling with a pooling window shape of $2\times 2$. The shaded portions are the first output element as well as the input tensor elements used for the output computation: $\max(0, 1, 3, 4)=4$.](https://d2l.ai/_images/pooling.svg)



The output tensor here has a height of 2 and a width of 2.
The four elements are derived from the maximum value in each pooling window:

$$
\max(0, 1, 3, 4)=4,\\
\max(1, 2, 4, 5)=5,\\
\max(3, 4, 6, 7)=7,\\
\max(4, 5, 7, 8)=8.\\
$$

We can construct the input tensor `X` to validate the output of the two-dimensional max-pooling layer in the above figure using the `keras` `MaxPooling2D` layer.

In [8]:
X = tf.constant([[[[0.0], [1.0], [2.0]], [[3.0], [4.0], [5.0]], [[6.0], [7.0], [8.0]]]])
MaxPooling2D((2, 2), strides=1)(X) # Note this defaults to a stride of the pool size if not specified 

<tf.Tensor: shape=(1, 2, 2, 1), dtype=float32, numpy=
array([[[[4.],
         [5.]],

        [[7.],
         [8.]]]], dtype=float32)>

## Summary

An example of convolution operation with stride length 2

![Convolution Operation with Stride Length = 2](https://miro.medium.com/v2/resize:fit:640/format:webp/1*1VJDP6qDY9-ExTuQVEOlVg.gif)


## Summary


An example of padding: 5x5x1 image is padded with 0s to create a 6x6x1 image

![padding: 5x5x1 image is padded with 0s to create a 6x6x1 image](https://miro.medium.com/v2/resize:fit:640/format:webp/1*nYf_cUIHFEWU1JXGwnz-Ig.gif)


## Summary

An example of 3x3 pooling over 5x5 convolved feature

![3x3 pooling over 5x5 convolved feature](https://miro.medium.com/v2/resize:fit:640/format:webp/1*uoWYsCV5vBU8SHFPAPao-w.gif)


## Summary


So, convolutions are used to detect patterns in the input image. The patterns are detected by sliding a kernel over the image. 


The kernel slides over the image and multiplies the values in the image with the weights in the kernel. The output of this operation is a feature map. The feature map highlights the areas in the image where the pattern is detected. 



The feature map is then passed through an activation function to introduce non-linearity and then downsampled using pooling. 

In practice we stack multiple convolutional layers, pooling layers, and fully connected layers to build complex models, often dozens of layers deep. As we pass through these layers the model learns to detect larger scale patterns in the image.

## Summary


See the [simple_cnn](simple_cnn.ipynb) notebook for a practical example of regression using a convolutional neural network with the `keras` library to predict precipitation based on SST. Next week we will look at a more complex model to emulate a fully coupled climate model