## Convolutional Neural Networks

- Special type of neural network to model data with a known **grid-like topology**. 


- *What kind of data is modeled with a CNN?* 
    - 1-D grid of scalars for Time series data
    - 2-D grid of pixels for Image data


- _Difference between CNN and DNN?_
    - CNNs apply convolution in place of matrix multiplication in at least 1 layer
    
### Convolution Operation

In the context of Neural Networks, it is an operation between 2 Tensors: Input $f(x)$ and a Kernel $g(x)$. If both functions are 1-Dimensional Data (Like Time series data), then their convolution is:

$$
(f*g)(t) = \int_{-\infty}^\infty f(\tau).g(t-\tau) d\tau
$$

Represents percentage of area of $g$ that overlaps $f$ at time $\tau$ over all time $\tau$. Since $\tau < 0$ is meaningless and $\tau > t$ represents the value of a function in the future (which we don't know), the integral above is written as below:

$$
(f*g)(t) = \int_{0}^t f(\tau).g(t-\tau) d\tau \tag{1}
$$

This corresponds to a single entry in the 1-D convolved Tensor $f*g$ (the $t^{th}$ entry). To compute the complete convolved tensor, we need to iterate t over all possible values. 

In Machine Learning, convolution is implemented as **cross-correlation** (convolution without kernel flip).

$$
(f\star g)(t) =  \bar f(-t)*g(t)
$$

Where $\bar f$ is the complex conjugate of $f$. This simplifies to the following.

$$
(f \star g)(t) = \int_{-\infty}^\infty \bar f(-\tau).g(t-\tau) d\tau
$$

Let $\tau' = -\tau$. So $d\tau' = -d\tau$.

$$
(f \star g)(t) = \int_{-\infty}^\infty  \bar f(\tau').g(t+\tau') (-d\tau')
$$

Flip the Integral

$$
(f \star g)(t) = \int_{\infty}^{-\infty}  \bar f(\tau).g(t+\tau) d\tau \tag{2}
$$

In practice, we may have multi dimentional input Tensors that require a multi Dimensional kernel tensor $g$. Consider an Image input. Convolution is defined as:

$$
\begin{aligned}
(I*h)(x, y) &= \int_0^x \int_0^y I(i, j).h(x-i, y-j) di dj \\
&= \int_0^x \int_0^y I(x-i, y-j).h(i, j) di dj
\end{aligned}
\tag{3}
$$

There are _usually_ less possible values for $x, y$ in a kerel than in an Image. Hence we use the latter form. The result will be a scalar value of the convolution. We need to repeat the process for every point $(x, y)$ for which a convolution exists on the image. store each value in the convolved matrix $I*h$. This output is sometimes called a **feature map**.


### Discrete convolution

Discrete convolution (in 1 Dimensional or univariate) can be done by converting one matrix (either input or impulse) into a **Toeplitz Matrix**. In this matrix, each row entry is displaced by 1 column.

Consider the input : 
$$

X =
\begin{bmatrix} 
x_0  & x_1  & x_2  & .  & .  & .  & x_n
\end{bmatrix}

X_{Toeplitz} = 
\begin{bmatrix}
    x_0 & x_1 & x_2 & ... & x_n & 0   & 0   & 0 & ... & 0\\
      0 & x_0 & x_1 & x_2 & ... & x_n & 0   & 0 & ... & 0\\
      0 &   0 & x_0 & x_1 & x_2 & ... & x_n & 0 & ... & 0\\
      ... \\
      ... \\
      ... \\
      0 &   0 & 0   & ... & x_0 & x_1 & x_2 & 0 & ... & x_n\\
\end{bmatrix}

h = 
\begin{bmatrix}
    h_0 \\
    h_1 \\
    h_2 \\
    . \\
    . \\
    . \\
    h_m
\end{bmatrix}

y = X_T*h = X_T.h 

$$

Taking the toeplitz of the input is shifting the input over time (one time step per row).The matrix above moves the input $X$ along $h$

### Why convolution? 

Three reasons:
- Sparse Interactions
- Parameter Sharing
- Equivariant representations


### Sparse Interactions

- In a DNN, every neuron in one layer is connected to every other in the next. In a CNN, each neuron is connected to only _k_ neurons in the next (where k is the size of the kernel in 1-D).  
- **Required Parameters Estimated reduces**: In a DNN with $m$ input neurons and $n$ output neurons, number of parameters = $O(m \times n)$. CNN number of parameters = $O(k \times n)$
- Viewed from below, a neuron in a DNN affects every neuron in the next layer. A neuron in a CNN only affects _k_ neurons in the next layer.
- Viewed from above the **receptive field** of a neuron in a DNN includes all neurons in the previous layer. The receptive field of a neuron in a CNN only consists of k neurons.
- **Indirect interactions exist**: Consider a kernel of size 3. the receptive field of a neuron in layer 4 inclues 3 neurons in layer 3. It also includes the receptive field of these 3 neurons and so on. Hence Indirect interactions exist despite sparse connectivity.

### Parameter Sharing

- In a DNN, the parameters learned (weights and biases) are the only used once for the entries between a specific pair of layers.
- In a CNN, the same kernel is used through every point in the input (except for some boundary points)
- In DNN, Space = $O(m \times n)$ Time = $O(m \times n)$
- In CNN, Space = $O(k)$ Time = $O(k \times n)$
- Hence parameter sharing decreases storage.

### Equivariant 

- Assume we perform a linear transformation on the input. The output of convolution will not be affected whether performed before or after the transformation. It will be modified by an amount equal to the effect of translation on the input.
    - E.g. Convolution is equivariant with an operation that assigns every pixel to it's left in 2-D (or translation of a signal linearly in 1-D time series)
    - E.g. 2 Convolution is _not_ equivariant with operations like image rotation.

## Pooling

1. **Parallel Convolutions **: Perform several convolutions in parallel. The output is a set of linear activations. But we want non-linear activations as the former doesn't actually learn much. (It's as good as logistic regression for classification). https://www.coursera.org/learn/neural-networks-deep-learning/lecture/OASKH/why-do-you-need-non-linear-activation-functions

2. **Detector Stage**: Perform a set of non linear activations. High values are those images which are ismilar
3. **pooling**: 

Pooling is a summary statistic. It picks a pixel to represent it's immidiate neighbors. Hence, the output of pooling is **invariant** with respect to small linear transaltions. So changing adding small translation to an image doesn't affect output of pooling. E.g. Consider the problem of digit recognition. We have a sample image (of say the number 5). 

Used in case of object detection, rather than accurately pin point the location of object in an image. E.g. In face detection, we are only concerned about the presence of an oval with 2 eyes, a nose and mouth. We are _not_ concerned with the pixel positions of these parts in the object.

Try Time series analysis:
- Kaggle Stock data: https://www.kaggle.com/camnugent/sandp500/data
- Time series analysis blog: https://codeburst.io/neural-networks-for-algorithmic-trading-volatility-forecasting-and-custom-loss-functions-c030e316ea7e
    - Classification: https://github.com/Rachnog/Deep-Trading/blob/master/multivariate/multivariate.py#L77
    - Regression: https://github.com/Rachnog/Deep-Trading/blob/master/volatility/volatility.py
