# Convolutional Networks

```
input -> [[CONV -> RELU] * N -> POOL?] * M -> [FC -> RELU]*K -> FC
```

where the * indicates repetition, and the POOL? indicates an optional pooling layer. Moreover, N >= 0 (and usually N <= 3), M >= 0, K >= 0 (and usually K < 3). For example, here are some common ConvNet architectures you may see that follow this pattern:

see:

- https://cs231n.github.io/convolutional-networks/#conv

## Convolution

Convolution leaverages three important ideas that help improve machine learning system:
- sparse interactions
- parameter sharing
- equivariant representations




Sparse interactions
- FCN layer uses matrix multiplication by a matrix of parameter with a seprate parameter describing the interaction between each input unit and each output unit. This means every output unit interacts with every input unit.
- CNN have sparse interactions, by making the kernel smaller than the input. This leads to fewer parameters, which both reduces memory requirements of the model, improves its statistical efficiency and reduce operations.
- for m inputs and n outputs, FCN will require $m \times n$ parameters, and the algorithm have a $O(mn)$ runtime complexity, while CNN requires only $k \times n$ parameters if we limit the number of conenctions each output may have to $k$. 

Parameter Sharing
- In FCN, each element of the weight matrix is used exactly once when computing the output of a layer. In CNN, due to convolution, the same parameter in a kernel is used at every position of the input.

Spatial Awareness
- CNN preserve spatial relationship in data.
- FCN treat 2-D image as 1-D vector, losing spatial info.

Equivariant representation
- CNN can detect features, e.g., edges, shapes regardless of their position in the image, thanks to shared filters and pooling operations
- FCN doesn't have translation invariance since each input dimension is treated independently.

## Nonlinearity

This stage add nonlinear activation to convolution results to study potential nonlinear relationship.

## Pooling

A pooling function replaces the output of a net at a certain location with a summary statistic of nearby points. For example, max pooling reports the maximum output within the neighborhood defined by the filter.

- makes the representation invariant to small translation of the input.


## Interview Questions

1, Can You Please Describe the Structure of CNNs? the Different Layers, Activation Functions? What are Some Key Properties of Activation Functions?

- Convolutional Neural Networks (CNNs) are a class of deep neural networks widely used in processing data with a grid-like topology, such as images. They are known for their ability to detect hierarchical patterns in data. Here’s an overview of their structure, including layers and activation functions:
- Structure of CNNs
    - Convolutional Layers: These layers apply a set of learnable filters (kernels) to the input. Each filter convolves across the width and height of the input volume, computing the dot product between the filter and input, producing a 2D activation map.
      - Key Property: Convolutional layers are adept at capturing spatial hierarchies in images by learning from local regions (like edges, textures) in the early layers and more complex patterns (like objects, shapes) in deeper layers.
    - Pooling Layers: Often placed after convolutional layers, pooling layers (such as max pooling or average pooling) reduce the spatial dimensions (width and height) of the input volume, leading to a reduction in the number of parameters and computation in the network. 
      - Key Property: Pooling helps in making the detection of features invariant to scale and orientation changes.
    - Fully Connected Layers: At the end of the network, one or more fully connected layers are used where each neuron is connected to all neurons in the previous layer. These layers are typically used for classifying the features learned by the convolutional layers into different classes.
      - Key Property: Fully connected layers combine features to make final predictions.
  - Dropout: Dropout is a regularization technique used in CNNs to prevent overfitting. It randomly “drops” a subset of neurons in a layer during training, forcing the network to learn redundant representations and enhancing its generalization capabilities.
  - Batch Normalization: Batch normalization is a technique to stabilize and accelerate the training of deep networks. It normalizes the activations of a previous layer at each batch, i.e., it applies a transformation that maintains the mean activation close to 0 and the activation standard deviation close to 1.
- Activation Functions
  1. ReLU (Rectified Linear Unit):
    Formula: $f(x)=\max (0, x)$
    Properties: Non-linear, allows models to account for complex data patterns; simple and efficient in computation.
    Variants like Leaky ReLU or Parametric ReLU are used to address the “dying ReLU” problem where neurons can become inactive and stop contributing to the learning process.
  2. Sigmoid:
    Formula: $\sigma(x)=\frac{1}{1+e^{-x}}$
    Properties: Smooth gradient, squashing values into a range between 0 and 1 . It’s often used in the output layer for binary classification.
  3. Tanh (Hyperbolic Tangent):
    Formula: $\tanh (x)=\frac{e^x-e^{-x}}{e^x+e^{-x}}$
    Properties: Similar to sigmoid but squashes values into a range between -1 and 1 . It is zerocentered, making it easier to model inputs that have strongly negative, neutral, and strongly positive values.
  4. Softmax: Used in the output layer of a CNN for multi-class classification; it turns logits into probabilities that sum to one.
    Properties: Softmax is non-linear and is able to handle multiple classes in a mutually exclusive scenario.
- Key Properties of Activation Functions
  - Nonlinearity: This allows CNNs to capture complex relationships in data. Without nonlinearity, the network would behave like a linear model.
  - Differentiability: Essential for enabling backpropagation where gradients are computed during training.
  - Computational Efficiency: Faster activation functions (like ReLU) lead to quicker training.

In summary, the structure of CNNs, characterized by alternating convolutional and pooling layers followed by fully connected layers, combined with dropout for regularization and batch normalization for faster training, is optimized for feature detection and classification. The choice of activation function, critical for introducing nonlinearity, depends on the specific requirements of the task and the network architecture.