# Deep Learning Fundamentals

- Intelligence: The ability to process information to inform future decisions
- Deep (Machine, Structured) Learning=Deep Neural Networks=Hierarchical Learning

![image](dl.png)

- Hand engineered features are time consuming, brittle and not scalable in practice.
- Learn underlying features/patterns directly from raw data automatically without need to for the human to actually come in and annotate these rules that the system needs to learn 
  - From low-level: lines, corners & edges
  - To high-level: facial structure
- Neural Networks with at least one hidden layer are universal approximators. So, the neural networks can approximate any continuous function. However, deeper networks (with multiple hidden layers) can work better than a single-hidden-layer networks is an empirical observation, despite the fact that their representational power is equal.
- In many cases, more than one layer is needed, to reach more variations in the functionality of the neural network. It is possible to create one complicated function that represent the composition over the whole layers of the network. However in most cases composing the functions is very hard. So, many layers applying linear combinations are used in practise.
- Overall picture for deep learning:

![image](overall_dl.png)

## Naming Conventions: 

- Notice that when we say N-layer neural network, we do **not count the input layer**. 
- Therefore, a single-layer neural network describes a network with no hidden layers (input directly mapped to output). 
- In that sense, you can sometimes hear people say that logistic regression or SVMs are simply a special case of single-layer Neural Networks. You may also hear these networks interchangeably referred to as "Artificial Neural Networks" (ANN) or "Multi-Layer Perceptrons" (MLP). Many people do not like the analogies between Neural Networks and real brains and prefer to refer to neurons as units.

## Sizing Neural Networks: 

- The two metrics that people commonly use to measure the size of neural networks are: 
  1. The number of neurons
  2. The number of parameters (more common) Working with the two example networks in the above picture:

![image](neural_net.jpeg)
![image](neural_net2.jpeg)

- The first network has 4 + 2 = 6 neurons (not counting the inputs), [3 x 4] + [4 x 2] = 20 weights and 4 + 2 = 6 biases, for a total of 26 learnable parameters.
- The second network has 4 + 4 + 1 = 9 neurons, [3 x 4] + [4 x 4] + [4 x 1] = 12 + 16 + 4 = 32 weights and 4 + 4 + 1 = 9 biases, for a total of 41 learnable parameters.
- In addition, for the second network: 
  - The input is a [3x1] vector. Notice that instead of having a single input column vector, the variable x could hold an entire batch of training data (where each input example would be a column of x) and then all examples would be efficiently evaluated in parallel.
  - All connection strengths for a layer can be stored in a single matrix.
  - The first hidden layer’s weights W1 is of size [4x3] and the biases for all units is in the vector b1, of size [4x1]. Here, every single neuron has its weights in a row of W1, so the matrix vector multiplication np.dot(W1,x) evaluates the activations of all neurons in that layer.
  - Similarly, W2 is a [4x4] matrix that stores the connections of the second hidden layer.
  - W3 is a [1x4] matrix for the last (output) layer. 
  - The full forward pass of this 3-layer neural network is then simply three matrix multiplications, interwoven with the application of the activation function.

## Biological Motivation and Connections:

- The basic **computational unit of the brain** is a **neuron**. Approximately 86 billion neurons can be found in the human nervous system and they are connected with approximately 10^14 - 10^15 synapses. The diagram below shows a cartoon drawing of a biological neuron (left) and a common mathematical model (right). 
- Each neuron **receives** input signals from its **dendrites**. 
- Each neuron **produces** output signals along its (single) **axon**. 
- The axon eventually branches out and **connects via synapses** to dendrites of other neurons. 
- In the computational model of a neuron, the signals that travel along the axons (e.g. x0) interact multiplicatively (e.g. w0x0) with the dendrites of the other neuron based on the synaptic strength at that synapse (e.g. w0). 
- The idea is that the **synaptic strengths (the weights w)** are learnable and control the strength of influence (and its direction: excitory (positive weight) or inhibitory (negative weight)) of one neuron on another. 
- In the basic model, the dendrites carry the signal to **the cell body** where they all get summed. If the final sum is above a certain threshold, the neuron can fire, sending a spike along its axon. 
- In the computational model, we assume that the precise timings of the spikes do not matter, and that only the frequency of the firing communicates information. Based on this rate code interpretation, we model the firing rate of the neuron with an activation function f, which represents the frequency of the spikes along the axon. Historically, a common choice of activation function is the sigmoid function $\sigma$, since it takes a real-valued input (the signal strength after the sum) and squashes it to range between 0 and 1.

![image](neuron.png)

![image](neuron_model.jpeg)

- **Differences:** The dendrites in biological neurons perform complex **nonlinear computations**. The synapses are **not just a single weight**, they’re a complex non-linear dynamical system. The exact **timing of the output spikes** in many systems is known to be **important**, suggesting that the rate code approximation may not hold.

## Parameters vs Hyperparameters:

- Parameters are learned directly and automatically from the training data. (Weights and biases)
- Hyperparameters are set manually to optimize learning. (Learning rate, batch size etc.)

## Deep Learning vs Nearest Neighbour:

### TODO: simplify

- First, note that the single matrix multiplication Wxi is effectively evaluating 10 separate classifiers in parallel (one for each class), where each classifier is a row of W.
- Notice also that we think of the input data (xi,yi) as given and fixed, but we have control over the setting of the parameters W,b. Our goal will be to set these in such way that the computed scores match the ground truth labels across the whole training set. We will go into much more detail about how this is done, but intuitively we wish that the correct class has a score that is higher than the scores of incorrect classes.
- An advantage of this approach is that the training data is used to learn the parameters W,b, but once the learning is complete we can discard the entire training set and only keep the learned parameters. That is because a new test image can be simply forwarded through the function and classified based on the computed scores.
- Lastly, note that classifying the test image involves a single matrix multiplication and addition, which is significantly faster than comparing a test image to all training images.
  
## Linear Classifiers:

- A linear classifier achieves the statistical classification task making a decision based on the value of a linear combination of the characteristics. In other words, classification algorithm makes its predictions based on a linear predictor function combining a set of weights with the feature vector.
- The decision boundary is linear.
- They can solve linearly separable problems (e.g. or, and) but they can’t solve non-linearly separable problems such as simple XOR (unless input is transformed into a better representation).
- Perceptron is a type of linear classifier.
- These models are of the generic form $y=Wx(+b)$ and we try to explore which one of these can fit the best the current dataset. Where $W$ is called the weights of the network and can be initialised randomly. These types of models are called feed-forward linear layers.
- Weights produced after the learning procedure are used by linear classifiers as **the templates/prototypes** to be compared.
- Every row of W is a template for one of the classes. The geometric interpretation of these numbers is that as we change one of the rows of W, the corresponding line in the pixel space will **rotate** in different directions. The biases b, on the other hand, allow our classifiers to **translate** the lines. In particular, note that without the bias terms, plugging in $x_i=0$ would always give score of zero regardless of the weights, so all lines would be forced to cross the origin.

- Building a template out of the training data is **difficult** if:
  - Intraclass variation is high
  - Training images include multiple content
  - The class appears in different locations in the training images
  
### Perceptron: 

- The structural building block of deep learning. Neuron of Neural Network.
- A single neuron can be used to implement a binary classifier with an appropriate loss function on the neuron’s output (e.g. binary Softmax (logistic regression) or binary SVM classifiers).

#### Forward Propagation:

![image](formula.png)
![image](scheme.png)

- The forward pass of a fully-connected layer corresponds to **one matrix multiplication** followed by **a bias offset** and **an activation function**.

#### Activation Functions: 

- The purpose of activation functions is to introduce non-linearities into the network.
- All activation functions are non-linear.
- **Bias** term allows us to shift activation function left and right:
  - Eliminate the effect of classes with higher intensities
  - Incline towards some particular classes more
- Sigmoid is useful for probabilities and distributions because it gives outputs in the [0, 1] range
![image](activation.png)

![image](separator.png)

- Notice that the non-linearity is critical computationally - if we left it out, the two matrices could be collapsed to a single matrix, and therefore the predicted class scores would again be a linear function of the input. The non-linearity is where we get the wiggle.
- The ability to draw arbitrarily complex decision boundaries in the feature space is what makes Neural Networks so powerful in practice.
- We don't explicitly enforce any behaviour on hidden layers. These layers are learned. Hidden unit is a single node in a hidden layer.
- Fully connected layers: Neurons between two adjacent layers are fully pairwise connected, but neurons within a single layer share no connections. All possible linear combinations of inputs based on weights are formed. Neural Networks with fully-connected layers define a family of functions that are parameterized by the weights of the network. Another common name for these is dense layers. For regular neural networks, the most common layer type is the fully-connected layers.

##### Sigmoid:

![image](sigmoid.jpeg)

$\displaystyle \sigma(x) = 1/(1+e^{-x})$

- Sigmoid non-linearity squashes real numbers to range between [0,1].
- In particular, large negative numbers become 0 and large positive numbers become 1. 
- The sigmoid function has seen frequent use historically since it has a nice interpretation as the firing rate of a neuron: from not firing at all (0) to fully-saturated firing at an assumed maximum frequency (1). 
- In practice, the sigmoid non-linearity has recently fallen out of favor and it is rarely ever used. It has **three** major drawbacks:
  1. **Sigmoids saturate and kill gradients:** A very undesirable property of the sigmoid neuron is that when the neuron’s activation saturates at either tail of 0 or 1, the gradient at these regions is **almost zero**. Recall that during backpropagation, this (local) gradient will be multiplied to the gradient of this gate’s output for the whole objective. Therefore, if the local gradient is very small, it will effectively **"kill" the gradient** and almost no signal will flow through the neuron to its weights and recursively to its data. Additionally, **one must pay extra caution when initializing the weights of sigmoid neurons to prevent saturation**. For example, if the initial weights are too large then most neurons would become saturated and the network will **barely learn**.
  2. **Sigmoid outputs are not zero-centered:** This is undesirable since neurons in later layers of processing in a Neural Network (more on this soon) would be receiving data that is not zero-centered. This has implications on the dynamics during gradient descent, because **if the data coming into a neuron is always positive (e.g. x>0 elementwise in $f=w^Tx+b$))**, then **the gradient on the weights w ($\displaystyle \frac{\partial f}{\partial w} = \sigma(1-\sigma(x))x$)** will during backpropagation become either **all be positive, or all negative** (depending on the gradient of the whole expression f). This could introduce undesirable **zig-zagging dynamics in the gradient updates for the weights**. However, notice that once these gradients are added up across a batch of data the final update for the weights can have variable signs, somewhat mitigating this issue. Therefore, this is an inconvenience but it has less severe consequences compared to the saturated activation problem above.
  3. Exponentiation is expensive (When compared to max operation used by ReLU.).

##### tanh:

![image](tanh.jpeg)

$\displaystyle \tanh(x)=\frac{\sinh(x)}{\cosh(x)}=\frac{e^x-e^{-x}}{e^x+e^{-x}}=\frac{e^{2x}-1}{e^{2x}+1}$

- The tanh non-linearity squashes real numbers to range between [-1,1].
- **Like the sigmoid** neuron, its activations **saturate**.
- However, **unlike the sigmoid** neuron its output is **zero-centered**. 
- Therefore, in practice the tanh non-linearity is always preferred to the sigmoid nonlinearity. 
- Note that the tanh neuron is simply a scaled sigmoid neuron, in particular the following holds: $\tanh(x) = 2 \sigma(2x) -1$.

##### ReLU:

![image](relu.jpeg)

$\displaystyle f(x) = \max(0, x)$

- The Rectified Linear Unit has become very popular in the last few years.

- There are several pros and cons to using the ReLUs:
  - (+) It was found to greatly **accelerate (e.g. a factor of 6) the convergence of SGD** compared to the sigmoid/tanh functions. It is argued that this is due to its **linear, non-saturating** form.
  - (+) Compared to tanh/sigmoid neurons that involve expensive operations (exponentials, etc.), the ReLU can be implemented **efficiently** by simply thresholding a matrix of activations at zero.
  - (-) ReLU units can be fragile during training and can **"die"**. For example, **a large gradient flowing through a ReLU neuron could cause the weights to update in such a way that the neuron will never activate on any datapoint again**. If this happens, then the gradient flowing through the unit will forever be zero from that point on. That is, the ReLU units can irreversibly die during training since they can get knocked off the data manifold. For example, you may find that as much as 40% of your network can be "dead" (i.e. neurons that never activate across the entire training dataset) **if the learning rate is set too high**. With a proper setting of the learning rate this is less frequently an issue. Alternatively, small positive biases like 0.01 can be applied before ReLU.
  
##### Leaky ReLU:

$\displaystyle f(x) = \mathbb{1}(x < 0) (\alpha x) + \mathbb{1}(x>=0) (x)$ where α is a small constant.

(or $\displaystyle f(x) = \max(0.1x, x)$)

- Leaky ReLUs are one attempt to fix the “dying ReLU” problem. Instead of the function being zero when x < 0, a leaky ReLU will instead have a small negative slope (of 0.01, or so).
- Some people report success with this form of activation function, but the results are not always consistent.

##### Maxout:

$\displaystyle f(x)=\max(w_1^Tx+b_1, w_2^Tx + b_2)$

- Maxout neuron generalizes the ReLU and its leaky version.
- Notice that both ReLU and Leaky ReLU are a special case of Maxout form (for example, for ReLU we have w1, b1=0).
- The Maxout neuron therefore enjoys **all the benefits of a ReLU unit (linear regime of operation, no saturation)** and does **not have its drawbacks (dying ReLU)**. 
- However, unlike the ReLU neurons it doubles the number of parameters for every single neuron, leading to a **high total number of parameters**.

##### ELU:

$\displaystyle g(z)=\max(\alpha(e^z-1),z)$

- Smooteh ReLU.
- Differentiable everywhere.

## Losses:

- Once that prediction is made, its distance from the ground truth (error, loss) can be measured.
- Loss function is a performance metric on how well the NN (neural network) manages to reach its goal of generating outputs as close as possible to the desired values. It quantifies our unhappiness with predictions on the training set
- We are seeking to minimize the error (the loss, the output of loss function or the objective function.).
- The main aim is to minimize it across the entire training set not just for a particular data point.
- Emprical loss: The mean of all of the losses for all data in the dataset.
- When the outputs are 0 or 1 (as in the case for binary classification) it is probably desired to use a softmax output.
- **Examples of loss functions:**
  - Square loss (Rich regression loss)
  - Hinge loss (Max-margin loss)
  - Logistic loss
  - Cross-entropy loss (Log loss)
  - Exponential loss
- As a summary, the loss function is an error metric, that gives an indicator on how much precision we lose, if we replace the real desired output by the actual output generated by our trained neural network model. That’s why it’s called loss.

![image](dataflow.jpeg)

  $\displaystyle L =  \underbrace{ \frac{1}{N} \sum_i L_i }_\text{data loss} + \underbrace{ \lambda R(W) }_\text{regularization loss}$

### Multiclass Support Vector Machine (SVM) Loss:

- The SVM loss is set up so that the SVM “wants” the correct class for each image to a have a score higher than the incorrect classes by some fixed margin $\Delta$. 

- Formula: $\displaystyle L_i = \sum_{j\neq y_i} \max(0, s_j - s_{y_i} + \Delta)$ where:
  - $x_i$: the pixels of i-th image
  - $y_i$: the label that specifies the index of the correct class
  - $s$: Short for scores, the score function takes the pixels of i-th ($x_i$'s) and computes the vector $f(x_i,W)$ of class scores. 
  - $s_j$: The score for the j-th class is the j-th element of $f(x_i,W)$: $s_j = f(x_i, W)_j$.
  - $\Delta$: A hyperparameter that forces the score of the correct class $y_i$ to be larger than the incorrect class scores by at least this amount
- **Note that the expression above sums over all incorrect classes ($j\neq y_i$)**

![image](margin.jpg)

- If any class has a score inside the margin of the actual class score (in the red region) or they are higher than the score of the actual class (in the region on right of the actual class score), then there will be accumulated loss. Otherwise, (in the green region) the loss will be zero.
- The threshold at zero, $max(0,−)$ function, is often called the **hinge loss**. Sometimes people instead use the squared hinge loss SVM (or L2-SVM), which uses the form $max(0,−)^2$ that penalizes violated margins more strongly (quadratically instead of linearly). The unsquared version is more standard, but in some datasets the squared hinge loss can work better. This can be determined during cross-validation.

#### Example: 

Suppose that we have three classes that receive the scores $s = [13, -7, 11]$, and that the first class is the true class (i.e. $y_i = 0$). Also assume that $\Delta$ is 10. The expression above sums over all incorrect classes ($j\neq y_i$), so we get two terms:

$L_i = \max(0, -7 - 13 + 10) + \max(0, 11 - 13 + 10)$

- You can see that the first term gives zero since $[-7 - 13 + 10]$ gives a negative number, which is then thresholded to zero with the $max(0,−)$ function. We get zero loss for this pair because the correct class score (13) was greater than the incorrect class score (-7) by at least the margin 10. In fact the difference was 20, which is much greater than 10 but the SVM only cares that the difference is at least 10; Any additional difference above the margin is clamped at zero with the max operation. 
- The second term computes $[11 - 13 + 10]$ which gives 8. That is, even though the correct class had a higher score than the incorrect class (13 > 11), it was not greater by the desired margin of 10. The difference was only 2, which is why the loss comes out to 8 (i.e. how much difference needed between the correct and incorrect class scores to meet the margin).

### Softmax Classifier (Multinomial Logistic Regression):

- The Softmax classifier, which uses a different loss function, is also a popular choice of classifier.
- The Softmax classifier is the generalization of the Binary Logistic Regression classifier to multiple classes.
- Unlike the SVM which treats the outputs $f(x_i,W)$ as (uncalibrated and possibly difficult to interpret) scores for each class, the Softmax classifier gives a slightly more intuitive output **(normalized class probabilities which allows you to interpret the confidence of classifier in each class)**. 
- In the Softmax classifier, the function mapping $f(x_i;W)=Wx_i$ stays unchanged, but we now interpret these scores as the unnormalized log probabilities for each class and **replace the hinge loss with a cross-entropy loss** that has the form:

  $\displaystyle L_i = -\log\left(\frac{e^{f_{y_i}}}{ \sum_j e^{f_j} }\right) \hspace{0.5in} \text{or equivalently} \hspace{0.5in} L_i = -f_{y_i} + \log\sum_j e^{f_j}$
  
- We are using the notation $f_j$ to mean the j-th element of the vector of class scores f. 
- The full loss for the dataset is again the mean of $L_i$ over all training examples together with a regularization term $R(W)$. (See the formula in regularization part)
- The function $\displaystyle f_j(z) = \frac{e^{z_j}}{\sum_k e^{z_k}}$ is called the **softmax function**: 
  - It takes a vector of arbitrary real-valued scores (in $z$)
  - Squashes it to a vector of values between zero and one that sum to one.
- So, the softmax function is used to squash the raw class scores into normalized positive values that sum to one to generate a probability distribution so that the cross-entropy loss can be applied.
- The Softmax classifier gets its name from the softmax function. In particular, note that technically it doesn’t make sense to talk about the "softmax loss", since softmax is just the squashing function, but it is a relatively commonly used shorthand.

#### Different Interpretations of Softmax:

- The cross-entropy between a “true” distribution p and an estimated distribution q is defined as:

  $\displaystyle H(p,q) = - \sum_x p(x) \log q(x)$
  
  The Softmax classifier is hence **minimizing the cross-entropy between the estimated class probabilities** ($\displaystyle q = e^{f_{y_i}}  / \sum_j e^{f_j}$ as seen above) and the “true” distribution, which in this interpretation is the distribution where all probability mass is on the correct class (i.e. $p=[0,…1,…,0]$ contains a single 1 at the $y_i$-th position.). The cross-entropy objective wants the predicted distribution to have all of its mass on the correct answer.
- The expression:

  $\displaystyle P(y_i \mid x_i; W) = \frac{e^{f_{y_i}}}{\sum_j e^{f_j} }$
  
  can be interpreted as **the (normalized) probability assigned to the correct label $y_i$ given the image $x_i$ and parameterized by W**. To see this, remember that the Softmax classifier interprets the scores inside the output vector f as the unnormalized log probabilities. Exponentiating these quantities therefore gives the (unnormalized) probabilities, and the division performs the normalization so that the probabilities sum to one. In the probabilistic interpretation, we are therefore **minimizing the negative log likelihood of the correct class, which can be interpreted as performing Maximum Likelihood Estimation (MLE)**. [A nice feature of this view is that we can now also interpret the regularization term R(W) in the full loss function as coming from a Gaussian prior over the weight matrix W, where instead of MLE we are performing the Maximum a posteriori (MAP) estimation.]
  
#### Numeric Stability:

When you’re writing code for computing the Softmax function in practice, the intermediate terms $\displaystyle e^{f_{y_i}}$ and $\displaystyle \sum_j e^{f_j}$ may be very large due to the exponentials. Dividing large numbers can be numerically unstable, so it is important to use a normalization trick. Notice that if we multiply the top and bottom of the fraction by a constant C and push it into the sum, we get the following (mathematically equivalent) expression:

$\displaystyle \frac{e^{f_{y_i}}}{\sum_j e^{f_j}}
= \frac{Ce^{f_{y_i}}}{C\sum_j e^{f_j}}
= \frac{e^{f_{y_i} + \log C}}{\sum_j e^{f_j + \log C}}$

We are free to choose the value of C. This will not change any of the results, but we can use this value to improve the numerical stability of the computation. A common choice for C is to set $\displaystyle  \log C = -\max_j f_j$. This simply states that we should shift the values inside the vector f so that the highest value is zero.

### SVM vs Softmax:

![image](svmvssoftmax.png)

- In both cases we compute the same score vector f (e.g. by matrix multiplication: Wx). 
- The difference is in the interpretation of the scores in f: 
  - The SVM interprets these as class scores and its loss function encourages the correct class (class 2, in blue) to have a score higher by a margin than the other class scores. 
  - The Softmax classifier instead interprets the scores as (unnormalized) log probabilities for each class and then encourages the (normalized) log probability of the correct class to be high (equivalently the negative of it to be low).
- The numbers obtained from SVM and Softmax are not comparable. They are only meaningful in relation to loss computed within the same classifier and with the same data.
- The performance difference between the SVM and Softmax are usually very small.
- Compared to the Softmax classifier, the SVM is a more local objective, which could be thought of either as a bug or a feature. 
- Consider an example that achieves the scores $[10, -2, 3]$ and where the first class is correct: 
  - An SVM (e.g. with desired margin of $\Delta=1$) will see that the correct class already has a score higher than the margin compared to the other classes and it will compute loss of zero. **The SVM does not care about the details of the individual scores**: if they were instead $[10, -100, -100]$ or $[10, 9, 9]$ the SVM would be **indifferent since the margin of 1 is satisfied and hence the loss is zero.** 
  - However, these scenarios are not equivalent to a Softmax classifier, which would accumulate a much higher loss for the scores $[10, 9, 9]$ than for $[10, -100, -100]$. 
  - In other words, the Softmax classifier is **never** fully happy with the scores it produces: the correct class could always have a higher probability and the incorrect classes always a lower probability and the loss would always get better. 
  - However, **the SVM is happy once the margins are satisfied** and it does not micromanage the exact scores beyond this constraint. This can intuitively be thought of as a feature: For example, a car classifier which is likely spending most of its “effort” on the difficult problem of separating cars from trucks should not be influenced by the frog examples, which it already assigns very low scores to, and which likely cluster around a completely different side of the data cloud.
  
## Loss Optimization:

- Goal: Find the set of all weights in the neural network W that **minimize the empirical loss**.
- Two layers of abstraction:
  1. Optimization level - where techniques like SGD, Adam, Rprop, BFGS etc. come into play, which (if they are first order or higher) use gradient computed above
  2. Gradient computation - where backpropagation comes to play
  In summary, the parameters in weight matrices are learned with (stochastic) gradient descent, and their gradients are derived with backpropagation (computed with chain rule) to minimize the empirical loss.
- Any optimiser can work with 3 modes: 
  - full online learning: single data point taken from the dataset
  - mini-batch learning: batches of certain size taken from the dataset
  - full-batch learning: single batch which is the entire dataset

### 1. Loss Optimization through Gradient Descent:

- Gradient is the rate of inclination or declination of a slope and descent means the instance of descending.
- A neural network has many parameters, so we measure the partial derivatives of each parameter’s contribution to the total change in error. We can compute the best direction along which we should change our weight vector that is mathematically guaranteed to be the direction of the steepest descend (at least in the limit as the step size goes towards zero). This direction will be negative direction of the gradient of the loss function since the gradient gives the direction of the increase by definition.
- Learning Rate (Step Size) is the rate of change used while approximating the partial derivative: 
  - Too small: Model may get stuck at the local minimum. **(Almost constant loss)**
  - Too large: Faster learning, but diverge from the minimum (higher chance of instability, **possibly infinite/NaN loss**)
- SGD is one of many optimization methods, namely first order optimizer, meaning, that it is based on analysis of the gradient of the objective. Consequently, in terms of neural networks it is often applied together with backpropagation to make efficient updates. You could also apply SGD to gradients obtained in a different way (from sampling, numerical approximators etc.). Or, you can use other optimization techniques with backprop as well, everything that can use gradient/jacobian.
- If the increase in the total error is 0.006 for each 0.0001 increasing weight ($\Delta W=0.0001$), **the rate of (composed) error change** (relative to the changes on the weight) is **0.006/0.0001=60**
- It is possible to guess this rate by calculating directly the derivative of the loss function.
- We need terms epochs, batch size, iterations when the data is too big and we can’t pass all the data to the computer at once:
  - One **epoch** is when an **entire dataset** is passed forward and backward through the neural network **only once**.
  - **Batch** is a set or part taken out from the dataset.
  - **Batch size** is total **number of training examples present in a single batch**.
  - **Iteration** is the **number of batches needed** to **complete one epoch**. (The number of batches is equal to number of iterations for one epoch.)
  - We can divide the dataset of 2000 examples into **batches of 500** then it will **take 4 iterations** to **complete 1 epoch**.
  - Minimising loss over the whole dataset is called batch learning, and might be very slow for big data. 
  - Minimising loss through batches of size N taken from (randomly shuffled) training dataset is called mini-batch gradient descend.
- **Smaller networks are harder to train with local methods such as GD:**
  - It’s clear that loss functions of smaller networks have relatively few local minima, but it turns out that many of these minima are easier to converge to, and that they are bad (i.e. with high loss). 
  - Conversely, bigger neural networks contain significantly more local minima, but these minima turn out to be much better in terms of their actual loss. 
  - In practice, what you find is that if you train a small network the final loss can display a good amount of variance - in some cases you get lucky and converge to a good place but in some cases you get trapped in one of the bad minima.
  - On the other hand, if you train a large network you’ll start to find many different solutions, but the variance in the final achieved loss will be much smaller. In other words, all solutions are about equally as good, and rely less on the luck of random initialization.
  
#### Stochastic Gradient Descent (SGD):

- Picking a single point: Computationally simple, but since only a single point is considered, this can be very noisy. Taking a step in the direction of the point picked might mean taking a step which is not representative of the entire data set.
- Instead get batches of size B. The estimate of the true gradient is obtaine by averaging the gradients coming from all batches. 
- More efficient than computing all gradients due to small batches and parallelization opportunity.
- More accurate that single point SGD.
- Note that even though SGD technically refers to using a single example at a time to evaluate the gradient, you will hear people use the term SGD even when referring to mini-batch gradient descent (i.e. mentions of MGD for “Minibatch Gradient Descent”, or BGD for “Batch gradient descent” are rare to see), where it is usually assumed that mini-batches are used.
- Mini-batch SGD:
  1. Sample a batch of data
  2. Forward propagate it through the graph, obtain loss
  3. Perform backpropagation to calculate gradients
  4. Update parameters using the gradient

### 2. Gradient Computation with Backpropagation: 

![image](forward_backward.png)

- A data instance flowing through a network’s parameters toward the prediction at the end is forward propagation.
- When the network propagates information about the error through the network backwards, this is backpropagation.
- Backpropagation is the central mechanism by which neural networks learn. It is the messenger telling the network whether or not the net made a mistake when it made a prediction.
- The goal of backpropagation is to optimize the weights so that the neural network can learn how to correctly map arbitrary inputs to outputs.
- Backpropagation is an efficient gradient computing technique in directed graphs of computations (such as neural networks).
- So, it is not a learning method, but rather a computational trick which is often used in learning methods.
- Mainly, it is the recursive application of chain rule, which simply gives the ability to compute all required partial derivatives in linear time in terms of the graph size (while naive gradient computations would scale exponentially with depth): $\displaystyle\frac{\partial J(W)}{\partial W}$
- To decrease the error, we subtract the values calculated by backpropagation from the current weights (optionally multiplied by some learning rate, $\eta$).
- A rule for weight updates is the **delta rule**:
  - New Weight = Old Weight - Derivative of Loss Function at Current Weight * Learning Rate
- We use the original weights, not the updated weights, as we continue the backpropagation algorithm backwards.
- Backpropagation takes the error associated with a wrong guess by a neural network, and uses that error to adjust the neural network’s parameters in the direction of less error. It knows the direction of less error from Gradient Descent.
- Backpropagation can thus be thought of as gates communicating to each other (through the gradient signal) whether they want their outputs to increase or decrease (and how strongly), to adjust the final output value.
- Analogy: You could compare a neural network to a large piece of artillery that is attempting to strike a distant object with a shell. When the neural network makes a guess about an instance of data, it fires, and the gunner tries to make out where the shell struck, and how far it was from the target. That distance from the target is the measure of error. The measure of error is then applied to the angle of and direction of the gun (parameters), before it takes another shot.

- Even though the gradient is technically a vector, we will often use terms such as “the gradient on x” instead of the technically correct phrase “the partial derivative on x” for simplicity.

![image](backprop.png)

- The derivative for the multiplication operation:

  $\displaystyle f(x,y) = x y \hspace{0.5in} \rightarrow \hspace{0.5in} \frac{\partial f}{\partial x} = y \hspace{0.5in} \frac{\partial f}{\partial y} = x$
  
  The local gradients (gradients of the immediate output of a unit with respect to its inputs) of the multiply gate are the input values (except switched), and this is multiplied by the gradient on its output during the chain rule. In the example above, the gradient on x is -8.00, which is -4.00 x 2.00.

- The derivative for the addition operation:

  $\displaystyle f(x,y) = x + y \hspace{0.5in} \rightarrow \hspace{0.5in} \frac{\partial f}{\partial x} = 1 \hspace{0.5in} \frac{\partial f}{\partial y} = 1$
  
  The add gate always takes the gradient on its output and distributes it equally to all of its inputs, regardless of what their values were during the forward pass. This follows from the fact that the local gradient for the add operation is simply +1.0, so the gradients on all inputs will exactly equal the gradients on the output because it will be multiplied by x1.0 (and remain unchanged). In the example circuit above, note that the + gate routed the gradient of 2.00 to both of its inputs, equally and unchanged.

- The derivative for the max operation:

  $\displaystyle f(x,y) = \max(x, y) \hspace{0.5in} \rightarrow \hspace{0.5in} \frac{\partial f}{\partial x} = \mathbb{1}(x >= y) \hspace{0.5in} \frac{\partial f}{\partial y} = \mathbb{1}(y >= x)$
  
    The max gate routes the gradient. Unlike the add gate which distributed the gradient unchanged to all its inputs, the max gate distributes the gradient (unchanged) to **exactly one of its inputs (the input that had the highest value during the forward pass)**. This is because the local gradient for a max gate is 1.0 for the highest value, and 0.0 for all other values. In the example circuit above, the max operation routed the gradient of 2.00 to the z variable, which had a higher value than w, and the gradient on w remains zero.
- Notice that backpropagation is a local process. Every gate in a circuit diagram gets some inputs and can right away compute two things: 
  1. its output value 
  2. the local gradient of its inputs with respect to its output value. 
- Notice that the gates can do this completely independently without being aware of any of the details of the full circuit that they are embedded in. However, once the forward pass is over, during backpropagation the gate will eventually learn about the gradient of its output value on the final output of the entire circuit. Chain rule says that the gate should take that gradient and multiply it into every gradient it normally computes for all of its inputs.

- Note that in linear classifiers where the weights are dot producted $w^Tx_i$ (multiplied) with the inputs, this implies that the scale of the data has an effect on the magnitude of the gradient for the weights. For example, if you multiplied all input data examples $x_i$ by 1000 during preprocessing, then the gradient on the weights will be 1000 times larger, and you’d have to lower the learning rate by that factor to compensate. This is why preprocessing matters a lot, sometimes in subtle ways! And having intuitive understanding for how the gradients flow can help you debug some of these cases.

- [See an example calculation.](https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/)

## Underfitting and Overfitting:

- **Underfitting:** The complexity of the model is not large enough to really learn the full complexity of the data.
- **Overfitting:** The complexity of the model is so large (number of layers and number of parameters) that it essentially memorizes the data and when it sees a new data it's not going to sense or perfectly match on the training data. This means that it will have a high generalization error. Model starts to perform worse on test set than on the training test.
- Passing the entire dataset through a neural network once is not enough. We need to pass the full dataset multiple times to the same neural network because we are using a limited dataset and we are optimising the learning and the graph with Gradient Descent which is an iterative process. So, updating the weights with single pass or one epoch is not enough.
- One epoch leads to **underfitting** of the curve in the graph.
- As the number of epochs increases, (i.e. more number of times the weights are changed in the neural network) the curve goes from **underfitting** to **optimal** to **overfitting** curve.

### Regularization:

- Addresses the fitting problem.
- There is a flaw with the Multiclass SVM loss function. Suppose that we have a dataset and a set of parameters W that correctly classify every example (i.e. all scores are so that all the margins are met, and $L_i=0$ for all $i$). The issue is that this set of W is not necessarily unique: there might be many similar W that correctly classify the examples. 
- One example is that if some parameters W correctly classify all examples (so loss is zero for each example), then any multiple of these parameters $\lambda W$ where $\lambda>1$ will also give zero loss because this transformation uniformly stretches all score magnitudes and hence also their absolute differences. For example, if the difference in scores between a correct class and a nearest incorrect class was 15, then multiplying all elements of W by 2 would make the new difference 30.
- In other words, we wish to encode some preference for a certain set of weights W over others to remove this ambiguity. We can do so by extending the loss function with a regularization penalty $R(W)$.
- Regularization tends to improve **generalization (less overfitting)** by penalizing large weights because it means that no input dimension can have a very large influence on the scores all by itself.
- Regularization favors more evenly distributed and dense matrices over those which are more concentrated and sparse among the set of matrices W, which yield the same score. This feature also contributes to generalization because matrices with weights distributed over dimensions increase the dependence of the final score on different parts of the image and, at the same time, decrease the reliance on certain parts of it.
- The regularization loss in both SVM/Softmax cases could, from this biological view, be interpreted as gradual forgetting, since it would have the effect of driving all synaptic weights w towards zero after every parameter update.
- Note that biases do not have the same effect since, unlike the weights, they do not control the strength of influence of an input dimension. Therefore, it is common to only regularize the weights W but not the biases b. (However, in practice this often turns out to have a negligible effect.) 
- Note that due to the regularization penalty we can **never achieve loss of exactly 0.0** on all examples, because this would only be possible in the pathological setting of W=0.

#### L2 Regularization:

- The most common regularization penalty is the L2 norm that discourages large weights through an elementwise quadratic penalty over all parameters:

  $\displaystyle R(W) = \sum_k\sum_l W_{k,l}^2$
  
- In the expression above, we are summing up all the squared elements of W. 
- **Notice that the regularization function is not a function of the data, it is only based on the weights.** 
- Including the regularization penalty completes the full Multiclass SVM loss, which is made up of two components: 
  1. The data loss (which is the average loss $L_i$ over all examples)
  2. The regularization loss
  
  $\displaystyle L =  \underbrace{ \frac{1}{N} \sum_i L_i }_\text{data loss} + \underbrace{ \lambda R(W) }_\text{regularization loss}$
  - where N is the number of training examples. 
  - We append the regularization penalty to the loss objective, weighted by a hyperparameter $\lambda$ **(regularization strength)**. There is no simple way of setting this hyperparameter and it is usually determined by cross-validation.
- The full Multiclass SVM loss becomes:

  $\displaystyle L = \frac{1}{N} \sum_i \sum_{j\neq y_i} \left[ \max(0, f(x_i; W)_j - f(x_i; W)_{y_i} + \Delta) \right] + \lambda \sum_k\sum_l W_{k,l}^2$

##### Example:

- Suppose that we have:
  - an input vector $x=[1,1,1,1]$
  - two weight vectors: 
    - $w1=[1,0,0,0]$
    - $w2=[0.25,0.25,0.25,0.25]$
- Then $w_1^Tx = w_2^Tx = 1$ so both weight vectors lead to the same dot product, but: 
  - **the L2 penalty of w1: 1.0** 
  - **the L2 penalty of w2: 0.25** 
- Therefore, according to the L2 penalty the weight vector w2 would be preferred since it achieves a lower regularization loss. 
- Intuitively, this is because **the weights in w2 are smaller and more diffuse**.
- As a result, **the final classifier is encouraged to take into account all input dimensions to small amounts rather than a few input dimensions and very strongly.** 
- This effect can improve the generalization performance of the classifiers on test images and lead to less overfitting.

#### Dropout: 

- Dropping out a number of neurons with some probability at every iteration. The result is a model that creates an ensemble of multiple models through the paths of the network and thus more **generic**.

# Deep Learning for Computer Vision

- Biologists traced the origins of vision back 540 million years ago to the Cambrian explosion.
- The reason that vision seems so easy for us as humans is because we have 540 million years of data that evolution has effectively trained on.
- Deep Learning allows extract and detect features in an image automatically and in an hierarchical fashion


## Convolutional Neural Networks:

- Convolutional Neural Networks = ConvNet = CNN
- Convolution operation allows us to capitalize on the spatial structure that is inherent in visual data. It extracts local features by using the weights in filters.
- Feature map: The resulting matrix when a filter is convolved with an input. It reflects where in the input image was activated by the filter. Higher values indicate greater activation values.
- As indicated by convolution, CNN is leveraged to keep spatial information in the image. This allows informing the network about the spatial structure of the picture.
- In this way, each neuron/unit in a hidden layer only sees a particular region of what the input to that layer is.
- Reduces the number of weights.
- Uses the fact that spatially close pixels are probably somehow related.
- Learning Task in CNN: Learn features directly from the image data = Learning the weights of the convolution filters
- All weights including the ones in filters and those in the fully connected layer at the end are learned through the learning process. 
- Fully connected layer is used for the actual classification task. By using the softmax function, the result outputted from this layer can be turned into a probability distribution indicating the membership possibility of images over a set of possible classes. (categorical probability distribution over the set of possible classes)
- The last layer of (most) CNNs are linear classifiers.
- Three main operations in CNN:
  - Convolution 
  - Non-linearity
  - Pooling: Downsampling, reduce dimentionality while going from one layer to the next.
- Steps for each (convolutional) layer:
  1. Apply a window of weights (convolution)
  2. Compute the linear combination of the weights against the input
     - Add bias
  3. Apply activation function on the output
- Output layer dimensions: $h \times w \times d$ where: 
  - $h$ and $w$ depend on size of the input, size of the filter and the stride,
  - $d$ depends on the number of different filters used. 
- Activating with ReLU: Negative values in the convolution result indicate the negative detection of the associated feature.
- Max Pooling: Take the maximum value in the patch
- If softmax is used and probability distribution is obtained at the end, cross entropy loss can be used to optimize weights via backpropagation.
- A famous CNN dataset: ImageNet dataset
- CNN is actually defined with the feature learning part of the pipeline.
- Second part of the pipeline can be replaced to suit many different applications, not just classification:
  - Object recognition
  - Segmentation: Downsampling (Feature extraction as before) + Upsampling (replaces classification part of the pipeline)
    - Semantic Segmentation: Group objects in an image semantically.
    - Depth Estimation:
    - Instant Segmentation: Identifying different instances of the same object
  - Image captioning: Feature extraction (as before) + RNN (replaces classification part of the pipeline)
  
  
# TODO:
### Basics of NN Training for Images:
- Preprocess with **mean subtraction** to center the data at the origin: X -= np.mean(X, axis = 0). Note that the mean must be computed only over the training data and then subtracted equally from all splits (train/val/test). The data can be normalized into range [-1, 1] after centering. However, the relative scales of pixels are already approximately equal (and in range from 0 to 255), so it is not strictly necessary to perform normalization with standard deviation. Using PCA and whitening are not common.
- Do **NOT** initialize weights with all zeroes. **Network gets stuck.** If every neuron in the network computes the same output, then they will also all compute the same gradients during backpropagation and undergo the exact same parameter updates. In other words, there is **no source of asymmetry between neurons** if their weights are initialized to be the same. 
- Small random numbers are generally used for symmetry breaking. The idea is that the neurons are **all random and unique in the beginning**, so they will compute **distinct updates** and integrate themselves as diverse parts of the full network. However, one problem with this suggestion is that the distribution of the outputs from a randomly initialized neuron has a variance that grows with the number of inputs.
- We need to calibrate the variances with 1/sqrt(n). Therefore, use **Xavier initialization (adapted version for ReLU)**: w = np.random.randn(fan_in, fan_out) * sqrt(fan_in/2). This gives small calibrated random numbers for initialization and ensures that all neurons in the network initially have approximately the same output distribution. Empirically, this method improves the rate of convergence.
- fan-in = its number of inputs
- fan-out = its number of outputs
- **WARNING:** It’s not necessarily the case that smaller numbers will work strictly better. For example, a Neural Network layer that has very small weights will during backpropagation compute very small gradients on its data (since this gradient is proportional to the value of the weights). This could greatly diminish the “gradient signal” flowing backward through a network **due to the excessive multiplication of small numbers**, and could become a concern for deep networks.
- **Batch Normalization** can also be used. It initializes neural networks properly by explicitly forcing the activations throughout a network to take on a unit **gaussian distribution** at the beginning of the training. The core observation is that this is possible because normalization is a simple differentiable operation. In the implementation, we usually insert the BatchNorm layer immediately after fully connected layers (or convolutional layers), and before non-linearities. Note that it has become a very common practice to use Batch Normalization in neural networks. **In practice networks that use Batch Normalization are significantly more robust to bad initialization.** Additionally, batch normalization can be **interpreted as doing preprocessing at every layer of the network, but integrated into the network itself in a differentiable manner.**

![image](batchnorm.png)
![image](batchnorm2.png)
![image](batchnorm3.png)

<p style="text-align: center;">Note that this selection of $\gamma$ and $\beta$ parameters performs identity mapping.</p>

- Batch Normalization steps:
  1. Compute emprical mean and variance independently for each dimension
  2. Normalize
- Batch Normalization pros:
  - Improves gradient flow through the network by preventing vanishing gradient in networks with saturable nonlinearities (sigmoid, tanh, etc). With Batch normalization, we ensure that the inputs of any activation function do not vary into saturable regions. Batch normalization transforms the distribution of those inputs to be unit Gaussian **(zero-centered and unit variance)**.
  - Allows higher learning rates: By preventing issues with vanishing gradient during training, we can afford to set higher learning rates. Batch normalization also reduces dependence on parameter scale. Large learning rates can increase the scale of layer parameters which cause the gradients to amplify as they are passed back during backpropagation.
  - Reduces strong dependence on initialization: 
  - Performs some regularization.
- Note that at the test time BatchNorm functions differently. The mean/std are not computed based on the batch. Instead, a single fixed empirical mean of activations during training is used. (For example, these can be estimated during traning with running averages.) 
- Use 0 for bias initialization.
- Use **ReLU** as activation function.
- Try different learning rates:
  - Start with high
  - Continue with low after some point
- Use **L2 regularization**
- Sanity Check - Overfit a tiny subset of data: Before training on the full dataset try to train on a tiny portion (e.g. 20 examples) of your data and make sure you can achieve zero cost. For this experiment it’s also best to set regularization to zero, otherwise this can prevent you from getting zero cost. Unless you pass this sanity check with a small dataset it is not worth proceeding to the full dataset. Note that it may happen that you can overfit very small dataset but still have an incorrect implementation. For instance, if your datapoints’ features are random due to some bug, then it will be possible to overfit your small training set but you will never notice any generalization when you fold it your full dataset.
- It is preferable to track epochs rather than iterations since the number of iterations depends on the arbitrary setting of batch size.
- Track the followings:
  - Loss function
  - Validation/training accuracy
  - Ratio of Weights/Updates
  - Activation / Gradient distributions per layer
  - First-layer visualizations
- The most common hyperparameters in context of Neural Networks include:
  - the initial learning rate
  - learning rate decay schedule (such as the decay constant)
  - regularization strength (L2 penalty, dropout strength)
- Hyperparameter Optimization:
  - Search ranges: Search for hyperparameters **on log scale**. For example, a typical sampling of **the learning rate** would look as follows: learning_rate = 10 ** uniform(-6, 1). That is, we are generating a random number from a uniform distribution, but then raising it to the power of 10. The same strategy should be used for **the regularization strength**. **Intuitively, this is because learning rate and regularization strength have multiplicative effects on the training dynamics.** For example, a fixed change of adding 0.01 to a learning rate has huge effects on the dynamics if the learning rate is 0.001, but nearly no effect if the learning rate when it is 10. This is because the learning rate multiplies the computed gradient in the update. Therefore, it is much more natural to consider a range of learning rate multiplied or divided by some value, than a range of learning rate added or subtracted to by some value. Some parameters (e.g. dropout) are instead usually searched in the original scale (e.g. dropout = uniform(0,1)).
  - Prefer random search to grid search: Randomly chosen trials are more efficient for hyper-parameter optimization than trials on a grid. It is very often the case that some of the hyperparameters matter much more than others (e.g. top hyperparam vs. left one in this figure). **Performing random search rather than grid search allows you to much more precisely discover good values for the important ones.**

![image](gridsearchbad.jpeg)

# CNN

- Regular Neural Nets don’t scale well to full images. Considering images of size 200x200x3, a single fully-connected neuron in a first hidden layer of a regular Neural Network would have 200x200x3 = 120,000 weights. As we want to have several such neurons, the parameters would add up quickly! Clearly, this full connectivity is wasteful and the huge number of parameters would quickly lead to overfitting.
- Convolutional Neural Networks take advantage of the fact that the input consists of images and they constrain the architecture in a more sensible way. In particular, unlike a regular Neural Network, the layers of a ConvNet have neurons arranged in 3 dimensions: width, height, depth. Each layer accepts an input 3D volume and transforms it to an output 3D volume through a differentiable function.
- Three main types of layers to build ConvNet architectures: 
  1. Convolutional Layer
  2. Pooling Layer
  3. Fully-Connected Layer
- In particular, the CONV/FC layers perform transformations that are a function of not only the activations in the input volume, but also of the parameters (the weights and biases of the neurons). On the other hand, the RELU/POOL layers will implement a **fixed function**. The parameters **in the CONV/FC layers** will be **trained** with gradient descent.
  - While CONV/FC have **paramaters**, RELU/POOL don’t.
  - While CONV/FC/POOL have additional **hyperparameters**, RELU doesn’t.
  
## 1. Convolutional Layer:

- During the forward pass, each filter is convolved across the width and height of the input volume. As we slide the filter over the width and height of the input volume we will produce a 2-dimensional activation map that gives the responses of that filter at every spatial position. Intuitively, the network will learn filters that activate when they see some type of visual feature such as an edge of some orientation or a blotch of some color on the first layer, or eventually entire honeycomb or wheel-like patterns on higher layers of the network.
- **Local Connectivity:** When dealing with high-dimensional inputs such as images, as we saw above it is **impractical to connect neurons to all neurons in the previous volume**. Instead, we will connect each neuron to only **a local region of the input volume**. The spatial extent of this connectivity is a hyperparameter called the **receptive field of the neuron (equivalently this is the filter size)**. The connections are local in space (along width and height), but the extent of the connectivity along the depth axis is always equal to the depth of the input volume. With local connectivity, every entry in the 3D output volume can be interpreted as an output of a neuron that looks at only a small region in the input and shares parameters with all neurons to the left and right spatially **(since these numbers all result from applying the same filter)**.
  - Examples:
    - Input volume: [32x32x3], Receptive field (or the filter size): 5x5, Number of parameters: 5\*5\*3 = 75 weights (and +1 bias parameter)
    - Input volume: [16x16x20], Receptive field (or the filter size): 3x3, Number of connections with input:  3\*3\*20 = 180 (=number of weights).
- **Spatial arrangement:** Three hyperparameters control the size of the output volume (the number of neurons): 
  1. Depth: The depth corresponds to the number of filters we would like to use, each learning to look for something different in the input. For example, if the first Convolutional Layer takes as input the raw image, then different neurons along the depth dimension may activate in presence of various oriented edges, or blobs of color. We will refer to a set of neurons that are all looking at the same region of the input as a depth column (some people also prefer the term fibre).
  2. Stride: Second, we must specify the stride with which we slide the filter. When the stride is 1 then we move the filters one pixel at a time. When the stride is 2 (or uncommonly 3 or more, though this is rare in practice) then the filters jump 2 pixels at a time as we slide them around. This will produce smaller output volumes spatially.
  3. Zero-padding: As we will soon see, sometimes it will be convenient to pad the input volume with zeros around the border. Zero padding allows us to preserve the spatial size of the input volume so the input and output width and height are the same.
- The spatial size (number of neurons) of the output volume:
  $\displaystyle \frac{(W−F+2P)}{S}+1$ where:
  - the input volume size (W), 
  - the receptive field size of the Conv Layer neurons (F), 
  - the stride with which they are applied (S),
  - the amount of zero padding used (P) on the border.
- **Parameter Sharing:** We can dramatically reduce the number of parameters by making one reasonable assumption: If one feature is useful to compute at some spatial position (x,y), then it should also be useful to compute at a different position (x2,y2). (For example, there is therefore no need to relearn to detect a horizontal edge at every one of the 55x55 distinct locations in the Conv layer output volume.) In other words, denoting a single 2D slice of depth as a depth slice (e.g. a volume of size [55x55x96] has 96 depth slices, each of size [55x55]), we are going to **constrain the neurons in each depth slice to use the same weights and bias.*** In practice during backpropagation, every neuron in the volume will compute the gradient for its weights, but these gradients will be added up across each depth slice and only update a single set of weights per slice. Notice that if all neurons in a single depth slice are using the same weight vector, then the forward pass of the CONV layer can in each depth slice be computed as a convolution of the neuron’s weights with the input volume (Hence the name: Convolutional Layer). This is why it is common to refer to the sets of weights as a filter (or a kernel), that is convolved with the input.
  - Bias sharing*:
    - Tied bias: where you share one bias per kernel
    - Untied bias: where you use use one bias per kernel and output
  - Note that sometimes the parameter sharing assumption may not make sense. This is especially the case when the input images to a ConvNet have some specific centered structure, where we should expect, for example, that completely different features should be learned on one side of the image than another. One practical example is when the input are faces that have been centered in the image. You might expect that different eye-specific or hair-specific features could (and should) be learned in different spatial locations. In that case it is common to relax the parameter sharing scheme, and instead simply call the layer a Locally-Connected Layer.

- **1x1 convolution:** Some people are at first confused to see 1x1 convolutions especially when they come from signal processing background. Normally signals are 2D so 1x1 convolutions do not make sense (it’s just pointwise scaling). However, in ConvNets this is not the case because convolution here operates over 3D volumes, and that the **filters always extend through the full depth of the input volume**. For example, if the input is [32x32x3] then doing 1x1 convolutions would effectively be doing 3-dimensional dot products (since the input depth is 3 channels). **This has the following benefits:**
  - 1x1 convolutions can be used to make **reductions in depth dimension** before the expensive 3x3 and 5x5 convolutions.
  - They can also be used to **increase the depth dimension**.
  - It is also possible to include the use of **rectified linear activation** in this process. 
- All 4 hyperparamaters needed for Convolutional Layer:
  1. Number of filters: K
  2. Spatial extent of filters (kernel size): F
  3. Stride: S
  4. Amount of zero padding: P

## 2. Pooling Layer:

- The idea is to progressively reduce the spatial size of the representation to **reduce the amount of parameters** and computation in the network, and hence to also **control overfitting**.
- The Pooling Layer operates **independently on every depth slice** (the depth dimension remains unchanged) of the input and resizes it spatially, using the "max" operation. 
- The most common form is a pooling layer with filters of size **2x2** applied with a **stride of 2** downsamples every depth slice in the input by 2 along both width and height, discarding 75% of the activations.
- All 2 hyperparamaters needed for Pooling Layer:
  1. Spatial extent of filters (kernel size): F
  2. Stride: S
- For Pooling layers, it is **not** common to pad the input using zero-padding.

## 3. Fully-connected Layer:

- Neurons in a fully connected layer have full connections to all activations in the previous layer, as seen in regular Neural Networks. Their activations can hence be computed with a matrix multiplication followed by a bias offset.

## Converting FC Layers to CONV Layers:

- The only difference between FC and CONV layers is that the neurons in the CONV layer are connected only to a local region in the input, and that many of the neurons in a CONV volume share parameters. However, the neurons in both layers still compute dot products, so their functional form is identical: 
  - CONV->FC: For any CONV layer there is an FC layer that implements the same forward function. The weight matrix would be a large matrix that is mostly zero except for at certain blocks (due to local connectivity) where the weights in many of the blocks are equal (due to parameter sharing).
  - FC->CONV: Any FC layer can be converted to a CONV layer. For example, an FC layer with K=4096 that is looking at some input volume of size 7×7×512 can be equivalently expressed as a CONV layer with **F=7,P=0,S=1,K=4096**. In other words, we are setting the filter size to be **exactly the size of the input volume**, and hence the output will simply be 1×1×4096 since only a single depth column “fits” across the input volume, giving identical result as the initial FC layer.
- Of these two conversions, **the ability to convert an FC layer to a CONV layer is particularly useful in practice**. This conversion allows us to “slide” the original ConvNet very efficiently across many spatial positions in a larger image, in a single forward pass.

## Practical Considerations:

- Split your training data randomly into train/validation splits. As a rule of thumb, between 70-90% of your data usually goes to the train split. This setting depends on how many hyperparameters you have and how much of an influence you expect them to have. If there are many hyperparameters to estimate, you should err on the side of having larger validation set to estimate them effectively. If you are concerned about the size of your validation data, it is best to split the training data into folds and perform cross-validation. If you can afford the computational budget it is always safer to go with cross-validation (the more folds the better, but more expensive). Though, in most cases a single validation set of respectable size substantially simplifies the code base, without the need for cross-validation with multiple folds. You’ll hear people say they “cross-validated” a parameter, but many times it is assumed that they still only used a single validation set.
- It is very rare to mix and match different types of neurons (classifier+activation function) in the same network, even though there is no fundamental problem with doing so.
- In practice, the current recommendation for activation is ReLU units with initialization: w = np.random.randn(fan_in, fan_out) * sqrt(fan_in/2).
- Note that it has become a very common practice to use Batch Normalization in neural networks.
- Unlike all layers in a Neural Network, the output layer neurons most commonly do not have an activation function (or you can think of them as having a linear identity activation function). **This is because the last output layer is usually taken to represent the class scores (e.g. in classification), which are arbitrary real-valued numbers, or some kind of real-valued target (e.g. in regression).**
- As we increase the size and number of layers in a Neural Network, the capacity of the network increases. That is, the space of representable functions grows since the neurons can collaborate to express many different functions.
- Modern Convolutional Networks contain on orders of 100 million parameters and are usually made up of approximately 10-20 layers. This is because, in CNNs, depth has been found to be an extremely important component. One argument for this observation is that images contain hierarchical structure (e.g. faces are made up of eyes, which are made up of edges, etc.), so several layers of processing make intuitive sense for this data domain.
- For regular Neural Networks, in practice, it is often the case that 3-layer neural networks will outperform 2-layer nets, but going even deeper (4,5,6-layer) rarely helps much more. (may cause overfitting)
- In practice, it is always better to use regularization methods to control overfitting instead of changing the number of neurons.
- What neuron type should I use? Use the ReLU non-linearity, be careful with your learning rates and possibly monitor the fraction of "dead" units in a network. If this concerns you, give Leaky ReLU or Maxout a try. Never use sigmoid. Try tanh, but expect it to work worse than ReLU/Maxout.
- In SVM, it turns out that the margin hyperparameter $\Delta$ can safely be set to $\Delta=1.0$ in all cases.
- Gradient Descent is currently by far the most common and established way of optimizing Neural Network loss functions.
- Note that in the mathematical formulation the gradient is defined in the limit as h goes towards zero, but in practice it is often sufficient to use a very small value **(in the range: [1e-3, 1e-5])**. Ideally, you want to use the smallest step size that does not lead to numerical issues. Additionally, in practice it often works better to compute the numeric gradient using the centered difference formula: $\displaystyle \frac{f(x+h)−f(x−h)}{2h}$.
- The size of the mini-batch is a hyperparameter but it is not very common to cross-validate it. It is usually based on memory constraints (if any), or set to some value, e.g. 32, 64 or 128. We use powers of 2 in practice because many vectorized operation implementations work faster when their inputs are sized in powers of 2. In current state of the art ConvNets, a typical batch contains 256 examples from the entire training set of 1.2 million.
- A common setting for the CONV layer hyperparameters is F=3,S=1,P=1.
- There are only two commonly seen variations of the max pooling layer found in practice: A pooling layer with F=3,S=2 (also called overlapping pooling), and more commonly F=2,S=2. Pooling sizes with larger receptive fields are **too destructive**.
- Discarding pooling layers has also been found to be important in training good generative models, such as variational autoencoders (VAEs) or generative adversarial networks (GANs). It seems likely that future architectures will feature very few to no pooling layers.
- Why use stride of 1 in CONV? Smaller strides work better in practice.
- Why use padding? In addition to the benefit of keeping the spatial sizes constant after CONV, doing this actually improves performance. If the CONV layers were to not zero-pad the inputs and only perform valid convolutions, then the size of the volumes would reduce by a small amount after each CONV, and the information at the borders would be “washed away” too quickly.
- Always save the model weights if the performance of the model on a holdout dataset is better than at the previous epoch. That way, you will always have the model with the best performance on the holdout set.
- Early Stopping: Some triggers for this event:
  - No change in metric over a given number of epochs.
  - An absolute change in a metric.
  - A decrease in performance observed over a given number of epochs.
  - Average change in metric over a given number of epochs.
