>>> Work in Progress (Following are the lecture notes of Prof Fei-Fei Li/Prof Justin Johnson/Prof Serena Yeung - CS231n - Stanford. This is my interpretation of their excellent teaching and I take full responsibility of any misinterpretation or misinformation provided herein.)

### Lecture 3: Loss Function and Optimization

#### Outline:
- Loss functions
  - A loss function quantifies the unhappiness with the scores across the training data
    - a function that takes in a W and tells us how bad quantitatively is that W
    - minimize the loss on training example
    - different types of loss 
- Optimization
  - Come up with a way of efficient procedure to calculate W
    - efficiently come up with the procedure of searching through the space of all possible Ws and come up with what is the correct value of W that is the least bad

#### Loss function
- Given a dataset $\{(x_{i}, y_{i})\}_{i=1}^{N}$, where $x_{i}$ is image and $y_{i}$ is (integer) label
- Loss over the dataset is sum of loss over examples:
> $L(W) = \frac{1}{N}\sum\limits_{i}L_{i}(f(x_{i},W),y_{i}) + \lambda R(W)$
  - where 1st term is the data loss
  - 2nd term is the regularization correction - making the model simple
- binary SVM - has 2 classes - each example will be classified as positive or negative example
- multinomial SVM - handle multiple classes

#### Multiclass SVM loss
- Given a dataset $\{(x_{i}, y_{i})\}_{i=1}^{N}$, where $x_{i}$ is image and $y_{i}$ is (integer) label
- and scores vector $s = f(x_{i}, W)$
  - predicted scores that are coming from the classifier
  - $y_{i}$ is the ground truth label
  - $s_{y_{i}}$ denotes score of the true class for the ith example in training set
  - $s_{1}$ and $s_{2}$ will be cat and dog score respectively

- SVM loss has the form - **Hinge Loss**:  

> \begin{equation}\\
\begin{aligned}\\
  L_{i} &= \sum\limits_{j \neq y_{i}}
    \begin{cases}
      0 & \text{if $s_{y_{i}} \geq s_{j} + 1$}\\
      s_{j} - s_{y_{i}} + 1 & \text{otherwise}\\
    \end{cases}\\       
    &= \sum\limits_{j \neq y_{i}} max(0, s_{j} - s_{y_{i}} + 1)
\end{aligned}\\
\end{equation}\\

<img src="images/03_Ws.png" height=400 width=400>
$\tiny{\text{YouTube-Stanford-CS231n-Justin Johnson}}$   

- If the true score is high, that is good. Otherwise, we will have to incur some loss and that would be bad.

<img src="images/03_Ls1.png" height=400 width=400>  
$\tiny{\text{YouTube-Stanford-CS231n-Justin Johnson}}$   

<img src="images/03_Ls2.png" height=400 width=400>  
$\tiny{\text{YouTube-Stanford-CS231n-Justin Johnson}}$   

<img src="images/03_Ls3.png" height=400 width=400>  
$\tiny{\text{YouTube-Stanford-CS231n-Justin Johnson}}$   


- Why +1?
  - We care about the relative scores

#### Regularization - 2nd term
> $L(W) = \frac{1}{N}\sum\limits_{i}L_{i}(f(x_{i},W),y_{i}) + \lambda R(W)$  
> where $\lambda$ is the regularization strength (hyperparameter)  

- Types of regularization:
  - L2 regularization - weight decay - Euclidean norm or squared norm - penalize the Euclidean norm of this weight vector
    > $R(W) = \sum_{k}\sum_{l}W^{2}_{k,l}$
  - L1 regularization - nice property of encouraging sparsity in matrix W
    > $R(W) = \sum_{k}\sum_{l}|W_{k,l}|$
  - Elastic net (L1 + L2) regularization - combination of L1 and L2
    > $R(W) = \sum_{k}\sum_{l}\beta W^{2}_{k,l} + |W_{k,l}|$
  - Max norm regularization - penalizes the max norm rather than L1 and L2 norm
  - Dropout regularization - specific to deep learning
  - Fancier regularization: Batch normalization, stochastic depth  

- Goal of regularization term is that it penalizes the complexity of the model rather than explicitly trying to fit the training data

#### Softmax Classifier (Multinomial Logistic Regression)

- Multiclass SVM
  - there was no interpretation of loss function
  - the model f spits out scores for the classes, which didn't actually had much interpretation
  - all we cared about was the score of correct class must be greater than score of incorrect class
- Multinomial Logistic Regression
  - in this case, the scores will have meaning
  > Softmax function $P(Y=k|X=x_{i}) = \frac{e^{s}k}{\sum_{j}e^{s_{j}}}$  
  > where scores $s = f(x_{i}; W)$ = unnormalized log probabilities of the classes
  - the probability of softmax function sum to 1
  - To maximize the log likelihood, or (for a loss function) to minimize the negative log likelihood of the correct class:
  > $L_{i} = -$log $P(Y=y_{i}|X=x_{i}) $
  - more weight (i.e., probability of 1) should be on the cat and 0 probability for all other classes
  - computed probability distribution coming out of the softmax function should match this target probability distribution that has all the mass on the correct class
    - use KL divergence
    - maximum likelihood estimate
  - Goal is the probability of true class is high and as close to 1
- Loss function will be the -log of the probability of true class
  > $L_{i} = -$log $\frac{e^{s_{y_{i}}}}{\sum_{j}e^{s_{j}}} $
- Calculation steps:
  - Calculate unnormalized log probabilities, as above
  - calculate exponent of it(unnormalized probabilities)
  - calculate normalized value (probabilities)
  - calculate negative log (softmax loss function value) (or multinomial logistic regression)

#### Optimization

- find bottom of the valley
- use types of iterative method
- types
  - random search
    - depends on luck
  - follow the slope. 
    - use local geometry, which way will take me little bit down
    - gradient is the vector of (partial derivatives) along each dimension
    - slope in any direction is the dot product of the direction with the gradient
    - direction of steepest descent is negative gradient
    - use finite differences
      - adv
        - easy to write
      - disadv
        - approximate
        - can be very slow if size is large
        - in practice, it is never used
    - instead compute analytic gradient
      - calculate dW in one step instead of looping over iteratively
        - adv
          - exact, fast
        - disadv
          - error prone
    - in practice, always use analytic gradient, but check implementation with numerical gradient - gradient check
    
<img src = "images/03_finiteMethod1.png" width=400 height=400>  
$\tiny{\text{YouTube-Stanford-CS231n-Justin Johnson}}$   

<img src = "images/03_finiteMethod2.png" width=400 height=400>  
$\tiny{\text{YouTube-Stanford-CS231n-Justin Johnson}}$   

  - gradient descent
    - most used
    - initialize W random 
    - compute loss and gradient
    - update the weights in opposite of the gradient direction
      - gradient points to the direction of greatest increase
      - minus gradient points in the direction of greatest decrease
      - take small step in the direction of minus gradient
      - repeat till it converges
    - step size or learning rate is a hyperparameter
      - tells us how far we step in the direction of gradient
      - step size is the first hyperparameter we check
      - model size and regularization can be dealt later, but step size should be the primary focus
            

#### Stochastic Gradient Descent (SGD)
> $L(W) = \frac{1}{N}\sum\limits_{i}L_{i}(f(x_{i},y_{i},W) + \lambda R(W)$  
> $\nabla_{W}L(W) = \frac{1}{N}\sum\limits_{i}\nabla_{W} L_{i}(f(x_{i},y_{i},W) + \lambda\nabla_{W} R(W)$
- Vanilla Minibatch Gradient Descent
  - minibatch of size 32/64/128
  

#### Image features
- Instead of feeding raw pixels into linear classifiers doesnot work too well
- Prior to deep neural network popularity, two stage approach was used
- first, take your image, compute various feature representations
- then concatenate these feature vectors to give some feature representation of image
- trick is to use right feature transform for the problem statement
- example
  - color histogram
    - count how many pixels fall into each bucket
    - tells us what type of color exist in image
  - histogram of oriented gradients (HoG)
    - dominant edge direction of each pixel
    - compute histogram over these different edge orientation in bucket 
    - tells us what type of edge exist in image
    - was used for object recognition in the past
  - bag of words (comes from NLP)
    - in NLP, number of words in a paragraph are counted
    - apply same concept in images
    - no straightforward analogy of words and images
    - create your own version of vocabulary of visual words
    - get sample 
    
- ConvNets