In [2]:
from sklearn.Linear_

0.7

In [4]:
import numpy as np
1/(1+np.exp(-0.7))

0.6681877721681662

In [5]:
np.tanh(0.7)

0.6043677771171635

## Moving away from Perceptrons : NN

### Why not accuracy as cost function?

* In MNIST, the costs function is the MSE (quadratic term betw y - ypred) and not accuracy of classified images
    $$ C(w, b) = \frac {1}{2n} || y(x) - ypred ||^2 $$
    * y is the output vector typically row vector containing 1 in place where the digit is correct

* the above cost function is dependent on w, b -> smooth and easy to understand small changes in w, b to get an improvement in cost

* With accuracy -> small changes in w, b wont casye change in accuracy as well!!


## Gradient Descent

* Conceptually dividing by number of example in gradient descent step makes little difference, since it’s equivalent to rescaling the learning rate


## Simple to Abstract Layers : Breaking down questions

* The end result is a network which breaks down a very complicated question – does this image show a face or not – into very simple questions answerable at the level of single pixels.

* It does this through a series of many layers, with early layers answering very simple and specific questions about the input image, and later layers building up a hierarchy of ever more complex and abstract concepts. 

* Networks with this kind of many-layer structure – two or more hidden layers – are called deep neural networks

## Why Cross-Entropy & Not MSE:

* Basic Equations
    * Quadratic Loss Function : $ C(w, b) = \frac{(y - \hat y)^2}{n} $

    * Cross-Entropy Loss Function : $ C(w, b) = -\frac{1}{n} \sum [y \ log(\hat y) + (1 - y) \ log(1 - \hat y)] $ 

    * where $ \hat y = \sigma {(w*x + b)} $

* Intuition
    * we want the error to be very high whenever network misclassifies. 

    * We often learn faster when we are badly wrong  -> learning slow can happen in networks if partial derivatives are small -> we want partial derivatives to be high

* **Case a - Quadratic loss**
    * $$ \frac{\partial C}{\partial w} = (\hat y - y)*{\hat y^\prime} $$ 
    * the gradient depends on the derivative of sigmoid -> which is very small for majority of space -> despite there is error term, it gets smaller by factor of gradient

* **Case b - Cross entropy loss**
    * CE loss has 2 properties:
        * It is non-negative -> log of values 0-1 is always - -> always +
        * If correct prediction, the CE is ~ 0 -> tends towards 0 for correct prediction. Indeed both are valid for quad loss. 
    * $$ \frac{\partial C}{\partial w} = \frac{1}{n} \sum (\hat y - y)*x $$ 
    * Here that slowing down because of gradient is not there.

* **Verdict**
    * Avoiding sigmoid gradient in derivatives doesnt cause slow down
    * The learning curves steeps down when used with CE loss typically.


### Quadratic Cost for regression problems 

* If regression, the output activation can be linear and not sigmoid

* Hence, the derivatives when using quad loss becomes
    * $ \frac{\partial C}{\partial w} = (\hat y - y)*{\hat y^\prime} $

    * If linear, -> $ \frac{\partial C}{\partial w} = (\hat y - y)*{x} $

    * Doesnt depend on derivative of sigmoid, which is slower. Rather depends on x -> better than sigmoidal derivative -> learning can be better -> quad loss can be stil used

* if the output neurons are linear neurons then the quadratic cost will not give rise to any problems with a learning slowdown. In this case the quadratic
cost is, in fact, an appropriate cost function to use




## Softmax Layer with Log-Likelihood Cost

* Softmax is activation which gives scaled probability for instances btw 0-1
    * $ Softmax = \frac{exp^{z}}{\sum exp^z} $

* Log-Likelihood Cost $ C = -ln(z) $
    * If z -> correct -> z is very high -> C near zero
    * if z -> incorrect -> z is low -> C very high
    * Similar intutition with cross entropy

* We can show that when using this loss with softmax error, the derivatives of w, b are exactly same as with CE with sigmoid activation

* Implying no learning slowdown

* **Verdict**
    * Use CE loss with sigmoid activations
    * Use Log Likelihood loss with softmax activations
    * Also use softmax, when you want output probabilities
    

## Sparse Categorical Entropy:

* If the num of class are high, to avoid converting it to one hot encoding target -> we can use sparse cat entropy than cross entropy

* implies we can use output activation as softmax & keep target as interger of output class (no need to convert to to_categorical)

* in calculating Categorical cross entropy loss ensure to give correct ordering : loss(y, yhat) 
    * where yhat is matrix of prob, each row containing prop of each class
    * y is target class vector containing interger of target class 

## Overfitting

* When you have four parameter we can draw elephant, with five we can make it wiggle it trunk

* More the parameter more flexibility to learn and fit unnecessary noise and patterns

* Prevent Overfitting:
    * More data : In general, one of the best ways of reducing overfitting is to increase the size of the training data. With enough training data it is difficult for even a very large network to overfit.

    * Regularization

        *  $ C(w, b) = -\frac{1}{n} \sum [y \ log(\hat y) + (1 - y) \ log(1 - \hat y)] + \frac{\lambda}{2n} \sum w^2$ 

        *  $ C(w, b) = C_0 + \frac{\lambda}{2n} \sum w^2$ 
        * where $C_0$ is unregularized cost function which can either cross-entropy or mse(quad) loss function

        * the effect of regularization is to make it so the network prefers to learn small weights, all other things being equal.
        * Large weights will only be allowed if they considerably improve the first part of the cost function.

        * $\lambda$ small -> prefer to minimize the original cost function
        *  $\lambda$ large -> prefer small weights
        
        * $w = (1 - \frac{\eta \lambda}{n}) w - \eta \frac{\partial C_0}{\partial w} $

        * The first term is called weight decay. It doesnt always mean the weights are decreasing -> other term can cause weight to increase.

        * The weight decay depends on the number of training data -> with different data set the learning rate with regularization might be diff -> we need to change $\lambda$ to have required regularization effect
        * Regularization -> small weights -> network wont change too much for few random inputs -> diff to learn local noise effects -> it responds to evidences that are seen often -> think of fitting linear line vs wiggly line and when do you expect the linear line to change course? 
        * Reg models learn patterns seen often in training data and are resistant to learning noises 
        * Rather saying simpler models -> regularization provides better generalization

        * L2 Vs L1 :
            * L2 : $w = (1 - \frac{\eta \lambda}{n}) w - \eta \frac{\partial C_0}{\partial w} $

            * L1 : $w = w - \frac{\eta \lambda}{n} - \eta \frac{\partial C_0}{\partial w} $

            * In L2 -> weights shrink by amount proportiional to w

            * In L1 -> weights shrink by constant amount towards 0

            * When w is large -> L2 shrink weights much larger

            * When w is small -> L1 shrink weights much larger

            * In general, L1 concentrates small number of weights while others pushed to zero

            * In L1, derivative of |w| is not defined at 0 -> at 0 unregularized cost function is used

    * Dropout

        * Randomly disconnect half hidden neurons -> frwd-prop input x thro modified network -> backprop result also thro modified network. 
        
        * update w, b
        * Repeat process -> first restore dropout neurons -> choose new random subset of hidden neurons to delete -> frwd, back prop -> update w, b
        * Finally when run on full network -> twice as many neurons are active -> we halve the weights outoging from hidden neurons

        * Why dropout works
            * Consider averaging the effects of diff NN -> better results -> eliminate overfit. Dropout is like fitting diff NN and taking average results
            * Avoids relying on particular neurons -> learn robust features with many subsets

    * Data Augumentation
            

## Weight Initialization

* Random Normal Initialization:
    * Assume a case where 500 weights zero and 500 non-zero

    * Send it to sigmoid -> 501(including b) -> close to 0 (w=0) & 1 (bcs of sigmoid) 

    * for those 501 z terms from sigmoid will have $~N(0, \sqrt{501}) $, as the std dev is large -> the values are large -1 << z >> 1 -> sigmoid are all squished towards 1

    * Implies small change in w is getting even sqiushed by sigmoid -> slower learning despite with cross entropy -> saturated neurons

    * if the weights in later hidden layers are initialized using normalized Gaussians, then activations will often be very close to 0 or 1, and learning will proceed very slowly.

* Alternate Initialization:
    * initialize w as $~N(0, \frac{1}{\sqrt{n_{weights}}}) $

    * This will cause z to be  $~N(0, \frac{1}{\sqrt{\frac{3}{2}}}) $. Which is much more peaked -> less likely to saturate on sigmoids

* new approach to weight initialization starts us off in a much better regime in learning curves

## Mini-Batch

* choosing the best mini-batch size is a compromise. 

* Too small, you don’t get to take full advantage of the benefits of good matrix libraries optimized for fast hardware. 

* Too large and you’re simply not updating your weights often enough.

In [11]:
# a = [2, 3, 1]
# print(f"x : {a[:-1]}")
# print(f"y : {a[1:]}")
import numpy as np
np.log(0.4)

-0.916290731874155

In [9]:
np.arange(0, 20, 5)

<IPython.core.display.Javascript object>

array([ 0,  5, 10, 15])

In [10]:
a = np.arange(1, 21)
batch = 5
[a[k:k+batch] for k in np.arange(0, 20, 5)]

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

[array([1, 2, 3, 4, 5]),
 array([ 6,  7,  8,  9, 10]),
 array([11, 12, 13, 14, 15]),
 array([16, 17, 18, 19, 20])]