# Machine learning setup

* quickly interating over idea, code and experiments
* experimental design important (train/dev/test splits & eval)
    * when n observations in ~ 10.000, traditional splits (70/30 or 60/20/20)
    * when n observations in ~ 10^6, dev/test sets might be in a ballpark of ~ 10.000
    * train-dev-test distributions might be mismatched

**Bias/Variance**

* high bias (underfitting the data), just right, high variance (overfitting the data)
* train set error vs dev set error
    * train set error high, dev set error high -> underfitting (high bias)
    * train set error low, dev set error high -> overfitting (high variance)
    * train set error high, dev set error very high -> high bias and high variance, wrong and detailed decision boundary (!)
    * and optimistic option (low bias & variance)

**The recipe**

* Does the algorithm have high bias? (train performance)
    * bigger network
    * train longer
    * use different optimization algorithms
    * different architecture
    * experiment until acceptable bias achieved

* Does the algorithm have high variance? (dev performance)
    * more data
    * regularization
    * different architecture
    * experiment until acceptable variance achieved

Bias-variance trade-off, in previous ML era the trade-off between the two was present, this is just not valid for NNs (can reduce both).

# Regularization

**L2**  

* helping with high variance, hyperparameter $\lambda$ (strenght of the regularization)
* for logreg cost function $J(w,b)$ in regularization scenarios with L2 reg $J(w,b) = \frac{1}{m} \sum_{i=1}^{m} L(\hat{y}^{(i)},y^{(i)}) + \frac{\lambda}{2m}\big\| w\big\|_2^2$
* bias term usually omitted
* L2 reg -> $\big\| w\big\|_2^2 = \sum_{j=1}^{n_x} w_j^2 = w^T w$, most commonly used
* L1 reg -> $\big\| w\big\|_1 = \sum_{j=1}^{n_x}|w_j|$, making model sparse

* in neural net cost function $J(w^{[1]},b^{[1]}, ..., w^{[l]},b^{[l]}) = \frac{1}{m} \sum_{i=1}^{m} L(\hat{y}^{(i)},y^{(i)}) + \frac{\lambda}{2m}\big\| w^{[l]}\big\|^2$
* $\big\| w^{[l]}\big\|^2 = \sum_{i=1}^{n^{[l]}} \sum_{j=1}^{n^{[l-1]}} (w_{i,j}^{[l]})^2$, where $i$ describes the number of neurons in the current layer, $j$ equal to the number of neurons from previous layer (this is due to the dimensions of $w^{[l]}$ matrix), it is called Frobenius norm
* in backprob $dW^{[l]} = (from \ bp) + \frac{\lambda}{m}w^{[l]}$
* update params $W^{[l]} = W^{[l]} - \frac{\alpha\lambda'}{m}W^{[l]} - \alpha (from \ bp)$
* where the first piece can be simplified to $(1-\frac{\alpha\lambda}{m}W^{[l]})$, thus L2 is also called weight decay (as we are decaying the weigt matrices)

* by setting $\lambda$ to large values, one is reducing impact of internal units (effectively removing them), thus making the network simple
* another example might be network with tanh activation func, where forcing weigts to be small leads to output of the unit in linear piece of the function, thus making the output of the network closer to linear func (NN w/o non-linear activations can output just linear func)

**Dropout**

* setting probability of removing a node from a layer, resulting in simplified version of the network
* constructing the temp architecture on randomly for each example
* step-by-step forward pass of "inverted dropout"
    * set probability for keeping the unit
    * generate array of random numbers, 1 if random > keep prob, 0 otherwise
    * multiply activation result with the boolean vector
    * scale up the total result of activation back (divide by keep prob) -> "inverted dropout" to keep comparable expected values
* used only during training, not in inference part

* drop-out force the network to be more robust (cannot rely on any one features), effectively spreading out weights (similar to regularization!)
* keep prob can be set out based on layers, ie lower on larger weight matrices

**Others**

* data augmentation - ie flipping, rotating, cropping and applying other transformations to images,
* early stopping - stopping when dev error not improving, not conceptual clean (chasing more problems simultaneously)






# Optimization

* input normalization
    * z-scaling, minmax
    * improves symmetry in the optimization problem

* vanishing/exploding gradients
    * for deep nets, by multiplication gradients can vanish or explode exponentialy 
    * vanish if W<I, explode if W>I, where I -> identity matrix

* weight initialization
    * a lot of inputs -> smaller weights are good
    * for relu, setting the variance to $\frac{2}{n}$ helps (for internal units, this is generally based on the no of units in prev layer)
    * for tanh, we set variance to $\frac{1}{n}$
    * some other approaches might be considered

* numerical approximation of gradients
    * using larger two-side intervals for numerical check of the derivation (in classic case, we would use just one-sided approach)
    * $\frac{f(\theta+\epsilon)-\theta-\epsilon)}{2\epsilon} \approx g(\theta)$

* gradient checking
    * helps with implementation
    * concat every param vec into large matrix $\theta$
    * concat gradient vecs into large matrix $d\theta$
    * for each $\theta$ get:
        * $d\theta_{approx} [i] = \frac{J(\theta_1, \theta_2, ... , \theta_i +\epsilon, ...)-J(\theta_1, \theta_2, ... , \theta_i -\epsilon, ...)}{2\epsilon} \\ \approx d\theta[i] = \frac{\delta J}{\delta{\theta_i}}$
    * compute euclidean distance between the estimated and expected grads, epsilon on scale of 10^-7, distance < 10^-5 ok, <10^-3 bad

* gradient checking implementation notes
    * use only to debug (computationally costly)
    * after grad check fails, check on the components (particular derivatives, layers)
    * remember regularization term
    * does not work with dropout (turn off dropout for debug, set keep_prob=1.0)
    * check after initialization and after some training