## 1.1 Training set, validation set and test set

- **Training set**: tune the model parameters. 60% of whole data set.
- **Validation set**: i.e., development set. Select hyper-parameters (i.e., different models) to avoid overfitting. 20% of whole data set.
- **Test set**: report the performance of trained model. 20% of whole data set.

**Note**:
- **Validation set** is not exactly needed.
- **Validation set** and **Test set** should follow the **same distribution**.

reference:  
https://stackoverflow.com/questions/2976452/whats-is-the-difference-between-train-validation-and-test-set-in-neural-netwo

## 1.2 Bias and variance tradeoff
- **Total Error = Bias + Variance + Irreducible Error**
- Bias: under-fitting
- Variance: over-fitting

In deep learning, bias and variance tradeoff is less discussed.

### Bias and variance illustration
Clicking for illustration: https://github.com/gaoisbest/Machine-Learning-and-Deep-Learning-basic-concepts-and-sample-codes/blob/master/Andrew_Ng_images/Class_2_week_1/Bias_variance.png

reference:  
http://scott.fortmann-roe.com/docs/BiasVariance.html


### The relationship between model complexity and error

Clicking for illustration: https://github.com/gaoisbest/Machine-Learning-and-Deep-Learning-basic-concepts-and-sample-codes/blob/master/Andrew_Ng_images/Class_2_week_1/Error_complexity.png

#### How to detect high bias ?
- Training error is high.
- Validation error has **similar magnitude** to training error.

#### How to detect high variance ?
- Training error is low.
- Validation error is **very high**.

### How to solve the problem ?
Clicking for illustration: https://github.com/gaoisbest/Machine-Learning-and-Deep-Learning-basic-concepts-and-sample-codes/blob/master/Andrew_Ng_images/Class_2_week_1/Under_over_fitting_solution.png

reference:  
http://www.learnopencv.com/bias-variance-tradeoff-in-machine-learning/

## 1.3 Reduce over-fitting
### 1.3.1: Regularization

**Forward propagation**: 
$$J_{regularized} = \small \underbrace{-\frac{1}{m} \sum\limits_{i = 1}^{m} \large{(}\small y^{(i)}\log\left(a^{[L](i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right) \large{)} }_\text{cross-entropy cost} + \underbrace{\frac{1}{m} \frac{\lambda}{2} \sum\limits_l\sum\limits_k\sum\limits_j W_{k,j}^{[l]2} }_\text{L2 regularization cost} $$

To calculate $\sum\limits_k\sum\limits_j W_{k,j}^{[l]2}$  , use :
```python
np.sum(np.square(Wl))
```

**Backward propagation**: 
only concern dW1, dW2. 
For each, add the regularization term's gradient ($\frac{d}{dW} ( \frac{1}{2}\frac{\lambda}{m}  W^2) = \frac{\lambda}{m} W$).

**Note**: 
- `1.0 / m * lambd / 2.0`. `lambd` is a hyperparameter.
- L2 regularization makes the decision boundary smoother.
- Weights end up smaller ("weight decay").

In [None]:
# forward propagation computes cost

# suppose that there are sigler hidden layer neural network
# W1 and W2 are parameters for input X and hidden neurons
L2_regularization_cost = 1.0 / m * lambd / 2.0 * (np.sum(np.square(W1)) + np.sum(np.square(W2)))
cost = cross_entropy_cost + L2_regularization_cost

# back propagation comptutes gradients
dZ2 = A2 - Y # the cross_entropy loss
dW2 = 1./m * np.dot(dZ2, A1.T) + W2 * lambd / m
db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True)
    
dA1 = np.dot(W2.T, dZ2)
dZ1 = np.multiply(dA1, np.int64(A1 > 0)) # relu activation
dW1 = 1./m * np.dot(dZ1, X.T) + W1 * lambd / m
db1 = 1./m * np.sum(dZ1, axis=1, keepdims = True)

### 1.3.2: Dropout
Note: initialize the random numbers follwing the uniform distribution, i.e., `np.random.rand(a, b)`

## 1.4 Accelerate deep network training
- Poor initialization can lead to **vanishing/exploding gradients**, which also slows down the optimization.
- The goal is speeding up the convergence of gradient descent.

### 1.4.1: input data normalization
### 1.4.2: weights initialization
- The weights $W^{[l]}$ should be initialized randomly to **break symmetry**. 
- It is okay to initialize the biases $b^{[l]}$ to **zeros**. Symmetry is still broken so long as $W^{[l]}$ is initialized randomly. 
- **He** initialization: scaling factor for the weights $W^{[l]}$ is $\sqrt{\frac{2}{\text{dimension of the previous layer}}}$. It is recommended for layers with a **ReLU** activation.
- **Xavier** initialization: scaling factor for the weights $W^{[l]}$ is $\sqrt{\frac{1}{\text{dimension of the previous layer}}}$.

The basic principle about **He** and **Xavier** initialization is let $Var(Y) = Var(X)$. Two references [a](http://andyljones.tumblr.com/post/110998971763/an-explanation-of-xavier-initialization) and [b](https://prateekvjoshi.com/2016/03/29/understanding-xavier-initialization-in-deep-neural-networks/) explains the principle in detail.

In [None]:
def initialize_parameters_he(layers_dims):
    parameters = {}
    L = len(layers_dims) - 1 # integer representing the number of layers     
    for l in range(1, L + 1):
        parameters['W' + str(l)] = np.random.randn(layers_dims[l], layers_dims[l-1]) * np.sqrt(2.0/layers_dims[l-1])
        parameters['b' + str(l)] = np.zeros((layers_dims[l], 1))
    return parameters

In [None]:
# tensorflow implementation
W = tf.get_variable('w', shape=[a, b], initializer=tf.contrib.layers.xavier_initializer())
# see https://stackoverflow.com/questions/33640581/how-to-do-xavier-initialization-on-tensorflow