# Deep Learning Notes

## Richer ; May 15th

## 1 Introduction

Actually we can categorize the deep learning into the following parts.

|  Domain   | Detial  |
|  ----  | ----  |
| Images | CNN |
| NLP | RNN, LSTM |


And here is a basic ideas: training scale drives deep learning progress.

### 1.1 Introduction to LR (logisic regression) Method

The loss function of LR(logisic regression) Method is below, the idea is to minimize the cost function by alternating the parameters.
$$\text{Pre:}\hat{y}=\sigma(w^{T}x+b)\text{ where } \sigma(z)=\frac{1}{1+e^{-z}}
$$
$$\text{Loss Fun:}L(\hat{y},y)=-\big(ylog\hat{y}+(1-y)log(1-\hat{y})\big)$$
$$\text{Cost Fun:}J(w,b)=\frac{1}{m}\sum L(\hat{y}^{i},y^{i})$$

and **gradient descent** method is applied for $w$ and $b$ below:
$$ w:=w-\alpha\frac{dJ(w)}{dt}\text{(notationaly dw)}
$$

Intuitively, we have the forward propogation and the backward propogation.

**Psudocode**: In python, we often write dv for $\frac{\partial J}{\partial v}$. In follows we write $x=[x^{(1)},x^{(2)},...,x^{(m)}]$, 
$dx=[dx^{(1)},dx^{(2)},...,dx^{(m)}]$ and etc.
For example we write 
$$ [z^{(1)},z^{(2)},...,z^{(m)}]=w^{T}[x^{(1)},x^{(2)},...,x^{(m)}]+[b,b,...,b]
$$ 
in $z=w^{T}x+b$.

1. J=0; $dw=np.zeros((n_x,1))$; db=0.
2. For i=1 to m ( $i$ stands for number of training examples )
<br> $\quad  z=w^{T}x+b$
<br> $\quad  a=\sigma(z)$
<br> $\quad  J+=-[y^{(i)}log(a^{(i)})+(1-y^{(i)}）log(1-a^{(i)})]$
<br> $\quad  dz=a-y$
<br> $\quad  dw=\frac{1}{m}Xdz^{T}$
<br> $\quad  db=\frac{1}{m}np.sum(dz)$
3. iterate the parameter
<br> $\quad  w=w-\alpha dw; b=b-\alpha db$

**Code Time**: Whenever possible, avoid explicit for-loops ( Especially it's better off if there exist built-in functions ). Therefore **Vectorization** matters when it comes to this!



### 1.2 Vectorization Implement

Art of getting rid of explicit folders in code. 

**Vectorization**

In [92]:
import numpy as np
a = np.random.randn(8,1) # dont use randn(8)! 
b = np.random.randn(8,1)
c = np.dot(a.T,b)
d = np.exp(b) # maximum/**/log is also applicable
#print(type(a))

**broadcasting**

In [96]:
e=np.c_[a,b,d]
f=np.dot(e.T,d)+1 # broadcasting in python!
a=a.reshape(8,1)
assert(a.shape==(8,1))
g=e/a # More general broadcasting !

## 2 Implement one hidden-layer Neural Network

Some assumption.
1. superscript 2 in $a^{[2]}_{3}$ refers to the layer number and the subscript 3 refers to the corresponding node.
2. structure is below

**Input Layer -- Hidden Layer -- Output Layer**

where hidden layer means the true values are still out there.

3. every hidden unit has two steps calculations. And gethering together we have a matrix mutiplication. First
$$ z^{[i](j)}=w^{[i]}x^{(j)}+b^{[i]}
$$
$$ a^{[i](j)}=\sigma(z^{[i](j)})
$$

Also we can vectorize. Therefore let $x=[x^{(1)},x^{(2)},..,x^{(m)}]$
Intuitively, horizontally is training examples and vertically is the nodes.

Therefore we have a more general formula
$$ z^{[i]}=w^{[i]}x+b^{[i]}
$$
$$ a^{[i]}=\sigma(z^{[i]})
$$

Formally, by changing x with $a^{[i-1]}$, we have
$$ z^{[i]}=w^{[i]}a^{[i-1]}+b^{[i]}
$$
$$ a^{[i]}=\sigma(z^{[i]})
$$

### 2.1 Activation Function and Compute of its Slope

Actually, $tanh(z)=\frac{e^z-e^{-z}}{e^z+e^{-z}}$ is 
better than $\sigma(z)$ because it's more "centerable". **And ReLU is best and the default choice nowadays**.

Here are some rule of thumbs:
1. binary classification:  $\sigma$ -- output layer; ReLU --  others
2. restriction of output ( e.g. predict the house price which is always nonnegative), then we can let $Relu$ -- output layer.

We also have
$$ \sigma'=\sigma(1-\sigma) \quad ; \quad tanh'(z)=1-tanh^2
$$

### 2.2 Implement gradient descent for NN with one hidden layer

**Algorithm for the case when $\hat{y}$ is a row vector**

Repeat the follows!
1. compute the product $\hat{y}^{(i)}$.
2. compute the partial derivative $dw^{[i]}$ and $db^{[i]}$. ( we compute it from the loss function, e.g. $L(a^{[2]},y)$ )
3. $w^{[i]}=w^{[i]}-\alpha dw^{[i]}$; $b^{[i]}=b^{[i]}-\alpha db^{[i]}$; 

Specificaly ( below we let $g^{2}=\sigma$ ) we have two steps:

**Forward Propogation**
1. 
$$z^{[1]}=w^{[1]}x+b^{[1]}\quad  \text{ and }\quad
   a^{[1]}=g^{[1]}(z^{[1]})
$$
2. 
$$z^{[2]}=w^{[2]}a^{[1]}+b^{[2]}\quad  \text{ and }\quad
   a^{[2]}=g^{[2]}(z^{[2]})=\sigma(z^{[2]})
$$

**Backword Propogation** ( m training examples )
1. 
$$ \quad dz^{[2]}=a^{[2]}-y
$$
$$ \quad dw^{[2]}=\frac{1}{m}dz^{[2]}a^{[1]T}
$$
$$ \quad db^{[2]}=\frac{1}{m}np.sum(dz^{[2]},axis=1,keepdims=True)
$$
2. 
$$ \quad dz^{[1]}=(dz^{[2]T}w^{[2]})^{T}\cdot dg^{[1]}(z^{[1]})
$$
$$ \quad dw^{[1]}=\frac{1}{m}dz^{[1]}a^{[0]T}
$$
$$ \quad db^{[1]}=\frac{1}{m}np.sum(dz^{[1]},axis=1,keepdims=True)
$$







**Initialization**

Aware to be rondomly initialize to avoid **symmetry breaking problem**, e.g.
$$ w^{[i]}=np.random.randn([2,2])*0.01 $$
where we multiply 0.01 to avoid learning too slow


## 3 Implement deep (more than one hidden-layer) Neural Network

**Notation**
1. $L=$#layers 
2. $n^{[l]}$=#units in layer $l$.

**Formula ( vectorized: stack the whole training examples )**

**Forward Propogation** $a^{[l-1]}$ to $a^{[l]}，z^{[l]}$

1. $z^{[l]}=w^{[l]}a^{[l-1]}+b^{[l]}$
2. $a^{[l]}=g^{[l]}(z^{[l]})$
3. $x=a^{[0]}\in M^{(n_l,m)}\quad \text{and} \quad \hat{y}=a^{[L]}$

**Backward Propogation** $da^{[l]}$ to $da^{[l-1]},dw^{[l]},db^{[l]}$

1. $dz^{[l]}=da^{[l]}\cdot dg^{[l]}(z^{[l]})$
2. $dw^{[l]}=\frac{1}{m}dz^{[l]}\cdot a^{[l-1]}$
$db^{[l]}=\frac{1}{m}n_p.sum(dz^{[l]},axis=1,keepdims=True)$

3. $da^{[l-1]}=w^{[l]T}dz^{[l]}$

**Intuition**
1. detect the edges.
2. then compose the features together.

**Parameter and hyperparameter**
1. para: $w^{[i]}$ $b^{[i]}$
2. hyper: $\alpha$, #iterations, L, $n^{l}$, AF.

## 4 Development

### 4.1 Seperation of Data

Data is seperated into the following modes:

**training ( train models )-- development set ( compare models ) -- test set ( test models )**,

**training ( train models )-- development set ( compare models )**,

1. When the data scale is large, e.g. over 1000000, the percentage 98:1:1 is fine.
2. make sure the Dev/test set come from the same distribution.

### 4.2 Bias and Variance

We shall sketch train set error and dev set error to see variance and bias.

if the optimal error is 0 (i.e. human can tell the difference),
1. 1/11 high variance (development set)
2. 15/16 high bias (training set)
3. 19/30 high bias & high variance

And here are some things to fix the problems without huring the other direction.

**Regularization/ more data -- reduce variance/overfit**

**Large network/ better models -- stable bias/underfit**

e.g. of $L_2$ regularization in neural network

$$ J(w,b)=\frac{1}{m}\sum_{i=1}^{m}L(\hat{y}^{(i)},y)+\frac{\lambda}{2m}\sum_{i}||w^{l}||_{2}^2
$$
$$w^{[l]}:=w^{[l]}-\alpha dw^{[l]}\textbf{, where } dw^{[l]}=old+\frac{\lambda}{m}\omega^{[l]}
$$