## Neural Networks

Now since the the simple logistic regression can only classify things by dividing the sample space linearly, we need some more complex ways to classify things. 
>The coursera course mainly talked about the binary classificaion hence other functions are yet to be discovered.

So we introduce some hidden layers between the input features anf the output.  
**Notation**:
- Superscript $[l]$ denotes a quantity associated with the $l^{th}$ layer. 
    - Example: $a^{[L]}$ is the $L^{th}$ layer activation. $W^{[L]}$ and $b^{[L]}$ are the $L^{th}$ layer parameters.
- Superscript $(i)$ denotes a quantity associated with the $i^{th}$ example. 
    - Example: $x^{(i)}$ is the $i^{th}$ training example.
- Lowerscript $i$ denotes the $i^{th}$ entry of a vector.
    - Example: $a^{[l]}_i$ denotes the $i^{th}$ entry of the $l^{th}$ layer's activations).

$L$: no of hidden layers  
        - by default input layer is layer 0 and output layer is layer L    
$nh[i]$ : is number of hidden layer units in layer $i$
$W^{[i]}$ : matrix of dim($nh[i-1], nh$)  
$b^{[i]}$ : coulmn matrix of dim($nh[i], 1$)  
$z^{[i]}$ : matrix obtained by $ z^{[i]} = W^{[i]} A^{[i-1]} + b^{[i]} $,  dim is $(nh[i],1)$  
$g^{[i]}$ : activation function used for ith layer, for bin classification use sigmoid in last layer  


Some activation functions are:  
   - sigmoid :  $\sigma(x) = \frac{1}{1 + e^{-x}} $. used for bin classi. not used in middle layers because not zero centered distribution.
   - tanh : $ \tanh(x) = \frac{e^x - e{-x}}{e^x + e^{-x}} $. better than sigmoid as zero centred but both of them have nearly zero slope at large values , hence need  replacement.
   - relu : $Relu(x) = max(0,x)$ Slope constant for +ve x. **Generally used**.
         Some better variations of relu exist but not used generally.

**Mathematically**(for L=2, i.e. only single hidden layer):

For one example $x^{(i)}$:
$$z^{[1] (i)} =  W^{[1]} x^{(i)} + b^{[1]}\tag{1}$$ 
$$a^{[1] (i)} = \tanh(z^{[1] (i)})\tag{2}$$
$$z^{[2] (i)} = W^{[2]} a^{[1] (i)} + b^{[2]}\tag{3}$$
$$\hat{y}^{(i)} = a^{[2] (i)} = \sigma(z^{ [2] (i)})\tag{4}$$
$$y^{(i)}_{prediction} = \begin{cases} 1 & \mbox{if } a^{[2](i)} > 0.5 \\ 0 & \mbox{otherwise } \end{cases}\tag{5}$$

Given the predictions on all the examples, you can also compute the cost $J$ as follows: 
$$J = - \frac{1}{m} \sum\limits_{i = 0}^{m} \large\left(\small y^{(i)}\log\left(a^{[2] (i)}\right) + (1-y^{(i)})\log\left(1- a^{[2] (i)}\right)  \large  \right) \small \tag{6}$$

**Reminder**: The general methodology to build a Neural Network is to:
    1. Define the neural network structure ( # of input units,  # of hidden units, etc). 
    2. Initialize the model's parameters
    3. Loop:
        - Implement forward propagation
        - Compute loss
        - Implement backward propagation to get the gradients
        - Update parameters (gradient descent)

You often build helper functions to compute steps 1-3 and then merge them into one function we call `nn_model()`. Once you've built `nn_model()` and learnt the right parameters, you can make predictions on new data.

#### Defining neural networks
when deciding number of hidden units, consider that large number of units may overfit the data and thus use wisely.  
Initialize nx, nh, ny for further usage

#### Initialization for W and b
If we make W all zeroes then all units in a single layer act identically. Use np.random.randn() for W.
Generally it doesnt make a diff for b to be random or not, hence used np.zeros() for it.  
Make sure to give correct dimensions

#### Forward propagation
loop over the hidden layers and calculate the required:
    $$ Z^{[i]} = W^{[i]}A^{[i-1]}+ b^{[i]} $$
    $$ A^{[i]} = g^{[i]}(Z^{[i]}) $$

#### Compute Cost
Take the final $A^{[L]}$ and $Y$ and compute the cost function  $J$. (this J is actually the average of the outputs of the loss function over all examples, hence the thing in bracket with the negative sign is out loss function here.)       
  $$ J = - \frac{1}{m} \sum\limits_{i = 0}^{m} \large\left(\small y^{(i)}\log\left(a^{[L] (i)}\right) + (1-y^{(i)})\log\left(1- a^{[L] (i)}\right)  \large  \right) $$


#### Backward propagation
  > note shorthand used is  $dZ^{[l]} = \frac{\partial \mathcal{L} }{\partial Z^{[l]}}$.  
  
The three outputs $(dW^{[l]}, db^{[l]}, dA^{[l-1]})$ are computed using the input $dZ^{[l]}$.Here are the formulas you need:
$$ dW^{[l]} = \frac{1}{m} dZ^{[l]} A^{[l-1] T} \tag{8}$$
$$ db^{[l]}  = \frac{1}{m} \sum_{i = 1}^{m} dZ^{[l](i)}\tag{9}$$
$$ dA^{[l-1]} = W^{[l] T} dZ^{[l]} \tag{10}$$  
To calculate $dZ^{[l]}$ use $dA^{[l]}$ and $Z^{[l]}$( cached from forward prop):  
$$dZ^{[l]} = dA^{[l]} * g'(Z^{[l]}) \tag{11}$$

Hence for all these calculations we would require $dA^{[L]}$ and it can be computed as follows ( for sigmoid function):
```python
dAL = - (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL)) # derivative of cost with respect to AL
```

#### Update
Update parameters as
$$ W^{[l]} = W^{[l]} - \alpha \text{ } dW^{[l]} \tag{16}$$
$$ b^{[l]} = b^{[l]} - \alpha \text{ } db^{[l]} \tag{17}$$

where $\alpha$ is the learning rate.

Running the model again and again will eventually lower down the value of cost function and thus our NN will be trained. Note that we should check for underfitting and over fitting.

### Multi-class Clasification
We can use the following method for having a generall classification of C classes:  
#### Softmax regression
After calculating the Z for the final layer we calculate t as `t = exp(z)` and then use the following formula to obtain the last output  i.e. yhat for our NN. Note that now our last layee should contain C units.
        $$ A^{[l]} = \frac{e^{z^{[L}}}{sumOver(e^{z^{[L}})} $$
This generates values within range (0,1).  
  
Now we need a new loss function which is similar to that in binary classification:  
$$ L =  \sum\limits_{i = 1}^{C} \large\left(\small y^{(i)}\log\left(a^{[L] (i)}\right)   \large  \right) $$