#  Shallow Network

## Neural Network Overview
For one example $x^{(i)}$:
$$z^{(i)} = w^T x^{(i)} + b \tag{1}$$
$$\hat{y}^{(i)} = a^{(i)} = sigmoid(z^{(i)})\tag{2}$$ 
$$ \mathcal{L}(a^{(i)}, y^{(i)}) =  - y^{(i)}  \log(a^{(i)}) - (1-y^{(i)} )  \log(1-a^{(i)})\tag{3}$$

The cost is then computed by summing over all training examples:
$$ J = \frac{1}{m} \sum_{i=1}^m \mathcal{L}(a^{(i)}, y^{(i)})\tag{6}$$

- When looking at the eq., we need to understand the neuron represents the two calculation performed for every training test. 
    - First: We calculate the equation (using the X's, weights, and biases)
    - Then: Using the activation function, we choose the most activated nueron to give our prediciton
    
<img src="./images/deep_45.png" alt="Drawing" style="width: 500px;"/>
- Noticed in the bottom of the image, we are computing a similar structure of equation but with "layers"

## Neural Network Representation
- Neuron Key
    - $a^{[0]} = x$
        - Meaning: The first activation of the input layer are the inputs we are passing (or the observations)
    - $a^{[1]}$ = First Hidden Layer, other values: $w^{[1]}, b^{[1]}$
        - Meaning: We have a 4 by 1 Vector values in the hidden layer
    - $a^{[2]} = \hat{y}$
        - Meaning: The output (*in a 2 layer neural network*)

## Computing a Neural Network's Output
<img src="./images/deep_46.png" alt="Drawing" style="width: 300px;"/>

- Noticed in the image, we see that a neuron is a two-step process
    - First: We compute the function followed by the activation (sigmoid)

**If we look at the neuron key, the bracket represents the layer and the underscript is the node in the layer**

First Layer:
- <img src="./images/deep_47.png" alt="Drawing" style="width: 300px;"/>
- <img src="./images/deep_48.png" alt="Drawing" style="width: 250px;"/>

    - For each node, we compute the 2-step process
- The eqns. could be calculated using a for-loop. But that's inefficient. A better way to do this is with vectorization. Since we have the weights, we can create a matrix. We can transpose the weight (which by default, is a row vector).
- **The 4 represents the nodes in our first hidden layer**
- **The 3 represents the x variables in our first hidden layer** (Not represented in the equation above, it's represented by the x variable)
- We then stack them on top of another to get 4 by 3 matrix (since we have 4 nodes with 3 inputs, and thus, we will need the 3 weights for each of the inputs features.
- By nodes, we mean the different 'circles' in our neural network. If we have 4, our goal is to find 4 different relationships. With these 4 relationship, the neural network will try to uncover the 4 main relationships. 
    - Remember that the X are not every person but they are features that have values for every single observaitons or person. Hence, they could be height, weight, and age with values according to every one of the users
- The following images represents how we can use vectorization **FOR ONE EXAMPLE**
    - We use the calculation
        - <img src="./images/deep_49.png" alt="Drawing" style="width: 400px;"/>
    - To then put in in the sigmoid function
        - <img src="./images/deep_50.png" alt="Drawing" style="width: 250px;"/>

**Do not get confused by $a^{[0]} = x$**
<img src="./images/deep_51.png" alt="Drawing" style="width: 200px;"/>


## Vectorizing across multiple examples
Key idea: Given a set of inputs x, we can use neural networks to find y hat 

**Remember that $x^{(1)}$ represents the first observations... continues until it reaches m**

<img src="./images/deep_52.png" alt="Drawing" style="width: 300px;"/>

Moving from lower case observation to upper case observation (where we vectorize each of the observations into one columns with one another.
- <img src="./images/deep_53.png" alt="Drawing" style="width: 200px;"/>
- <img src="./images/deep_54.png" alt="Drawing" style="width: 200px;"/>

**Remember that the X's are horizontall stacked side to side bc vectorization can computed with W, or the weights**
- Horizontally we are going through different training examples white vertically we are going through different neuron units
- <img src="./images/deep_55.png" alt="Drawing" style="width: 400px;"/>


## Explanation for Vectorized Implementation

**One thing to keep in mind is we conduct vector implementation for every single of the training example. Meaning, we are finding the weights for every training example**

<img src="./images/deep_56.png" alt="Drawing" style="width: 500px;"/>
- We can see that vectorization works. Note: The $z^{[1]}$ refers to first layers in the neural network.


## Activation functions
There a couple of better functions to use as an activation function than the logistic function

**Tanh function**
- One example is the tanh function: where it is very similar to the logistic function but shifted. It is shifted where it intersects at 0.
- <img src="./images/deep_57.png" alt="Drawing" style="width: 400px;"/>

The sigmoid function is still used when you are binary classification. (when comparing the predicted to the real since we need a 0 and 1)

**ReLu function**
- the beauty of the ReLu function is that its derivative will be 1 when the result is positive or 0 when the result is negative
- reminder, the max(0, x) means that we will find the max value with a range of y=0 to y=x!!!
- the relu function is most widely used. It is even used more than the tanh function
- <img src="./images/deep_58.png" alt="Drawing" style="width: 400px;"/>

**Leaky ReLu function**
- it is similar to the ReLu funcition. The difference is that instead of being 0 when the result is negative, the Leaky ReLu function take a slight downward slope.
- <img src="./images/deep_59.png" alt="Drawing" style="width: 400px;"/>

- <img src="./images/deep_60.png" alt="Drawing" style="width: 400px;"/>

## Why do you need non-linear activation functions?

Do we need an activation function? Why cannot we just use the results without using the g function?
- Meaning, we are using the linear activation function or the idenity activation function

If we do decide to not use an activation function, look at the image below, we are not gaining anything from using the activation function since are just computing linear function on top of another.
- <img src="./images/deep_61.png" alt="Drawing" style="width: 300px;"/>

The only time a linear activation function make sense is for a linear function. For example, if we are trying to predict home prices. There, we should only use the activation function for the output layer!

## Derivatives of activation functions

**Logistic Function**
<img src="./images/deep_62.png" alt="Drawing" style="width: 500px;"/>
<img src="./images/deep_63.png" alt="Drawing" style="width: 400px;"/>

**Tanh Function**
<img src="./images/deep_64.png" alt="Drawing" style="width: 500px;"/>
<img src="./images/deep_65.png" alt="Drawing" style="width: 400px;"/>

**ReLu/Leaky ReLu Function**

<img src="./images/deep_66.png" alt="Drawing" style="width: 500px;"/>


## Gradient descent for Neural Networks
Formulas:

Forward Propogation:
- $Z^{[1]} = W^{[1]}X + b^{[1]}$
- $A^{[1]} = g^{[1]}(Z^{[1]})$
- $Z^{[2]} = W^{[2]}A^{[1]} + b^{[2]}$
- $A^{[2]} = g^{[2]}(Z^{[2]}) = \sigma(Z^{[2]})$

Backpropogation:
- $dZ^{[2]} = A^{[2]} - Y$
- $dW^{[2]} = 1/m(dZ^{[2]}A^{[1]T})$
- $db^{[2]} = 1/m(np.sum(dZ^{[2]}, axis=1, keepdims=True))$
- $dZ^{[1]} = (W^{[2]T}dZ^{[2]} * g^{[1]1}Z^{[1]})$
- $dW^{[1]} = 1/m(dZ^{[1]}X^{[T]}$
- $db^{[1]} = 1/m(np.sum(dZ^{[1]}, axis=1, keepdims=True))$

## Backpropagation intuition (optional)

Summary of gradient descent
- $dz^{[2]} = a^{[2]} - y$
- $dW^{[2]} = dz^{[2]}a^{[1]T}$
- $db^{[2]} = dz^{[2]}$
- $dz^{[2]} = W^{[2]T}dz^{[2]} * g^{[1]'}(z^{[2]})$
- $dW^{[1]} = dz^{[2]}x^{T}$
- $db^{[1]} = dz^{[1]}$

## Random Initialization
If we happen to initiate all the weights with 0's, we will figure that we are performing the same eqn repeadly!

Also, we cannot init the values to large. If we create the values too large, we might have the activation function too large and thus, if using the sigmoid function, could gets stuck at the very top or very bottom. Moreover, the very top or very bottom have a slower slope and can make it more difficult to optimize!

A good number will be 0.01!

## Notes
- Remember that with Andrew Ng, we are computing the weight (inner product) with the input where the weights are to the left of the input W and then X

**REMINDERS**
- When we see brackets... think of the layer.
- When we see parenthesis... think of the training example.
- When we see underscript... think of the nueron.
