We talked about the basics of Artificial Neural Networks (ANNs), gave an intuitive summary, and provided an example in the slides. In this notebook, we will continue to explore more of the ANN and implement a simple version of it.  

After reading this notebook, you will understand some key concepts including the: *Inputs, Bias, Weights, Outputs, Targets, Activation Functions, Error, Learning rate.*

ANNs are inspired from the way neurons in the brain are thought to process inputs. Imagine you see an apple and think

<center><h3><i>"Oh wow, that's an apple!"</i><h3></center> 

From a drastically oversimplified data perspective, we might imagine a neuron:

    1. receives input data - takes the image from your retina as data input to its dendrites
    2. processess the data - weighs each input (dendrite 1 and 4 get apple vibes), then combines information 
    3. outputs an answer - fires (yes, an apple!) or doesn't fire (no, not an apple...) as a signal down its axon

<img src="https://images.deepai.org/glossary-terms/cf6448554ec74f7fb58cf83192b6e904/synapse.jpeg" width="800" />
    
We will see that in Artificial Neural Networks, the only thing a neuron can change as it learns to map inputs to the correct output, is how it weighs each input. From the ANN's perspective, the only thing it is allowed to do as we train it, is tinker with the weights. 
    
Now that we have a connection to Biological Neural Networks, let's explore the simplest type of Artificial Neural Network, the [Perceptron](https://en.wikipedia.org/wiki/Perceptron). Developed back to the late 1950s, this was one of the first artificial neural networks to be produced. You can think of a perceptron as a two layer neural network without any hidden layers. Even though it has limitations, it contains the essential components found in ANNs. 

Let's explore how a percepton works using a simple dataset with 5 observations where each observation has 3 features. The target is what we are ultimately trying to predict from the features. Our target is either 0 or 1 representing the different classes each sample can be, i.e. class 0, and class 1. Our goal is to build and train a simple Perceptron model that can **output the correct target** by feeding it our 3 **features as inputs**. See the following table. This example is modified from this [great post](http://iamtrask.github.io/2015/07/12/basic-python-network/). 


|Observation||Feature1|Feature2|Feature3||Target|
|:------:||:------:|:------:|:------:||:----:|
| 1 ||    0   |    0   |    1   ||   0  |
| 2 ||    1   |    1   |    1   ||   1  |
| 3 ||    1   |    0   |    1   ||   1  |
| 4 ||    0   |    1   |    1   ||   0  |
| 5 ||    0   |    1   |    0   ||   1  |

Let's have a look at the structure of our model.

<img src="https://raw.githubusercontent.com/qingkaikong/blog/master/39_ANN_part2_step_by_step/figures/figure1_perceptron_structure.jpg" width="600"/>  

Let's go through each component of the figure above. 

### Inputs (feature 1, feature 2, feature 3)
The inputs to the model are the features for a given observation. For example, in Observation 1, our input would be:

    `Feature1 = 0`
    `Feature2 = 0`
    `Feature3 = 1`

### Weights ($\omega_1, \omega_2, \omega_3$)
Each input is connected to the output layer by a weight, $\omega_i$ that is mutilpied by. In this way each weight determines:
    1. the magnitude or importance of the input feature 
    2. the direction it effects the output (sign) 

It is important to note that in our perceptron, the only thing it is allowed to do as it learns to map inputs to outputs is tinker with these weights! 
    
### Bias ($1*\omega_0$)
The extra 1 we added to the input layer is the bias. It is connected to the output layer by $\omega_0$. 
    
**Why do we add a bias term?** -  Consider the case when all our features are 0, and no matter what weights we have, we will always have 0 as our output. Adding a bias can avoid this problem. The bias term fuctions just like a y-intercept in linear regression, allowing us to shift our function from being rooted at 0.
    
**Why do we add a 1 to the input?** - Anything multiplied by one is itself, so this may seem pointless since we can instead simplify this to $\omega_0$. The utility of having a 1 in the input is like a placeholder. This make our matrix/vector operations work well (size of inputs = size of weights and bias).
    
### Summation ($\sum$)
The Sigma, $\sum$, refers to a summation. By taking the weighted sum of the inputs and bias, we can calculate $z$ as: 

$$z = 1*\omega_0 + feature1*\omega_1 + feature2*\omega_2 + feature3*\omega_3$$

This is anaologous to linear regression, where we sum a y-intercept (bias) and the variables weighted by their coefficients.

### Activation function ($f$)   
The output layer only has one node that takes all the input and determines whether the output should fire or not (this is ANN terminology related to the brain and the firing of neurons). Usually we use a 1 to indicate a neuron firing and a 0 to indicate a neuron not firing. 

Since our summation $z$ can be anything, how can we scale it to a value between 0 and 1? 

We can use the sigmoid activation function! It has some great advantages including:
1. The output is scaled between 0 and 1.
    - If the z value is a large positive number, say 10, the output will be close to 1 while for very negative numbers, say -10, the output will be close to 0. 
2. The output is given as a probability.
    - If we have z relatively close to 0, then the scaled value will be around 0.5, which indicates the network is not very confident in its estimation. 
3. We can take the derivative of the output.
    -  The derivative can be used to update the weights while training our model.

<img src="https://miro.medium.com/max/1868/1*9-FQ4ZfGyoYT36BTDh0hZg.png" width="600"/> 
    
### Output ($y$)
After passing our summation term, $z$, through the sigmoid function. We get our output which we now know is a number between 0 and 1. 

### Targets
Targets are the correct output for each observation. This is seperate from our output, which is an estimation from the model.
    
### How to learn from the error  

Learning will be achieved by:
1. **Estimating our error** - how off is our estimation from the target?
2. Update the weights such that it will **reduce the error**.

To understand precisely the justification for how we update the weights, see the optional advanced topics section below.

### Learning rate   

*Why do we need a learning rate?* - The weights change a lot when the error is high, and we may overshoot our weight updates leading to another extreme, making the network unstable so that it never settles down. 

For this reason, when updating our weights we use a learning rate to control how fast the network learns. The learning rate is a hyperparameter we set between 0-1 that is multiplied by our weight updates to reduce how much we change the weights by (if set to 1, it has no effect on the weight). 

There is a trade off in setting the learning rate too high or too low. A high learning rate makes training our ANN fast, but can cause it to jump past the optimal weights. On the other hand a small learning rate can lead to more stable weights but make training take longer or potentially get stuck in local minima. 

<img src="https://sebastianraschka.com/images/blog/2015/singlelayer_neural_networks_files/perceptron_learning_rate.png" width="600" />

The steps we will take to train this model are:
1. Initialize the weights to small random numbers with both positive or negative values.
2. For many iterations:  
    * Calculate the output value based on all observations.<br />
    * Update the weights based on the error.

Let's implement what we discussed above, explaining them line by line below!

In [None]:
import numpy as np

# the sigmoid activation function
def sigmoid(z):
    """
    Takes a number z (the weighted sum of inputs)
    Returns the output of the sigmoid function evaluted at z.
    """
    
    return 1/(1+np.exp(-z))

# the slope (derivative) of the sigmoid function
def sigmoid_derivative(z):
    """
    Takes an input number z (the weighted sum of inputs)
    Returns the derivative of the sigmoid with respect to z.
    """
    sig = sigmoid(z)
    return sig*(1-sig)
    

# define learning rate
learning_rate = 0.4

# input dataset with each observation as an array
X = np.array([[0,0,1],
              [1,1,1],
              [1,0,1],
              [0,1,1],
              [0,1,0]])
              
# we add a bias column of 1s to the input
X = np.concatenate((np.ones((len(X), 1)), X), axis = 1)

# true target for each data sample     
y = np.array([[0,1,1,0,1]]).T

# set seed for reproducibility
np.random.seed(1)

# initialize weights randomly between [-1, 1] with a mean 0
weights_0 = (2 * np.random.random((4,1))) - 1

print("Weights Before training: ")
print(weights_0.round(2))
print("Outputs Before Training:")
print(sigmoid(np.dot(X,weights_0)).round(2))
print('Targets')
print(y)

# store errors at each iteration to track learning              
errors=[]

# train the network with 50,000 iterations
for iter in range(50000):

    # forward propagation
    layer_0 = X
              
    # pass the weighted sum of the inputs through the sigmoid activation function
    layer_1_output = sigmoid(np.dot(layer_0,weights_0))

    # calculate the error (target value - predicted value)
    layer1_error = y - layer_1_output
              
    # store error of current iteration
    errors.append(np.sum(layer1_error**2))
              
    # compute the change (delta) for each sample by multipying 
    # 1. the learning rate 
    # 2. error
    # 3. slope (derivative of our the sigmoid with respect to the weighted sum of inputs (layer_1_output))
    layer1_delta = learning_rate * layer1_error * sigmoid_derivative(layer_1_output)
    
    # compute the change (delta) for each weight
    # we multiply by our input (transpose for the correct shape), x
    layer1_delta = np.dot(layer_0.T,layer1_delta)
    
    # lastly we update weights by simply adding the delta 
    weights_0 += layer1_delta
print("-------------------------")    
print("Weights After Training:")
print(weights_0.round(2))
print("Outputs After Training:")
print(layer_1_output.round(2))
print('Targets')
print(y)



In [None]:
### Visualizing the errors over time
import matplotlib.pyplot as plt

def plot_errors(errors, window = 500):
    """
    Plots the total error over time (each iteration of weight updates).
    Takes two arguments
    1. Error - A list where each element is the total error over each iteration of training
    2. Window - An integer representing what iteration to visualize the error up to
        default: window = 500
    """
    
    fig = plt.figure(figsize=(8, 8))
    plt.title('Perceptron Total Error Over Each Iteration of Learning')
    plt.plot(errors[0:window])
    plt.ylabel('Total Error')
    plt.xlabel('Training iteration')
    plt.show()
    
plot_errors(errors)

## Explain line by line  

**Line 1:** This line imports the numpy module, which is a linear algebra library.   

```python
import numpy as np
```

**Line 3:** These lines define the activation function, which is a function that converts our weighted sum into a probability between 0 and 1 as we discussed above. We also define the derivative (or slope) of the sigmoid with respect to this weighted sum (z).   

```python
# the sigmoid activation function
def sigmoid(z):
    """
    Takes a number z (the weighted sum of inputs)
    Returns the output of the sigmoid function evaluted at z.
    """
    
    return 1/(1+np.exp(-z))

# the slope (derivative) of the sigmoid function
def sigmoid_derivative(z):
    """
    Takes an input number z (the weighted sum of inputs)
    Returns the derivative of the sigmoid with respect to z.
    """
    sig = sigmoid(z)
    return sig*(1-sig)
```

**Line 22:** Here we define our learning rate, this will control how fast the network learns from the data.

```python
# define the learning rate
learning_rate = 0.4
```

**Line 25:** This initializes the input dataset as a numpy matrix. Each row is a sample, and each column corresponds to a feature. We then add the bias term, which is a column of ones. The final matrix has 5 rows and 4 columns.  

```python
# input dataset with each observation as an array
X = np.array([[0,0,1],
              [1,1,1],
              [1,0,1],
              [0,1,1],
              [0,1,0]])
# we add a bias column of 1s to the input
X = np.concatenate((np.ones((len(X), 1)), X), axis = 1)
```

**Line 35:** This defines our target, the true value of each data sample. ".T" is the transpose function, which converts our output data from a row to a column vector. 

```python
# true target for each data sample          
y = np.array([[0,1,1,0,1]]).T
```

**Line 38:** Before we generate the random weights, we set a seed to ensure we can replicate our experiment by drawing the same random numbers. This is very useful when we test the algorithm, and compare our results with others. 

```python
# set seed for reproducibility
np.random.seed(1)
```

**Line 41:** This initializes our weights that we will mutiply by our inputs. Since we have 4 input features (including the bias term), we initialize the random weights as a column vector with 4 values - dimensions (4,1). Also note that the random numbers we initialized are -1 to 1, with a mean of 0.

```python
# initialize weights randomly between [-1, 1] with a mean 0
weights_0 = 2*np.random.random((4,1)) - 1
```

**Line 44:** This line tracks our weights, outputs and targets before training the ANN.

```python
print("Weights Before training: ")
print(weights_0.round(2))
print("Outputs Before Training:")
print(sigmoid(np.dot(X,weights_0)).round(2))
print('Targets')
print(y)
```

**Line 51:** Initialize an empty list to track of our error through each iteration of learning.

```python
# store errors at each iteration to track learning              
errors=[]
```

**Line 54:** We train our model with 50,000 iterations. In each iteration, we will update our weights, and track our total error.  

```python
# train the network with 50,000 iterations
for iter in xrange(50000):
```

**Line 57:** We explicitly assign our input data, X, as layer_0. We're going to process all the data at the same time in this implementation also known as 'batch processing'. 

```python
    # forward propagation
    layer_0 = X
```

**Line 60:** This is forward propagation. It has two steps

1. Take the dot product of the raw inputs and the weights using the np.dot function:

$$z = np.dot(layer_0,weights_0) = 1*\omega_0 + feature1*\omega_1 + feature2*\omega_2 + feature3*\omega_3$$

2. Pass the weighted sum, $z$, to the sigmoid function to get an output between 0 to 1.

$$output = sigmoid(z) = \frac{1}{1+e^{-z}}$$

```python
    # pass the weighted sum of the inputs through the sigmoid activation function
    layer_1_output = sigmoid(np.dot(layer_0,weights_0))
```

**Line 63:** This is the error for our current weights. It is simply the target (y) minus our estimation (layer_1_output). The error of each observation is stored here as a 5 by 1 matrix.  

```python
    # calculate the error (target value - predicted value)
    layer1_error = y - layer_1_output
```

**Line 69:** This is the most important part - learning. We calculate how much we will change each of the weights in the current iteration for a given datapoint. It has 3 parts that are multiplied together, (1) the learning rate, (2) the current error, and (3) the slope (derivative) of the sigmoid function.

```python
    # compute the change (delta) for each sample by multipying 
    # 1. the learning rate 
    # 2. error
    # 3. slope (derivative of our the sigmoid with respect to the weighted sum of inputs (layer_1_output))
    layer1_delta = learning_rate * layer1_error * sigmoid_derivative(layer_1_output)
```

1. The first term is the learning rate, which controls how fast we will learn. It is just a constant value between 0-1, and this case, we chose 0.4. 

2. The second term is the errors for each of our observations. The higher the error, the more we update our weights.

3. The final term is the derivative of the sigmoid function. 

**Line 75:** Lastly, we need to multiply our weight changes by the input.   

```python
    # compute the change (delta) for each weight
    # we multiply by our input (transpose for the correct shape), x
    layer1_delta = np.dot(layer_0.T,layer1_delta)
```

**Line 79:** This line updates the current weights by simply adding the delta we just determined to the current weights.  

```python
    # update weights by simply adding the delta
    weights_0 += layer1_delta
```

**Line 81:** These lines print out the final updated weights and outputs after 50,000 iterations, along with the true targets, y. As you can see, we get great results!   

```python
print("-------------------------")    
print("Weights After Training:")
print(weights_updated.round(2))
print("Outputs After Training:")
print(layer_1_output.round(2))
print('Targets')
```

This concludes our coding section. Below are some more optional advanced topics in Neural Networks. 
***

## OPTIONAL: Neural Networks Advanced Topics

Think of a neural network as a bicycle with exchangeable parts for specific situations (thin tires for roads or thick tires for mountains). In our ANN we can swap out:
1. The Activation Functions
2. The Cost Functions
3. The Weight Updates
4. The Architecture

For our simple perceptron we built a neural network with:
1. The Activation Function - **sigmoid**
2. The Cost Function - **target-prediction**
3. The Weight Updates - **(learning rate)x(error)x(slope)x(input)**
4. The Architecture - **2-layers, input and output**

### 1. Activation Functions

While we used the sigmoid activation function for this notebook, there are many different types of activation functions used in practice. These activation functions each have their own tradeoffs. Below are a collection of different activation functions, a brief description of what they aim to do, and their drawbacks:

1. **Sigmoid (Logistic)** - Can squish output between 0-1, but neurons can die from oversaturation of the gradients (vanishing gradients) during backpropagation (the derivative is near 0 when the input is very high or low!) and high computational overhead to calculate $e^-x$.

2. **Tanh (Hyperbolic Tangent)** - Very similar to the sigmoid, but can squish output between -1-1 and is 0-centered (allowing backpropagation to be more efficient) but still has the issues of oversaturation of the gradients and high computational overhead.

3. **ReLU (Rectified Linear Unit)** - Very computationally efficient and the gradient doesn't oversaturate at the positive end, but neurons die when the input is $<= 0$ (the derivative (slope) is 0 when the input is $<= 0$ ). 

4. **Leaky ReLU (Leaky Rectified Linear Unit)** - Similar to ReLU, but neurons will not die when the input is $<= 0$ since the derivative is not 0 at these points, but instead the $input*0.1$
5. **Maxout** - A generalization of ReLU and Leaky ReLU that doesn't kill neurons, but requires double the number of parameters for each neuron. 
6. **ELU (Exponential Linear Unit)** - Similar to Leaky ReLU, but requires high computational overhead to calculate $e^{-x}$. 
7. Many more!

<img src="https://miro.medium.com/max/2384/0*sIJ-gbjlz0zrz8lb.png" width="600"/>


In our code above, we developed a perceptron that learns how to map inputs to targets by **reducing the errors** and **changing the weights**. 

If you understand that, fantastic! You have a good overview how neural networks **learn**. But can we reach a bit deeper to understand *what* specific changes to weights will make our perceptron better? Below, we will see why our weight update rule

$$\text{weights}_{new}= \text{weights}_{old} + \text{learning rate}*\text{error}*\text{slope}*\text{input}$$

makes sense from a mathematical perspective.

***

#### <center> Math alert! - For completeness we will review the underlying math that governs how Artificial Neural Networks learn. </center>

### 2. Cost Function
In ANNs and machine learning in general, the objective of our algorithms are to minimize the value of a **Cost Function** (mimizing cost = much better predictions!). In our context, we want to *minimize the cost function (i.e. the prediction error) by only changing the weights of our ANN*.

Consider the cost function $C$ which takes two arugments $y$, and $\hat{y}$ which represent the target (true answer) and the model's estimate (i.e. model prediction) respectively:

$$C(y,\hat{y}) = \frac{1}{2}(y-\hat{y})^{2}$$


In order to minimize our cost function (get it as close to 0 as possible), we use a process called gradient descent.

**Gradient descent** is an iterative algorithm to find parameter which minimize the value of a function. In our perceptron we do the following:
1. Calculate the derivative (slope) of our cost function given the current weights (parameters). 
2. Use this derivative (collection of values are called gradients) to update the weights (parameters) towards the minimum value.

Think of our cost function as a bowl and gradient descent as a ball (value of our cost function) that we want to roll downhill to the bottom. 

<img src="https://miro.medium.com/max/736/1*e88JKNWAFok3vpjeuPfHig.gif" width="300"/>

Notice that when the slope is negative, it means we must move in the positive direction to get towards the minimum. By subtracting the gradient from are currents weights we are able to:
1. Move in the right direction towards the minimum
2. Scale how big our jump is by how steep the slope is (the closer we are, the smaller our step)

$$weights_{new} = weights_{old} - \text{gradient}$$

***
#### <center> Calculus alert! - Updating weights in our ANN is simply an application of the chain rule! </center> 

### 3. Updating the Weights
We now have a **cost function** to **minimize with gradient descent** by **updating the weights**. 

We can denote how changes in a given weight effect our cost as:

$$\frac{\text{change in cost}}{\text{change in weight}}$$ 

In calculus we would state this as the derivative of the cost with respect to the weight:

$$\frac{\text{change in cost}}{\text{change in weight}} = \frac{\partial C}{\partial w}$$

There is a butterfly effect in which the weights we want to update are passed through several functions before we can see its effect on the cost function... Our cost function is dependent on our output, $\hat{y}$, which is dependent on our summation term $z$, which is dependent on our weights $w$ and inputs $x$. Below is a depiction of this.

$$C(y, \hat{y}) = \frac{1}{2}(y-\hat{y})^{2} \hspace{10pt}\rightarrow\hspace{10pt} \hat{y}(z)= \frac{1}{1+e^{-z}} \hspace{10pt}\rightarrow\hspace{10pt} z(x,w) = \sum{x_iw_i}$$

Luckily for us, the chain rule allows us to unwrap these nested functions to compute how the weights effect our cost!



Utilizing the chain rule, we can see that:
$$\frac{\partial C}{\partial w_i} = \frac{\partial C}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial z} \frac{\partial z}{\partial w_i}$$

Skipping the full derivation, we can calculate each component as:
$$\frac{\partial C}{\partial \hat{y}} = -(y-\hat{y})\\
\frac{\partial \hat{y}}{\partial z} = \hat{y}(1-\hat{y})\\
\frac{\partial z}{\partial w_i} = x_i$$

Does this look familiar? Each component corresponds to what we previously calculated to update our weights!

#### <center> $\frac{\partial C}{\partial \hat{y}} = -(y- \hat{y})$ This is our **error** *-(target - prediction)*</center>
#### <center> $\frac{\partial \hat{y}}{\partial z} = \hat{y}(1-\hat{y})$ This is our **slope** *the derivitive of the sigmoid*</center>
#### <center> $\frac{\partial z}{\partial w_i} = x_i$ This is our **input** </center>

You may have noticed that our sign for the error is flipped. This is because in practice, you subtract the gradient (derivitive/slope) of the cost function, with respect to each weight by the current weight. Subtracting a negative number is the equivalent of adding! Throw in a learning rate, and you have exactly what we implemented in code above!

This is a lot to take in, but you just learned the underlying mathematical work-engine of neural networks, updating the weights through gradient descent (a.k.a. backpropagation). Congratulations!

#### <center> Math alert end!</center>
***

### 4. The Architecture

We used the simplest neural network architecture in this notebook, but in practice we can add many more "hidden layers" with which our input passes through before obtaining our final output. This idea gave birth to the term *deep neural network* to describe an ANN with many hidden layers.

In the next notebook we will explore this topic which helped end the winter of ANNs in the 1970s.

## References  

##### ANN Overview
[A Neural Network in 11 lines of Python](http://iamtrask.github.io/2015/07/12/basic-python-network/) (I thank the author, since my example is modified from his blog).    
[Machine learning - An Algorithmic Perspective](https://seat.massey.ac.nz/personal/s.r.marsland/MLBook.html)
***
##### Activation Functions
[Activation functions and when to use what](https://medium.com/@himanshuxd/activation-functions-sigmoid-relu-leaky-relu-and-softmax-basics-for-neural-networks-and-deep-8d9c70eed91e)
***
##### Learning Rate
[Single-Layer Neural Networks and Gradient Descent](http://sebastianraschka.com/Articles/2015_singlelayer_neurons.html)     
[Learning rate optimization](https://www.jeremyjordan.me/nn-learning-rate/)     
***
##### Backpropogation
[Backpropagation - 3Blue1Brown](https://www.youtube.com/watch?v=Ilg3gGewQ5U&ab_channel=3Blue1Brownhttps://www.youtube.com/watch?v=Ilg3gGewQ5U&ab_channel=3Blue1Brown)     
[Backpropgation Computation Examples - Standford Lecture](https://www.youtube.com/watch?v=d14TUNcbn1k&ab_channel=StanfordUniversitySchoolofEngineering)
***
##### Gradient Descent
[Gradient Descent - 3Blue1Brown](https://www.youtube.com/watch?v=IHZwWFHWa-w&t=960s&ab_channel=3Blue1Brownhttps://www.youtube.com/watch?v=IHZwWFHWa-w&t=960s&ab_channel=3Blue1Brown)