We talked about the basics of Artificial Neural Networks (ANNs), gave an intuitive summary, and provided an example in the slides. In this notebook, we will continue to explore more of the ANN and implement a simple version of it.  

After reading this notebook, you will understand some key concepts including the: *Inputs, Weights, Outputs, Targets, Activation Functions, Error, Bias term, Learning rate.*   

Let's start with the simplest type of Artificial Neural Network, the [Perceptron](https://en.wikipedia.org/wiki/Perceptron). Developed back to the late 1950s, this was one of the first artificial neural networks to be produced. You can think of a perceptron as a two layer neural network without any hidden layers. Even though it has limitations, it contains the essential components found in ANNs.   

Let's explore how a percepton works using a simple dataset with 5 observations where each sample has 3 features. The target is what we are ultimately trying to predict from the features. Our target is either 0 or 1 representing the different classes each sample can be, i.e. class 0, and class 1. Our goal is to build and train a simple Perceptron model that can **output the correct target** by feeding it our 3 **features as input**. See the following table. This example is modified from this [great post](http://iamtrask.github.io/2015/07/12/basic-python-network/). 


|Observation|Feature1|Feature2|Feature3|Target|
|:------:|:------:|:------:|:------:|:----:|
| 1 |    0   |    0   |    1   |   0  |
| 2 |    1   |    1   |    1   |   1  |
| 3 |    1   |    0   |    1   |   1  |
| 4 |    0   |    1   |    1   |   0  |
| 5 |    0   |    1   |    0   |   1  |

Let's have a look at the structure of our model.

<img src="https://raw.githubusercontent.com/qingkaikong/blog/master/39_ANN_part2_step_by_step/figures/figure1_perceptron_structure.jpg" width="600"/>  

*note* - strike this sentence: We  add an additional node with 1 as an input, and the weight associated with it is the $\omega_0$, more on this later.

From the figure, we can see that we have two layers in this model: the input layer and the output layer. The input layer has 3 features that connect to the output layer via 3 weights.  The steps we will take to train this model are:
1. Initialize the weights to small random numbers with both positive or negative values.
2. For many iterations:  
       Calculate the output value based on all data samples.
       Update the weights based on the error.

Before we implement this in code, we need go through some concepts:  

### Bias term
Now let's look at the extra 1 we added to the input layer. It is connected to the output layer by weight $\omega_0$. Why do we add a 1 to the input? Consider the case when all our features are 0, and no matter what weights we have, we will always have 0 as output. Adding this 1 extra node can avoid this problem. The bias term fuctions just like a y-intercept in regression, allowing us to shift our function from being rooted at 0. 

### Activation functions   
The output layer only has one node that sums all the passed input and determines whether the output should fire or not (this is ANN terminology related to the brain and the firing of neurons). Usually we use a 1 to indicate a neuron firing and a 0 to indicate a neuron not firing. In this case, we can see the sum of the weighted inputs are: z = $1*\omega_0 + feature1*\omega_1 + feature2*\omega_2 + feature3*\omega_3$. But what we want as the output is either 0 or 1 to represent two classes, or a number between 0 and 1, so that we can use it as a probability. Since this number z can be anything, how can we scale it to a value between 0 and 1? 

For demonstration purposes we will use the sigmoid (also called logistic) activation function. Using the sigmoid activation function has three major advantages:
1. The output is scaled between 0 and 1.
    - If the z value we is a large positive number, say 10, or a small negative number, say -10, the scaled value will be either 1 or 0. We can say at these cases, the network is very confident that the class is belong to certain class. 
    - If we have z relatively close to 0, for example, -1 to 1, then the scaled value will be around 0.2 to 0.7, which also indicates the network is not so confident about the result. 
2. The output is given as a probability.
    - This gives us some metric for confidence about our result.
3. We can take the derivative of the output.
    -  Since the derivative can be used to update the weights while training our model, this property from the sigmoid function lets our shift the weights to predict the correct target.


<img src="https://raw.githubusercontent.com/qingkaikong/blog/master/39_ANN_part2_step_by_step/figures/figure2_sigmoid.jpg" width="600"/>   

*note* - Add an x and y label


So now we have a perceptron network that can take some input of features, apply some weight to each feature, sum that up and pass it through an activation function to classify an observation into two different classes, great!

But what if the result wrong? That is not only possible, but **expected** since we initialized the weights as small **random** numbers. To update our model, we need the perceptron to have the ability to learn from the data (*more specifically, the from its errors*). 

### How to learn from the error  

Learning will be achieved by:
1. **Estimating our error** from the current weights
2. Finding a way to update the weights that will **reduce the error**
More details to follow in the code ahead!  

### Learning rate   

Typically, when updating our weights, we use a learning rate to control how fast the network learns. We will see later in the code that the learning rate throttles how much to change the weights by. Leaving out the learning rate is the equivalent of setting it equal to 1. 

*Why do we need a learning rate?* - The weights change a lot when the error is high, and we may overshoot our weight updates leading to another extreme, making the network unstable so that it never settles down. 

There is a trade off in setting the learning rate too high or too low. A high learning rate makes training our ANN fast, but can cause it to jumping past the optimal weights. On the other hand a small learning rate will make training take longer or potentially get stuck in local mimuma, but lead to more stable weights. 

<img src="https://sebastianraschka.com/images/blog/2015/singlelayer_neural_networks_files/perceptron_learning_rate.png" width="600" />

Let's look implement what we discussed, explaining them line by line below!

In [54]:
import numpy as np
# The activation function, we will use the sigmoid
def sigmoid(x,deriv=False):
    sig = 1/(1+np.exp(-x))
    if(deriv==True):
        return sig*(1-sig)
    return sig

def sigmoid_derivative(x):
    sig = 1/(1+np.exp(-x))
    return sig*(1-sig)
    

# define learning rate
learning_rate = 0.4

# input dataset with each observation as an array
X = np.array([[0,0,1],
              [1,1,1],
              [1,0,1],
              [0,1,1],
              [0,1,0]])
# we add a bias column of 1s to the input
X = np.concatenate((np.ones((len(X), 1)), X), axis = 1)

# output or target         
y = np.array([[0,1,1,0,1]]).T

# seed random numbers to make calculation
# deterministic (just a good practice)
np.random.seed(1)

# initialize weights randomly between [-1, 1]
weights_0 = (2 * np.random.random((4,1))) - 1

# train the network with 50000 iterations
for iter in range(50000):

    # forward propagation
    layer_0 = X
    # pass the weighted sum of the inputs through the sigmoid activation function
    layer_1_output = sigmoid(np.dot(layer_0,weights_0))

    # calculate the error which we define as the (actual target value - the predicted value)
    layer1_error = y - layer_1_output

    # multiply how much we missed by the
    # slope of the sigmoid at the values at output layer
    # we also multiply the input to take care of the negative case
    layer1_delta = learning_rate * layer1_error * sigmoid_derivative(layer_1_output)
    print(layer1_delta.shape)
    layer1_delta = np.dot(layer_0.T,layer1_delta)
    print(layer_0.shape)
    print(layer1_delta.shape)
    
    # update weights by simply adding the delta
    weights_0 += layer1_delta
    break
    
print("Output After Training:")
print(layer_1_output)
print('Target')
print(y)

(5, 1)
(5, 4)
(4, 1)
Output After Training:
[[0.36324884]
 [0.24593466]
 [0.46987606]
 [0.1734943 ]
 [0.23762817]]
Target
[[0]
 [1]
 [1]
 [0]
 [1]]


## Explain line by line  

**Line 1:** This line imports the numpy module, which is a linear algebra library.   

```python
import numpy as np
```

**Line 3:** This block defines the activation function, which is a function to convert any number to a probability between 0 and 1 as we discussed above.   

```python
# The activation function, we will use the sigmoid
def sigmoid(x,deriv=False):
    sig = 1/(1+np.exp(-x))
    if(deriv==True):
        return sig*(1-sig)
    return sig
```

**Line 10:** Here we define our learning rate, this will control how fast the network learns from the data. Usually this learning rate will be a number between 0 - 1.

```python
# define learning rate
learning_rate = 0.4
```

**Line 13:** This initializes the input dataset as numpy matrix. Each row is a single data sample, and each column corresponds to one features (one of the input nodes). And we also add the bias term 1 in line 14. You can see that we now have 4 input nodes and 5 training examples.   

```python
# input dataset with each observation as an array
X = np.array([[0,0,1],
              [1,1,1],
              [1,0,1],
              [0,1,1],
              [0,1,0]])
# we add a bias column of 1s to the input
X = np.concatenate((np.ones((len(X), 1)), X), axis = 1)
```

**Line 17:** This initializes our output dataset. ".T" is the transpose function, which converts our output data to a column vector. You can see that we have 5 rows and 1 column, corresponds to 5 data samples and 1 output node.  

```python
# output or target           
y = np.array([[0,1,1,0,1]]).T
```

**Line 21:** Before we generate the random weights, we use a seed to ensure we can replicate our experiment, drawing the same random numbers. This is very useful when we test the algorithm, and compare the results with others. This means that your results and my results should be the same. But in reality when you use the algorithm, you don't need to seed it.   

```python
# seed random numbers to make calculation
# deterministic (just a good practice)
np.random.seed(1)
```

**Line 24:** This initializes our weights to connect the input layer to the output layer. Since we have 4 input features (including the bias term), we initialize the random weights as four rows with one column - dimensions (4,1). Also note that the random numbers we initialized are within -1 to 1, with a mean of 0. There is quite a bit of theory that goes into the weight initialization. Fow now we will simply ensure to have a mean of 0 for weight initialization.

```python
# initialize weights randomly with mean 0
weights_0 = 2*np.random.random((4,1)) - 1
```

**Line 27:** We begin the training and make the Perceptron to learn in 50000 iterations. In each iteration, the algorithm will learn from the error it made.  

```python
# train the network with 50000 iterations
for iter in xrange(50000):
```

**Line 30:** We explicitly assign our input data to layer_0 (since we have 2 layers, we will call it layer_0, and layer_1). We're going to process all the data at the same time in this implementation, this is called 'batch processing'. There's another type of learning called ['online learning'](https://en.wikipedia.org/wiki/Online_machine_learning), which essentially change the weights whenever there's new data sample available, but that is beyond the scope of this workshop.   

```python
    # forward propagation
    layer_0 = X
```

**Line 31:** This is forward propagation. It has two steps, first, we take the weighted sum of the inputs, then we pass that sum through the sigmoid activation function converting the output to a value between 0 and 1.

The first step is achieved by the taking the dot product of the raw inputs and the weights using the np.dot function, $$np.dot(layer_0,weights_0) = 1*\omega_0 + feature1*\omega_1 + feature2*\omega_2 + feature3*\omega_3$$, we can refer to this weighted sum as z. The second step is passing z to the sigmoid function to get a number between 0 to 1.   

```python
    # pass the weighted sum of the inputs through the sigmoid activation function
    layer_1_output = sigmoid(np.dot(layer_0,weights_0))
```

**Line 35:** This is our error for our current weights. It is simply the true answer (y) minus our estimation (layer_1_output). The error of each observation is stored here as a 5 by 1 matrix.  

```python
    # calculate the error which we define as the target value - the estimated value
    layer1_error = y - layer_1_output
```

**Line 40:** This is the most important part - learning. This is the line that makes our algorithm learn from the data. It calculates how much we will change each of the weights in the current iteration. It has 3 parts that are multiplied together, (1) the learning rate, (2) the error we just got, and (3) the derivative of the sigmoid function at the output value. 

```python
    # learning rate - controls how fast learning occurs (set to 0.4)
    # layer1_error - the error for each of our observations
    # sigmoid(layer_1_output,True) - The derivative (or slope) of the sigmoid at the output value
    layer1_delta = learning_rate * layer1_error * sigmoid(layer_1_output,True)
```

The first term is the learning rate, which controls how fast we will learn, it is just a constant in this case, say, we can choose it as 0.4. The larger the values is, the faster the network will learn, but may make the learning unstable and cause the results to oscillate and jump past the optimal weights.      

The second term is the errors for each of our observations. Let's think about this error a little more... 
We have two classes, class 0 and class 1. Therefore, we can have two types of error, when the true class is 0, but our estimate is > 0, or when the true class is 1, but our estimate is < 1. In the first case, the error is a negative number, and in the second, it is a positive number. If we assume all the input nodes are positive, and we get an error as negative number, then we want to reduce the weight to make the next estimation smaller. The same applies for the second case, if we have the error as a positive number, then we will want to increase the weights, so that in the next iteration, our estimation will be larger. This is the reason we include the error term here, to determine how we change the weights. But wait, what if the input nodes contain negative input? This will switch the values over, and reverse the direction we want to change the weights. How can we avoid this case? Well, it is quite simple, we can just multiply the input into this function, the negative error mutplied by the negative input will be a positive number, which corrects the direction of change. This is where line 42 comes in, we separate these two lines for clearer explanation.    

The final term is the derivative of the sigmoid function evaluated at the output value. We can see a simple plot of the sigmoid function and its derivatives at different places. The derivative at x is the slope of the sigmoid curve at value x. We can clearly see the slope is different at various values of x, ranging from 0 (at the negative and positive ends, i.e. either really large or really small numbers) to 1 (when x value is close to the 0, the slope is very steep). This means we will multiply another value that is between 0 to 1 to the error we made that already scaled with the learning rate. If we look carefully with the figure, the slope of the green and purple dots are flatter than the blue dot, which means they will have a smaller value. The green and purple dots represent more confidence of the result to be one class. At the green dot, the sigmoid value is about 0.9, which is much close to the class 1. But the blue dot has a sigmoid value of 0.5 meaning we have no idea whether it should belong to class 1 or class 0. Therefore, this slope is large when we have little confidence in determining the class, and the slope is small when we have more confidence in the class of the result. Now, this makes sense, if the current weights already made a very good estimate of the class (for the green dot), we will want to update the weights really small or even not update it. But if the current estimation is not clear, for example, the blue case, we will want to change the weights relatively large to make it clearer. Therefore, combining the 3 terms, we can get a delta value for each of the weight we want to change in the direction to reduce the error.    

<img src="https://raw.githubusercontent.com/qingkaikong/blog/master/39_ANN_part2_step_by_step/figures/figure3_sigmoid_derivative.jpg" width="600"/> 

**Line 41:** As we talked about the negative input case (see line 39 second term explanation), where we need multiply the input to keep the sign correctly (to keep the direction we want to change the weights correctly!).   

```python
    layer1_delta = np.dot(layer_0.T,layer1_delta)
```

**Line 44:** This is the last line, it aims to update the current weights by simply adding the delta we just determined to the current weights.  

```python
    # update weights by simply adding the delta
    weights_0 += layer1_delta
```

**Line 47:** This line just print out the final results we get after the network learned 50000 iterations, and we can see we get a pretty decent results.   

```python
print(layer_1_output)
```

This concludes this part by showing you all the important part of the ANN (perceptron) algorithm. But the perceptron has its own limation, which actually caused the ANN winter we talked about in the 1970s until researchers found the way to solve it - the multilayer perceptron algorithm, which we will talk in the next part. 

## Neural Networks Advanced topics

Think of neural network as a bicycle with exchangable parts. We can swap the wheels depending on if we are on flat roads, or on rocky terrains, or changes the type of breaks, number of gears, etc. to make the bike work as well as possible for our given task (riding up a mountainside, or downhill to work).

<img src ="https://img.freepik.com/free-vector/realistic-bicycle-parts-set-with-isolated-illustration_1284-55952.jpg?size=626&ext=jpg" width=500/>

Neural networks also have this property of having exchangle parts. A few of the parts of a neural network we can swap out include:
    1. The activation functions
    2. The cost function
    3. The architechture

For our simple perceptron we build we created a neural network with:
    1. The sigmoid activation function
    2. A cost function that we define as our error = target - output
    3. A two layer network with an input and ouput layer. 

### Activation Functions

While we used the sigmoid activation function for this notebook, there are many different types of activation functions used in practice. These activation functions each have their own tradeoffs. Below are a collection of different activation functions, and a brief description of what they aim to do, and their drawbacks:

1. **Sigmoid (Logistic)** - Can squish output between 0-1, but neurons can die from oversaturation of the gradients (vanishing gradients) during backpropogation (the derivative is near 0 when the input is very high or low!) and high computational overhead to calculate $e^-x$.

2. **Tanh (Hyperbolic Tangent)** - Very similar to the sigmoid, but can squish output between -1-1 and is 0-centered (allowing backpropogation to be more efficient) but still has the issues of oversaturation of the gradients and high computational overhead.

3. **ReLU (Rectified Linear Unit)** - Very computationally efficient and the gradient doesn't oversaturate at the positive end, but neurons die when the input is $<= 0$ (the derivative (slope) is 0 when the input is $<= 0$ ). 

4. **Leaky ReLU (Leaky Rectified Linear Unit)** - Similar to ReLU, but neurons will not die when the input is $<= 0$ since the derivative is not 0 at these points, but instead the $input*0.1$
5. **Maxout** - A generalization of ReLU and Leaky ReLU that doesn't kill neurons, but requires double the number of parameters for each neuron. 
6. **ELU (Exponential Linear Unit)** - Similar to Leaky ReLU, but requires high computational overhead to calculate $e^-x$. 
7. Many more!

<img src="https://miro.medium.com/max/2384/0*sIJ-gbjlz0zrz8lb.png" width="600"/>

### The Cost Function

We can think of the way we calculated our error (target - output) as our cost function. In artificial neural networks and machine learning in general, our goal is to minimize the value of the cost function through backprogation which uses gradient descent. In practice, there are many types of cost functions you can use depending on the particular problem you are trying to solve.

<img src="https://lh3.googleusercontent.com/qUJKz1DWwMBGzh-unup3AAcXGuNT5-63Ygjx00LN9YE2mJ-dUBO5fXl6E8H_AiU8IMNbBl7KsCvb4CWGCxperdFk_xmFDCDtwD6VExMUG5P8r29jgJij_8yyKnyvXoEh1scc7ixIGXNr-FbPVA" width=600/>

On top of that, we often add a regularization term that helps us to prevent overfitting the data by penalizing the size of the weights

<img src="https://miro.medium.com/max/550/1*-LydhQEDyg-4yy5hGEj5wA.png" width=300>

### The Architechture

We used the simplest neural network architechture in this notebook, but in practice we can add many more "hidden layers" with which our input passes through before obtaining our final output. 



## References  

##### ANN Overview
[A Neural Network in 11 lines of Python](http://iamtrask.github.io/2015/07/12/basic-python-network/) (I thank the author, since my example is modified from his blog).    
[Machine learning - An Algorithmic Perspective](https://seat.massey.ac.nz/personal/s.r.marsland/MLBook.html)
***
##### Activation Functions
[Activation functions and when to use what](https://medium.com/@himanshuxd/activation-functions-sigmoid-relu-leaky-relu-and-softmax-basics-for-neural-networks-and-deep-8d9c70eed91e)
***
##### Learning Rate
[Single-Layer Neural Networks and Gradient Descent](http://sebastianraschka.com/Articles/2015_singlelayer_neurons.html)     
[Learning rate optimization](https://www.jeremyjordan.me/nn-learning-rate/)     