<a href="https://colab.research.google.com/github/dylanwalker/BA865/blob/master/BA865_Lecture_06.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Neural Networks

This lecture focuses on the theory of neural networks.

# From Real Neurons to Artificial Neurons

Artificial neural networks were of course inspired by how the real neurons inside of our brains work.

<img src="https://drive.google.com/uc?id=15D4uFFIloOQE0oYU6Zt8wYrN3VxSn12-" width=5000>

Left:
* a picture of an actual neuron in our brain. I won't go into the details of how this work (its mostly biology that isn't that useful for our purposes).

Right:
* a model of an artificial neuron. 
<br>
<br>

Each neuron in our brain gets inputs from many other neurons. It only fires (activates) when these inputs combine to be greater than a certain threshold.

Each input into an artificial neuron is multiplied by some weight and then these values are all summed together. If the resultant sum is greater than some threshold, the neuron fires.
<br>
<br>

Mathematically, the mapping of input to output of this artifical neuron is:

$\hspace{5cm}y = \delta(\sum_i{w_i x_i} + b)$

where:
- $w_i$ are the weights
- $b$ is the bias
- $\delta()$ is a step function (=1 if its argument is > 0; =0 otherwise)
- note: $\sum_i{w_i x_i}$ can be thought of as a vector dot product, $\vec{w}\cdot\vec{x}$

<br>
We chose a step function ($\delta$) as the **activation function**. Activation functions transform the weighted inputs of a neuron to determine whether it activates. However, there are other choices we could have made (e.g., sigmoid, RELU -- we'll talk more about these later).
<br>
<br>

Just like a real neuron, an artificial neuron is *adaptive* -- it can be trained by adjusting its parameters (the weights and biases).

<br>

The example shown above is for a single neuron. It is termed the **single-layer perceptron**.  However, you can imagine that is is possible to chain a bunch of such neurons together to create a network. 









# A neural network

Typically, we think of a neural network as a set of layers between the input and output:

<img src="https://drive.google.com/uc?id=14nhiGsQVECw_XjVm7-WBftnntfYzvv0D">

The *width* of a layer is the number of neurons it contains. The *depth* of a neural network is the number of layers it contains.

So, the "deep learning" you have likely heard about refers to building and training neural networks that have many layers.

Two aspects of a Neural Network are:
- The neural network **architecture**: This describes the connections between neurons in each layer. Above, I showed you a neural net with several "fully connected" layers. However there are various choices for architecture that we will talk about.
- The neural network **training procedure**: This describes the procedure that we use to train a neural network.


Each nueron (each circle or node in the example diagram above) has a particular bias associated with it and associates a different weight for each of its inputs (the edges in the diagram above):

<img src="https://drive.google.com/uc?id=12RwFK6JFws0f9cOoK0d_aIGW_CrGKsmu" width=500>

(notice our neuron model now explicitly has the bias incorporated and a generalized activation function)


As you can imagine, the number of parameters in a deep neural network can be quite large. How do we go about finding the right values of these parameters?  By "right values of the parameters", I mean the values that produce the output we want given the input.




# Training a neural network




Every neural network is essentially an approximation of a function. We are trying to approximate the function that, when it operates on the inputs will return the targets.  However, because it is an approximation, there will always be some *error*.  We term this error the "loss". The error depends on the parameters of the network (the weights and biases) -- and remember that there are usually a lot of parameters.

Training a neural network is an optimization problem. We are seeking a "global minimum" - i.e., the values of the parameters that minimize the loss.

Training is accomplished by implementing a **training loop**, which *loops over epochs* and does the following loop in pseudo-code:
```python
for epoch in epochs:
  predictions = net.forward(inputs) # 1. feedforward
  loss = loss_function(predictions, targets) # 2. calculate loss
  loss.backward() # 3. Backpropagate
  optimizer.step(net.parameters) # update parameters (weights, biases)
  zero_gradients() # set the gradients to zero
```

1. **Feedforward** - pass the input data into the network to determine the predictions (the response of the network to the inputs).
 - This is accomplished by calling the `model(inputs)` or by calling `net.forward(inputs)` depending on how the network has been defined (more on this later)
2. **Calculate loss** - Calculate the loss by comparing the predictions to the targets
 - This is accomplished by calling `loss_function(predictions,targets)` where `loss_function()` is a loss function (there are many to choose from, more on this later) 
3. **Backpropagate** the loss - calculate the gradients of the parameters (biases and weights) using `loss.backward()`
4. **Update the parameters** - Update the weights and biases to reduce the loss
 - This is accomplished through an update function or (more commonly) through a built in optimizer, `optimizer.step()` where `optimizer` is an optimizer object (there are many to choose from, more on this later)
 - We have to be careful to zero out the gradients after each epoch (each pass through the loop), because gradients accumulate.





## Different Architectures?

## Which Activation Function?

There are a variety of different activation functions that can be used:

### Sigmoid (Logistic) and Hyperbolic Tangent

<img src="https://drive.google.com/uc?id=1e_wqLalTXFPxp0GMNMFbL_9kpXap043o" width=400>

Both the sigmoid and the hyperbolic tangent are the most traditional activiation function. They are defined as:

$f(x) = \frac{1}{1+e^{-x}}$ (sigmoid)

$f(x) = \frac{2}{1+e^{-2x}}-1$ (hyperboic tangent)

Sigmoid ranges between 0 and 1 and is a "softer" version of the step function. While hyperbolic tangent ranges between -1 and 1 and is a bit steeper than the sigmoid (which can lead to faster training).

However, in very deep neural nets (or recurring neural nets, which we'll talk about later), its use can lead to the vanishing gradient problem.

### ReLU

ReLU (Which stand for rectified linear unit) is a popular choice recently. It is defined as:

$f(x) = max(0,x)$

In other words it is equal to the input when the input is positive, but zero when the input is negative.  ReLU reduces the vanishing gradient problem significantly without any added expense that other solutions have. For this reason, it has become popular for training deep or recurrent neural nets.

### Leaky ReLU


### Parameterized Activation Functions

There are several other activation functions that can actually take parameters that can be learned through the training process. They are motivated by the two problems I mentioned above (vanishing gradient, dead neuron). And they do address these problems a bit better, however they come the cost of significantly increasing the parameters in the NN. I'm not going to talk about these in any detail, except to just mention them:

- Parametrized ReLU:
- ELU - Exponential Linear Unit:
- Maxout Activation: 


# Which Optimizer?