# Neural Network Architecture

In [None]:
import os
import sys
module_path = os.path.abspath(os.path.join(os.pardir, os.pardir))
if module_path not in sys.path:
    sys.path.append(module_path)
    
import numpy as np
from matplotlib import pyplot as plt
import matplotlib.image as mpimg
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_digits

![dense](images/dogcat.gif)

# Objectives

- Describe the basic structure of densely connected neural networks
- Describe the various activation functions that are used in neural networks

# Introduction to Neural Networks

## Background

Neural networks have been around for a while. They are over 70 years old, dating back to  their proposal in 1944 by Warren McCullough and Walter Pitts. These first proposed neural nets had thresholds and weights, but no layers and no specific training mechanisms.

The "perceptron", the first trainable neural network, was created by Frank Rosenblatt in 1957. It consisted of a single layer with adjustable weights in the middle of input and output layers.

![perceptron](images/nn-diagram.png)

## Wait, Wait, Wait... Why a Neural Network?

You really should take a second to realize what tools we already have and ask yourself, "Do we really need to use this 'neural network' if we already have so many machine learning algorithms?"

And in short, we don't need to default to a neural network but they have advantages in solving very complex problems. It might help to know that idea of neural networks was developed back in the 1950s (perceptron network). It wasn't until we had a lot of data and computational power where they became reasonably useful.

## Starting with a Perceptron

### A Diagram

<img src='https://cdn-images-1.medium.com/max/1600/0*No3vRruq7Dd4sxdn.png' width=40%/>

Notice the similarity to a linear regression:


$$ x_1 w_1 + x_2 w_2  + x_3 w_3 = \text{output}$$
$$ XW = \text{output}$$

## Relation to Previous Models

### Logistic Regression

Think of the weights as the betas and the activation function as the sigmoid function!

### Stacking Ensembles

Various base models' predictions are fed into a "meta-estimator" that is trained to aggregate them optimally. This is analogous to the multiple **layers** of a neural network.

## Basic Architecture

For our DS purposes, we'll generally imagine our network to consist of only a few layers, including an input layer (where we feed in our data) an output layer (comprising our predictions). Significantly, there will also (generally) be one or more layers of neurons between input and output, called **hidden layers**.

One reason these are named hidden layers is that what their output actually represents in not really known.  The activation of node 1 of the first hidden layer may represent a sequence of pixel intensity corresponding to a horizontal line, or a group of dark pixels in the middle of a number's loop. 

![dense](images/Deeper_network.jpg)

Because we are unaware of how exactly these hidden layers are operating, neural networks are considered **black box** algorithms.  You will not be able to gain much inferential insight from a neural net.

Each of our pixels from our digit representation goes to each of our nodes, and each node has a set of weights and a bias term associated with it.

## Inspiration from Actual Neurons

The composition of neural networks can be **loosely** compared to a neuron.

![neuron](images/neuron.png)

Neural networks draw their inspiration from the biology of our own brains, which are of course also accurately described as 'neural networks'. A human brain contains around $10^{11}$ neurons, connected very **densely**.

This is a loose analogy, but can be a helpful **mnemonic**. The inputs to our node are like inputs to our neurons. They are either direct sensory information (our features) or input from other axons (nodes passing information to other nodes). The body of our neuron (soma) is where the signals of the dendrites are summed together, which is loosely analogous to our **collector function**. If the summed signal is large enough (our **activation function**), they trigger an action potential which travels down the axon to be passed as output to other dendrites. See [here](https://en.wikipedia.org/wiki/Neuron) for more. 

# Neural Networks Overview

## Couple ways to think of neural networks

> We can think of neural networks as finding ways to take inputs and creating something like latent features.

![](images/neural_network_with_human_readable_labels.png)

> But we can also think of them as creating linear separators and then combining them together

In [None]:
x = np.random.rand(25)
y = np.random.rand(25)
z = (x + y) <= 0.8
plt.scatter(x, y, c=z)

Thinking in the more mathematical way, allows us to use our linear algebra knowledge

![](images/neural_network_mathematics.png)

## Parts of a Neural Network

### Layers

- **Input Layer**: the initial parameters (these will be the parts we feed to our network)
- **Output Layer**: the classification (or regression predictions)
- **Hidden Layer(s)**: the other neurons potentially in a neural network to find more complex patterns

### Weights

> The weights from our inputs are describing how much they should contribute to the next neuron

But we can also think of the weights of hidden layer neurons telling us how much of these linear separations should be combined.

### Activation Functions

![activation](images/log-reg-nn-ex-a.png)

Then we pass it into an activation function. The activation function converts our summed inputs into an output, which is then passed on to other nodes in hidden layers, or as an end product in the output layer. This can loosely be thought of as the action potential traveling down the axon.

When we build our models in `keras`, we will specify the activation function of both hidden layers and output.

### Other Hyperparameters

We'll talk more about this in optimizing our neural networks but some hyperparameters include:

- **Learning Rate ($\alpha$)**: how big of a step we take in gradient descent
- **Number of Epochs**: how many times we repeat this process
- **Batch Size**: how many data points we use in a single training session (1 epoch)

Remember, any parameter adjusted to enhance the neural network's learning _is_ a hyperparameter (this includes the actual structure of the neural net)

# How They Work

## Forward Propagation

Let's first look at **forward propagation** on the level of the perceptron.

We will use the built-in dataset of handwritten numbers from `sklearn`, which comes from the UCI Machine Learning collection [digits source](https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits). Each record is a 64-bit (8x8) image of a handwritten number between 0 and 9. Each pixel value (a number between 0 and 16) represents the relative brightness of the pixel. 

In [None]:
digits = load_digits()
flat_image = np.array(digits.data[0]).reshape(digits.data[0].shape[0], -1)
eight_by_eight_image = digits.images[0]

In [None]:
eight_by_eight_image

It is similar to the famous [**MNIST**](http://yann.lecun.com/exdb/mnist/index.html) dataset which is sometimes referred to the ["hello world" of computer vision](https://www.kaggle.com/c/digit-recognizer).  

Let's look at one digit:

In [None]:
digits = load_digits()
eight_by_eight_image = digits.images[0]

In [None]:
imgplot = plt.imshow(eight_by_eight_image, cmap='Greys')

In [None]:
# look at the matrix below and make sure you see how the large numbers 
# correspond to darker shades in the image above

eight_by_eight_image

When passing the data into our perceptron, we will **flatten** the image into a 64x1 array.

In [None]:
digits.data[0]

In [None]:
digits.data[0].shape[0]

In [None]:
digits.data[0].shape

In [None]:
flat_image = np.array(digits.data[0]).reshape(digits.data[0].shape[0], -1)

In [None]:
flat_image.shape

In [None]:
flat_image


![weights](images/log-reg-nn-ex-w.png)

We will instantiate our weight with small random numbers.


In [None]:
w = np.random.uniform(-0.1, 0.1, (flat_image.shape[0], 1))
w

In [None]:
len(w)

We'll set our bias term to 0:

In [None]:
b = 0

### Summation

![sum](images/log-reg-nn-ex-sum.png)

Our inputs, the pixels, each are multiplied by their respective weights and then summed together with the bias. 

This amounts to the dotproduct of the pixel value and the weights.

In [None]:
flat_image.T.shape

In [None]:
z = flat_image.T.dot(w) + b
z

### Activation Functions
We have a suite of activation functions to choose from.

#### Sigmoid

$f(x) = \frac{1}{1+e^{-x}}$

In [None]:
# Z is the input from our collector, the sum of the weights
# multiplied by the features and the bias

def sigmoid(z):
    '''
    Input the sum of our weights times the pixel intensities, plus the bias
    Output a number between 0 and 1.
    
    '''
    return 1/(1 + np.exp(-z))

In [None]:
X = np.linspace(-10, 10, 20000)
sig = sigmoid(X)

fig, ax = plt.subplots()
ax.plot(X, sig)
ax.set_title('Sigmoid Activation');

#### tanh

$f(x) = tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$

The hyperbolic tangent function is a shifted version of the sigmoid. The inflection point passes through 0,0 instead of 0,.5, and the output is between -1 and 1.  This means the mean of the output is centered around 0, which can make learning in the next layer easier. tanh is almost always better in a **hidden layer** than sigmoid because it [speeds up learning](https://stats.stackexchange.com/questions/330559/why-is-tanh-almost-always-better-than-sigmoid-as-an-activation-function). For the output layer, however, sigmoid makes sense for binary outcomes.

In [None]:
def tanh(z):
    return (np.exp(z) - np.exp(-z)) / (np.exp(z) + np.exp(-z))

In [None]:
# Coding tanh:

X = np.linspace(-10, 10, 20000)
y_tanh = tanh(X)

fig, ax = plt.subplots()
ax.plot(X, y_tanh)
ax.set_title('Hyperbolic Tangent Activation');

One problem with tanh (and sigmoid) is that if our input is large, the slope of the activation function flattens out.  When conducting backpropagation, we will use the derivative of the activation function as one of our terms multiplied by a learning rate to determine how big a step to take when adjusting our weights. If our derivative is close to zero, the step will be very small, so the speed of our learning will be very slow.  This is called the **vanishing gradient** problem.

#### ReLU

$f(x) = 0$ if $x\leq 0$; $f(x) = x$ otherwise

ReLU is a commonly used and effective activation function because of speed.  Given that the **output** is zero when negative, some nodes become inactive (i.e. produce an output of 0).  Zero outputs take little computational power. Also, the constant gradient leads to faster learning in comparison to sigmoid and tanh, which come close to 0 with large positive and large negative values. Since the speed of our network is linked to the derivative, a derivative close to zero will result in slow learning.

See also [this page on stackexchange](https://stats.stackexchange.com/questions/126238/what-are-the-advantages-of-relu-over-sigmoid-function-in-deep-neural-networks).

In [None]:
def relu(z, leaky=False, a=0.01):
    if z > 0:
        return z
    elif leaky == False:
        return 0
    else:
        return a*z

In [None]:
# Coding ReLU:

X = np.linspace(-10, 10, 200)

y_relu = [relu(x) for x in X]

fig, ax = plt.subplots()
ax.plot(X, y_relu)
ax.set_title('ReLU Activation');

#### Swish

$f(x) = \frac{x}{1+e^{-x}}$

New on the scene, this function has lots of promise since it looks much like ReLU but has a nonzero derivative everywhere. See [here](https://medium.com/@neuralnets/swish-activation-function-by-google-53e1ea86f820) for more.

In [None]:
def swish(z):
    return z/(1+np.exp(-z))

In [None]:
# Coding Swish

X = np.linspace(-10, 10, 200)

y_swish = [swish(x) for x in X]

fig, ax = plt.subplots()
ax.plot(X, y_swish)
ax.set_title('Swish Activation');

#### Softmax

$\large f(x_0) = \frac{e^{x_0}}{\Sigma_{x\epsilon X}e^x}$

Because this function relates each value to the sum of all values, this is the appropriate activation in the output layer for **multi-class** classification problems. We can interpret the function as outputting the probabilities of belonging to each class.

There are other activation functions; [see here](https://towardsdatascience.com/comparison-of-activation-functions-for-deep-neural-networks-706ac4284c8a). 

Our nodes will be taking in input from multiple sources. Let's add the entire training set as our input. 


In [None]:
X_train, X_test, y_train, y_test = train_test_split(digits.data,
                                                    digits.target,
                                                    random_state=42,
                                                    test_size=0.2)
X_train.shape

In [None]:
X_train

In [None]:
X_train[0, :].reshape(8, 8)

In [None]:
y_train[0]

In [None]:
w_1 = np.random.normal(0, 0.1, (X_train.shape[1], 4))
w_1.shape

In [None]:
w_1

In [None]:
b_1 = 0

In [None]:
z_1 = X_train.dot(w_1) + b_1
z_1

In [None]:
len(z_1)

In [None]:
a_1 = sigmoid(z_1)
a_1

Now each of these neurons has a set of weights and a bias associated with it.

In [None]:
w_2 = np.random.normal(0, 0.1, (a_1.shape[1], 1))

w_2.shape

In [None]:
b_2 = 0

In [None]:
z_2 = a_1.dot(w_2)
z_2

In [None]:
len(z_2)

In [None]:
output = sigmoid(z_2)
y_pred = output > 0.5
y_hat = y_pred.astype(int)
y_hat[:5]

## Backpropagation

The **backpropagation** algorithm takes the idea of optimally adjusting the parameters (weights) to get a better result. 

We do this tuning by propagating the (average) error back through the network, with the cost function $J$ guiding us and adjusting via gradient descent.

> Turn down previous neurons that give a bad result
>
> Turn up previous neurons that give a good result

> Great video explanation of backpropagation by 3Blue1Brown (part of a full playlist): [Backpropagation calculus | Deep learning, chapter 4](https://www.youtube.com/watch?v=tIeHLnjs5U8&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi&index=4)

![](images/neural_network_graph_3blue1brown.png)


# Let's see it in action!

Now we know the different parts, let's try it out for ourselves!

- [playground.tensorflow.org](https://playground.tensorflow.org): A visual playground for us to train a neural network