# Introduction to Neural Networks

![dense](images/dogcat.gif)

## Objectives
- Describe the basic structure of densely connected neural networks
- Describe the concept of backpropagation
- Explain the use of gradient descent in neural networks
- Use `keras` to code up a neural network model

# Introduction to Neural Networks

## Background

Neural networks have been around for a while. They are over 70 years old, dating back to  their proposal in 1944 by Warren McCullough and Walter Pitts. These first proposed neural nets had thresholds and weights, but no layers and no specific training mechanisms.

The "perceptron", the first trainable neural network, was created by Frank Rosenblatt in 1957. It consisted of a single layer with adjustable weights in the middle of input and output layers.

![perceptron](images/nn-diagram.png)

## Wait, Wait, Wait... Why a Neural Network?

You really should take a second to realize what tools we already have and ask yourself, "Do we really need to use this 'neural network' if we already have so many machine learning algorithms?"

And in short, we don't need to default to a neural network but they have advantages in solving very complex problems. It might help to know that idea of neural networks was developed back in the 1950s (perceptron network). It wasn't until we had a lot of data and computational power where they became reasonably useful.

### Let's Talk About Interpretability...

![](images/accuracy_vs_interpretability.png)

[Image Source](https://medium.com/ansaro-blog/interpreting-machine-learning-models-1234d735d6c9g/)

### And yet, the pull of deep learning is strong...

<img src='images/move_on.jpg' width=350/>

## Applications of Neural Networks

- Clustering
- Pattern Recognition
- Image Recognition (CNN)
- Time Series Forecasting (RNN)
- Audio/Video/Image Generation (GAN) 

#### Limitations
- Good for prediction, bad for inference 
- Computationally expensive 

## Starting with a Perceptron

### A Diagram

<img src='https://cdn-images-1.medium.com/max/1600/0*No3vRruq7Dd4sxdn.png' width=40%/>

Notice the similarity to a linear regression:


$$ x_1 w_1 + x_2 w_2  + x_3 w_3 = \text{output}$$
$$ XW = \text{output}$$

## Logistic Regression as a Perceptron

* This is **one row of data**, each input is a different feature
* Weights are determined through gradient descent 
* The **bias** term is our logistic regression intercept term
* The **activation function** is the sigmoid function that forces output values between 0 and 1
* Output is our classification result

![](https://miro.medium.com/max/1280/1*8VSBCaqL2XeSCZQe_BAyVA.jpeg)


* The perceptron algorithm is about learning the weights for inputs in order to draw a **linear decision boundary** that allows us to discriminate between two linearly separable classes
* A perceptron takes in inputs, sums them up with weights, adds a bias, applies some activation function --> output
* You can have different activation functions (sigmoid, tanh, ReLu, etc.)
* Many perceptrons put together create a neural network

<img src='images/perceptron_binary.png'/>


## Basic Architecture

For our DS purposes, we'll generally imagine our network to consist of only a few layers, including an input layer (where we feed in our data) an output layer (comprising our predictions). Significantly, there will also (generally) be one or more layers of neurons between input and output, called **hidden layers**.

One reason these are named hidden layers is that what their output actually represents is _not really known_.  The activation of node 1 of the first hidden layer may represent a sequence of pixel intensity corresponding to a horizontal line, or a group of dark pixels in the middle of a number's loop... etc etc.

![dense](images/Deeper_network.jpg)

Because we are unaware of how exactly these hidden layers are operating, neural networks are considered **black box** algorithms.  You will not be able to gain much inferential insight from a neural net.

Each of our pixels from our digit representation goes to each of our nodes, and each node has a set of weights and a bias term associated with it.

## Inspiration from Actual Neurons

The composition of neural networks can be **loosely** compared to a neuron.

![neuron](images/neuron.png)

Neural networks draw their inspiration from the biology of our own brains, which are of course also accurately described as 'neural networks'. A human brain contains around $10^{11}$ neurons, connected very **densely**.

This is a loose analogy, but can be a helpful **mnemonic**. The inputs to our node are like inputs to our neurons. They are either direct sensory information (our features) or input from other axons (nodes passing information to other nodes). The body of our neuron (soma) is where the signals of the dendrites are summed together, which is loosely analogous to our **collector function**. If the summed signal is large enough (our **activation function**), they trigger an action potential which travels down the axon to be passed as output to other dendrites. See [here](https://en.wikipedia.org/wiki/Neuron) for more. 

## Parts of a Neural Network

### Layers

- **Input Layer**: the initial parameters (these will be the parts we feed to our network)
- **Output Layer**: the classification (or regression predictions)
- **Hidden Layer(s)**: the other neurons potentially in a neural network to find more complex patterns

### Weights

The weights from our inputs are describing how much they should contribute to the next neuron.

But we can also think of the weights of hidden layer neurons telling us how much of these linear separations should be combined.

### Activation Functions

<img src='images/activation.png' width=500/>

The activation function converts our summed inputs into an output, which is then passed on to other nodes in hidden layers, or as an end product in the output layer. This can loosely be thought of as the action potential traveling down the axon.

When we build our models in `keras`, we will specify the activation function of both hidden layers and output.

### Other Hyperparameters

We'll talk more about this when we dive into how to optimize our neural networks, but some hyperparameters include:

- **Learning Rate ($\alpha$)**: how big of a step we take in gradient descent
- **Number of Epochs**: how many times we repeat this process
- **Batch Size**: how many data points we use in a single training session (1 epoch)
    - KEY! This is how often we send results back to update our weights, aka _back-propogation_!

Remember, any parameter adjusted to enhance the neural network's learning _is_ a hyperparameter (this includes the actual structure of the neural net)

## Let's see it in action!

Now we know the vocabulary of the different parts, let's try it out for ourselves!

First up:
- [playground.tensorflow.org](https://playground.tensorflow.org): A visual playground for us to train a neural network

#### Spaceship Titanic Data!

In [None]:
#Initial imports 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

In [None]:
df = pd.read_csv('data/train.csv', )
holdout = pd.read_csv('data/test.csv')

In [None]:
X = df.drop(columns='Transported')
y = df['Transported']

#### Still need a Train/Test Split

I'll note, it's much more often that you'll see train/val/test with neural networks, aka 3 pieces instead of just two

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42)

In [None]:
# train-test split causing two cols to be bool instead of obj
# fixing to pre-empt a type error 
map_bools = {True: 'True', False:'False', np.nan:np.nan}
for col in ['VIP', 'CryoSleep']:
    for dataset in [X_train, X_test, holdout]:
        dataset[col] = dataset[col].map(map_bools)

In [None]:
# Just going to run our test on num_cols
num_cols = [col for col in X_train.columns if X_train[col].dtype != 'O']

In [None]:
X_train[num_cols].info()

In [None]:
X_train[num_cols].head()

In [None]:
# We'll use two different imputation strategies
median_num_col = ['Age']

zero_num_cols = ['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']

In [None]:
# Creating our imputer
median_imp = SimpleImputer(strategy="median")
zero_imp = SimpleImputer(strategy='constant', fill_value = 0)

imputer = ColumnTransformer(
    transformers=[
        ("med", median_imp, median_num_col),
        ("zero", zero_imp, zero_num_cols),
    ]
)

### Logistic Regression to Compare To!

In [None]:
clf = Pipeline(steps=[
    ('imputer', imputer),
    ('scaler', MinMaxScaler()), 
    ('logreg', LogisticRegression())
])

clf.fit(X_train[num_cols], y_train)

train_preds = clf.predict(X_train[num_cols])
test_preds = clf.predict(X_test[num_cols])

print(f"Train accuracy score: {clf.score(X_train[num_cols], y_train)}")

print(f"Test accuracy score: {clf.score(X_test[num_cols], y_test)}")

## Define Our First Model!
Models in Keras are defined as a sequence of layers. We create a Sequential model and add layers one at a time until we are happy with our network topology. Documentation: https://keras.io/guides/sequential_model/

*   Metrics: https://keras.io/api/metrics/
*   Optimizers: https://keras.io/api/optimizers/
*   Loss: https://keras.io/api/losses/

In [None]:
# Preprocess our data first
# NNs are linear models - we still need to scale!
preprocessor = Pipeline(steps=[
    ('imputer', imputer),
    ('scaler', MinMaxScaler()),
])

preprocessor.fit(X_train[num_cols])

X_tr_pr = preprocessor.transform(X_train[num_cols])
X_te_pr = preprocessor.transform(X_test[num_cols])

In [None]:
X_tr_pr.shape

In [None]:
# Need more imports!
import tensorflow as tf
from tensorflow.keras import Sequential, regularizers
from tensorflow.keras.layers import Dense, Dropout

> Note! You may also see `keras` as its own separate library, but it's been integrated into tensorflow since TF V2.0 - you should make sure to use `keras` from the tensorflow library!
>
> [Source, which includes interesting reading on the history of tensorflow and keras](https://pyimagesearch.com/2019/10/21/keras-vs-tf-keras-whats-the-difference-in-tensorflow-2-0/)

A common way to build models in tensorflow/keras is to create an empty base model and then add layers in order - so that's what we'll do!

We'll create a NN with an input layer, one hidden layer, and then an output layer.

In [None]:
# Create our base, empty Sequential model
model = None

# Add a dense input layer -- model.add(Dense())
# 12 nodes, input_dim = our # of cols, activation = 'relu'


# Add another 12 node dense layer with relu - no need for input_dim


# output layer - dense layer with 1 node, activation = 'sigmoid'


# And then compile our model! -- model.compile()
# loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy']


In [None]:
model.summary()

In [None]:
# We then fit our model 
history = model.fit(X_tr_pr,        # Features
                    y_train,        # Target
                    epochs=100,     # Number of epochs
                    verbose=2,      # Verbosity level - Some output
                    batch_size=100, # Number of observations per batch
                    validation_data=(X_te_pr, y_test)) # Data for evaluation

Let's discuss this summary output...

In [None]:
# Get training and test loss/accuracy histories
training_loss = history.history['loss']
test_loss = history.history['val_loss']

training_acc = history.history['accuracy']
test_acc = history.history['val_accuracy']

# Create count of the number of epochs
epoch_count = range(1, len(training_loss) + 1)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 10))

# Visualize loss history
ax1.plot(epoch_count, training_loss, 'r--')
ax1.plot(epoch_count, test_loss, 'b-')
ax1.legend(['Training Loss', 'Test Loss'])

# Visualize accuracy  history
ax2.plot(epoch_count, training_acc, 'r--')
ax2.plot(epoch_count, test_acc, 'b-')
ax2.legend(['Training Accuracy', 'Test Accuracy']);

In [None]:
print(f"LogReg's accuracy score: {clf.score(X_test[num_cols], y_test)}")
print(f"Simple NN's best accuracy score: {max(test_acc)}")

### Discuss!

- 


## Break It Down

Now that we've built our first model and seen it in action, let's discuss some of those pieces.

### Loss Functions

First up, let's talk about our loss function.

Loss functions are akin to cost functions we were trying to minimize in gradient descent (i.e. RMSE for linear regression, Gini/entropy for trees)

1. For regression problems, keras has **mean_squared_error** or **mean_absolute_error** as a loss function, or **mean_squared_logarithmic_error** if your target has potential outliers
2. For binary classification: **binary_crossentropy** (what we used above!)
3. For multiclass problems: **categorical_crossentropy**

[This article summarizes the above, and more.](https://machinelearningmastery.com/how-to-choose-loss-functions-when-training-deep-learning-neural-networks/)



### Gradient Descent in Neural Networks
Neural Nets are usually implemented at scale with large sets of data, therefore optimizing for speed becomes a big concern. Gradient descent can take a very **very** long time to run if we use a single training example every time to update the weights and biases. Therefore, we usually use batch-mode:

- **Batch**: 
In batch gradient descent, we pass all of the training examples through the forward propagation stage before using backpropagation to compute the weights and biases

- **Epoch**: 
An epoch is when you're done passing all training examples through the forward propagation


We set our epoch and batch sizes when fitting the model - above, we used 100 for each. That means we broke our data down into chunks of 100 training observations - that's *one batch*. When ALL batches have gone through, that's *one epoch*!


#### Types of Gradient Descent
- Stochastic Gradient Descent 

    - SGD calculates the error and update the weight after training each observation in the training set. 

- Batch Gradient Descent

    - Batch calculates the error after each example is trained, but only updates the weight after all of the observations have been trained

- Mini-Batch Gradient Descent

    - Mini-batch is a compromise between batch and SGD - it splits the training examples into mini batches, and calculates the error and update the weight after each iteration of the mini batches are done training. 


#### Forward Propogation

Forward propogation is how data moves through the network, from the initial input layer through any hidden layers to the output layer.

On the first pass, when we feed the node values forward through layers, we initialize the weights with *random* values and biases to be zero. 

#### Back Propogation

After a certain number of data points have been passed through the model (batch), the weights will be *updated* with an eye toward optimizing our loss function. (Thinking back to biological neurons, this is like revising their activation potentials.) Typically, this is done by using some version of gradient descent.

#### Overview of the Forward & Back Propogation Process

![backprop](images/ff-bb.gif)

There are a lot more pieces we'll continue to explore, but for now let's copy our model from above and adjust some pieces to see how its performance changes!

In [None]:
# Code here to iterate!
model_2 = None



In [None]:
history_2 = None

In [None]:
# Get training and test loss/accuracy histories
training_loss_2 = history_2.history['loss']
test_loss_2 = history_2.history['val_loss']

training_acc_2 = history_2.history['accuracy']
test_acc_2 = history_2.history['val_accuracy']

# Create count of the number of epochs
epoch_count = range(1, len(training_loss_2) + 1)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 10))

# Visualize loss history
ax1.plot(epoch_count, training_loss_2, 'r--')
ax1.plot(epoch_count, test_loss_2, 'b-')
ax1.legend(['Training Loss', 'Test Loss'])

# Visualize accuracy  history
ax2.plot(epoch_count, training_acc_2, 'r--')
ax2.plot(epoch_count, test_acc_2, 'b-')
ax2.legend(['Training Accuracy', 'Test Accuracy']);

In [None]:
print(f"LogReg's accuracy score: {clf.score(X_test[num_cols], y_test)}")
print(f"Simple NN's best accuracy score: {max(test_acc)}")
print(f"Second NN's best accuracy score: {max(test_acc_2)}")

## Resources

- A very basic, visual intro to neural networks: https://jalammar.github.io/visual-interactive-guide-basics-neural-networks/
- Great video explanation of backpropagation by 3Blue1Brown (part of a full playlist): [Backpropagation calculus | Deep learning, chapter 4](https://www.youtube.com/watch?v=tIeHLnjs5U8&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi&index=4)
- These are all neural networks! [The Neural Network Zoo](https://www.asimovinstitute.org/neural-network-zoo/)
- Tips and tricks from Stanford (CS 230 - Deep Learning): https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-deep-learning-tips-and-tricks#good-practices

#### Deep Learning Courses:

* Google's Machine Learning Crash Course (which uses tensorflow): https://developers.google.com/machine-learning/crash-course/

* Deep Learning Wizard (which uses pytorch): https://www.deeplearningwizard.com/