# Lecture 3.12: Neural Networks Pt.2

[**Lecture Slides**](https://docs.google.com/presentation/d/1RLRcvClkzBXLrixydfXnEu6b91lqurEeeH7oq2e9Fuc/edit?usp=sharing)

This lecture, we are going to train a neural network classifier and regressor with keras.

**Learning goals:**

- train a neural network classifier
- visualize training by plotting the loss curve
- reproduce training by setting random seeds
- analyse the effect of network architecture on training speed
- train a neural network regressor

## 1. Setup

This notebook uses the keras and tensorflow deep learning libraries. If you haven't already, please follow the setup steps in the last notebook (3.11) to correctly install these dependencies.

## 2. Data Munging

Let's find some counterfeiters again! We load the banknote authentication dataset in a `DataFrame`:

In [None]:
import pandas as pd

df = pd.read_csv('bank_note.csv')
df.head()

Our features are scaled and ready to train hard! 🏋️‍♀️We'll use all 4 features since we already went through decision boundary visualization last lecture. This time, our focus is on the _optimization_.

In [None]:
X = df[['feature_1', 'feature_2', 'feature_3', 'feature_4']].values
y = df['is_fake'].values

## 3. Training

Before we can run gradient descent and backpropagation, we need to create the structure of our neural network. Just like last lecture, this can be done with keras' `Sequential` api.

We'll try a neural network with 2 `Dense` hidden layers of 6 neurons each. We'll use one of the most common activation functions, `relu`, popular for its optimization speed and regularization properties (more info [here](https://datascience.stackexchange.com/questions/23493/why-relu-is-better-than-the-other-activation-functions)).  
The input dimension is 4 since we are using 4 features.  
The output layer has 1 `sigmoid` neuron since we are solving a _binary_ classification task.

In [None]:
from keras.models import Sequential
from keras.layers import Dense

model = Sequential([
    Dense(6, activation='relu', input_dim=4),
    Dense(6, activation='relu'),
    Dense(1, activation='sigmoid')
])

We can investigate the neural network structure with keras' `.summary()` method:

In [None]:
model.summary()

🧠 Do these parameter counts still make sense from last lecture?

What is the value of these 79 model weights? We haven't _trained_ our model yet, so the $\theta$s have not been optimized. However, when setting up the structure of our neural network, keras has _initialized_ its weights. 

Recall from lecture 3.5 that before gradient descent can iteratively update $\theta$s, their values must _randomly initialized_. It's like choosing a starting point on the loss function mountain from which to go downhill, towards the minimum of $J$. 🏔 It might sound strange to pick a _random_ start, but in practice this turns out to be a good idea. We'll see more next lecture about how randomness can help neural network optimization.

There are [many ways](https://keras.io/api/layers/initializers/) of randomly initializing weights, but the keras default is pretty good. So let's not worry about this hyperparameter just yet, and have a look at our randomly initialized model parameters.

We can use the keras method [`.get_weights()`](https://keras.io/api/models/model_saving_apis/#get_weights-method). It has a strange way of providing the parameters: it returns a list of alternating weights & bias weights for each successive layer. Recall that bias weights are just the $\theta_{0}$s of each weight vector, acting on the bias nodes.

e.g for our neural architecture:
```python
# layer1 weights
weights[0]
# layer1 bias weights
weights[1]
# layer2 weights
weights[2]
# layer2 bias weights
weights[3]
# layer3 weights
weights[4]
# layer2 bias weights
weights[5]
```

Since there are 79 weights total, let's just peek at the weights of the first layer:

In [None]:
weights = model.get_weights()
weights[0]

These look pretty random indeed! 🤪 The first row is all the $\theta_{1}$s mapping the _first_ feature to our six hidden neurons, the second row is all the $\theta_{2}$ mapping the _second_ feature to our six hidden neurons, etc

Now that we have initialized our neural network, we must _compile_ it. This is just a way of configuring the model for training. For example, we haven't specified our _loss function_ yet, so keras has no idea if we're trying to solve a classification or a regression task here.

Compilation is done with the ... [`.compile()`](https://keras.io/api/models/model_training_apis/#compile-method) method:

In [None]:
model.compile(
    loss='binary_crossentropy',
    optimizer='adam'
)

We're using `binary_crossentropy` as loss function. This + the choice of `sigmoid` activation means our last layer will be a simple logistic regression layer.

... who is `adam` though? 🤷‍♂️Neural network optimization is tricky, and there are many flavours of gradient descent available. More on this next lecture, where we will get to know `adam` better. 🤝

Until then, let's train our model! Now that it is configured for training, keras has all the necessary information to start optimizing the weights. Similarly to the sklearn api, we use the [`.fit()`](https://keras.io/api/models/model_training_apis/#fit-method) method.

Once again we have to sneak in a couple of extra arguments: 
- In this example, `epochs` is the number of gradient descent iterations. Keras doesn't automatically decide when the loss converged, so we have to specify a cut-off. The full definition of `epochs` will be provided next lecture.
- `batch_size` specifies the number of examples used for each gradient descent step. We haven't talked about why changing this hyperparameter might help with optimization, so more on this next lecture.

In [None]:
history = model.fit(X, y, epochs=2000, batch_size=len(X))

Wow that took a while! ⌛️ Keras did have a lot of work to do...

🧠🧠 List all the steps that keras went through to `.fit()` this neural network. 

keras printed out a lot of information. The part we are most interested in is the `loss` value at each step of gradient descent. It looks like it's decreasing throughout the optimization, which is a good sign! We can visualize this directly by plotting the [loss curve](https://developers.google.com/machine-learning/testing-debugging/metrics/interpretic).

A [History callback](https://keras.io/api/callbacks/) stored a bunch of training information at each training epoch, which we can accesss as such:

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

loss = history.history['loss']

fig = plt.figure(dpi=120)
ax = fig.add_subplot(111)
ax.plot(loss)
ax.set_xlabel('epoch')
ax.set_ylabel('loss')
ax.set_title('Loss Curve');

Plotting the loss curve is an effective way of checking that everything went smoothly during training (here's some extra [tips](https://developers.google.com/machine-learning/testing-debugging/metrics/interpretic) on how to interpret different loss curve profiles). Our loss curve is decreasing, and converges towards the end of our optimization. This means that gradient descent is complete and that the model is fully trained!

🧠 What should you do if the loss curve was still strongly decreasing towards your final epochs?

We won't visualize the model's decision boundary or analyse its predictions in this notebook, since we've already done this in notebook 3.11. However, we can peek at our updated model weights:

In [None]:
model.get_weights()[0]

These values are different from our randomly initialized weights because they were _optimized_ by keras during training.

🧠🧠 Do these model parameters look overfit? Why?

## 4. Analysis

### 4.1 Training Reproducibility

Random weight initialization is one of _many_ random processes used in neural network optimization. This means that our training procedure above was not _reproducible_. We can prove this by training two consecutive models. The `create_neural_network()` and `train_neural_network()` functions will prevent us from copy-pasting code 924 times:

In [None]:
def create_neural_network(layers):
    model = Sequential()
    first_layer = layers.pop(0)
    model.add(Dense(first_layer, activation='relu', input_dim=4))
    for layer in layers:
        model.add(Dense(layer, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(optimizer='adam', loss='binary_crossentropy')
    return model

              
def train_neural_network(X, y, layers, **kwargs):
    model = create_neural_network(layers)
    model.fit(X, y, batch_size=len(X), **kwargs)
    return model

model1 = train_neural_network(X, y, [6, 6], epochs=30, verbose=0)
model2 = train_neural_network(X, y, [6, 6], epochs=30, verbose=0)

print(model1.get_weights()[0])
print(model2.get_weights()[0])

The optimized weights of these two models aren't the same at all! This doesn't necessarily mean that one of these models is bad, but it is still inconvenient if we ever try to recreate some amazing results. 😰

We know that NumPy is THE math library in python, so maybe we can try setting NumPy's random seed before each training to harmonize the results?

In [None]:
import numpy as np

np.random.seed(1337)
model1 = train_neural_network(X, y, [6, 6], epochs=30, verbose=0)

np.random.seed(1337)
model2 = train_neural_network(X, y, [6, 6], epochs=30, verbose=0)

print(model1.get_weights()[0])
print(model2.get_weights()[0])

Still not enough! This doesn't work because we are using a tensorflow backend, and tensorflow is _special_ 💁‍♂️ and uses its own random number generator. We have to reset the _tensorflow_ seed before each training too:

In [None]:
import tensorflow as tf

np.random.seed(1337)
tf.random.set_seed(666)
model1 = train_neural_network(X, y, [6, 6], epochs=30, verbose=0)

np.random.seed(1337)
tf.random.set_seed(666)
model2 = train_neural_network(X, y, [6, 6], epochs=30, verbose=0)

print(model1.get_weights()[0])
print(model2.get_weights()[0])

These are close, but not quite the same. The differences are actually due to numerical errors, so our model was reproduced, but our weights are just not that accurate 💘. In fact, we'll see that for complex neural networks trained across several GPUs, it is sometimes impossible to train deterministically. This makes it even more important to have solid data engineering, and somewhere to save our model weights. More on this next chapter.

### 4.2 Training Speed

In the lecture slides, we have described neural networks as _slow_. 🐌 This is in part due to their large number of parameters. There are many ways of layering these weights into a neural network however, so let's investigate how neural architecture affects training speed.

We'll use our `train_neural_network()` function, and the [`timeit`](https://stackoverflow.com/questions/29280470/what-is-timeit-in-python) magic function will help us measure the performance of the models. We are comparing three neural networks:
- 2 hidden layers of 6 neurons each
- 2 hidden layers of 100 neurons each
- 5 hidden layers of 6 neurons each

We are not interested in successful optimization or loss convergence here, only training times. So we set `epochs=30` to speed things up:

In [None]:
%timeit train_neural_network(X, y, [6, 6], epochs=30, verbose=0)
%timeit train_neural_network(X, y, [100, 100], epochs=30, verbose=0)
%timeit train_neural_network(X, y, [6, 6, 6, 6, 6], epochs=30, verbose=0)

Our `[6, 6]` neural network is the fastest: this makes sense since it's the smallest.
The `[100, 100]` model isn't that much slower however. Moreover, the `[6, 6, 6, 6, 6]` model is the slowest! This might be surprising when looking at the number of model parameters. We're feeling lazy so we'll let keras do the linear algebra, and use `.summary()`:

In [None]:
print(create_neural_network([6, 6]).summary())
print(create_neural_network([100, 100]).summary())
print(create_neural_network([6, 6, 6, 6, 6]).summary())

That's right, training 10701 model weights is faster than training 205! It seems like _layers_ contribute more to overall training time than the number of hidden cells.

Indeed, training time is most impacted by the _structure_ of the neural network. This is not only because deeper layer derivatives require more chained operations using _backpropagation_ , but also because many same layer calculations are _parallelised_. Neural network optimization is a complex problem with many different approaches and solutions, including algorithmic improvements, random methods, hardware solutions, and many more discovered every _week_. 🤯

Of course, we don't just design neural network architectures because of speed, as the structure has a big effect on model accuracy and overfitting. More on that next chapter!

Next lecture we will go over basic improvements to gradient descent which can help train our models _faster_. 🏎

## 5. Exercises

💪💪💪 Train a neural network regressor on the instagram planning dataset. Some helper functions are supplied so you can visualize the results after training 😎. Here's a list of the steps you should be taking to lead your analysis:

- load the `instagram_planning_norm.csv` dataset into a DataFrame
- optionally visualize this dataset to refresh your memory
- create a feature matrix, X, and a label vector, y
- no need to standardize the features as they are already scaled 
- create a neural architecture with the Sequential api under the variable `model`
- recommended: 1 hidden layer with 6 neurons, and a `relu` activation
- think of the input dimensions of your feature vectors
- think of the size and activation of your output layer: you are solving a regression task
- compile the model with `optimizer='adam'` and `loss='mean_squared_error'`
- fit the model with `batch_size=len(X)`
- depending on your hidden layer activation function, you will need > 5000 `epochs` for the loss to converge.
- store the output of `.fit()` in a variable called history
- unit test your training with the provided code cell
- visualize the predictions and loss curve with the second provided code cell

🧠 Why are we using a `mean_squared_error` loss?

In [None]:
# INSERT CODE HERE

In [None]:
import math 

def test_neural_network():
    assert model, "Couldn't find model"
    assert history, "Couldn't find training history"
    assert len(history.history['loss']) > 10, f"You only trained your model for {len(history.history)} epochs. Are you sure that's enough?"
    loss = history.history['loss'][-1]
    assert loss < 15, f"Your loss is {loss}, but it could be lower"
    print('Success! 🎉')
    
test_neural_network()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from plot import plot_regression

sns.set()

def model_viz(X, y, model, history):
    fig = plt.figure(figsize=(12,5), dpi=120)

    ax1 = fig.add_subplot(121)
    plot_regression(ax1, X, y, model)
    ax1.set

    ax2 = fig.add_subplot(122)
    ax2.plot(history.history['loss'])
    ax2.set_xlabel('epoch')
    ax2.set_ylabel('loss')
    ax2.set_title('Loss Curve');
    
model_viz(X, y, model, history)

💪💪 Feel free to explore more architectures and try new hyperparameters! You can always use the cell above to visualize their effects

## 6. Summary

Today we learned about **neural network optimization**. First, we revisited cost functions and **gradient descent** in the context of a neural network. We then explained how the **chain rule** can calculate derivatives of the cost function with respect to weights of the first few layers. We showed how gradients can be calculated this way by **stepping** through the network in the **reverse** direction. We defined **backpropagation** as the optimization algorithm which computes the gradients. It **caches** and reuses derivative terms to make the calculation more efficient. We could then **define** neural networks as **nested non-linear functions** structured as **layered neurons**, whose weights are optimized using **gradient descent** and **backpropagation**. Finally, we trained a neural network classifier and regressor and analysed their loss curves.


# Resources

## Core Resources
- [3Blue1Brown - deep learning calculus](https://youtu.be/tIeHLnjs5U8)  
Amazing channel with an outstanding video which derives gradient descent for neural networks. The whole series is excellent.

## Additional Resources
- [Why ReLU is better than other activation functions?](https://datascience.stackexchange.com/questions/23493/why-relu-is-better-than-the-other-activation-functions)  
Stackexchange thread explaining the popularity of ReLU as neural activation function
- [thinc backpropagation](https://thinc.ai/docs/backprop101)  
Alternative approach to explaining backpropagation, from a software engineering perspective
- [Interpreting the loss curve](https://developers.google.com/machine-learning/testing-debugging/metrics/interpretic)  
Typical loss curve profiles and how to interpret them
