# Neural Networks

In [None]:
## For data handling
import pandas as pd
import numpy as np

## For plotting
import matplotlib.pyplot as plt
from seaborn import set_style

## This sets the plot style
## to have a grid on a white background
set_style("whitegrid")

##### 1. Softmax activation function

When performing multiclass classification problems (like with the MNIST data set) we used the softmax activation function. In lecture I referred to this as the multiclass equivalent of the sigmoid function, let's see why now.

Suppose we have some vector $z = (z_1, z_2, \dots, z_K) \in \mathbb{R}^K$. The $i^\text{th}$ entry of the softmax function applied to this vector is given by:
$$
\sigma(z)_i = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}}
$$

and so $\sigma$ maps $\mathbb{R}^K$ to $[0,1]^K$. As an example perhaps $z$ represents the values of your output nodes prior to activation, then the softmax turns these into "probabilities" of your observation being of class $i$.

##### 2. Backpropagation practice

Look at this architecture that comes from the following blog post (don't cheat and just look up the solution though!), <a href="https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/">https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/</a>

<img src="practice.png" width="60%"></img>

This is both our first time with a multi-output network and a network with bias so I'll help you with the set up. To get started here are the formulas for $h_1$ and $h_2$.

$$
h_1 = \Phi(w_1 x_1 + w_2 x_2 + b_1)
$$

$$
h_2 = \Phi(w_3 x_1 + w_4 x_2 + b_1)
$$

Also in the case of a multi-output network our cost function is a sum of the two errors:
$$
C = (o_1 - y_1)^2 + (o_2 - y_2)^2,
$$
where $y = (y_1,y_2)$ and $o_1,o_2$ can be thought of as the $\hat{y}$ in our simple example.

For this problem:
- Set up the equations for $o_1$ and $o_2$ in terms of $h_1$ and $h_2$,
- Calculate $\partial C/\partial w_5$, $\partial C/ \partial w_1$ and $\partial C/\partial b_2$.


You may record your answers in the markdown block if you would like to, math can be entered with typical latex commands with equations being contained in dollar signs.



##### 3. Exclusive or (XOR) I

Recall that a failure of the perceptron was being able to produce nonlinear decision boundaries (or nonlinear regression in a regression setting). We demonstrated this with the following picture.

In [None]:
X = np.array([[0,0],[0,1],[1,0],[1,1]])

y = np.array([1,-1,-1,1])

In [None]:
plt.figure(figsize=(10,10))

plt.scatter(X[y == 1,0],X[y == 1,1], c = 'b', label = "1", s=100)
plt.scatter(X[y == -1,0],X[y == -1,1], c = 'orange', label = "-1", s=100)

plt.legend(fontsize=12)

plt.show()

Using $\sigma = \text{sgn}$, show that a perceptron with a bias term cannot separate these data points.

<i>Hint</i>

The setup for this perceptron would be:

$$
\text{sgn}\left( w_1 x_1 + w_2 x_2 + b \right)
$$

###### Write here



##### 4. Exclusive or II

The classification problem above is roughly equivalent to building a classifier on these data.

In [None]:
X = np.zeros((10000,2))

X[:2500,:] = np.random.random((2500,2))
X[2500:5000,:] = np.random.random((2500,2)) + np.array([2,2])
X[5000:7500,:] = np.random.random((2500,2)) + np.array([0,2])
X[7500:,:] = np.random.random((2500,2)) + np.array([2,0])

y = np.zeros(10000)

y[:2500] = 1
y[2500:5000] = 1
y[5000:7500] = -1
y[7500:] = -1

In [None]:
plt.figure(figsize=(10,10))

plt.scatter(X[y == 1,0],X[y == 1,1], c = 'b', label = "1", s=10)
plt.scatter(X[y == -1,0],X[y == -1,1], c = 'orange', label = "-1", s=10)

plt.show()

Using `sklearn` build a perceptron and a multilayer network with a single hidden layer of $100$ nodes, what is the accuracy of both on this data set?

In [None]:
## Code here




In [None]:
## Code here




In [None]:
## Code here




In [None]:
## Code here




##### 5. Additional `keras` metrics

We used `"accuracy"` when specifying the metrics in the `compile` step. Look at the `keras` documentation on metrics to see what other metrics are available to us, <a href="https://keras.io/api/metrics/">https://keras.io/api/metrics/</a>.

##### 6. IMDB review sentiment

In the code below I load in an IMDB review sentiment data set. In these data there are IMDB reviews and a corresponding sentiment $y=0$ indicating a negative review, $y=1$ indicating a positive review. In particular each observation in `X_train` or `X_test` will have $5000$ columns corresponding to the $5000$ most used words across all reviews. The $i,j$ entry of `X_train` or `X_test` will thus represent the frequency at which IMDB review $i$ utilized word $j$.

In [None]:
## loading the data
from keras.datasets import imdb

num_words = 5000
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=num_words)

In [None]:
## preparing an X_train data set
## We're making a series of word frequency vectors
## Here column j is the frequency of word j in the review
## this is a standard way to process text based data
X_train_ff = np.zeros((len(train_data), num_words))
X_test_ff = np.zeros((len(test_data), num_words))

print("Starting word count vectorization for train data")
for i in range(len(train_data)):
    for j in train_data[i]:
        X_train_ff[i,j] = X_train_ff[i,j] + 1
print("Done with word count vectorization for train data")

print("Starting word count vectorization for test data")
for i in range(len(test_data)):
    for j in test_data[i]:
        X_test_ff[i,j] = X_test_ff[i,j] + 1
print("Done with word count vectorization for test data")

print("Now making word frequency vectors :)")
X_train_ff = X_train_ff/np.sum(X_train_ff, axis=1).reshape(-1,1)
X_test_ff = X_test_ff/np.sum(X_test_ff, axis=1).reshape(-1,1)

y_train_ff = train_labels.copy()
y_test_ff = test_labels.copy()

In [None]:
## The shape of X_train
X_train_ff.shape

In [None]:
## The shape of X_test
X_test_ff.shape

In [None]:
## demonstrating the labeled data
y_train_ff

Make a validation set of $15\%$ of the training data.

In [None]:
## code here




In [None]:
## code here




Build two feed forward neural networks on these data:
1. One with a single hidden layer with $64$ nodes,
2. Another with two hidden layers each with $32$ nodes.

Which one seems to perform better on the validation set?

In [None]:
## Import what you'll need from keras here


In [None]:
## Make an empty model object here
model1 = 

## Add your layers here
## the output layer should have a 'sigmoid' activation function



## compile the model here



## Fit your model here, don't forget to include your validation 
## data argument


In [None]:
## Make an empty model object here
model2 = 

## Add your layers here



## compile the model here



## Fit your model here, don't forget to include your validation 
## data argument



In [None]:
plt.figure(figsize=(14,8))




##### 7. Adding Dropout Layers

Sometimes you can combat overfitting by randomly dropping some of the nodes in the model. In `keras` this is accomplished with the `Dropout` layer. This layer will randomly set some of the nodes from the previous layer to $0$ with probability `rate` and scale the remaining nodes by $1/(1-\text{rate})$.

Return to the network you trained in the previous problem and add a `Dropout` layer using the code I provide below. Does this improve your model? (I'm not sure if it will in this particular model or not, but I wanted to give you a chance to practice adding a `Dropout` layer.

Note you'll be plotting the validation and training accuracies of both models below, so store your history in a different variable than you used for question 2.

Docs: <a href="https://keras.io/api/layers/regularization_layers/dropout/">https://keras.io/api/layers/regularization_layers/dropout/</a>.

In [None]:
## Add your model from above.
## Make an empty model object here
model3 = 

## Add your layers here


## prior to the output layer we'll add this dropout layer



## compile the model here



## Fit your model here, don't forget to include your validation 
## data argument


In [None]:
plt.figure(figsize=(10,8))





##### 8. Compare to `SimpleRNN`

Run the code below to compare the feed forward results to the `SimpleRNN` network we built in lecture.

In [None]:
from keras.preprocessing import sequence

In [None]:
max_features = 5000
(X_train, y_train), (X_test,y_test) = imdb.load_data(num_words=max_features)

max_length = 100
X_train_seq = sequence.pad_sequences(X_train, maxlen=100)
X_test_seq = sequence.pad_sequences(X_test, maxlen=100)


X_tt_seq,X_val_seq,y_tt_seq,y_val_seq = train_test_split(X_train_seq, y_train,
                                                           test_size=.2,
                                                           shuffle=True,
                                                           stratify = y_train,
                                                           random_state=440)

In [None]:
del simple_rnn_model

In [None]:
simple_rnn_model = models.Sequential()

simple_rnn_model.add( layers.Embedding(max_features, 32) )
simple_rnn_model.add( layers.SimpleRNN(32, return_sequences = False) )


simple_rnn_model.add(layers.Dense(1, activation='sigmoid'))

simple_rnn_model.compile(optimizer='rmsprop',
                 loss='binary_crossentropy',
                 metrics=['accuracy'])


epochs = 100

history_snn = simple_rnn_model.fit(X_tt_seq, y_tt_seq,
                                    epochs = epochs,
                                    batch_size=500,
                                    validation_data=(X_val_seq,y_val_seq))

history_snn_dict = history_snn.history

In [None]:
## plot the validation accuracy of both models here


##### 9. `LSTM` layer

An adjustment to the `SimpleRNN` architecture came in the form of the so-called "Long Short-Term Memory" or LSTM architecture. 

This adjustment addressed the issue of disappearing gradients that occurr due to the long sequences involved in the standard RNN architecture. These disappearing gradients during backpropagation mean that the networks pay much more attention to more recent terms in the sequence than further back terms.

While I will not dive into the mathematical setup of LSTM networks here, check out section 7.5 of this text, <a href="https://d1wqtxts1xzle7.cloudfront.net/63954267/2018_Book_NeuralNetworksAndDeepLearning20200718-22595-1luren6-with-cover-page-v2.pdf?Expires=1652983581&Signature=LT2OEq4kN4bAjeVMo0Gi1B-JPuy0TUYR1VuGhVOnEiHc-bvoUY1-OHLSiLh8EAVQhMHG5U2x6Umg1muZArOvflSiZpDpTnKVMsjGdZYQs4CULVXGw~Zf4kl7jQiZJG4jRZZuA6m2-vxb9kykkEUqNLjdGATea2UJd9AbkkFUUnLUTWdLSNy5wSLKTKU~pYxYIrfhUZgUw4~pc9RBut4Z5L5W7bYhYhMyI10TTwqvtrzqMekCVLsZt8aNjqYkcYi1bBtsGT5yxqV85s6lfPezZaBR5xBvcccaga7zq9OKuwWwltiMhuldPUXZFt9jBGs5mu-kZsauNU0fvTCdKPA-QA__&Key-Pair-Id=APKAJLOHF5GGSLRBV4ZA">https://d1wqtxts1xzle7.cloudfront.net/63954267/2018_Book_NeuralNetworksAndDeepLearning20200718-22595-1luren6-with-cover-page-v2.pdf?Expires=1652983581&Signature=LT2OEq4kN4bAjeVMo0Gi1B-JPuy0TUYR1VuGhVOnEiHc-bvoUY1-OHLSiLh8EAVQhMHG5U2x6Umg1muZArOvflSiZpDpTnKVMsjGdZYQs4CULVXGw~Zf4kl7jQiZJG4jRZZuA6m2-vxb9kykkEUqNLjdGATea2UJd9AbkkFUUnLUTWdLSNy5wSLKTKU~pYxYIrfhUZgUw4~pc9RBut4Z5L5W7bYhYhMyI10TTwqvtrzqMekCVLsZt8aNjqYkcYi1bBtsGT5yxqV85s6lfPezZaBR5xBvcccaga7zq9OKuwWwltiMhuldPUXZFt9jBGs5mu-kZsauNU0fvTCdKPA-QA__&Key-Pair-Id=APKAJLOHF5GGSLRBV4ZA</a>.

In this problem you will walk through building an LSTM network on the IMDB data and comparing the performance to the previous two models.

In [None]:
## Make an empty sequential model here
lstm_model = 

## add the same embedding layer here
lstm_model.add(  )

## add the first LSTM layer
## setting return_sequences=True allows us to add a second LSTM layer
lstm_model.add( layers.LSTM(32, return_sequences = True) )

## add the second LSTM layer
## set return_sequences=False
lstm_model.add( )


## add the output layer


## compile the model



## train the model
## Note this will take several minutes



In [None]:
## plot the validation accuracy of all 3 models here


##### Comment on what you observe



Another common RNN architecture people use is a <i>gated

--------------------------

This notebook was written for the Erd&#337;s Institute C&#337;de Data Science Boot Camp by Matthew Osborne, Ph. D., 2022.

Any potential redistributors must seek and receive permission from Matthew Tyler Osborne, Ph.D. prior to redistribution. Redistribution of the material contained in this repository is conditional on acknowledgement of Matthew Tyler Osborne, Ph.D.'s original authorship and sponsorship of the Erdős Institute as subject to the license (see License.md)