# CS-5600/6600 Lecture 19 - Training Neural Networks

**Instructor: Dylan Zwick**

*Weber State University*

In [None]:
import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

import sys

Today, we're going to talk about the basic idea behind a neural network, and see how one can work in action on a very famous problem - classifying hand written digits. We won't get into too much depth on the mathematics, but I've included the deviration of back propagation as an appendix.

##The Single-Layer Neural Network (Perceptron) and Logistic Regression

The simplest type of neural network - a neural network without any internal (or "hidden") layers, is called a "perceptron", represented below:

<center>
  <img src="https://drive.google.com/uc?export=view&id=109BRkgVYMEWAEr5TQjNNetImqOKM3U-n" alt="The Perceptron">
</center>

The idea here is that it takes as its input the values from our input variables $X_{1},X_{2}, \ldots, X_{n}$, multiplies them by their appropriate weights, adds these multiplied weights together, feeds these to an activation function, and then to a unit step function.

This might look complicated, but if the activation function is a sigmoid (which it frequently is) then this is just logistic regression.

<center>
  <img src="https://drive.google.com/uc?export=view&id=1Li6XZfZr-O6eKMiLXW3VDEarTql2wohg" alt="The Sigmoid">
</center>

Formulating this a bit more mathematically, if the inputs are $X = (x_{1},x_{2},\ldots,x_{m})$, then these inputs are combined into the net input function through a weighted linear combination:
<center>
    <br>
    $\displaystyle z(X) = w_{0} + w_{1}x_{1} + w_{2}x_{2} + \cdots + w_{m}x_{m}$.
    <br>
</center>
<br>
This linear combination is then sent to an activation function, in this case the sigmoid:
<center>
    <br>
    $\displaystyle \sigma(z) = \frac{1}{1+e^{-z}} = \frac{1}{1+e^{-(w_{0}+w_{1}x_{1}+w_{2}x_{2}+\cdots+w_{m}x_{m})}}$
    <br>
</center>
<br>
The sigmoid function goes from 0 to 1 and provides a number we can interpret as a probability.

We then get our final prediction by sending the output of the sigmoid to a unit step function, which predicts a $1$ if the value is above a given threshold (frequently $.5$), and $0$ otherwise.

###WARNING: Calculus Ahead - Gradient Descent for the Perceptron

Suppose we have an observation $X$ with output $y$. We want to choose our weights so as to maximize the likelihood of the observed outcome, which is given by:

<br>
<center>
    <br>
    $\displaystyle \sigma(z)^{y}(1-\sigma(z))^{1-y}$.
    <br>
</center>
<br>

Here, we're using that $y$ is either $0$ or $1$. Now, typically instead of maximizing a function, we define a *loss function*, and seek to minimize it. Also, we can note the logarithm is an increasing function, so the input that maximizes / minimizes a function will also maximize / minimize its logarithm. Using these observations, we define our loss function:

<br>
<center>
    <br>
    $\displaystyle J = -\left(y\log{\sigma(z(X))} + (1-y)\log{(1-\sigma(z(X)))}\right)$.
    <br>
</center>
<br>

We want to find the values of the weights $w_{0},w_{1},\ldots,w_{m}$ that minimize this function, and to do this we'll want to find the function's gradient. This means finding the partial derivatives with respect to each weight. We can get the partial derivative of this function with respect to the weight $w_{j}$ using the chain rule:

<br>
<center>
    <br>
    $\displaystyle \frac{\partial J}{\partial w_{j}} = \frac{\partial J}{\partial z}\frac{\partial z}{\partial w_{j}}$.
    <br>
</center>
<br>

This will require a bit of calculus, but the derivative of the sigmoid function is:

<br>
<center>
    <br>
    $\displaystyle \frac{d\sigma}{dz} = \frac{d}{dz}\left(\frac{1}{1+e^{-z}}\right) = \frac{e^{-z}}{(1+e^{-z})^{2}} = \left(\frac{1}{1+e^{-z}}\right)\left(1-\frac{1}{1+e^{-z}}\right) = \sigma(z)(1-\sigma(z))$
    <br>
</center>
<br>

From this we get that the derivative of the loss function:

<br>
<center>
    <br>
    $\displaystyle \frac{\partial J}{\partial w_{j}} = \left(-y\left(\frac{1}{\sigma(z)}\right)\sigma(z)(1-\sigma(z)) + (1-y)\left(\frac{1}{1-\sigma(z)}\right)\sigma(z)(1-\sigma(z))\right)\frac{\partial z}{\partial w_{j}} = (\sigma(z)-y)x_{j}$
    <br>
</center>
<br>

We express as $\delta$ the partial derivative $\displaystyle \frac{\partial J}{\partial z}$, and so from this we get $\delta = \sigma(z)-y$. This value is typically called the *error* of the output node.

From this we can see our gradient is $\nabla J = \delta X$.

##The Multi-Layer Neural Network (a.k.a. Deep Learning)

A multi-layer neural network (which is essentially everything that's actually called a neural network) is just a bunch of perceptrons (or whatever activation function you want to use) meshed together.

<center>
  <img src="https://drive.google.com/uc?export=view&id=19_1Jb8hLiv3_7tD0fzgaejFpyTXQbjNi" alt="Multi-Layer Perceptron">
</center>

The number of layers and size of each layer is a hyperparameter that is set before the model trains, and then during model training, what you do is update the weights. How these weights are updated is through something called "back propagation", which means the errors from the output layer are fed back to the previous layer and the weights are adjusted, the error for the previous layer is then fed back to the layer before it, and so on. It turns out that these steps can be handled sequentially in this manner, so that the entire neural network doesn't need to be modified at once. This significantly decreases the complexity of an update (a step in gradient descent), and the reason you're able to do this is, essentially, just the chain rule from calculus.

###Forward Propagation

Suppose we've got our neural network above with $t$ output nodes, and $d$ hidden nodes.

At each non-input node there is a net input function, which we denote by $z_{j}^{(l)}$, where $l \in \{in,h,out\}$. So, for example, the net input to the index $1$ node in the hidden layer is:

<br>
<center>
    <br>
    $\displaystyle z_{1}^{(h)} = w_{0,1}^{(h)}a_{0}^{(in)} + w_{1,1}^{(h)}a_{1}^{(in)} + \cdots w_{m,1}^{(h)}a_{m}^{(in)}$
    <br>
</center>
<br>

The output value of each node is the sigmoid of its inputs, so for example the output of the index $1$ node in the hidden layer is:

<br>
<center>
    <br>
    $\displaystyle a_{1}^{(h)} = \sigma(z_{1}^{(d)})$
    <br>
</center>
<br>

Note that we can view the set of inputs to the hidden layer as a row vector $\textbf{z}^{(h)}$, with values given by:

<br>
<center>
  $\displaystyle \textbf{z}^{(h)} = \left(\begin{array}{cccc} z_{1}^{(h)} & z_{2}^{(h)} & \cdots & z_{d}^{(h)}\end{array}\right) = \left(\begin{array}{cccc} a_{0}^{(in)} & a_{1}^{(in)} & \cdots & a_{m}^{(in)}\end{array}\right) \left(\begin{array}{cccc} w_{0,1}^{(h)} & w_{0,2}^{(h)} & \cdots & w_{0,d}^{(h)} \\ w_{1,1}^{(h)} & w_{1,2}^{(h)} & \cdots & w_{1,d}^{(h)} \\ \vdots & \vdots & \ddots & \vdots \\ w_{m,1}^{(h)} & w_{m,2}^{(h)} & \cdots & w_{m,d}^{(h)}\end{array}\right) = \textbf{a}^{(in)}\textbf{W}^{(h)}$
</center>
<br>

Forward propagation is, essentially, the continuous application of these transformations over each node of the network.

Back propagation is how we handle updating the weights through gradient descent. The math behind this is a bit complicated. It's "only" multivariable calculus, but it's a serious walk through the multivariable chain rule. I've gone over it twice before in graduate machine learning classes, and both times both I and the students have regretted it. So, I've included the math at the bottom of these notes as an appendix, in case you're curious.

##Implementing a Neural Network

There are excellent libraries that we'll be learning about and using for working with neural networks. However, today will be an exercise in self-sufficiency. We're going to create a neural network using only NumPy.

We'll step through the code to implement a neural network for the famous task of identifying handwritten digits. I've broken up what would usually be a neural network class into a number of separate functions so we can better understand how they work together, although to be clear keeping them all within a single class is definitely the better way to do it!

First, let's grab the data using the *fetch_openml* function we've used before:

In [None]:
X_mnist, y_mnist = fetch_openml('mnist_784', return_X_y=True, as_frame=False,
                                parser='auto')

Now, the size of the data here is pretty significant.

In [None]:
X_mnist.shape

We won't need that many, so let's just grab the first 30,000.

In [None]:
X_train = X_mnist[:30000]
y_train = y_mnist[:30000]

Also, our $y$ values are actually stored as strings.

In [None]:
y_train

Let's convert them to integers.

In [None]:
y_train = y_train.astype(int)

Let's take a look at some of these digits:

In [None]:
fig, ax = plt.subplots(nrows=2, ncols=5, sharex=True, sharey=True)
ax = ax.flatten()
for i in range(10):
    img = X_train[y_train == i][0].reshape(28, 28)
    ax[i].imshow(img, cmap='Greys')

ax[0].set_xticks([])
ax[0].set_yticks([])
plt.tight_layout()
plt.show()

For the purposes of our algorithm, we're going to encode our data using onehot encoding:

In [None]:
def onehot(y, n_classes):
    onehot = np.zeros((y.shape[0], n_classes))
    for idx, val in enumerate(y.astype(int)):
        onehot[idx, val] = 1.
    return onehot

In [None]:
example = np.array([0,1,1,0,1,0,0])
onehot(example,2)

We'll even write our own sigmoid function. Note that we've added some code to help avoid some numerical approximation issues. This code will make the sigmoid slower, but that's OK for us today.

In [None]:
def sigmoid(x):
    # Using np.where to handle large negative values
    x = np.clip(x, -709, 709)  # np.exp(709) is close to the max float64
    return np.where(
        x >= 0,
        1 / (1 + np.exp(-x)),
        np.exp(x) / (1 + np.exp(x))
    )

In [None]:
print(sigmoid(-17))
print(sigmoid(0))
print(sigmoid(1))
print(sigmoid(6000))

Now, let's implement forward propagation:

In [None]:
def forward(X,W_h,b_h,W_out,b_out):

    z_h = np.dot(X, W_h) + b_h

    a_h = sigmoid(z_h)

    z_out = np.dot(a_h, W_out) + b_out

    a_out = sigmoid(z_out)

    return z_h, a_h, z_out, a_out

Now back propagation (see the appendix for details)

In [None]:
def back(X,y,W_h,b_h,W_out,b_out,eta):

    z_h,a_h,z_out,a_out = forward(X,W_h,b_h,W_out,b_out)

    delta_out = a_out - y

    sigmoid_derivative_h = a_h * (1. - a_h)

    delta_h = (np.dot(delta_out, W_out.T) * sigmoid_derivative_h)

    grad_W_h = np.dot(X.T, delta_h)
    grad_b_h = np.sum(delta_h, axis=0)

    grad_W_out = np.dot(a_h.T, delta_out)
    grad_b_out = np.sum(delta_out, axis=0)

    W_h -= eta * grad_W_h
    b_h -= eta * grad_b_h

    W_out -= eta * grad_W_out
    b_out -= eta * grad_b_out
    return W_h,b_h,W_out,b_out

Now we'll use forward propagation and back propagation to build our fit function.

In [None]:
def fit(X_train, y_train, epochs, n_hidden, eta, seed=1):

    random = np.random.RandomState(seed)

    n_output = np.unique(y_train).shape[0]  # number of class labels
    n_features = X_train.shape[1]

    # weights for input -> hidden
    b_h = np.zeros(n_hidden)
    W_h = random.normal(loc=0.0, scale=0.1,size=(n_features,n_hidden))

    # weights for hidden -> output
    b_out = np.zeros(n_output)
    W_out = random.normal(loc=0.0, scale=0.1,size=(n_hidden, n_output))

    y_train_enc = onehot(y_train, n_output)

    # iterate over training epochs
    for i in range(epochs):

        indices = np.arange(X_train.shape[0])

        random.shuffle(indices)

        for idx in range(indices.shape[0]):
            W_h,b_h,W_out,b_out = back(X_train[[idx]],y_train_enc[[idx]],W_h,b_h,W_out,b_out,eta)

        # Evaluation after each epoch during training
        z_h, a_h, z_out, a_out = forward(X_train,W_h,b_h,W_out,b_out)
        cost = np.sum(-y_train_enc * (np.log(a_out)) - (1. - y_train_enc) * np.log(1. - a_out))

        y_train_pred = np.argmax(z_out, axis=1)

        train_acc = ((np.sum(y_train == y_train_pred)).astype(float) / X_train.shape[0])

        sys.stderr.write('\r%d/%d | Cost: %.2f '
                             '| Train Acc.: %.2f%% ' %
                             (i+1, epochs, cost,train_acc*100))
        sys.stderr.flush()

Let's try it out!

In [None]:
#hyperparameters

epochs = 10
n_hidden = 30
eta = .001
seed = 42

fit(X_train, y_train, epochs, n_hidden, eta, seed)

Not bad! Now, with better neural network architectures we can do much better. We'll be discussing those later.

##Back Propagation (Appendix)

Recall what we said above about forward propagation, repeated here for reference.

Suppose we've got our neural network above with $t$ output nodes, and $d$ hidden nodes.

At each non-input node there is a net input function, which we denote by $z_{j}^{(l)}$, where $l \in \{in,h,out\}$. So, for example, the net input to the index $1$ node in the hidden layer is:

<br>
<center>
    <br>
    $\displaystyle z_{1}^{(h)} = w_{0,1}^{(h)}a_{0}^{(in)} + w_{1,1}^{(h)}a_{1}^{(in)} + \cdots w_{m,1}^{(h)}a_{m}^{(in)}$
    <br>
</center>
<br>

The output value of each node is the sigmoid of its inputs, so for example the output of the index $1$ node in the hidden layer is:

<br>
<center>
    <br>
    $\displaystyle a_{1}^{(h)} = \sigma(z_{1}^{(d)})$
    <br>
</center>
<br>

Note that we can view the set of inputs to the hidden layer as a row vector $\textbf{z}^{(h)}$, with values given by:

<br>
<center>
  $\displaystyle \textbf{z}^{(h)} = \left(\begin{array}{cccc} z_{1}^{(h)} & z_{2}^{(h)} & \cdots & z_{d}^{(h)}\end{array}\right) = \left(\begin{array}{cccc} a_{0}^{(in)} & a_{1}^{(in)} & \cdots & a_{m}^{(in)}\end{array}\right) \left(\begin{array}{cccc} w_{0,1}^{(h)} & w_{0,2}^{(h)} & \cdots & w_{0,d}^{(h)} \\ w_{1,1}^{(h)} & w_{1,2}^{(h)} & \cdots & w_{1,d}^{(h)} \\ \vdots & \vdots & \ddots & \vdots \\ w_{m,1}^{(h)} & w_{m,2}^{(h)} & \cdots & w_{m,d}^{(h)}\end{array}\right) = \textbf{a}^{(in)}\textbf{W}^{(h)}$
</center>
<br>

For any given input $X$, the output will be one of the categories $1,\ldots,t$. We can encode this as a "onehot" vector, in which every element is zero except the one corresponding to the output category, which is $1$. So, for example, if the output is the third category, the output vector $\textbf{y}$ will be:
<br>
<br>
<center>
  $\displaystyle \textbf{y} = \left(\begin{array}{cccccc} 0 & 0 & 1 & 0 & \cdots & 0\end{array}\right)$
</center>
<br>
<br>

In this case, the likelihood of an observation will be:

<br>
<center>
    <br>
    $\displaystyle \prod_{i = 1}^{t}\sigma(z_{i}^{(out)})^{y_{i}}(1-\sigma(z_{i}^{(out)}))^{1-y_{i}}$,
    <br>
</center>
<br>

and the loss function will be:

<br>
<center>
    <br>
    $\displaystyle J = -\sum_{i = 1}^{t}\left(y_{i}\log{\sigma(z_{i}^{(out)})} + (1-y_{i})\log{(1-\sigma(z_{i}^{(out)}))}\right)$.
    <br>
</center>
<br>

We define $\delta_{j}^{(l)}$ to be the rate of change of the loss function with respect to the input to node $j$ of layer $l$. For example:

<br>
<center>
    $\displaystyle \delta_{1}^{(h)} = \frac{\partial J}{\partial z_{1}^{(h)}}.$
</center>
<br>

From the chain rule, we know this will be:

<br>
<center>
    $\displaystyle \frac{\partial J}{\partial z_{1}^{(h)}} = \sum_{i = 1}^{t}\frac{\partial J}{\partial z_{i}^{(out)}}\frac{\partial z_{i}^{(out)}}{\partial z_{1}^{(h)}}$.
</center>
<br>

Now, by definition we have:

<br>
<center>
    $\displaystyle z_{i}^{(out)} = \sum_{j = 1}^{d}w_{j,i}^{(out)}\sigma(z_{j}^{(h)})$.
</center>
<br>

Taking the partial derivative of this with respect to $z_{1}^{(h)}$ we have:

<br>
<center>
    $\displaystyle \frac{\partial z_{i}^{(out)}}{\partial z_{1}^{(h)}} = w_{1,i}^{(out)}\sigma'(z_{1}^{(h)}) = w_{1,i}^{(out)}\sigma(z_{1}^{(h)})(1-\sigma(z_{1}^{(h)}))$
</center>
<br>

Using this to calculate $\delta_{1}^{(h)}$, and noting that $\displaystyle \delta_{i}^{(out)} = \frac{\partial J}{\partial z_{i}^{(out)}}$, we get:

<br>
<center>
    $\displaystyle \delta_{1}^{(h)} = \sum_{i = 1}^{t}\delta_{i}^{(out)}w_{1,i}^{(out)}\sigma(z_{1}^{(h)})(1-\sigma(z_{1}^{(h)}))$
</center>
<br>

Writing

<br>
<center>
  $\displaystyle \delta^{(h)} = \left(\begin{array}{cccc} \delta_{1}^{(h)} & \delta_{2}^{(h)} & \cdots && \delta_{d}^{(h)}\end{array}\right)$
</center>
<br>

we have more generally that

<br>
<center>
    $\displaystyle \delta^{(h)} = \delta^{(out)}(\textbf{W}^{(out)})^{T} \odot (a^{(h)} \odot (1-a^{(h)}))$,
</center>
<br>

where $\odot$ is element-wise multiplication. In an identical fashion we can get:

<br>
<center>
    $\displaystyle \delta^{(in)} = \delta^{(h)}(\textbf{W}^{(h)})^{T} \odot (a^{(in)} \odot (1-a^{(in)}))$.
</center>
<br>

This idea of propagating these $\delta$ terms, as mentioned earlier known as *error* terms, backwards through the network is the reason for the term backpropagation. Once we have these taking the partial derivatives is easy. For any weight in $\textbf{W}^{(out)}$ (the weights connecting the hidden layer and the output layer) we have:

<br>
<center>
    $\displaystyle \frac{\partial J}{\partial w_{i,j}^{(out)}} = a_{j}^{(h)}\delta_{i}^{(out)}$,
</center>
<br>

and for any weight in $\textbf{W}^{(h)}$ (the weights connecting the input layer and the hidden layer) we have:

<br>
<center>
    $\displaystyle \frac{\partial J}{\partial w_{i,j}^{(h)}} = a_{j}^{(in)}\delta_{i}^{(h)}$.
</center>
<br>

Using these, we can calculate our updates for gradient descent.

Note throughout we've assumed we're doing stochastic gradient descent - updating just based upon one observation. This is what we'll do in our program too, but to do this for batch stochastic gradient descent, or just standard gradient descent, the math is almost identical - just with even more summations and indices to keep straight!