# From multiple logistic regression to the very first deep neural network

In lesson 5, we showed how to apply a logistic regression model to identify defective metal-casting parts. However, the accuracy seems not good enough. This lesson will improve the accuracy by replacing the logistic model with a more advanced model: a fully connected neural network model with one hidden layer. Fully connected neural networks are the cornerstone of deep learning and so-called artificial intelligence.

## Before we start

Before we start, let's import the required libraries:

In [1]:
from autograd import numpy
from autograd import grad
from matplotlib import pyplot

Also, we will use random numbers in this lesson. As this is a teaching material, we want to control the generation of random numbers to make it NOT really random -- you will get the same results as we did when executing this notebook by fixing the random seed:

In [2]:
numpy.random.seed(10000)

In this lesson, our example application is identifying defective metal-casting parts using computer vision. We already saw this application in lesson 5. To save space, we wrapped the code loading and data normalizing in file `scripts/load_casting_data.py`. Execute the following code to get the data from the wrapper: 

In [3]:
import sys
sys.path.insert(0, "../scripts")
from load_casting_data import *

The above code cell loads the following variables: `res`, `n_ok_total`, `n_ok_train`, `n_ok_val`, `n_ok_test`, `n_def_total`, `n_def_train`, `n_def_val`, `n_def_test`, `images_train`, `images_val`, `images_test`, `labels_train`, `labels_val`, `labels_test`, `mu`, and `sigma`. Though the variable names should be self-explanatory enough, it's still a good idea to revisit lesson 5 for the meaning of these variables.

Also, we will reuse some functions from lesson 5. We wrapped these functions in `scripts/lesson_5_functions.py`. Let's also import these functions:

In [4]:
from lesson_5_functions import *

The above code will import `logistic`, `classify`, and `performance`. They are the same code as what we used in lesson 5.

## Introduction to fully connected neural networks

### What are fully connected neural networks?

While neural networks in machine learning may be inspired by real neural networks in our bodies, from an engineering perspective, they are just mathematical models and nothing magical. Most of them can be expressed with high-school math. Recall that a mathematical model is a hypothesized relationship between input features $\mathbf{x}$ and outputs $\mathbf{y}$. For example, a linear regression model assumes $\mathbf{y} \approx \hat{\mathbf{y}} = W^\mathsf{T}\mathbf{x}+\mathbf{b}$, and a logistic regression model assumes $\mathbf{y}\approx\hat{\mathbf{y}}=\operatorname{logistic}\left(W^\mathsf{T}\mathbf{x}+\mathbf{b}\right)$. Neural networks assume a more complicated relationship between $\mathbf{x}$ and $\mathbf{y}$.

There is no unanimous definition of neural networks. However, fully connected neural networks are commonly deemed the most basic type of neural network models among various kinds of neural networks. An input vector $\mathbf{x}$ goes through several consecutive and interleaved linear and nonlinear transformations in fully connected neural networks before the model returns $\hat{\mathbf{y}}$. A linear transformation is something like $W^\mathsf{T}\mathbf{x}$, and a nonlinear transformation may be a logistic function, a trigonometric function, or any functions that are not linear.

Given a single input vector $\mathbf{x}$, a fully connected neural network model returns $\hat{\mathbf{y}}$ through: 

$$
\begin{aligned}
\mathbf{z}^1 &= \sigma_0\left(\left(W^0\right)^\mathsf{T}\mathbf{x} + \mathbf{b}^0\right) \\
\mathbf{z}^2 &= \sigma_1\left(\left(W^1\right)^\mathsf{T}\mathbf{z}^1 + \mathbf{b}^1\right) \\
&\vdots \\
\hat{\mathbf{y}} &= \sigma_L\left(\left(W^L\right)^\mathsf{T}\mathbf{z}^L + \mathbf{b}^L\right)
\end{aligned}
$$

Or, if we prefer an iterative expression:

$$
\begin{aligned}
\mathbf{z}^0 &\equiv \mathbf{x} \\
\mathbf{z}^{i+1} &= \sigma_{i}\left(\left(W^i\right)^\mathsf{T}\mathbf{z}^{i} + \mathbf{b}^{i}\right)\text{ for }i=0,\dots,L \\
\hat{\mathbf{y}} &\equiv \mathbf{z}^{L+1}
\end{aligned}
$$

We use bold-face symbols because these variables are vectors and matrices, as seen in lesson 6. The math equations of the linear and logistic regression models in lessons 1 to 5 are special cases of the vector-matrix form. Let's take a look at these symbols:

- Linear transformations and model parameters: $W^0$, $\mathbf{b}^0$, $\dots$, $W^L$, $\mathbf{b}^L$
  
  $W^0$, $\mathbf{b}^0$, $\dots$, $W^L$, $\mathbf{b}^L$ are parameter matrices and vectors. They are responsible for linear transformations, just like in previous lessons. The difference in this lesson is that now we have more than one linear transformation. We will determine the values of these parameters using gradient descent in this lesson.
  
  Note that the superscript of the parameter matrices and vectors starts from $0$. Usually, in math expression, we start indices with $1$. For example, the first element in a vector $\mathbf{x}$ is $x_1$. However, we begin with $0$ for $W$ and $\mathbf{b}$ because we want to address how many ***additional*** transformations we have in the model besides the last transformation, i.e., the one that returns $\mathbf{y}$. This information is carried by the variable $L$.

- Nonlinear transformations:  $\sigma_0$, $\sigma_1$, $\dots$, $\sigma_{L}$
    
  $\sigma_0$, $\sigma_1$, $\dots$, $\sigma_{L}$ are choosen nonlinear functions. It's not uncommon to use the same function for all of them, i.e., $\sigma_0 = \sigma_1 = \dots = \sigma_{L} = \sigma$. In addition, they are often element-wise functions. That says, if we give a vector $\mathbf{g}$ to $\sigma_0$, then
  
  $$
  \sigma_0\left(\mathbf{g}\right)=
  \sigma_0\left(\begin{bmatrix}g_1 \\ g_2 \\ \vdots \end{bmatrix} \right)=
  \begin{bmatrix}\sigma_0(g_1) \\ \sigma_0(g_2) \\ \vdots \end{bmatrix}
  $$

- Intermediate results: $\mathbf{z}^1$, $\mathbf{z}^2$, $\dots$, $\mathbf{z}^L$
  
  $\mathbf{z}^1$, $\mathbf{z}^2$, $\dots$, $\mathbf{z}^L$ are intermediate results. In rare situations, we can explain the meaning of these intermediate results. For example, if we are modeling a natural phenomenon, these results may represent some physical mechanisms or properties. However, in most applications, these intermediate results are just the side products of the calculations and do not carry any meaning. 

- Number of linear-nonlinear-transformation pairs: $L$
  
  $L$ is a user-defined parameter that controls how many linear-nonlinear transformation pairs before the model returns the output $\mathbf{y}$. We use $L$ to adjust the complexity of a fully connected neural network model: larger $L$ makes our model more complicated. And a more complicated neural network can model a more sophisticated relationship between $\mathbf{x}$ and $\mathbf{y}$.
  
  As mentioned previously, $L$ denotes how many additional transformations we have in the model before the one returning the final result. Alternatively, we can also treat $L$ as how many intermediate results we have in the model.

- Lengths of intermediate vectors: $n_1$, $n_2$, $\dots$, $n_L$
  
  $n_1$, $n_2$, $\dots$, $n_L$ are not visible in the above matrix-vector form of the model. They represent the length of the intermediate vectors $\mathbf{z}^1$, $\mathbf{z}^2$, $\dots$, $\mathbf{z}^L$. These variables are user-defined and also control the complexity of models.

If we choose $\sigma_0(\mathbf{g}) = \sigma_1(\mathbf{g}) = \dots = \sigma_{L}(\mathbf{g}) = \operatorname{logistic}(\mathbf{g}) = \frac{1}{1+\mathrm{e}^{-\mathbf{g}}}$, then a fully connected neural network is a chain of several logistic regression models. In fact, ***a logistic regression model is a fully connected neural network*** with $L=0$.

### Our first neural network

We will use the following configuration for the fully-connected neural network in this lesson:

1. $L=1$,
2. $\sigma_0(\mathbf{g})=\sigma_1(\mathbf{g})=\operatorname{logistic}(\mathbf{g})$, and
3. $n_1=64$, meaning $\mathbf{z}^1=\begin{bmatrix}z_1^1 & z_2^1 & \cdots & z_{64}^1 \end{bmatrix}^\mathsf{T}$

Also, each of our images has a total of $128\times 128=16384$ pixels (revisit lesson 5 if you don't remember), so $\mathbf{x}=\begin{bmatrix}x_1 & x_2 & \cdots & x_{16384}\end{bmatrix}^\mathsf{T}$.

If you are not comfortable using math symbols for linear algebra, it's always nice to expand the symbols with real vectors and matrices. Let's do it with the model we'll use later in this lesson:

$$
\begin{aligned}
  \left[\begin{smallmatrix}
  z_1^1 \\ \vdots \\ z_{64}^1
  \end{smallmatrix}\right]
  &=
  \operatorname{logistic}
  \left(
  \left[\begin{smallmatrix}
  W_{1,1}^0 & \cdots & W_{1,64}^0 \\
  \vdots    & \ddots & \vdots \\
  W_{16384,1}^0 & \cdots & W_{16384,64}^0 \\
  \end{smallmatrix}\right]^\mathsf{T}
  \left[\begin{smallmatrix}
  x_1 \\ \vdots \\ x_{16384}
  \end{smallmatrix}\right]
  +
  \left[\begin{smallmatrix}
  b_1^0 \\ \vdots \\ b_{64}^0
  \end{smallmatrix}\right]
  \right)
  \\
  \hat{y}
  &=
  \operatorname{logistic}
  \left(
  \left[\begin{smallmatrix}
  W_{1}^1 \\
  \vdots \\
  W_{64}^1 \\
  \end{smallmatrix}\right]^\mathsf{T}
  \left[\begin{smallmatrix}
  z_1^1 \\ \vdots \\ z_{64}^1
  \end{smallmatrix}\right]
  +
  b^1
  \right)
\end{aligned}
$$

Don't forget the transpose symbol in the equations.

If we are computing the predictions of $N$ images at once, then we have

$$
\begin{aligned}
  \left[\begin{smallmatrix}
  z_1^{1, (1)} & \cdots & z_{64}^{1, (1)} \\
  \vdots & \ddots & \vdots \\
  z_1^{1, (N)} & \cdots & z_{64}^{1, (N)} \\ 
  \end{smallmatrix}\right]
  &=
  \operatorname{logistic}
  \left(
  \left[\begin{smallmatrix}
  x_1^{(1)} & \cdots & x_{16384}^{(1)} \\
  \vdots & \ddots & \vdots \\
  x_1^{(N)} & \cdots & x_{16384}^{(N)} \\
  \end{smallmatrix}\right]
  \left[\begin{smallmatrix}
  W_{1,1}^0 & \cdots & W_{1,64}^0 \\
  \vdots    & \ddots & \vdots \\
  W_{16384,1}^0 & \cdots & W_{16384,64}^0 \\
  \end{smallmatrix}\right]
  +
  \left[\begin{smallmatrix}
  b_1^0  & \cdots & b_{64}^0 \\
  \vdots & \ddots & \vdots \\
  b_1^0  & \cdots & b_{64}^0 \\
  \end{smallmatrix}\right]
  \right)
  \\
  \left[\begin{smallmatrix}
  \hat{y}^{(1)} \\ \vdots \\ \hat{y}^{(N)}
  \end{smallmatrix}\right]
  &=
  \operatorname{logistic}
  \left(
  \left[\begin{smallmatrix}
  z_1^{1, (1)} & \cdots & z_{64}^{1, (1)} \\
  \vdots & \ddots & \vdots \\
  z_1^{1, (N)} & \cdots & z_{64}^{1, (N)} \\ 
  \end{smallmatrix}\right]
  \left[\begin{smallmatrix}
  W_{1}^1 \\
  \vdots \\
  W_{64}^1 \\
  \end{smallmatrix}\right]
  +
  \left[\begin{smallmatrix}
  b^1 \\
  \vdots \\
  b^1
  \end{smallmatrix}\right]
  \right)
\end{aligned}
$$

You may notice that we use $W^\mathsf{T}\mathbf{x}$ when we have a single image and use $XW$ when we have multiple images ($N$ images in this case). This is because we use a column vector to describe a single image, but we use a matrix to represent multiple images, in which each row vector represents one image.

If you are confused about these linear algebra calculations, writing down the shapes of these vectors and matrices may be helpful.

Now, let's write the code for our neural network model:

In [5]:
def neural_network_model(x, params):
    """A fully-connected neural network with L=1.
    
    Arguments
    ---------
    x : numpy.ndarray
        The input of the model. It's shape should be (n_images, n_total_pixels).
    params : a tuple/list of four elements
        - The first element is W0, a 2D array with shape (n_total_pixels, n_z1).
        - The second elenment is b0, an 1D array with length n_z1.
        - The third element is W1, an 1D array with length n_z1.
        - The fourth element is b1, a scalar.

    Returns
    -------
    yhat : numpy.ndarray
        The predicted values obtained from the model. It's an 1D array with
        length n_images.
    """
    z1 = logistic(numpy.dot(x, params[0])+params[1])
    yhat = logistic(numpy.dot(z1, params[2])+params[3])
    return yhat

The loss function is the same as in lesson 5, except we now consider both $W^0$ and $W^1$ in the regularization: 

In [6]:
def model_loss(x, true_labels, params):
    """Calculate the predictions and the loss w.r.t. the true values.
    
    Arguments
    ---------
    x : numpy.ndarray
        The input of the model. The shape should be (n_images, n_total_pixels).
    true_labels : numpy.ndarray
        The true labels of the input images. Should be 1D and have length of
        n_images.
    params : a tuple/list of two elements
        - The first element is W0, a 2D array with shape (n_total_pixels, n_z1).
        - The second elenment is b0, an 1D array with length n_z1.
        - The third element is W1, an 1D array with length n_z1.
        - The fourth element is b1, a scalar.
    
    Returns
    -------
    loss : a scalar
        The summed loss.
    """
    pred = neural_network_model(x, params)
    
    n_images = x.shape[0]
    
    # major loss
    loss = - (
        numpy.dot(true_labels, numpy.log(pred+1e-15)) +
        numpy.dot(1.-true_labels, numpy.log(1.-pred+1e-15))
    ) / n_images
    
    return loss

In [7]:
def regularized_loss(x, true_labels, params, _lambda=1.):
    """Return the loss with regularization.
    
    Arguments
    ---------
    x, true_labels, params :
        Parameters for function `model_loss`.
    _lambda : float
        The weight of the regularization term. Default: 0.01
    
    Returns
    -------
    loss : a scalar
        The summed loss.
    """
    loss = model_loss(x, true_labels, params)
    Nw = params[0].shape[0] * params[0].shape[1] + params[2].size
    reg = ((params[0]**2).sum() + (params[2]**2).sum()) / Nw
    return loss + _lambda * reg

## Initialization

As usual, we rely on the `grad` from `autograd` to get a function that calculates the derivatives:

In [8]:
# a function to get the gradients of a logistic model
gradients = grad(regularized_loss, argnum=2)

For our convenience, we use variables `n0` to store the length of input vector $\mathbf{x}$ (recall that $\mathbf{z}^0\equiv\mathbf{x}$) and use `n1` to represent $n_1$:

In [9]:
# number of elements in z0 (i.e., x), z1, ...
n0 = images_train.shape[1]  # i.e., res * res
n1 = 64

### Our regular weight initialization is not working!!!

Intuitively, we should initialize the parameters with zeros, just like what we did in previous lessons. However, this is not going to work for neural networks when $L \gt 0$. We will show you why.

In [10]:
# initialize parameters
W0 = numpy.zeros((n0, n1))
b0 = numpy.zeros(n1)
W1 = numpy.zeros(n1)
b1 = 0.0

When we have all-zero elements in parameter matrices, the gradients of $W^i$ for $i \ne L$ will be zeros. In our model ($L=1$), that says the gradient of $W^0$ will be zero. Without dragging you into messy mathematical equations, we can show it by using the `gradients` function to calculate the gradients:

In [11]:
grads = gradients(images_train, labels_train, (W0, b0, W1, b1))
print("Gradients of W0 are zeros:", numpy.allclose(grads[0], 0.))
print("Gradients of W1 are zeros:", numpy.allclose(grads[2], 0.))

Gradients of W0 are zeros: True
Gradients of W1 are zeros: False


And how does this issue matter? Recall how we perform gradient descent -- we update parameters by substracting gradients from the current values of parameters:

$$
\text{new }W^0 = \text{current }W^0 - \text{step size}\times\text{gradients of }W^0
$$

So if the gradients of $W^0$ are zeros, then gradient descent will not change the values of $W^0$ at all. In other words, our model will not improve no matter how many iterations we run for optimization.

### Xavier/Glorot initialization

To resolve the issue, we need non-zero initial values for parameters. Xavier initialization (or sometimes called Glorot initialization) is the most common method to initialize parameters. The mathematical proof of why it works under the hood requires knowledge in statistics and probabilities. You can check reference [1] if you're interested, but here we focus on how to do it instead:

$$
W^i = \text{random numbers uniformly drawn from interval }\left[-\sqrt{\frac{6}{n_i+n_{i+1}}}, \sqrt{\frac{6}{n_i+n_{i+1}}}\right]
$$
and
$$
\mathbf{b}^i = \mathbf{0}
$$
where the bold-face zero, $\mathbf{0}$, denotes a vector of zeros.

In [12]:
# initialize parameters
scale = numpy.sqrt(6/(n0+n1))
W0 = numpy.random.uniform(-scale, scale, (n0, n1))
b0 = numpy.zeros(n1)

scale = numpy.sqrt(6/(n1+1))
W1 = numpy.random.uniform(-scale, scale, n1)
b1 = 0.

Now if you check the gradients of $W^0$ using the function `gradients`, they are not zeros anymore.

Finally, let's take a look at the initial performance of our model:

In [13]:
# initial accuracy
pred_labels_test = classify(images_test, (W0, b0, W1, b1), neural_network_model)
perf = performance(pred_labels_test, labels_test)
print("Initial precision: {:.1f}%".format(perf[0]*100))
print("Initial recall: {:.1f}%".format(perf[1]*100))
print("Initial F-score: {:.1f}%".format(perf[2]*100))

Initial precision: 59.2%
Initial recall: 88.5%
Initial F-score: 71.0%


## Optimization

The optimization is the same as how we did it in lesson 5, except that now we have to update `W0`, `b0`, `W1`, and `b1` (in lesson 5, we only updated `W` and `b`). Also, we give you a proper step size (`lr`) directly.

In [14]:
%%time

# step size
lr = 1e-1

# a variable for the change in validation loss
change = numpy.inf

# a counter for optimization iterations
i = 0

# a variable to store the validation loss from the previous iteration
old_val_loss = 1e-15

# keep running if:
#   1. we still see significant changes in validation loss
#   2. iteration counter < 10000
while change >= 1e-6 and i < 10000:
    
    # calculate gradients and use gradient descents
    grads = gradients(images_train, labels_train, (W0, b0, W1, b1))
    W0 -= (grads[0] * lr)
    b0 -= (grads[1] * lr)
    W1 -= (grads[2] * lr)
    b1 -= (grads[3] * lr)
    
    # validation loss
    val_loss = model_loss(images_val, labels_val, (W0, b0, W1, b1))
    
    # calculate f-scores against the validation dataset
    pred_labels_val = classify(images_val, (W0, b0, W1, b1), neural_network_model)
    score = performance(pred_labels_val, labels_val)

    # calculate the chage in validation loss
    change = numpy.abs((val_loss-old_val_loss)/old_val_loss)

    # update the counter and old_val_loss
    i += 1
    old_val_loss = val_loss
    
    # print the progress every 10 steps
    if i % 10 == 0:
        print("{}...".format(i), end="")

10...20...30...40...50...60...70...80...90...100...110...120...130...140...150...160...170...180...190...200...210...220...230...240...250...260...270...280...290...300...310...320...330...340...350...360...370...380...390...400...410...420...430...440...450...CPU times: user 3min, sys: 9.24 s, total: 3min 9s
Wall time: 31.7 s


In [15]:
val_loss_reg = regularized_loss(images_train, labels_train, (W0, b0, W1, b1))

print("")
print("")
print("Upon optimization stopped:")
print("    Iterations:", i)
print("    Validation loss w/o regularization:", val_loss)
print("    Validation loss w/ regularization:", val_loss_reg)
print("    Validation precision:", score[0])
print("    Validation recall:", score[1])
print("    Validation F-score:", score[2])
print("    Change in validation loss:", change)



Upon optimization stopped:
    Iterations: 453
    Validation loss w/o regularization: 0.2787043104606595
    Validation loss w/ regularization: 0.024497566748684113
    Validation precision: 0.8993288590604027
    Validation recall: 0.8589743589743589
    Validation F-score: 0.8786885245901639
    Change in validation loss: 1.6184802410438597e-07


Finally, let's check the final model performance:

In [16]:
# final accuracy
pred_labels_test = classify(images_test, (W0, b0, W1, b1), neural_network_model)
perf = performance(pred_labels_test, labels_test)
print("Final precision: {:.1f}%".format(perf[0]*100))
print("Final recall: {:.1f}%".format(perf[1]*100))
print("Final F-score: {:.1f}%".format(perf[2]*100))

Final precision: 91.8%
Final recall: 93.6%
Final F-score: 92.7%


Awesome! Compared to lesson 5, simply replacing the logistic regression model with a fully connected neural network model with $L=1$ improves the F-score from $86.6\%$ to $92.7\%$. The precision is now $91.8\%$, meaning whenever our model predicts a casting part being defective, $91.8\%$ chance it's correct. And the recall is now $93.4\%$, which says that our model misses around $6.6\%$ of defective parts.

Of course, the performance may still be unsatisfying. However, one benefit of using neural networks is that we can control the complexity of our model. For example, we may get better performance if we increase $L$ and the lengths of intermediate results $\mathbf{z}^1$, $\mathbf{z}^2$, $\dots$, $\mathbf{z}^L$. 

What we didn't address in this lesson, however, is that increasing the complexity increases the difficulty of optimization as well. Optimization may need much more time to converge or may not even find a satisfying solution. So it's not like we can freely increase $L$ and the lengths of intermediate results to whatever values we like. One way to ease the optimization when having high $L$ and lengths is to use advanced optimization methods. In lesson 8, we will see more different optimization methods. Another approach is to use other types of neural networks, such as a convolutional neural network, which is a more popular choice for computer vision applications. We will talk about this in a later lesson. 

## Conclusion

In this lesson, we learned:

1. the concept of fully connected neural networks,
2. how to code a fully connected neural network, and
3. how to initialize parameters of a neural network.

We hope this lesson gave you some sense of what neural networks and deep learning are and how they are the same as or different from linear/logistic regressions.

## Reference

1. Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In Y. W. Teh & M. Titterington (Eds.), Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (Vol. 9, pp. 249–256). PMLR. http://proceedings.mlr.press/v9/glorot10a.html

In [17]:
# Execute this cell to load the notebook's style sheet, then ignore it
from IPython.core.display import HTML
css_file = '../style/custom.css'
HTML(open(css_file, "r").read())