# MNIST machine learning exercise

In this exercise we will demonstrate the use of Keras and Keras Tune to identify a feedforward neural network that best predicts the a handwritten digit. 

We use the MNIST data set;

![mnist data](https://upload.wikimedia.org/wikipedia/commons/2/27/MnistExamples.png)

## Load and explore data (shouldn't need any transformations)

In [1]:
import os
import tensorflow as tf
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'

2023-06-12 17:52:09.441995: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
from __future__ import print_function
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

from tensorflow import keras
from tensorflow.keras import layers
from sklearn import datasets

np.random.seed(1)

Load the MNIST digits dataset. It's originally from UCI machine learning library, but included in SKLearn.

In [3]:
mnist = datasets.load_digits() # sklearn includes this data set .. https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits

In [4]:
type(mnist)

sklearn.utils._bunch.Bunch

Notice that the dataset is stored in a Bunch type (see sklearn https://scikit-learn.org/stable/modules/generated/sklearn.utils.Bunch.html)

We can view this dataset as similar to a dictionary; we can look at all the keys by doing the following:

In [5]:
mnist.keys()

dict_keys(['data', 'target', 'frame', 'feature_names', 'target_names', 'images', 'DESCR'])

Note thjat there are 1797 images.

Images are 8x8 grid of values epresenting the gray level for each pixel (16 levels of grey -- from 0 (black) to 15 (white)). 

If we want verify the number of images, we can use the len function.

In [6]:
print(len(mnist.data))

1797


And, for each image we have a target value

In [7]:
print(len(mnist.target))

1797


### Split data into training and test sets


In [8]:
X_train, X_test, y_train, y_test = train_test_split(mnist.data, mnist.target, test_size=0.2, random_state=1)

## Network Depth and Width

A deep neural network is a neural network with a large number of layers. A wide neural network is a neural network with a large number of neurons in one or more layers. A wide network can also refer to a network that has more than one hidden layer in parallel.

**Wide and Shallow**
First, let's look at a wide and shallow network. The depth will be 1 hidden layer, while the width we be 1000 neurons.

**Deep and Narrow**
Next, we will look at a deep and narrow network. The depth will be 5 hidden layers, while the width we be 10 neurons.

**Wide and Deep Network (parallel)**
Finally, we will look at a deep and wide network. In this final example, we add three hidden layers of 100 neurons in parallel. The input layer, therefore, connects directly to three layers that are in parallel. Each of the three parallell layers then feed into one output layer. Now, logically, if we concatenate 3 layers of 10 neurons such that they are parallel, this is equivalent to one layer of 30 neurons. Where things can be a bit more interesting is when we have one layer recieve input from multiple layers. For instance, have a two three layer network, where the third layer is a concatenation of the first layer and the second layer. This is a deep and wide network with parallel layers.


### Wide and Shallow

In this example we simply have multiple layers (depth) with each layer only have a relatively small number of units (neurons).

Also, note that we introduce a new way of archtecting the network. Note that in the kera intro notebook we used the following to add layers:

```python
model = keras.models.Sequential()
model.add(keras.layers.Input(64)) 
model.add(keras.layers.Dense(500, activation="relu")) 
model.add(keras.layers.Dense(10, activation="softmax")) 
```

But, this same network can be defined using the technique you see below. Notice that each layer is given a name, and thus, this allows for the layers to be connected in different ways. For instance, we can connect the input layer to multiple layers, or we can connect multiple layers to a single layer. This is a very powerful way of defining a network.


In [9]:
input_ = keras.layers.Input(64)
hidden1 = keras.layers.Dense(500, activation="relu")(input_)
output = keras.layers.Dense(10, activation="softmax")(hidden1)
model = keras.Model(inputs=[input_], outputs=[output])

Now that we have defined our neural network model, we can get a summary of the model by calling the summary() function on the model. Since this network, though implemented using different syntax, is the same model as we defined in the intro to keras notebook covered previous (see previous notebook if you need a reminder about how to interpret this output). 

In [10]:
model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 64)]              0         
                                                                 
 dense (Dense)               (None, 500)               32500     
                                                                 
 dense_1 (Dense)             (None, 10)                5010      
                                                                 
Total params: 37,510
Trainable params: 37,510
Non-trainable params: 0
_________________________________________________________________


The above code only defines the structure of the model. We now need to compile the model. When we compile the model we specify details about how it will be trained. We need to specify a loss function, and optimizer approach, and a metric to optimize. 

In the following model, we will use the categorical_crossentropy loss function, which is appropriate for a multi-class classification problem. We will use the Adam optimizer, which is a variant of stochastic gradient descent. We will use accuracy as the metric to optimize.

In [11]:
model.compile(loss="sparse_categorical_crossentropy", optimizer="sgd", metrics=["accuracy"])

Now that we have the structure of our model defined, and the details of the training process specified, we can train the model. We will train the model for 10 epochs, and use a batch size of 128. We will also use the validation data set to evaluate the model after each epoch.

In the specific case of this dataset, we have a training dataset that is 80% of 1792 (1437) 8x8 images of handwritten digits. If we set our batch size to 111, then we will have 12 full batches and one partial -- so, 13 batches per epoch (1437/111 = 12.95). We will train the model for 10 epochs, so we will have 130 batches of training data.

> NOTE: Optimization algorithms (aka 'learning algorithms') generally have a number of hyperparameters. Two hyperparameters that often confuse beginners are the batch size and number of epochs. 
>
* The batch size is a hyperparameter of gradient descent that controls the number of training samples to work through before the model’s internal parameters are updated.
* The number of epochs is a hyperparameter of gradient descent that controls the number of complete passes through the training dataset.
  >
> So, if you have a training set of data that consists of 100 observations; if our batch size is 10, then the gradient descent algorithm will update the weights after every 10 observations. If  we have 100 epochs, then the gradient descent algorithm will update the weights 100 times.


In [12]:
X_train.shape

(1437, 64)

In [13]:
%%time
history = model.fit(X_train, y_train, epochs=10, batch_size=111, validation_data=(X_test, y_test))

print(model.metrics_names)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
['loss', 'accuracy']
CPU times: user 732 ms, sys: 86.6 ms, total: 819 ms
Wall time: 616 ms


Now, we need to evaluate the model on the test data. We can do this with the evaluate() method.

In [14]:
loss, accuracy = model.evaluate(X_test, y_test)
print(f"Loss {loss:.5f}\nAccuracy {accuracy:.5f}")

Loss 0.12988
Accuracy 0.96111


In [15]:
model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 64)]              0         
                                                                 
 dense (Dense)               (None, 500)               32500     
                                                                 
 dense_1 (Dense)             (None, 10)                5010      
                                                                 
Total params: 37,510
Trainable params: 37,510
Non-trainable params: 0
_________________________________________________________________


### Deep Network and Narrow

Deep neural networks consist of many hidden layers. The number of layers is the depth of the network. The first layer is the input layer. The last layer is the output layer. The layers in between are the hidden layers.

In [16]:
model = keras.models.Sequential()
model.add(keras.layers.Input(64))
model.add(keras.layers.Dense(64, activation="relu"))
model.add(keras.layers.Dense(64, activation="relu"))
model.add(keras.layers.Dense(64, activation="relu"))
model.add(keras.layers.Dense(64, activation="relu"))
model.add(keras.layers.Dense(64, activation="relu"))
model.add(keras.layers.Dense(64, activation="relu"))
model.add(keras.layers.Dense(10, activation="softmax"))

Alternatively, we can also create this network using the syntax below (this method is required for more complex ann architecures, such as when some layers fork to other deeper layers).

In [17]:
input_ = keras.layers.Input(64)
hidden1 = keras.layers.Dense(200, activation="relu")(input_)
hidden2 = keras.layers.Dense(200, activation="relu")(hidden1)
hidden3 = keras.layers.Dense(200, activation="relu")(hidden2)
hidden4 = keras.layers.Dense(200, activation="relu")(hidden3)
hidden5 = keras.layers.Dense(200, activation="relu")(hidden4)
hidden6 = keras.layers.Dense(200, activation="relu")(hidden5)
output = keras.layers.Dense(10, activation="softmax")(hidden5)
model = keras.Model(inputs=[input_], outputs=[output])

In [18]:
model.compile(loss="sparse_categorical_crossentropy", optimizer="sgd", metrics=["accuracy"])

In [19]:
history = model.fit(X_train, y_train, epochs=10, batch_size=11, validation_data=(X_test, y_test))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [20]:
loss, accuracy = model.evaluate(X_test, y_test)
print(f"Loss {loss:.5f}\nAccuracy {accuracy:.5f}")

Loss 0.07236
Accuracy 0.97778


In [21]:
model.summary()

Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_3 (InputLayer)        [(None, 64)]              0         
                                                                 
 dense_9 (Dense)             (None, 200)               13000     
                                                                 
 dense_10 (Dense)            (None, 200)               40200     
                                                                 
 dense_11 (Dense)            (None, 200)               40200     
                                                                 
 dense_12 (Dense)            (None, 200)               40200     
                                                                 
 dense_13 (Dense)            (None, 200)               40200     
                                                                 
 dense_15 (Dense)            (None, 10)                2010

### Wide and Deep Network

A 'wide and deep' network is simply a network with many layers (deep), and many units per layer (wide). In the example below, we will also fork one layer so that it is both wide and deep.

A wide and deep network is a network that has both a wide component and a deep component. The wide component is a set of layers that are connected directly to the output layer. The deep component is a set of layers that are connected to each other, and then to the output layer. The wide component allows the network to learn simple relationships between the input features and the output. The deep component allows the network to learn complex relationships between the input features and the output. The wide and deep network combines the strengths of both the wide component and the deep component.

In [22]:
from tensorflow import keras
from tensorflow.keras import layers

In [23]:
input_ = keras.layers.Input(64)
hidden1 = keras.layers.Dense(1000, activation="relu")(input_)
hidden2 = keras.layers.Dense(1000, activation="relu")(hidden1)
hidden3 = keras.layers.Dense(1000, activation="relu")(hidden2)
output = keras.layers.Dense(10, activation="softmax")(hidden3)
model = keras.Model(inputs=[input_], outputs=[output])

In [24]:
model.compile(loss="sparse_categorical_crossentropy", optimizer="sgd", metrics=["accuracy"])

In [25]:
history = model.fit(X_train, y_train, epochs=10, validation_data=(X_test, y_test))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [26]:
loss, accuracy = model.evaluate(X_test, y_test)
print(f"Loss {loss:.5f}\nAccuracy {accuracy:.5f}")

Loss 0.07733
Accuracy 0.98333


In [27]:
model.summary()

Model: "model_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_4 (InputLayer)        [(None, 64)]              0         
                                                                 
 dense_16 (Dense)            (None, 1000)              65000     
                                                                 
 dense_17 (Dense)            (None, 1000)              1001000   
                                                                 
 dense_18 (Dense)            (None, 1000)              1001000   
                                                                 
 dense_19 (Dense)            (None, 10)                10010     
                                                                 
Total params: 2,077,010
Trainable params: 2,077,010
Non-trainable params: 0
_________________________________________________________________


### Wide and Deep Network with Parallel Layers

A 'wide and deep' network is simply a network with many layers (deep), and many units per layer (wide). In the example below, we will also fork one layer so that it is both wide and deep.

A wide and deep network is a network that has both a wide component and a deep component. The wide component is a set of layers that are connected directly to the output layer. The deep component is a set of layers that are connected to each other, and then to the output layer. The wide component allows the network to learn simple relationships between the input features and the output. The deep component allows the network to learn complex relationships between the input features and the output. The wide and deep network combines the strengths of both the wide component and the deep component.

In [28]:
from tensorflow import keras
from tensorflow.keras import layers

In [29]:
input_ = keras.layers.Input(64)
hidden1 = keras.layers.Dense(1000, activation="relu")(input_)
hidden2 = keras.layers.Dense(1000, activation="relu")(hidden1)
hidden3 = keras.layers.Dense(1000, activation="relu")(hidden2)
concat = keras.layers.Concatenate()([hidden1, hidden2])
hidden3 = keras.layers.Dense(1000, activation="relu")(concat)
hidden4 = keras.layers.Dense(1000, activation="relu")(hidden3)
output = keras.layers.Dense(10, activation="softmax")(hidden4)
model = keras.Model(inputs=[input_], outputs=[output])

In [30]:
model.compile(loss="sparse_categorical_crossentropy", optimizer="sgd", metrics=["accuracy"])

In [31]:
history = model.fit(X_train, y_train, epochs=10, validation_data=(X_test, y_test))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [32]:
loss, accuracy = model.evaluate(X_test, y_test)
print(f"Loss {loss:.5f}\nAccuracy {accuracy:.5f}")

Loss 0.06606
Accuracy 0.98611


In [33]:
model.summary()

Model: "model_3"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_5 (InputLayer)           [(None, 64)]         0           []                               
                                                                                                  
 dense_20 (Dense)               (None, 1000)         65000       ['input_5[0][0]']                
                                                                                                  
 dense_21 (Dense)               (None, 1000)         1001000     ['dense_20[0][0]']               
                                                                                                  
 concatenate (Concatenate)      (None, 2000)         0           ['dense_20[0][0]',               
                                                                  'dense_21[0][0]']         

## Summary

We explored multiple configurations of neural network architectures, primarily focusing on variations in Width, Depth, and Parallel layers.

Here are the results we obtained for these models:

* Wide and Shallow: Accuracy = 0.95556
* Deep and Narrow: Accuracy = 0.98056
* Wide and Deep: Accuracy = 0.98611
* Wide and Deep with Parallel Layers: Accuracy = 0.98333

> NOTE: There may be some variation in the above results due to the random nature of the training process. 

In this particular instance, given the specific data, the number of layers, neurons, and interlayer connections, the architectures employing both 'Width' and 'Depth' (with and without 'Parallel Layers') exhibited superior accuracy. It's crucial, however, to highlight that this isn't a universally optimal model. Often, the Wide and Deep architectures won't outperform its counterparts.

The key insight here is the non-existence of a universally 'best' architecture for all problems. An optimal architecture is highly dependent on specific factors like the data at hand, number of layers, the count of neurons, and the connection schema between layers. The most effective approach for uncovering the best architecture for a specific problem involves experimenting with a variety of architectures and comparing their performances.

But don't let the prospect of infinite architecture configurations and parameter settings overwhelm you. Here's where to begin:

* Literature Review: If your problem is not unique, there's a high chance it's been tackled before. Start by exploring academic and industry literature to understand what approaches others have used. This can provide you a solid base for your experiments.

* In-house Knowledge: If you're within a company, leverage the knowledge and expertise of your colleagues. They've likely worked on similar problems and their insights could be invaluable.

* Exploration and Automation: If neither of the above apply, you're essentially a pioneer, and this necessitates a more exploratory approach. You'll have to invest time in testing various combinations of architectures and parameters. Tools like Keras Tuner are highly valuable here as they facilitate automated exploration of potential architectures and parameter spaces. We'll delve deeper into this subject in a future notebook."