## Lesson 2: Computer Vision Neural Networks

#### Lesson Overview
We will walk through some basic neural network implementations where our application for these networks is the computer vision task of image classification. Specifically, we will go through a basic machine learning workflow:
1. Load, examine, understand, and preprocess the data
3. Build the model
4. Train the model
5. Test the model
6. Improve the model and repeat the process

#### Lesson Goals
By the end of this lesson you should be able to
1. Perform basic data inspection and preprocessing
1. Implement a neural network model in TensorFlow
2. Train, test, and tuning of a model

----

#### Motivation
On average, the postal service delivers around 300 million peices of mail a day. For each peice of mail, the address must be processed and put into a computer system that determines where to send the mail. While this task is not difficult for a human to perform, it is difficult for someone to look at each peice of mail individually and process the address due to the sheer volume of mail. So, we would like to automate this task with a computer. To start out, we want our computer to be able to recognize and distinguish handwritten digits (i.e., numbers 0, 1, ..., 9) so that it can process address numbers and postal codes.

#### Why use Neural Networks for this task?
This Computer Vision task (i.e., getting a computer to "see" and "recognize" as a human is able to) of image classification is very difficult to program explicitly. It is not too difficult to imagine that different people write digits uniquely. For example, writting a 2 with or without a loop or writting a digit slightly slanted. All these countless different styles of writting digits would need to be accounted for in an explicit program, which is not feasible to do. Instead, we can use image data containing handwritten digits to train (optimize) a neural network to recongize said digits. After training (we hope) the neural network will distinguish digits and implicitly capture different writting styles.

#### Dataset
We will use an existing dataset that can be easily loaded in using TensorFlow. This dataset, named the MNIST database (see [this Wiki for more info](https://en.wikipedia.org/wiki/MNIST_database)), contains black and white images of size 28x28 where each image shows a single handwritten digit between 0 and 9. Note that TensorFlow contains a handful of readily available datasets see [the docs](https://www.tensorflow.org/api_docs/python/tf/keras/datasets) for a list.

In [None]:
# Load necessary packages for this lesson
import os

import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
import sklearn

#### Load the dataset

The MNIST database can be easily loaded using the built-in TensorFlow function
<center> tf.keras.datasets.mnist.load_data() </center>

In [None]:
(train_x, train_y), (test_x, test_y) = tf.keras.datasets.mnist.load_data()

#### Inspect the dataset

Lets look at the size and data types of the datasets. In TensorFlow the first axis is always the number of samples

In [None]:
def print_info(dataset):
	print(f'Shape: {dataset.shape} | Data Type {dataset.dtype} \n')

print_info(train_x) # 60000 images of size 28x28. data type uint8 = unsigned integer stored with 8 bits
print_info(train_y) # 60000 category labels denoting which digit is shown on the image
print_info(test_x)
print_info(test_y)

Since we know our dataset contains images, lets plot a few to get a better idea of what the data looks like

Plot a single image and print the corresponding label

In [None]:
idx = 0
img = train_x[idx]
plt.imshow(img, cmap='gray')
plt.title(f"Label = {train_y[idx]}")
plt.show()

Plot many images and print the corresponding labels labels

In [None]:
sample_size=20
np.random.seed(2)

# Randomly sample indices
sample_indices = np.random.choice(np.arange(train_x.shape[0]), size=sample_size, replace=False)
# Calculate optimal number of rows and columns
num_columns = int(np.ceil(np.sqrt(sample_size)))
num_rows = int(np.ceil(sample_size / num_columns))
# Plot images
fig, axes = plt.subplots(num_rows, num_columns, figsize=(num_columns * 2, num_rows * 2))
axes = axes.flatten()
for ax, idx in zip(axes, sample_indices):
	ax.imshow(train_x[idx], cmap='gray')
	ax.set_title(f'Label = {train_y[idx]}')
	ax.axis('off')
# Hide any remaining empty axes
for i in range(len(sample_indices), len(axes)):
	axes[i].axis('off')
plt.show()

We can see our dataset indeed contains images of digits and the corresponding labels match the digits shown in each image. Just from these few examples we can also see how the same digit can be written quite differently

Next, lets calculate some statistics about our data. We will write a function to calculate the max, min, mean, median, and standard deviation of our dataset

In [None]:
def print_stats(dataset):
	max_value = tf.math.reduce_max(dataset)
	min_value = tf.math.reduce_min(dataset)
	mean_value = tf.math.reduce_mean(dataset)
	median_value = np.median(dataset)
	std_value = tf.math.reduce_std(tf.cast(dataset, dtype='float32'))

	print(f"Max: {max_value}")
	print(f"Min: {min_value}")
	print(f"Mean: {mean_value}")
	print(f"Median: {median_value}")
	print(f"Standard Deviation: {std_value}")

Apply the stats function to our image datasets

In [None]:
print_stats(train_x)

In [None]:
print_stats(test_x)

We see the pixel values of our images range from 0 to 255 (these are in units of pixel intensity) with a majority of the pixel values being 0 (since the median value is 0). It is typical for images to take values in [0, 255] where 0 intensity is black and 255 indensity is white. Such a wide range of input values and input values being large is not ideal for neural network processing and training. Generally, we want to keep the inputs to neural networks small, e.g., between [-1, 1], which we can do by preprocessing our dataset.

#### Data Preprocessing

The two most common data preprocessing techniques for neural network data are
1. `Normalization`: Scale and shift data to mean zero and unit standard deviation 
2. `Standardization` or `Unitization`: Scale and shift the data to the range [0,1]

These have both been implemented in *custom_preprocessing.py* and **TBD add example using these particular preprocessing functions.**

Due to the simplicity of our data and for brevity, we will use `Standardization`. Since our image data is in range [0, 255] we can map this to [0, 1] in one operation: dividing by 255.

To preprocess, we will want to convert the data type of our input images from integers to floats, which we can do with *tf.cast*

In [None]:
# Convert from uint8 to float32
train_x = tf.cast(train_x, dtype='float32')
test_x = tf.cast(test_x, dtype='float32')

# Standardize
train_x = train_x/255.0
test_x = test_x/255.0

Lets display our statistics information again to see the change

In [None]:
print_stats(train_x)

In [None]:
print_stats(test_x)

Now the data has been shifted onto the range [0, 1] as expected

#### Reexpress Labels

Similar to image preprocessing, we need to do a bit of preprocessing on the labels for each image. Right now each label is a number in range [0, 9]. For classification problems, we often convert such a label number into a `One-Hot` probability vector. We will used the TensorFlow function
<center> tf.one_hot </center>

which expects as input a 1d array of numbers of shape (number of samples, )

In [None]:
num_classes = len(set(train_y)) # calculate number of classes. Should be 10 as we have one for each digit 0, 1, ..., 9
print(f'Number of Classes = {num_classes}')

print(f'Old Labels \n {train_y[0:5]} \n')

train_y = tf.one_hot(train_y, num_classes)
test_y = tf.one_hot(test_y, num_classes)

print(f'New Labels \n {train_y[0:5]}')

Notice that label 0 now has a 1 only in the 0th vector entry, label 1 now has a 1 only in the 1st vector entry, label 2 now has a 1 only in the 2nd vector entry, etc. So, these `One-Hot` encodings can be viewed as probability vectors where each entry $i$ in a vector is the probability the digit in the corresponding image is $i$. For example, $[1, 0, 0, 0, 0, 0, 0, 0, 0, 0]$ says that with 100% probability the image displays a 0.

#### Data Splitting

The final bit of work we need to do with the data is to split into training, testing, and validataion datasets:
* Training data: Data provided to a neural network to train (i.e., optimize) the model parameters
* Testing data: Data unseen by the network during training that is used after training to assess the generalizability of the model. We want our model to perform well on any future input even if it did not explicitly see it during training
* Validataion data: Data provided to a neural network during the training process to act as test data during the training. Usually used to determine model overfitting. This data is NOT used to train the model parameters

 Our dataset is convienently already split into training and testing datasets. All we need now is the validataion data, which we create by splitting the training dataset. We will take the final 10,000 images from the dataset however we could also randomly sample from the train data to from the test data

In [None]:
# Take final 10000 samples as the validataion data
val_x = train_x[-10000:]
val_y = train_y[-10000:]

# Remove the validataion data from the training data
train_x = train_x[:-10000]
train_y = train_y[:-10000]

In [None]:
print_info(train_x) # 50000 images of size 28x28
print_info(train_y) # 50000 category labels denoting which digit is shown on the image
print_info(val_x)
print_info(val_y)
print_info(test_x)
print_info(test_y)

#### Creating a Neural Network Model

Now that we have completed the inspection and preprocessing of our data it is time to construct a neural network model, which consists of the following main components
1.	A collection of `layer`s where each `layer` is a mathematical operation applied to the previous `layer`. (This collection of layers is what is colloquially known as a *neural network* or *neural network model* or *model* and each layer will contain some *parameters* or *weights* to be optimized)
	* Ex.) A fully-connected (or dense layer). i.e., $\sigma(Ax+b)$ for $A$ a dense matrix
2.	A `loss function` that calculates the error between the neural network output, given an input, and the actual label or output for the given input
	* Ex.) The mean squared error (or Euclidean error) $||x - y||_2^2$
3.  An `optimizer` that updates the neural networks' weights using the current weights and the gradient of the `loss function` with respect to the neural networks' weights
	* Ex.) A gradient descent function $x - \nabla f(x)$
4. (Optional) `Metric`s that calulate other error values, outside of the loss function, during network training and testing
	* Ex.) In addition to loss, calculate the mean absolute error $||x - y||_1$
4. (Optional) `Callback` functions that are called periodically throughout training
	* Ex.) Function to save the model every so often	

#### Define the collection of `layer`s and model structure

We will use a `Sequential` model, which is a basic stack of layers where each layer has exactly one input tensor and one output tensor, as this basic model type is sufficient for our purposes. A `Sequential` model takes as input a list of layers and creates a TensorFlow model with these layers.

An example dense (or fully-connected) Sequential model is given in the next cell where a Dense layer is created with the following code
<center> tf.keras.layers.Dense(output dimension, activation="name of activation function") </center>

Note that our final layer must map to $\mathbb{R}^{10}$ as we want the output from our model to be a vector of size *num_classes=10* in order to match the size of the `One-Hot` data labels.

In [None]:
tf.random.set_seed(1)
np.random.seed(1)

layers = [
	tf.keras.layers.Flatten(), # flatten input 2d images into a long vector
	tf.keras.layers.Dense(128, activation='relu'),	# apply a dense layer mapping an input vector into a vector of dimension 128 then apply a ReLU activation function
	tf.keras.layers.Dense(num_classes, activation='softmax')	# apply a dense layer mapping an input vector of dimension 128 into a vector of dimension 10 then apply a Softmax activation function. Softmax converts a vector of values into a vector of probabilities
]

model = tf.keras.Sequential(layers)

#### Compile the model with a `loss function`, `optimizer`, and (optionally) `metric`s

In [None]:
loss_function = tf.keras.losses.MeanSquaredError()
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001) # the learning rate is essentially the gradient descent "step size"
metrics = [
	'accuracy'
] # metrics are provided as a list and each metric in the list will be evaluated

model.compile(optimizer=optimizer, loss=loss_function, metrics=metrics) # compile the model by assigning it a loss function, optimizer, and (optionally) metrics

#### Build the model

Providing an input into a TensorFlow model "builds" it (i.e., constructs the weight matrices for each layer)

In [None]:
_ = model(train_x[0:5]) # provide first 5 images as input

Display a tabular summary of the model. This shows the layers, the parameters in each layer, and the total parameters of the network.
* Flatten Layer: has no parameters as all it does is flatten the input matrices into vectors
* First Dense Layer: Maps an input vector of size 784 to an output vector of size 128 via a 784x128 matrix (with 784x128=100352 entires) and then adds a bias vector of size 128 giving 100352+128=100480 parameters in this layer
* Second Dense Layer: Maps an input vector of size 128 to an output vector of size 10 via a 128x10 matrix and then adds a bias vector of size 10 giving 128x10+10=1290 parameters in this layer

In [None]:
model.summary()

#### Train the model

To train our TensorFlow model we will call the *.fit()* method built-in to all TensorFlow models. For this we need to provide:
* Training inputs: *train_x* set of input training images
* Training labels: *train_y* set of training labels
* Batch size: the number of images that will be processed by the network at the same time
* Epochs: the number of training steps. In each training step the network sees all the training data
* Validataion data: the validataion images and labels as a tuple
* (Optional) Callbacks: a list of callbacks to call during training

Create folder where we will save the training results

In [None]:
save_base = os.path.join('..', 'training-results')
path_to_save = os.path.join(save_base, 'Trial 0')
count = 0
while os.path.exists(path_to_save):
	count += 1
	path_to_save = os.path.join(save_base, f'Trial {count}')
os.makedirs(path_to_save)

Create callbacks and train the model

In [None]:
callbacks = [
	tf.keras.callbacks.ModelCheckpoint(os.path.join(path_to_save, 'best_weights.h5'), monitor='val_loss', save_best_only=True, save_weights_only=True) # Will watch the loss value on the validataion dataset and save the weights from the model that produce the best validataion loss
]
history = model.fit(train_x, train_y, batch_size=256, epochs=25, validation_data=(val_x, val_y), callbacks=callbacks) # train model. Returns a History object containing the training and validataion losses and metrics

#### Plot the trianing results

*model.fit()* returns a History object containing the training and validataion losses and metrics in a dictionary. Lets plot these values

In [None]:
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']

loss = history.history['loss']
val_loss = history.history['val_loss']

epochs_range = range(len(history.history['loss']))

plt.figure()
plt.subplot(1, 2, 1)
plt.plot(epochs_range, acc, label='Training Accuracy')
plt.plot(epochs_range, val_acc, label='Validation Accuracy')
plt.legend(loc='lower right')
plt.title('Training and Validation Accuracy')

plt.subplot(1, 2, 2)
plt.plot(epochs_range, loss, label='Training Loss')
plt.plot(epochs_range, val_loss, label='Validation Loss')
plt.legend(loc='upper right')
plt.title('Training and Validation Loss')
plt.tight_layout()
plt.show()

We can see the loss decreasing with each epoch. As our loss is the mean squared error between the network outputs and the true labels, a smaller loss implies that the network better matches the true labels, which is what we want.

We also see the accuracy is increasing each epoch, which again is what we want as this says our model is better classifying the input image digits. At the end of our training, the model correctly classifies the images in the training dataset 99.68% of the time and correctly classifies the images in the validataion dataset 97.67% of the time. Our simple model is very accurate!

#### Test the model

We can test our model in two ways:
1. *model.evaluate*: returns the average loss value and metrics accross all images in the testing dataset

In [None]:
test_errors = model.evaluate(test_x, test_y, batch_size=256, return_dict=True)
print(test_errors)

Our model correctly classifies the input test images 97.79% of the time. Again our simple model is very accurate! Note that the test loss and accuracy is far closer to the validataion loss and accuracy than to the training loss and accuracy, which is usually the case as validataion results typically well represent the testing results.

2. *model.predict*: returns the model output for every input image in the testing dataset

In [None]:
test_predictions = model.predict(test_x, batch_size=256)
print(test_predictions.shape)

Now we can calculate losses and metrics ourselves

In [None]:
loss = tf.math.reduce_mean(tf.keras.losses.mse(test_y, test_predictions))
acc = tf.math.reduce_mean(tf.keras.metrics.categorical_accuracy(test_y, test_predictions))
print({'loss': loss.numpy(), 'accuracy': acc.numpy()}) # should match loss and accuracy from above

Lets take a closer look at the network outputs to gain more understanding on what it returns

First we take a subset of the test images and labels and plot these

In [None]:
sample_size=4
sub_x = test_x[0:sample_size]
sub_y = test_y[0:sample_size]

sample_indices = np.arange(sample_size)

# Calculate optimal number of rows and columns
num_columns = int(np.ceil(np.sqrt(sample_size)))
num_rows = int(np.ceil(sample_size / num_columns))

# Plot images
fig, axes = plt.subplots(num_rows, num_columns, figsize=(num_columns * 3, num_rows * 3))
axes = axes.flatten()
for ax, idx in zip(axes, sample_indices):
	ax.imshow(sub_x[idx], cmap='gray')
	ax.set_title(f'Label = \n {sub_y[idx]}')
	ax.axis('off')

# Hide any remaining empty axes
for i in range(len(sample_indices), len(axes)):
	axes[i].axis('off')
	
plt.show()

Now lets see how our model classifies these images

In [None]:
label_predict_prob = model(sub_x)
print(label_predict_prob)

These outputs are probability vectors, due to the Softmax activation function, but what exactly do these number mean? Lets break one of these outputs down further to find out. Consider the first output
<center> [5.2421272e-09, &nbsp 7.3183758e-11,  &nbsp 2.1322680e-06, &nbsp 1.9078225e-05, &nbsp 4.0508985e-11, &nbsp
  4.7977098e-09, &nbsp 4.1967412e-15, &nbsp 9.9997842e-01, &nbsp 5.1285550e-08, &nbsp 2.7951430e-07] </center>
  
and recall the input image was the 7 in the above plot. This output from the model says that it predicts that the digit in this input image is a:
* 0 with probability .000000005242 (.00000005242 % chance that it is a 0)
* 1 with probability .00000000007318 (.0000000007318 % chance that it is a 1)
* 2 with probability .000002133 (.00002132 % chance that it is a 2)
* 3 with probability .00001907 (.001907 % chance that it is a 3)
* 4 with probability .00000000004050 (.0000000004050 % chance that it is a 4)
* 5 with probability .000000004797 (.00000004797 % chance that it is a 5)
* 6 with probability .000000000000004196 (.00000000000004196 % chance that it is a 6)
* 7 with probability .9999 (99.99 % chance that it is a 7)
* 8 with probability .00000005128 (.0000005128 % chance that it is an 8)
* 9 with probability .0000002795 (.000002795 % chance that it is a 9)

That is the network is very confident that the input image displays a 7, which is correct! The other values are so small they realistically are 0.

To assign a single classification from one of these output probability vectors produced by the network we can simply take the argmax of the network output vector. Doing so says that the category to which the network assigns the highest probability is the category in which the network chooses for classification. Note that this is what is happening inside of the 'acuracy' metric. Lets see this in action

In [None]:
label_predict = tf.math.argmax(label_predict_prob, axis=-1)
print(label_predict)

The network classifications are 7, 2, 1, and 0 which matches the four plots above!

#### Network Tuning

For our next bit here we will load in the function *train_and_test* from the file *train_img_classification_net*. This function, *train_and_test*, implements the training and testing code we just walked through above all in one functin call for convenience.

In [None]:
from train_img_classification_net import train_and_test

Recall network structure from before which has the layers
<center> Flatten -> Dense(128) -> Dense(10) </center>

In [None]:
tf.random.set_seed(1)
np.random.seed(1)

layers = [
	tf.keras.layers.Flatten(), # flatten input 2d images into a long vector
	tf.keras.layers.Dense(128, activation='relu'),	# apply a dense layer mapping an input vector into a vector of dimension 128 then apply a ReLU activation function
	tf.keras.layers.Dense(num_classes, activation='softmax')	# apply a dense layer mapping an input vector of dimension 128 into a vector of dimension 10 then apply a Softmax activation function
]

model = tf.keras.Sequential(layers)

We run *train_and_test* by calling as so, when we do so we see we have the exact same testing results that we had previously

In [None]:
test_results = train_and_test(model, (train_x, train_y), (test_x, test_y), (val_x, val_y), learning_rate=0.001, batch_size=256, epochs=25)

The particular choice of network structure we were using was fairly arbitrary and we could alter said structure in a few key ways:
1. Change the hidden size
2. Change the number of layers
3. Change the layer types

#### 1. Change the hidden size

* Generally, larger hidden size implies more network parameters that further implies the network has better learning capacities (i.e., a better network) if enough data is available. However, with too many paramters or not enough data the model will overfit (i.e., memorize the training data and be unable to generalize to the testing data well)

Lets look at a few examples where we change the hidden size

##### Example 1: Hidden size 256

In [None]:
tf.random.set_seed(1)
np.random.seed(1)

layers = [
	tf.keras.layers.Flatten(), # flatten input 2d images into a long vector
	tf.keras.layers.Dense(256, activation='relu'),	# apply a dense layer mapping an input vector into a vector of dimension 256 then apply a ReLU activation function
	tf.keras.layers.Dense(num_classes, activation='softmax')	# apply a dense layer mapping an input vector of dimension 256 into a vector of dimension 10 then apply a Softmax activation function
]

model = tf.keras.Sequential(layers)

test_results = train_and_test(model, (train_x, train_y), (test_x, test_y), (val_x, val_y), learning_rate=0.001, batch_size=256, epochs=25)

We see slightly better test accuracy of 98.03% when using a hidden size of 256 up from 97.79% when using a hidden size of 128

##### Example 2: Hidden size 64

In [None]:
tf.random.set_seed(1)
np.random.seed(1)

layers = [
	tf.keras.layers.Flatten(), # flatten input 2d images into a long vector
	tf.keras.layers.Dense(64, activation='relu'),	# apply a dense layer mapping an input vector into a vector of dimension 64 then apply a ReLU activation function
	tf.keras.layers.Dense(num_classes, activation='softmax')	# apply a dense layer mapping an input vector of dimension 64 into a vector of dimension 10 then apply a Softmax activation function
]

model = tf.keras.Sequential(layers)

test_results = train_and_test(model, (train_x, train_y), (test_x, test_y), (val_x, val_y), learning_rate=0.001, batch_size=256, epochs=25)

We see slightly worse test accuracy of 97.10% when using a hidden size of 64 down from 97.79% when using a hidden size of 128.

While we only ran 3 tests, right now it appears the a larger hidden size of 256 is what we want to use for our model.

#### 2. Change the number of layers

* Generally, more layers implies more network parameters that further implies the network has better learning capacities (i.e., a better network) if enough data is available. However, with too many paramters or not enough data the model will overfit (i.e., memorize the training data and be unable to generalize to the testing data well)

Lets look at a few examples where we change the number of layers

##### Example 1: Two hidden dense layers of hidden size 128

In [None]:
tf.random.set_seed(1)
np.random.seed(1)

layers = [
	tf.keras.layers.Flatten(), # flatten input 2d images into a long vector
	tf.keras.layers.Dense(128, activation='relu'),	# apply a dense layer mapping an input vector into a vector of dimension 128 then apply a ReLU activation function
	tf.keras.layers.Dense(128, activation='relu'),	# apply a dense layer mapping an input vector of dimension 128 into another vector of dimension 128 then apply a ReLU activation function
	tf.keras.layers.Dense(num_classes, activation='softmax')	# apply a dense layer mapping an input vector of dimension 128 into a vector of dimension 10 then apply a Softmax activation function
]

model = tf.keras.Sequential(layers)

test_results = train_and_test(model, (train_x, train_y), (test_x, test_y), (val_x, val_y), learning_rate=0.001, batch_size=256, epochs=25)

We see slightly worse test accuracy of 97.40% when using two hidden dense layers with hidden size 128 down from 97.79% when using a single hidden size of 128.

##### Example 2: Use multiple hidden layers with different hidden sizes

In [None]:
tf.random.set_seed(1)
np.random.seed(1)

layers = [
	tf.keras.layers.Flatten(), # flatten input 2d images into a long vector
	tf.keras.layers.Dense(64, activation='relu'),	# apply a dense layer mapping an input vector into a vector of dimension 64 then apply a ReLU activation function
	tf.keras.layers.Dense(128, activation='relu'),	# apply a dense layer mapping an input vector of dimension 64 into another vector of dimension 128 then apply a ReLU activation function
	tf.keras.layers.Dense(256, activation='relu'),	# apply a dense layer mapping an input vector of dimension 128 into another vector of dimension 256 then apply a ReLU activation function
	tf.keras.layers.Dense(32, activation='relu'),	# apply a dense layer mapping an input vector of dimension 256 into another vector of dimension 32 then apply a ReLU activation function
	tf.keras.layers.Dense(num_classes, activation='softmax')	# apply a dense layer mapping an input vector of dimension 128 into a vector of dimension 10 then apply a Softmax activation function
]

model = tf.keras.Sequential(layers)

test_results = train_and_test(model, (train_x, train_y), (test_x, test_y), (val_x, val_y), learning_rate=0.001, batch_size=256, epochs=25)

We see slightly worse test accuracy of 97.25% with this network, down from 97.79% when using a single hidden size of 128

#### 3. Change the layer types

Convolutional layers are by far the most common layer type used for Computer Vision and image processing tasks. We will not go into much detail here on what a convolution is and how it is implemented in a convolutional layer, but a search for either term will bring up lots of pages and videos to explain if you are interested. Lets checkout a few examples using *Conv2D*

##### Example 1: Single Convolutional layer
Note that while we use *Conv2D* layers as hidden layers, we will still use the exact same final output *Dense* layer from before so that our network outputs tensors of the same dimension to our image labels

In [None]:
tf.random.set_seed(1)
np.random.seed(1)

layers = [
	tf.keras.layers.Reshape((28, 28, 1)), # introduce final image channel dimension to reshape images from size 28x28 to 28x28x1 as Conv2D layers expect this final image channels dimension
	tf.keras.layers.Conv2D(32, 3, padding='same', activation='relu'), # 3x3 convolution creating images with 32 filter channels. That is, the output is an tensor of shape 28x28x32
	tf.keras.layers.Flatten(), # flatten images into a long vector
	tf.keras.layers.Dense(num_classes, activation='softmax')	# apply a dense layer mapping an input vector into a vector of dimension 10 then apply a Softmax activation function
]

model = tf.keras.Sequential(layers)

test_results = train_and_test(model, (train_x, train_y), (test_x, test_y), (val_x, val_y), learning_rate=0.001, batch_size=256, epochs=25)

We see slightly better test accuracy of 97.99% with this network, up from 97.79% with our original dense network that used a single hidden size of 128

##### Example 2: Multiple Convolutional Layers

Note that this network may take a while to train

In [None]:
tf.random.set_seed(1)
np.random.seed(1)

layers = [
	tf.keras.layers.Reshape((28, 28, 1)), # introduce final image channel dimension to reshape images from size 28x28 to 28x28x1 as Conv2D layers expects this final image channels dimension
	tf.keras.layers.Conv2D(8, 3, padding='same', activation='relu'), # 3x3 convolution creating images with 8 filter channels. That is, the output is an tensor of shape 28x28x8
	tf.keras.layers.Conv2D(32, 3, padding='same', activation='relu'), # 3x3 convolution creating images with 32 filter channels. That is, the output is an tensor of shape 28x28x32
	tf.keras.layers.Conv2D(32, 3, padding='same', activation='relu'), # 3x3 convolution creating images with 32 filter channels. That is, the output is an tensor of shape 28x28x32
	tf.keras.layers.Conv2D(32, 3, padding='same', activation='relu'), # 3x3 convolution creating images with 32 filter channels. That is, the output is an tensor of shape 28x28x32
	tf.keras.layers.Conv2D(16, 3, padding='same', activation='relu'), # 3x3 convolution creating images with 16 filter channels. That is, the output is an tensor of shape 28x28x16
	tf.keras.layers.Flatten(), # flatten images into a long vector
	tf.keras.layers.Dense(num_classes, activation='softmax')	# apply a dense layer mapping an input vector of dimension 128 into a vector of dimension 10 then apply a Softmax activation function
]

model = tf.keras.Sequential(layers)

test_results = train_and_test(model, (train_x, train_y), (test_x, test_y), (val_x, val_y), learning_rate=0.001, batch_size=256, epochs=25)

Altering these three items 
1. Change the hidden size
2. Change the number of layers
3. Change the layer types

is called **tuning** of the network. The optimial choices for these are usually very problem specific and require a great deal of trial and error to find. While we did not look at examples of altering the following items in this notebook, here are additional items that can be included in network tuning

4. Change the loss function
5. Change the optimizer


Generally, after a bunch of trial and error during **tuning**, you will choose the network that produces the best testing results. From all of our networks above, which network produced the best testing accuracy?

#### Hyperparameter Tuning

Hyperparameter tuning goes hand-in-hand with network tuning (usually these two tunings are done at the same time) and is another trial-and-error process. Hyperparameters are typically additional quantities defining your training process which you set up front. A few examples include:
1. Learning Rate
2. Batch size
3. Number of epochs

Here is one example training our originl dense network with a larger learning rate, smaller batch size, and greater number of epochs. How do the results compare to the original?

In [None]:
tf.random.set_seed(1)
np.random.seed(1)

layers = [
	tf.keras.layers.Flatten(), # flatten input 2d images into a long vector
	tf.keras.layers.Dense(128, activation='relu'),	# apply a dense layer mapping an input vector into a vector of dimension 128 then apply a ReLU activation function
	tf.keras.layers.Dense(num_classes, activation='softmax')	# apply a dense layer mapping an input vector of dimension 128 into a vector of dimension 10 then apply a Softmax activation function
]

model = tf.keras.Sequential(layers)

test_results = train_and_test(model, (train_x, train_y), (test_x, test_y), (val_x, val_y), learning_rate=0.005, batch_size=128, epochs=50)

#### Documentation for Further Reference

* [Layer documentation](https://www.tensorflow.org/api_docs/python/tf/keras/layers) for all built-in avilable layers in TensorFlow. The most common layers include:
	1. Dense layers: *tf.keras.layers.Dense*
	2. Convolutional layers: *tf.keras.layers.Conv1D*, *tf.keras.layers.Conv2D*, etc.

* [Loss function documentation](https://www.tensorflow.org/api_docs/python/tf/keras/losses) for all built-in available loss functions in TensorFlow. The most common loss functions include:
	1. Mean squared error $||x - y||_2^2$
	2. Mean absolute error $||x-y||_1$
	3. Cross entropy (loss between probability distributions) $-\sum_{x\in \text{classes}} p_{\text{true}}(x) \log(p_{\text{predicted}}(x))$

* [Optimizer documentation](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers) for all built-in available optimizers in TensorFlow. The most common optimizers include:
	1. Adaptive Moment Estimation (Adam)
	2. Stochastic Gradient Descent (SGD)

* [Metrics documentation](https://www.tensorflow.org/api_docs/python/tf/keras/metrics) for all built-in available metrics in TensorFlow. The most common metrics include:
	1. Mean squared error $||x - y||_2^2$
	2. Mean absolute error $||x-y||_1$
	3. Accuracy: In a classification problem determine what % of the outputs have be categorized correctly

* [Callbacks documentation](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks) for all built-in available callbacks in TensorFlow. The most common metrics include:
	1. EarlyStopping: terminates training when the model overfits
	2. ModelCheckpoint: saves the model weights during training

* Model documentation
	1. [Sequential models](https://www.tensorflow.org/guide/keras/sequential_model): The most basic stack of layers where each layer has exactly one input tensor and one output tensor
	2. [Functional models](https://www.tensorflow.org/guide/keras/functional_api): A more flexible model that can handle non-linear topologies, shared weights, and multi-input or multi-output layers
	3. [Subclassed models](https://www.tensorflow.org/guide/keras/making_new_layers_and_models_via_subclassing): The most flexible model that allows for custom layers or custom model weights
	4. [*model.compile* documentation](https://www.tensorflow.org/api_docs/python/tf/keras/Model#compile)
	5. [*model.fit* documentation](https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit)

#### Check out the [lesson 2 workbook](./2-workbook.ipynb) for practice examples writing your own neural networks for Computer Vision