## LeNet-5
LeNet is a convolutional network architecture developed in 1998, that was primarily used for handwritten digit detection. LeNet was widely used to read the handwriting on checks for the U.S. Treasury. 

### Architecture
LeNet consists of 8 layers, 1 being the input, 3 of which are convolutional, 2 are average pooling, and two fully connected layers. By today's standards, this is a relatively simple and straightforward architecture. Our implementation of LeNet's architecture using Tensorflow is below.

In [None]:
import tensorflow as tf
import numpy as np

#First layer is the input, was originally 28x28 by padded on each side by 2 pixels.
#This is the only layer to use any padding, so the overall size of the layers shrink at every stage
X_inp = tf.placeholder(tf.float32, shape=(None, 32, 32, 1))
y_inp = tf.placeholder(tf.int32, (None))

#First convolutional layer. The arguments for tensorflows conv2d function from layers that we use are:
#input: the input layer
#filters: the number of output maps generated
#kernel_size: the width and height of the kernel. In this network, the width and height are always the same,
#so we just specify one number
#strides: the horizontal and vertical stride, a tuple generally, but a single number represents same width + height
#padding: "valid" = no padding, may ignore some rows and columns at bottom of image
#         "safe" = padding, adds rows and columns if necessary based on the stride
#activation: the activation function used by the layer

#Conv1 input size: 32 x 32, output size: 28 x 28 x 6
conv1 = tf.layers.conv2d(X_inp, filters=6, kernel_size=5,  strides=1, padding="valid", activation=tf.nn.relu) 


### Average Pooling
LeNet uses average pooling as as opposed to max pooling. The idea is the same, only that the average value within the pool is taken for the layer instead of the maximum. Similar to a kernel with 1 / (pool_width) * 1 / (pool_height) for its entries.

In [None]:
#average_pooling2d args: 
#input: input layer
#pool_size: width, height of pool size, single number means same width + height
#strides: horiz and vertical stride, 1 number = same width + height

#avg_pool input size: 28 x 28 x 6, output size: 14 x 14 x 6
#pool size of (2,2) and stride size of (2,2) halves the dimensions of the previous layer
avg_pool = tf.layers.average_pooling2d(conv1, pool_size=2, strides=2)

#Conv2 input size: 14 x 14 x 6, output size: 10 x 10 x 16
#because of padding valid and kernel size (5,5), only the first 10 / 14 pixels of each row and column are used
#in this layer
conv2 = tf.layers.conv2d(avg_pool, filters=16, kernel_size = 5, strides=1,
                       padding="valid", activation = tf.nn.relu)
#avg_pool2 input size: 10 x 10 x 16, output size: 5 x 5 x 16
avg_pool2 = tf.layers.average_pooling2d(conv2, pool_size = 2, strides=2)

#Conv3 input size: 5 x 5 x 16, output size: 1 x 1 x 120
#A single kernel of size 5x5 results in 16 1x1 outputs, with 120 filters applied to each of them
conv3 = tf.layers.conv2d(avg_pool2, filters=120, kernel_size = 5, strides=1,
                      padding="valid", activation = tf.nn.relu)

#Reshape the convolutional layer to have size of 400x120 for use in fully connected layer
flat = tf.reshape(conv3, [-1, 120])

#fully connected dense layer. size: 120x84
dense = tf.layers.dense(inputs=flat, units=84, activation=tf.nn.relu)

#output layer: inputs to softmax. size: 84 x 10
logits = tf.layers.dense(dense, units=10)

softmax = tf.nn.softmax(logits)
predict = tf.argmax(softmax, axis=1)

y_labels = tf.one_hot(y_inp, 10)
cross_entropy = tf.nn.softmax_cross_entropy_with_logits_v2(logits=logits, labels=y_labels)
#loss function is the mean cross entropy for softmax applied to the output layer
loss = tf.reduce_mean(cross_entropy)
#use adam optimizer
opt = tf.train.AdamOptimizer(learning_rate=0.001)
train_op = opt.minimize(loss)

### Data Augmentation
To make the training data more complete and have the CNN be robust to minor variance in the input, we augment the training data with random rotations, shears, and zooms, using the keras preprocessing library. We do this as an optional first stage in our training pipeline, where the user can enter parameters that affect which transformations get applied and how they are applied.

In [None]:
def fit_transform(self, X, y):
    #print("before TF, shapes = %s, %s" % (X.shape, y.shape))
        X_aug = []
        y_aug = []

        for x, label in zip(X, y):
            # include originals
            X_aug.append(x)
            y_aug.append(label)

            # random rotations up to -60/60 degrees
            for degree in range(*self.rotation):
                rotated = tf.contrib.keras.preprocessing.image.random_rotation(
                            x, degree, row_axis=0, col_axis=1, channel_axis=2)
                X_aug.append(rotated)
                y_aug.append(label)


                if self.rot_and_shear:
                    # random shears up to 40% intensity        
                    for sh in range(*self.shear):
                        sh /= 10
                        sheared = tf.contrib.keras.preprocessing.image.random_shear(
                                rotated, sh, row_axis=0, col_axis=1, channel_axis=2)
                        X_aug.append(sheared)
                        y_aug.append(label)

            # random shifts 20% left, right, up or down
            for xsh in range(*self.x_shifts):
                xsh /= 10
                for ysh in range(*self.y_shifts):
                    ysh /= 10
                    shifted = tf.contrib.keras.preprocessing.image.random_shift(
                            x, xsh, ysh, row_axis=0, col_axis=1, channel_axis=2)
                    X_aug.append(shifted)
                    y_aug.append(label)

            # random zoom up to 20%
            # zoomed = tf.contrib.keras.preprocessing.image.random_zoom(
            #         x, (0.9, 1.0), row_axis=0, col_axis=1, channel_axis=2)
            # X_aug.append(zoomed)
            # y_aug.append(label)

        X_aug = np.array(X_aug)
        y_aug = np.array(y_aug)

        # print("after TF, shapes = %s, %s" % (X_aug.shape, y_aug.shape))
        return (X_aug, y_aug)



### LeNet Class as SkLearn Estimator + Pipeline
We made a class that wraps LeNet as a sklearn estimator, and performed a gridsearch in conjunction with the data augmentation transformer to find an optimal combination of hyperparameters.

In [None]:
import tensorflow as tf

import numpy as np
from datetime import datetime

import matplotlib.pyplot as plt
import pandas as pd

from sklearn.utils import shuffle
from sklearn.base import BaseEstimator, ClassifierMixin, TransformerMixin
from ext_test import kaggle_test

from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV


##LOAD INPUT
X = np.load("x_mnist1000.npy")
X = X.reshape((-1, 28, 28, 1))
y = np.load("y_mnist1000.npy")

np.random.seed(1)

indices = np.random.permutation(len(X))

valid_ind = indices[:100]
test_ind = indices[100:200]
train_ind = indices[200:]

X_train = X[train_ind]
y_train = y[train_ind]

X_validation = X[valid_ind]
y_validation = y[valid_ind]

X_test = X[test_ind]
y_test = y[test_ind]



default_activations = [
    "conv1",
    "avg_pool",
    "conv2",
    "avg_pool2",
    "conv3",
    "dense",
]

elu = {x: tf.nn.elu for x in default_activations}
selu = {x: tf.nn.selu for x in default_activations}
relu = {x: tf.nn.relu for x in default_activations}
tanh= {x: tf.nn.tanh for x in default_activations}

param_grid = {
    "aug__rotation": [[-60, 60, 10], [-30, 30, 10]],
    "aug__shear": [[0, 50, 10], [0, 20, 10]],
    "aug__rot_and_shear": [True, False],
    "clf__activations": [selu, elu, tanh],
    "clf__dropout_rate": [0.25, 0.35, 0.45],
    "clf__batch_size": [500],
    "clf__using_da": [True],
}

da = DataAugment()
clf = CNNClassifier()

pipe = Pipeline(
    steps=[
        ("aug", da),
        ("clf", clf),
    ]
)

now = datetime.now().strftime("%H-%M-%S")
log_name = "./logs/cnn_log_" + now + ".log"
with open(log_name, "w") as log:
    log.write("Param grid:\n")
    for p in param_grid:
        log.write("%s: %s\n" % (p, param_grid[p]))
 
    gs = LoggingGridSearch(pipe, param_grid=param_grid)
    gs.fit(X_train, y_train)
    best_msg = "best score: %s, params: %s" % (gs.best_score_, gs.best_estimator_.get_params())
    print(best_msg)
    log.write(best_msg)

    test_msg = "Test accuracy: %s" % gs.best_estimator_.score(X_test, y_test)
    print(test_msg)
    log.write(test_msg)
    
df = pd.DataFrame(gs.cv_results_)
df.to_csv("./gs_results/" + now + ".csv")


### Best Hyperparameters + Performance Plots