In [1]:
#Populating the interactive namespace from numpy and matplotlib
%pylab inline
import pandas as pd
import numpy as np
import scipy

Populating the interactive namespace from numpy and matplotlib


---
# Classifying Noisy Digits - Comp 598 Project 3 Report

#### Team: Duck Duck Duck

#### Stuart Spence, Josh Romoff, Charlie Bloomfield

---
## Introduction

Image classification is an area of computer science that borrows techniques form machine learning, computer vision and data science to categorize images. It involves several tasks: image processing, feature extraction, feature learning etc.

In this report, we present our methods for classifying a modified [MNIST](http://yann.lecun.com/exdb/mnist/) dataset. Each of the images in the original MNIST dataset has been modified by several tranformations: embossing, rotation, scaling, and a texture overlay. We explore several data-preprocessing techniques and several training models in search of the best way to classify these images, and present a fine-tuned Convolutional Neural Network as the best model for classifying images.

---
## Data Preprocessing Methods

#### Augmenting the dataset

A common mantra in machine learning is that more data beats a better algorithm. In this project, we were presented with 50,000 processed images to learn with. While this is a sizeable dataset, we knew that adding to the dataset may improve the performance of our learners.

Our first approach to augmenting the dataset involved rotating each image in the provided dataset by 0, 90, 180, and 270 degrees. We suspected that adding these rotations could aid our Neural Network in becoming more robust with regards to rotational variance between images. Image rotations can by performed easily using [scipy's rotate](http://docs.scipy.org/doc/scipy-0.16.0/reference/generated/scipy.ndimage.interpolation.rotate.html) function, which rotates an array about an axis.

---
Our second approach for augmenting the dataset involved downloading the original MNIST data and processing it in a way similar to the images provided. We knew the transformations that the provided images had been processed with, so transforming MNIST images in a similar fashion was rather simple. MNIST images have a 28 x 28 pixel resolution, so we first rescaled the images to match the provided dataset's 48 x 48 resolution using [scipy's imresize](http://docs.scipy.org/doc/scipy-0.16.0/reference/generated/scipy.misc.imresize.html) function and then used [scipy's imfilter](http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.misc.imfilter.html) function to emboss each image. We could then rotate images in the same way mentioned previously.

Unfortunately, we were not able to leverage the processed MNIST dataset as increased the raw dataset size to greater than 10GB. At this size, the machine on which we were running our learners could not support the entire dataset. Had we initially considered how to structure our model training such that we did not need the entire dataset to be in memory at once, we could have utilized these images and we expect that we would have had at least marginal improvements in the classification performance of our Neural Network.

#### Trimming Images

Another method we applied in data preprocessing involved trimming the edges of the images before passing them as input to the classifiers. We suspected that the edges added meaningless noise to our feature set, slowing down our learning times and causing the learners to overfit. This method proved to be effective, improving the accuracy of our Neural Network by over half a percentage point.

## Feature Design Methods

We used a simple normalization on the pixel values; subtracting the mean and dividing by the standard deviation.

-----
## Algorithms Used

---
#### Logistic Regression

Logistic Regression is a statistical method that can be used for binary classification (classification of data that can be in one of two classes). It is based on the log-odds function, and uses it to compute the likelihoood that a test instance is part of a given class.

The log odds ratio is defined as follows:
$$\ln(\frac{p}{1 - p})$$

For each feature, we can use the log odds ration to determine a probabalistic relationship between the feature and the instance's class. This is a commonly used technique for feature selection, as it provides a way to filter features that show little correlation to any of the output classes.

For the task of classification, one can solve for the set of weights that maximizes the likelihood of observing the values in the test set. With this, one can find the log-odds likelihood that a new instance is part of a given class with the following expression:

$$p_{W}(x) = \frac{1}{1 + e ^ {-W ^ {T} x}}$$

One drawback of Logistic Regression for the problem at hand is that it only classifies binary data. It can be adapted to a multi-class problem by applying binary classification for each class versus the group of all other classes, reffered to as one-versus-all classification. The likelihood of each class versus the group can then be compared and we can choose the class that is has highest probability for the given inputs. Unfortunately, one-versus-all is a costly classification technique, and does not scale well to datasets with many classes. For the digit dataset at hand, we'll need to generate ten sets of weights, one for each class of digit.

---
##### Results

We used [scikit-learn's logistic regression classifier](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). This classifier provides many hyperparameters for model tuning, and integrates with the [Pipeline](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) and [GridSearch](http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.GridSearchCV.html) classes provided by scikit-learn, which let us easily explore different hyperparameter and feature selection combinations to isolate which combinations provide the best classification accuracy. 

We explore two of the provided parameters: the regularization strength and the penalization norm used by the cost function. The following table shows the best best results for a grid search over a large set of C values with 10-fold cross validation.

|    |L1   |L2  |
|-----|-----|----|
|C = 1.0  |19.6%|19.6%|
|C = 2.0  |18.8%|19.8%|
|C = 10.0 |17.8%|19.8%|


As shown, Logistic Reggression techniques performed poorly in the domain of transformed image classification. Furthurmore, the variances amongst the classiciation percentages of show little correlation between changes in the model's hyperparameters. Nonetheless, it provided us with a baseline against which we could compare our other models.

---
#### Linear Support Vector Machine (SVM)

Linear SVM classification is another statistical technique that can be used for binary classification. SVMs classify data points by finding a hyperplane in the dataset's feature space that optimizes a conceptually simple constraint problem.

The approach involves finding the hyperplane that **maximizes the minimum geometric margin** of the given feature set, where the geometric margin is define as follows:

$$\gamma = y_{i} \Big[\big(\frac{W}{||W||}\big) ^ {T} x_{i}\Big]$$

$$where: y_{i} = \{-1, 1\} $$

This expression can be optimized by solving the maximization function:

$$max_{\gamma, W} (\gamma)$$

Subject to the constraints:

$$y_{i}(W ^ {T} x_{i}) \ge \gamma$$

$$||W|| = 1$$

---
Unfortunately, this expression is not optimazable because the second of the two constraints is not convex. But, the expression can be rewritten in the following form:

$$min_{\gamma, W} \Big(\frac{1}{2}||W||^{2}\Big)$$

Subject to the constraint:

$$y_{i}W^{T}x_{i} \ge 1$$


In this form, we have an expression that can be solved using standard optimization techniques, namely quadrtic optimization. But, we must consider how to fit the hyperplane without perfectly linear seperation. Many datasets are simply not linearly separable. More importantly, we often do not want to fit a hyperplane to perfectly separate our test data, but instead want to fit a hyperplane that generalizes well to data not in our dataset (ie. not overfit). To accomplish this goal, we can add a cost function to the geometric margin expression.

This results in the following optimization function:

$$min_{W} \Big[\frac{1}{2}||W||^{2} + \Big( C \sum_{m=1}^{m} \xi_{i} \Big)\Big]$$

Subject to the constraints:

$$y_{i}W^{T}x_{i} \ge 1 - \xi_{i}$$

$$\xi_{i} \ge 0$$

---
As with logistic regression, SVMs are designed to to classify binary data, and can be adapted to a multi-class settings by using one-versus-all classification. Again, this is a costly operation as we need to find a decision boundary for each of the ten classes of digits.

It is worth noting here that this discussion did not consider non-linear decision boundaries. The methods discussed above can be adapted to other types of decision boundaries, such as polynomials.


---
##### Results

We used [scikit-learns support vector classifier](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html). With this classifier came built in support for hyper-parameterizing the cost constant defined in the optimization function above. The following table shows the classification accuracy of the linear SVC hyperparameterized by different values of C, the cost constant.

|C    |Accuracy   |
|-----|-----|
|1.0| 21.2%|
|10.0|23.4%|
|100.0|20.8%|
|1000.0|20.6%|

This is a slight improvement over the performance of the Logistic Regression classifier, though only minimally. As with Logistic Regression, the variance amongst the set of cross-validation folds show little correlation between the hyperparameter's value and the prediction output. 

---
#### Full Implementation for Part 3

Part 3 requires a fully connected neural network with backpropagation. We also implemented mini-batch and a dynamic learning rate.

For mini-batch, instead of performing weight adjustments after every backpropagation, we sum the corrections and defer their application until X amount of trials. This saves us some processing time because weight adjustments are very computation heavy operations (discussed later).

For the dynamic learning rate in part 3, we simply did a linear interpolation. There is a starting learning rate, and a final learning rate. Given what percent done we are in processing our samples, we gradually approach the final (usually smaller) learning rate.

---
#### Convolutional Neural Network
The Convolutional Neural network was inspired by how the human brain processes images.  That is, the brain has multiple layers of cells that each respond to certain orientations and frequencies in the image. One of the most famous examples of a CNN is "Lenet-5", which was very succesful in classifying digits. Our model is based off of "Lenet-5" with some slight modifications.

All of our models use the following layers:

- Input Layer: Each image of dimension (38X38) after clipping.

- Convolutional Layer: Convolves a given number of filters (found by gridsearch) of a given size ((3,3) for every model) on the image.

- Max Pooling Layer: Downsamples by a given factor (2 for every model) in both the x and y axis.

- Dropout Layer: Randomly drops a certain percentage (found by gridsearch) of the layers connections. 

- Dense Layer: A certain number of sigmoid neurons (256 for every model)

- Output Layer: Layer that maps Dense layer to the 10 classes using Softmax

We used the following models:

- Model 1: Input->Conv->Pool->Drop->Conv->Pool->Drop->Dense->Drop->Dense->Out 
- Model 2: Input->Conv->Conv->Pool->Drop->Conv->Conv->Pool->Drop->Dense->Drop->Dense->Out 
- Model 3: Input->Conv->Conv->Conv->Pool->Drop->Conv->Conv->Conv->Pool->Drop->Dense->Drop->Dense->Out 

Using the nolearn library made designing the convnet simple! (Model 1)

In [2]:
from lasagne.layers import DenseLayer
from lasagne.layers import InputLayer
from lasagne.layers import DropoutLayer
from lasagne.layers import Conv2DLayer
from lasagne.layers import MaxPool2DLayer
from lasagne.nonlinearities import softmax
from lasagne.updates import nesterov_momentum
from nolearn.lasagne import NeuralNet

convnet = NeuralNet(
    layers = [
        (InputLayer, {'shape': (None, 1, 38,38)}),

        (Conv2DLayer, {'num_filters': 32, 'filter_size': (3, 3), 'pad': 1}),
            
        (MaxPool2DLayer, {'pool_size': (2, 2)}),
        
        (DropoutLayer, {'p':.5}),

        (Conv2DLayer, {'num_filters': 64, 'filter_size': (3, 3), 'pad': 1}),
        
        (MaxPool2DLayer, {'pool_size': (2, 2)}),
        
        (DropoutLayer, {'p':.5}),

        (DenseLayer, {'num_units': 256}),
        
        (DropoutLayer, {'p':.5}),
        
        (DenseLayer, {'num_units': 256}),

        (DenseLayer, {'num_units': 10, 'nonlinearity': softmax}),
    ],
    
    update_learning_rate=0.005,
    
    update_momentum=0.9,
    
    max_epochs = 200,
)


ERROR (theano.sandbox.cuda): ERROR: Not using GPU. Initialisation of device gpu failed:
initCnmem: cnmemInit call failed! Reason=CNMEM_STATUS_OUT_OF_MEMORY. numdev=1

ERROR:theano.sandbox.cuda:ERROR: Not using GPU. Initialisation of device gpu failed:
initCnmem: cnmemInit call failed! Reason=CNMEM_STATUS_OUT_OF_MEMORY. numdev=1



RuntimeError: Cuda error: kernel_reduce_ccontig_node_meb404c8cd39208f6884dd773b584b7d7_0: out of memory. (grid: 1 x 1; block: 256 x 1 x 1)

Apply node that caused the error: GpuCAReduce{add}{1}(<CudaNdarrayType(float32, vector)>)
Toposort index: 0
Inputs types: [CudaNdarrayType(float32, vector)]
Inputs shapes: [(10000,)]
Inputs strides: [(1,)]
Inputs values: ['not shown']
Outputs clients: [[HostFromGpu(GpuCAReduce{add}{1}.0)]]

HINT: Re-running with most Theano optimization disabled could give you a back-trace of when this node was created. This can be done with by setting the Theano flag 'optimizer=fast_compile'. If that does not work, Theano optimizations can be disabled with 'optimizer=None'.
HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint and storage map footprint of this apply node.

## Optimization

#### Python Neural Network Full Implementation
For part 3 of the assignment, we used numpy to perform matrix and vector operations. This is a far better alternative than using iterations in Python, because Python is not a compiled language. This made dramatic performance improvements in our custom implementation neural network. However we did not verify these performance increases because this is a well documented aspect of Python and numpy.

Our Python implementation badly needed (and still needs more) optimization. We concluded this after we implemented a simple PerformanceTimer class (found in utilities.py) to monitor time usage for various parts of our neural network computations. In order of most time consuming:

<pre>ADJUST WEIGHTS 166.40998 ms (401 times)
FORWARD 30.38259 ms (201 times)
START 18.95308 ms (2 times)
CORRECTIONS 0.38668 ms (401 times)
NP.ARRAY DESIRED 0.01468 ms (201 times)
OUTPUTS MINUS DESIRED 0.00732 ms (201 times)
OVERHEAD 0.06398 ms (200 times)</pre>

We learned that matrix addition and subtraction did not cost much, forward passes cost some, and the weight adjustment at each back propagation is very expensive. We decided to implement the mini-batch strategy. Unfortunately, this only defers a matrix times a constant operation, and a matrix minus a matrix operation. The costly matrix dot operation to calculate the corrections according to each output of each forward pass still must be done, even in the mini-batch system.

Results vary greatly depending on the size of the neural network. But on larger networks and without mini-batch, on a fairly powerful cloud computing instance (Digital Ocean 4 core 8GB RAM) we could only process on the order of 30 training items per second. With mini-batch we generally saw a 30% improvement, to about 40. It's possible that our specific implementation of the neural net did not allow much performance increase with mini-batch.

Full performance results for all trials can be seen <a href="https://drive.google.com/file/d/0B2lkEEFpyGuBWHU2Y0dmalhZbE0">here</a>.

#### Convolutional Neural Network
The implementation was done using the following libraries; "nolearn","lasagne", and "theano". We used CUDA, and also signed up for CUDNN from nvidia to increase convolution speeds.

## Hyperparameter Selection

#### Python Neural Network Full Implementation
For the custom implementation we wanted to raise its performance in lieu of developing more sophisticated neural network techniques from scratch (like a convolutional neural network). We considered many variations of the hyperparameters of: hidden layer count, hidden layer size, batch size, learning rate, and dynamic learning rates. In summary, by far the most significant hyperparameter was simply time spent in training, or trial count. Given the slow performance of our custom built neural network (compared to the scikit learn model used by our group) we were not able to adequately explore hyperparameters. Ideally, we'd have used orders of magnitude greater numbers of trials and neuron counts. As a result, the following results are more a proof of concept and an exploration of the lower bounds of our custom implementation.

For all hyperparameters in the following tables, these were the defaults unless specified otherwise:

<pre>Trial count of 50000. 1 hidden layer, size 600. Learning rate 0.03, final learning rate 0.003. Batch size 1000.</pre>

In [None]:
data = pd.read_csv('../data/performance/1hidden.csv').values
data=data[1:,:]
plt.plot(data[:,0], data[:,1], 'r', label='Accuracy')
plt.xlabel('Neuron Count')
plt.ylabel('Accuracy')
plt.title('Neuron Count in Hidden Layer versus Accuracy')
plt.legend(loc='best')

It appears that a simple 1 hidden layer network worked best with 400 nodes. However, these minor random fluctuations would likely prove meaningless had we explored larger node counts and greater trial sizes. 

In [None]:
data = pd.read_csv('../data/performance/learning-rates2.csv').values
plt.plot(data[:,0], data[:,1], 'g', label='Accuracy')
plt.xlabel('Learning Rate')
plt.ylabel('Accuracy')
plt.title('Start Learning Rate versus Accuracy')
plt.legend(loc='best')

With our low trial sizes, and low number of trials, a learning rate of 0.03 appeared best.

In [None]:
data = pd.read_csv('../data/performance/trials.csv').values
plt.plot(data[:,0], data[:,1], 'b', label='Accuracy')
plt.xlabel('Trials')
plt.ylabel('Accuracy')
plt.title('Trials versus Accuracy')
plt.legend(loc='best')

It appears that the neural network is still perusing local minima. Since the highest value for trial count in the above table is only 350,000, there was still more data to collect. Had our part 3 implementation been faster, we likely would have seen accuracy increase as we performed one million, or ten million, backpropagations. The graph is especially jagged below 50000 trials because there are more trials, since they completed much faster.

In [None]:
data = pd.read_csv('../data/performance/batch-size-600.csv').values
plt.plot(data[:,0], data[:,1], 'b', label='Duration')
plt.xlabel('Batch size')
plt.ylabel('Duration (seconds)')
plt.title('Batch Size versus Training Duration')
plt.legend(loc='best')

The benefits of mini-batch show diminishing returns very quickly.

We also explored 2 hidden layer combinations with node counts up to 3000. However because of computation time, we weren't able to find much difference. For our final training session, we settled on 1500 nodes in the first hidden layer, and 500 nodes in the second hidden layer, because it appeared to be a decent trade off between training time and accuracy.

In [None]:
data = pd.read_csv('../data/performance/batch-size-1500.csv').values
plt.plot(data[:,0], data[:,1], 'b', label='Accuracy')
plt.xlabel('Batch Size')
plt.ylabel('Accuracy')
plt.title('Batch Size versus Validation Accuracy')
plt.legend(loc='best')

Observe the Y axis: accuracy hardly varies. Through this trial and various others, we concluded batch size did not significantly impact validation accuracy, especially when running larger training sessions.

#### Convolutional Neural Network

For our grid search we used 10% of the overall training set. We then stratified shuffle split the data such that we had 80% train and 20% test. Finally we ran a randomized grid search for 100 iterations. The following hyperparamaters were estimated based on the defined distribution.

- update_learning_rate: uniform(0.001,0.05)
- update_momentum': uniform(0.5,0.95)
- num_filters1: randint(8,128)
- num_filters2: randint(8,128)
- dropout_rate ('p'): uniform(0.1,0.7)

Grid search is made easy by nolearn because it conforms nicely with sklearns gridsearch. Here's how we used it:

In [None]:
from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import RandomizedSearchCV

XDATA, yDATA = data_manager.load_test_data()

#format data
X_train,y_train = formatData(XDATA,yDATA)

#Take 10% of the data
X_train, _, y_train, _= train_test_split(X_train, y_train, test_size=0.90, random_state=42)

#split into 80/20 train test
cv = StratifiedShuffleSplit(y_train, n_iter=1,test_size=0.2, random_state=42)


parameters = {'num_filters1':sp_randint(8,128),
            'num_filters2':sp_randint(8,128),
			'update_learning_rate':uniform(loc=0.001,scale = 0.01),
			'update_momentum': uniform(loc = 0.5,scale= 0.45),
			'p':uniform(loc = 0.1,scale = 0.6),
			}


grid_search = RandomizedSearchCV(convnet,parameters,cv=cv,verbose = 2,n_iter=100)


grid_search.fit(X_train, y_train)


The following is a chart of our top 5 best performing paramaters based on Model 1:

- Model with rank: 1, Mean validation score: 0.502 (std: 0.000), Parameters: 
    - 'p': 0.51221599476999473, 
    - 'update_learning_rate': 0.01064499164578778, 
    - 'update_momentum': 0.91352699589488, 
    - 'num_filters1': 30,
    - 'num_filters2': 67

- Model with rank: 2, Mean validation score: 0.492 (std: 0.000), Parameters: 
    - 'p': 0.5637040674774771, 
    - 'update_learning_rate': 0.009892801971218328, 
    - 'update_momentum': 0.9468960973673631, 
    - 'num_filters1': 63,
    - 'num_filters2': 99

- Model with rank: 3, Mean validation score: 0.480 (std: 0.000), Parameters: 
    - 'p': 0.6078659297232203, 
    - 'update_learning_rate': 0.010694675739290053, 
    - 'update_momentum': 0.7168119761446152, 
    - 'num_filters1': 15,
    - 'num_filters2': 42

- Model with rank: 4, Mean validation score: 0.477 (std: 0.000), Parameters:
    - 'p': 0.15543656731019595, 
    - 'update_learning_rate': 0.007025415126151051, 
    - 'update_momentum': 0.8022797047592151, 
    - 'num_filters1': 43,
    - 'num_filters2': 80

- Model with rank: 5, Mean validation score: 0.477 (std: 0.000), Parameters: 
    - 'p': 0.4000804018488181, 
    - 'update_learning_rate': 0.00870990383597324, 
    - 'update_momentum': 0.9160907182082736, 
    - 'num_filters1': 10,
    - 'num_filters2': 31

##### Grid Search caveats: 
We found that by lowering our training rate to .005 we would perform better than the suggested .01 from the best model. This can be explained by the grid search being conducted on a limited set of the data and thus it is biased to learn quicker. After a bit more manual tuning we ended up using paramters very close to the best grid search model.

##### Hyperparameters selected (ALL MODELS):
update_learning_rate = .005, update_momentum = .9, num_filters1 = 32, num_filters2 = 64, dropout_rate ('p') = .5 

## Testing and Validation

#### Python Neural Network Full Implementation

The early stages of development for part 3 included a test suite (found in tests/test_nn.py). Given a reasonable learning rate and 10,000 passes, the test suite verified that the neural network code could learn the following boolean functions (in increasing order of presumed difficulty):

A xor B. 1 hidden layer, 2 nodes, 
A xor B. 2 hidden layers, 2 nodes, 
A xor B. 1 hidden layer, 6 nodes, 
A or (B xor C). 1 hidden layer, 10 nodes, 
A or (B xor C). 1 hidden layer, 10 nodes, minibatch, 
2 booleans, 3 outputs. 1 hidden layer, 50 nodes.

After these generally succeeded, we moved on to manual testing and validation with the neural net interface (main.py). It provides command line control over the various hyperparameters like hidden layer count, hidden layer size, batch size, learning rate, and dynamic learning rates. Here is an example command used for our final training session:

<pre>python3 main.py ../../data/data_and_scripts/train_inputs.csv ../../data/data_and_scripts/train_outputs.csv --sizes=2304,1500,500,10 --validate --trials=180000 --learn-rate=0.03 --validation-ratio=0.9 --random --normalize --report --final-learn-rate=0.003 --timer=600 --verbose --batch=1000</pre>

You can also get command line help with this command:

<pre>python3 main.py -h</pre>

Once we had an idea of what basic orders of magnitude were best for the hyperparameters, we were ready to collect data on various hyperparameter configurations systematically (explored in the hyperparameter section).

#### Validation of Full Implementation

A simple validation was performed. The validation-ratio parameter specifies how much of the training set to reserve for validation. It is optionally shuffled randomly. Typically, 90% of the set was used for training and 10% for validation. Finally, after all training sessions our reports included the accuracy of the neural network on the training set as well as the validation set. Sadly, we never encountered overtraining because the custom implementation was too slow, but we were guarding against it nevertheless. The full training and validation accuracies for all trials can be found <a href="https://drive.google.com/file/d/0B2lkEEFpyGuBWHU2Y0dmalhZbE0">here</a>.

#### Convolutional Neural Network

We used a 80/20 train/val split. We define the train loss as being the mean of the categorical cross entropy loss between the training predictions and the actual categories. Val loss is defined similarly for the mean validation cross entropy loss. The following are the train and val losses plotted for each model.


<h3 align="center"> MODEL 1
<img src="src/CNNMODELS/traintestcurve1.png">

MODEL 2
<img src="src/CNNMODELS/traintestcurve1.png">

MODEL 3
<img src="src/CNNMODELS/traintestcurve1.png">
</h3> 
And here is all 3 models validation accuracy plotted with relation to the number of epochs.
<h3 align="center">
All Models Validation Accuracy
<img src="src/CNNMODELS/accuracycurve.png">
</h3>

## Discussion



#### Python Neural Network Full Implementation

Although our full implementation did not achieve very high accuracies, it was significantly better than random. Given the difficulty of the data set, and the fact that it was a rather simple backpropagation implementation, the black magic of neural networks was still apparent.

A number of image transformations and enhancements, taken advantage of by our scikit learn approach, were not used in our full implementation. Many of those data preprocessing steps were ready for use too late in development to be useful, especially given how slow the hyperparameter search was. The part 3 implementation operated entirely on the initial csv dataset.

Further development of the neural net was difficult. Stuart's home laptop died months ago, so he has been using his school laptop with just 3GB of ram! He was not able to process the full dataset in any meaningful way locally. Eventually we moved testing and validation to Digital Ocean cloud computing.

One aspect we did not take advantage of is multiple hidden layers. We only tested and ran hyperparameters searches with one, sometimes two, hidden layers. When in fact our implementation can handle an arbitrary number of hidden layers. We did not choose to spend time on this given the consensus in the class that two hidden layers is more than enough. Given the sensitivty of hyperparameters, especially the learning rate, we opted not to investigate further. Convolutional neural networks are, after all, a better approach to handling many hidden layers, as demonstrated by the other sections of our report.

#### Convolutional Neural Network

##### Model Computation Time: 
Simply put, the computation time was immense, and this scales tremondously with added layers. Therefore rather than conducting our experiments with a K-Fold cross validation, we chose to use a simple 80/20 train test split.  

##### Model depth: 
The first thing to note is that there is a diminishing returns on accuracy with each model depth increase. This can be seen in both the final validation scores. This is due to the base model already being relatively accurate (91% on validation set) and thus adding more layers will improve the accuracy but not immensely. The improvement between models can be understood by realizing that the early layers detect bigger patterns, which allows for the deeper layers (after max pooling is done) to detect the finer details. Layering multiple convulutional layers back to back increases our model complexity and allows for a richer filtering of the image. 

##### Learning rate: 
From our observations the learning rate plays a huge role on both how quickly the models reach their peaks and that of the limit itself. Setting the learning rate too high results in quick gains but a lower overall limit. Contrarily, a learning rate that is low results in slow improvement but tends to perform better. 

##### Overtraining/Overfitting: 
By tuning our hyperparameters nicely we can actually see that almost no overfitting occurs in our graphs. That is the test error continues to roughly diminish, at the very least it never gets that much worse. This is most likely do to our dropout layers as this has been documented in the litterature. Thus, little to no benifits would be seen from early stopping.


## Statement of Contributions

#### Stuart Spence

Stuart built the full implementation of the neural network in Python for part 3, including the mini-batch and dynamic learning rate. This includes the interface and gradient descent backpropagation. He also collected data for this section, and wrote his corresponding sections of the report.

#### Josh Romoff

Josh implemented the Convolutional Neural Network. He also collected data for this section, and wrote his corresponding sections of the report.

#### Charlie Bloomfield

Charlie explored optimizing hyperparameters for Logistic Regression and SVC, and wrote methods for augmenting the provided dataset with various image transformations. He collected data for these sections, and wrote the corresponding sections in this report. 


#### We hereby state that all the work presented in this report is that of the authors.

---
## References

1. http://www.sersc.org/journals/IJMUE/vol8_no4_2013/39.pdf
2. http://www.inb.uni-luebeck.de/publications/pdfs/LaBaMa08c.pdf
3. https://github.com/mnielsen/neural-networks-and-deep-learning
4. http://vip.uwaterloo.ca/files/publications/Gaussian%20MRF%20rotation-invariant%20features%20for%20SAR%20sea%20ice%20classification.pdf
5. http://arxiv.org/pdf/1506.02025v1.pdf
6. http://yann.lecun.com/exdb/publis/pdf/lecun-98.pdf
7. http://danielnouri.org/notes/2014/12/17/using-convolutional-neural-nets-to-detect-facial-keypoints-tutorial/
