# GR5242 Final Project Report

## Team members:
- Jia, Kewei (kj2408@columbia.edu)
- Zhang, Yini (yz3005@columbia.edu)
- Zhu, Chenyun (cz2434@columbia.edu)

## Overview

The [CIFAR-10 Dataset](https://www.cs.toronto.edu/~kriz/cifar.html) is an important image classification dataset. It consists of 60000 32x32 colour images in 10 classes (airplanes, automobiles, birds, cats, deer, dogs, frogs, horses, ships, and trucks), with 6000 images per class. There are 50000 training images and 10000 test images.<br>

The **GOALS** of this project are to:
- Learn how to preprocess the image data
- Implement different Convolutional Neural Networks (CNN) classifiers using GPU-enabled Tensorflow and Keras API
- Compare different CNN architectures

**Tools:**
- GPU-enabled Tensorflow
- Keras API

## 1. Data Exploration & Preprocessing

(Please refer to the *Data Exploration and Preprocessing.ipynb* for detailed code.)

### 1-1. Data Description

The version we used is [CIFAR-10 python version](https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz).

The original CIFAR 10 training dataset has five batches of files, each contains 10,000 images. The test dataset has one file that contains 10,000 images. We use functions in our script **load_data_helper_functions.py** to load both images and labels in training and test data.

The training set we get is numpy ndarray with shape (50,000, 3072) and test set is numpy ndarray with shape (10,000, 3072). Each row of the array stores a 32x32 colour image. The first 1024 entries contain the red channel values, the next 1024 the green, and the final 1024 the blue. The image is stored in row-major order, so that the first 32 entries of the array are the red channel values of the first row of the image.

The labels for training and test dataset are numpy array with shape (50,000, 1) and (10,000, 1). They are not one-hot-encoding yet.

### 1-2. Data Exploration

Here we reshape each row into a (32,32,3) numpy array, with one inner array as one pixel with three channels: red, green and blue. The reshaped training data is of shape (50,000, 32, 32, 3). The reshaped test data is of shape (10,000, 32, 32, 3).

Then we plot the first 10 images in training set with true class labels. This is for better understanding of the dataset. The images are plotted using functions in our script **preprocess_data.py**.

__The first 10 images in training set:__


<img style="float: left;" src="http://drive.google.com/uc?export=view&id=1KfGNI0WfoZbZwjokrwyKwh9OdqNvRAlu">

### 1-3. Data Preprocessing

To prepare data for training CNN models, we do the following things: 

First, we convert image labels to one-hot-encoding.

Next, we inflate the size of training dataset by adding randomly distorted images which are cropped, horizontally flipped, or adjusted in terms of hue, contrast and saturation. This way of distorting images will include different variation of images in training set, and will therefore make the CNN model we trained to generalize better in test dataset. We got this idea of data preprocessing from [Magnus Erik Hvass Pedersen](http://www.hvass-labs.org/).

Last, the test dataset will be images cropped around center without any other adjustment. The cropped size is the same as that in training set.

__Plot the distorted image__<br>
Here are 10 examples of the 321st image in test dataset after preprocessing:

<img style="float: left;" src="http://drive.google.com/uc?export=view&id=1vW_iW7FQQpbenLvFT1-Cx4mtgIi_1kEV"> 

As we can see, the distorted images are eithered flipped or adjusted in some way that varies from original image. These images will later be used to train CNN model.

## 2. Convolutional Neural Networks (CNN) From Scratch

To start with, we build a CNN classifier from scratch and explore details to understand how convolution works. <br>

(Please refer to the *CNN_Scratch.ipynb* for detailed code.)

### 2-1. Model 


We trained the model with 1000 epochs.<br>
In each epoch, we have 500 batches, each with batch size of 100. <br>
The learning rate is 0.05. 

__Architecture:__ <br>
input = ($32\times32\times3$) -- <br>
Image Preprocessing: cropped images to shape ($24\times24\times3$) -- <br>

2D convolution layer with filters size $64$ and kernel size $5\times5$ plus ReLU activation -- <br>
2D Maxpooling -- <br>

2D convolution layer with filters size $64$ and kernel size $5\times5$ plus ReLU activation -- <br>
2D Maxpooling -- <br>
20% Dropout -- <br>

Softmax Output

### 2-2. Results

#### I. Train/Test Accurarcy and Loss

<img style="float: left; width:60%" src="http://drive.google.com/uc?export=view&id=1ghfpYuJu6xT2qU4mL8WPA2aJFp68dU95">

** Insights **:<br>
We used GPU to train, and the training for 1000 epochs takes around 10 hours. <br>
After 1000 epochs, the accuracy on test dataset is around 64%, and the accuray on training set is a slightly higher 65%. <br>
Overfitting exists but not a serious problem. In both accuracy and model plot, test and train lines stay close to each other.

It seems from the plot that the model still needs more time to train since accuracy displays the trend of increasing and is not stablized yet. However, due to computation restraint, we did not proceed further. We believe that if continuing to train this CNN model with more epochs, the final accurarcy will keep improving. 

#### II. Examples of Mis-classified Images

Below are 10 examples of images that are predicted wrong by CNN classifiers. 

Some of the images are really hard even for humans to detect the difference. The third one in first row resembles a lot as the shape of cat, and the fifth one in row is similar to the color of frog. 

<img style="float: left; width:60%" src="http://drive.google.com/uc?export=view&id=1FCmsI5xr8PLuE_XyUBq4cFtMWN8q4cOU">

#### III. Confusion Matrix

We then use confusion matrix to see which of the classes are more likely to be misclassified.

<img style="float: left; width:60%" src="http://drive.google.com/uc?export=view&id=1C6o64PxIbve57B1pkRUbxNSWa5H_rpKZ">

Above demonstrated the confusion matrix for one batch (100 images total) of test dataset. <br>
Automobiles are less likely to be misclassified, whereas 50% of airplanes are given the wrong labels. Although airplanes are usually classified wrongly, they are often mistaken as objects instead of animals.

## 3. Convolutional Neural Networks (CNN) with Keras
### 3-1. Brief Introduction for Keras

*"Keras is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano.1."*[1] The reasons we tried Keras are as follows:
- Easy to get started with
- Results in much more readable and succinct code
- Able to run on GPU (much faster than CPU)

### 3-2. CNN with Keras

__I. We first created the basic CNN with 100 Epochs.__<br>
(Please refer to the *Keras_CNN_Baseline.ipynb* for detailed code.)

__Architecture:__ <br>
input = ($32\times32\times3$) -- <br>
2D convolution layer with filters size $32$ and kernel size $3\times3$ plus ReLU activation -- <br>
2D convolution layer with filters size $32$ and kernel size $3\times3$ plus ReLU activation -- <br>
2D Maxpooling -- <br>
20% Dropout -- <br>

2D convolution layer with filters size $64$ and kernel size $3\times3$ plus ReLU activation -- <br>
2D convolution layer with filters size $64$ and kernel size $3\times3$ plus ReLU activation -- <br>
2D Maxpooling -- <br>
25% Dropout -- <br>

Softmax Output

__Baseline CNN Result:__

<img style="float: left; width:60%" src="http://drive.google.com/uc?export=view&id=1tvqt-_HYvtX_ystbJI6_BShUBYFmmLju">

__Insights:__ <br>
Notice that the test result is 79.7%. Let's try to add more layers to the model and see if the test accuracy has any improve.

__II. Now we added more layers to the previous model.__<br>
(Please refer to the *Keras_CNN_Baseline_More_Layer.ipynb* for detailed code.)

__Architecture:__ <br>
input = ($32\times32\times3$) -- <br>
2D convolution layer with filters size $32$ and kernel size $3\times3$ plus ReLU activation -- <br>
2D convolution layer with filters size $32$ and kernel size $3\times3$ plus ReLU activation -- <br>
2D Maxpooling -- <br>
20% Dropout -- <br>

2D convolution layer with filters size $64$ and kernel size $3\times3$ plus ReLU activation -- <br>
2D convolution layer with filters size $64$ and kernel size $3\times3$ plus ReLU activation -- <br>
2D Maxpooling -- <br>
25% Dropout -- <br>

2D convolution layer with filters size $128$ and kernel size $3\times3$ plus ReLU activation -- <br>
2D convolution layer with filters size $128$ and kernel size $3\times3$ plus ReLU activation -- <br>
2D Maxpooling -- <br>
30% Dropout -- <br>

Softmax Output

__Baseline CNN with More Layers Result:__

<img style="float: left; width:60%" src="http://drive.google.com/uc?export=view&id=1G6JkT3F9v9bsPhcOXLqdVvrbZ0YQUs2u">

__Insights:__<br>
The good thing is test accuracy increases from 79.7% to 83.1%. However, the model seems overfitting.

__III. Try to prevent overfitting by adding batchnormalization and kernel regularizer__<br>
Idea from the Kaggle comment of [EricAlcaideAldeano](https://www.kaggle.com/ericalcaide9834/discussion). <br>
(Please refer to the *Keras_CNN_Prevent_Overfitting.ipynb* for detailed code.)

__Architecture:__ <br>
input = ($32\times32\times3$) -- <br>
2D convolution layer with filters size $32$ and kernel size $3\times3$ plus ReLU activation and *regularizer(0.001)* -- <br>
*BatchNormalization* -- <br>
2D convolution layer with filters size $32$ and kernel size $3\times3$ plus ReLU activation and *regularizer(0.001)* -- <br>
*BatchNormalization* -- <br>
2D Maxpooling -- <br>
20% Dropout -- <br>

2D convolution layer with filters size $64$ and kernel size $3\times3$ plus ReLU activation and *regularizer(0.001)* -- <br>
*BatchNormalization* -- <br>
2D convolution layer with filters size $64$ and kernel size $3\times3$ plus ReLU activation and *regularizer(0.001)* -- <br>
*BatchNormalization* -- <br>
2D Maxpooling -- <br>
25% Dropout -- <br>

2D convolution layer with filters size $128$ and kernel size $3\times3$ plus ReLU activation and *regularizer(0.001)* -- <br>
*BatchNormalization* -- <br>
2D convolution layer with filters size $128$ and kernel size $3\times3$ plus ReLU activation and *regularizer(0.001)* -- <br>
*BatchNormalization* -- <br>
2D Maxpooling -- <br>
30% Dropout -- <br>
Softmax Output

__ Previous Model with Batchnorm and Kernel Regularizer Added Result:__

<img style="float: left; width:60%" src="http://drive.google.com/uc?export=view&id=1zGp093ocKkfTrV5vEImBr76FeTLwdkHm">

__Insights:__<br>
After applied batchnormalization and regularizer, we can see the test accuracy 85.2% is better than the previous model 83.1%. In addition, the generalization of the model improves as we can see in the graph (the gap between training accuracy and validation accuracy is smaller than before).

### 3.3. Findings of CNN with Keras

- Add relatively more layers can achieve higher accuracy
- Batchnormalization and kernel regularizaer can help us prevent overfitting and keep the weights small so that the model can generalize well

### 3.4. Next Steps

Apply data augmentation to the model. It takes 17 hours to run 25 epochs so the limited time is the main challenge for us to add data augmentation.

## 4. Discussions

### 4-1. Summarize the Findings
- Data preprocessing is important if we try to build a good classifier
- Add relatively more layers can achieve higher accuracy
- Batchnormalization and kernel regularizaer can help us prevent overfitting

### 4-2. Model Comparison

When comparing CNN classifiers built from scratch with the ones built with Keras, it is not difficult to find that Keras perform better with a much faster speed. It is interesting to see that in the early phases, even in the very first epoch, Keras models start with a quite high test accurarcy of around 0.6. Scratch models, however, learn from the very beginning, and uses almost 500 epochs to increase accurarcy from 0.2 to 0.6. One thing that may causes this is the choice of initializer[2]. It might be that Keras have already embedded some better inits by learning from empiricism to skip the time consuming early training phases.

Another difference is about overfitting. Both scratch model and Keras models use drop-out layers. However, Keras models tend to overfit compared with scratch model. This is probably because, in each epoch of scratch model, we run through all images in train set with a batch size of 100 before going to next epoch, which means that in each epoch, scratch CNN takes 500 steps. This makes the scratch CNN to learn slower but more carefully than Keras. It can also be seen from the very dense curves in scratch CNN and relatively sparse curves in Keras results plots.

### 4-3. Our other trials

At the beginning of our project, we also tried other different architectures:
- Logistic regression with accuracy ~ 28% (Please refer to the *logistic_reg.py* for detailed code.)
- Conv -- fc with accuracy ~ 40% (Please refer to the *conv_fc.py* for detailed code.)
- Conv -- conv -- fc -- fc with accuracy ~ 50% (Please refer to the *cnn_deep.py* for detailed code.)
- Fractional max pooling [3] with accuracy ~ 58% (Please refer to the *fractional_max_pool.py* for detailed code.)

### 4-4. Future Improvements
- Add data augmentation
- Try decrease the learning rate as epoch increases
- Train the model with more time

## Reference

(1) *Keras: The Python Deep Learning Library. keras.io/#keras-the-python-deep-learning-library.* <br>
(2) *Mishkin, D., & Matas, J. (2015). All you need is a good init. arXiv preprint arXiv:1511.06422. Chicago*	<br>
(3) *Graham, B. (2014). Fractional max-pooling. arXiv preprint arXiv:1412.6071.*