# Exercise 2 - Image Classification using CIFAR10 dataset

In this notebook we will build on the knowledge of model building in Exercise 1 to classify objects. 

We will follow these steps:
1. Explore the dataset (CIFAR10)
2. Build a small convnet from scratch
3. Evaluate training and validation accuracy
4. Score the model against test set and submit result

Let's get started!

## Download and Explore the Dataset

Let's start by downloading our dataset, a .zip of 60,000 PNG pictures of different objects, and extracting it locally.

**NOTE:** The images used in this exercise are excerpted from the "CIFAR-10" available [here](https://www.cs.toronto.edu/~kriz/cifar.html), which contains 60,000 images in 10 classes.

The contents of the `.zip` are extracted to the base directory, which contains `train` and `val` and `test` subdirectories for the training, validation and test datasets. The folders have the following structure:

```
---------------
train
|- airplane
|- automobile
|- bird
|- cat
|- deer
|- dog
|- frog
|- horse
|- ship
|- truck

val
|- airplane
|- automobile
|- bird
|- cat
|- deer
|- dog
|- frog
|- horse
|- ship
|- truck

test
|- test
```

In [None]:
# Creating two directories - "data" and "data/cifar10" 
!mkdir data && mkdir data/cifar10
# Downloading the CIFAR dataset
!wget -N https://s3-us-west-2.amazonaws.com/ai-camp/cifar10.zip
# Unzip the data into the folder "data/cifar10"
!unzip -qq -n cifar10.zip -d data/cifar10
# Switch directory to "data/cifar10" and show its content
!cd data/cifar10 && ls

In [None]:
import os

base_dir = 'data/cifar10'

# Directory to our training data
train_folder = os.path.join(base_dir, 'train')

# Directory to our validation data
val_folder = os.path.join(base_dir, 'val')

# Directory to our training data
test_folder = os.path.join(base_dir, 'test')

Now, let's find out the total number of images we have in each `train`, `val` and `test`.

In [None]:
# List folders and number of files
print("Directory, Number of files")
for root, subdirs, files in os.walk(base_dir):
    print(root, len(files))

We can see that there are 10 categories/folders in each `train` and `val` folder, whereas in the `test` directory, there is only 1 folder.

Now let's take a look at a few images to get a better sense of what the `frog` and `airplane` dataset look like. First, configure the matplot parameters:

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt
import matplotlib.image as mpimg

# Parameters for our graph; we'll output images in a 4x4 configuration
nrows = 4
ncols = 4

# Index for iterating over images
pic_index = 0

Now, display a batch of 8 frogs and 8 airplanes pictures. You can rerun the cell to see a new batch.

In [None]:
## Path to frog and airplan
train_frog_dir= "data/cifar10/train/frog"
train_airplane_dir= "data/cifar10/train/airplane"
train_frog_fnames = os.listdir(train_frog_dir)
train_airplane_fnames = os.listdir(train_airplane_dir)

# Set up matplotlib fig, and size it to fit 4x4 pics
fig = plt.gcf()
fig.set_size_inches(ncols, nrows)

pic_index += 8
next_frog_pix = [os.path.join(train_frog_dir, fname) 
                for fname in train_frog_fnames[pic_index-8:pic_index]]
next_airplane_pix = [os.path.join(train_airplane_dir, fname) 
                for fname in train_airplane_fnames[pic_index-8:pic_index]]

for i, img_path in enumerate(next_frog_pix+next_airplane_pix):
    # Set up subplot; subplot indices start at 1
    sp = plt.subplot(nrows, ncols, i + 1)
    sp.axis('Off') # Don't show axes (or gridlines)
    
    img = mpimg.imread(img_path)
    plt.imshow(img)

plt.show()

## Data Preprocessing

Let's set up data generators that will read images from our source folders and convert them to float32 tensors. We'll have one generator for each training, validation and test folder.

### Batch
Our generators will yield batches of `10` images of size `32 x 32` and their labels.

### Feature scaling
As you may know in our MNIST exercise, data that goes into a neural network should be normalised in a way that is easier to be processed by the network. In our case, we will preprocess our images by normalising the pixels values to be in the 0 to 1 range. This happens by dividing each pixel value by 255 and this process is known as data normalisation or rescaling.

### Generator - ImageDataGenerator
To rescale the data, we use `keras.preprocessing.image.ImageDataGenerator` class with the `rescale` parameter. This class will also allow us to instantiate generators of augmented image batches (and their labels) via `.flow_from_directory(directory)`. These generators can then be used with the Keras model methods that accept data generators as inputs such as `fit_generator`, `evaluate_generator` and `predict_generator`.

In [None]:
from keras.preprocessing.image import ImageDataGenerator

# Batch size
bs = 10

# All images will be resized to this value
image_size = (32, 32)

# All images will be rescaled by 1./255 
train_datagen = ImageDataGenerator(rescale=1./255)
val_datagen = ImageDataGenerator(rescale=1./255)
test_datagen = ImageDataGenerator(rescale=1./_____)

# Flow training images in batches of 10 using train_datagen generator
print("Preparing generator for train dataset")
train_generator = train_datagen.flow_from_directory(
    directory= train_folder, # This is the source directory for training images 
    target_size=image_size, # All images will be resized to value set in image_size
    batch_size=bs,
    class_mode='categorical')

# Flow validation images in batches of 10 using val_datagen generator
print("Preparing generator for validation dataset")
val_generator = val_datagen.flow_from_directory(
    directory= val_folder, 
    target_size=image_size,
    batch_size=bs,
    class_mode='_____')

# Flow test images in batches of 10 using test_datagen generator
# Added shuffle=False to keep data in same order as labels
print("Preparing generator for test dataset")
test_generator = test_datagen.flow_from_directory(
    directory=test_folder,
    target_size=image_size,
    batch_size=bs,
    shuffle=False)

## Building a Small Convnet Model

The images that will go into our convnet are **32 x 32** color images. You are free to resize the images for faster training time.

Here, we designed the architecture such that it contains 2 {convolution + relu + maxpooling} modules. Our convolutions operate on **3x3** windows and our maxpooling layers operate on **2x2** windows. Our first convolution extracts **16** filters and the second extracts **32** filters.

On top of the convolution layers, we flatten the 2-dim matrix into 1-dim vector and feed them into 2 fully connected layers. The first fully connected layer has 128 hidden units and the last one has the same number of outputs as our classes (10).

NOTE: This is a basic configuration for image classification. You are free to modify the model to improve the accuracy while watching for overfitting/underfitting.

In [None]:
from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D
from keras.layers import Flatten, Dense

# Here we specify the input shape of our data 
# This should match the size of images ('image_size') along with the number of channels (3)
input_shape = (32, 32, 3)

# Define the number of classes
num_classes = 10

# Initialising the model
model = Sequential()

# First convolution extracts 16 filters that are of kernel size 3x3 
model.add(Conv2D(_____, (_____), 
                 padding='same', 
                 strides=2, 
                 input_shape=input_shape,
                 activation='relu'))

# Convolution is followed by max-pooling layer with a 2x2 window
model.add(MaxPooling2D(pool_size=(_____)))

# Second convolution extracts 32 filters that are of kernel size 3x3 
model.add(Conv2D(_____, (3,3), 
                 padding='same', 
                 strides=2,
                 activation='_____'))

# Convolution is followed by max-pooling layer with a 2x2 window
model.add(MaxPooling2D(pool_size=(2, 2)))

# Flatten 2-dim matrix to 1-d vector so we can pass them through the fully connected layer (dense layer)
model.add(Flatten())

# Create a fully connected layer with ReLU activation and 128 hidden units
model.add(Dense(128, activation='relu'))

# Create an output layer with the number of classes and activate using softmax
model.add(Dense(num_classes, activation='_____'))

Let's summarise the model architecture:

In [None]:
model.summary()

Next, we will configure the specifications for model training.

We train our model with `categorical_crossentropy` loss, because this is a multi-class problem. We will use the `adam` optimizer with a learning rate of `0.001`. During training, we want to monitor `accuracy` of the classification.

In [None]:
from keras import optimizers

model.compile(loss='categorical_crossentropy',
              optimizer=optimizers.Adam(lr=0.001),
              metrics=['_____'])

## Model Training 

For educational purposes. Let's train on 500 images, for 10 epochs, and validate against all 50 validation images.

Note: This may take a few minutes to run.

In [None]:
history = model.fit_generator(
        train_generator, # train generator has 50,000 train images but we are not using all of them
        steps_per_epoch=50, # training 500 images = 50 steps x 10 images per batch
        epochs=30,
        validation_data=val_generator, # validation generator has 5,000 validation images
        validation_steps=5 # validating on 50 images = 5 steps x 10 images per batch
)

## Evaluating Accuracy and Loss of the Model

With a trained model, we can evaluate the model performance against the truth labels of our validation set. Note that we are validating against 50 sets i.e. 500 images, instead of the 50 images we validated in each training epoch.

In [None]:
# Validating against 50 steps, 500 images in total
scores = model.evaluate_generator(val_generator, steps=50, verbose=1)
print('Val loss:', scores[0])
print('Val accuracy:', scores[1])

## Model Prediction against Test Set

Using the trained model, we now predict the classes in the **test set**. The test set contains images from all classes, all placed in a single folder.

In [None]:
import json
import pandas as pd
import numpy as np

# 500 steps x 10 images per batch = 5,000 images
preds = model.predict_generator(test_generator, steps=500, verbose=1)

# Get labels from truth
# {0: 'airplane',
#  1: 'automobile',
#  ...}
label_dict = dict((v,k) for k,v in (train_generator.class_indices).items())

# Get a list of predictions ['horse', 'airplane', ...]
predicted_class_indices = np.array([np.argmax(x) for x in preds])
predictions = [label_dict[k] for k in predicted_class_indices]

In [None]:
## CHANGE HERE
TEAM_ID = "_____"
SUBMISSION_TYPE = "cifar10_test_set"

# Preparing results into dataframe
results=pd.DataFrame({"filename":test_generator.filenames,
                      "prediction":predictions})

# Output a CSV (optional)
# results.to_csv('results.csv', index = None)

# Peparing submission payload
submission = {
    "team_id": TEAM_ID,
    "submission_type": SUBMISSION_TYPE
}
submission['predictions'] = json.loads(results.to_json(orient='records'))

In [None]:
# Take a look at our submission
json.dumps(submission)

In [None]:
## Function for submission
def submit_result(submission):
    import requests
    headers = {'content-type' : 'application/json'}
    url = "https://yfpki7bqa9.execute-api.us-east-1.amazonaws.com/default/submit"
    res = requests.post(url, data=json.dumps(submission), headers=headers)
    return res.json()

## Calling the function to submit
submit_result(submission)