# 2. Data Exploration and Preprocessing

Note: The detailed code of this part is in *Data Exploration and Preprocessing.ipynb*.

## 2-1 Data Description

The [CIFAR-10 dataset](https://www.cs.toronto.edu/~kriz/cifar.html) is consisted of 32x32 colour images in 10 classes, with 6,000 images per class. The version we used is [CIFAR-10 python version](https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz). The total number of images in training and test set are 50,000 and 10,000 respectively.

The original CIFAR 10 training dataset has five batches of files, each contains 10,000 images. The test dataset has one file that contains 10,000 images. We use functions in our script **load_data_helper_functions.py** to load both images and labels in training and test data.

The training set we get is numpy ndarray with shape (50,000, 3072) and test set is numpy ndarray with shape (10,000, 3072). Each row of the array stores a 32x32 colour image. The first 1024 entries contain the red channel values, the next 1024 the green, and the final 1024 the blue. The image is stored in row-major order, so that the first 32 entries of the array are the red channel values of the first row of the image.

The labels for training and test dataset are numpy array with shape (50,000, 1) and (10,000, 1). They are not one-hot-encoding yet.

## 2-2 Data Exploration

Here we reshape each row into a (32,32,3) numpy array, with one inner array as one pixel with three channels: red, green and blue. The reshaped training data is of shape (50,000, 32, 32, 3). The reshaped test data is of shape (10,000, 32, 32, 3).

Then we plot the first 10 images in training set with true class labels. This is for better understanding of the dataset. The images are plotted using functions in our script **preprocess_data.py**.

#### The first 10 images in training set

<img style="float: left;" src="../figs/first10.png">


## 2-3 Data Preprocess

To prepare data for training CNN models, we do the following things: 

First, we convert image labels to one-hot-encoding.

Next, we inflate the size of training dataset by adding randomly distorted images which are cropped, horizontally flipped, or adjusted in terms of hue, contrast and saturation. This way of distorting images will include different variation of images in training set, and will therefore make the CNN model we trained to generalize better in test dataset. We got this idea of data preprocessing from [Magnus Erik Hvass Pedersen](http://www.hvass-labs.org/).

Last, the test dataset will be images cropped around center without any other adjustment. The cropped size is the same as that in training set.

#### Plot the distorted image

Here is ten examples of the 321st image in test dataset after preprocessing. 

<img style="float: left;" src="../figs/distorted.png">

As we can see, the distorted images are eithered flipped or adjusted in some way that varies from original image. These images will later be used to train CNN model.