# Experimenting with CIFAR-10

From the [website](https://www.cs.toronto.edu/~kriz/cifar.html): "The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images."

Let's try some stuff out. 

1. Dimensionality reduction (PCA)

2. K-means clustering (perhaps in different dimensions)

3. Siple least squares regression with one-hot output vectors

4. Simple feedforward NN (Dropout, L2 Regularization)

5. A convolutional neural network. I expect this to have the best results.

## Reading the Data

This was blissfully easy. 

In [2]:
#!/usr/local/bin/python

#Script source: 
#https://www.cs.toronto.edu/~kriz/cifar.html

#useful sources
#http://parneetk.github.io/blog/cnn-cifar10/

import pickle
import numpy as np 

def unpickle(file):
    with open(file, 'rb') as fo:
        dict = pickle.load(fo, encoding='bytes')
    return dict

In [3]:
batch_1 = unpickle('cifar-10-batches-py/data_batch_1')
batch_2 = unpickle('cifar-10-batches-py/data_batch_2')
batch_3 = unpickle('cifar-10-batches-py/data_batch_3')
batch_4 = unpickle('cifar-10-batches-py/data_batch_4')
batch_5 = unpickle('cifar-10-batches-py/data_batch_5')
test_batch = unpickle('cifar-10-batches-py/test_batch')

meta_data = unpickle('cifar-10-batches-py/batches.meta')
meta_data

{b'label_names': [b'airplane',
  b'automobile',
  b'bird',
  b'cat',
  b'deer',
  b'dog',
  b'frog',
  b'horse',
  b'ship',
  b'truck'],
 b'num_cases_per_batch': 10000,
 b'num_vis': 3072}

## Data Preprocessing

In [6]:
train_batches = [batch_1, batch_2, batch_3, batch_4, batch_5]

all_training_features = np.vstack((batch_1[b'data'], batch_2[b'data'], 
                               batch_3[b'data'], batch_4[b'data'], batch_5[b'data']))
all_training_labels = np.hstack((batch_1[b'labels'], batch_2[b'labels'], 
                               batch_3[b'labels'], batch_4[b'labels'], batch_5[b'labels']))

test_features = test_batch[b'data']
test_labels = test_batch[b'labels']

'''Standardize the image pixel values, which are normally in [0, 255].'''

test_features = test_features/255
all_training_features = all_training_features/255

In [7]:
'''Some useful functions.'''

label_names = meta_data[b'label_names']

'''Returns the accuracy rate of a prediction against the true labels.
Assumes both prediction and true_labels contain integers only, although'''
def accuracy(prediction, true_labels): 
    assert len(prediction) == len(true_labels), 'Mismatched prediction and label set'
    prediction = np.int_(np.rint(np.array(prediction))) #round to nearest integer and cast to integer type

    num_accurate = 0
    for i in range(len(prediction)): 
        if(prediction[i] == true_labels[i]): 
            num_accurate += 1 
    return (num_accurate/len(prediction))

'''Takes a label value (an integer between 1 and 10) and returns the corresponding
string which the label corresponds to. Example: 3 --> bird '''
def number_to_name(num): 
    assert type(num) == int, '{} is not an integer'.format(num)
    assert num in [x for x in range(1, 11)], '{} is not between 1 and 10'.format(num)
    return label_names[num - 1].decode('utf-8')

In [13]:
batch_1[b'data'][0]

array([ 59,  43,  50, ..., 140,  84,  72], dtype=uint8)

# Least Squares

The most naive approach I know of is to model the problem as a least-squares optimization problem. Let $X$ be the feature matrix, where each row has 3072 entries corresponding to pixel values. Let $y$ be the label-vector, where each entry is the label for the corresponding row entry in $X$. Then the goal is to find **weight vector** $w$ such that $$Xw \approx y$$

The analytic solution to least squares is 

$$w = (X^T X)^{-1} X^{T}y$$

## Implementing on One Batch

To start, I'll implement the solution on one batch of data and see what happens.

In [54]:
import numpy as np
import matplotlib.pyplot as plt 
%matplotlib inline

In [35]:
feature_matrix = batch_1[b'data']
label_vector = batch_1[b'labels']

In [36]:
#takes about 10 seconds to run

weight = np.linalg.lstsq(feature_matrix, label_vector)

In [43]:
weight_vector = weight[0]
residuals = weight[1]

In [119]:
prediction = np.rint(feature_matrix @ weight_vector)

In [120]:
accuracy(prediction, label_vector)

0.1516

Oof. We get an accuracy of 15.2%, and that's on the training set! I'll bet it's much worse on a fresh batch.

Let's use the same weight vector as before, but a new batch of image features and labels as a validation dataset.

In [121]:
feature_matrix_2 = batch_2[b'data']
labels_2 = batch_2[b'labels']

prediction_2 = np.rint(feature_matrix_2 @ weight_vector)

accuracy(prediction_2, labels_2)

0.1068

As expected, the accuracy is even lower - a poor showing at 10.7%.

## Training on all 5 Batches

Let's try training on all 5 batches. Perhaps this will just overfit, but for sheer curiosity it's worth trying out.

In [28]:
weight_all_batches = np.linalg.lstsq(all_training_features, all_training_labels)

In [32]:
weight_vector_all_batches = weight_all_batches[0]

all_batch_prediction = all_training_features @ weight_vector_all_batches
accuracy(all_batch_prediction, all_training_labels)

0.11742

Okay, so a training accuracy of 11.7% as opposed to the earlier 15.2% for one batch. What about test error?

In [33]:
all_batch_test_prediction = test_features @ weight_vector_all_batches
accuracy(all_batch_test_prediction, test_labels)

0.1148

Again, a test accuracy of 11.5%. Nothing too exciting here.

# Neural Nets

Let's bring out the big guns.