# Experimenting with CIFAR-10

From the [website](https://www.cs.toronto.edu/~kriz/cifar.html): "The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images."

Let's try some stuff out. 

1. Dimensionality reduction (PCA)

2. K-means clustering (perhaps in different dimensions)

- Simple feedforward NN (Dropout, L2 Regularization)

4. Siple least squares regression with one-hot output vectors

## Reading the Data

This was blissfully easy. 

In [5]:
#!/usr/local/bin/python

#Script source: 
#https://www.cs.toronto.edu/~kriz/cifar.html

import pickle
import numpy as np 

def unpickle(file):
    with open(file, 'rb') as fo:
        dict = pickle.load(fo, encoding='bytes')
    return dict

In [2]:
batch_1 = unpickle('cifar-10-batches-py/data_batch_1')
batch_2 = unpickle('cifar-10-batches-py/data_batch_2')
batch_3 = unpickle('cifar-10-batches-py/data_batch_3')
batch_4 = unpickle('cifar-10-batches-py/data_batch_4')
batch_5 = unpickle('cifar-10-batches-py/data_batch_5')
test_batch = unpickle('cifar-10-batches-py/test_batch')

meta_data = unpickle('cifar-10-batches-py/batches.meta')
meta_data

{b'label_names': [b'airplane',
  b'automobile',
  b'bird',
  b'cat',
  b'deer',
  b'dog',
  b'frog',
  b'horse',
  b'ship',
  b'truck'],
 b'num_cases_per_batch': 10000,
 b'num_vis': 3072}

In [3]:
#b'data'
#b'labels'
train_batches = [batch_1, batch_2, batch_3, batch_4, batch_5]

In [9]:
'''Some useful functions.'''

label_names = meta_data[b'label_names']

'''Returns the accuracy rate of a prediction against the true labels.
Assumes both prediction and true_labels contain integers only, although'''
def accuracy(prediction, true_labels): 
    assert len(prediction) == len(true_labels), 'Mismatched prediction and label set'
    prediction = np.int_(np.rint(np.array(prediction))) #round to nearest integer and cast to integer type

    num_accurate = 0
    for i in range(len(prediction)): 
        if(prediction[i] == true_labels[i]): 
            num_accurate += 1 
    return (num_accurate/len(prediction))

'''Takes a label value (an integer between 1 and 10) and returns the corresponding
string which the label corresponds to. Example: 3 --> bird '''
def number_to_name(num): 
    assert type(num) == int, '{} is not an integer'.format(num)
    assert num in [x for x in range(1, 11)], '{} is not between 1 and 10'.format(num)
    return label_names[num - 1].decode('utf-8')

def join_batches(*batches): 
    features = np.array([])
    labels = np.array([])
    for batch in batches: 
        features = np.vstack((features, batch[b'data']))
        lables = np.vstack((features, batch[b'labels']))
    return features, labels
np.vstack((batch_1[b'data'], batch_2[b'data']))

array([[ 59,  43,  50, ..., 140,  84,  72],
       [154, 126, 105, ..., 139, 142, 144],
       [255, 253, 253, ...,  83,  83,  84],
       ..., 
       [127, 139, 155, ..., 197, 192, 191],
       [190, 200, 208, ..., 163, 182, 192],
       [177, 174, 182, ..., 119, 127, 136]], dtype=uint8)

# Least Squares

The most naive approach I know of is to model the problem as a least-squares optimization problem. Let $X$ be the feature matrix, where each row has 3072 entries corresponding to pixel values. Let $y$ be the label-vector, where each entry is the label for the corresponding row entry in $X$. Then the goal is to find **weight vector** $w$ such that $$Xw \approx y$$

The analytic solution to least squares is 

$$w = (X^T X)^{-1} X^{T}y$$

## Implementing on One Batch

To start, I'll implement the solution on one batch of data and see what happens.

In [54]:
import numpy as np
import matplotlib.pyplot as plt 
%matplotlib inline

In [35]:
feature_matrix = batch_1[b'data']
label_vector = batch_1[b'labels']

In [36]:
#takes about 10 seconds to run

weight = np.linalg.lstsq(feature_matrix, label_vector)

In [43]:
weight_vector = weight[0]
residuals = weight[1]

In [119]:
prediction = np.rint(feature_matrix @ weight_vector)

In [120]:
accuracy(prediction, label_vector)

0.1516

Oof. We get an accuracy of 15%, and that's on the training set! I'll bet it's much worse on a fresh batch.

Let's use the same weight vector as before, but a new batch of image features and labels as a validation dataset.

In [121]:
feature_matrix_2 = batch_2[b'data']
labels_2 = batch_2[b'labels']

prediction_2 = np.rint(feature_matrix_2 @ weight_vector)

accuracy(prediction_2, labels_2)

0.1068

As expected, the accuracy is even lower - a poor showing at 10.6%.

## Training on all 5 Batches

CIFAR-10 actually contains 5 batches of training data with 

In [122]:
print('hi')

hi


In [123]:
test = np.int_(np.rint(np.array([1, 2, 3])))

In [108]:
test[0]

1

In [126]:
def do_stuff(*args):
    for arg in args: 
        print(arg)

In [127]:
do_stuff(1, 2, 3, 4)

1
2
3
4


In [130]:
a = np.random.randn(5)
b = np.random.randn(5)
c = np.random.randn(5)
np.vstack((a, b, c))

array([[-0.80700818, -1.5456516 , -0.71027861,  1.35079221, -0.11563599],
       [-1.21846609, -0.20569536,  0.39768117,  0.34709642,  0.25687026],
       [ 0.31753043,  1.5201755 ,  0.1909019 , -0.13671277,  0.83897556]])