# cs7324 Lab 2 - Exploring Image Data

## 1. Business Rationale

"The CIFAR-10 and CIFAR-100 are labeled subsets of the 80 million tiny images dataset. They were collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton (http://www.cs.toronto.edu/~kriz/cifar.html)."

The CIFAR-10 dataset was selected for this lab. The full dataset is a collection of 60,000 images across 10 classes, each 32x32 pixels. It is further divided into subsets of five, 10,000 image training batches and one, 1,000 image test batch. Each image is in color so the data consists of "a 10000x3072 numpy array of uint8s. Each row of the array stores a 32x32 colour image. The first 1024 entries contain the red channel values, the next 1024 the green, and the final 1024 the blue. The image is stored in row-major order, so that the first 32 entries of the array are the red channel values of the first row of the image (https://www.cs.toronto.edu/~kriz/cifar.html).

To meet the requirments of the lab only one, 10,000 image subset is used. 

The data was originally collected for machine learning purposes by the Canadian Institute for Advanced Research. Many research papers have been written regarding application of a variety of machine learning techniques utilizing this dataset (https://en.wikipedia.org/wiki/CIFAR-10). While this dataset is focused on academia and research, a business case could be made for datasets like this to be used to create image recognition software for web search engines, thereby enabling them to attact users and sell advertisements. Specifically the CIFAR-10 dataset can be used to train a computer to recognize images of airplanes, automobiles, birds, cats, deer, dogs, frogs, horses, ships, and large trucks (https://www.cs.toronto.edu/~kriz/cifar.html). 



## 2. Data Understanding

This data is formatted in the pickle format which requires unpickling. Below is a function to do that.

In [63]:
# Source http://www.cs.toronto.edu/~kriz/cifar.html
import numpy as np
import pickle
from sklearn import datasets as ds

def unpickle(file):
    import pickle
    with open(file, 'rb') as fo:
        dict = pickle.load(fo, encoding='bytes')
    return dict

ds = unpickle(r"C:\Users\Chip\source\repos\cs7324_code\Lab 2\cifar-10-batches-py\data_batch_1")
print(ds.keys())

dict_keys([b'batch_label', b'labels', b'data', b'filenames'])


Now we'll prep the data for PCA and RPCA analysis by setting appropriate variables to portions of the data

In [66]:
# Source: In class lecture and flipped assignment
X = ds[b'data'] # Assign feature vectors to 'X'
y = np.array(ds[b'labels']) # Assign target values to y, we'll be trying to predict these later on

# This data was a 1D array to start, reshaping it to create a column vector of y values
y = y.reshape(10000, 1)

n_samples, n_features = X.shape

# print(np.sum(~np.isfinite(X)))
print("n_samples: {}".format(n_samples))
print("n_features: {}".format(n_features))



n_samples: 10000
n_features: 3072


Since each row represents the red, green blue values for a picture, and that there are 1024 total pixels in a 32x32 image we see from above that there are 3072 features and 10,000 images in this dataset.

## 3. PCA

We'll now look at trying to reduce features by performing PCA on this dataset

**How many features should we decompose to???**

In [None]:
# Source: In class lecture

# lets do some PCA of the features and go from 1850 features to 20 features
from sklearn.decomposition import PCA

n_components = 300
print ("Extracting the top %d eigenfaces from %d faces" % (
    n_components, X.shape[0]))

pca = PCA(n_components=n_components)
%time pca.fit(X.copy())
eigenfaces = pca.components_.reshape((n_components, h, w))