#CIFAR-10 Dataset Description
CIFAR-10 Dataset Description
Image Source : here

CIFAR-10 is a widely used dataset for Machine Learning research, which is created by A. Krizhevsky et al.

It consists of 60,000 - 32x32 color images in 10 classes (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck) with 50,000 training images and 10,000 testing images.

Each class has 6,000 images. The classes in a CIFAR-10 dataset are mutually exclusive.

At a glance:

Number of classes: 10
Size of image: 32 x 32 x 3
Note: In this course, we use only a subset of the above dataset due to memory constraints in online cloud platform. We will be explaining the generation of subset in the upcoming cards

![alt text](https://docs-cdn.fresco.me/system/attachments/files/000/223/478/large/705ec5e1e9e4956563c19f0aca33fdc39e269a48/cifar_10_1.jpeg)


https://www.cs.toronto.edu/~kriz/cifar.html

sample dataset for below code
http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz>cifar-10-python.tar.gz

```

#Data Loading
import os
import numpy as np
def _load_cifar10_batch(file): 
    import cPickle 
    fo = open(file, 'rb') 
    dict = cPickle.load(fo) 
    fo.close() 
    return dict['data'].reshape(-1, 32, 32, 3), dict['labels'] # reshaping the data to 32 x 32 x 3  
print('Loading...') 
batch_fns = [os.path.join("./", 'cifar-10-batches-py', 'data_batch_' + str(i)) for i in range(1, 6)] 
data_batches = [_load_cifar10_batch(fn) for fn in batch_fns] 


#Data Stacking
data_all = np.vstack([data_batches[i][0] for i in range(len(data_batches))]).astype('float') 
labels_all = np.vstack([data_batches[i][1] for i in range(len(data_batches))]).flatten() 

"""
Subset Generation
As explained in dataset description, we use only a subset of CIFAR-10 dataset.

The dataset with 50,000 samples is split in the ratio 92:8. This split is done to take a smaller portion of 50000 samples (i.e the 8% contains only 4000 images).

These 4000 samples are used for generating the train and test sets for classification.

Here, StratifiedShuffleSplit is used to split the dataset. It splits the data by taking equal number of samples from each class in a random manner.

"""
#Splitting the whole training set into 92:8
seed=7
from sklearn.cross_validation import StratifiedShuffleSplit
data_split = StratifiedShuffleSplit(labels_all,1, test_size=0.08,random_state=seed) #creating data_split object with 8% test size 
for train_index, test_index in data_split:
    split_data_92, split_data_8 = data_all[train_index], data_all[test_index]        
    split_label_92, split_label_8 = labels_all[train_index], labels_all[test_index]


 4000 samples are split in the ratio 7:3. (i.e., 2800 for training and 1200 for testing) using StratifiedShuffleSplit.
    
#Splitting the training set into 70 and 30
train_test_split = StratifiedShuffleSplit(split_label_8,1, test_size=0.3,random_state=seed) #test_size=0.3 denotes that 30 % of the dataset is used for testing.
for train_index, test_index in train_test_split:
    train_data_70, test_data_30 = split_data_8[train_index], split_data_8[test_index]     
    train_label_70, test_label_30 = split_label_8[train_index], split_label_8[test_index]
train_data = train_data_70 #assigning to variable train_data
train_labels = train_label_70 #assigning to variable train_labels
test_data = test_data_30
test_labels = test_label_30    

#Need for Preprocessing
Using the Data preprocessing step, the raw data is converted into a form suitable for subsequent analysis. All the steps before data training (model creation) can be considered as a pre-processing step.

The quality of an image is greatly influenced by its clarity and the device used to capture it.

The captured image may contain noise and irregularities, which can be removed via preprocessing steps.

Some of the common preprocessing techniques include:

Normalization

Dimensionality reduction (eg. PCA, SVD)

Feature Extraction (e.g. SIFT, HOG)

Whitening

Denoising

Contrast Stretching

Background subtraction

Image Enhancement

Smoothing

In the following cards, we will describe some of the preprocessing techniques that can be applied to images.

#Normalization
Normalization is the process of converting the pixel intensity values to a normal state.

It follows a normal distribution.

A normalized image has mean = 0 and variance = 1
```
# definition of normalization function
def normalize(data, eps=1e-8): 
    data -= data.mean(axis=(1, 2, 3), keepdims=True) 
    std = np.sqrt(data.var(axis=(1, 2, 3), ddof=1, keepdims=True)) # calculating standard deviation
    std[std < eps] = 1. 
    data /= std 
    return data 
# calling the function
train_data = normalize(train_data) 
test_data = normalize(test_data) 
# prints the shape of train data and test data 
print 'train_data: ', train_data.shape
print 'test_data: ', test_data.shape

```

ZCA Whitening
Normalization is followed by a ZCA whitening process.

The main aim of whitening is to reduce data redundancy, which means the features are less correlated and have the same variance.

ZCA stands for zero-phase component analysis. ZCA whitened images resemble the normal image.
```
# Computing whitening matrix 
train_data_flat = train_data.reshape(train_data.shape[0], -1).T
test_data_flat = test_data.reshape(test_data.shape[0], -1).T
print('train_data_flat: ', train_data_flat.shape)
print('test_data_flat: ', test_data_flat.shape)
train_data_flat_t = train_data_flat.T
test_data_flat_t = test_data_flat.T
```

Principle Component Analysis (PCA)
Principle Component Analysis (PCA)
The major function of PCA is to decompose a multivariate dataset into a set of successive orthogonal components. These orthogonal components explain a maximum amount of the variance.

PCA is a dimensionality reduction technique.

The whitened data is given as the input to PCA.
```
from sklearn.decomposition import PCA
# n_components specify the no.of components to keep
train_data_pca = PCA(n_components=train_data_flat.shape[1]).fit_transform(train_data_flat)
test_data_pca = PCA(n_components=test_data_flat.shape[1]).fit_transform(test_data_flat)
train_data_pca = train_data_pca.T
test_data_pca = test_data_pca.T
```
![alt text](https://docs-cdn.fresco.me/system/attachments/files/000/215/796/large/ae904f45bdf68a7805c76c8a430a2b6dd72a633b/PCA.jpeg)


#Singular Value Decomposition (SVD)
Singular Value Decomposition (SVD)
SVD is a dimensionality reduction technique that has been used in several fields such as image compression, face recognition, and noise filtering.

In this method, a digital image (generally considered as a matrix) is decomposed into three other matrices.

The singular values (less in number) obtained from this refactoring process can preserve useful features of the original image without utilizing high storage space in the memory.

For further details, click here.![alt text](https://docs-cdn.fresco.me/system/attachments/files/000/215/914/large/9fa5e9b578e75cd27f3cdc7be171d7e2139da7d6/Singular_Value_Decomposition_SVD.jpeg)



#Singular Value Decomposition (SVD)
The below code for SVD may not work in the available online cloud playground due to package issues. So, it is better to try this out in a local Python environment.
```
from skimage import color
# definition for SVD
def svdFeatures(input_data):
    svdArray_input_data=[]
    size = input_data.shape[0]
    for i in range (0,size):
        img=color.rgb2gray(input_data[i])
        U, s, V = np.linalg.svd(img, full_matrices=False);
        S=[s[i] for i in range(30)]
        svdArray_input_data.append(S)
        svdMatrix_input_data=np.matrix(svdArray_input_data)
    return svdMatrix_input_data
# apply SVD for train and test data
train_data_svd=svdFeatures(train_data)
test_data_svd=svdFeatures(test_data)
```


#Scale-Invariant Feature Transform for Feature Generation (SIFT)
SIFT is mainly used for images that are less simple and less organized.

Even the photographs of the same material will undergo scale change corresponding to the distance from the material, focal length etc. This is one of the reasons for not considering the raw pixel values as useful features for images.

The main aim of using SIFT for feature extraction is to obtain features that are not sensitive to changes in scale, rotation, image resolution, illumination, etc.

The major steps involved in SIFT algorithm are:

Scale-space Extrema Detection

Keypoint Localization

Orientation Assignment

Keypoint Descriptor

For further details, refer here.

#Convolutional Neural Networks (CNN)
Deep learning has become more important for learning complex algorithms. It is a more refined form of machine learning, which is based on neural networks that emulate the brain.

Neural network consists of:

input layer

hidden layers

output layer

Each layer is composed of nodes, where the computation happens.

Neural Network consists of interconnected neurons that passes

messages between each other.

CNN is a special case of neural networks that consists of multiple convolutional layers, pooling layers and finally, fully connected layers.

The improved network structure helps in saving memory and computational complexity. They are mainly used in pattern and image recognition problems.

5 of 5

#Testing
Cross validation is considered as a model validation technique to evaluate the performance of a model on unseen data.

It is a better estimate to evaluate testing accuracy than training accuracy on unseen data.

Points to remember:

Cross validation gives high variance if the testing set and training set are not drawn from the same population.

Allowing training data to be included in testing data will not give actual performance results.

In cross validation, the number of samples used for training the model is reduced, and the results depend upon the choice of the pair of training and testing sets.

You can refer to the various cross validation approaches from here.

Partitioning the Data
It is a methodological mistake to test and train on the same dataset because the classifier would fail to predict correctly for any unseen data. This could result in overfitting.

To avoid this problem,

The data is split into train set, validation set, and test set.

Training Set: The data used to train the classifier.

Validation Set: The data used to tune the classifier model parameters i.e., to understand how well the model has been trained (as part of training data).

Testing Set: The data used to evaluate the performance of the classifier (unseen data by the classifier).

This will help us to know the efficiency of our model.

Since the online platform used in this course doesn't support huge dataset, only a few samples are taken for training and testing.


```
lassification accuracy is defined as the percentage of correct predictions.

To calculate class wise accuracy,

         CA = (correctly predicted images of a class/(Total images of the class)) * 100



#To see the accuracy of each class. 

accuracy=[]

leng = len(conf_matrix) #finding the length of confusion matrix

for i in range(leng): 



#each diagonal element (conf_matrix[i,i]) is divided by the sum of the 

elements of that particular row (conf_matrix[i].sum()).



    ac=(conf_matrix[i,i]/((conf_matrix[i].sum())+.0000001))*100 

    accuracy.append(ac)

print accuracy

Overall accuracy is given by, OA = Sum of class-wise accuracy/no of classes

The code is as follows:


summation=0

no_of_classes = 10

for i in range(0,len(accuracy)):

    summation+=accuracy[i]

overall_accuracy = summation/no_of_classes

print overall_accuracy


#TASK KERAS

In [1]:
from keras.datasets import fashion_mnist
from keras.utils import to_categorical
import numpy as np

# load dataset
(trainX, trainy), (testX, testy) = fashion_mnist.load_data()
# load train and test dataset
def load_dataset():
    # load dataset
    (trainX, trainy), (testX, testY) = fashion_mnist.load_data()
    # reshape dataset to have a single channel
    trainX = trainX.reshape((trainX.shape[0], 28, 28, 1))
    testX = testX.reshape((testX.shape[0], 28, 28, 1))
    # one hot encode target values
    trainy = to_categorical(trainy)
    testY = to_categorical(testY)
    return trainX, trainy, testX, testY

Using TensorFlow backend.


Downloading data from http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz
Downloading data from http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz
Downloading data from http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz
Downloading data from http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz


In [0]:
seed=9

from sklearn.model_selection import StratifiedShuffleSplit
data_split = StratifiedShuffleSplit(test_size=0.08, random_state=seed)
for train_index, test_index in data_split.split(trainX, trainy):

    split_data_92, split_data_8 = trainX[train_index], trainX[test_index]

    split_label_92, split_label_8 = trainy[train_index], trainy[test_index]
train_test_split = StratifiedShuffleSplit(test_size=0.3, random_state=seed) #test_size=0.3 denotes that 30 % of the dataset is used for testing.


In [5]:
for train_index, test_index in train_test_split.split(split_data_8,split_label_8):

    train_data_70, test_data_30 = split_data_8[train_index], split_data_8[test_index]

    train_label_70, test_label_30 = split_label_8[train_index], split_label_8[test_index]
train_data = train_data_70 #assigning to variable train_data

train_labels = train_label_70 #assigning to variable train_labels

test_data = test_data_30

test_labels = test_label_30
print('train_data : ',     train_data)

print('train_labels : ',   train_labels)

print('test_data : ',      test_data)

print('test_labels : ',    test_labels)

train_data :  [[[0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 1 0 0]
  ...
  [0 0 0 ... 5 0 0]
  [0 0 0 ... 6 0 0]
  [0 0 0 ... 1 0 0]]

 [[0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  ...
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]]

 [[0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  ...
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]]

 ...

 [[0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  ...
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]]

 [[0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  ...
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]]

 [[0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  ...
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]]]
train_labels :  [0 1 6 ... 3 1 1]
test_data :  [[[0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  ...
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]]

 [[0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 

In [6]:
# definition of normalization function

def normalize(data, eps=1e-8):

    data -= data.mean(axis=(0, 1, 2), keepdims=True) 

    std = np.sqrt(data.var(axis=(0, 1, 2), ddof=1, keepdims=True)) # calculating standard deviation

    std[std < eps] = 1.

    data /= std

    return data
train_data=train_data.astype('float64')
test_data=test_data.astype('float64')
# calling the function

train_data = normalize(train_data)

test_data = normalize(test_data)
# prints the shape of train data and test data

print('train_data: ', train_data          )

print('test_data: ',  test_data         )

train_data:  [[[-0.80699357 -0.80699357 -0.80699357 ... -0.80699357 -0.80699357
   -0.80699357]
  [-0.80699357 -0.80699357 -0.80699357 ... -0.80699357 -0.80699357
   -0.80699357]
  [-0.80699357 -0.80699357 -0.80699357 ... -0.79589088 -0.80699357
   -0.80699357]
  ...
  [-0.80699357 -0.80699357 -0.80699357 ... -0.7514801  -0.80699357
   -0.80699357]
  [-0.80699357 -0.80699357 -0.80699357 ... -0.74037741 -0.80699357
   -0.80699357]
  [-0.80699357 -0.80699357 -0.80699357 ... -0.79589088 -0.80699357
   -0.80699357]]

 [[-0.80699357 -0.80699357 -0.80699357 ... -0.80699357 -0.80699357
   -0.80699357]
  [-0.80699357 -0.80699357 -0.80699357 ... -0.80699357 -0.80699357
   -0.80699357]
  [-0.80699357 -0.80699357 -0.80699357 ... -0.80699357 -0.80699357
   -0.80699357]
  ...
  [-0.80699357 -0.80699357 -0.80699357 ... -0.80699357 -0.80699357
   -0.80699357]
  [-0.80699357 -0.80699357 -0.80699357 ... -0.80699357 -0.80699357
   -0.80699357]
  [-0.80699357 -0.80699357 -0.80699357 ... -0.80699357 -0.80

In [8]:
# Computing whitening matrix 

train_data_flat = train_data.reshape(train_data.shape[0], -1).T

test_data_flat = test_data.reshape(test_data.shape[0], -1).T

print('train_data_flat: ',  train_data_flat        )

print('test_data_flat: ',     test_data_flat          )



train_data_flat_t = train_data_flat.T

test_data_flat_t = test_data_flat.T


train_data_flat:  [[-0.80699357 -0.80699357 -0.80699357 ... -0.80699357 -0.80699357
  -0.80699357]
 [-0.80699357 -0.80699357 -0.80699357 ... -0.80699357 -0.80699357
  -0.80699357]
 [-0.80699357 -0.80699357 -0.80699357 ... -0.80699357 -0.80699357
  -0.80699357]
 ...
 [-0.79589088 -0.80699357 -0.80699357 ... -0.80699357 -0.80699357
  -0.80699357]
 [-0.80699357 -0.80699357 -0.80699357 ... -0.80699357 -0.80699357
  -0.80699357]
 [-0.80699357 -0.80699357 -0.80699357 ... -0.80699357 -0.80699357
  -0.80699357]]
test_data_flat:  [[-0.81083669 -0.81083669 -0.81083669 ... -0.81083669 -0.81083669
  -0.81083669]
 [-0.81083669 -0.81083669 -0.81083669 ... -0.81083669 -0.81083669
  -0.81083669]
 [-0.81083669 -0.81083669 -0.81083669 ... -0.81083669 -0.81083669
  -0.81083669]
 ...
 [-0.81083669 -0.81083669 -0.81083669 ... -0.81083669 -0.81083669
  -0.81083669]
 [-0.81083669 -0.81083669 -0.81083669 ... -0.81083669 -0.81083669
  -0.81083669]
 [-0.81083669 -0.81083669 -0.81083669 ... -0.81083669 -0.810836

In [9]:
from sklearn.decomposition import PCA

# n_components specify the no.of components to keep

train_data_pca = PCA(n_components=train_data.shape[1]).fit_transform(train_data_flat)

test_data_pca =PCA(n_components=test_data.shape[1]).fit_transform(test_data_flat)

print( train_data_pca                )

print( test_data_pca             ) 

train_data_pca = train_data_pca.T

test_data_pca = test_data_pca.T

[[-48.8353608  -10.09929446  -5.68424705 ...   0.16228964   0.2467461
    0.72782834]
 [-48.83400607 -10.09922325  -5.68301089 ...   0.16147601   0.24772916
    0.72671348]
 [-48.81999397 -10.10285095  -5.6753278  ...   0.16142327   0.2621737
    0.72885717]
 ...
 [-47.54340165  -9.42846415  -4.6915174  ...   0.23164538   1.07587441
    0.50553854]
 [-48.57670043  -9.86242271  -5.60291988 ...   0.05444069   0.45313798
    0.66871658]
 [-48.81420729 -10.07988015  -5.69104811 ...   0.12613004   0.25425758
    0.71167824]]
[[-32.31296431  -6.2126007   -3.36998485 ...   0.38448069   0.24667314
    0.65584848]
 [-32.31296431  -6.2126007   -3.36998485 ...   0.38448069   0.24667314
    0.65584848]
 [-32.3010017   -6.21872634  -3.36594828 ...   0.38422524   0.24350656
    0.66066067]
 ...
 [-31.57566912  -5.88950443  -2.9665535  ...   0.78878668   0.33532483
    0.26736214]
 [-32.08537108  -6.04989022  -3.37871062 ...   0.42926997   0.19569179
    0.34402801]
 [-32.28224143  -6.197356    -3.37

In [10]:
from skimage import color
def svdFeatures(input_data):

    svdArray_input_data=[]

    size = input_data.shape[0]

    for i in range (0,size):

        img=color.rgb2gray(input_data[i])

        U, s, V = np.linalg.svd(img, full_matrices=False);

        S=[s[i] for i in range(28)]

        svdArray_input_data.append(S)

        svdMatrix_input_data=np.matrix(svdArray_input_data)

    return svdMatrix_input_data



# apply SVD for train and test data

train_data_svd=svdFeatures(train_data)

test_data_svd=svdFeatures(test_data)
print(train_data_svd.shape)
print(test_data_svd.shape) 

(3360, 28)
(1440, 28)


In [11]:
from sklearn import svm #Creating a svm classifier model

clf = svm.SVC(gamma=0.001, probability=True) #train_data_flat_tModel training

train = clf.fit(train_data_flat_t, train_labels)
predicted= clf.predict(test_data_flat_t)

score = clf.score(test_data_flat_t, test_labels)
print("score",score)


score 0.8277777777777777
