# Problem Set 3

### Import Packages

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

import numpy as np
import pandas as pd
import pprint as pp
from skimage import data
import seaborn as sns

import sklearn
from sklearn.decomposition import PCA, TruncatedSVD, NMF

import timeit

## Question 1

### 1) Load the CIFAR-10 dataset from (https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz). (Don’t upload the dataset to github.)

To avoid uploading the data to GitHub, the directory with all subsequent data files has been added to the .gitignore for my repo. 

The CIFAR-10 dataset, according to the README for this dataset (https://www.cs.toronto.edu/~kriz/cifar.html), consists of 60,000 32x32 colour images in 10 classes, with 6000 images per class. There are 50,000 training images and 10,000 test images *which is approximately the 80-20 split requested for this assignment.* 

The dataset is divided into five training batches and one test batch, each with 10,000 images. The test batch contains exactly 1,000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class. 

The png files can be found here: https://www.kaggle.com/swaroopkml/cifar10-pngs-in-folders. 

In [2]:
# function to open pickled python files
def unpickle(file):
    import pickle
    with open(file, 'rb') as fo:
        dict = pickle.load(fo, encoding='bytes')
    return dict

### Meta Data

There are 10,000 cases per batch and 3072 values per visual.  
There are 10 classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck

In [3]:
batches_meta = unpickle("data/cifar_10_batches_py/batches.meta")
print(batches_meta.keys())
print(batches_meta)

dict_keys([b'num_cases_per_batch', b'label_names', b'num_vis'])
{b'num_cases_per_batch': 10000, b'label_names': [b'airplane', b'automobile', b'bird', b'cat', b'deer', b'dog', b'frog', b'horse', b'ship', b'truck'], b'num_vis': 3072}


### Explore Batch 1

There are four keys in each batch: 
- the batch label
- the labels corresponding to the image
- the 3072 array to define each image
- the filename.

In [4]:
batch_1 = unpickle("data/cifar_10_batches_py/data_batch_1")
print(len(batch_1))
print(batch_1.keys())

4
dict_keys([b'batch_label', b'labels', b'data', b'filenames'])


In [None]:
pp.pprint(batch_1[b'batch_label'])

In [None]:
pp.pprint(batch_1[b'labels'][:20]) # first 20 image labels

In [None]:
pp.pprint(batch_1[b'data'][0]) # first image array
print("Dimension:", batch_1[b'data'][0].ndim)
print("Shape:", batch_1[b'data'][0].shape)
print("Filename:", batch_1[b'filenames'][0])

Let's view the training set's first image...

In [None]:
image1 = list(batch_1.values())[2][0] # extract first set of 3072 values

# divide array into 2D arrays for each color
# divide by 255 because max value for RGB scale
reds = np.reshape(image1[:1024],(32,32))/255
greens = np.reshape(image1[1024:2048],(32,32))/255
blues = np.reshape(image1[2048:],(32,32))/255

# create a 3D array
image1 = np.dstack((reds,greens,blues))
print("Dimension:", image1.ndim)
print("Shape:", image1.shape)

In [None]:
plt.figure(figsize=(10,2.5))
plt.subplot(131)
plt.gca().set_title('Red channel')
plt.imshow(reds, cmap='Reds', interpolation='nearest')
plt.subplot(132)
plt.gca().set_title('Green channel')
plt.imshow(greens, cmap='Greens', interpolation='nearest')
plt.subplot(133)
plt.gca().set_title('Blue channel')
plt.imshow(blues, cmap='Blues', interpolation='nearest')

plt.show()

In [None]:
plt.figure(figsize=(2.5,2.5))
plt.imshow(image1, cmap=plt.cm.gray) 
plt.show()

Yay! After converting the arrayThis is an image of a frog.

**The above work helped me understand how the dataset was originally stored. At this point in my analysis, I noticed that the CIFAR-10 dataset could be accessed through Keras.** 

In [2]:
import tensorflow
import keras
from keras.datasets import cifar10

In [3]:
(x_train, y_train), (x_test, y_test) = cifar10.load_data()

### 2)	Make an 80%-20% split on the dataset into test and train data.

In [None]:
# The training and testing data is approximately in a 80-20 split.
print('Traning data shape:', x_train.shape)
print('Testing data shape:', x_test.shape)

In [None]:
# corresponding training and testing labels
print('Traning labels shape:', y_train.shape)
print('Testing labels shape:', y_test.shape)

In [4]:
# unique numbers from the train labels
classes = np.unique(y_train)
nClasses = len(classes)
print('Total number of classes : ', nClasses)
print('Output classes : ', classes)

# from meta data above we know which numbers correspond to which image class
label_dict = {
 0: 'airplane',
 1: 'automobile',
 2: 'bird',
 3: 'cat',
 4: 'deer',
 5: 'dog',
 6: 'frog',
 7: 'horse',
 8: 'ship',
 9: 'truck',
}

Total number of classes :  10
Output classes :  [0 1 2 3 4 5 6 7 8 9]


View example images from training and testing set.

In [None]:
plt.figure(figsize=[5,5])

# Display the first image in training data (SAME AS ABOVE)
plt.subplot(121)
curr_img = np.reshape(x_train[0], (32,32,3))
plt.imshow(curr_img)
plt.title(str(label_dict[y_train[0][0]]))

# Display the first image in testing data
plt.subplot(122)
curr_img = np.reshape(x_test[0],(32,32,3))
plt.imshow(curr_img)
plt.title(str(label_dict[y_test[0][0]]))

### 3)	Scale the data so that each feature has a minimum value of 0 and a maximum value of 1.

In [5]:
np.min(x_train),np.max(x_train) # current minimum and maximum

(0, 255)

In [6]:
x_train = x_train/255.0
print("New Minimum:" , np.min(x_train), "New Maximum:" , np.max(x_train))
print("Same Training Shape:" ,x_train.shape)

New Minimum: 0.0 New Maximum: 1.0
Same Training Shape: (50000, 32, 32, 3)


### 4)	Use the following dimensionality reduction techniques for feature extraction: (More in Question 3)
    a) Principal Component Analysis
    b) Singular Value Decomposition
    c) Non-negative Matrix Factorization
    
To begin, I will create a dataframe of pixel values for each image with their respective labels.

In [7]:
# reshape image dimensions from three to one
x_train_flat = x_train.reshape(-1,3072)

# name each column by pixel number
feat_cols = ['p'+str(i + 1) for i in range(x_train_flat.shape[1])]
df_cifar = pd.DataFrame(x_train_flat,columns=feat_cols)

# add image labels
df_cifar['label'] = y_train

# check dataframe shape
print('Size of the dataframe: {}'.format(df_cifar.shape))

Size of the dataframe: (50000, 3073)


In [8]:
df_cifar.head() # each row is image, each column contains pixel or label info

Unnamed: 0,p1,p2,p3,p4,p5,p6,p7,p8,p9,p10,...,p3064,p3065,p3066,p3067,p3068,p3069,p3070,p3071,p3072,label
0,0.231373,0.243137,0.247059,0.168627,0.180392,0.176471,0.196078,0.188235,0.168627,0.266667,...,0.847059,0.721569,0.54902,0.592157,0.462745,0.329412,0.482353,0.360784,0.282353,6
1,0.603922,0.694118,0.733333,0.494118,0.537255,0.533333,0.411765,0.407843,0.372549,0.4,...,0.560784,0.521569,0.545098,0.560784,0.52549,0.556863,0.560784,0.521569,0.564706,9
2,1.0,1.0,1.0,0.992157,0.992157,0.992157,0.992157,0.992157,0.992157,0.992157,...,0.305882,0.333333,0.32549,0.309804,0.333333,0.32549,0.313725,0.337255,0.329412,9
3,0.109804,0.098039,0.039216,0.145098,0.133333,0.07451,0.14902,0.137255,0.078431,0.164706,...,0.211765,0.184314,0.109804,0.247059,0.219608,0.145098,0.282353,0.254902,0.180392,4
4,0.666667,0.705882,0.776471,0.658824,0.698039,0.768627,0.694118,0.72549,0.796078,0.717647,...,0.294118,0.309804,0.321569,0.278431,0.294118,0.305882,0.286275,0.301961,0.313725,1


#### a) Principal Component Analysis

Principal component analysis (PCA) is a technique for reducing the dimensionality of such datasets, increasing interpretability but at the same time minimizing information loss. It does so by creating new uncorrelated variables that successively maximize variance.

In [9]:
pca = PCA(0.9) # want PCA model to capture 90% of variance

pca.fit(x_train_flat)
pca.n_components_ # need 99 components to acheive 90% variance explained

99

Therefore, to achieve 90% variance explained, use 99 principal components compared to the original 3072 pixels.

In [10]:
ncomp = 99
pca_cifar = PCA(n_components=ncomp)
principalComponents_cifar = pca_cifar.fit_transform(df_cifar.iloc[:,:-1])

# convert PC into dataframe
feat_cols = ['PC'+str(i + 1) for i in range(principalComponents_cifar.shape[1])]
principal_cifar_Df = pd.DataFrame(data = principalComponents_cifar, columns = feat_cols)
principal_cifar_Df['y'] = y_train
principal_cifar_Df.head()

Unnamed: 0,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10,...,PC91,PC92,PC93,PC94,PC95,PC96,PC97,PC98,PC99,y
0,-6.401018,2.729039,1.501711,-2.953333,-4.452582,0.64715,0.568989,0.092877,3.451771,1.168442,...,0.418062,0.226053,-0.056544,-0.318404,-0.716454,-0.523308,-0.447695,0.133987,-0.106993,6
1,0.829783,-0.949943,6.003753,1.504931,-1.3685,1.225687,0.606882,-0.523086,2.58415,2.565564,...,-0.46099,-0.114007,0.031109,-0.110551,0.39852,-0.368037,0.374922,-0.281805,-0.379971,9
2,7.7302,-11.522102,-2.753621,2.333595,-1.584409,-2.272213,-0.610438,-1.361358,-0.730908,-1.125914,...,0.268505,-0.04826,-0.569063,0.183344,0.333791,0.694297,-0.521619,-0.416999,0.047737,9
3,-10.347817,0.010738,1.101019,-1.30454,-1.59487,0.8676,0.194107,0.232392,1.467262,-0.359152,...,0.022401,0.093721,0.146062,-0.197351,0.153781,0.246829,-0.230318,0.161916,0.065638,4
4,-2.625651,-4.96924,1.034585,3.306459,1.261683,0.031241,5.655493,1.426761,3.918136,-1.955221,...,0.679009,0.488839,-0.69237,-0.018541,-0.035966,-0.315796,0.410882,0.145903,0.355507,1


In [11]:
print(pca_cifar.explained_variance_ratio_)
print('Total Variance Explained',sum(pca_cifar.explained_variance_ratio_))

[0.2907663  0.11253144 0.06694414 0.03676459 0.03608843 0.0280923
 0.02712992 0.02167162 0.02064641 0.01438001 0.01310563 0.01065978
 0.01049981 0.01004269 0.00918482 0.008174   0.00739608 0.0071613
 0.00687472 0.00643243 0.00594396 0.00587355 0.00495567 0.00490792
 0.00480452 0.00465877 0.00451348 0.00443654 0.00400781 0.00393866
 0.00366217 0.0033314  0.00323965 0.00310246 0.00307587 0.0029125
 0.00261219 0.00259261 0.00254345 0.00248378 0.00242671 0.0022932
 0.00228175 0.00221518 0.0021026  0.00206732 0.00192457 0.00190379
 0.0018466  0.00181696 0.00178052 0.001736   0.00171165 0.00169759
 0.00162334 0.0015859  0.00156412 0.001543   0.00153092 0.00149923
 0.00145783 0.00142325 0.00141115 0.00137706 0.00134853 0.00132664
 0.00128842 0.00124309 0.00121383 0.00121199 0.00118513 0.00116823
 0.00113991 0.00112302 0.00111627 0.0011102  0.00110244 0.00104774
 0.00103586 0.00101727 0.00099838 0.00098963 0.00097765 0.00097181
 0.00094219 0.00092738 0.00091825 0.00089843 0.0008827  0.00087738

#### b) Singular Value Decomposition

In linear algebra, the singular value decomposition is a factorization of a real or complex matrix. It generalizes the eigendecomposition of a square normal matrix with an orthonormal eigenbasis to any m\times n matrix.

In [None]:
# need 95 components to acheive at least 90% variance explained
svd = TruncatedSVD(n_components=95)

svd.fit(df_cifar)
print('Total Variance Explained',svd.explained_variance_ratio_.sum())

In [None]:
svd.singular_values_

In [None]:
# apply SVD transform to dataset
transformed_svd = svd.fit_transform(df_cifar)
transformed_svd

In [None]:
# convert SVD into dataframe
feat_cols = ['SVD'+str(i + 1) for i in range(transformed_svd.shape[1])]
svd_cifar_Df = pd.DataFrame(data = transformed, columns = feat_cols)
svd_cifar_Df['y'] = y_train
svd_cifar_Df.head()

#### c) Non-negative Matrix Factorization

Non-negative matrix factorization, also non-negative matrix approximation is a group of algorithms in multivariate analysis and linear algebra where a matrix V is factorized into two matrices W and H, with the property that all three matrices have no negative elements. 

In [None]:
nmf = NMF(n_components=100) # arbitrarily choose 100 components
trans_nmf = nmf.fit_transform(df_cifar)
print("Shape:", trans_nmf.shape)
feat_cols = ['NMF'+str(i + 1) for i in range(trans_nmf.shape[1])]
nmf_cifar_Df = pd.DataFrame(data = trans_nmf, columns = feat_cols)
nmf_cifar_Df['y'] = y_train
nmf_cifar_Df.head()

### 5)	Bonus: Provide some visualization on how these methods are different.

In [None]:
ncomp = 2

# PCA Example
pca_cifar_ex = PCA(n_components=ncomp)
principalComponents_cifar_ex = pca_cifar_ex.fit_transform(df_cifar.iloc[:,:-1])

principal_cifar_Df_ex = pd.DataFrame(data = principalComponents_cifar_ex
                    , columns = ['principal component 1', 'principal component 2'])
principal_cifar_Df_ex['y'] = y_train

# SVD Example
svd_ex = TruncatedSVD(n_components=ncomp)
svd_ex.fit(df_cifar)

svd_transform_ex = svd_ex.transform(df_cifar)
feat_cols = ['SVD'+str(i + 1) for i in range(svd_transform_ex.shape[1])]
svd_cifar_Df_ex = pd.DataFrame(data = svd_transform_ex, columns = feat_cols)
svd_cifar_Df_ex['y'] = y_train

# NMF Example
nmf_ex = NMF(n_components=ncomp)
trans_nmf_ex = nmf.fit_transform(df_cifar)

feat_cols = ['NMF'+str(i + 1) for i in range(trans_nmf_ex.shape[1])]
nmf_cifar_Df_ex = pd.DataFrame(data = trans_nmf_ex, columns = feat_cols)
nmf_cifar_Df_ex['y'] = y_train

In [None]:
plt.figure(figsize=(16,20))

# PCA plot
plt.subplot(311)
plt.gca().set_title('a) Principal Component Analysis')
sns.scatterplot(
    x="principal component 1", y="principal component 2",
    hue="y",
    palette=sns.color_palette("hls", 10),
    data=principal_cifar_Df_ex,
    legend="full",
    alpha=0.3
)

# SVD plot
plt.subplot(312)
plt.gca().set_title('b) Singular Value Decomposition')
sns.scatterplot(
    x="SVD1", y="SVD2",
    hue="y",
    palette=sns.color_palette("hls", 10),
    data=svd_cifar_Df_ex,
    legend="full",
    alpha=0.3
)

# NMF plot
plt.subplot(313)
plt.gca().set_title('c) Non-negative Matrix Factorization')
sns.scatterplot(
    x="NMF1", y="NMF2",
    hue="y",
    palette=sns.color_palette("hls", 10),
    data=nmf_cifar_Df_ex,
    legend="full",
    alpha=0.3
)

plt.savefig('images/question1.png')

## Question 2

### 1) Fit the following classifiers on the dataset:
    a) Linear SVC
    b) Logistic Regresstion Classifier
    c) K-nearest Neighbors Classifier
    d) Perceptron

In [101]:
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression, Perceptron
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score, recall_score,precision_score


In [13]:
# prep testing data
x_test = x_test/255.0
x_test = x_test.reshape(-1,32,32,3)

# reshape image dimensions from three to one
x_test_flat = x_test.reshape(-1,3072)

In [14]:
# transform new data using already fitted pca
newdata_transformed = pca.transform(x_test_flat)

# name each column by pixel number
feat_cols = ['p'+str(i + 1) for i in range(newdata_transformed.shape[1])]
principal_cifar_test = pd.DataFrame(newdata_transformed,columns=feat_cols)

# add image labels
principal_cifar_test['label'] = y_test
principal_cifar_test.head()

Unnamed: 0,p1,p2,p3,p4,p5,p6,p7,p8,p9,p10,...,p91,p92,p93,p94,p95,p96,p97,p98,p99,label
0,-3.479671,0.906426,1.251956,3.164191,-1.91824,1.659476,-0.068083,0.160349,-0.472169,-0.182596,...,-0.545092,-0.608618,0.108551,-0.183645,0.20923,0.027931,-0.47339,-0.073715,-0.174357,3
1,9.943158,-9.580553,5.068578,-2.9612,1.110012,-0.266616,-1.147632,-3.261594,1.486868,-3.944965,...,0.413626,0.06585,-0.101268,-0.307669,0.49875,-0.19966,0.099927,-0.676948,0.699004,8
2,4.7043,-8.837206,4.109285,1.030721,0.204448,-0.911759,5.313861,1.123139,1.150991,0.448419,...,-0.031693,-0.224322,-0.146823,-0.280326,-0.046095,-0.143887,0.043336,-0.119398,0.410197,8
3,8.046408,-3.812435,6.268061,-0.799734,-0.160306,4.558274,0.988285,0.892768,-1.147928,0.752415,...,-0.18359,-0.161597,-0.222708,0.778089,0.058857,0.282721,0.413735,-0.300682,-0.444346,0
4,-5.254615,4.320979,1.844344,-0.784487,0.386743,0.410722,0.666861,1.029134,0.203599,0.083327,...,0.187285,0.034242,0.170983,-0.663774,0.380714,-0.183378,-0.217617,0.329352,0.585919,6


In [15]:
X_train = principal_cifar_Df.iloc[:,:-1]
X_test = principal_cifar_test.iloc[:,:-1]

In [16]:
from IPython.core.display import display, HTML
display(HTML(f"""
   
        <ul class="list-group">
          <li class="list-group-item disabled" aria-disabled="true"><h4>Shape of Train and Test Dataset</h4></li>
          <li class="list-group-item"><h4>Number of rows in Train dataset is: <span class="label label-primary">{ X_train.shape[0]:,}</span></h4></li>
          <li class="list-group-item"> <h4>Number of columns Train dataset is <span class="label label-primary">{X_train.shape[1]}</span></h4></li>
          <li class="list-group-item"><h4>Number of rows in Test dataset is: <span class="label label-success">{ X_test.shape[0]:,}</span></h4></li>
          <li class="list-group-item"><h4>Number of columns Test dataset is <span class="label label-success">{X_test.shape[1]}</span></h4></li>
        </ul>
  
    """))

In [17]:
y_train = principal_cifar_Df.iloc[:,-1]
y_test = principal_cifar_test.iloc[:,-1]
print('Shape of the training labels: {}'.format(y_train.shape))
print('Shape of the testing labels: {}'.format(y_test.shape))

Shape of the training labels: (50000,)
Shape of the testing labels: (10000,)


#### a) Linear SVC

In [29]:
Model = SVC()
Model

SVC()

In [30]:
%%timeit
Model.fit(X_train, y_train)

5min ± 11.1 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [31]:
y_pred_svc = Model.predict(X_test) 

In [32]:
np.save("results/y_pred_svc.npy", y_pred_svc)

In [77]:
# load prediction array
np.load("results/y_pred_svc.npy")

# Accuracy score
print('Accuracy : ',accuracy_score(y_pred_svc,y_test))

# confusion matrix
print(confusion_matrix(y_test, y_pred_svc))

# Summary of the predictions made by the classifier
parta = classification_report(y_test, y_pred_svc)
print(parta)

Accuracy :  0.5395
[[616  28  56  19  27  22  19  25 143  45]
 [ 33 653  18  43   7  22  15  25  55 129]
 [ 82  28 398  96 135  56 118  47  27  13]
 [ 32  23  80 384  61 170 120  44  30  56]
 [ 46  13 154  69 429  39 139  69  24  18]
 [ 22  14  82 197  69 435  83  54  20  24]
 [ 10  17  74  91 100  39 630  14  11  14]
 [ 39  20  55  66  79  71  37 559  15  59]
 [ 87  61  18  29  23  18  13  15 683  53]
 [ 44 143  11  46  11  15  24  35  63 608]]
              precision    recall  f1-score   support

           0       0.61      0.62      0.61      1000
           1       0.65      0.65      0.65      1000
           2       0.42      0.40      0.41      1000
           3       0.37      0.38      0.38      1000
           4       0.46      0.43      0.44      1000
           5       0.49      0.43      0.46      1000
           6       0.53      0.63      0.57      1000
           7       0.63      0.56      0.59      1000
           8       0.64      0.68      0.66      1000
         

#### b) Logistic Regresstion Classifier

In [46]:
Model = LogisticRegression(max_iter = 400)
Model

LogisticRegression(max_iter=400)

In [47]:
%%timeit 
Model.fit(X_train, y_train)

3.84 s ± 52.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [48]:
y_pred_log = Model.predict(X_test)

In [49]:
np.save("results/y_pred_log.npy", y_pred_log)

In [78]:
# load prediction array
np.load("results/y_pred_log.npy")

# Accuracy score
print('Accuracy : ',accuracy_score(y_pred_log,y_test))

# confusion matrix
print(confusion_matrix(y_test, y_pred_log))

# Summary of the predictions made by the classifier
partb = classification_report(y_test, y_pred_log)
print(partb)

Accuracy :  0.3955
[[457  55  60  29  18  35  25  55 194  72]
 [ 64 494  30  34  24  41  39  49  80 145]
 [102  43 256  88 127  87 159  68  46  24]
 [ 42  66  98 270  59 176 135  53  38  63]
 [ 54  30 150  63 285  79 169 117  29  24]
 [ 49  50  96 157  71 325  96  80  53  23]
 [ 10  41  78 132  97  73 484  32  17  36]
 [ 45  37  68  57  96  71  70 422  46  88]
 [155  67  21  29  12  49  13  18 524 112]
 [ 73 181  25  33  13  23  47  56 111 438]]
              precision    recall  f1-score   support

           0       0.43      0.46      0.45      1000
           1       0.46      0.49      0.48      1000
           2       0.29      0.26      0.27      1000
           3       0.30      0.27      0.29      1000
           4       0.36      0.28      0.32      1000
           5       0.34      0.33      0.33      1000
           6       0.39      0.48      0.43      1000
           7       0.44      0.42      0.43      1000
           8       0.46      0.52      0.49      1000
         

#### c) K-nearest Neighbors Classifier

In [54]:
Model = KNeighborsClassifier(n_neighbors=10)
Model

KNeighborsClassifier(n_neighbors=10)

In [55]:
%%timeit 
Model.fit(X_train, y_train)

365 ms ± 15.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [56]:
y_pred_knn = Model.predict(X_test)

In [57]:
np.save("results/y_pred_knn.npy", y_pred_knn)

In [79]:
# load prediction array
np.load("results/y_pred_knn.npy")

# Accuracy score
print('Accuracy : ',accuracy_score(y_pred_knn,y_test))

# confusion matrix
print(confusion_matrix(y_test, y_pred_knn))

# Summary of the predictions made by the classifier
partc = classification_report(y_test, y_pred_knn)
print(partc)

Accuracy :  0.3824
[[563  12 100  14  49   7  33  14 204   4]
 [ 99 276  92  34 121  21 106  19 197  35]
 [109   4 429  39 222  24 114  16  40   3]
 [ 56  13 202 184 181  90 192  24  43  15]
 [ 61   2 252  26 507   7  82  24  39   0]
 [ 47   8 194 129 188 229 130  19  49   7]
 [ 18   0 210  29 247  18 453   4  18   3]
 [ 81   9 146  47 252  53  82 263  52  15]
 [126  14  40  31  59   9  25   9 674  13]
 [140  77  59  32 102  19  91  25 209 246]]
              precision    recall  f1-score   support

           0       0.43      0.56      0.49      1000
           1       0.67      0.28      0.39      1000
           2       0.25      0.43      0.31      1000
           3       0.33      0.18      0.24      1000
           4       0.26      0.51      0.35      1000
           5       0.48      0.23      0.31      1000
           6       0.35      0.45      0.39      1000
           7       0.63      0.26      0.37      1000
           8       0.44      0.67      0.53      1000
         

#### d) Perceptron

In [95]:
Model = Perceptron(tol=1e-3, random_state=1021)
Model

Perceptron(random_state=1021)

In [96]:
%%timeit
Model.fit(X_train, y_train)

1.3 s ± 14.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [97]:
y_pred_p = Model.predict(X_test)

In [98]:
np.save("results/y_pred_p.npy", y_pred_p)

In [99]:
# load prediction array
np.load("results/y_pred_p.npy")

# Accuracy score
print('Accuracy : ',accuracy_score(y_pred_p,y_test))

# confusion matrix
print(confusion_matrix(y_test, y_pred_p))

# Summary of the predictions made by the classifier
partd = classification_report(y_test, y_pred_p)
print(partd)

Accuracy :  0.2262
[[218 104  28  95  37  27   9  43 131 308]
 [ 44 353  45  48  12  81  22  41 193 161]
 [ 47  66 188 136  99 156  71  63  78  96]
 [ 30  93 147 144  32 195  52  89 120  98]
 [ 43  70 228  56 147 163  69 104  56  64]
 [ 33  83 123 129  65 198  41 120 110  98]
 [ 12  47 180  71  60 256 147  78  79  70]
 [ 62  79 108  68  70 121  27 237  92 136]
 [113 117  17 109  12  23  10  20 315 264]
 [ 87 178  40  52  17  67  27  37 180 315]]
              precision    recall  f1-score   support

           0       0.32      0.22      0.26      1000
           1       0.30      0.35      0.32      1000
           2       0.17      0.19      0.18      1000
           3       0.16      0.14      0.15      1000
           4       0.27      0.15      0.19      1000
           5       0.15      0.20      0.17      1000
           6       0.31      0.15      0.20      1000
           7       0.28      0.24      0.26      1000
           8       0.23      0.32      0.27      1000
         

### 2) Report various metrics of the test data for the fitted models, such as averaged 
- precision
- recall
- f1 score
- accuracy 

In [108]:
# precision
svc_pre = precision_score(y_pred_svc,y_test, average = "macro")
log_pre = precision_score(y_pred_log,y_test, average = "macro")
knn_pre = precision_score(y_pred_knn,y_test, average = "macro")
p_pre = precision_score(y_pred_p,y_test, average = "macro")

# recall 
svc_recall = round(recall_score(y_pred_svc,y_test, average = "macro"),4)
log_recall = round(recall_score(y_pred_log,y_test, average = "macro"),4)
knn_recall = round(recall_score(y_pred_knn,y_test, average = "macro"),4)
p_recall = round(recall_score(y_pred_p,y_test, average = "macro"),4)

# f1 score
svc_f1 = round(f1_score(y_pred_svc,y_test, average = "macro"),4)
log_f1 = round(f1_score(y_pred_log,y_test, average = "macro"),4)
knn_f1 = round(f1_score(y_pred_knn,y_test, average = "macro"),4)
p_f1 = round(f1_score(y_pred_p,y_test, average = "macro"),4)

# accuracy
svc_acc = accuracy_score(y_pred_svc,y_test)
log_acc = accuracy_score(y_pred_log,y_test)
knn_acc = accuracy_score(y_pred_knn,y_test)
p_acc = accuracy_score(y_pred_p,y_test)

In [111]:
data = {'Classifier':['Linear SVC', 'Logistic Regression Classifier', 'K-nearest Neighbors Classifier', 'Perceptron'],
        'Precision':[svc_pre, log_pre, knn_pre, p_pre],
        'Recall':[svc_recall, log_recall, knn_recall, p_recall],
        'F1 Score':[svc_f1, log_f1, knn_f1, p_f1],
        'Accuracy':[svc_acc, log_acc, knn_acc, p_acc]}

In [112]:
# Convert the dictionary into DataFrame 
df = pd.DataFrame(data)
df

Unnamed: 0,Classifier,Precision,Recall,F1 Score,Accuracy
0,Linear SVC,0.5395,0.5389,0.5382,0.5395
1,Logistic Regression Classifier,0.3955,0.391,0.3918,0.3955
2,K-nearest Neighbors Classifier,0.3824,0.4556,0.3751,0.3824
3,Perceptron,0.2262,0.2385,0.224,0.2262


## Question 3

#### 1)	There are several combinations of dimensionality reduction methods, model selection and hyperparameter values for both. Use sklearn’s GridSearchCV and Pipeline features to go over these combinations for selecting the combination that gives the best f1 score averaged over 5 folds (5-cross validation). Each model in grid search takes a relatively large amount of time to train. Specifically, you need to consider the following options:
    a)	Dimensionality reduction method (2 options)
    b)	Classification model used (4 options)
    c)	For models that support only binary classification, 1-vs-1 or 1-vs-rest? (2 options)
    d)	Type of regularization (if applicable) (l1, l2)


In [115]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

In [None]:

grid = {'n_neighbors': np.arange(1,50)}
knn = KNeighborsClassifier()
knn_cv = GridSearchCV(knn, grid, cv=3) # GridSearchCV
knn_cv.fit(X,y) # Fit

# Print hyperparameter
print("Tuned hyperparameter k: {}".format(knn_cv.best_params_)) 
print("Best score: {}".format(knn_cv.best_score_))



#### 2)	Run the grid search and describe the best pipeline you found. Report various metrics for this pipeline, such as averaged precision, recall, f1 score and accuracy on the test data. Why are they so low?