## PCA Analysis

In this notebook, the PCA method is used to fit the training images and the results are used to transform both the train and test sets. The number of principal components is selected in a way that the resulting array can explain 99% of variability in the training data. By rendering the PCA analysis, the number of features for each image is reduced from 460 $\times$ 700 $\times$ 3 = 966,000 to 1346 which offers a great boost in optimizing the classifier while preserving the main structure of the training images. At the end, the resulting PCA arrays along with the target data are stored using a `Numpy` compressed format.

In [1]:
import cv2
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA

In [None]:
df_train = pd.read_csv('train_data.csv')
df_test  = pd.read_csv('test_data.csv')

In [9]:
dst_folder = './Images/'
img_list = []

for index, row in df_train.iterrows():
    file_name = dst_folder + row['Image_Id'] + '.png'
    img_list.append(cv2.imread(file_name))

X_train = np.stack(img_list, axis=0)
X_train = X_train.reshape(len(X_train), -1)

del img_list

X_train = X_train.astype('float')
X_train /= 255

In [10]:
dst_folder = './Images/'
img_list = []

for index, row in df_test.iterrows():
    file_name = dst_folder + row['Image_Id'] + '.png'
    img_list.append(cv2.imread(file_name))

X_test = np.stack(img_list, axis=0)
X_test = X_test.reshape(len(X_test), -1)

del img_list

X_test = X_test.astype('float')
X_test /= 255

In [11]:
y_train = np.array(df_train['Label'])
y_test  = np.array(df_test['Label'])

In [12]:
pca = PCA(n_components=0.99)
pca.fit(X_train)

PCA(copy=True, iterated_power='auto', n_components=0.99, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

In [13]:
pca.n_components_

1346

In [14]:
Z_train = pca.transform(X_train)
Z_test = pca.transform(X_test)

In [15]:
del X_train
del X_test

In [16]:
Z_train.shape

(1510, 1346)

In [17]:
Z_test.shape

(464, 1346)

In [28]:
np.savez('pca_data', Z_train, Z_test, y_train, y_test)