# Principle component analysis for data compression

In the following we use the concept of principle component analysis (PCA). For information on PCA, we refer to https://en.wikipedia.org/wiki/Principal_component_analysis and for a discussion on the python implementation see https://towardsdatascience.com/pca-using-python-scikit-learn-e653f8989e60.

The data to be analyzed captures mobile phone user motion information and can be downloaded from: https://www.kaggle.com/uciml/human-activity-recognition-with-smartphones/data.

# Training models after PCA

In this notebook, we seek to use the results of the PCA to train various classification models (logistic regression, kNN, SVM, Naive Bayes, Decision Tree and Random Forest).

In [23]:
#importing necessary packages

import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA

%matplotlib inline
import matplotlib.pyplot as plt

import time

from helper import plot_classifier #helper.py is saved in the repository

In [2]:
train = pd.read_csv("./train.csv.bz2")
test = pd.read_csv("./test.csv.bz2")

In [3]:
#define variables
X_train = train.drop("subject", axis = 1).drop("Activity", axis = 1) #drop two last columns
Y_train = train["Activity"]

X_test = test.drop("subject", axis = 1).drop("Activity", axis = 1) #drop two last columns
Y_test = test["Activity"]

In [4]:
#rescale data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)

In [5]:
#proceed with PCA (reduce dimensions)
pca = PCA() #PCA for all components!
pca.fit(X_train) #you are fitting PCA on the training set only

X_transformed = pca.transform(X_train)

In [6]:
print(pca.explained_variance_ratio_[:10]) #to give the variance ratios for the first 10 pcs

print("The first 10 principle components account for " + str(100*np.sum(pca.explained_variance_ratio_[:10])) + " percent of the overall variance.")

[0.50781172 0.0658068  0.02806437 0.02503953 0.01888285 0.01724006
 0.01371011 0.01199078 0.0099586  0.00965087]
The first 10 principle components account for 70.81556875869077 percent of the overall variance.


In [7]:
#print pcs
#pca.components_[0] #for the first pc
pca.components_[0].shape

(561,)

### Training of a logistic regression model after PCA

In [26]:
#proceed with PCA (reduce dimensions)
pca = PCA(n_components = 10) #break down 561 columns/axes down to 10!
pca.fit(X_train) #you are fitting PCA on the training set only
X_train_transformed = pca.transform(X_train)
X_test_transformed = pca.transform(X_test)

In [27]:
#train logistic regression model on PCA reduced data
#one - vs - all classification

start_time = time.time()

classifier = LogisticRegression()

classifier.fit(X_train_transformed, Y_train)

Y_predicted = classifier.predict(X_test_transformed) #to predict if successful or not

#score the quality of the prediction
print(classifier.score(X_test_transformed, Y_test))

print("--- %s seconds ---" % (time.time() - start_time))

0.5829657278588395
--- 0.06118583679199219 seconds ---




In [28]:
#train logistic regression model on original data
#one - vs - all classification

start_time = time.time()

classifier2 = LogisticRegression()
classifier2.fit(X_train, Y_train)

print(classifier2.score(X_test, Y_test))

print("--- %s seconds ---" % (time.time() - start_time))



0.9643705463182898
--- 11.682186841964722 seconds ---


We observe that data compression achieved by PCA leads to drastic speed improvement for the learning algorithms.

### Training of a SVM model after PCA

In [31]:
#proceed with PCA (reduce dimensions)
pca = PCA(n_components = 10) #break down 561 columns/axes down to 10!
pca.fit(X_train) #you are fitting PCA on the training set only
X_train_transformed = pca.transform(X_train)
X_test_transformed = pca.transform(X_test)

In [32]:
#train SVM RBF kernel classifier on PCA reduced data
from sklearn.svm import SVC

start_time = time.time()

#Kernel form K(x,x')=exp(-||x-x'||^2/(2gamma^2))
#gamma large: kernel strongly peaked, gamma small: kernel slowly decaying wide peak
#C value as for linear kernels modulates emphasis of landmark data points
classifier = SVC(kernel = "rbf", gamma = 1, C = 1)

classifier.fit(X_train_transformed, Y_train)

Y_predicted = classifier.predict(X_test_transformed)

print("Score: " + str(classifier.score(X_test_transformed, Y_test)))

print("--- %s seconds ---" % (time.time() - start_time))

Score: 0.35493722429589414
--- 7.8921730518341064 seconds ---


In [34]:
#train SVM RBF kernel classifier on original data

start_time = time.time()

classifier2 = SVC(kernel = "rbf", gamma = 1, C = 1)
classifier2.fit(X_train, Y_train)

print(classifier2.score(X_test, Y_test))

print("--- %s seconds ---" % (time.time() - start_time))

0.18221920597217509
--- 95.60716915130615 seconds ---


### Training of a kNN model after PCA

In [38]:
#proceed with PCA (reduce dimensions)
pca = PCA(n_components = 10) #break down 561 columns/axes down to 10!
pca.fit(X_train) #you are fitting PCA on the training set only
X_train_transformed = pca.transform(X_train)
X_test_transformed = pca.transform(X_test)

In [39]:
#train KNN classifier on PCA reduced data
from sklearn.neighbors import KNeighborsClassifier

start_time = time.time()

classifier = KNeighborsClassifier(n_neighbors = 5) #standard is n_neighbors = 5

classifier.fit(X_train_transformed, Y_train)

Y_predicted = classifier.predict(X_test_transfored)

print("Score: " + str(classifier.score(X_test_transformed, Y_test)))

print("--- %s seconds ---" % (time.time() - start_time))

Score: 0.8171021377672208
--- 0.20365166664123535 seconds ---


In [40]:
#train KNN classifier on original data

start_time = time.time()

classifier2 = KNeighborsClassifier(n_neighbors = 5)
classifier2.fit(X_train, Y_train)

print(classifier2.score(X_test, Y_test))

print("--- %s seconds ---" % (time.time() - start_time))

0.8917543264336614
--- 22.89915108680725 seconds ---


### Training of a decision tree model after PCA

In [41]:
#proceed with PCA (reduce dimensions)
pca = PCA(n_components = 10) #break down 561 columns/axes down to 10!
pca.fit(X_train) #you are fitting PCA on the training set only
X_train_transformed = pca.transform(X_train)
X_test_transformed = pca.transform(X_test)

In [42]:
#train decision tree classifier on PCA reduced data
from sklearn.tree import DecisionTreeClassifier

start_time = time.time()

classifier = DecisionTreeClassifier(criterion = "entropy")

classifier.fit(X_train_transformed, Y_train)

Y_predicted = classifier.predict(X_test_transformed)

print("Score: " + str(classifier.score(X_test_transformed, Y_test)))

print("--- %s seconds ---" % (time.time() - start_time))

Score: 0.7689175432643366
--- 0.2121567726135254 seconds ---


In [43]:
#train decision tree classifier on original data

start_time = time.time()

classifier2 = DecisionTreeClassifier(criterion = "entropy")
classifier2.fit(X_train, Y_train)

print(classifier2.score(X_test, Y_test))

print("--- %s seconds ---" % (time.time() - start_time))

0.8544282321004412
--- 7.132295846939087 seconds ---


### Training of a random forest model after PCA

In [None]:
#proceed with PCA (reduce dimensions)
pca = PCA(n_components = 10) #break down 561 columns/axes down to 10!
pca.fit(X_train) #you are fitting PCA on the training set only
X_train_transformed = pca.transform(X_train)
X_test_transformed = pca.transform(X_test)

In [44]:
#train random forest classifier on PCA reduced data
from sklearn.ensemble import RandomForestClassifier

start_time = time.time()

classifier = RandomForestClassifier(criterion = "entropy", n_estimators = 10)

classifier.fit(X_train_transformed, Y_train)

Y_predicted = classifier.predict(X_test_transformed)

print("Score: " + str(classifier.score(X_test_transformed, Y_test)))

print("--- %s seconds ---" % (time.time() - start_time))

Score: 0.7994570749915167
--- 0.4485588073730469 seconds ---


In [45]:
#train random forest classifier on original data

start_time = time.time()

classifier2 = RandomForestClassifier(criterion = "entropy", n_estimators = 10)
classifier2.fit(X_train, Y_train)

print(classifier2.score(X_test, Y_test))

print("--- %s seconds ---" % (time.time() - start_time))

0.8866644044791313
--- 2.3397629261016846 seconds ---


### Training of a Naive Bayes (GaussianNB) model after PCA

In [46]:
#proceed with PCA (reduce dimensions)
pca = PCA(n_components = 10) #break down 561 columns/axes down to 10!
pca.fit(X_train) #you are fitting PCA on the training set only
X_train_transformed = pca.transform(X_train)
X_test_transformed = pca.transform(X_test)

In [54]:
#train naive Bayes classifier using GaussianNB on PCA reduced data
from sklearn.naive_bayes import GaussianNB

start_time = time.time()

classifier = GaussianNB()

classifier.fit(X_train_transformed, Y_train)

Y_predicted = classifier.predict(X_test_transformed)

print("Score: " + str(classifier.score(X_test_transformed, Y_test)))

print("--- %s seconds ---" % (time.time() - start_time))

Score: 0.7923311842551748
--- 0.03618884086608887 seconds ---


In [55]:
#train naive Bayes classifier on original data

start_time = time.time()

classifier2 = GaussianNB()
classifier2.fit(X_train, Y_train)

print(classifier2.score(X_test, Y_test))

print("--- %s seconds ---" % (time.time() - start_time))

0.5714285714285714
--- 0.2728719711303711 seconds ---


We observe that the training on the PCA reduced data is much faster. In most cases it gives even better results than training on the original data. Results in this notebook could be improved by increasing the number of principle components to capture more of the variance of the original data (while still trying to train faster). Standardly, hyperparameter optimization with grid search could be mounted to further improve results.