# Introduction

The objective of this project is to classify a user's activity, based on the readings obtained from sensors of a smartphone's embedded accelerometer and gyroscope, as one of the following:
<ul>
    <li>Walking</li>
    <li>Walking upstairs</li>
    <li>Walking downstairs</li>
    <li>Sitting</li>
    <li>Standing</li>
    <li>Laying</li>
</ul>

So we'll start by importing the data and looking for the characteristics of its columns.

## Explotary Data Analysis

We'll be making use of Python's pandas for storing and manipulating our data.

In [1]:
#Import libraries
import pandas as pd
import numpy as np

#Read from CSV file
SensorData = pd.read_csv("data/SensorData.csv", index_col=False)

#Print first 5 rows
SensorData.head()

Unnamed: 0,rn,activity,tBodyAcc.mean.X,tBodyAcc.mean.Y,tBodyAcc.mean.Z,tBodyAcc.std.X,tBodyAcc.std.Y,tBodyAcc.std.Z,tBodyAcc.mad.X,tBodyAcc.mad.Y,...,fBodyBodyGyroJerkMag.meanFreq,fBodyBodyGyroJerkMag.skewness,fBodyBodyGyroJerkMag.kurtosis,angle.tBodyAccMean.gravity,angle.tBodyAccJerkMean.gravityMean,angle.tBodyGyroMean.gravityMean,angle.tBodyGyroJerkMean.gravityMean,angle.X.gravityMean,angle.Y.gravityMean,angle.Z.gravityMean
0,7,STANDING,0.279,-0.0196,-0.11,-0.997,-0.967,-0.983,-0.997,-0.966,...,0.146,-0.217,-0.564,-0.213,-0.231,0.0146,-0.19,-0.852,0.182,-0.043
1,11,STANDING,0.277,-0.0127,-0.103,-0.995,-0.973,-0.985,-0.996,-0.974,...,0.121,0.349,0.0577,0.0807,0.596,-0.476,0.116,-0.852,0.188,-0.0347
2,14,STANDING,0.277,-0.0147,-0.107,-0.999,-0.991,-0.993,-0.999,-0.991,...,0.74,-0.564,-0.766,0.106,-0.0903,-0.132,0.499,-0.85,0.189,-0.0351
3,15,STANDING,0.298,0.0271,-0.0617,-0.989,-0.817,-0.902,-0.989,-0.794,...,0.131,0.208,-0.0681,0.0623,-0.0587,0.0312,-0.269,-0.731,0.283,0.0364
4,20,STANDING,0.276,-0.017,-0.111,-0.998,-0.991,-0.998,-0.998,-0.989,...,0.667,-0.942,-0.966,0.245,0.103,0.0661,-0.412,-0.761,0.263,0.0296


So as you can see, there are 563 columns! Two of the columns, "rn" and "activity", are not of any use in prediction. So we still have 561 columns making this a really high dimensional data. Hence we'll have to incorporate dimensionality reduction techniques to classify data accurately. 

Let's first look at the summary of the data available to check for column types.

In [2]:
SensorData.describe()

Unnamed: 0,rn,tBodyAcc.mean.X,tBodyAcc.mean.Y,tBodyAcc.mean.Z,tBodyAcc.std.X,tBodyAcc.std.Y,tBodyAcc.std.Z,tBodyAcc.mad.X,tBodyAcc.mad.Y,tBodyAcc.mad.Z,...,fBodyBodyGyroJerkMag.meanFreq,fBodyBodyGyroJerkMag.skewness,fBodyBodyGyroJerkMag.kurtosis,angle.tBodyAccMean.gravity,angle.tBodyAccJerkMean.gravityMean,angle.tBodyGyroMean.gravityMean,angle.tBodyGyroJerkMean.gravityMean,angle.X.gravityMean,angle.Y.gravityMean,angle.Z.gravityMean
count,3609.0,3609.0,3609.0,3609.0,3609.0,3609.0,3609.0,3609.0,3609.0,3609.0,...,3609.0,3609.0,3609.0,3609.0,3609.0,3609.0,3609.0,3609.0,3609.0,3609.0
mean,5152.43059,0.274544,-0.017415,-0.109195,-0.608457,-0.506265,-0.614482,-0.634634,-0.52166,-0.616047,...,0.128804,-0.300815,-0.6194,0.007561,0.009484,0.029185,-0.010632,-0.496977,0.06004,-0.050202
std,2975.767839,0.063589,0.042589,0.056218,0.439157,0.501627,0.399514,0.413194,0.485282,0.394932,...,0.240278,0.317963,0.308303,0.332249,0.448971,0.613615,0.49083,0.509336,0.311308,0.263935
min,7.0,-0.521,-1.0,-0.926,-1.0,-0.999,-1.0,-1.0,-0.999,-1.0,...,-0.786,-0.968,-0.995,-0.969,-0.997,-1.0,-0.993,-0.999,-1.0,-0.971
25%,2570.0,0.262,-0.0252,-0.122,-0.992,-0.976,-0.979,-0.993,-0.976,-0.978,...,-0.0158,-0.533,-0.836,-0.118,-0.281,-0.478,-0.398,-0.816,-0.0156,-0.122
50%,5158.0,0.277,-0.0172,-0.109,-0.939,-0.812,-0.844,-0.946,-0.816,-0.837,...,0.132,-0.341,-0.706,0.00774,0.00983,0.0296,-0.0134,-0.716,0.183,-0.00526
75%,7727.0,0.287,-0.011,-0.098,-0.254,-0.0517,-0.283,-0.306,-0.0845,-0.288,...,0.29,-0.118,-0.501,0.142,0.309,0.554,0.374,-0.522,0.252,0.104
max,10281.0,0.693,1.0,1.0,1.0,0.98,1.0,1.0,0.988,1.0,...,0.871,0.99,0.957,0.981,0.997,0.999,0.996,0.977,1.0,0.998


Since we have 562 columns in the output it means that all the columns, except for the "activity" column, is numeric in nature. Let's see if there are any nulls

In [3]:
SensorData.isnull().values.any()

False

As it is evident from the output, there are no null values in the data available. 

# Dimension Reduction - Principal Component Analysis

First we are going to use Principal Component Analysis (PCA) and after that we'll use Decision Trees, K-nearest Neighbors, and SVM classifiers to build different models. But first let us further split our data into training (80%) and testing (20%) sets.

In [4]:
#Drop "rn" column
SensorData = SensorData.drop(columns=["rn"])

#Store the target variable
target = SensorData[["activity"]]

#Store the predictor variables
predictors = SensorData.drop(columns=["activity"])
predictors.head()

Unnamed: 0,tBodyAcc.mean.X,tBodyAcc.mean.Y,tBodyAcc.mean.Z,tBodyAcc.std.X,tBodyAcc.std.Y,tBodyAcc.std.Z,tBodyAcc.mad.X,tBodyAcc.mad.Y,tBodyAcc.mad.Z,tBodyAcc.max.X,...,fBodyBodyGyroJerkMag.meanFreq,fBodyBodyGyroJerkMag.skewness,fBodyBodyGyroJerkMag.kurtosis,angle.tBodyAccMean.gravity,angle.tBodyAccJerkMean.gravityMean,angle.tBodyGyroMean.gravityMean,angle.tBodyGyroJerkMean.gravityMean,angle.X.gravityMean,angle.Y.gravityMean,angle.Z.gravityMean
0,0.279,-0.0196,-0.11,-0.997,-0.967,-0.983,-0.997,-0.966,-0.983,-0.941,...,0.146,-0.217,-0.564,-0.213,-0.231,0.0146,-0.19,-0.852,0.182,-0.043
1,0.277,-0.0127,-0.103,-0.995,-0.973,-0.985,-0.996,-0.974,-0.985,-0.94,...,0.121,0.349,0.0577,0.0807,0.596,-0.476,0.116,-0.852,0.188,-0.0347
2,0.277,-0.0147,-0.107,-0.999,-0.991,-0.993,-0.999,-0.991,-0.992,-0.943,...,0.74,-0.564,-0.766,0.106,-0.0903,-0.132,0.499,-0.85,0.189,-0.0351
3,0.298,0.0271,-0.0617,-0.989,-0.817,-0.902,-0.989,-0.794,-0.888,-0.926,...,0.131,0.208,-0.0681,0.0623,-0.0587,0.0312,-0.269,-0.731,0.283,0.0364
4,0.276,-0.017,-0.111,-0.998,-0.991,-0.998,-0.998,-0.989,-0.997,-0.946,...,0.667,-0.942,-0.966,0.245,0.103,0.0661,-0.412,-0.761,0.263,0.0296


In [5]:
from sklearn.model_selection import train_test_split

#Split into 80% training and 20% testing data
#X_train - predictor variables for training set
#Y_train - target variable for training set
#X_test - predictor variables for testing set
#Y_test - target variable for testing set
X_train, X_test, Y_train, Y_test = train_test_split( predictors, target, test_size=0.2)

Now that we have our training and testing data we should first standardize our data before we proceed with PCA. Not standardizing our data can compromise the performance of our model. To do this, we use scikit-learn's StandardScaler. 

In [6]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

#Fitting on training set
scaler.fit(X_train)

#Applying standard scaler object
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

The description of the features states the following:

<i>The features selected for this database come from the accelerometer and gyroscope 3-axial raw signals tAcc-XYZ and tGyro-XYZ. These time domain signals (prefix 't' to denote time) were captured at a constant rate of 50 Hz. Then they were filtered using a median filter and a 3rd order low pass Butterworth filter with a corner frequency of 20 Hz to remove noise. Similarly, the acceleration signal was then separated into body and gravity acceleration signals (tBodyAcc-XYZ and tGravityAcc-XYZ) using another low pass Butterworth filter with a corner frequency of 0.3 Hz. 

Subsequently, the body linear acceleration and angular velocity were derived in time to obtain Jerk signals (tBodyAccJerk-XYZ and tBodyGyroJerk-XYZ). Also the magnitude of these three-dimensional signals were calculated using the Euclidean norm (tBodyAccMag, tGravityAccMag, tBodyAccJerkMag, tBodyGyroMag, tBodyGyroJerkMag). 

Finally a Fast Fourier Transform (FFT) was applied to some of these signals producing fBodyAcc-XYZ, fBodyAccJerk-XYZ, fBodyGyro-XYZ, fBodyAccJerkMag, fBodyGyroMag, fBodyGyroJerkMag. (Note the 'f' to indicate frequency domain signals). 

These signals were used to estimate variables of the feature vector for each pattern:  
'-XYZ' is used to denote 3-axial signals in the X, Y and Z directions.
<ul>
<li>tBodyAcc-XYZ</li>
<li>tGravityAcc-XYZ</li>
<li>tBodyAccJerk-XYZ</li>
<li>tBodyGyro-XYZ</li>
<li>tBodyGyroJerk-XYZ</li>
<li>tBodyAccMag</li>
<li>tGravityAccMag</li>
<li>tBodyAccJerkMag</li>
<li>tBodyGyroMag</li>
<li>tBodyGyroJerkMag</li>
<li>fBodyAcc-XYZ</li>
<li>fBodyAccJerk-XYZ</li>
<li>fBodyGyro-XYZ</li>
<li>fBodyAccMag</li>
<li>fBodyAccJerkMag</li>
<li>fBodyGyroMag</li>
<li>fBodyGyroJerkMag</li>
</ul>

Additional vectors obtained by averaging the signals in a signal window sample. These are used on the angle() variable:

<ul>
<li>gravityMean</li>
<li>tBodyAccMean</li>
<li>tBodyAccJerkMean</li>
<li>tBodyGyroMean</li>
<li>tBodyGyroJerkMean</li>
</ul>
</i>

Thus, the 561 attributes actually provide information about these 17 attributes. Hence while applying the PCA method we should include the parameter that states that we need close to 17, or let's say, 20 components after transformation to preserve the information.

In [7]:
from sklearn.decomposition import PCA

#We need 20 components that preserve the information of 561 components 
pca_obj = PCA(n_components = 20)

#Fitting the PCA object on training set
pca_obj.fit(X_train)

#Applying PCA on both training and testing data
X_train = pca_obj.transform(X_train)
X_test = pca_obj.transform(X_test)

Now let's use different classification algorithm and compare their performance.

In [8]:
#Import accuracy score for performance measurement
from sklearn.metrics import accuracy_score

#Decision Tree classifier
from sklearn.tree import DecisionTreeClassifier as DTC
dtree = DTC(max_depth = 5).fit(X_train, Y_train)
dtree_pred = dtree.predict(X_test)
 
#Percentage of correctly classified observations
accuracy_score(Y_test, dtree_pred)

0.796398891966759

In [9]:
#Linear SVM classifier
from sklearn.svm import SVC
svm_model = SVC(kernel = 'linear', C = 1).fit(X_train, Y_train)
svm_pred = svm_model.predict(X_test)
 
#Model accuracy for X_test  
accuracy = svm_model.score(X_test, Y_test)
print(accuracy)

  y = column_or_1d(y, warn=True)


0.9196675900277008


In [10]:
#K-Nearest Neighbours classification
from sklearn.neighbors import KNeighborsClassifier as KNN
knn_model = KNN(n_neighbors = 5).fit(X_train, Y_train)

# accuracy on X_test
accuracy = knn_model.score(X_test, Y_test)
print(accuracy)

0.9016620498614959


  This is separate from the ipykernel package so we can avoid doing imports until


So we see that a linear SVM classifier performs the best with a classification accuracy of 91.97%, followed by kNN classifier with 90.17% classification accuracy, and Decision Tree classifier at 79.64% classification accuracy.