## Introduction
The objective of the project is to train a machine and predict the activity from one of the six activities performed, based on the readings embedded inertial sensors in a waist mounted smartphone. The original dataset is taken from **Kaggle** website under the name *"Human Activity Recognition with Smartphones"*.
#### Problem Description :
The dataset was prepared by 30 volunteers who performed six activities (WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING) while wearing a smartphone with embedded accelerometer amd gyroscope. The sensors captured 2-axial linear acceleration and 3-axial angular velocity at a constant rate of 50Hz. Also, it is specified in the dataset description that the sensor signals were pre-processed by applying noise filters and then sampled in fixed-width sliding window of 2.56 sec and 50% overlap(128 readings/window).

##### Attribute Information
-> Triaxial acceleration from accelerometer<br>
-> Triaxial angular velocity from gyroscope<br>
-> A 561-feature vector with time and frequency domain variables<br>
-> Its activity label<br>
-> An human identifier of subject who carried out the experiment<br>

Let's start by importing relevant libraries and datasets

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import time

In [10]:
from sklearn.utils import shuffle

In [2]:
train_df  = pd.read_csv('D:/Users/gajesing/Desktop/Kaggle/human-activity-recognition-with-smartphones/train.csv')
test_df  = pd.read_csv('D:/Users/gajesing/Desktop/Kaggle/human-activity-recognition-with-smartphones/test.csv')

In [3]:
train_df.head()

Unnamed: 0,tBodyAcc-mean()-X,tBodyAcc-mean()-Y,tBodyAcc-mean()-Z,tBodyAcc-std()-X,tBodyAcc-std()-Y,tBodyAcc-std()-Z,tBodyAcc-mad()-X,tBodyAcc-mad()-Y,tBodyAcc-mad()-Z,tBodyAcc-max()-X,...,fBodyBodyGyroJerkMag-kurtosis(),"angle(tBodyAccMean,gravity)","angle(tBodyAccJerkMean),gravityMean)","angle(tBodyGyroMean,gravityMean)","angle(tBodyGyroJerkMean,gravityMean)","angle(X,gravityMean)","angle(Y,gravityMean)","angle(Z,gravityMean)",subject,Activity
0,0.288585,-0.020294,-0.132905,-0.995279,-0.983111,-0.913526,-0.995112,-0.983185,-0.923527,-0.934724,...,-0.710304,-0.112754,0.0304,-0.464761,-0.018446,-0.841247,0.179941,-0.058627,1,STANDING
1,0.278419,-0.016411,-0.12352,-0.998245,-0.9753,-0.960322,-0.998807,-0.974914,-0.957686,-0.943068,...,-0.861499,0.053477,-0.007435,-0.732626,0.703511,-0.844788,0.180289,-0.054317,1,STANDING
2,0.279653,-0.019467,-0.113462,-0.99538,-0.967187,-0.978944,-0.99652,-0.963668,-0.977469,-0.938692,...,-0.760104,-0.118559,0.177899,0.100699,0.808529,-0.848933,0.180637,-0.049118,1,STANDING
3,0.279174,-0.026201,-0.123283,-0.996091,-0.983403,-0.990675,-0.997099,-0.98275,-0.989302,-0.938692,...,-0.482845,-0.036788,-0.012892,0.640011,-0.485366,-0.848649,0.181935,-0.047663,1,STANDING
4,0.276629,-0.01657,-0.115362,-0.998139,-0.980817,-0.990482,-0.998321,-0.979672,-0.990441,-0.942469,...,-0.699205,0.12332,0.122542,0.693578,-0.615971,-0.847865,0.185151,-0.043892,1,STANDING


In [4]:
train_df.shape, test_df.shape

((7352, 563), (2947, 563))

In [5]:
train_df.columns

Index(['tBodyAcc-mean()-X', 'tBodyAcc-mean()-Y', 'tBodyAcc-mean()-Z',
       'tBodyAcc-std()-X', 'tBodyAcc-std()-Y', 'tBodyAcc-std()-Z',
       'tBodyAcc-mad()-X', 'tBodyAcc-mad()-Y', 'tBodyAcc-mad()-Z',
       'tBodyAcc-max()-X',
       ...
       'fBodyBodyGyroJerkMag-kurtosis()', 'angle(tBodyAccMean,gravity)',
       'angle(tBodyAccJerkMean),gravityMean)',
       'angle(tBodyGyroMean,gravityMean)',
       'angle(tBodyGyroJerkMean,gravityMean)', 'angle(X,gravityMean)',
       'angle(Y,gravityMean)', 'angle(Z,gravityMean)', 'subject', 'Activity'],
      dtype='object', length=563)

The traing data has 7352 observations with 563 features and similarly test data has 2947 observations with same features with the first few columns representing the mean and standard deviations of body accelerations in 3 spatial dimensions (X, Y, Z). The last two columns are "subject" and "Acitivity" in which the subject represent that the observation is taken from and the corresponding activity respectively. Let's see what activities have been recorded in this data.

In [7]:
print('Training Labels :\n', train_df.Activity.unique())
print('Test Labels :\n', test_df.Activity.unique())

Training Labels :
 ['STANDING' 'SITTING' 'LAYING' 'WALKING' 'WALKING_DOWNSTAIRS'
 'WALKING_UPSTAIRS']
Test Labels :
 ['STANDING' 'SITTING' 'LAYING' 'WALKING' 'WALKING_DOWNSTAIRS'
 'WALKING_UPSTAIRS']


We have 6 activities, 3 passive (laying, standing and sitting) and 3 active (walking, walking_downstairs, walking_upstairs). So, each observation in the dataset represent one of the six activities whose features are recorded in the first 561 variables. Our goal would be to train the machine to predict one of the six activities given a feature set of these 561 variables.

Next we should find if there are any missing or null values, which are not accepted by models. If we find any such values, we have to replace them by using some imputing techniques.

In [9]:
print('Training set : ',train_df.isnull().sum().sum())
print('Test set : ',test_df.isnull().sum().sum())

Training set :  0
Test set :  0


No missing data! cool :) So we dont need to do any data preprocessing.

Let's shuffle our training and test datasets to avoid any elements of bias/patterns and make sure that our models remains general.

In [12]:
train_df = shuffle(train_df)
test_df = shuffle(test_df)

We don't need to split our dataset into training and test data, because the data which we have got, is already splitted into two sets. So we'll train our models on training set and will test them on test set.

In [17]:
X_train = train_df.iloc[:, :-2].values
Y_train = train_df.iloc[:, -1].values
X_test = test_df.iloc[:, :-2].values
Y_test = test_df.iloc[:, -1].values

In [18]:
X_train.shape, Y_train.shape

((7352, 561), (7352,))

In [19]:
X_test.shape, Y_test.shape

((2947, 561), (2947,))

There are some classification models (like LR, KNN, SVC, MLP) which are sensitive towards the scale of attributes. Though we have almost same scale of all the attributes, still we can normalize our dataset so that it can learn faster. So we'll standardize our dataset based on mean and deviation of training set only. Because we should not touch the test set before fitting the model. The problem of **data leakage** can occur, and can make our model biased.

In [20]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_std = scaler.fit_transform(X_train)
X_test_std = scaler.transform(X_test)

I am using here 9 most frequently used models and evaluating mean accuracy of every cross validation set using stratified k fold method.
1. Logistic Regression
2. Decision Tree Classifier
3. Support Vector Classifier
4. K Nearest Neighbors Classifier
5. Gradient Boosting Classifier
6. Random Forest Classifier
7. AdaBoost Classifier
8. Naive Bayes Classifier
9. Multi Layer Perceptron

In [21]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

Here we are training all the models without tuning parameters and evaluating on validation set. Out of these 9 models, we'll tune the better 4 models and then we'll choose the best out of those 3 models.

In [22]:
models = []
models.append(('LR', LogisticRegression()))
models.append(('Tree', DecisionTreeClassifier()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('SVC', SVC()))
models.append(('GNB', GaussianNB()))
models.append(('Forest', RandomForestClassifier()))
models.append(('GBoost', GradientBoostingClassifier()))
models.append(('AdaBoost', AdaBoostClassifier()))
models.append(('MLP', MLPClassifier()))

In [25]:
names = []
scores = []
for name, model in models:
    kfold = StratifiedKFold(n_splits = 10, random_state = 0)
    start = time.time()
    val_score = cross_val_score(model, X_train_std, Y_train, cv = kfold, scoring = 'accuracy')
    end = time.time()
    scores.append(val_score)
    names.append(name)
    print('%s : %f (%f) in %f sec.' %(name, val_score.mean(), val_score.std(), end - start))

LR : 0.985854 (0.002666) in 88.377260 sec.
Tree : 0.944369 (0.008752) in 36.274448 sec.
KNN : 0.964644 (0.006788) in 35.858841 sec.
SVC : 0.978504 (0.005393) in 60.175031 sec.
GNB : 0.736361 (0.031642) in 2.026816 sec.
Forest : 0.971848 (0.007267) in 9.602464 sec.
GBoost : 0.989249 (0.005825) in 2068.713859 sec.
AdaBoost : 0.544886 (0.000503) in 217.487000 sec.
MLP : 0.985177 (0.003950) in 55.651362 sec.


We found that **Logistic Regression**, **Gradient Boosting**, **SVC** and **MLP** classifiers are among the most four accurate models in 10 fold cross validation. So we can tune the parameters for these 4 models.

### Model - 1. Logistic Regression  

In [26]:
from sklearn.model_selection import GridSearchCV

In [27]:
lr = LogisticRegression()
kfold = StratifiedKFold(n_splits = 10, random_state = 0)
grid_values = {'penalty' : ['l1', 'l2'], 'C' : [0.1, 0.3, 1]}
lr_tuned = GridSearchCV(lr, grid_values, cv = kfold, scoring = 'accuracy', n_jobs = -1)
lr_tuned.fit(X_train_std, Y_train)

GridSearchCV(cv=StratifiedKFold(n_splits=10, random_state=0, shuffle=False),
       error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'penalty': ['l1', 'l2'], 'C': [0.1, 0.3, 1]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='accuracy', verbose=0)

In [28]:
lr_tuned.best_params_

{'C': 1, 'penalty': 'l1'}

In [29]:
lr_tuned.best_score_

0.9878944504896626

### Model - 2. MLP Classifier

In [30]:
mlp = MLPClassifier(verbose = True)
kfold = StratifiedKFold(n_splits = 10, random_state = 0)
grid_values = {'hidden_layer_sizes' : [(100,), (180, 60), (100, 50)], 'activation' : ['relu', 'logistic', 'tanh'], 'alpha' : [0.0001, 0.0003]}
mlp_tuned = GridSearchCV(mlp, grid_values, cv = kfold, scoring = 'accuracy', n_jobs = -1)
mlp_tuned.fit(X_train_std, Y_train)

Iteration 1, loss = 0.46919331
Iteration 2, loss = 0.11730976
Iteration 3, loss = 0.08071720
Iteration 4, loss = 0.05950913
Iteration 5, loss = 0.05009980
Iteration 6, loss = 0.04071802
Iteration 7, loss = 0.03651860
Iteration 8, loss = 0.03581308
Iteration 9, loss = 0.03658988
Iteration 10, loss = 0.02573168
Iteration 11, loss = 0.02703680
Iteration 12, loss = 0.02810736
Iteration 13, loss = 0.01804405
Iteration 14, loss = 0.01510275
Iteration 15, loss = 0.01253899
Iteration 16, loss = 0.01030649
Iteration 17, loss = 0.01012733
Iteration 18, loss = 0.00691303
Iteration 19, loss = 0.00544229
Iteration 20, loss = 0.00491143
Iteration 21, loss = 0.00369778
Iteration 22, loss = 0.00314947
Iteration 23, loss = 0.00320445
Iteration 24, loss = 0.00454467
Iteration 25, loss = 0.00314914
Training loss did not improve more than tol=0.000100 for two consecutive epochs. Stopping.


GridSearchCV(cv=StratifiedKFold(n_splits=10, random_state=0, shuffle=False),
       error_score='raise',
       estimator=MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=True, warm_start=False),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'hidden_layer_sizes': [(100,), (180, 60), (100, 50)], 'activation': ['relu', 'logistic', 'tanh'], 'alpha': [0.0001, 0.0003]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='accuracy', verbose=0)

In [31]:
mlp_tuned.best_params_

{'activation': 'relu', 'alpha': 0.0003, 'hidden_layer_sizes': (180, 60)}

In [32]:
mlp_tuned.best_score_

0.9874863982589771

### Model - 3. Gradient Boosted Classifier

In [33]:
gboost = GradientBoostingClassifier()
kfold = StratifiedKFold(n_splits = 10, random_state = 0)
grid_values = {'n_estimators' : [100, 200, 300], 'max_depth' : [3, 4, 5], 'max_features' : ['auto', 'log2', None]}
gboost_tuned = GridSearchCV(gboost, grid_values, cv = kfold, scoring = 'accuracy', n_jobs = -1, verbose = 1)
gboost_tuned.fit(X_train_std, Y_train)

Fitting 10 folds for each of 27 candidates, totalling 270 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed: 70.4min
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed: 348.7min
[Parallel(n_jobs=-1)]: Done 270 out of 270 | elapsed: 476.4min finished


GridSearchCV(cv=StratifiedKFold(n_splits=10, random_state=0, shuffle=False),
       error_score='raise',
       estimator=GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=100,
              presort='auto', random_state=None, subsample=1.0, verbose=0,
              warm_start=False),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'n_estimators': [100, 200, 300], 'max_depth': [3, 4, 5], 'max_features': ['auto', 'log2', None]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='accuracy', verbose=1)

In [34]:
gboost_tuned.best_params_

{'max_depth': 4, 'max_features': None, 'n_estimators': 200}

In [35]:
gboost_tuned.best_score_

0.9923830250272034

### Model - 4. SVC 

In [42]:
svc = SVC()
kfold = StratifiedKFold(n_splits = 10, random_state = 0)
grid_values = {'kernel' : ['linear', 'rbf'], 'C' : [0.03, 0.1, 0.3, 1]}
svc_tuned = GridSearchCV(svc, grid_values, cv = kfold, scoring = 'accuracy', n_jobs = -1, verbose = 1)
svc_tuned.fit(X_train_std, Y_train)

Fitting 10 folds for each of 8 candidates, totalling 80 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  9.8min
[Parallel(n_jobs=-1)]: Done  80 out of  80 | elapsed: 15.0min finished


GridSearchCV(cv=StratifiedKFold(n_splits=10, random_state=0, shuffle=False),
       error_score='raise',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'kernel': ['linear', 'rbf'], 'C': [0.03, 0.1, 0.3, 1]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='accuracy', verbose=1)

In [43]:
svc_tuned.best_params_

{'C': 0.1, 'kernel': 'linear'}

In [44]:
svc_tuned.best_score_

0.9862622415669206

Now we can test our three best tuned models and choose one out of four models as final model. Lets' see how accurate are these models on training and test sets.

In [46]:
print('Training Set :\n')
print('Logistic Regression : ',accuracy_score(Y_train, lr_tuned.predict(X_train_std)))
print('MLP Classifier : ',accuracy_score(Y_train, mlp_tuned.predict(X_train_std)))
print('Gradient Boosted Classifier : ',accuracy_score(Y_train, gboost_tuned.predict(X_train_std)))
print('SVC : ', accuracy_score(Y_train, svc_tuned.predict(X_train_std)))
print('\nTest Set :\n')
print('Logistic Regression : ',accuracy_score(Y_test, lr_tuned.predict(X_test_std)))
print('MLP Classifier : ',accuracy_score(Y_test, mlp_tuned.predict(X_test_std)))
print('Gradient Boosted Classifier : ',accuracy_score(Y_test, gboost_tuned.predict(X_test_std)))
print('SVC : ', accuracy_score(Y_test, svc_tuned.predict(X_test_std)))
print('\nClassification Report of Test Set : ')
print('Logistic Regression \n: ',classification_report(Y_test, lr_tuned.predict(X_test_std)))
print('MLP Classifier : \n',classification_report(Y_test, mlp_tuned.predict(X_test_std)))
print('Gradient Boosted Classifier : \n',classification_report(Y_test, gboost_tuned.predict(X_test_std)))
print('SVC :\n ', classification_report(Y_test, svc_tuned.predict(X_test_std)))

Training Set :

Logistic Regression :  0.9952393906420022
MLP Classifier :  1.0
Gradient Boosted Classifier :  1.0
SVC :  0.9926550598476604

Test Set :

Logistic Regression :  0.9599592806243638
MLP Classifier :  0.9450288428910757
Gradient Boosted Classifier :  0.9314557176789956
SVC :  0.9616559212758737

Classification Report of Test Set : 
Logistic Regression 
:                      precision    recall  f1-score   support

            LAYING       1.00      1.00      1.00       537
           SITTING       0.97      0.86      0.91       491
          STANDING       0.89      0.97      0.93       532
           WALKING       0.95      1.00      0.97       496
WALKING_DOWNSTAIRS       1.00      0.98      0.99       420
  WALKING_UPSTAIRS       0.97      0.95      0.96       471

       avg / total       0.96      0.96      0.96      2947

MLP Classifier : 
                     precision    recall  f1-score   support

            LAYING       1.00      0.96      0.98       537
      

Here we can conclude that **MLP** and **Gradient Boosting** classifiers are better for training set, whereas, **LR** and **SVC** are competitive for training set but better for test set. MLP and GBoost classifier have more variance than LR and SVC.
**SVC** with parameters as *C* = 0.1 and *kernel* = linear, is most generalized and accurate model. And by tuning it's parameter we improved it's training accuracy by around 2%.

So our best model **SVC** predicted the training dataset with **99.3%** and test set with **96.2%** accuracy.