# Machine Learning Engineer Nanodegree
# Capstone
# Project: To find an Optimum Machine Learning Model for Human Activity        Recognition with Smartphone Sensor Data

This is the Capstone for MLND.
Our purpose is to recognize human activity based on the sensor data from smartphones and to find an optimum model with can predict these activities with high degree of accuracy.

In [1]:
# Import the Necessary Libraries

import time, pickle
import numpy as np
from numpy import array
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
from IPython.display import display # Allows the use of display() for DataFrames
from sklearn.model_selection import train_test_split, GridSearchCV
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.metrics import accuracy_score, f1_score, log_loss

# Use pickle to open stored objects.
with open('data.pickle','rb') as f:
    data = pickle.load(f)
with open('X_train.pickle','rb') as f:
    X_train = pickle.load(f)
with open('X_test.pickle','rb') as f:
    X_test = pickle.load(f)
with open('y_train.pickle','rb') as f:
    y_train = pickle.load(f)
with open('y_test.pickle','rb') as f:
    y_test = pickle.load(f)
with open('y_true.pickle','rb') as f:
    y_true = pickle.load(f)

## Logistic Regression Classifier:

Now we will use a Logistic Regression Classifier

In [2]:
# Import LogisticRegression
from sklearn.linear_model import LogisticRegression
clf_lr= LogisticRegression(solver= 'liblinear', multi_class='ovr', verbose=22, n_jobs= -1)
start = time.clock()
clf_lr.fit(X_train, y_train)
pred_lr = clf_lr.predict(X_test)
end = time.clock()
y_pred = clf_lr.predict_proba(X_test)
print('Accuracy Score for Logistic Regression is: {}'.format(accuracy_score(y_test, pred_lr)))
print('F1 Score for Logistic Regression is: {}'.format(f1_score(y_test, pred_lr, average = 'weighted')))
print('Runtime is {} seconds'.format(end-start))
print('Log Loss Score is: {}'.format(log_loss(y_true, y_pred)))

  " = {}.".format(self.n_jobs))


[LibLinear]Accuracy Score for Logistic Regression is: 0.981877022654
F1 Score for Logistic Regression is: 0.981834941184
Runtime is 6.154393 seconds
Log Loss Score is: 0.0623213805757


### Parameter Optimization:

Now let us apply GridSearch to optimize the solver followed by C.

In [80]:
# optimize the solver and C
clf_lr= LogisticRegression(random_state = 21)
parameters = {'solver' : ['liblinear', 'sag', 'newton-cg', 'lbfgs'],
              'C':[0.001, 0.01, 1, 10, 100, 200, 500]}
grid = GridSearchCV(clf_lr, parameters, scoring = 'accuracy', n_jobs = -1, verbose=True)
grid.fit(X_train, y_train)
print("The best classifier is: ", grid.best_estimator_)
print("The best score is: ", grid.best_score_)

Fitting 3 folds for each of 28 candidates, totalling 84 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  7.5min
[Parallel(n_jobs=-1)]: Done  84 out of  84 | elapsed: 18.4min finished


('The best classifier is: ', LogisticRegression(C=100, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=21, solver='newton-cg', tol=0.0001,
          verbose=0, warm_start=False))
('The best score is: ', 0.98141212373422115)


From the above Grid Search, we get solver='newton-cg' and C = 100 as our best choice.
Let us tune C again in a range of 60 to 100.

In [87]:
# optimize the C.
clf_lr= LogisticRegression(random_state = 21)
parameters = {'solver' : ['newton-cg'],
              'C':[i*10 for i in range(6, 15)]}
grid = GridSearchCV(clf_lr, parameters, scoring = 'accuracy', n_jobs = -1, verbose=True)
grid.fit(X_train, y_train)
print("The best classifier is: ", grid.best_estimator_)
print("The best score is: ", grid.best_score_)

Fitting 3 folds for each of 9 candidates, totalling 27 fits


[Parallel(n_jobs=-1)]: Done  27 out of  27 | elapsed: 12.2min finished


('The best classifier is: ', LogisticRegression(C=60, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=21, solver='newton-cg', tol=0.0001,
          verbose=0, warm_start=False))
('The best score is: ', 0.98168955472326258)


In [88]:
# optimize the C.
clf_lr= LogisticRegression(random_state = 21)
parameters = {'solver' : ['newton-cg'],
              'C':[i*10 for i in range(1, 7)]}
grid = GridSearchCV(clf_lr, parameters, scoring = 'accuracy', n_jobs = -1, verbose=True)
grid.fit(X_train, y_train)
print("The best classifier is: ", grid.best_estimator_)
print("The best score is: ", grid.best_score_)

Fitting 3 folds for each of 6 candidates, totalling 18 fits


[Parallel(n_jobs=-1)]: Done  18 out of  18 | elapsed:  6.8min finished


('The best classifier is: ', LogisticRegression(C=50, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=21, solver='newton-cg', tol=0.0001,
          verbose=0, warm_start=False))
('The best score is: ', 0.98210570120682483)


In [89]:
# optimize the C in 45 to 55
clf_lr= LogisticRegression(random_state = 21)
parameters = {'solver' : ['newton-cg'],
              'C':[i for i in range(45, 56)]}
grid = GridSearchCV(clf_lr, parameters, scoring = 'accuracy', n_jobs = -1, verbose=True)
grid.fit(X_train, y_train)
print("The best classifier is: ", grid.best_estimator_)
print("The best score is: ", grid.best_score_)

Fitting 3 folds for each of 11 candidates, totalling 33 fits


[Parallel(n_jobs=-1)]: Done  33 out of  33 | elapsed: 16.5min finished


('The best classifier is: ', LogisticRegression(C=46, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=21, solver='newton-cg', tol=0.0001,
          verbose=0, warm_start=False))
('The best score is: ', 0.98210570120682483)


In [91]:
# optimize the C.
clf_lr= LogisticRegression(random_state = 21)
parameters = {'solver' : ['newton-cg'],
              'C':[(i/10) for i in range(455, 466)]}
grid = GridSearchCV(clf_lr, parameters, scoring = 'accuracy', n_jobs = -1, verbose=True)
grid.fit(X_train, y_train)
print("The best classifier is: ", grid.best_estimator_)
print("The best score is: ", grid.best_score_)

Fitting 3 folds for each of 11 candidates, totalling 33 fits


[Parallel(n_jobs=-1)]: Done  33 out of  33 | elapsed: 13.4min finished


('The best classifier is: ', LogisticRegression(C=46, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=21, solver='newton-cg', tol=0.0001,
          verbose=0, warm_start=False))
('The best score is: ', 0.98210570120682483)





** _After a series of optimizing best C we found with newton-cg as the scoring function and l2 as penalty is: 46_**

Now lets fit the classifier and find out the accuracy we get.

In [111]:
clf_lr= LogisticRegression(penalty = 'l2', C = 46, solver= 'newton-cg', multi_class='ovr')
strt = time.clock()
clf_lr.fit(X_train, y_train)
pred_lr = clf_lr.predict(X_test)
end = time.clock()
y_pred = clf_lr.predict_proba(X_test)
print('Accuracy Score for Logistic Regression is: {}'.format(accuracy_score(y_test, pred_lr)))
print('F1 Score for Logistic Regression is: {}'.format(f1_score(y_test, pred_lr, average = 'weighted')))
print('Runtime is {} seconds'.format(end-start))
print('Log Loss Score is: {}'.format(log_loss(y_true, y_pred)))

Accuracy Score for Logistic Regression is: 0.988025889968
F1 Score for Logistic Regression is: 0.988019092014
Runtime is 202.429314 seconds
Log Loss Score is: 0.040566206566


### Logistic Regression: Optimum Values:

After multiple Grid Search we found the best parameters as:

* Solver= newton-cg
* C = 46

The performance of our model is as follows:

* Accuracy Score for Logistic Regression is: 0.988025889968
* F1 Score for Logistic Regression is: 0.988019092014
* Runtime is 202.429314 seconds
* Log Loss Score is: 0.040566206566

