# Machine Learning Engineer Nanodegree
# Capstone
# Project: To find an Optimum Machine Learning Model for Human Activity        Recognition with Smartphone Sensor Data

This is the Capstone for MLND.
Our purpose is to recognize human activity based on the sensor data from smartphones and to find an optimum model with can predict these activities with high degree of accuracy.

In [1]:
# Import the Necessary Libraries

import time, pickle
import numpy as np
from numpy import array
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
from IPython.display import display # Allows the use of display() for DataFrames
from sklearn.model_selection import train_test_split, GridSearchCV
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.metrics import accuracy_score, f1_score, log_loss

# Use pickle to open stored objects.
with open('data.pickle','rb') as f:
    data = pickle.load(f)
with open('X_train.pickle','rb') as f:
    X_train = pickle.load(f)
with open('X_test.pickle','rb') as f:
    X_test = pickle.load(f)
with open('y_train.pickle','rb') as f:
    y_train = pickle.load(f)
with open('y_test.pickle','rb') as f:
    y_test = pickle.load(f)
with open('y_true.pickle','rb') as f:
    y_true = pickle.load(f)

## LightGBM Classifier:

Now let us use LightGBM Classifier for our task. At first we will use the classifier with the default parameter values and then tune its parameters to obtain better accuracy and logloss Score. Let's see if how good a score can we obtain with LGBM Classifier :)

In [2]:
# Import LightGBM Classifier for our Task
from lightgbm import LGBMClassifier
# Define the classifier
clf_lgbm = LGBMClassifier()
clf_lgbm

LGBMClassifier(boosting_type='gbdt', colsample_bytree=1.0, learning_rate=0.1,
        max_bin=255, max_depth=-1, min_child_samples=10,
        min_child_weight=5, min_split_gain=0.0, n_estimators=10, n_jobs=-1,
        num_leaves=31, objective=None, random_state=0, reg_alpha=0.0,
        reg_lambda=0.0, silent=True, subsample=1.0,
        subsample_for_bin=50000, subsample_freq=1)

In [201]:
import warnings
warnings.simplefilter('ignore')

In [205]:
from sklearn.metrics import accuracy_score

In [209]:
start = time.clock()
clf_lgbm.fit(X_train, y_train)
## make predictions
preds = clf_lgbm.predict(X_test)
end = time.clock()
y_pred = clf_lgbm.predict_proba(X_test)
print('Accuracy Score for LGBM Classifier is: {}'.format(accuracy_score(y_test, preds)))
print('F1 Score for LGBM Classifier is: {}'.format(f1_score(y_test, preds, average = 'weighted')))
print('Runtime is {} seconds'.format(end-start))
print('Log Loss Score is: {}'.format(log_loss(y_true, y_pred)))

Accuracy Score for LGBM Classifier is: 0.96957928802589
F1 Score for LGBM Classifier is: 0.9695155916317623
Runtime is 406.51 seconds
Log Loss Score is: 0.595676708696094


### GridSearchCV for tuning Parameters:
Now we will select _'objective'_ as _multiclass_ and _'num_class'_ as 6 and then do parameter tuning for 'num_ierations' and 'learning_rate' followed by the following parameters:
* num_leaves
* min_data_in_leaf
* max_depth

_(Reference: http://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html)_

In [212]:
# Lets set learning rate as 0.1 and see what optimum value we get for num_iterations. Our objective is to obtain a num_iterations value closer to 100 
# and then for final score evaluation we will increase this consequently decreasing learnig rate. 
params = {'num_iterations' : [i*10 for i in range(6, 15, 2)],
          'learning_rate' : [0.5]}
GSCV = GridSearchCV(clf_lgbm, params, verbose = True, n_jobs = -1)
GSCV.fit(X_train, y_train)
GSCV.best_params_

Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed:  1.4min finished


{'learning_rate': 0.5, 'num_iterations': 100}

In [213]:
# Lets set num_iterations= 100 and learning_rate=0.5 and move on with tuning the other parameters
clf_lgbm = LGBMClassifier(objective = 'multiclass', num_class = 6, learning_rate = 0.5, num_iterations = 100)

#GridSearch for 'num_leaves'
params = {'num_leaves' : [2**i for i in range(2, 6)]}
GSCV = GridSearchCV(clf_lgbm, params, verbose = True, n_jobs = -1)
GSCV.fit(X_train, y_train)
GSCV.best_params_, GSCV.best_score_

Fitting 3 folds for each of 4 candidates, totalling 12 fits


[Parallel(n_jobs=-1)]: Done   8 out of  12 | elapsed:   44.3s remaining:   22.1s
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:  1.1min finished


({'num_leaves': 8}, 0.98807046747121652)

In [214]:
clf_lgbm = LGBMClassifier(boosting_type='gbdt', objective = 'multiclass', num_class = 6, learning_rate = 0.5, num_iterations = 100, num_leaves = 8)

#GridSearch for 'num_leaves'
params = {'min_data_in_leaf' : [i*100 for i in range(1, 5)]}
GSCV = GridSearchCV(clf_lgbm, params, verbose = True, n_jobs = -1)
GSCV.fit(X_train, y_train)
GSCV.best_params_, GSCV.best_score_

Fitting 3 folds for each of 4 candidates, totalling 12 fits


[Parallel(n_jobs=-1)]: Done   8 out of  12 | elapsed:   47.5s remaining:   23.8s
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:   49.8s finished


({'min_data_in_leaf': 300}, 0.98820918296573723)

In [215]:
#Lets tune 'num_leaves' one more time with range from 200 to 400 with step of 10
# Grid Search CV
params = {'min_data_in_leaf' : [i for i in range(200, 400, 10)]}
GSCV = GridSearchCV(clf_lgbm, params, verbose = True)
GSCV.fit(X_train, y_train)
GSCV.best_params_, GSCV.best_score_

Fitting 3 folds for each of 20 candidates, totalling 60 fits


[Parallel(n_jobs=1)]: Done  60 out of  60 | elapsed:  5.2min finished


({'min_data_in_leaf': 220}, 0.98848661395477877)

In [216]:
#one last time for 'num_leaves' 
params = {'min_data_in_leaf' : [i for i in range(210, 230)]}
GSCV = GridSearchCV(clf_lgbm, params, verbose = True)
GSCV.fit(X_train, y_train)
GSCV.best_params_, GSCV.best_score_

Fitting 3 folds for each of 20 candidates, totalling 60 fits


[Parallel(n_jobs=1)]: Done  60 out of  60 | elapsed:  3.2min finished


({'min_data_in_leaf': 229}, 0.98890276043834091)

In [217]:
# Grid Search for 'max_depth'
params = {'max_depth' : [i for i in range(6, 10)]}
GSCV = GridSearchCV(clf_lgbm, params, verbose = True)
GSCV.fit(X_train, y_train)
GSCV.best_params_, GSCV.best_score_

Fitting 3 folds for each of 4 candidates, totalling 12 fits


[Parallel(n_jobs=1)]: Done  12 out of  12 | elapsed:  2.2min finished


({'max_depth': 7}, 0.98890276043834091)

In [218]:
# Pass tuned params to the classifier. Using seed for reproducible results.
# Decrease learning rate and increase num_iterations correspondingly. num_iteration from 100 to 1000 and learning_rate from 0.5 to 0.05)
clf_lgbm = LGBMClassifier(boosting_type='gbdt', objective = 'multiclass', num_class = 6,
                          learning_rate = 0.05, num_iterations = 1000, num_leaves = 8, 
                          min_data_in_leaf = 229, max_depth = 7, seed = 31)

# Fit to the training data
start = time.clock()
clf_lgbm.fit(X_train, y_train)
## make predictions
preds = clf_lgbm.predict(X_test)
end = time.clock()
y_pred = clf_lgbm.predict_proba(X_test)
print('Accuracy Score for LGBM Classifier is: {}'.format(accuracy_score(y_test, preds)))
print('F1 Score for LGBM Classifier is: {}'.format(f1_score(y_test, preds, average = 'weighted')))
print('Runtime is {} seconds'.format(end-start))
print('Log Loss Score is: {}'.format(log_loss(y_true, y_pred)))

Accuracy Score for LGBM Classifier is: 0.991262135922
F1 Score for LGBM Classifier is: 0.991257859265
Runtime is 264.508288 seconds
Log Loss Score is: 0.0247173917944
