<a href="https://colab.research.google.com/github/ashishpal2702/HumanActivityrecognition/blob/main/Logistic_Regression_and_Classification_POC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction

We will be using the [Human Activity Recognition with Smartphones](https://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones) database, which was built from the recordings of study participants performing activities of daily living (ADL) while carrying a smartphone with an embedded inertial sensors. The objective is to classify activities into one of the six activities (walking, walking upstairs, walking downstairs, sitting, standing, and laying) performed.

For each record in the dataset it is provided: 

- Triaxial acceleration from the accelerometer (total acceleration) and the estimated body acceleration. 
- Triaxial Angular velocity from the gyroscope. 
- A 561-feature vector with time and frequency domain variables. 
- Its activity label. 

More information about the features is available on the website above.

In [1]:
#from google.colab import drive
#drive.mount('/content/drive')

In [1]:
from __future__ import print_function
import os
data_path = [ 'data']

from sklearn.preprocessing import LabelEncoder , StandardScaler
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.metrics import confusion_matrix, accuracy_score, roc_auc_score
from sklearn.preprocessing import label_binarize
from sklearn.ensemble import ExtraTreesClassifier


In [2]:
#import matplotlib.pyplot as plt
#import seaborn as sns
#%matplotlib inline

## 1. Data Import

Import the data and do the following:

* Examine the data types--there are many columns, so it might be wise to use value counts
* Determine if the floating point values need to be scaled
* Determine the breakdown of each activity
* Encode the activity label as an integer

In [6]:
import pandas as pd
import numpy as np
import os
#filepath = '/Users/apal/Documents/PathtoAI/AnalyticsVidhya/Mlops/data//Human_Activity_Recognition_Using_Smartphones_Data_augmented_data.gzip'
filepath = '/Users/apal/Documents/PathtoAI/AnalyticsVidhya/Mlops/data/New_Data.gzip'
data = pd.read_parquet(filepath)

In [7]:
data.shape

(100000, 563)

In [8]:
sensors = set()
for col in data.columns:
    sensors.add(col.split("-")[0])

In [9]:
sensors

{'Activity',
 'angle(X,gravityMean)',
 'angle(Y,gravityMean)',
 'angle(Z,gravityMean)',
 'angle(tBodyAccJerkMean),gravityMean)',
 'angle(tBodyAccMean,gravity)',
 'angle(tBodyGyroJerkMean,gravityMean)',
 'angle(tBodyGyroMean,gravityMean)',
 'date_time',
 'fBodyAcc',
 'fBodyAccJerk',
 'fBodyAccMag',
 'fBodyBodyAccJerkMag',
 'fBodyBodyGyroJerkMag',
 'fBodyBodyGyroMag',
 'fBodyGyro',
 'tBodyAcc',
 'tBodyAccJerk',
 'tBodyAccJerkMag',
 'tBodyAccMag',
 'tBodyGyro',
 'tBodyGyroJerk',
 'tBodyGyroJerkMag',
 'tBodyGyroMag',
 'tGravityAcc',
 'tGravityAccMag'}

In [10]:
data['Activity'].value_counts()

LAYING                16762
WALKING               16728
WALKING_UPSTAIRS      16675
STANDING              16645
WALKING_DOWNSTAIRS    16627
SITTING               16563
Name: Activity, dtype: int64

In [11]:
data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
tBodyAcc-mean()-X,100000.0,0.138041,0.086520,-0.863040,0.066226,0.134419,0.205561,0.771421
tBodyAcc-mean()-Y,100000.0,-0.008688,0.017784,-0.584658,-0.013820,-0.007170,-0.002103,0.511605
tBodyAcc-mean()-Z,100000.0,-0.054425,0.040249,-0.918692,-0.080716,-0.051144,-0.024272,0.937914
tBodyAcc-std()-X,100000.0,-0.282349,0.307097,-0.998199,-0.492172,-0.197145,-0.045912,0.839073
tBodyAcc-std()-Y,100000.0,-0.233839,0.320421,-0.994999,-0.465923,-0.118005,-0.005036,0.934541
...,...,...,...,...,...,...,...,...
"angle(tBodyGyroMean,gravityMean)",100000.0,0.012051,0.342313,-0.987830,-0.157721,0.004293,0.187061,0.988863
"angle(tBodyGyroJerkMean,gravityMean)",100000.0,-0.006645,0.245722,-0.931989,-0.128887,-0.001805,0.112867,0.962053
"angle(X,gravityMean)",100000.0,-0.261718,0.319036,-0.992246,-0.500460,-0.281813,-0.067636,0.948844
"angle(Y,gravityMean)",100000.0,0.039957,0.170272,-0.975458,0.003913,0.060276,0.138061,0.988527


In [12]:
le = LabelEncoder()
data['Activity'] = le.fit_transform(data['Activity'])

In [13]:
data['Activity'].value_counts()

0    16762
3    16728
5    16675
2    16645
4    16627
1    16563
Name: Activity, dtype: int64

### Feature engineering

In [46]:
from sklearn.model_selection import train_test_split
X = data.drop(['date_time','Activity'] , axis = 1)
Y = data['Activity']

In [47]:
def get_top_k_features(X, Y, k):
    clf = ExtraTreesClassifier(n_estimators=150)
    clf = clf.fit(X, Y)
    feature_df = pd.DataFrame(data=(X.columns, clf.feature_importances_)).T.sort_values(by=1, ascending=False)
    cols = feature_df.head(k)[0].values
    return cols

In [48]:
top_k_features = get_top_k_features(X, Y, k=10)


In [49]:
top_k_features


array(['tGravityAcc-energy()-X', 'angle(X,gravityMean)',
       'tGravityAcc-mean()-X', 'tGravityAcc-min()-Y',
       'tGravityAcc-max()-X', 'tGravityAcc-max()-Y',
       'tGravityAcc-min()-X', 'tGravityAcc-mean()-Y',
       'angle(Y,gravityMean)', 'fBodyAccJerk-entropy()-Y'], dtype=object)

## 3. Data preparation

* Split the data into train and test data sets. 
* Regardless of methods used to split the data, compare the ratio of classes in both the train and test splits.


In [50]:
X = X[top_k_features]

In [51]:
x_train , x_test , y_train , y_test = train_test_split(X, Y)


In [52]:
x_train.shape , y_train.shape


((75000, 10), (75000,))

In [54]:
x_test.shape , y_test.shape


((25000, 10), (25000,))

## Standardise data

In [55]:
sc = StandardScaler()
x_train_std = sc.fit_transform(x_train)
x_test_std = sc.transform(x_test)


## 4. Model Training

* Fit different models and compare the result 
1. Logistic regression
2. Decision Tree Classifier
3. Random Forest Classifier 
4. Adaptive Boosting Classifier

In [56]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier


In [57]:
model_result = {}


In [58]:
lr =  LogisticRegression()
lr.fit(x_train_std, y_train)
train_accuracy = round(lr.score(x_train_std, y_train)*100,2)
test_accuracy = round(lr.score(x_test_std, y_test)*100,2)
print("Training Accuracy", train_accuracy)
print("Test Accuracy", test_accuracy)
model_result['Logistic_Regression'] = {'train_accuracy': train_accuracy,'test_accuracy':test_accuracy }


Training Accuracy 70.12
Test Accuracy 69.82


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [59]:
dt =  DecisionTreeClassifier()
dt.fit(x_train, y_train)
train_accuracy = round(dt.score(x_train, y_train)*100,2)
test_accuracy = round(dt.score(x_test, y_test)*100,2)
print("Training Accuracy", train_accuracy)
print("Test Accuracy", test_accuracy)
model_result['Decision_Tree'] = {'train_accuracy': train_accuracy,'test_accuracy':test_accuracy }


Training Accuracy 100.0
Test Accuracy 65.82


In [61]:
rfc =  RandomForestClassifier(n_estimators=50)
rfc.fit(x_train, y_train)
train_accuracy = round(rfc.score(x_train, y_train)*100,2)
test_accuracy = round(rfc.score(x_test, y_test)*100,2)
print("Training Accuracy", train_accuracy)
print("Test Accuracy", test_accuracy)
model_result['RandomForest'] = {'train_accuracy': train_accuracy,'test_accuracy':test_accuracy }


Training Accuracy 99.99
Test Accuracy 72.03


In [62]:
gbc =  GradientBoostingClassifier(n_estimators=50)
gbc.fit(x_train, y_train)
train_accuracy = round(gbc.score(x_train, y_train)*100,2)
test_accuracy = round(gbc.score(x_test, y_test)*100,2)
print("Training Accuracy", train_accuracy)
print("Test Accuracy", test_accuracy)
model_result['GradientBoosting'] = {'train_accuracy': train_accuracy,'test_accuracy':test_accuracy }

Training Accuracy 72.74
Test Accuracy 72.02


## 5. Model comparison

In [63]:
pd.DataFrame(model_result).T

Unnamed: 0,train_accuracy,test_accuracy
Logistic_Regression,70.12,69.82
Decision_Tree,100.0,65.82
RandomForest,99.99,72.03
GradientBoosting,72.74,72.02


## Hyper parameter tuning

In [64]:
from sklearn.model_selection import GridSearchCV
param_grid = {
            "n_estimators": [
                50,
                100,
                150,
            ],
            "max_depth": [4, 12],
            "criterion": ["gini", "entropy"],
        }
CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, cv=5)
CV_rfc.fit(x_train, y_train)

In [66]:
CV_rfc.best_params_

{'criterion': 'entropy', 'max_depth': 12, 'n_estimators': 100}

In [67]:
rfc =  RandomForestClassifier(criterion= 'entropy', max_depth= 12, n_estimators= 100)
rfc.fit(x_train, y_train)
train_accuracy = round(rfc.score(x_train, y_train)*100,2)
test_accuracy = round(rfc.score(x_test, y_test)*100,2)
print("Training Accuracy", train_accuracy)
print("Test Accuracy", test_accuracy)

Training Accuracy 77.06
Test Accuracy 72.82


In [68]:
train_accuracy = round(CV_rfc.score(x_train, y_train)*100,2)
test_accuracy = round(CV_rfc.score(x_test, y_test)*100,2)
print("Training Accuracy", train_accuracy)
print("Test Accuracy", test_accuracy)

Training Accuracy 76.95
Test Accuracy 72.76


### 6. Final Model

In [69]:
importances = rfc.feature_importances_

In [70]:
## Top 10 Features contributing to Model 
feature_df = pd.DataFrame( x_train.columns ,  columns = ['variables'])
feature_df['importance'] = importances
feature_df.sort_values( by = 'importance' , ascending = False).head(10)


Unnamed: 0,variables,importance
9,fBodyAccJerk-entropy()-Y,0.451563
6,tGravityAcc-min()-X,0.119655
2,tGravityAcc-mean()-X,0.099543
3,tGravityAcc-min()-Y,0.076955
1,"angle(X,gravityMean)",0.055138
7,tGravityAcc-mean()-Y,0.047969
5,tGravityAcc-max()-Y,0.040498
0,tGravityAcc-energy()-X,0.039626
8,"angle(Y,gravityMean)",0.039577
4,tGravityAcc-max()-X,0.029476


In [71]:
y_pred = rfc.predict(x_test)

## 7. Model Evaluation

For each model, calculate the following error metrics: 

* accuracy
* precision
* recall
* fscore
* confusion matrix

Decide how to combine the multi-class metrics into a single value for each model.

In [72]:
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.metrics import confusion_matrix, accuracy_score, roc_auc_score, classification_report


In [73]:
confusion_matrix(y_pred, y_test)

array([[4146,    0,    0,    0,    0,    0],
       [   0, 3534,  498,    0,    0,    0],
       [   0,  582, 3735,    0,    0,   75],
       [   0,    0,    0, 1414, 1004,  627],
       [   0,    1,    0, 1481, 2224,  346],
       [   0,    0,    1, 1246,  935, 3151]])

In [74]:
print(classification_report(y_pred, y_test))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      4146
           1       0.86      0.88      0.87      4032
           2       0.88      0.85      0.87      4392
           3       0.34      0.46      0.39      3045
           4       0.53      0.55      0.54      4052
           5       0.75      0.59      0.66      5333

    accuracy                           0.73     25000
   macro avg       0.73      0.72      0.72     25000
weighted avg       0.75      0.73      0.73     25000



## 8. Model registration


In [77]:
## Change Drive path to your Folder
os.mkdir('model_weights/')
os.mkdir('model_features/')

In [78]:
## Let's save model weights
import joblib
# save
joblib.dump(rfc, "./model_weights/my_random_forest.joblib")

['./model_weights/my_random_forest.joblib']

In [79]:
## Save Final columns
final_columns = np.array(x_train.columns) 
joblib.dump(final_columns, "./model_features/train_features.joblib")

['./model_features/train_features.joblib']

## 9. Model Prediction

In [82]:
## load Test Data set 
test_data = pd.read_csv('/Users/apal/Documents/PathtoAI/AnalyticsVidhya/Mlops/data/test_data_100.csv')

In [83]:
test_data.head()

Unnamed: 0,tBodyAcc-mean()-X,tBodyAcc-mean()-Y,tBodyAcc-mean()-Z,tBodyAcc-std()-X,tBodyAcc-std()-Y,tBodyAcc-std()-Z,tBodyAcc-mad()-X,tBodyAcc-mad()-Y,tBodyAcc-mad()-Z,tBodyAcc-max()-X,...,fBodyBodyGyroJerkMag-skewness(),fBodyBodyGyroJerkMag-kurtosis(),"angle(tBodyAccMean,gravity)","angle(tBodyAccJerkMean),gravityMean)","angle(tBodyGyroMean,gravityMean)","angle(tBodyGyroJerkMean,gravityMean)","angle(X,gravityMean)","angle(Y,gravityMean)","angle(Z,gravityMean)",Activity
0,0.288585,-0.020294,-0.132905,-0.995279,-0.983111,-0.913526,-0.995112,-0.983185,-0.923527,-0.934724,...,-0.298676,-0.710304,-0.112754,0.0304,-0.464761,-0.018446,-0.841247,0.179941,-0.058627,STANDING
1,0.278419,-0.016411,-0.12352,-0.998245,-0.9753,-0.960322,-0.998807,-0.974914,-0.957686,-0.943068,...,-0.595051,-0.861499,0.053477,-0.007435,-0.732626,0.703511,-0.844788,0.180289,-0.054317,STANDING
2,0.279653,-0.019467,-0.113462,-0.99538,-0.967187,-0.978944,-0.99652,-0.963668,-0.977469,-0.938692,...,-0.390748,-0.760104,-0.118559,0.177899,0.100699,0.808529,-0.848933,0.180637,-0.049118,STANDING
3,0.279174,-0.026201,-0.123283,-0.996091,-0.983403,-0.990675,-0.997099,-0.98275,-0.989302,-0.938692,...,-0.11729,-0.482845,-0.036788,-0.012892,0.640011,-0.485366,-0.848649,0.181935,-0.047663,STANDING
4,0.276629,-0.01657,-0.115362,-0.998139,-0.980817,-0.990482,-0.998321,-0.979672,-0.990441,-0.942469,...,-0.351471,-0.699205,0.12332,0.122542,0.693578,-0.615971,-0.847865,0.185151,-0.043892,STANDING


In [84]:
## Load Features and model weight
train_features = joblib.load("./model_features/train_features.joblib")

model = joblib.load("./model_weights/my_random_forest.joblib")


In [85]:
test_data_features = test_data[train_features]

In [94]:
train_features

array(['tGravityAcc-energy()-X', 'angle(X,gravityMean)',
       'tGravityAcc-mean()-X', 'tGravityAcc-min()-Y',
       'tGravityAcc-max()-X', 'tGravityAcc-max()-Y',
       'tGravityAcc-min()-X', 'tGravityAcc-mean()-Y',
       'angle(Y,gravityMean)', 'fBodyAccJerk-entropy()-Y'], dtype=object)

In [86]:
test_data_features.fillna(0,inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_data_features.fillna(0,inplace = True)


In [87]:
y_prediction = model.predict(test_data_features)
y_prediction_label = le.inverse_transform(y_prediction)
test_data['Prediction_label'] = y_prediction_label

In [88]:
test_data.head()

Unnamed: 0,tBodyAcc-mean()-X,tBodyAcc-mean()-Y,tBodyAcc-mean()-Z,tBodyAcc-std()-X,tBodyAcc-std()-Y,tBodyAcc-std()-Z,tBodyAcc-mad()-X,tBodyAcc-mad()-Y,tBodyAcc-mad()-Z,tBodyAcc-max()-X,...,fBodyBodyGyroJerkMag-kurtosis(),"angle(tBodyAccMean,gravity)","angle(tBodyAccJerkMean),gravityMean)","angle(tBodyGyroMean,gravityMean)","angle(tBodyGyroJerkMean,gravityMean)","angle(X,gravityMean)","angle(Y,gravityMean)","angle(Z,gravityMean)",Activity,Prediction_label
0,0.288585,-0.020294,-0.132905,-0.995279,-0.983111,-0.913526,-0.995112,-0.983185,-0.923527,-0.934724,...,-0.710304,-0.112754,0.0304,-0.464761,-0.018446,-0.841247,0.179941,-0.058627,STANDING,STANDING
1,0.278419,-0.016411,-0.12352,-0.998245,-0.9753,-0.960322,-0.998807,-0.974914,-0.957686,-0.943068,...,-0.861499,0.053477,-0.007435,-0.732626,0.703511,-0.844788,0.180289,-0.054317,STANDING,STANDING
2,0.279653,-0.019467,-0.113462,-0.99538,-0.967187,-0.978944,-0.99652,-0.963668,-0.977469,-0.938692,...,-0.760104,-0.118559,0.177899,0.100699,0.808529,-0.848933,0.180637,-0.049118,STANDING,STANDING
3,0.279174,-0.026201,-0.123283,-0.996091,-0.983403,-0.990675,-0.997099,-0.98275,-0.989302,-0.938692,...,-0.482845,-0.036788,-0.012892,0.640011,-0.485366,-0.848649,0.181935,-0.047663,STANDING,STANDING
4,0.276629,-0.01657,-0.115362,-0.998139,-0.980817,-0.990482,-0.998321,-0.979672,-0.990441,-0.942469,...,-0.699205,0.12332,0.122542,0.693578,-0.615971,-0.847865,0.185151,-0.043892,STANDING,STANDING


In [89]:
test_data['Prediction_label'].value_counts()

STANDING    27
LAYING      27
SITTING     24
WALKING     22
Name: Prediction_label, dtype: int64

#### Evaluate Model prediction

In [90]:
y_test = test_data['Activity'].values
y_pred = test_data['Prediction_label']

In [91]:
confusion_matrix(y_pred, y_test)

array([[27,  0,  0,  0],
       [ 0, 24,  0,  0],
       [ 0,  0, 27,  0],
       [ 0,  0,  0, 22]])

In [92]:
print(classification_report(y_pred, y_test))

              precision    recall  f1-score   support

      LAYING       1.00      1.00      1.00        27
     SITTING       1.00      1.00      1.00        24
    STANDING       1.00      1.00      1.00        27
     WALKING       1.00      1.00      1.00        22

    accuracy                           1.00       100
   macro avg       1.00      1.00      1.00       100
weighted avg       1.00      1.00      1.00       100



## End