<a href="https://colab.research.google.com/github/ashishpal2702/HumanActivityrecognition/blob/main/Logistic_Regression_and_Classification_POC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction

We will be using the [Human Activity Recognition with Smartphones](https://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones) database, which was built from the recordings of study participants performing activities of daily living (ADL) while carrying a smartphone with an embedded inertial sensors. The objective is to classify activities into one of the six activities (walking, walking upstairs, walking downstairs, sitting, standing, and laying) performed.

For each record in the dataset it is provided: 

- Triaxial acceleration from the accelerometer (total acceleration) and the estimated body acceleration. 
- Triaxial Angular velocity from the gyroscope. 
- A 561-feature vector with time and frequency domain variables. 
- Its activity label. 

More information about the features is available on the website above.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [21]:
from __future__ import print_function
import os
data_path = [ 'data']

from sklearn.preprocessing import LabelEncoder , StandardScaler
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.metrics import confusion_matrix, accuracy_score, roc_auc_score
from sklearn.preprocessing import label_binarize

In [4]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## 1. Data Import

Import the data and do the following:

* Examine the data types--there are many columns, so it might be wise to use value counts
* Determine if the floating point values need to be scaled
* Determine the breakdown of each activity
* Encode the activity label as an integer

In [5]:
import pandas as pd
import numpy as np
import os
filepath = '/content/drive/MyDrive/MLOPs/data/Human_Activity_Recognition_Using_Smartphones_Data_augmented_data.gzip'
data = pd.read_parquet(filepath)

In [6]:
data.shape

(566445, 562)

In [7]:
#data = data.sample(10000)

In [8]:
sensors = set()
for col in data.columns:
  sensors.add(col.split("-")[0])

In [10]:
sensors

{'Activity',
 'angle(X,gravityMean)',
 'angle(Y,gravityMean)',
 'angle(Z,gravityMean)',
 'angle(tBodyAccJerkMean),gravityMean)',
 'angle(tBodyAccMean,gravity)',
 'angle(tBodyGyroJerkMean,gravityMean)',
 'angle(tBodyGyroMean,gravityMean)',
 'fBodyAcc',
 'fBodyAccJerk',
 'fBodyAccMag',
 'fBodyBodyAccJerkMag',
 'fBodyBodyGyroJerkMag',
 'fBodyBodyGyroMag',
 'fBodyGyro',
 'tBodyAcc',
 'tBodyAccJerk',
 'tBodyAccJerkMag',
 'tBodyAccMag',
 'tBodyGyro',
 'tBodyGyroJerk',
 'tBodyGyroJerkMag',
 'tBodyGyroMag',
 'tGravityAcc',
 'tGravityAccMag'}

In [11]:
data['Activity'].value_counts()

WALKING_UPSTAIRS      1697
STANDING              1689
WALKING_DOWNSTAIRS    1687
LAYING                1662
WALKING               1647
SITTING               1618
Name: Activity, dtype: int64

In [12]:
data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
tBodyAcc-mean()-X,10000.0,0.273878,0.065804,-0.966040,0.262319,0.277197,0.288261,0.683677
tBodyAcc-mean()-Y,10000.0,-0.017268,0.027614,-0.618098,-0.024210,-0.017175,-0.011057,0.815044
tBodyAcc-mean()-Z,10000.0,-0.108445,0.043946,-0.739033,-0.120130,-0.108339,-0.097960,0.948795
tBodyAcc-std()-X,10000.0,-0.560174,0.452662,-0.999948,-0.991385,-0.584698,-0.208369,0.795908
tBodyAcc-std()-Y,10000.0,-0.465878,0.504038,-0.999492,-0.974604,-0.355204,-0.029963,0.975364
...,...,...,...,...,...,...,...,...
"angle(tBodyGyroMean,gravityMean)",10000.0,0.019023,0.597627,-0.998072,-0.463277,0.016459,0.510076,0.998648
"angle(tBodyGyroJerkMean,gravityMean)",10000.0,-0.014538,0.426657,-0.980865,-0.338392,-0.009602,0.299186,0.964529
"angle(X,gravityMean)",10000.0,-0.526110,0.484357,-0.999059,-0.818913,-0.726849,-0.554940,0.948302
"angle(Y,gravityMean)",10000.0,0.080501,0.291130,-0.991998,0.033089,0.187818,0.252485,0.999754


In [13]:
le = LabelEncoder()
data['Activity'] = le.fit_transform(data['Activity'])

In [14]:
data['Activity'].value_counts()

5    1697
2    1689
4    1687
0    1662
3    1647
1    1618
Name: Activity, dtype: int64

## 2. EDA
 
* Calculate the correlations between the dependent variables.
* Identify high correlated features and drop them.

In [15]:
data_features = data.drop('Activity' , axis = 1)
# Create correlation matrix
corr_matrix = data_features.corr().abs()

# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))

# Find features with correlation greater than 0.80
to_drop = [column for column in upper.columns if any(upper[column] > 0.8)]
print(to_drop)

['tBodyAcc-std()-Y', 'tBodyAcc-std()-Z', 'tBodyAcc-mad()-X', 'tBodyAcc-mad()-Y', 'tBodyAcc-mad()-Z', 'tBodyAcc-max()-X', 'tBodyAcc-max()-Y', 'tBodyAcc-max()-Z', 'tBodyAcc-min()-X', 'tBodyAcc-min()-Y', 'tBodyAcc-min()-Z', 'tBodyAcc-sma()', 'tBodyAcc-energy()-X', 'tBodyAcc-energy()-Y', 'tBodyAcc-energy()-Z', 'tBodyAcc-iqr()-X', 'tBodyAcc-iqr()-Y', 'tBodyAcc-iqr()-Z', 'tBodyAcc-entropy()-X', 'tBodyAcc-entropy()-Y', 'tBodyAcc-entropy()-Z', 'tBodyAcc-arCoeff()-X,1', 'tBodyAcc-arCoeff()-X,2', 'tBodyAcc-arCoeff()-Y,2', 'tBodyAcc-arCoeff()-Z,2', 'tGravityAcc-mad()-X', 'tGravityAcc-mad()-Y', 'tGravityAcc-mad()-Z', 'tGravityAcc-max()-X', 'tGravityAcc-max()-Y', 'tGravityAcc-max()-Z', 'tGravityAcc-min()-X', 'tGravityAcc-min()-Y', 'tGravityAcc-min()-Z', 'tGravityAcc-energy()-X', 'tGravityAcc-energy()-Y', 'tGravityAcc-energy()-Z', 'tGravityAcc-iqr()-X', 'tGravityAcc-iqr()-Y', 'tGravityAcc-iqr()-Z', 'tGravityAcc-arCoeff()-X,2', 'tGravityAcc-arCoeff()-X,3', 'tGravityAcc-arCoeff()-X,4', 'tGravityAcc-ar

In [16]:
len(to_drop)

420

In [17]:
# Drop features
data.drop(to_drop, axis=1, inplace=True)

## 3. Data preparation

* Split the data into train and test data sets. 
* Regardless of methods used to split the data, compare the ratio of classes in both the train and test splits.


In [18]:
from sklearn.model_selection import train_test_split
X = data.drop('Activity' , axis = 1)
Y = data['Activity']
x_train , x_test , y_train , y_test = train_test_split(X, Y)

In [19]:
x_train.shape , y_train.shape

((7500, 141), (7500,))

In [20]:
x_test.shape , y_test.shape

((2500, 141), (2500,))

## Standardise data

In [24]:
sc = StandardScaler()
x_train_std = sc.fit_transform(x_train)
x_test_std = sc.transform(x_test)

## 4. Model Training

* Fit different models and compare the result 
1. Logistic regression
2. Decision Tree Classifier
3. Random Forest Classifier 
4. Adaptive Boosting Classifier

In [25]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier


In [36]:
model_result = {}

In [40]:
lr =  LogisticRegression()
lr.fit(x_train_std, y_train)
train_accuracy = round(lr.score(x_train_std, y_train)*100,2)
test_accuracy = round(lr.score(x_test_std, y_test)*100,2)
print("Training Accuracy", train_accuracy)
print("Test Accuracy", test_accuracy)
model_result['Logistic_Regression'] = {'train_accuracy': train_accuracy,'test_accuracy':test_accuracy }

Training Accuracy 99.39
Test Accuracy 98.32


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [41]:
dt =  DecisionTreeClassifier()
dt.fit(x_train, y_train)
train_accuracy = round(dt.score(x_train, y_train)*100,2)
test_accuracy = round(dt.score(x_test, y_test)*100,2)
print("Training Accuracy", train_accuracy)
print("Test Accuracy", test_accuracy)
model_result['Decision_Tree'] = {'train_accuracy': train_accuracy,'test_accuracy':test_accuracy }

Training Accuracy 100.0
Test Accuracy 95.4


In [42]:
rfc =  RandomForestClassifier()
rfc.fit(x_train, y_train)
train_accuracy = round(rfc.score(x_train, y_train)*100,2)
test_accuracy = round(rfc.score(x_test, y_test)*100,2)
print("Training Accuracy", train_accuracy)
print("Test Accuracy", test_accuracy)
model_result['RandomForest'] = {'train_accuracy': train_accuracy,'test_accuracy':test_accuracy }

Training Accuracy 100.0
Test Accuracy 98.52


In [43]:
abc =  AdaBoostClassifier()
abc.fit(x_train, y_train)
train_accuracy = round(abc.score(x_train, y_train)*100,2)
test_accuracy = round(abc.score(x_test, y_test)*100,2)
print("Training Accuracy", train_accuracy)
print("Test Accuracy", test_accuracy)
model_result['AdaBoost'] = {'train_accuracy': train_accuracy,'test_accuracy':test_accuracy }

Training Accuracy 50.23
Test Accuracy 50.16


## 5. Model comparison

In [45]:
pd.DataFrame(model_result).T

Unnamed: 0,train_accuracy,test_accuracy
Logistic_Regression,99.39,98.32
Decision_Tree,100.0,95.4
RandomForest,100.0,98.52
AdaBoost,50.23,50.16


### 6. Final Model

In [46]:
importances = rfc.feature_importances_


In [47]:
## Top 10 Features contributing to Model 
feature_df = pd.DataFrame( x_train.columns ,  columns = ['variables'])
feature_df['importance'] = importances
feature_df.sort_values( by = 'importance' , ascending = False).head(10)

Unnamed: 0,variables,importance
15,tGravityAcc-mean()-X,0.072212
16,tGravityAcc-mean()-Y,0.067462
3,tBodyAcc-std()-X,0.060798
17,tGravityAcc-mean()-Z,0.030893
27,"tGravityAcc-arCoeff()-Z,1",0.027299
26,"tGravityAcc-arCoeff()-Y,1",0.023477
110,"fBodyAccJerk-bandsEnergy()-57,64.1",0.021434
100,fBodyAccJerk-min()-X,0.021152
23,tGravityAcc-entropy()-Y,0.019749
12,"tBodyAcc-correlation()-X,Y",0.019723


In [48]:
y_pred = rfc.predict(x_test)

## 7. Model Evaluation

For each model, calculate the following error metrics: 

* accuracy
* precision
* recall
* fscore
* confusion matrix

Decide how to combine the multi-class metrics into a single value for each model.

In [49]:
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.metrics import confusion_matrix, accuracy_score, roc_auc_score, classification_report


In [50]:
confusion_matrix(y_pred, y_test)

array([[407,   0,   0,   0,   0,   0],
       [  0, 427,  18,   0,   0,   0],
       [  0,  10, 401,   0,   0,   0],
       [  0,   0,   0, 420,   0,   0],
       [  0,   0,   0,   0, 396,   1],
       [  0,   0,   0,   5,   3, 412]])

In [51]:
print(classification_report(y_pred, y_test))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       407
           1       0.98      0.96      0.97       445
           2       0.96      0.98      0.97       411
           3       0.99      1.00      0.99       420
           4       0.99      1.00      0.99       397
           5       1.00      0.98      0.99       420

    accuracy                           0.99      2500
   macro avg       0.99      0.99      0.99      2500
weighted avg       0.99      0.99      0.99      2500



## 8. Model registration


In [63]:
## Change Drive path to your Folder
os.mkdir('/content/drive/MyDrive/MLOPs/model_weights/')
os.mkdir('/content/drive/MyDrive/MLOPs/model_features/')

In [64]:
## Let's save model weights
import joblib
# save
joblib.dump(rfc, "/content/drive/MyDrive/MLOPs/model_weights/my_random_forest.joblib")

['/content/drive/MyDrive/MLOPs/model_weights/my_random_forest.joblib']

In [65]:
## Save Final columns
final_columns = np.array(x_train.columns) 
joblib.dump(final_columns, "/content/drive/MyDrive/MLOPs/model_features/train_features.joblib")

['/content/drive/MyDrive/MLOPs/model_features/train_features.joblib']

## 9. Model Prediction

In [61]:
## load Test Data set 
test_data = pd.read_csv('/content/drive/MyDrive/MLOPs/data/test_data.csv')

In [66]:
## Load Features and model weight
train_features = joblib.load("/content/drive/MyDrive/MLOPs/model_features/train_features.joblib")

model = joblib.load("/content/drive/MyDrive/MLOPs/model_weights/my_random_forest.joblib")


In [68]:
test_data_features = test_data[train_features]

In [72]:
test_data_features.fillna(0,inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_data_features.fillna(0,inplace = True)


In [79]:
y_prediction = model.predict(test_data_features)
y_prediction_label = le.inverse_transform(y_prediction)
test_data['Prediction_label'] = y_prediction_label

In [80]:
test_data.head()

Unnamed: 0,tBodyAcc-mean()-X,tBodyAcc-mean()-Y,tBodyAcc-mean()-Z,tBodyAcc-std()-X,tBodyAcc-std()-Y,tBodyAcc-std()-Z,tBodyAcc-mad()-X,tBodyAcc-mad()-Y,tBodyAcc-mad()-Z,tBodyAcc-max()-X,...,fBodyBodyGyroJerkMag-kurtosis(),"angle(tBodyAccMean,gravity)","angle(tBodyAccJerkMean),gravityMean)","angle(tBodyGyroMean,gravityMean)","angle(tBodyGyroJerkMean,gravityMean)","angle(X,gravityMean)","angle(Y,gravityMean)","angle(Z,gravityMean)",Activity,Prediction_label
0,0.288585,-0.020294,-0.132905,-0.995279,-0.983111,-0.913526,-0.995112,-0.983185,-0.923527,-0.934724,...,-0.710304,-0.112754,0.0304,-0.464761,-0.018446,-0.841247,0.179941,-0.058627,STANDING,STANDING
1,0.278419,-0.016411,-0.12352,-0.998245,-0.9753,-0.960322,-0.998807,-0.974914,-0.957686,-0.943068,...,-0.861499,0.053477,-0.007435,-0.732626,0.703511,-0.844788,0.180289,-0.054317,STANDING,STANDING
2,0.279653,-0.019467,-0.113462,-0.99538,-0.967187,-0.978944,-0.99652,-0.963668,-0.977469,-0.938692,...,-0.760104,-0.118559,0.177899,0.100699,0.808529,-0.848933,0.180637,-0.049118,STANDING,STANDING
3,0.279174,-0.026201,-0.123283,-0.996091,-0.983403,-0.990675,-0.997099,-0.98275,-0.989302,-0.938692,...,-0.482845,-0.036788,-0.012892,0.640011,-0.485366,-0.848649,0.181935,-0.047663,STANDING,STANDING
4,0.276629,-0.01657,-0.115362,-0.998139,-0.980817,-0.990482,-0.998321,-0.979672,-0.990441,-0.942469,...,-0.699205,0.12332,0.122542,0.693578,-0.615971,-0.847865,0.185151,-0.043892,STANDING,STANDING
