<a href="https://colab.research.google.com/github/ashishpal2702/HumanActivityrecognition/blob/main/Logistic_Regression_and_Classification_POC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction

We will be using the [Human Activity Recognition with Smartphones](https://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones) database, which was built from the recordings of study participants performing activities of daily living (ADL) while carrying a smartphone with an embedded inertial sensors. The objective is to classify activities into one of the six activities (walking, walking upstairs, walking downstairs, sitting, standing, and laying) performed.

For each record in the dataset it is provided: 

- Triaxial acceleration from the accelerometer (total acceleration) and the estimated body acceleration. 
- Triaxial Angular velocity from the gyroscope. 
- A 561-feature vector with time and frequency domain variables. 
- Its activity label. 

More information about the features is available on the website above.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
from __future__ import print_function
import os
data_path = [ 'data']

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.metrics import confusion_matrix, accuracy_score, roc_auc_score
from sklearn.preprocessing import label_binarize

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## 1. Data Import

Import the data and do the following:

* Examine the data types--there are many columns, so it might be wise to use value counts
* Determine if the floating point values need to be scaled
* Determine the breakdown of each activity
* Encode the activity label as an integer

In [None]:
import pandas as pd
import numpy as np
import os
filepath = '/content/drive/MyDrive/MLOPs/data/Human_Activity_Recognition_Using_Smartphones_Data_augmented_data.gzip'
data = pd.read_parquet(filepath)

In [None]:
data.shape

(566445, 562)

In [60]:
sensors = set()
for col in data.columns:
  sensors.add(col.split("-")[0])

In [61]:
sensors

{'Activity',
 'angle(tBodyAccJerkMean),gravityMean)',
 'angle(tBodyAccMean,gravity)',
 'angle(tBodyGyroJerkMean,gravityMean)',
 'angle(tBodyGyroMean,gravityMean)',
 'fBodyAcc',
 'fBodyAccJerk',
 'fBodyAccMag',
 'fBodyBodyAccJerkMag',
 'fBodyBodyGyroJerkMag',
 'fBodyBodyGyroMag',
 'fBodyGyro',
 'tBodyAcc',
 'tBodyAccJerk',
 'tBodyAccJerkMag',
 'tBodyAccMag',
 'tBodyGyro',
 'tBodyGyroJerk',
 'tBodyGyroJerkMag',
 'tBodyGyroMag',
 'tGravityAcc'}

In [None]:
data['Activity'].value_counts()

LAYING                94635
STANDING              94597
SITTING               94468
WALKING               94413
WALKING_UPSTAIRS      94235
WALKING_DOWNSTAIRS    94097
Name: Activity, dtype: int64

In [None]:
data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
tBodyAcc-mean()-X,566445.0,0.274936,0.059249,-1.0,0.262873,0.277136,0.288076,1.0
tBodyAcc-mean()-Y,566445.0,-0.017419,0.029188,-1.0,-0.024297,-0.017138,-0.011188,1.0
tBodyAcc-mean()-Z,566445.0,-0.108818,0.043453,-1.0,-0.120096,-0.108533,-0.098246,1.0
tBodyAcc-std()-X,566445.0,-0.565518,0.450373,-1.0,-0.991588,-0.637147,-0.211506,1.0
tBodyAcc-std()-Y,566445.0,-0.469409,0.503388,-1.0,-0.974501,-0.393744,-0.029494,1.0
...,...,...,...,...,...,...,...,...
"angle(tBodyGyroMean,gravityMean)",566445.0,0.021176,0.594952,-1.0,-0.451814,0.021843,0.510186,1.0
"angle(tBodyGyroJerkMean,gravityMean)",566445.0,-0.015486,0.426282,-1.0,-0.341371,-0.014399,0.299621,1.0
"angle(X,gravityMean)",566445.0,-0.523804,0.486867,-1.0,-0.816700,-0.724314,-0.555039,1.0
"angle(Y,gravityMean)",566445.0,0.079559,0.293166,-1.0,0.029635,0.188226,0.253245,1.0


In [None]:
le = LabelEncoder()
data['Activity'] = le.fit_transform(data['Activity'])

In [None]:
data['Activity'].value_counts()

0    94635
2    94597
1    94468
3    94413
5    94235
4    94097
Name: Activity, dtype: int64

## 2. EDA
 
* Calculate the correlations between the dependent variables.
* Identify high correlated features and drop them.

In [None]:
data_features = data.drop('Activity' , axis = 1)
# Create correlation matrix
corr_matrix = data_features.corr().abs()

# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))

# Find features with correlation greater than 0.80
to_drop = [column for column in upper.columns if any(upper[column] > 0.8)]
print(to_drop)

['tBodyAcc-std()-Y', 'tBodyAcc-std()-Z', 'tBodyAcc-mad()-X', 'tBodyAcc-mad()-Y', 'tBodyAcc-mad()-Z', 'tBodyAcc-max()-X', 'tBodyAcc-max()-Y', 'tBodyAcc-max()-Z', 'tBodyAcc-min()-X', 'tBodyAcc-min()-Y', 'tBodyAcc-min()-Z', 'tBodyAcc-sma()', 'tBodyAcc-energy()-X', 'tBodyAcc-energy()-Y', 'tBodyAcc-energy()-Z', 'tBodyAcc-iqr()-X', 'tBodyAcc-iqr()-Y', 'tBodyAcc-iqr()-Z', 'tBodyAcc-entropy()-X', 'tBodyAcc-entropy()-Y', 'tBodyAcc-entropy()-Z', 'tBodyAcc-arCoeff()-X,1', 'tBodyAcc-arCoeff()-X,2', 'tBodyAcc-arCoeff()-X,3', 'tBodyAcc-arCoeff()-Y,2', 'tBodyAcc-arCoeff()-Z,2', 'tGravityAcc-mad()-X', 'tGravityAcc-mad()-Y', 'tGravityAcc-mad()-Z', 'tGravityAcc-max()-X', 'tGravityAcc-max()-Y', 'tGravityAcc-max()-Z', 'tGravityAcc-min()-X', 'tGravityAcc-min()-Y', 'tGravityAcc-min()-Z', 'tGravityAcc-energy()-X', 'tGravityAcc-energy()-Y', 'tGravityAcc-energy()-Z', 'tGravityAcc-iqr()-X', 'tGravityAcc-iqr()-Y', 'tGravityAcc-iqr()-Z', 'tGravityAcc-arCoeff()-X,2', 'tGravityAcc-arCoeff()-X,3', 'tGravityAcc-arCoe

In [None]:
len(to_drop)

420

In [None]:
# Drop features
data.drop(to_drop, axis=1, inplace=True)

## 3. Data preparation

* Split the data into train and test data sets. 
* Regardless of methods used to split the data, compare the ratio of classes in both the train and test splits.


In [None]:
from sklearn.model_selection import train_test_split
X = data.drop('Activity' , axis = 1)
Y = data['Activity']
x_train , x_test , y_train , y_test = train_test_split(X, Y)

In [None]:
x_train.shape , y_train.shape

((424833, 141), (424833,))

In [None]:
x_test.shape , y_test.shape

((141612, 141), (141612,))

## 4. Model Training

* Fit different models and compare the result 
1. Logistic regression
2. Decision Tree Classifier
3. Random Forest Classifier 
4. Adaptive Boosting Classifier

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier


In [None]:
lr =  LogisticRegression()
lr.fit(x_train, y_train)
print("Training Accuracy", round(lr.score(x_train, y_train)*100,2))
print("Test Accuracy", round(lr.score(x_test, y_test)*100,2))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Training Accuracy 0.9899442839892382
Test Accuracy 0.9896689546083666


In [None]:
dt =  DecisionTreeClassifier()
dt.fit(x_train, y_train)
print("Training Accuracy", round(dt.score(x_train, y_train)*100,2))
print("Test Accuracy", round(dt.score(x_test, y_test)*100,2))

Training Accuracy 100.0
Test Accuracy 99.84


In [None]:
rfc =  RandomForestClassifier()
rfc.fit(x_train, y_train)
print("Training Accuracy", round(rfc.score(x_train, y_train)*100,2))
print("Test Accuracy", round(rfc.score(x_test, y_test)*100,2))

Training Accuracy 100.0
Test Accuracy 100.0


In [42]:
abc =  AdaBoostClassifier()
abc.fit(x_train, y_train)
print("Training Accuracy", round(abc.score(x_train, y_train)*100,2))
print("Test Accuracy", round(abc.score(x_test, y_test)*100,2))

Training Accuracy 33.53
Test Accuracy 33.03


### Final Model

In [50]:
importances = rfc.feature_importances_


In [62]:
## Top 10 Features contributing to Model 
feature_df = pd.DataFrame( x_train.columns ,  columns = ['variables'])
feature_df['importance'] = importances
feature_df.sort_values( by = 'importance' , ascending = False).head(10)

Unnamed: 0,variables,importance
14,tGravityAcc-mean()-X,0.079016
15,tGravityAcc-mean()-Y,0.07122
3,tBodyAcc-std()-X,0.070572
16,tGravityAcc-mean()-Z,0.035166
22,tGravityAcc-entropy()-Y,0.024533
25,"tGravityAcc-arCoeff()-Y,1",0.02432
24,"tGravityAcc-arCoeff()-X,1",0.024049
26,"tGravityAcc-arCoeff()-Z,1",0.021462
117,fBodyGyro-maxInds-Z,0.020082
86,fBodyAcc-min()-X,0.019553


In [44]:
y_pred = rfc.predict(x_test)

## 7. Model Evaluation

For each model, calculate the following error metrics: 

* accuracy
* precision
* recall
* fscore
* confusion matrix

Decide how to combine the multi-class metrics into a single value for each model.

In [45]:
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.metrics import confusion_matrix, accuracy_score, roc_auc_score, classification_report


In [46]:
confusion_matrix(y_pred, y_test)

array([[23464,     0,     0,     0,     0,     0],
       [    0, 23864,     0,     0,     0,     0],
       [    0,     3, 23315,     0,     0,     0],
       [    0,     0,     0, 23814,     0,     0],
       [    0,     0,     0,     0, 23653,     0],
       [    0,     0,     0,     0,     0, 23499]])

In [47]:
print(classification_report(y_pred, y_test))

                    precision    recall  f1-score   support

            LAYING       1.00      1.00      1.00     23464
           SITTING       1.00      1.00      1.00     23864
          STANDING       1.00      1.00      1.00     23318
           WALKING       1.00      1.00      1.00     23814
WALKING_DOWNSTAIRS       1.00      1.00      1.00     23653
  WALKING_UPSTAIRS       1.00      1.00      1.00     23499

          accuracy                           1.00    141612
         macro avg       1.00      1.00      1.00    141612
      weighted avg       1.00      1.00      1.00    141612



## 8. Model registration


In [64]:
## Let's save model weights
import joblib
# save
joblib.dump(rfc, "my_random_forest.joblib")

['my_random_forest.joblib']