<a href="https://colab.research.google.com/github/ashishpal2702/HumanActivityrecognition/blob/main/Logistic_Regression_and_Classification_POC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction

We will be using the [Human Activity Recognition with Smartphones](https://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones) database, which was built from the recordings of study participants performing activities of daily living (ADL) while carrying a smartphone with an embedded inertial sensors. The objective is to classify activities into one of the six activities (walking, walking upstairs, walking downstairs, sitting, standing, and laying) performed.

For each record in the dataset it is provided: 

- Triaxial acceleration from the accelerometer (total acceleration) and the estimated body acceleration. 
- Triaxial Angular velocity from the gyroscope. 
- A 561-feature vector with time and frequency domain variables. 
- Its activity label. 

More information about the features is available on the website above.

In [None]:
#from google.colab import drive
#drive.mount('/content/drive')

In [18]:
from __future__ import print_function
import os
data_path = [ 'data']

from sklearn.preprocessing import LabelEncoder , StandardScaler
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.metrics import confusion_matrix, accuracy_score, roc_auc_score
from sklearn.preprocessing import label_binarize
from sklearn.ensemble import ExtraTreesClassifier


In [None]:
#import matplotlib.pyplot as plt
#import seaborn as sns
#%matplotlib inline

## 1. Data Import

Import the data and do the following:

* Examine the data types--there are many columns, so it might be wise to use value counts
* Determine if the floating point values need to be scaled
* Determine the breakdown of each activity
* Encode the activity label as an integer

In [32]:
import pandas as pd
import numpy as np
import os
#filepath = '/Users/apal/Documents/PathtoAI/AnalyticsVidhya/Mlops/data//Human_Activity_Recognition_Using_Smartphones_Data_augmented_data.gzip'
filepath = '/Users/apal/Documents/PathtoAI/AnalyticsVidhya/Mlops/data/Human_Activity_Recognition_Using_Smartphones_Data_augmented_timedata_sample.gzip'
data = pd.read_parquet(filepath)

In [33]:
data.shape

(100000, 563)

In [34]:
#data = data.sample(10000)

In [35]:
sensors = set()
for col in data.columns:
    sensors.add(col.split("-")[0])

In [36]:
sensors

{'Activity',
 'angle(X,gravityMean)',
 'angle(Y,gravityMean)',
 'angle(Z,gravityMean)',
 'angle(tBodyAccJerkMean),gravityMean)',
 'angle(tBodyAccMean,gravity)',
 'angle(tBodyGyroJerkMean,gravityMean)',
 'angle(tBodyGyroMean,gravityMean)',
 'date_time',
 'fBodyAcc',
 'fBodyAccJerk',
 'fBodyAccMag',
 'fBodyBodyAccJerkMag',
 'fBodyBodyGyroJerkMag',
 'fBodyBodyGyroMag',
 'fBodyGyro',
 'tBodyAcc',
 'tBodyAccJerk',
 'tBodyAccJerkMag',
 'tBodyAccMag',
 'tBodyGyro',
 'tBodyGyroJerk',
 'tBodyGyroJerkMag',
 'tBodyGyroMag',
 'tGravityAcc',
 'tGravityAccMag'}

In [37]:
data['Activity'].value_counts()

Activity
LAYING                16762
WALKING               16728
WALKING_UPSTAIRS      16675
STANDING              16645
WALKING_DOWNSTAIRS    16627
SITTING               16563
Name: count, dtype: int64

In [38]:
data.describe().T

Unnamed: 0,count,mean,min,25%,50%,75%,max,std
tBodyAcc-mean()-X,100000.0,0.275078,-0.986314,0.262825,0.277178,0.288289,0.965808,0.060068
tBodyAcc-mean()-Y,100000.0,-0.017384,-0.89446,-0.024411,-0.017155,-0.011201,0.87536,0.029089
tBodyAcc-mean()-Z,100000.0,-0.108674,-0.936031,-0.120134,-0.108572,-0.098326,0.999455,0.044118
tBodyAcc-std()-X,100000.0,-0.564353,-1.0,-0.991627,-0.620961,-0.211144,1.0,0.450584
tBodyAcc-std()-Y,100000.0,-0.467669,-0.999629,-0.974517,-0.382965,-0.028442,0.999009,0.503729
...,...,...,...,...,...,...,...,...
"angle(tBodyGyroJerkMean,gravityMean)",100000.0,-0.013572,-1.0,-0.339358,-0.012546,0.30096,0.988742,0.426647
"angle(X,gravityMean)",100000.0,-0.523201,-0.999923,-0.816508,-0.724015,-0.554554,0.997924,0.487325
"angle(Y,gravityMean)",100000.0,0.079965,-0.997831,0.029963,0.18861,0.253884,0.999779,0.293338
"angle(Z,gravityMean)",100000.0,-0.041572,-0.99383,-0.103348,0.002657,0.109922,0.999776,0.256959


In [39]:
le = LabelEncoder()
data['Activity'] = le.fit_transform(data['Activity'])

In [40]:
data['Activity'].value_counts()

Activity
0    16762
3    16728
5    16675
2    16645
4    16627
1    16563
Name: count, dtype: int64

### Feature engineering

In [42]:
from sklearn.model_selection import train_test_split
X = data.drop(['date_time','Activity'] , axis = 1)
Y = data['Activity']

In [43]:
def get_top_k_features(X, Y, k):
    clf = ExtraTreesClassifier(n_estimators=50)
    clf = clf.fit(X, Y)
    feature_df = pd.DataFrame(data=(X.columns, clf.feature_importances_)).T.sort_values(by=1, ascending=False)
    cols = feature_df.head(k)[0].values
    return cols

In [44]:
top_k_features = get_top_k_features(X, Y, k=10)


In [45]:
top_k_features


array(['tGravityAcc-max()-X', 'tGravityAcc-max()-Y',
       'tGravityAcc-mean()-X', 'tGravityAcc-energy()-X',
       'angle(X,gravityMean)', 'tGravityAcc-min()-Y',
       'tGravityAcc-min()-X', 'tGravityAcc-mean()-Y',
       'tGravityAcc-energy()-Y', 'angle(Y,gravityMean)'], dtype=object)

## 3. Data preparation

* Split the data into train and test data sets. 
* Regardless of methods used to split the data, compare the ratio of classes in both the train and test splits.


In [46]:
X = X[top_k_features]

In [47]:
x_train , x_test , y_train , y_test = train_test_split(X, Y)


In [48]:
x_train.shape , y_train.shape


((75000, 10), (75000,))

In [49]:
x_test.shape , y_test.shape


((25000, 10), (25000,))

## Standardise data

In [50]:
sc = StandardScaler()
x_train_std = sc.fit_transform(x_train)
x_test_std = sc.transform(x_test)


## 4. Model Training

* Fit different models and compare the result 
1. Logistic regression
2. Decision Tree Classifier
3. Random Forest Classifier 
4. Adaptive Boosting Classifier

In [51]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier


In [52]:
model_result = {}


In [53]:
lr =  LogisticRegression()
lr.fit(x_train_std, y_train)
train_accuracy = round(lr.score(x_train_std, y_train)*100,2)
test_accuracy = round(lr.score(x_test_std, y_test)*100,2)
print("Training Accuracy", train_accuracy)
print("Test Accuracy", test_accuracy)
model_result['Logistic_Regression'] = {'train_accuracy': train_accuracy,'test_accuracy':test_accuracy }


Training Accuracy 71.5
Test Accuracy 72.42


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [54]:
dt =  DecisionTreeClassifier()
dt.fit(x_train, y_train)
train_accuracy = round(dt.score(x_train, y_train)*100,2)
test_accuracy = round(dt.score(x_test, y_test)*100,2)
print("Training Accuracy", train_accuracy)
print("Test Accuracy", test_accuracy)
model_result['Decision_Tree'] = {'train_accuracy': train_accuracy,'test_accuracy':test_accuracy }


Training Accuracy 100.0
Test Accuracy 93.61


In [55]:
rfc =  RandomForestClassifier()
rfc.fit(x_train, y_train)
train_accuracy = round(rfc.score(x_train, y_train)*100,2)
test_accuracy = round(rfc.score(x_test, y_test)*100,2)
print("Training Accuracy", train_accuracy)
print("Test Accuracy", test_accuracy)
model_result['RandomForest'] = {'train_accuracy': train_accuracy,'test_accuracy':test_accuracy }


Training Accuracy 100.0
Test Accuracy 96.65


In [57]:
gbc =  GradientBoostingClassifier()
gbc.fit(x_train, y_train)
train_accuracy = round(gbc.score(x_train, y_train)*100,2)
test_accuracy = round(gbc.score(x_test, y_test)*100,2)
print("Training Accuracy", train_accuracy)
print("Test Accuracy", test_accuracy)
model_result['AdaBoost'] = {'train_accuracy': train_accuracy,'test_accuracy':test_accuracy }

Training Accuracy 85.25
Test Accuracy 84.51


## 5. Model comparison

In [58]:
pd.DataFrame(model_result).T

Unnamed: 0,train_accuracy,test_accuracy
Logistic_Regression,71.5,72.42
Decision_Tree,100.0,93.61
RandomForest,100.0,96.65
AdaBoost,85.25,84.51


### 6. Final Model

In [59]:
importances = rfc.feature_importances_


In [60]:
## Top 10 Features contributing to Model 
feature_df = pd.DataFrame( x_train.columns ,  columns = ['variables'])
feature_df['importance'] = importances
feature_df.sort_values( by = 'importance' , ascending = False).head(10)


Unnamed: 0,variables,importance
6,tGravityAcc-min()-X,0.199422
4,"angle(X,gravityMean)",0.123819
0,tGravityAcc-max()-X,0.114376
5,tGravityAcc-min()-Y,0.096854
2,tGravityAcc-mean()-X,0.090635
9,"angle(Y,gravityMean)",0.088676
1,tGravityAcc-max()-Y,0.079719
3,tGravityAcc-energy()-X,0.078245
7,tGravityAcc-mean()-Y,0.071489
8,tGravityAcc-energy()-Y,0.056765


In [61]:
y_pred = rfc.predict(x_test)

## 7. Model Evaluation

For each model, calculate the following error metrics: 

* accuracy
* precision
* recall
* fscore
* confusion matrix

Decide how to combine the multi-class metrics into a single value for each model.

In [62]:
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.metrics import confusion_matrix, accuracy_score, roc_auc_score, classification_report


In [63]:
confusion_matrix(y_pred, y_test)

array([[4245,    0,    0,    0,    0,    0],
       [   0, 3931,  134,    2,    4,    2],
       [   0,  146, 3802,   21,    5,    6],
       [   0,   15,   65, 4014,   46,   35],
       [   0,   13,   49,   53, 4095,   65],
       [   0,    5,   60,   31,   81, 4075]])

In [64]:
print(classification_report(y_pred, y_test))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      4245
           1       0.96      0.97      0.96      4073
           2       0.93      0.96      0.94      3980
           3       0.97      0.96      0.97      4175
           4       0.97      0.96      0.96      4275
           5       0.97      0.96      0.97      4252

    accuracy                           0.97     25000
   macro avg       0.97      0.97      0.97     25000
weighted avg       0.97      0.97      0.97     25000



## 8. Model registration


In [None]:
## Change Drive path to your Folder
#os.mkdir('../model_weights/')
#os.mkdir('../model_features/')

In [67]:
## Let's save model weights
import joblib
# save
joblib.dump(rfc, "../model_weights/my_random_forest.joblib")

In [None]:
## Save Final columns
final_columns = np.array(x_train.columns) 
joblib.dump(final_columns, "../model_features/train_features.joblib")

## 9. Model Prediction

In [65]:
## load Test Data set 
test_data = pd.read_csv('/Users/apal/Documents/PathtoAI/AnalyticsVidhya/Mlops/data/Human_Activity_Recognition_Using_Smartphones_TestData.csv')

In [None]:
## Load Features and model weight
train_features = joblib.load("../model_features/train_features.joblib")

model = joblib.load("../model_weights/my_random_forest.joblib")


In [None]:
test_data_features = test_data[train_features]

In [None]:
test_data_features.fillna(0,inplace = True)

In [None]:
y_prediction = model.predict(test_data_features)
y_prediction_label = le.inverse_transform(y_prediction)
test_data['Prediction_label'] = y_prediction_label

In [None]:
test_data.head()