## Introduction

We will be using the [Human Activity Recognition with Smartphones](https://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones) database, which was built from the recordings of study participants performing activities of daily living (ADL) while carrying a smartphone with an embedded inertial sensors. The objective is to classify activities into one of the six activities (walking, walking upstairs, walking downstairs, sitting, standing, and laying) performed.

For each record in the dataset it is provided: 

- Triaxial acceleration from the accelerometer (total acceleration) and the estimated body acceleration. 
- Triaxial Angular velocity from the gyroscope. 
- A 561-feature vector with time and frequency domain variables. 
- Its activity label. 

More information about the features is available on the website above.

In [None]:
from __future__ import print_function
import os
data_path = [ 'data']
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.metrics import confusion_matrix, accuracy_score, roc_auc_score
from sklearn.preprocessing import label_binarize

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## 1. Data Import

Import the data and do the following:

* Examine the data types--there are many columns, so it might be wise to use value counts
* Determine if the floating point values need to be scaled
* Determine the breakdown of each activity
* Encode the activity label as an integer

In [None]:
import pandas as pd
import numpy as np
import os
filepath = 'Human_Activity_Recognition_Using_Smartphones_Data_augmented_data.gzip'
data = pd.read_parquet(filepath)

In [None]:
data.shape

In [15]:
data['Activity'].value_counts()

LAYING                94635
STANDING              94597
SITTING               94468
WALKING               94413
WALKING_UPSTAIRS      94235
WALKING_DOWNSTAIRS    94097
Name: Activity, dtype: int64

In [16]:
data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
tBodyAcc-mean()-X,566445.0,0.274936,0.059249,-1.0,0.262873,0.277136,0.288076,1.0
tBodyAcc-mean()-Y,566445.0,-0.017419,0.029188,-1.0,-0.024297,-0.017138,-0.011188,1.0
tBodyAcc-mean()-Z,566445.0,-0.108818,0.043453,-1.0,-0.120096,-0.108533,-0.098246,1.0
tBodyAcc-std()-X,566445.0,-0.565518,0.450373,-1.0,-0.991588,-0.637147,-0.211506,1.0
tBodyAcc-std()-Y,566445.0,-0.469409,0.503388,-1.0,-0.974501,-0.393744,-0.029494,1.0
...,...,...,...,...,...,...,...,...
"angle(tBodyGyroMean,gravityMean)",566445.0,0.021176,0.594952,-1.0,-0.451814,0.021843,0.510186,1.0
"angle(tBodyGyroJerkMean,gravityMean)",566445.0,-0.015486,0.426282,-1.0,-0.341371,-0.014399,0.299621,1.0
"angle(X,gravityMean)",566445.0,-0.523804,0.486867,-1.0,-0.816700,-0.724314,-0.555039,1.0
"angle(Y,gravityMean)",566445.0,0.079559,0.293166,-1.0,0.029635,0.188226,0.253245,1.0


In [17]:
le = LabelEncoder()
data['Activity'] = le.fit_transform(data['Activity'])

In [18]:
data['Activity'].value_counts()

0    94635
2    94597
1    94468
3    94413
5    94235
4    94097
Name: Activity, dtype: int64

## 2. EDA
 
* Calculate the correlations between the dependent variables.
* Create a histogram of the correlation values
* Identify those that are most correlated (either positively or negatively).

In [32]:
#data.corr()

In [None]:
import seaborn
seaborn.heatmap(data.corr())

In [None]:
corr_data = data.corr()


In [None]:
corr_data[corr_data > 0.8].

## 3. Data preparation

* Split the data into train and test data sets. 
* Regardless of methods used to split the data, compare the ratio of classes in both the train and test splits.


In [20]:
from sklearn.model_selection import train_test_split
X = data.drop('Activity' , axis = 1)
Y = data['Activity']
x_train , x_test , y_train , y_test = train_test_split(X, Y)

In [21]:
x_train.shape , y_train.shape

((424833, 561), (424833,))

In [22]:
x_test.shape , y_test.shape

((141612, 561), (141612,))

## 4. Model Training

* Fit a logistic regression model without any regularization using all of the features. Be sure to read the documentation about fitting a multi-class model so you understand the coefficient output. Store the model.

In [23]:
lr =  LogisticRegression()
lr.fit(x_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [24]:
print("Training Accuracy", lr.score(x_train, y_train)) 

Training Accuracy 0.9945437383630744


In [25]:
print("Test Accuracy", lr.score(x_test, y_test))

Test Accuracy 0.9948733158206932


In [26]:
y_pred = lr.predict(x_test)

In [27]:
y_pred

array([0, 1, 2, ..., 3, 1, 5])

## 7. Model Evaluation

For each model, calculate the following error metrics: 

* accuracy
* precision
* recall
* fscore
* confusion matrix

Decide how to combine the multi-class metrics into a single value for each model.

In [28]:
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.metrics import confusion_matrix, accuracy_score, roc_auc_score, classification_report


In [29]:
confusion_matrix(y_pred, y_test)

array([[23834,     0,     0,     0,     0,     0],
       [    0, 23191,   352,     0,     0,     0],
       [    0,   360, 23383,     0,     0,     0],
       [    0,     0,     0, 23555,     3,     3],
       [    0,     0,     0,     1, 23500,     0],
       [    0,     6,     0,     0,     1, 23423]])

In [31]:
print(classification_report(y_pred, y_test))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     23834
           1       0.98      0.99      0.98     23543
           2       0.99      0.98      0.99     23743
           3       1.00      1.00      1.00     23561
           4       1.00      1.00      1.00     23501
           5       1.00      1.00      1.00     23430

    accuracy                           0.99    141612
   macro avg       0.99      0.99      0.99    141612
weighted avg       0.99      0.99      0.99    141612



## 8. Model registration


## 9. Model prediction