Human activity monitoring is a growing field within data science. It has practical use within the healthcare industry, particular with tracking the elderly to make sure they don't end up doing things which might cause them to hurt themselves. Governments are also very interested in it do that they can detect unusual crowd activities, perimeter breaches, or the identification of specific activities, such as loitering, littering, or fighting. Fitness apps also make use of activity monitoring to better estimate the amount of calories used by the body during a period of time.

We will be training a random forest against a public domain Human Activity Dataset titled Wearable Computing: Accelerometers' Data Classification of Body Postures and Movements, containing 165,633, one of which is invalid. Within this dataset, there are five target activities:
    1. Sitting
    2. Sitting Down
    3. Standing
    4. Standing Up
    5. Walking

These activities were captured from four people wearing accelerometers mounted on their waist, left thigh, right arm, and right ankle.

In [1]:
import pandas as pd
import time
import os
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import warnings 
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
# Grab the DLA HAR dataset from:
# http://groupware.les.inf.puc-rio.br/har
# http://groupware.les.inf.puc-rio.br/static/har/dataset-har-PUC-Rio-ugulino.zip

In [2]:
#
# Loading up the dataset into dataframe 'X'
#
X = pd.read_csv("Datasets\\dataset-har-PUC-Rio-ugulino.csv", sep=';', header=0, decimal = ',')
print(X.head(2))

     user gender  age  how_tall_in_meters  weight  body_mass_index  x1  y1  \
0  debora  Woman   46                1.62      75             28.6  -3  92   
1  debora  Woman   46                1.62      75             28.6  -3  94   

   z1  x2  y2  z2  x3   y3  z3   x4   y4    z4    class  
0 -63 -23  18 -19   5  104 -92 -150 -103  -147  sitting  
1 -64 -21  18 -18 -14  104 -90 -149 -104  -145  sitting  


In [3]:
# Exploring the numerical data
print(X.describe())

                 age  how_tall_in_meters         weight  body_mass_index  \
count  165633.000000       165633.000000  165633.000000    165633.000000   
mean       38.265146            1.639712      70.819408        26.188522   
std        13.184091            0.052820      11.296527         2.995777   
min        28.000000            1.580000      55.000000        22.000000   
25%        28.000000            1.580000      55.000000        22.000000   
50%        31.000000            1.620000      75.000000        28.400000   
75%        46.000000            1.710000      83.000000        28.600000   
max        75.000000            1.710000      83.000000        28.600000   

                  x1             y1             z1             x2  \
count  165633.000000  165633.000000  165633.000000  165633.000000   
mean       -6.649327      88.293667     -93.164611     -87.827504   
std        11.616238      23.895829      39.409423     169.435194   
min      -306.000000    -271.000000    

In [4]:
# Exploring the categorical data
print (X.describe(include=['O']))

          user  gender      z4    class
count   165633  165633  165633   165633
unique       4       2     276        5
top     debora   Woman    -163  sitting
freq     51577  101374    5698    50631


In [5]:
# Checking the data types
print (X.dtypes)

user                   object
gender                 object
age                     int64
how_tall_in_meters    float64
weight                  int64
body_mass_index       float64
x1                      int64
y1                      int64
z1                      int64
x2                      int64
y2                      int64
z2                      int64
x3                      int64
y3                      int64
z3                      int64
x4                      int64
y4                      int64
z4                     object
class                  object
dtype: object


We make the following observations: 
    1. z4 has a type of object even though it has just numbers. 
    2. gender attribute has just two values and is of type object.

In [6]:
#
# Encoding the gender column, 0 as male, 1 as female
#
X.gender = X.gender.map({'Man': 0, 'Woman': 1})

In [7]:
# 
# Checking for null values and removing them
#
print (X.isnull().values.any())

False


In [8]:
# Running the code without running this cell gives an error --
# ValueError: Unable to parse string "-14420-11-2011 04:50:23.713"
# And this error is encountered while coercing the values in the z4 variable.
# Now, this is a bad value and instead of treating it, we will discard it completely

# Let us find the index where this bad value sits
X.index[X['z4'] == "-14420-11-2011 04:50:23.713"].tolist()

# Now let us drop this row from the dataset
X = (X.drop(X.index[122076]))

In [9]:
# Treating the z4 column to convert it back to int64
X.z4 = pd.to_numeric(X.z4, errors = 'raise')

In [10]:
# Extracting the labels
y = X["class"]
y =pd.get_dummies(y)

# The attribute user will not help us here because we are 
# generalizing postures and movements across users and not predicting 
# postures and movements for each user. 
# Hence, the user variable has not much importance in our model

X.drop(["user","class"], inplace=True, axis = 1)
print (X.describe())

print (X[pd.isnull(X).any(axis=1)])

              gender            age  how_tall_in_meters         weight  \
count  165632.000000  165632.000000       165632.000000  165632.000000   
mean        0.612044      38.264925            1.639712      70.819431   
std         0.487286      13.183821            0.052820      11.296557   
min         0.000000      28.000000            1.580000      55.000000   
25%         0.000000      28.000000            1.580000      55.000000   
50%         1.000000      31.000000            1.620000      75.000000   
75%         1.000000      46.000000            1.710000      83.000000   
max         1.000000      75.000000            1.710000      83.000000   

       body_mass_index             x1             y1             z1  \
count    165632.000000  165632.000000  165632.000000  165632.000000   
mean         26.188535      -6.649319      88.293591     -93.164449   
std           2.995781      11.616273      23.895881      39.409487   
min          22.000000    -306.000000    -271.000

In [11]:
# Now, that our data is ready, let us split into training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=7)

In [12]:
#
# Creating an RForest classifier model
#
model = RandomForestClassifier(n_estimators = 30, max_depth = 10, oob_score = True, random_state = 0)

In [13]:
print ("Fitting...")
s = time.time()
#
# Training the model on your training set
#
model.fit(X_train,y_train)
print ("Fitting completed in: ", time.time() - s)

score = model.oob_score_
print ("OOB Score: ", round(score*100, 3))

Fitting...
Fitting completed in:  14.174395322799683
OOB Score:  98.744


In [14]:
print ("Scoring...")
s = time.time()
#
# Scoring the model on your test set
#
score = model.score(X_test,y_test)
print ("Score: ", round(score*100, 3))
print ("Scoring completed in: ", time.time() - s)

Scoring...
Score:  95.687
Scoring completed in:  0.7184772491455078
