# DAT210x - Programming with Python for DS

## Module6- Lab6

Human activity monitoring is a growing field within data science. It has practical use within the healthcare industry, particular with tracking the elderly to make sure they don't end up doing things which might cause them to hurt themselves. Governments are also very interested in it do that they can detect unusual crowd activities, perimeter breaches, or the identification of specific activities, such as loitering, littering, or fighting. Fitness apps also make use of activity monitoring to better estimate the amount of calories used by the body during a period of time.

In this lab, you will be training a random forest against a public domain Human Activity Dataset titled Wearable Computing: Accelerometers' Data Classification of Body Postures and Movements, containing 165,633, one of which is invalid. Within the dataset, there are five target activities:

Sitting
Sitting Down
Standing
Standing Up
Walking
These activities were captured from four people wearing accelerometers mounted on their waist, left thigh, right arm, and right ankle. To get started:

Acquire the DLA HAR Dataset from their webpage. Be sure to get the dataset-har-PUC-Rio-ugulino.zip file and not the weight lifting one. That's a bonus dataset you can try fitting afterwards! If the GroupWare website is down, we have a cached copy of the dataset in the course repo.
Open up the sample code located in Module6/Module6 - Lab6.ipynb and read through it.
Complete out all the requisite ToDo's as usual.
Finally, answer the following questions:

In [2]:
import pandas as pd
import time

### How to Get The Dataset

Grab the DLA HAR dataset from:

- http://groupware.les.inf.puc-rio.br/har
- http://groupware.les.inf.puc-rio.br/static/har/dataset-har-PUC-Rio-ugulino.zip
- A cached copy of the dataset is included in the course repository.

After extracting it out, load up the dataset into dataframe named `X` and do your regular dataframe examination:

In [12]:
X = pd.read_csv(r'Datasets/dataset-har-PUC-Rio-ugulino.csv', sep=';')
X.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,user,gender,age,how_tall_in_meters,weight,body_mass_index,x1,y1,z1,x2,y2,z2,x3,y3,z3,x4,y4,z4,class
0,debora,Woman,46,162,75,286,-3,92,-63,-23,18,-19,5,104,-92,-150,-103,-147,sitting
1,debora,Woman,46,162,75,286,-3,94,-64,-21,18,-18,-14,104,-90,-149,-104,-145,sitting
2,debora,Woman,46,162,75,286,-1,97,-61,-12,20,-15,-13,104,-90,-151,-104,-144,sitting
3,debora,Woman,46,162,75,286,-2,96,-57,-15,21,-16,-13,104,-89,-153,-103,-142,sitting
4,debora,Woman,46,162,75,286,-1,96,-61,-13,20,-15,-13,104,-89,-153,-104,-143,sitting


In [15]:
# X.z4.value_counts() -- -14420-11-2011 04:50:23.713
# how_tall_in_meters    165633 non-null object
# body_mass_index       165633 non-null object
# z4                    165633 non-null object
# class                 165633 non-null object

Encode the gender column such that: `0` is male, and `1` as female:

In [17]:
X.gender = X.gender.map({'Woman':0, 'Man':1})

Clean up any columns with commas in them so that they're properly represented as decimals:

In [26]:
X.how_tall_in_meters = X.how_tall_in_meters.str.replace(',', '.')
X.body_mass_index = X.body_mass_index.str.replace(',', '.')


Let's take a peek at your data types:

In [27]:
X.dtypes

user                  object
gender                 int64
age                    int64
how_tall_in_meters    object
weight                 int64
body_mass_index       object
x1                     int64
y1                     int64
z1                     int64
x2                     int64
y2                     int64
z2                     int64
x3                     int64
y3                     int64
z3                     int64
x4                     int64
y4                     int64
z4                    object
class                 object
dtype: object

Convert any column that needs to be converted into numeric use `errors='raise'`. This will alert you if something ends up being problematic.

In [49]:
X.how_tall_in_meters = pd.to_numeric(X.how_tall_in_meters, errors='raise')
X.body_mass_index = pd.to_numeric(X.body_mass_index, errors='raise')
# X[X.z4 == '-14420-11-2011 04:50:23.713'] -> 122076
X = X.drop(labels = X.index[122076], axis=0)
X.z4 = pd.to_numeric(X.z4, errors='raise')

If you find any problematic records, drop them before calling the `to_numeric` methods above.

Okay, now encode your `y` value as a Pandas dummies version of your dataset's `class` column:

In [56]:
y = X['class']
y = pd.get_dummies(y)

In fact, get rid of the `user` and `class` columns:

In [58]:
X = X.drop(labels=['user', 'class'], axis=1)

Let's take a look at your handy-work:

In [59]:
X.describe()

Unnamed: 0,gender,age,how_tall_in_meters,weight,body_mass_index,x1,y1,z1,x2,y2,z2,x3,y3,z3,x4,y4,z4
count,165631.0,165631.0,165631.0,165631.0,165631.0,165631.0,165631.0,165631.0,165631.0,165631.0,165631.0,165631.0,165631.0,165631.0,165631.0,165631.0,165631.0
mean,0.387953,38.264703,1.639712,70.819454,26.188548,-6.649311,88.293562,-93.164383,-87.828528,-52.065984,-175.056559,17.423719,104.516987,-93.881689,-167.640732,-92.624955,-159.651321
std,0.487285,13.183552,0.052821,11.296587,2.995786,11.616307,23.89595,39.409597,169.435957,205.160699,192.817336,52.635641,54.156143,45.389903,38.310955,19.968389,13.220353
min,0.0,28.0,1.58,55.0,22.0,-306.0,-271.0,-603.0,-494.0,-517.0,-617.0,-499.0,-506.0,-613.0,-702.0,-526.0,-537.0
25%,0.0,28.0,1.58,55.0,22.0,-12.0,78.0,-120.0,-35.0,-29.0,-141.0,9.0,95.0,-103.0,-190.0,-103.0,-167.0
50%,0.0,31.0,1.62,75.0,28.4,-6.0,94.0,-98.0,-9.0,27.0,-118.0,22.0,107.0,-90.0,-168.0,-91.0,-160.0
75%,1.0,46.0,1.71,83.0,28.6,0.0,101.0,-64.0,4.0,86.0,-29.0,34.0,120.0,-80.0,-153.0,-80.0,-153.0
max,1.0,75.0,1.71,83.0,28.6,509.0,533.0,411.0,473.0,295.0,122.0,507.0,517.0,410.0,-13.0,86.0,-43.0


You can also easily display which rows have nans in them, if any:

In [60]:
X[pd.isnull(X).any(axis=1)]

Unnamed: 0,gender,age,how_tall_in_meters,weight,body_mass_index,x1,y1,z1,x2,y2,z2,x3,y3,z3,x4,y4,z4


Create an RForest classifier named `model` and set `n_estimators=30`, the `max_depth` to 10, `oob_score=True`, and `random_state=0`:

In [62]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=30, max_depth=10, oob_score=True, random_state=0)

Split your data into `test` / `train` sets. Your `test` size can be 30%, with `random_state` 7. Use variable names: `X_train`, `X_test`, `y_train`, and `y_test`:

In [64]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=7)

### Now the Fun Stuff

In [65]:
print("Fitting...")
s = time.time()

# TODO: train your model on your training set
model.fit(X_train, y_train)

print("Fitting completed in: ", time.time() - s)

Fitting...
Fitting completed in:  9.439747095108032


Display the OOB Score of your data:

In [66]:
score = model.oob_score_
print("OOB Score: ", round(score*100, 3))

OOB Score:  98.853


In [67]:
print("Scoring...")
s = time.time()

# TODO: score your model on your test set
score = model.score(X_test, y_test)

print("Score: ", round(score*100, 3))
print("Scoring completed in: ", time.time() - s)

Scoring...
Score:  96.386
Scoring completed in:  0.6602330207824707


At this point, go ahead and answer the lab questions, then return here to experiment more --

Try playing around with the gender column. For example, encode `gender` `Male:1`, and `Female:0`. Also try encoding it as a Pandas dummies variable and seeing what changes that has. You can also try dropping gender entirely from the dataframe. How does that change the score of the model? This will be a key insight on how your feature encoding alters your overall scoring, and why it's important to choose good ones.

In [None]:
# .. your code changes above ..

After that, try messing with `y`. Right now its encoded with dummies, but try other encoding methods to what effects they have.