# DAT210x - Programming with Python for DS

## Module6- Lab6

In [1]:
import pandas as pd
import time

### How to Get The Dataset

Grab the DLA HAR dataset from:

- http://groupware.les.inf.puc-rio.br/har
- http://groupware.les.inf.puc-rio.br/static/har/dataset-har-PUC-Rio-ugulino.zip
- A cached copy of the dataset is included in the course repository.

After extracting it out, load up the dataset into dataframe named `X` and do your regular dataframe examination:

In [2]:
X = pd.read_csv('Datasets/dataset-har-PUC-Rio-ugulino.csv',delimiter=';')

  interactivity=interactivity, compiler=compiler, result=result)


Encode the gender column such that: `0` is male, and `1` as female:

In [3]:
X.drop('gender',axis=1,inplace=True)
#X.gender = X.gender.map({'Man':0, 'Woman':1})
#X.gender = pd.get_dummies(X.gender)
#print(X.gender.unique())

Clean up any columns with commas in them so that they're properly represented as decimals:

In [4]:
X['how_tall_in_meters'].replace({',':'.'}, regex=True, inplace=True)
X['body_mass_index'].replace({',':'.'}, regex=True, inplace=True)

Let's take a peek at your data types:

In [5]:
X.dtypes

user                  object
age                    int64
how_tall_in_meters    object
weight                 int64
body_mass_index       object
x1                     int64
y1                     int64
z1                     int64
x2                     int64
y2                     int64
z2                     int64
x3                     int64
y3                     int64
z3                     int64
x4                     int64
y4                     int64
z4                    object
class                 object
dtype: object

Convert any column that needs to be converted into numeric use `errors='raise'`. This will alert you if something ends up being problematic.

In [6]:
X['how_tall_in_meters'] = pd.to_numeric(X['how_tall_in_meters'], errors='raise')
X['body_mass_index'] = pd.to_numeric(X['body_mass_index'], errors='raise')
X.z4 = pd.to_numeric(X.z4, errors='coerce')

If you find any problematic records, drop them before calling the `to_numeric` methods above.

Okay, now encode your `y` value as a Pandas dummies version of your dataset's `class` column:

In [7]:
print(X.isnull().any())
X.dropna(axis=0, inplace=True)
y = X['class'].astype("category").cat.codes
#y = pd.get_dummies(X['class'])
#print(y)

user                  False
age                   False
how_tall_in_meters    False
weight                False
body_mass_index       False
x1                    False
y1                    False
z1                    False
x2                    False
y2                    False
z2                    False
x3                    False
y3                    False
z3                    False
x4                    False
y4                    False
z4                     True
class                 False
dtype: bool


In fact, get rid of the `user` and `class` columns:

In [8]:
X.drop(['class','user'],axis=1,inplace=True)
print(X.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 165632 entries, 0 to 165632
Data columns (total 16 columns):
age                   165632 non-null int64
how_tall_in_meters    165632 non-null float64
weight                165632 non-null int64
body_mass_index       165632 non-null float64
x1                    165632 non-null int64
y1                    165632 non-null int64
z1                    165632 non-null int64
x2                    165632 non-null int64
y2                    165632 non-null int64
z2                    165632 non-null int64
x3                    165632 non-null int64
y3                    165632 non-null int64
z3                    165632 non-null int64
x4                    165632 non-null int64
y4                    165632 non-null int64
z4                    165632 non-null float64
dtypes: float64(3), int64(13)
memory usage: 21.5 MB
None


Let's take a look at your handy-work:

In [9]:
X.describe()

Unnamed: 0,age,how_tall_in_meters,weight,body_mass_index,x1,y1,z1,x2,y2,z2,x3,y3,z3,x4,y4,z4
count,165632.0,165632.0,165632.0,165632.0,165632.0,165632.0,165632.0,165632.0,165632.0,165632.0,165632.0,165632.0,165632.0,165632.0,165632.0,165632.0
mean,38.264925,1.639712,70.819431,26.188535,-6.649319,88.293591,-93.164449,-87.827956,-52.065911,-175.055647,17.423517,104.517056,-93.881641,-167.641211,-92.625235,-159.650985
std,13.183821,0.05282,11.296557,2.995781,11.616273,23.895881,39.409487,169.435606,205.160081,192.817111,52.635546,54.155987,45.38977,38.311336,19.968653,13.22102
min,28.0,1.58,55.0,22.0,-306.0,-271.0,-603.0,-494.0,-517.0,-617.0,-499.0,-506.0,-613.0,-702.0,-526.0,-537.0
25%,28.0,1.58,55.0,22.0,-12.0,78.0,-120.0,-35.0,-29.0,-141.0,9.0,95.0,-103.0,-190.0,-103.0,-167.0
50%,31.0,1.62,75.0,28.4,-6.0,94.0,-98.0,-9.0,27.0,-118.0,22.0,107.0,-90.0,-168.0,-91.0,-160.0
75%,46.0,1.71,83.0,28.6,0.0,101.0,-64.0,4.0,86.0,-29.0,34.0,120.0,-80.0,-153.0,-80.0,-153.0
max,75.0,1.71,83.0,28.6,509.0,533.0,411.0,473.0,295.0,122.0,507.0,517.0,410.0,-13.0,86.0,-43.0


You can also easily display which rows have nans in them, if any:

In [10]:
X[pd.isnull(X).any(axis=1)]

Unnamed: 0,age,how_tall_in_meters,weight,body_mass_index,x1,y1,z1,x2,y2,z2,x3,y3,z3,x4,y4,z4


Create an RForest classifier named `model` and set `n_estimators=30`, the `max_depth` to 10, `oob_score=True`, and `random_state=0`:

In [11]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=30,max_depth=10,oob_score=True,random_state=0)

Split your data into `test` / `train` sets. Your `test` size can be 30%, with `random_state` 7. Use variable names: `X_train`, `X_test`, `y_train`, and `y_test`:

In [12]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=7)

### Now the Fun Stuff

In [13]:
print("Fitting...")
s = time.time()

# TODO: train your model on your training set
model.fit(X_train,y_train)
# .. your code here ..

print("Fitting completed in: ", time.time() - s)

Fitting...
('Fitting completed in: ', 3.450000047683716)


Display the OOB Score of your data:

In [14]:
score = model.oob_score_
print("OOB Score: ", round(score*100, 3))

('OOB Score: ', 97.379)


In [15]:
print("Scoring...")
s = time.time()

# TODO: score your model on your test set
score = model.score(X_test,y_test)
# .. your code here ..

print("Score: ", round(score*100, 3))
print("Scoring completed in: ", time.time() - s)

Scoring...
('Score: ', 97.517)
('Scoring completed in: ', 0.19100022315979004)


At this point, go ahead and answer the lab questions, then return here to experiment more --

Try playing around with the gender column. For example, encode `gender` `Male:1`, and `Female:0`. Also try encoding it as a Pandas dummies variable and seeing what changes that has. You can also try dropping gender entirely from the dataframe. How does that change the score of the model? This will be a key insight on how your feature encoding alters your overall scoring, and why it's important to choose good ones.

In [16]:
# dropping the gender column largely increase OOB score, however it does drop a bit when y is encoded in 1 column 

After that, try messing with `y`. Right now its encoded with dummies, but try other encoding methods to what effects they have.

In [17]:
# higher score achieved when y encoded in 1 column instaed of 4, but OOB score drops a bit