Lambda School Data Science

*Unit 2, Sprint 3, Module 1*

---


# Define ML problems

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your decisions.

- [X] Choose your target. Which column in your tabular dataset will you predict? : CONDITION
- [X] Is your problem regression or classification? CLASSIFICATION
- [X] How is your target distributed?
    - Classification: How many classes? Are the classes imbalanced? Balanced 3
    - Regression: Is the target right-skewed? If so, you may want to log transform the target.
- [ ] Choose your evaluation metric(s).
    - Classification: Is your majority class frequency >= 50% and < 70% ? If so, you can just use accuracy if you want. Outside that range, accuracy could be misleading. What evaluation metric will you choose, in addition to or instead of accuracy?
    - Regression: Will you use mean absolute error, root mean squared error, R^2, or other regression metrics?
- [ ] Choose which observations you will use to train, validate, and test your model.
    - Are some observations outliers? Will you exclude them?
    - Will you do a random split or a time-based split?
- [ ] Begin to clean and explore your data.
- [ ] Begin to choose which features, if any, to exclude. Would some features "leak" future information?

If you haven't found a dataset yet, do that today. [Review requirements for your portfolio project](https://lambdaschool.github.io/ds/unit2) and choose your dataset.

Some students worry, ***what if my model isn't “good”?*** Then, [produce a detailed tribute to your wrongness. That is science!](https://twitter.com/nathanwpyle/status/1176860147223867393)

In [141]:
%%capture
!pip install category_encoders

In [142]:
import numpy as np
from sklearn.model_selection import train_test_split 
import os
import sklearn.pipeline
from sklearn.feature_selection import SelectKBest
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
import logging, sys

import warnings
warnings.filterwarnings("ignore")

In [143]:
# Read data
import pandas as pd
train = '/Users/filch/Dropbox/repositories/HRV/hrv_dataset/data/final/train.csv'
train = pd.read_csv(train)
test = '/Users/filch/Dropbox/repositories/HRV/hrv_dataset/data/final/test.csv'
test = pd.read_csv(test)

In [144]:
train.head()

Unnamed: 0,MEAN_RR,MEDIAN_RR,SDRR,RMSSD,SDSD,SDRR_RMSSD,HR,pNN25,pNN50,SD1,...,HF,HF_PCT,HF_NU,TP,LF_HF,HF_LF,sampen,higuci,datasetId,condition
0,885.157845,853.76373,140.972741,15.554505,15.553371,9.063146,69.499952,11.133333,0.533333,11.001565,...,15.522603,0.421047,1.514737,3686.666157,65.018055,0.01538,2.139754,1.163485,2,no stress
1,939.425371,948.357865,81.317742,12.964439,12.964195,6.272369,64.36315,5.6,0.0,9.170129,...,2.108525,0.070133,0.304603,3006.487251,327.296635,0.003055,2.174499,1.084711,2,interruption
2,898.186047,907.00686,84.497236,16.305279,16.305274,5.182201,67.450066,13.066667,0.2,11.533417,...,13.769729,0.512671,1.049528,2685.879461,94.28091,0.010607,2.13535,1.176315,2,interruption
3,881.757865,893.46003,90.370537,15.720468,15.720068,5.748591,68.809562,11.8,0.133333,11.119476,...,18.181913,0.529387,1.775294,3434.52098,55.328701,0.018074,2.178341,1.179688,2,no stress
4,809.625331,811.184865,62.766242,19.213819,19.213657,3.266724,74.565728,20.2,0.2,13.590641,...,48.215822,1.839473,3.279993,2621.175204,29.487873,0.033912,2.221121,1.249612,2,no stress


In [145]:
train.shape

(369289, 36)

In [146]:
train['condition'].value_counts()


no stress        200082
interruption     105150
time pressure     64057
Name: condition, dtype: int64

In [147]:
conditions = ['no stress','interruption','time pressure']

##Engineer new columns for conditions of testing.

HRV is a measurement of heart rate variability. HRV is an indicator of functional health. The higher your HRV is the stronger and more resilient your body is. The easier it is to recover from stressors, etc. This data is from a series of experiments in which test subjects were subjected to 3 conditions while performing a set of predetermined tasks. Some were interrupted while performing task, some were given time pressure to complete task, and some were given no stress or time table to complete task.

I originally thought I would be looking at the output of the HRV and heart data and predicting in which of those conditions they completed the task. The original files have the three conditions of the experiments. My thought was I would predict for those.  But I think I've gotten confused. The predictions are not binary. I had this idea that the encoder would split out the conditions into separate columns as numbers. But now that I am at this point I have realized that the encoder won't do this as I am removing the target from the df. And if I were to engineer these into a new columns that would essentially constitute data leakage because the prediction is in the newly engineered columns.

Code sketch I wrote to do this engineering (Since removed):

df['stress'] = df['condition'] == 'no stress'
df['interruption'] = df['condition'] == 'interruption'
df['time pressure'] = df['condition'] == 'time pressure'
df['stress'].value_counts()
df['interruption'].value_counts()
df['time pressure'].value_counts()

In [148]:
train.shape

(369289, 36)

In [149]:
train.isna().sum()

MEAN_RR              0
MEDIAN_RR            0
SDRR                 0
RMSSD                0
SDSD                 0
SDRR_RMSSD           0
HR                   0
pNN25                0
pNN50                0
SD1                  0
SD2                  0
KURT                 0
SKEW                 0
MEAN_REL_RR          0
MEDIAN_REL_RR        0
SDRR_REL_RR          0
RMSSD_REL_RR         0
SDSD_REL_RR          0
SDRR_RMSSD_REL_RR    0
KURT_REL_RR          0
SKEW_REL_RR          0
VLF                  0
VLF_PCT              0
LF                   0
LF_PCT               0
LF_NU                0
HF                   0
HF_PCT               0
HF_NU                0
TP                   0
LF_HF                0
HF_LF                0
sampen               0
higuci               0
datasetId            0
condition            0
dtype: int64

In [150]:
train, val = train_test_split(df, train_size=0.80, test_size=0.20, 
                              stratify=df['condition'], random_state=42)

In [151]:
train.shape

(295431, 36)

In [152]:
val.shape

(73858, 36)

In [153]:
import category_encoders as ce
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier

target = 'condition'
features = train.columns.drop([target])
X_train = train[features]
y_train = train[target]
X_val = val[features]
y_val = val[target]

pipeline = make_pipeline( 
    DecisionTreeClassifier(max_depth=10)
)

pipeline.fit(X_train, y_train)
print('Validation Accuracy', pipeline.score(X_val, y_val))

Validation Accuracy 0.9186953342901243


In [154]:
X_train.head()

Unnamed: 0,MEAN_RR,MEDIAN_RR,SDRR,RMSSD,SDSD,SDRR_RMSSD,HR,pNN25,pNN50,SD1,...,LF_NU,HF,HF_PCT,HF_NU,TP,LF_HF,HF_LF,sampen,higuci,datasetId
78976,830.694562,843.47897,78.115334,14.080812,14.080802,5.547644,72.956075,6.8,0.4,9.959953,...,96.56239,28.412214,1.489073,3.43761,1908.046673,28.089977,0.0356,1.952928,1.240773,2
277609,741.173652,740.082575,45.450598,10.84732,10.847283,4.19003,81.257701,1.333333,0.0,7.672747,...,89.400754,48.622893,4.521232,10.599246,1075.434581,8.434633,0.118559,2.171717,1.27638,2
97270,837.620774,819.564645,82.860048,10.764555,10.764397,7.697489,72.299017,2.733333,0.2,7.614118,...,96.639853,19.021061,1.749134,3.360147,1087.455871,28.760604,0.03477,2.004642,1.215812,2
157663,745.815948,739.995175,76.772033,22.853646,22.853418,3.35929,81.301519,29.666667,2.266667,16.165199,...,95.607441,117.626508,3.054277,4.392559,3851.206643,21.765771,0.045944,2.212711,1.212051,2
178408,1037.637734,1032.13655,136.480628,16.092104,16.091682,8.481217,58.858271,11.2,1.133333,11.382335,...,99.72109,2.496212,0.056564,0.27891,4413.051163,357.538193,0.002797,2.230512,1.095117,2


In [155]:
def simple_model_evaluation():
    select = SelectKBest(k=20)
    target = 'condition'
    hrv_features = list(train)
    hrv_features = [x for x in hrv_features if x not in [target]]
    X_train= train[hrv_features]
    y_train= train[target]
    X_test = test[hrv_features]
    y_test = test[target]
    classifiers = [
                    RandomForestClassifier(n_estimators=100, max_features='log2', n_jobs=-1),
                    SVC(C=20, kernel='rbf'),   
                 ]
    for clf in classifiers:
        name = str(clf).split('(')[0]
        if 'svc' == name.lower():
            # Normalize the attribute values to mean=0 and variance=1
            from sklearn.preprocessing import StandardScaler
            scaler = StandardScaler()
            scaler.fit(X_train)
            X_train = scaler.transform(X_train)
            X_test = scaler.transform(X_test)
        clf = RandomForestClassifier()
        steps = [('feature_selection', select),
             ('model', clf)]
        pipeline = sklearn.pipeline.Pipeline(steps)
        pipeline.fit(X_train, y_train)
        y_prediction = pipeline.predict(X_test)
        print("----------------------------{0}---------------------------".format(name))
        print(sklearn.metrics.classification_report(y_test, y_prediction))
        print()
        pipeline.fit(X_test, y_test)
        y_pred = pipeline.predict(X_test)
        print('Test Accuracy', pipeline.score(X_test, y_test))
        print()
        print()

In [156]:
simple_model_evaluation()


----------------------------RandomForestClassifier---------------------------
               precision    recall  f1-score   support

 interruption       1.00      1.00      1.00     11782
    no stress       1.00      1.00      1.00     22158
time pressure       1.00      1.00      1.00      7093

     accuracy                           1.00     41033
    macro avg       1.00      1.00      1.00     41033
 weighted avg       1.00      1.00      1.00     41033


Test Accuracy 0.9999756293714815


----------------------------SVC---------------------------
               precision    recall  f1-score   support

 interruption       1.00      1.00      1.00     11782
    no stress       1.00      1.00      1.00     22158
time pressure       1.00      1.00      1.00      7093

     accuracy                           1.00     41033
    macro avg       1.00      1.00      1.00     41033
 weighted avg       1.00      1.00      1.00     41033


Test Accuracy 1.0


