## Regression Analysis on the outputs from the previously coded algorithms

- The sampling rate of the empkins dataset 1000 Hz does not match the sampling rate of the guardian dataset 500 Hz. How to deal with that? Convert to milliseconds. Do it when building the train and test data
- Normalize B-Point samples acording to start and end point of heartbeat as part of data preprocessing?
    - Try both approaches
- Start without feature selection since the tree should use all outputs generated by the algorithms
- How to impute nan values? If normalized between 0 and 1 just use mean?
    - drop them first
    - Background: Many algorithms don't handle NaN values
    - Check how many entries contain NaN
    - Use a SimpleImputer with e.g. mean or KNNImputer first to test the pipeline properly (https://scikit-learn.org/1.5/modules/impute.html)
- GroupKFold could be used for cross validation to ensure that a participant is not present in the train and testdata --> until now I only used KFold 
- Splitting of the data:
    - use biopyskit and apply groupkfold. Mutliindex can remain in dataframe its important, that one hearbeat per row
    - I splitted the data on the participant level using GroupShuffleSplit
    - I treated the datasets separately --> 20% test data for guardian and empkins dataset
    - train and test data of empkins and guardian dataset where joined --> Amount of guardian and empkins data in the train and test dataset is approximately equal

### Setup and helper functions

In [2]:
import json

from pathlib import Path

import pandas as pd
import numpy as np

import biopsykit as bp
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.impute import SimpleImputer

#Classification
from sklearn.tree import DecisionTreeClassifier

# Regression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import AdaBoostRegressor

# Cross-Validation
from sklearn.model_selection import KFold

from biopsykit.classification.model_selection import SklearnPipelinePermuter

import matplotlib.pyplot as plt

%matplotlib widget
%load_ext autoreload
%autoreload 2

### Load data

In [3]:
data_path = Path("../../results/train_test_data")
data_path

WindowsPath('../../results/train_test_data')

In [4]:
models_path = Path("../../results/models")

In [10]:
train_data = pd.read_csv(data_path.joinpath("train_data.csv")).drop(columns=["Unnamed: 0"])
train_target = pd.read_csv(data_path.joinpath("train_target.csv")).drop(columns=["Unnamed: 0"])

In [4]:
X = train_data.to_numpy()
y = train_target.to_numpy()

### Impute the missing values by using the mean of the row
- the length of a heartbeat can vary
    - Therefore averaging over heartbeats is not a good approach
    - It would be better to average over the rows
        --> Transpose the data before imputation

In [21]:
print(f"Does the target data contain np.nan values?\n {train_target.isna().any()}\n")
print(f"Does the train data contain np.nan values?\n {train_data.isna().any()}")

Does the target data contain np.nan values?
 b_point_samplereference    False
dtype: bool

Does the train data contain np.nan values?
 multiple-conditions_autoregression          False
multiple-conditions_linear-interpolation    False
multiple-conditions_none                     True
second-derivative_autoregression            False
second-derivative_linear-interpolation      False
second-derivative_none                       True
straight-line_autoregression                False
straight-line_linear-interpolation          False
straight-line_none                          False
third-derivative_autoregression             False
third-derivative_linear-interpolation       False
third-derivative_none                        True
dtype: bool


In [6]:
data_imputer = SimpleImputer(strategy="mean")

In [10]:
print(f"Shape of X: {X.shape}")

Shape of X: (9147, 12)


In [14]:
X_train_imputed_transposed = data_imputer.fit_transform(X.T)
print(f"Shape of X.T after data imputation: {X_train_imputed_transposed.shape}")

Shape of X.T after data imputation: (12, 9147)


In [15]:
X_train_imputed = X_train_imputed_transposed.T
print(f"Shape of X after data imputation: {X_train_imputed.shape}")

Shape of X after data imputation: (9147, 12)


In [41]:
X_train_imputed = X_train_imputed.astype(np.float64)
y = y.astype(np.float64)

### Specify Estimator Combinations and Parameters for Hyperparameter Search

In [63]:
model_dict = {
    "scaler": {"StandardScaler": StandardScaler(), "MinMaxScaler": MinMaxScaler()},
    "clf": {
        "KNeighborsRegressor": KNeighborsRegressor(),
        "DecisionTreeRegressor": DecisionTreeRegressor(),
        "SVR": SVR(),
        #"AdaBoostRegressor": AdaBoostRegressor(),
    },
}

In [68]:
params_dict = {
    "StandardScaler": None,
    "MinMaxScaler": None,
    "KNeighborsRegressor": {"n_neighbors": [2, 4]},
    "DecisionTreeRegressor": {"max_depth": [2, 4]},
    "SVR": [
        {
            "kernel": ["linear"],
            "C": np.logspace(start=-2, stop=2, num=5)
        },
        {
            "kernel": ["rbf"],
            "C": np.logspace(start=-2, stop=2, num=5),
            "gamma": np.logspace(start=-2, stop=2, num=5)
        }
    ],
    #"AdaBoostRegressor": {
    #    "estimator": [DecisionTreeClassifier(max_depth=1)],
    #    "n_estimators": np.arange(20, 110, 10),
    #    "learning_rate": np.arange(0.6, 1.1, 0.1)
    #},
}

In [69]:
hyper_search_dict = {"DecisionTreeRegressor": {"search_method": "random", "n_iter":2}}

### Setup PipelinePermuter and Cross-Validations for Model Evaluation

In [70]:
pipeline_permuter = SklearnPipelinePermuter(
    model_dict=model_dict, param_dict=params_dict, hyper_search_dict=hyper_search_dict
)

In [71]:
outer_cv = KFold(5)
inner_cv = KFold(5)

pipeline_permuter.fit(X=X_train_imputed, y=y, outer_cv=outer_cv, inner_cv=inner_cv, scoring="r2")

Pipeline Combinations:   0%|          | 0/6 [00:00<?, ?it/s]

### Running hyperparameter search for pipeline: (('scaler', 'StandardScaler'), ('clf', 'KNeighborsRegressor')) with 1 parameter grid(s):
Parameter grid #0 ({'search_method': 'grid'}): {'clf__n_neighbors': [2, 4]}


Outer CV:   0%|          | 0/5 [00:00<?, ?it/s]

Fitting 5 folds for each of 2 candidates, totalling 10 fits




Fitting 5 folds for each of 2 candidates, totalling 10 fits




Fitting 5 folds for each of 2 candidates, totalling 10 fits
Fitting 5 folds for each of 2 candidates, totalling 10 fits
Fitting 5 folds for each of 2 candidates, totalling 10 fits


### Running hyperparameter search for pipeline: (('scaler', 'StandardScaler'), ('clf', 'DecisionTreeRegressor')) with 1 parameter grid(s):
Parameter grid #0 ({'search_method': 'random', 'n_iter': 2}): {'clf__max_depth': [2, 4]}




Outer CV:   0%|          | 0/5 [00:00<?, ?it/s]



Fitting 5 folds for each of 2 candidates, totalling 10 fits
Fitting 5 folds for each of 2 candidates, totalling 10 fits
Fitting 5 folds for each of 2 candidates, totalling 10 fits
Fitting 5 folds for each of 2 candidates, totalling 10 fits




Fitting 5 folds for each of 2 candidates, totalling 10 fits


### Running hyperparameter search for pipeline: (('scaler', 'StandardScaler'), ('clf', 'SVR')) with 2 parameter grid(s):
Parameter grid #0 ({'search_method': 'grid'}): {'clf__kernel': ['linear'], 'clf__C': array([1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])}


Outer CV:   0%|          | 0/5 [00:00<?, ?it/s]

Fitting 5 folds for each of 5 candidates, totalling 25 fits


  y = column_or_1d(y, warn=True)


Fitting 5 folds for each of 5 candidates, totalling 25 fits


  y = column_or_1d(y, warn=True)


Fitting 5 folds for each of 5 candidates, totalling 25 fits


  y = column_or_1d(y, warn=True)


Fitting 5 folds for each of 5 candidates, totalling 25 fits


  y = column_or_1d(y, warn=True)


Fitting 5 folds for each of 5 candidates, totalling 25 fits


  y = column_or_1d(y, warn=True)



Parameter grid #1 ({'search_method': 'grid'}): {'clf__kernel': ['rbf'], 'clf__C': array([1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02]), 'clf__gamma': array([1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])}




Outer CV:   0%|          | 0/5 [00:00<?, ?it/s]

Fitting 5 folds for each of 25 candidates, totalling 125 fits


  y = column_or_1d(y, warn=True)


Fitting 5 folds for each of 25 candidates, totalling 125 fits


  y = column_or_1d(y, warn=True)


Fitting 5 folds for each of 25 candidates, totalling 125 fits


  y = column_or_1d(y, warn=True)


Fitting 5 folds for each of 25 candidates, totalling 125 fits


  y = column_or_1d(y, warn=True)


Fitting 5 folds for each of 25 candidates, totalling 125 fits


  y = column_or_1d(y, warn=True)




### Running hyperparameter search for pipeline: (('scaler', 'MinMaxScaler'), ('clf', 'KNeighborsRegressor')) with 1 parameter grid(s):
Parameter grid #0 ({'search_method': 'grid'}): {'clf__n_neighbors': [2, 4]}




Outer CV:   0%|          | 0/5 [00:00<?, ?it/s]

Fitting 5 folds for each of 2 candidates, totalling 10 fits
Fitting 5 folds for each of 2 candidates, totalling 10 fits
Fitting 5 folds for each of 2 candidates, totalling 10 fits




Fitting 5 folds for each of 2 candidates, totalling 10 fits
Fitting 5 folds for each of 2 candidates, totalling 10 fits


### Running hyperparameter search for pipeline: (('scaler', 'MinMaxScaler'), ('clf', 'DecisionTreeRegressor')) with 1 parameter grid(s):
Parameter grid #0 ({'search_method': 'random', 'n_iter': 2}): {'clf__max_depth': [2, 4]}




Outer CV:   0%|          | 0/5 [00:00<?, ?it/s]

Fitting 5 folds for each of 2 candidates, totalling 10 fits
Fitting 5 folds for each of 2 candidates, totalling 10 fits




Fitting 5 folds for each of 2 candidates, totalling 10 fits
Fitting 5 folds for each of 2 candidates, totalling 10 fits




Fitting 5 folds for each of 2 candidates, totalling 10 fits


### Running hyperparameter search for pipeline: (('scaler', 'MinMaxScaler'), ('clf', 'SVR')) with 2 parameter grid(s):
Parameter grid #0 ({'search_method': 'grid'}): {'clf__kernel': ['linear'], 'clf__C': array([1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])}




Outer CV:   0%|          | 0/5 [00:00<?, ?it/s]

Fitting 5 folds for each of 5 candidates, totalling 25 fits


  y = column_or_1d(y, warn=True)


Fitting 5 folds for each of 5 candidates, totalling 25 fits


  y = column_or_1d(y, warn=True)


Fitting 5 folds for each of 5 candidates, totalling 25 fits


  y = column_or_1d(y, warn=True)


Fitting 5 folds for each of 5 candidates, totalling 25 fits


  y = column_or_1d(y, warn=True)


Fitting 5 folds for each of 5 candidates, totalling 25 fits


  y = column_or_1d(y, warn=True)



Parameter grid #1 ({'search_method': 'grid'}): {'clf__kernel': ['rbf'], 'clf__C': array([1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02]), 'clf__gamma': array([1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])}




Outer CV:   0%|          | 0/5 [00:00<?, ?it/s]

Fitting 5 folds for each of 25 candidates, totalling 125 fits


  y = column_or_1d(y, warn=True)


Fitting 5 folds for each of 25 candidates, totalling 125 fits


  y = column_or_1d(y, warn=True)


Fitting 5 folds for each of 25 candidates, totalling 125 fits


  y = column_or_1d(y, warn=True)


Fitting 5 folds for each of 25 candidates, totalling 125 fits


  y = column_or_1d(y, warn=True)


Fitting 5 folds for each of 25 candidates, totalling 125 fits


  y = column_or_1d(y, warn=True)








### Display the resuslts of the pipeline permuter

To print the results I had to exclude the conf matrix in the Biopsykit function.  
Make sure to include it again afterwards

In [72]:
pipeline_permuter.metric_summary()

Unnamed: 0_level_0,Unnamed: 1_level_0,true_labels,true_labels_folds,predicted_labels,predicted_labels_folds,train_indices,train_indices_folds,test_indices,test_indices_folds,mean_test_r2,std_test_r2,test_r2_fold_0,test_r2_fold_1,test_r2_fold_2,test_r2_fold_3,test_r2_fold_4
pipeline_scaler,pipeline_clf,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
StandardScaler,KNeighborsRegressor,"[[1074.0], [1849.0], [2518.0], [3252.0], [3933...","[[[1074.0], [1849.0], [2518.0], [3252.0], [393...","[[1114.25], [1855.5], [2533.75], [3255.5], [39...","[[[1114.25], [1855.5], [2533.75], [3255.5], [3...","[1830, 1831, 1832, 1833, 1834, 1835, 1836, 183...","[[1830, 1831, 1832, 1833, 1834, 1835, 1836, 18...","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13...",0.999994,3e-06,0.999989,0.999992,0.999997,0.999996,0.999996
StandardScaler,DecisionTreeRegressor,"[[1074.0], [1849.0], [2518.0], [3252.0], [3933...","[[[1074.0], [1849.0], [2518.0], [3252.0], [393...","[1122.7484143763213, 1122.7484143763213, 2855....","[[1122.7484143763213, 1122.7484143763213, 2855...","[1830, 1831, 1832, 1833, 1834, 1835, 1836, 183...","[[1830, 1831, 1832, 1833, 1834, 1835, 1836, 18...","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13...",0.996157,0.0001,0.996268,0.996288,0.996073,0.9961,0.996054
StandardScaler,SVR,"[[1074.0], [1849.0], [2518.0], [3252.0], [3933...","[[[1074.0], [1849.0], [2518.0], [3252.0], [393...","[1346.474763842738, 1897.7439746244672, 2502.1...","[[1346.474763842738, 1897.7439746244672, 2502....","[1830, 1831, 1832, 1833, 1834, 1835, 1836, 183...","[[1830, 1831, 1832, 1833, 1834, 1835, 1836, 18...","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13...",0.99981,2e-05,0.999828,0.999831,0.999818,0.999788,0.999785
MinMaxScaler,KNeighborsRegressor,"[[1074.0], [1849.0], [2518.0], [3252.0], [3933...","[[[1074.0], [1849.0], [2518.0], [3252.0], [393...","[[1114.25], [1855.5], [2533.75], [3255.5], [39...","[[[1114.25], [1855.5], [2533.75], [3255.5], [3...","[1830, 1831, 1832, 1833, 1834, 1835, 1836, 183...","[[1830, 1831, 1832, 1833, 1834, 1835, 1836, 18...","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13...",0.999994,3e-06,0.999989,0.999992,0.999997,0.999997,0.999996
MinMaxScaler,DecisionTreeRegressor,"[[1074.0], [1849.0], [2518.0], [3252.0], [3933...","[[[1074.0], [1849.0], [2518.0], [3252.0], [393...","[1122.7484143763213, 1122.7484143763213, 2855....","[[1122.7484143763213, 1122.7484143763213, 2855...","[1830, 1831, 1832, 1833, 1834, 1835, 1836, 183...","[[1830, 1831, 1832, 1833, 1834, 1835, 1836, 18...","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13...",0.996156,0.0001,0.996268,0.996287,0.996073,0.9961,0.996054
MinMaxScaler,SVR,"[[1074.0], [1849.0], [2518.0], [3252.0], [3933...","[[[1074.0], [1849.0], [2518.0], [3252.0], [393...","[1377.9323030828727, 2063.1115307359905, 2706....","[[1377.9323030828727, 2063.1115307359905, 2706...","[1830, 1831, 1832, 1833, 1834, 1835, 1836, 183...","[[1830, 1831, 1832, 1833, 1834, 1835, 1836, 18...","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13...",0.999792,6e-06,0.999799,0.999796,0.999794,0.999786,0.999784


### Save the results of the pipeline permuter to a pickle file

In [52]:
pipeline_permuter.to_pickle(models_path.joinpath("DT_SVR_AdaBoost.pkl"))