## Regression Analysis on the outputs of the B-Point detection algorithms

- The sampling rate of the empkins dataset 1000 Hz does not match the sampling rate of the guardian dataset 500 Hz. How to deal with that? 
    - Convert to milliseconds. Do it when building the train and test data
- Normalize B-Point samples acording to start and end point of heartbeat as part of data preprocessing?
    - Try both approaches (This kind of normalization does not make sense)
- Start without feature selection since the models should use all outputs generated by the algorithms
- How to impute nan values? If normalized between 0 and 1 just use mean?
    - drop them first
    - Background: Many algorithms don't handle NaN values
    - Check how many entries contain NaN
    - Use a SimpleImputer with e.g. mean or KNNImputer first to test the pipeline properly (https://scikit-learn.org/1.5/modules/impute.html)
- GroupKFold should be used for cross validation to ensure that a participant is not present in the train and testdata
- Splitting of the data:
    - use biopyskit and apply groupkfold. Mutliindex can remain in dataframe its important, that one hearbeat per row

### Setup and helper functions

In [None]:
import json

from pathlib import Path

import pandas as pd
import numpy as np

import biopsykit as bp
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.impute import SimpleImputer


#Feature Selection
from sklearn.feature_selection import SelectKBest, RFE

#Classification
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

# Regression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import HistGradientBoostingRegressor

# Cross-Validation
from sklearn.model_selection import GroupKFold

from biopsykit.classification.model_selection import SklearnPipelinePermuter

import matplotlib.pyplot as plt

%matplotlib widget
%load_ext autoreload
%autoreload 2

#### Specify whether the results should be saved or not

In [None]:
save_results = False

### Load data

In [3]:
data_path = Path("../../results/train_test_data")
data_path

WindowsPath('../../results/train_test_data')

In [4]:
models_path = Path("../../results/models")

In [76]:
input_data = pd.read_csv(data_path.joinpath("combined_data.csv"))

In [77]:
X, y, groups = bp.classification.utils.prepare_df_sklearn(data=input_data, label_col="b_point_samplereference", subject_col="participant", print_summary=True)

KeyError: 'Requested level (b_point_samplereference) does not match index name (None)'

In [5]:
train_data = pd.read_csv(data_path.joinpath("train_data_all_algos.csv")).drop(columns=["Unnamed: 0"])
target_data = pd.read_csv(data_path.joinpath("target_data_all_algos.csv")).drop(columns=["Unnamed: 0"])

### Specify Estimator Combinations and Parameters for Hyperparameter Search

In [65]:
model_dict = {
    "scaler": {"StandardScaler": StandardScaler(), "MinMaxScaler": MinMaxScaler()},
    #"reduce_dim": {"SelectKBest": SelectKBest(), "RFE": RFE(SVC(kernel="linear", C=1))},
    "reduce_dim": {"SelectKBest": SelectKBest()},
    "clf": {
        #"KNeighborsRegressor": KNeighborsRegressor(),
        "RandomForestRegressor": RandomForestRegressor(),
        #"HistGradientBoostingRegressor": HistGradientBoostingRegressor(),
    },
}

In [66]:
params_dict = {
    "StandardScaler": None,
    "MinMaxScaler": None,
    "SelectKBest": {"k": [2, 4, 6, 8, 10, "all"]},
    #"KNeighborsRegressor": {
    #    "n_neighbors": [8,9,10,11,12,13,14],
    #    "weights": ["uniform", "distance"],
    #    },
    "RandomForestRegressor": {
        #"n_estimators": [80, 100, 120],
        "criterion": ["squared_error", "absolute_error", "friedman_mse", "poisson"],
        #"min_samples_split": [1, 2, 3, 4, 5],
        #"min_samples_leaf": [1, 2, 3, 4, 5],
        #"max_features": ["sqrt", "log2", None],
    },
    #"HistGradientBoostingRegressor": None,
}

In [67]:
#hyper_search_dict = {"DecisionTreeRegressor": {"search_method": "random", "n_iter":2}}

### Setup PipelinePermuter and Cross-Validations for Model Evaluation

In [68]:
pipeline_permuter = SklearnPipelinePermuter(
    model_dict=model_dict, param_dict=params_dict
)

In [69]:
X = train_data.drop(columns=["participant", "condition", "phase", "heartbeat_idreference"])
y = target_data.drop(columns=["participant", "condition", "phase", "heartbeat_idreference"])
groups = train_data["participant"]

In [70]:
outer_cv = GroupKFold(n_splits=5)
inner_cv = GroupKFold(n_splits=5)

pipeline_permuter.fit(X=X.values, y=y.values, outer_cv=outer_cv, inner_cv=inner_cv, scoring="neg_mean_absolute_error", groups=groups)

Pipeline Combinations:   0%|          | 0/2 [00:00<?, ?it/s]

### Running hyperparameter search for pipeline: (('scaler', 'StandardScaler'), ('reduce_dim', 'SelectKBest'), ('clf', 'RandomForestRegressor')) with 1 parameter grid(s):
Parameter grid #0 ({'search_method': 'grid'}): {'clf__criterion': ['squared_error', 'absolute_error', 'friedman_mse', 'poisson'], 'reduce_dim__k': [2, 4, 6, 8, 10, 'all']}


Outer CV:   0%|          | 0/5 [00:00<?, ?it/s]

Fitting 5 folds for each of 24 candidates, totalling 120 fits


  return fit_method(estimator, *args, **kwargs)


Fitting 5 folds for each of 24 candidates, totalling 120 fits


  return fit_method(estimator, *args, **kwargs)


Fitting 5 folds for each of 24 candidates, totalling 120 fits


  return fit_method(estimator, *args, **kwargs)


Fitting 5 folds for each of 24 candidates, totalling 120 fits


  return fit_method(estimator, *args, **kwargs)


Fitting 5 folds for each of 24 candidates, totalling 120 fits


  return fit_method(estimator, *args, **kwargs)




### Running hyperparameter search for pipeline: (('scaler', 'MinMaxScaler'), ('reduce_dim', 'SelectKBest'), ('clf', 'RandomForestRegressor')) with 1 parameter grid(s):
Parameter grid #0 ({'search_method': 'grid'}): {'clf__criterion': ['squared_error', 'absolute_error', 'friedman_mse', 'poisson'], 'reduce_dim__k': [2, 4, 6, 8, 10, 'all']}




Outer CV:   0%|          | 0/5 [00:00<?, ?it/s]

Fitting 5 folds for each of 24 candidates, totalling 120 fits


  y = column_or_1d(y, warn=True)
  return fit_method(estimator, *args, **kwargs)


Fitting 5 folds for each of 24 candidates, totalling 120 fits


  y = column_or_1d(y, warn=True)
  return fit_method(estimator, *args, **kwargs)


Fitting 5 folds for each of 24 candidates, totalling 120 fits


  y = column_or_1d(y, warn=True)
  return fit_method(estimator, *args, **kwargs)


Fitting 5 folds for each of 24 candidates, totalling 120 fits


  y = column_or_1d(y, warn=True)
  return fit_method(estimator, *args, **kwargs)


Fitting 5 folds for each of 24 candidates, totalling 120 fits


  y = column_or_1d(y, warn=True)
  return fit_method(estimator, *args, **kwargs)






### Display the resuslts of the pipeline permuter

To print the results I had to exclude the conf matrix in the Biopsykit function.  
Make sure to include it again afterwards

In [73]:
pipeline_permuter.metric_summary()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,true_labels,true_labels_folds,predicted_labels,predicted_labels_folds,train_indices,train_indices_folds,test_indices,test_indices_folds,mean_test_neg_mean_absolute_error,std_test_neg_mean_absolute_error,test_neg_mean_absolute_error_fold_0,test_neg_mean_absolute_error_fold_1,test_neg_mean_absolute_error_fold_2,test_neg_mean_absolute_error_fold_3,test_neg_mean_absolute_error_fold_4
pipeline_scaler,pipeline_reduce_dim,pipeline_clf,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
StandardScaler,SelectKBest,RandomForestRegressor,"[[622.0], [1231.0], [3342.0], [4596.0], [5233....","[[[622.0], [1231.0], [3342.0], [4596.0], [5233...","[624.22, 1273.84, 3340.91, 4641.41, 5282.65, 5...","[[624.22, 1273.84, 3340.91, 4641.41, 5282.65, ...","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13...","[1397, 1398, 1399, 1400, 1401, 1402, 1403, 140...","[[1397, 1398, 1399, 1400, 1401, 1402, 1403, 14...",13.735736,0.661483,14.275545,13.314558,14.008832,12.654518,14.425228
MinMaxScaler,SelectKBest,RandomForestRegressor,"[[622.0], [1231.0], [3342.0], [4596.0], [5233....","[[[622.0], [1231.0], [3342.0], [4596.0], [5233...","[624.31, 1273.7, 3340.16, 4641.41, 5282.65, 58...","[[624.31, 1273.7, 3340.16, 4641.41, 5282.65, 5...","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13...","[1397, 1398, 1399, 1400, 1401, 1402, 1403, 140...","[[1397, 1398, 1399, 1400, 1401, 1402, 1403, 14...",13.723299,0.686075,14.286753,13.276116,14.024607,12.603995,14.425024


In [74]:
pipeline_permuter.best_hyperparameter_pipeline()

  .agg(["mean", "std"])


Unnamed: 0_level_0,mean_test_neg_mean_absolute_error,param_clf__criterion,param_reduce_dim__k,params,rank_test_neg_mean_absolute_error,split0_test_neg_mean_absolute_error,split1_test_neg_mean_absolute_error,split2_test_neg_mean_absolute_error,split3_test_neg_mean_absolute_error,split4_test_neg_mean_absolute_error,std_test_neg_mean_absolute_error
outer_fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,-13.697996,absolute_error,4,"{'clf__criterion': 'absolute_error', 'reduce_d...",2,-13.787161,-14.243101,-13.618856,-13.241658,-13.599204,0.325379
1,-13.957128,absolute_error,4,"{'clf__criterion': 'absolute_error', 'reduce_d...",2,-13.457105,-13.638498,-13.554444,-14.539383,-14.596208,0.502224
2,-13.730262,absolute_error,4,"{'clf__criterion': 'absolute_error', 'reduce_d...",1,-13.451196,-14.598374,-13.303103,-13.980553,-13.318086,0.499298
3,-14.304568,absolute_error,4,"{'clf__criterion': 'absolute_error', 'reduce_d...",1,-14.152876,-14.361676,-14.103827,-14.158255,-14.746205,0.237916
4,-13.93038,absolute_error,4,"{'clf__criterion': 'absolute_error', 'reduce_d...",1,-13.213718,-12.648656,-14.886735,-15.116419,-13.786372,0.948521


### Save the results of the pipeline permuter to a pickle file

In [75]:
if save_results:
    pipeline_permuter.to_pickle(models_path.joinpath("Scaler_Feature_Elimination_Random_Forest.pkl"))