## Regression Analysis on the outputs of the B-Point detection algorithms

- The sampling rate of the empkins dataset 1000 Hz does not match the sampling rate of the guardian dataset 500 Hz. How to deal with that? 
    - Convert to milliseconds. Do it when building the train and test data
- Normalize B-Point samples acording to start and end point of heartbeat as part of data preprocessing?
    - Try both approaches (This kind of normalization does not make sense)
- Start without feature selection since the models should use all outputs generated by the algorithms
- How to impute nan values? If normalized between 0 and 1 just use mean?
    - drop them first
    - Background: Many algorithms don't handle NaN values
    - Check how many entries contain NaN
    - Use a SimpleImputer with e.g. mean or KNNImputer first to test the pipeline properly (https://scikit-learn.org/1.5/modules/impute.html)
- GroupKFold should be used for cross validation to ensure that a participant is not present in the train and testdata
- Splitting of the data:
    - use biopyskit and apply groupkfold. Mutliindex can remain in dataframe its important, that one hearbeat per row

### Setup and helper functions

In [1]:
from pathlib import Path

import numpy as np
import pandas as pd
from biopsykit.classification.model_selection import SklearnPipelinePermuter

# Feature Selection
from sklearn.feature_selection import SelectKBest

# Cross-Validation
from sklearn.model_selection import GroupKFold

# Regression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import MinMaxScaler, StandardScaler

# Classification
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor

%matplotlib widget
%load_ext autoreload
%autoreload 2

### Load data

In [2]:
data_path = Path("../../results/train_test_data")
data_path

WindowsPath('../../results/train_test_data')

In [3]:
models_path = Path("../../results/models")

In [4]:
train_data = pd.read_csv(data_path.joinpath("train_data.csv")).drop(columns=["Unnamed: 0"])
target_data = pd.read_csv(data_path.joinpath("target_data.csv")).drop(columns=["Unnamed: 0"])

### Specify Estimator Combinations and Parameters for Hyperparameter Search

In [5]:
model_dict = {
    "scaler": {"StandardScaler": StandardScaler(), "MinMaxScaler": MinMaxScaler()},
    # "reduce_dim": {"SelectKBest": SelectKBest(), "RFE": RFE(SVC(kernel="linear", C=1))},
    "reduce_dim": {"SelectKBest": SelectKBest()},
    "clf": {
        "KNeighborsRegressor": KNeighborsRegressor(),
        "DecisionTreeRegressor": DecisionTreeRegressor(),
        "SVR": SVR(),
    },
}

In [6]:
params_dict = {
    "StandardScaler": None,
    "MinMaxScaler": None,
    "SelectKBest": {"k": [2, 4, "all"]},
    # "RFE": {"n_features_to_select": [2, 4, None]},
    "KNeighborsRegressor": {"n_neighbors": [2, 4], "weights": ["uniform", "distance"]},
    "DecisionTreeRegressor": {
        "max_depth": [2, 4, 6],
        "min_samples_split": [2, 6, 10, 14],
        "min_samples_leaf": [2, 6, 10, 14],
        "max_leaf_nodes": [10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
    },
    "SVR": [
        {"kernel": ["linear"], "C": np.logspace(start=-2, stop=2, num=5)},
        {"kernel": ["rbf"], "C": np.logspace(start=-2, stop=2, num=5), "gamma": np.logspace(start=-2, stop=2, num=5)},
    ],
}

In [7]:
hyper_search_dict = {"DecisionTreeRegressor": {"search_method": "random", "n_iter": 2}}

### Setup PipelinePermuter and Cross-Validations for Model Evaluation

In [8]:
pipeline_permuter = SklearnPipelinePermuter(
    model_dict=model_dict, param_dict=params_dict, hyper_search_dict=hyper_search_dict
)

In [9]:
X = train_data.drop(columns=["participant", "condition", "phase", "heartbeat_idreference"])
y = target_data.drop(columns=["participant", "condition", "phase", "heartbeat_idreference"])
groups = train_data["participant"]

In [10]:
outer_cv = GroupKFold(n_splits=5)
inner_cv = GroupKFold(n_splits=5)

pipeline_permuter.fit(X=X.values, y=y.values, outer_cv=outer_cv, inner_cv=inner_cv, scoring="r2", groups=groups)

Pipeline Combinations:   0%|          | 0/6 [00:00<?, ?it/s]

### Running hyperparameter search for pipeline: (('scaler', 'StandardScaler'), ('reduce_dim', 'SelectKBest'), ('clf', 'KNeighborsRegressor')) with 1 parameter grid(s):
Parameter grid #0 ({'search_method': 'grid'}): {'reduce_dim__k': [2, 4, 'all'], 'clf__n_neighbors': [2, 4], 'clf__weights': ['uniform', 'distance']}


Outer CV:   0%|          | 0/5 [00:00<?, ?it/s]

Fitting 5 folds for each of 12 candidates, totalling 60 fits




Fitting 5 folds for each of 12 candidates, totalling 60 fits
Fitting 5 folds for each of 12 candidates, totalling 60 fits




Fitting 5 folds for each of 12 candidates, totalling 60 fits
Fitting 5 folds for each of 12 candidates, totalling 60 fits


### Running hyperparameter search for pipeline: (('scaler', 'StandardScaler'), ('reduce_dim', 'SelectKBest'), ('clf', 'DecisionTreeRegressor')) with 1 parameter grid(s):
Parameter grid #0 ({'search_method': 'random', 'n_iter': 2}): {'reduce_dim__k': [2, 4, 'all'], 'clf__max_depth': [2, 4, 6], 'clf__min_samples_split': [2, 6, 10, 14], 'clf__min_samples_leaf': [2, 6, 10, 14], 'clf__max_leaf_nodes': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]}




Outer CV:   0%|          | 0/5 [00:00<?, ?it/s]

  y = column_or_1d(y, warn=True)


Fitting 5 folds for each of 2 candidates, totalling 10 fits




Fitting 5 folds for each of 2 candidates, totalling 10 fits
Fitting 5 folds for each of 2 candidates, totalling 10 fits
Fitting 5 folds for each of 2 candidates, totalling 10 fits
Fitting 5 folds for each of 2 candidates, totalling 10 fits


### Running hyperparameter search for pipeline: (('scaler', 'StandardScaler'), ('reduce_dim', 'SelectKBest'), ('clf', 'SVR')) with 2 parameter grid(s):
Parameter grid #0 ({'search_method': 'grid'}): {'reduce_dim__k': [2, 4, 'all'], 'clf__kernel': ['linear'], 'clf__C': array([1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])}




Outer CV:   0%|          | 0/5 [00:00<?, ?it/s]

Fitting 5 folds for each of 15 candidates, totalling 75 fits


  y = column_or_1d(y, warn=True)


Fitting 5 folds for each of 15 candidates, totalling 75 fits


  y = column_or_1d(y, warn=True)


Fitting 5 folds for each of 15 candidates, totalling 75 fits


  y = column_or_1d(y, warn=True)


Fitting 5 folds for each of 15 candidates, totalling 75 fits


  y = column_or_1d(y, warn=True)


Fitting 5 folds for each of 15 candidates, totalling 75 fits


  y = column_or_1d(y, warn=True)



Parameter grid #1 ({'search_method': 'grid'}): {'reduce_dim__k': [2, 4, 'all'], 'clf__kernel': ['rbf'], 'clf__C': array([1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02]), 'clf__gamma': array([1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])}




Outer CV:   0%|          | 0/5 [00:00<?, ?it/s]

Fitting 5 folds for each of 75 candidates, totalling 375 fits


  y = column_or_1d(y, warn=True)


Fitting 5 folds for each of 75 candidates, totalling 375 fits


  y = column_or_1d(y, warn=True)


Fitting 5 folds for each of 75 candidates, totalling 375 fits


  y = column_or_1d(y, warn=True)


Fitting 5 folds for each of 75 candidates, totalling 375 fits


  y = column_or_1d(y, warn=True)


Fitting 5 folds for each of 75 candidates, totalling 375 fits


  y = column_or_1d(y, warn=True)




### Running hyperparameter search for pipeline: (('scaler', 'MinMaxScaler'), ('reduce_dim', 'SelectKBest'), ('clf', 'KNeighborsRegressor')) with 1 parameter grid(s):
Parameter grid #0 ({'search_method': 'grid'}): {'reduce_dim__k': [2, 4, 'all'], 'clf__n_neighbors': [2, 4], 'clf__weights': ['uniform', 'distance']}




Outer CV:   0%|          | 0/5 [00:00<?, ?it/s]

Fitting 5 folds for each of 12 candidates, totalling 60 fits


  y = column_or_1d(y, warn=True)


Fitting 5 folds for each of 12 candidates, totalling 60 fits


  y = column_or_1d(y, warn=True)


Fitting 5 folds for each of 12 candidates, totalling 60 fits


  y = column_or_1d(y, warn=True)


Fitting 5 folds for each of 12 candidates, totalling 60 fits


  y = column_or_1d(y, warn=True)


Fitting 5 folds for each of 12 candidates, totalling 60 fits


  y = column_or_1d(y, warn=True)




### Running hyperparameter search for pipeline: (('scaler', 'MinMaxScaler'), ('reduce_dim', 'SelectKBest'), ('clf', 'DecisionTreeRegressor')) with 1 parameter grid(s):
Parameter grid #0 ({'search_method': 'random', 'n_iter': 2}): {'reduce_dim__k': [2, 4, 'all'], 'clf__max_depth': [2, 4, 6], 'clf__min_samples_split': [2, 6, 10, 14], 'clf__min_samples_leaf': [2, 6, 10, 14], 'clf__max_leaf_nodes': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]}




Outer CV:   0%|          | 0/5 [00:00<?, ?it/s]

  y = column_or_1d(y, warn=True)


Fitting 5 folds for each of 2 candidates, totalling 10 fits


  y = column_or_1d(y, warn=True)


Fitting 5 folds for each of 2 candidates, totalling 10 fits


  y = column_or_1d(y, warn=True)


Fitting 5 folds for each of 2 candidates, totalling 10 fits


  y = column_or_1d(y, warn=True)


Fitting 5 folds for each of 2 candidates, totalling 10 fits
Fitting 5 folds for each of 2 candidates, totalling 10 fits


### Running hyperparameter search for pipeline: (('scaler', 'MinMaxScaler'), ('reduce_dim', 'SelectKBest'), ('clf', 'SVR')) with 2 parameter grid(s):
Parameter grid #0 ({'search_method': 'grid'}): {'reduce_dim__k': [2, 4, 'all'], 'clf__kernel': ['linear'], 'clf__C': array([1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])}




Outer CV:   0%|          | 0/5 [00:00<?, ?it/s]

Fitting 5 folds for each of 15 candidates, totalling 75 fits


  y = column_or_1d(y, warn=True)


Fitting 5 folds for each of 15 candidates, totalling 75 fits


  y = column_or_1d(y, warn=True)


Fitting 5 folds for each of 15 candidates, totalling 75 fits


  y = column_or_1d(y, warn=True)


Fitting 5 folds for each of 15 candidates, totalling 75 fits


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


Fitting 5 folds for each of 15 candidates, totalling 75 fits


  y = column_or_1d(y, warn=True)



Parameter grid #1 ({'search_method': 'grid'}): {'reduce_dim__k': [2, 4, 'all'], 'clf__kernel': ['rbf'], 'clf__C': array([1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02]), 'clf__gamma': array([1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])}




Outer CV:   0%|          | 0/5 [00:00<?, ?it/s]

Fitting 5 folds for each of 75 candidates, totalling 375 fits


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


Fitting 5 folds for each of 75 candidates, totalling 375 fits


  y = column_or_1d(y, warn=True)


Fitting 5 folds for each of 75 candidates, totalling 375 fits


  y = column_or_1d(y, warn=True)


Fitting 5 folds for each of 75 candidates, totalling 375 fits


  y = column_or_1d(y, warn=True)


Fitting 5 folds for each of 75 candidates, totalling 375 fits


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)






### Display the resuslts of the pipeline permuter

To print the results I had to exclude the conf matrix in the Biopsykit function.  
Make sure to include it again afterwards

In [11]:
pipeline_permuter.metric_summary()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,true_labels,true_labels_folds,predicted_labels,predicted_labels_folds,train_indices,train_indices_folds,test_indices,test_indices_folds,mean_test_r2,std_test_r2,test_r2_fold_0,test_r2_fold_1,test_r2_fold_2,test_r2_fold_3,test_r2_fold_4
pipeline_scaler,pipeline_reduce_dim,pipeline_clf,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
StandardScaler,SelectKBest,KNeighborsRegressor,"[[1532.0], [2212.0], [2937.0], [3644.0], [4405...","[[[1532.0], [2212.0], [2937.0], [3644.0], [440...","[[1529.8376867659717], [2204.7238099442], [293...","[[[1529.8376867659717], [2204.7238099442], [29...","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13...","[1928, 1929, 1930, 1931, 1932, 1933, 1934, 193...","[[1928, 1929, 1930, 1931, 1932, 1933, 1934, 19...",0.999998,4.883419e-07,0.999998,0.999998,0.999999,0.999997,0.999998
StandardScaler,SelectKBest,DecisionTreeRegressor,"[[1532.0], [2212.0], [2937.0], [3644.0], [4405...","[[[1532.0], [2212.0], [2937.0], [3644.0], [440...","[1592.013698630137, 2449.83, 3289.685446009389...","[[1592.013698630137, 2449.83, 3289.68544600938...","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13...","[1928, 1929, 1930, 1931, 1932, 1933, 1934, 193...","[[1928, 1929, 1930, 1931, 1932, 1933, 1934, 19...",0.993984,0.005643079,0.999726,0.987121,0.987424,0.999644,0.996005
StandardScaler,SelectKBest,SVR,"[[1532.0], [2212.0], [2937.0], [3644.0], [4405...","[[[1532.0], [2212.0], [2937.0], [3644.0], [440...","[2253.402911161542, 2794.623628313333, 3385.79...","[[2253.402911161542, 2794.623628313333, 3385.7...","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13...","[1928, 1929, 1930, 1931, 1932, 1933, 1934, 193...","[[1928, 1929, 1930, 1931, 1932, 1933, 1934, 19...",0.998931,0.0003157163,0.999116,0.999001,0.999006,0.999213,0.998319
MinMaxScaler,SelectKBest,KNeighborsRegressor,"[[1532.0], [2212.0], [2937.0], [3644.0], [4405...","[[[1532.0], [2212.0], [2937.0], [3644.0], [440...","[[1529.8380884154735], [2204.7238322334238], [...","[[[1529.8380884154735], [2204.7238322334238], ...","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13...","[1928, 1929, 1930, 1931, 1932, 1933, 1934, 193...","[[1928, 1929, 1930, 1931, 1932, 1933, 1934, 19...",0.999998,4.886307e-07,0.999998,0.999998,0.999999,0.999997,0.999998
MinMaxScaler,SelectKBest,DecisionTreeRegressor,"[[1532.0], [2212.0], [2937.0], [3644.0], [4405...","[[[1532.0], [2212.0], [2937.0], [3644.0], [440...","[2081.975, 2081.975, 2081.975, 2081.975, 5341....","[[2081.975, 2081.975, 2081.975, 2081.975, 5341...","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13...","[1928, 1929, 1930, 1931, 1932, 1933, 1934, 193...","[[1928, 1929, 1930, 1931, 1932, 1933, 1934, 19...",0.985042,0.02554465,0.995761,0.999729,0.934066,0.999645,0.996008
MinMaxScaler,SelectKBest,SVR,"[[1532.0], [2212.0], [2937.0], [3644.0], [4405...","[[[1532.0], [2212.0], [2937.0], [3644.0], [440...","[2254.517914410306, 2641.535492595587, 3135.24...","[[2254.517914410306, 2641.535492595587, 3135.2...","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13...","[1928, 1929, 1930, 1931, 1932, 1933, 1934, 193...","[[1928, 1929, 1930, 1931, 1932, 1933, 1934, 19...",0.998863,0.000398646,0.999062,0.998963,0.998968,0.999232,0.99809


In [12]:
pipeline_permuter.best_estimator_summary()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,best_estimator
pipeline_scaler,pipeline_reduce_dim,pipeline_clf,Unnamed: 3_level_1
StandardScaler,SelectKBest,KNeighborsRegressor,[Pipeline(memory=Memory(location=cachedir\jobl...
StandardScaler,SelectKBest,DecisionTreeRegressor,[Pipeline(memory=Memory(location=cachedir\jobl...
StandardScaler,SelectKBest,SVR,[Pipeline(memory=Memory(location=cachedir\jobl...
MinMaxScaler,SelectKBest,KNeighborsRegressor,[Pipeline(memory=Memory(location=cachedir\jobl...
MinMaxScaler,SelectKBest,DecisionTreeRegressor,[Pipeline(memory=Memory(location=cachedir\jobl...
MinMaxScaler,SelectKBest,SVR,[Pipeline(memory=Memory(location=cachedir\jobl...


### Save the results of the pipeline permuter to a pickle file

In [13]:
pipeline_permuter.to_pickle(models_path.joinpath("Feature_Elimination_NN_DT_SVR.pkl"))