## Collecting dataset



In [2]:
import numpy as np
from os import getcwd
import pandas as pd
from pathlib import Path
from tqdm import tqdm

## for relative imports
import os, sys
up1 = os.path.abspath('..'); sys.path.insert(0, up1)
## for relative imports

## for sklearn warnings
import warnings
warnings.filterwarnings("ignore")
## for sklearn warnings

import data.manipulators as dm
import data.utilities as du
import model.utilities as mu
from model.assets.labelmodel_heuristics import get_vote_vector_nk


# Collect data
config = mu.getModelConfig()
featureSet = config.features_nk #nk suffixes are neurokit versions of features/featurizations (also have heartpy versions)
print(f'Collecting unlabeled featurized data from source: {config.trainDataFile} ("label" column is populated from prior labelmodel annotations)')
df = pd.read_csv(
    Path(getcwd()).parent / 'data' / 'assets' / config.trainDataFile,
    parse_dates=['start', 'stop'])

print(f'Collecting annotated featurized data from source: {config.goldDataFile}')
goldDF = pd.read_csv(
    Path(getcwd()).parent / 'data' / 'assets' / config.goldDataFile,
    parse_dates=['start', 'stop'])

#helper to get votes -- see model/assets/labelmodel_heuristics.py for LFs
def getHeuristicVotes(featurizedData):
    L_train = list()
    for i, row in tqdm(featurizedData.iterrows(), total=len(featurizedData)):
        L_train.append(get_vote_vector_nk(**row))
    return np.array(L_train)
L_train = getHeuristicVotes(df)
df, goldDF





Collecting unlabeled featurized data from source: trainset_featurized_nk.csv ("label" column is populated from prior labelmodel annotations)
Collecting annotated featurized data from source: final_annotations_featurized_nk.csv


  0%|          | 0/9818 [00:00<?, ?it/s]

once


https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations


# Appendix

## Features

features_nk:
   - `b2b_var`     coefficient of variation (std / mean) of R peak to R peak intervals for segment
   - `b2b_iqr`     interquartile range of "
   - `b2b_range`   range of "
   - `b2b_std`     std deviation of "
   - `hrv_sd1`     std deviation of RR_n, RR_{n+1} plot projected on line perpendicular to identity line (see [here](https://www.researchgate.net/figure/The-Poincare-plot-SD1-and-SD2-standard-deviations-of-the-scattergram_fig1_290416554))
   - `hrv_sd2`     std deviation of RR_n, RR_{n+1} plot projected on identity line (see above)
   - `hrv_sd1sd2`  sd1 / sd2
   - `ecg_rate_mean`  bpm
   - `hrv_pnn20`      percentage of NN intervals that are more than 20% different than their neighbor
   - `hrv_pnn50`      " 50% "
   - `hrv_rmssd`      square-root of   sum of successive differences between adjacent RR intervals
   - `hrv_sdsd`       std deviation of "
   - `hrv_sdnn`       std deviation of NN intervals
   - `hrv_lf`         power in low frequency band (range of .04hz - .15hz) of RR intervals
   - `hrv_hf`         power in high frequency band (range of .15hz - .4hz) of RR intervals
   - `hrv_lfhf`       lf / hf
   - `hrv_sampen`     sample entropy of RR intervals
   - `hrv_shanen`     shannon of "
   - `hrv_apen`       approximate of "
   - `hrv_hfd`        higuchi fractal dimension of "
   - `hopkins_statistic`  a measurement of the clusterability of RR intervals (see wiki)
   - `max_sil_score`      maximum silhouette score out of k-means clustering of RR_n, RR_{n+1} pairs, with k taking values in 2 through 11 inclusive (see towards data science for good description of silhouette score)
   - `sse_1_clusters`     sum of squared errors for 1-mean cluster fitted to RR_n, RR_{n+1} pairs
   - `sse_2_clusters`     sum of squared errors for 2-mean cluster fitted to "
   - `sse_diff`           sse_1_clusters - sse_2_clusters, to capture how much better two clusters capture RR interval pairs than a single cluster