## Part 1: Pulse Rate Algorithm

### Contents
Fill out this notebook as part of your final project submission.

**You will have to complete both the Code and Project Write-up sections.**
- The [Code](#Code) is where you will write a **pulse rate algorithm** and already includes the starter code.
   - Imports - These are the imports needed for Part 1 of the final project. 
     - [glob](https://docs.python.org/3/library/glob.html)
     - [numpy](https://numpy.org/)
     - [scipy](https://www.scipy.org/)
- The [Project Write-up](#Project-Write-up) to describe why you wrote the algorithm for the specific case.


### Dataset
You will be using the **Troika**[1] dataset to build your algorithm. Find the dataset under `datasets/troika/training_data`. The `README` in that folder will tell you how to interpret the data. The starter code contains a function to help load these files.

1. Zhilin Zhang, Zhouyue Pi, Benyuan Liu, ‘‘TROIKA: A General Framework for Heart Rate Monitoring Using Wrist-Type Photoplethysmographic Signals During Intensive Physical Exercise,’’IEEE Trans. on Biomedical Engineering, vol. 62, no. 2, pp. 522-531, February 2015. Link

-----

### Code

In [1]:
import matplotlib.pyplot as plt
import scipy.signal as sg
import pandas as pd
import copy
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
import pickle
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.preprocessing import normalize
import os

import glob

import numpy as np
import scipy as sp
import scipy.io


def LoadTroikaDataset():
    """
    Retrieve the .mat filenames for the troika dataset.

    Review the README in ./datasets/troika/ to understand the organization
    of the .mat files.

    Returns:
        data_fls: Names of the .mat files that contain signal data
        ref_fls: Names of the .mat files that contain reference data
        <data_fls> and <ref_fls> are ordered correspondingly,
        so that ref_fls[5] is the reference data for data_fls[5], etc...
    """
    data_dir = "./datasets/troika/training_data"
    data_fls = sorted(glob.glob(data_dir + "/DATA_*.mat"))
    ref_fls = sorted(glob.glob(data_dir + "/REF_*.mat"))
    return data_fls, ref_fls


def LoadTroikaDataFile(data_fl):
    """
    Loads and extracts signals from a troika data file.

    Usage:
        data_fls, ref_fls = LoadTroikaDataset()
        ppg, accx, accy, accz = LoadTroikaDataFile(data_fls[0])

    Args:
        data_fl: (str) filepath to a troika .mat file.

    Returns:
        numpy arrays for ppg, accx, accy, accz signals.
    """
    data = sp.io.loadmat(data_fl)["sig"]
    return data[2:]


# in your prediction file
def load_model(model_path):
    """
    Loads a saved sklearn regression model or train new model
    Args:
    Returns:
        return the saved model
    """
    path_n = model_path
    if os.path.isfile(path_n):
        fil = open(path_n, "rb")
        prediction_model = pickle.load(fil)
        print("model found")
    else:
        prediction_model = model_train()
        print("new model is created")
    return prediction_model


def Load_labels(data_fl, ts):
    """
    Load the reference PBM for a file and perform interpolation to convert it
    to the same timestamp of the measurments
    Args:
        data_fl: (str) filepath to a troika .mat file.
        ts: time stamp of the sensors measurements for the same file
    Returns:
        numpy arrays for ref BPM with the same length of sensor measurements
    """
    # as mentioed in the dataset read_me the ref value is obtained using a
    # window of 8s and 6s overlap
    # so we can assume that the sample freq of the ref data is 2s
    fs_labels = 2
    data = sp.io.loadmat(data_fl)["BPM0"]
    ts_labels = np.arange(0, len(data) / fs_labels, 1 / fs_labels)
    labels_interp = np.interp(ts, ts_labels, data[:, 0])
    return labels_interp


def butter_bandpass(data, rang, fs):
    """
    bandpass the data withn certain frequency range for all sensors
    Args:
        data: a list of sensors measurements
        range: tuple contain the min, max frequency
        fs: sample frequence for the signal
    Returns:
        return a list arrays for all the filtered signals in the time domain
    """
    results = []
    zeros, poles = sg.butter(5, rang, btype="bandpass", fs=fs)
    # filter each sensor measurement
    for key in data:
        results.append(sg.filtfilt(zeros, poles, key))
    return results


def fft_sensors(data, fs):
    """
    fast fourier transform for the each sensor data
    Args:
        data: a list of sensors measurements
        fs: sample frequence for the signal
    Returns:
        return a list of arrays for all the signals in the frequency domaoin
        data is on this format data - > sensor1_freq, sensor1_powers
                                    - > sensor2_freq, sensor2_powers
                                    - > sensorn_freq, sensorn_powers
    """
    ff_data = []
    # FFT each sensor measurement
    for key in data:
        ff_freq = np.fft.rfftfreq(len(key), 1.0 / fs)
        ff_power = np.fft.rfft(key)
        ff_data.append(np.array([ff_freq, ff_power]))
    return ff_data


def fft_inv(fftdata):
    """
    inverse fast fourier transform for each sensor data
    Args:
        data: a list of arrays for each sensor frequencies and powers
        fs: sample frequence for the signal
    Returns: list of arrays for each sensor signal in the time domain
    """
    data = {}
    # IFFT for each sensor data
    for key in fftdata:
        # Each sensor has freqs, power arrays so we chose the power here
        data[key] = np.fft.irfft(fftdata[key][1])
    return data


# Deprecated -> was used for visualization in the begining only
def vizualize_data(data_dict_fs, wind_title, xlim, vtype="freq"):
    len_dict = len(data_dict_fs)
    plt.clf
    fig, ax = plt.subplots(len_dict, 1, figsize=(10, 6 * len_dict))
    fig.canvas.set_window_title(wind_title)
    ax = ax.flatten()
    for key, axes in zip(data_dict_fs, ax):
        if type == "freq":
            power = np.abs(key[0])
            freqs = np.abs(key[1])
            axes.plot(freqs, power, label=key)
            axes.set_xlabel(key)
            axes.set_xlim(xlim)
            axes.legend()
        else:
            axes.plot(key)
            # axes.set_xlabel(key)
            axes.set_xlim(xlim)
            axes.legend()

        # axes.set_xlabel('Time (sec)')


def load_all_dataset(data_fls, ref_fls, fs, window_size=6, window_overlab=4):

    """
    Load data, labels from all files, windowing on each 6 seconds data with
    4 sec overlab each sample data is a window of 6 seconds and shifted 2 s
    from the previoues one
    Args:
        data_fls: all the files for the dataset sensor measurements
        ref_fls: all the files for the dataset sensor labels
        fs: sample frequency
    Returns: list of array for all samples in the dataset, each sample data
     has all the sensors measurements in this sample in the time domain
    """
    X = []
    Y = []
    window_len = 6 * fs
    window_shift_len = (window_size - window_overlab) * fs

    dataset = {"ppg": [], "accx": [], "accy": [], "accz": [], "labels": []}
    # Read All Files
    for data_file, ref_file in zip(data_fls, ref_fls):
        ppg, accx, accy, accz = LoadTroikaDataFile(data_file)
        dataset["ppg"].extend(ppg)
        dataset["accx"].extend(accx)
        dataset["accy"].extend(accy)
        dataset["accz"].extend(accz)
        ts = np.arange(0, len(ppg) / fs, 1 / fs)
        labels = Load_labels(ref_file, ts)
        labels /= 60.0
        dataset["labels"].extend(labels)
    # dataframe have data and labels for all the data and labels
    # all files are appended
    df = pd.DataFrame(dataset)
    # Convert dataframe to samples each one is shifted 2s
    number_samples = int(len(df) / window_shift_len)
    for window_ind in range(number_samples):
        sample_start = int(window_ind * window_shift_len)
        sample_end = int(sample_start + window_len)
        ppg = df["ppg"][sample_start:sample_end].values
        accx = df["accx"][sample_start:sample_end].values
        accy = df["accy"][sample_start:sample_end].values
        accz = df["accz"][sample_start:sample_end].values
        labels = df["labels"][sample_start:sample_end].values
        X.append(np.array([ppg, accx, accy, accz]))
        Y.append(np.array([labels]))
    return X, Y


def clean_datset(samples, labels, fs):
    """
    Clean each seample data by performing bandpass filter on range of 0.7-4
    Hz and add FFT for each sample in the dataset
    Args:
        samples: list of array for each sample. each sample data has several
         sensor measurements
        labels: list of arrays for each sample labels
    Returns: tuble of dataset time and frequency data for all sensors, labels
     for each sample
    """
    clean_x = []
    clean_y = []
    for sample, label in zip(samples, labels):
        bandpass_filtered = butter_bandpass(sample, (0.7, 4), fs)
        fft_sample = fft_sensors(bandpass_filtered, fs)
        clean_x.append((bandpass_filtered, fft_sample))
        clean_y.append([np.mean(label)])
    return (clean_x, clean_y)


def AggregateErrorMetric(pr_errors, confidence_est):
    """
    Computes an aggregate error metric based on confidence estimates.

    Computes the MAE at 90% availability.

    Args:
        pr_errors: a numpy array of errors between pulse rate estimates and
         corresponding reference heart rates.
        confidence_est: a numpy array of confidence estimates for each pulse
         rate error.

    Returns:
        the MAE at 90% availability
    """
    # Higher confidence means a better estimate. The best 90% of the estimates
    #    are above the 10th percentile confidence.
    percentile90_confidence = np.percentile(confidence_est, 10)

    # Find the errors of the best pulse rate estimates
    best_estimates = pr_errors[confidence_est >= percentile90_confidence]

    # Return the mean absolute error
    return np.mean(np.abs(best_estimates))


def remove_powerful_acc_freqs_from_ppg(
    peaks_freq_ppg, peaks_freqs_ppg_vals, peaks_freq_acc
):
    """
    in any sample, filter the ppg peak frequencies that exist in the
     acceleration peak frequencies .
    If an acceleration frequency exists within +-0.25Hz range for any of
     the ppg peak freqs, this freq is removed.
    Args:
        peaks_freq_ppg: a list of the powerful freqs in the ppg freq signal
        peaks_freqs_ppg_vals: power of each ppg peak frequency
        peaks_freq_acc:a list of the powerful freqs in the accel freq signal
    Returns:return a tuple of the most powerful ppg freq and its power after
     removing near acc freqs
    """
    # if there is only one peak in this sample, take this sample
    if len(peaks_freq_ppg) > 1:

        # sort ppg freqs according to their power
        sorted_ppg_freqs = peaks_freq_ppg[np.argsort(peaks_freqs_ppg_vals)]
        sorted_ppg_powers = peaks_freqs_ppg_vals[np.argsort(
                                                peaks_freqs_ppg_vals)]
        ppg_freqs_acc_removed = []
        ppg_freqs_acc_removed_val = []
        # for each ppg freq search for powerful freqs near the range
        for freq, power in zip(sorted_ppg_freqs, sorted_ppg_powers):
            flag = 0
            for acc_freq in peaks_freq_acc:
                if (freq + 0.15 <= acc_freq) & (acc_freq <= freq + 0.15):
                    flag = 1
                    break
            if flag == 0:
                ppg_freqs_acc_removed.append(freq)
                ppg_freqs_acc_removed_val.append(power)

        ppg_1st_power_freq_acc_removed = ppg_freqs_acc_removed[-1]
        ppg_1st_power_freq_acc_removed_val = ppg_freqs_acc_removed_val[-1]
    else:
        ppg_1st_power_freq_acc_removed = peaks_freq_ppg[-1]
        ppg_1st_power_freq_acc_removed_val = peaks_freqs_ppg_vals[-1]
    return (ppg_1st_power_freq_acc_removed, ppg_1st_power_freq_acc_removed_val)


def featurize_samples(samples, fs):
    """
    samples->(time_data,freq_data)->(ppg,accx,accy,accz)
                                    ->fppg,faccx,faccy,faccz->freqs,power
    """
    """
    time and freq featurization for each sample sensors data
    Args:
        samples: a list of the dataset samples
        fs: sample frequency
    Returns: return a list of arrays for each sample features
    """
    features = []
    for sample in samples:
        # extract each sample to each sensor time & freq data
        ppg_t = sample[0][0]
        freqs = np.abs(sample[1][0][0])
        ppg_f = np.abs(sample[1][0][1])
        accx_f = np.abs(sample[1][1][1])
        accy_f = np.abs(sample[1][2][1])
        accz_f = np.abs(sample[1][3][1])
        #  average freq power for all accel axes
        acc_all_f = (accx_f + accy_f + accz_f) / 3.0

        # Time features
        peaks, _ = sg.find_peaks(
            ppg_t, height=1.5, distance=30
        )
        # minimum 0.25s distance(4Hz)
        # periods features
        peaks_dif = np.diff(peaks) / 60.0
        min_peak = np.min(peaks_dif)
        max_peak = np.max(peaks_dif)
        mean_peak = np.mean(peaks_dif)
        std_peak = np.std(peaks_dif)

        # peaks for ppg freqs & their power
        # minimum 1/3Hz distance
        peaks_freq_ppg, _ = sg.find_peaks(ppg_f, distance=3)
        peaks_freqs_ppg_vals = ppg_f[peaks_freq_ppg]
        peaks_freq_ppg = freqs[peaks_freq_ppg]
        # take the peaks in the allowed freq only
        peaks_freqs_ppg_vals = peaks_freqs_ppg_vals[
            (peaks_freq_ppg < 4.0) & (peaks_freq_ppg > 0.67)
        ]
        peaks_freq_ppg = peaks_freq_ppg[
            (peaks_freq_ppg < 4.0) & (peaks_freq_ppg > 0.67)
        ]
        # peaks for accel freqs & their power
        peaks_freq_acc, _ = sg.find_peaks(
            acc_all_f, distance=3
        )  # minimum 1/3Hz distance
        peaks_freqs_acc_vals = acc_all_f[peaks_freq_acc]
        peaks_freq_acc = freqs[peaks_freq_acc]
        peaks_freqs_acc_vals = peaks_freqs_acc_vals[
            (peaks_freq_acc < 4.0) & (peaks_freq_acc > 0.67)
        ]
        peaks_freq_acc = peaks_freq_acc[
            (peaks_freq_acc < 4.0) & (peaks_freq_acc > 0.67)
        ]
        # freq features for ppg
        ppg_freq_mean = np.mean(peaks_freq_ppg)
        ppg_freq_std = np.std(peaks_freq_ppg)
        ppg_1st_power_freq = peaks_freq_ppg[np.argsort(
                                    peaks_freqs_ppg_vals)][-1]
        ppg_1st_power_freq_val = peaks_freqs_ppg_vals[np.argsort(
                                    peaks_freqs_ppg_vals)][-1]
        (ppg_1st_freq_acc_fil, ppg_1st_freq_acc_fil_val,) = \
            remove_powerful_acc_freqs_from_ppg(
                        peaks_freq_ppg, peaks_freqs_ppg_vals, peaks_freq_acc
                                    )

        # freq features for acc
        acc_all_freq_mean = np.mean(peaks_freq_acc)
        acc_all_freq_std = np.std(peaks_freq_acc)
        acc_all_max_strong_freq = peaks_freq_acc[np.argsort(
                                            peaks_freqs_acc_vals)][-1]
        acc_all_max_strong_freq_power = peaks_freqs_acc_vals[
            np.argsort(peaks_freqs_acc_vals)
        ][-1]

        time_features = [min_peak, max_peak, mean_peak, std_peak]
        freq_features = [
            ppg_freq_mean,
            ppg_freq_std,
            ppg_1st_power_freq,
            ppg_1st_power_freq_val,
            ppg_1st_freq_acc_fil,
            ppg_1st_freq_acc_fil_val,
            acc_all_freq_mean,
            acc_all_freq_std,
            acc_all_max_strong_freq,
            acc_all_max_strong_freq_power,
        ]
        total_features = time_features + freq_features
        features.append(np.array(total_features))

    return np.array(features)


def model_train(estimators=650, depth=14, file_path="model_final"):
    """
    train a random forest regressor on the dataset and save it in the
     provided path
    Args:
        estimators: number of trees in the model
        depth: single value for the depth of each tree of the model
        file_path: a path string to which model will be saved
    Returns: return the trained regression model
    """
    # Reading ref and sensors data, create timestamp for both
    fs = 125
    data_fls, ref_fls = LoadTroikaDataset()
    samples, labels = load_all_dataset(data_fls, ref_fls, fs)
    clean_x, clean_y = clean_datset(samples, labels, fs)
    """
    samples->(time_data,freq_data)->(ppg,accx,accy,accz)
                                    ->fppg,faccx,faccy,faccz->freqs,power
    """
    dataset_feats = featurize_samples(clean_x, fs)
    train_x, test_x, train_y, test_y = train_test_split(
        dataset_feats, clean_y, random_state=42, test_size=0.2
    )
    clf = RandomForestRegressor(
        n_estimators=estimators, max_depth=depth, random_state=42
    )
    clf.fit(train_x, np.ravel(train_y))
    y_pred = clf.predict(test_x)
    mae_val = mean_absolute_error(
        y_pred * 60.0, np.array(test_y) * 60.0
    )  # *60 toconvert from Hz t BPM
    mse_val = mean_squared_error(
        y_pred * 60, np.array(test_y) * 60
    )  # *60 to convert from Hz t BPM
    # res={"label":np.ravel(test_y)*60,"prediction":y_pred*60}
    print("Model Results : \n")
    print("mean absolute error", mae_val)
    print("mean square error", mse_val)
    with open(file_path, "wb") as f:
        pickle.dump(clf, f)
        print("model saved in the following dir: %s" % file_path)
    return clf


def Evaluate():
    """
    Top-level function evaluation function.
    Runs the pulse rate algorithm on the Troika dataset and returns an
     aggregate error metric.

    Returns:
        Pulse rate error on the Troika dataset. See AggregateErrorMetric.
    """
    # Retrieve dataset files
    data_fls, ref_fls = LoadTroikaDataset()
    errs, confs = [], []
    for data_fl, ref_fl in zip(data_fls, ref_fls):
        # Run the pulse rate algorithm on each trial in the dataset
        errors, confidence = RunPulseRateAlgorithm(data_fl, ref_fl)
        # errors=list(errors)
        # confidence=list(confidence)
        # print(errors.shape,confidence.shape)
        errs.append(errors)
        confs.append(confidence)
        # Compute aggregate error metric
    errs = np.hstack(errs)
    confs = np.hstack(confs)
    # print(len(errs),len(confs))
    return AggregateErrorMetric(errs, confs)


def RunPulseRateAlgorithm(data_fl, ref_fl):
    """
    load regresssion model, perform prediction over a file  samples
    return mean error and confidence for each sample
    Args:
        data_fl: file name for sensor measurements
        ref_fl: file name that has true PBM value
        file_path: a path string to which model will be saved
    Returns: tuple of two arrays for mean error confidence
    """
    model_path = "model_final"
    fs = 125
    samples, labels = load_all_dataset([data_fl], [ref_fl], fs)
    clean_x, clean_y = clean_datset(samples, labels, fs)
    file_samples_x = featurize_samples(clean_x, fs)
    file_samples_y = np.array(clean_y)
    # ppg, accx, accy, accz = LoadTroikaDataFile(data_fl)
    reg_model = load_model(model_path)
    samples_pred = reg_model.predict(file_samples_x)

    mae_val = np.abs(
        (samples_pred.reshape(-1, 1)) * 60.0 - np.array(file_samples_y) * 60.0
    )  # *60 toconvert from Hz t BPM
    # Compute pulse rate estimates and estimation confidence.
    confidence = []
    mae_val = np.ravel(mae_val)
    for ind in range(len(clean_y)):
        pred_freq = samples_pred[ind]
        fft_freqs = np.abs(clean_x[ind][1][0][0])
        fft_f_power = np.abs(clean_x[ind][1][0][1])
        freq_wind = \
            (fft_freqs > pred_freq - 0.5) & (fft_freqs < pred_freq + 0.5)
        conf = np.sum(fft_f_power[freq_wind]) / np.sum(fft_f_power)
        confidence.append(conf)
    # print(mae_val.shape,np.array(confidence).shape)
    # Return per-estimate mean absolute error and confidence as
    # a 2-tuple of numpy arrays.
    return mae_val, np.array(confidence)


def get_predictions(data_fl, ref_fl, model_path="model_final"):
    """
    load regresssion model, perform prediction over a file  samples
    return prediction and confidence for each sample
    Args:
        data_fl: file name for sensor measurements
        model_path: a path string to which model will be saved
    Returns: a data for predictions and confidence of each 2s measurement
    """
    fs = 125
    samples, labels = load_all_dataset([data_fl], [ref_fl], fs)
    clean_x, clean_y = clean_datset(samples, labels, fs)
    file_samples_x = featurize_samples(clean_x, fs)
    file_samples_y = np.array(clean_y)
    # ppg, accx, accy, accz = LoadTroikaDataFile(data_fl)
    reg_model = load_model(model_path)
    samples_pred = reg_model.predict(file_samples_x)
    confidence = []
    for ind in range(len(clean_y)):
        pred_freq = samples_pred[ind]
        fft_freqs = np.abs(clean_x[ind][1][0][0])
        fft_f_power = np.abs(clean_x[ind][1][0][1])
        freq_wind = \
            (fft_freqs > pred_freq - 0.5) & (fft_freqs < pred_freq + 0.5)
        conf = np.sum(fft_f_power[freq_wind]) / np.sum(fft_f_power)
        confidence.append(conf)
    # print(len(file_samples_y),len(samples_pred),len(confidence))
    results = {
        "true PPM": np.ravel(file_samples_y) * 60.0,
        "predicted_PPM": samples_pred * 60.0,
        "confidence": confidence,
    }
    return pd.DataFrame(results)


In [2]:
model_train(estimators=650, depth=14, file_path="model_final")

Model Results : 

mean absolute error 8.600705195109164
mean square error 166.6567147439998
model saved in the following dir: model_final


RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=14,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=650,
                      n_jobs=None, oob_score=False, random_state=42, verbose=0,
                      warm_start=False)

In [3]:
Evaluate()

model found
model found
model found
model found
model found
model found
model found
model found
model found
model found
model found
model found


6.82323605387397

-----
### Project Write-up

Answer the following prompts to demonstrate understanding of the algorithm you wrote for this specific context.

> - **Code Description** - Include details so someone unfamiliar with your project will know how to run your code and use your algorithm. 
> - **Data Description** - Describe the dataset that was used to train and test the algorithm. Include its short-comings and what data would be required to build a more complete dataset.
> - **Algorithhm Description** will include the following:
>   - how the algorithm works
>   - the specific aspects of the physiology that it takes advantage of
>   - a describtion of the algorithm outputs
>   - caveats on algorithm outputs 
>   - common failure modes
> - **Algorithm Performance** - Detail how performance was computed (eg. using cross-validation or train-test split) and what metrics were optimized for. Include error metrics that would be relevant to users of your algorithm. Caveat your performance numbers by acknowledging how generalizable they may or may not be on different datasets.

Your write-up goes here...

### Code Description
##### Running the code
 1. download all requred libraries are avaiable and all dependancies with Python 3.7
 2. to run test if everything is working, you can use ```RunPulseRateAlgorithm``` function. To use this function you should provide 1 file that have PPG , 3-axis Accelerometer sensors measurements and 1 more file that have the true BPM values. All files should be the same format to Troika Dataset.
 3. to run the training, testing process at the same at once, you can use ```Evaluate``` function which will do everything and return the mean average error of the algorithm. Troika dataset should be provided in the same directory of the notebook in a folder called 'datasets'
 4. to run the project for getting BPM values for certain file you can run the following code
 

In [4]:
data_fs,ref_fs=LoadTroikaDataset()
df=get_predictions(data_fs[0],ref_fs[0])
df.sample(10)

model found


Unnamed: 0,true PPM,predicted_PPM,confidence
147,154.220779,153.05811,0.458667
72,154.220779,154.790161,0.438753
31,163.973422,157.487565,0.536739
36,154.490148,153.304711,0.513044
1,74.225355,88.902268,0.376885
144,154.220779,154.264625,0.374124
84,154.220779,154.766573,0.402636
145,154.220779,154.270538,0.468361
111,154.220779,154.119899,0.42242
130,154.220779,154.837754,0.496803


### Data Description


#### Description
- The used dataset is a modiffied version of the Troika dataset recorded from 12 male subjects with yellow skin and ages ranging from 18 to 35.
- Dataset provide measurements from PPG ,3-axis Acceleromenter at sample freuquency = 125 Hz
- For each subject, the PPG signal was recorded from wrist using a pulse oximeter with green LED (wavelength: 515nm).
- The acceleration signal was also recorded from wrist using a three-axis accelerometer. Both the pulse oximeter and the accelerometer were embedded in a wristband, which was comfortably worn. 

- During data recording subjects walked or ran on a treadmill with the following speeds in order: the speed of 1-2 km/hour for 0.5 minute, the speed of 6-8 km/hour for 1 minute, the speed of 12-15 km/hour for 1 minute, the speed of 6-8 km/hour for 1 minutes, the speed of 12-15 km/hour for 1 minute, and the speed of 1-2 km/hour for 0.5 minute. The subjects were asked to purposely use the hand with the wristband to pull clothes, wipe sweat on forehead in addition to freely swing

#### Limitations
- predictions can have bias towards the provided age range, males, yellow skin as a result of the unbalanced distrubution of the dataset
- predictions can have small bias to the used pattern when collecting the data
- predictions can have bias towards the used sensors when using different sensors
- predictions may have high mean average error when predicting PPM at rest.
#### Optimum dataset
- demographic distribution should represent the real world in (age, gender,skin color)
- size of the dataset should be increased  to decrease any chacn of overfitting
- more daily life activitites should be monitored too


### Algorithm Description
#### Preprocessing
- sensors time series are filtered with bandpass filter within 40-240BPM
- windowing algorithm is performed to split each data series to 6s samples and each sample is overlapped 4s with the ealrier one to achieve 2Hz output
- FFT(fast fourier transform) is performed to convert samples to frequency domain
- each sample time and frequency measurements are packed together for featurization process
#### Featurization

 For each sample the following features was taken:  
 - minimum time between 2 sequential peaks in ppg singal
 - maximum time between 2 sequential peaks in ppg singal
 - the mean time between 2 sequential peaks in ppg singal 
 - standard deviation between 2 sequential peaks in ppg singal 
 - mean frequency in the ppg signal
 - standard deviation for frequency in ppg signal
 - most powerful freq in ppg singal and it's value
 - most powerful freq in acceleration singal and it's value
 - acceleration frequencies mean and standard deviation
 - most powerful freq in ppg singal that doesn't exist in the acceleration freqs and it's value 
#### Regression model
Algorithm takes all the features of each sample and feed this data to a regression Random Forest model which is trained using 650 tree and a depth of 12 for each tree.
 
#### Physiology 
This algorithm mainly input is from PPG sensors which uses emits light through skin and recives this signal again and depending on the absorbtion rate of the recevied light we can tell weather this measuremet is for blood that is going to the body or towards heart because the absorbtion rate is different for blood with high and low red cells percentage. As a result we can know when conctraction and retraction of the heart happen.
#### Output

Algorithm privodes heart rate measurements in PPM and the fonfidence of each measurement depending on the measured power around this frequency.

#### Caveats on algorithm outputs

- algorithm isn't still general for any dataset format and files should have same format as Troika dataset
- the function that gives prediction still requires reference file because it uses several preprocessing, cleaning, filtering functions that was used for training although the ref file don't affect the prediction process by any way.
- no specific failure cases was noticed


### Algorithm Performance
#### Performance 

- performance was computed using mean absolute error MAE metric over both 90th precentile and 10th percntile 
- dataset was split and shuffled using train_test split sklearn functions and the error was measured over the testset 
- overall MAE= 6.8
- Confidence of the model can also be used as indication to the error in the measurements. When the confidence is low this means that this PPM have ony small power in the obtained data or the measurements have other high power frequencies created by movement noise or any other distortion.




-----
### Next Steps
You will now go to **Test Your Algorithm** (back in the Project Classroom) to apply a unit test to confirm that your algorithm met the success criteria. 