# Capstone Proposal: A Three-Way Consensus Pipeline for Stress Level Detection
Trenton Potgieter

## Background
As one gets older, an increasingly difficult awareness of a parent's mortality becomes a concern. Personally, my parents are both in their early 70's and according to a study [^1] done in 2015 by the __American Heart Association__, there is a prevalence of almost *third* of the population at risk of *Heart Disease* leading to *Heart Attack* as one approaches 80+ year in age. Having no personal experience in the Coronary Field of Medical research, it would be difficult for me to diagnose any potential warning signs, but with the advent of wearable technology, the mechanisms are in place to potentially aid in this early warning and detection of heart attacks. The majority of wearable technology today has the built-in ability to monitor heart rates. This information can be uploaded or sent to a data ingestion pipeline that this capable of interpreting, analyzing and detecting an the patterns that could be classified as symptoms of a heart attack. 

One of the potential symptoms is the increase in heart rates. There are a number of potential factors that influence the increase in heart rate, but there are well published guidelines [^2] that can be used to determine anomalous patterns. If these anomalies occur, the the data ingestion pipeline could proactively determine if a heart attack is about to *or* has occurred and alert the appropriate medical response. Thus proactively preventing a fatal or near-fatal heart attack. Additionally, the pipeline mechanism can be used to monitor patients who are in *Cardiac Rehabilitation* [^3].

<!---
According to < https://www.heart.org/idc/groups/ahamah-public/@wcm/@sop/@smd/documents/downloadable/ucm_480086.pdf > around 370,000 people die of heart attacks each year and is the No. 1 cause of in the United States. In 2014, around 356,500 people experienced heart attacks out of the hospital. Of that  amount  only 12% survived due to emergency medical services intervention. Personally, I would not like my parents to be one the 88%  who suffered from a fatal heart attack and didn't survive  due to the fact that there was no intervention by emergency medical services. To this end I propose ...
--->





## Problem Statement
For this Project, I propose creating a classification pipeline that ingests heart-rate signal data (from a simulated wearable monitor) and classifies whether the subject is in a stressful situation that could lead to *Cardiac Unrest*. Additionally, in order to prevent a "cry-wolf" scenario or *false-positives*, the pipeline employs a consensus mechanism where three classifiers must all agree on the classification. To accomplish this, the project is comprised of three stages:

1. __Ingesting signal data__. $\rightarrow$ Collect already filtered PPG [^4] signal data with symbolic peaks (and other features) have been collected for a one-minute time segment. Each one-minute time segment is considered an observation labeled with the class `relax` or `stress`.
2. __Classification model training__.
3. __Classification model application on new, unseen data__. $\rightarrow$ The final classification is Implementing a Weighted Majority Rule Ensemble Classifier [^5] based on the probability of the time segment observation belonging to either class, using the following:
$$
\hat{y} = \arg\max_{i}\sum^{m}_{j=1}w_{j}p_{ij},
$$
where $wj$ is the weight that can be assigned to the $j^{th}$ classifier.

## Datasets and Inputs
The dataset used for this Project was obtained as part of a *Proof of Concept (POC)* project in the __Dell IoT Solutions Lab__ [^6] in Santa Clara, California, where a PPG [^4] Pulse sensor was used to measure Heart Rate Variability (HRV) [^7]  reading, similar to those found on current wearables like the __Fitbit Charge 2__ [^8]. The scope of the original POC is simply to verify if the data can be extracted and filtered to detect peaks in the PPG signal for a one minute data segment. Four separate test subjects (between the ages of 68 and 76) were subjected to different stimuli to induce *stress* and *relaxing* scenarios. The one minute observations are stored in a *.csv* file..  

For the scope of this project however, I propose training three separate supervised machine learning models by applying the following methodology to create a pipeline:

1. Separate the input data into two separate repositories. One for the observations and one for the labeled output.
2. Apply __normalization__ and/or __standardization__ techniques to  pre-process the data.
3. Define three separate models to evaluate the the data.
4. Apply the models and measure their performance.

## Evaluation Metrics
Since the success criteria of the project is based on the overall __probability__ of the time segment observation belonging to either class (stressed or relaxed), each individual model as well as the overall consensus pipeline will be evaluated using a __Confusion Matrix__, with specific attention to the aspects:

1. __Precision:__ $\rightarrow$ Measure the accuracy of each model as well as  the overall pipeline.
2. __Sensitivity:__ $\rightarrow$ Measure how thoroughness of each model as well as the overall pipeline.
3. __Specificity:__ $\rightarrow$ Measure how well each model as well as the overall pipeline correctly measures the incorrectly classified results.

## Overall Design

![Figure 1: Training/Testing Pipeline](images/Pipeline.jpg)

Figure 1 (above) provides an overview of the proposed pipeline that address the solution scope; to determine if an individual's heart rate indicates that they are in a position of stress.

The pipeline is separated into two specific workflows:
Training
Production

### Training Pipeline
The Training Pipeline is comprised of three specific stages. 

#### Feature Extraction
The first process - Feature Extraction - separates the incoming signal data from the heart rate monitor into two separate training data sets. The first data set are the signal observations, while the second data set are the training labels associated with each observation. The labels are further converted to a binary integer value, demarcating $1$ for "relaxed" and $0$ for stressed. Additionally, in order to account for outlier variables and overfitting, the data is further standardized and scaled using the following:

>__Standardize:__
$$
X = \frac{\sum^{n}_{i=1}(x_{i} - \mu)}{\sigma}
$$

>__Scale:__
$$
X' = \frac{X - X_{min}}{X_{max} - X_{min}}
$$

#### Model Training
Once the data has been pre-processed, three separate classifieds are trained on the data.

1. Decision Tree
2. Support Vector Machines (SVM)
3. Neural Network

Once each of these models have been trained, there sullying classification is probability undergoes a consensus vote to determine the <!---INSET HERE--->.

#### Final Model
The last stage of the Training Pipeline is an optimized classification model that can be used for new data.

### Testing/Production Pipeline
Like the Training Pipeline, the Production/Testing Pipeline also comprises of three stages.

#### Feature Extraction
Unlike the first stage of the Training Pipeline, the data from the heart rate monitor is not separated into two data sets. Rather, the signal data is pre-processed, scaled and normalized.

#### Observation Segmentation
The pre-processed data is then split into one-minute segments based on the time stamp of the data. These one-minute segments are established as a single observation of the test individuals stress level at the given time.

#### Classification
The final model from the Training Pipeline is then executed against each one-minute observation segment to classify wether the test subject is stressed or relaxed. 

Based on this final classification, additional future actions can be implemented that are currently outside the scope of this project. <!---See the section on Further Research--->

## Solution 
Once created, the pipeline will be used to test and deploy the models on a sample unseen data from the test subjects and hence predict their stress levels. It is the objective of this project to re-apply the resulting pipeline to a set of new test subjects and hopefully provide a viable prototype that can preemptively warn of potential heart attacks.




[^1]: (http://www.heart.org/idc/groups/heart-public/@wcm/@sop/@smd/documents/downloadable/ucm_449846.pdf)
[^2]: (http://www.heart.org/HEARTORG/HealthyLiving/PhysicalActivity/FitnessBasics/Target-Heart-Rates_UCM_434341_Article.jsp#.WHEiXbGZNE4)
[^3]: (https://www.nhlbi.nih.gov/health/health-topics/topics/rehab)
[^4]: (https://en.wikipedia.org/wiki/Photoplethysmogram)
[^5]: (http://scikit-learn.org/stable/modules/ensemble.html#weighted-average-probabilities-soft-voting)
[^6]: (https://www.dell.com/en-us/work/learn/internet-of-things-labs)
[^7]: (http://www.myithlete.com/what-is-hrv/)
[^8]: (https://www.fitbit.com/charge2)



<!---TODO:
1. Research Classification Accuracy; Confusion MAtrix; and ROC curves to be used.
2. Instead of doing what the code does, evaluate the following for the consensus vote:
[Implementing a Weighted Majority Rule Ensemble Classifier in sklearn](http://sebastianraschka.com/Articles/2014_ensemble_classifier.html)
[Weighted Average Probabilities (Soft Voting)](http://scikit-learn.org/stable/modules/ensemble.html#weighted-average-probabilities-soft-voting)
[EnsembleVoteClassifier](http://rasbt.github.io/mlxtend/user_guide/classifier/EnsembleVoteClassifier/)
--->

<!--
Further Research:
Factor in targeted heart for test case ages. < http://www.heart.org/HEARTORG/HealthyLiving/PhysicalActivity/FitnessBasics/Target-Heart-Rates_UCM_434341_Article.jsp>
Should the stress level be determined outside of target Age rate of the test case, then automate alerting to emergency medical services.
Extend Steps 1 and 2 into a Lambda  Function and take the pipeline into production with a RESTful interface that can be leveraged by wearable/heart rate monitors.
--->