## CardioPredict: Assessing Heart Disease Risk
_CardioPredict harnesses the power of kNN Models to analyze key health indicators and provide a predictive model for assessing the risk of coronary heart disease in individuals._

by Joel Wu, Sandra Gross, He Ma and Doris Wang 



In [None]:
import pandas as pd
import pickle
from myst_nb import glue

In [None]:
glue(...)

In [None]:
glue(...)

In [None]:
with open('../results/models/preprocessor.pickle', 'rb') as f:
    cardio_fit = pickle.load(f)

## Summary (figures need to use glue:text) (Needs to rewrite, delete logistic)
In the analysis we examine the development and evaluation of two models, Logistic Regression and k-Nearest Neighbours (kNN), to predict the risk of cardiovascular disease (CVD) using the Framingham Heart Study dataset. The Logistic Regression model, optimized through hyperparameter tuning, demonstrated a modest test accuracy of 55.3% but had a high recall for class 1 (indicating disease presence) at 84%. However, its precision was low, indicating a high rate of false positives. The kNN model, selected based on cross-validation, initially showed a higher accuracy of approximately 79% but a very low recall of 5.17%, indicating many false negatives. After applying oversampling to address class imbalance, the kNN model’s recall improved significantly to 51.7%, though accuracy decreased to 63.7%. The report concludes that while Logistic Regression is preferable due to its higher recall, both models show limited accuracy. Further investigation into the dataset’s features and more refined preprocessing are recommended for improved model performance. 
This study could encourage collaborations between data scientists, healthcare professionals, and epidemiologists to develop more robust models for disease prediction

## Introduction
Heart disease remains the leading cause of death worldwide, with an estimated 17.9 million lives lost annually {cite}whr2023. The global fatalities from cardiovascular disease (CVD) have risen from about 12.1 million in 1990 to 18.6 million in 2019 {cite}whf2023. Identifying high-risk individuals and providing timely treatment is critical to reducing premature mortality from CVDs.

The assertion that up to 80 percent of premature heart attacks and strokes are preventable highlights the critical role of quality data in formulating effective health policies. {cite}whf2023. The report also highlights the importance of data-driven approaches in predicting cardiovascular disease (CVD) risk. Leveraging date from the famous Framingham Heart Study {cite}fhs, our project examines the potential of machine learning classification methods to assess heart disease risk and identify the risk factors. The dataset contains 1,363 records and includes variables such as age, gender, blood pressure, cholesterol levels, and smoking habits, analyzed to predict the likelihood of developing heart disease. Given that many cardiovascular conditions present no initial symptoms and are preventable through lifestyle modifications {cite}who, an algorithm capable of accurately detecting at-risk individuals can be pivotal. Early detection is paramount to providing early interventions, which may include lifestyle counseling and proactive medication management, thereby improving patient outcomes.

## Data

The CardioPredict project utilizes data from the renowned Framingham Heart Study (FHS), initiated by the USA Public Health Service {cite}nih, a pivotal cohort study initiated in 1948 in Framingham, Massachusetts. In 2018, The FHS celebrated its seventieth anniversary, is the longest-running cardiovascular epidemiological cohort study in the USA {cite}Andersson2019. The Framingham Heart Study (FHS) has been widely used for various research purposes. The study has contributed to the understanding of risk factors for cardiovascular diseases (CVD) such as hypertension, dyslipidemia, smoking, and diabetes {cite}Tsao201. It has also facilitated the development and implementation of effective treatments for these conditions. The FHS has been instrumental in investigating the role of lifestyle habits, social networks, and genetics in CVD. Additionally, the study has provided valuable data for genomic research, including genome-wide single-nucleotide polymorphism (SNP) data and other 'omics' data types like DNA methylation and gene expression. The FHS continues to be relevant today in addressing challenges such as physical inactivity, obesity, and diabetes, and remains an important institution for training CVD epidemiologists and researchers {cite}Andersson2021.

The FHS is also widely used as a source of data for educational purposes in the medical field by National Institute of Health (NIH) {cite}nih2021
Our dataset, derived from this study and provided by Professor Paul Blanche {cite}data, encompasses a comprehensive collection of health metrics from 1,363 individuals, meticulously recorded and analyzed to gauge the risk of developing heart disease.

### Data Description

Key variables in this dataset include age, gender, systolic and diastolic blood pressure (SBP and DBP), cholesterol levels, Framingham relative weight (FRW), and smoking habits (measured as cigarettes per day). These variables offer a multifaceted view into each individual's health status, providing a robust foundation for our predictive analysis. For example, 'sex' is categorized into Female or Male, 'AGE' represents the age of individuals in years, 'FRW' indicates the Framingham relative weight percentage at baseline ranging from 52 to 222, 'SBP' and 'DBP' measure blood pressure in mmHg, 'CHOL' shows cholesterol levels in mg/100ml, and 'CIG' quantifies cigarette consumption per day. The 'disease' variable,which serves as out target variable, marks the occurrence of coronary heart disease during the study, noted as 1 for occurrence and 0 otherwise.


| Variable | Explanation |
|----------|-------------|
| sex      | sex (Female/Male) |
| AGE      | Age in years |
| FRW      | "Framingham relative weight" (pct.) at baseline (52-222) |
| SBP      | systolic blood pressure at baseline mmHg (90-300) |
| DBP      | diastolic blood pressure at baseline mmHg (50-160) |
| CHOL     | cholesterol at baseline mg/100ml (96-430) |
| CIG      | cigarettes per day at baseline (0-60) |
| disease  | 1 if coronary heart disease occurred during the follow-up, 0 otherwise |

### Data Visualization

As {numref}`distribution_of_disease_occurrence` shown, there are 1,095 individuals without heart disease and 268 individuals with heart disease, indicating a higher prevalence of non-disease cases in the sample and high class imbalance.

```{figure} ../results/figures/distribution_of_disease_occurrence.png
---
width: 400px
name: distribution_of_disease_occurrence
---
Distribution of Disease Occurrence
```

The distributions in {numref}`age_and_health_indicators_exhibit_elevated_heart_disease` displayed in the charts indicate that most of the variables (AGE, FRW, SBP, DBP, CHOL, CIG) show a clear differentiation between individuals with heart disease (1) and those without (0), especially for SBP, DBP, and CHOL. Individuals with heart disease tend to have higher systolic blood pressure (SBP), diastolic blood pressure (DBP), and cholesterol (CHOL) levels, while age, relative weight (FRW), and cigarette consumption (CIG) do not show as pronounced a difference between the two groups. Most of the features, except for CIG and AGE are roughly normally distributed.

```{figure} ../results/figures/age_and_health_indicators_exhibit_elevated_heart_disease.png
---
width: 800px
name: age_and_health_indicators_exhibit_elevated_heart_disease
---
Age and Health Indicators Exhibit Elevated Heart Disease
```

As {numref}`correlation_matrix_of_the_features` shown, Systolic and diastolic blood pressure (SBP and DBP) show a strong positive correlation. Additionally, there is a notable positive correlation between SBP and the occurrence of disease, as well as between DBP and disease, age and disease, while cigarette smoking (CIG) is negatively correlated with Framingham relative weight (FRW) and cholesterol levels (CHOL).

```{figure} ../results/figures/correlation_matrix_of_the_features.png
---
width: 600px
name: correlation_matrix_of_the_features
---
Correlation Matrix of the Numerical Features 
```

In {numref}`pairwise_scatter_plot_matrix`, Systolic and diastolic blood pressure (SBP and DBP) show a strong linear relationship. Additionally, the color differentiation indicates potential trends between these variables and the presence of heart disease, with some variables like SBP exhibiting clusters that may correlate with higher instances of the disease.

```{figure} ../results/figures/pairwise_scatter_plot_matrix.png
---
width: 800px
name: pairwise_scatter_plot_matrix
---
Pairwise Scatter Plot Matrix
```

The distribution in {numref}`distribution_of_the_sex_variable` of the two genders appears to be approximately equal, with a similar number of observations being sick in each group.

```{figure} ../results/figures/distribution_of_the_sex_variable.png
---
width: 400px
name: distribution_of_the_sex_variable
---
Distribution of the sex variable
```

The outliers in the boxplot (as shown in {numref}`boxplot_of_specified_numerical_features`) for the features 'AGE', 'FRW', 'SBP', 'DBP', 'CHOL', and 'CIG' could be indicative of significant health-related phenomena or critical cases that warrant further medical investigation. For instance, extreme values in 'SBP' and 'DBP' may reflect cases of hypertension or hypotension, which are clinically relevant, while variations in 'CHOL' and 'CIG' might highlight individuals with high cholesterol levels or heavy smoking habits, both of which are key factors in assessing cardiovascular risk. Therefore, retaining these outliers is essential for a comprehensive analysis, as they contribute to the understanding of the full range of health profiles within the population studied.

```{figure} ../results/figures/boxplot_of_specified_numerical_features.png
---
width: 600px
name: boxplot_of_specified_numerical_features.png
---
Boxplot of Numerical Features
```

## Methodology

### Model Specification and Retionalization

The selection of k-Nearest Neighbors (kNN) for the prediction of cardiovascular disease (CVD) is underpinned by its well-recognized efficacy in clinical diagnostics. kNN's simplicity and its inherent capacity to discern intricate, non-linear relationships make it particularly suited for medical datasets, where such complexities are the norm. It has gained traction within the medical research community for its successful application across various diagnostic challenges. {cite}Lahmiri2023 utilizes convolutional neural networks (CNN) for automatic feature extraction from MRI images. Receiver operating characteristic (ROC) based feature ranking and selection is employed to reduce the dimensionality of the CNN-based features. And the k-nearest neighbors (kNN) classifier is then optimized using Bayesian optimization (BO) algorithm. The proposed system achieves high classification performance and shows promise for efficient Alzheimer's disease (AD) diagnosis. {cite}Cherif2018 optimizes the performance of the K-Nearest Neighbors (KNN) algorithm for breast cancer diagnosis by clustering class instances, selecting significant attributes, and weighting similarities with reliability coefficients. The results demonstrate that this approach outperforms other classification techniques, achieving an average f-measure exceeding 94% on the breast cancer dataset.

Recent scholarly work has showcased the integration of kNN with advanced computational techniques to amplify its diagnostic capabilities in CVD. {cite}jabbar2013 proposes a new algorithm that combines the K-nearest neighbor (KNN) method with a genetic algorithm for the classification of heart disease. The KNN algorithm is utilized for pattern recognition based on the class of the nearest neighbor, while the genetic algorithm performs a global search to find optimal solutions. The experimental results demonstrate that the proposed algorithm enhances the accuracy in diagnosing heart disease. The study of {cite}Ali2021 aimed to predict heart disease using data mining techniques and used a heart disease dataset with 14 attributes to compare different classification algorithms. The results showed that the KNN, RF, and DT algorithms achieved 100% accuracy in predicting heart disease. 


### Evaluation Metrics and Model Performance

Given the critical nature of accurately identifying at-risk patients, the study prioritizes recall as the primary evaluation metric. In CVD prediction, the cost of false negatives – overlooking individuals who are actually at risk – is exceedingly high, potentially leading to missed opportunities for early intervention. Therefore, recall, which measures the model's ability to detect all positive cases, emerges as the most pertinent metric for this study.

### Data Preprocessing and Feature Handling

A thorough preprocessing protocol is adopted to ensure the data's suitability for analysis. This includes addressing missing values in numerical attributes through median imputation, a decision driven by the skewed nature of certain variables, notably the cigarette consumption (CIG). This approach preserves the underlying distribution of these features. Additionally, to counter the diverse range of scales across various numerical features, the dataset undergoes normalization using standard scaling This normalization ensures equitable representation of each feature in the model, preventing any single variable from exerting undue influence due to scale disparities. Furthermore, the study incorporates OneHotEncoding for the binary variable SEX with female as reference group. 

### Hyperparamer Optimization and Over-Sample Approach

For the k-Nearest Neighbors (kNN) model, this study optimizes the number of neighbors (n_neighbors). A cross-validation approach was applied across a range of 1 to 20 neighbors, in increments of 5, to determine the ideal balance between underfitting and overfitting, thus ascertaining the most effective neighbor count for accurate classification. To address class imbalance in the kNN model, we implemented Random Over Sampling (ROS), ensuring an equitable class distribution in the training set. This approach significantly enhanced the model’s ability to detect the minority class, a prevalent challenge in medical datasets with disproportionate class sizes.

### Estimating Strategy

Utilizing the scikit-learn Python package {cite}Pedregosa2012, this research benefits from its extensive suite of machine learning tools. This package not only provides efficient implementations of Logistic Regression and kNN but also encompasses a broad spectrum of data pre-processing techniques and metrics for comprehensive model evaluation, all encapsulated within an intuitive framework conducive to streamlined model development and assessment.

## Results (figures need to use glue:text, add final figures/tables)

In the quest to refine cardiovascular disease prediction models, this analysis reveals a nuanced balance between accuracy {numref}`accuracy_lines` and recall {numref}`recall_lines`, particularly when examining k-Nearest Neighbors (kNN) with and without oversampling. The accuracy plot underscores a steadier, more uniform performance with oversampling, maintaining a high accuracy across the spectrum of neighbors considered. In contrast, the kNN model without oversampling demonstrates a gradual increase in accuracy as the number of neighbors grows, hinting at the model's sensitivity to the neighbor parameter.

However, this study's focus pivots towards recall, a metric of paramount importance in medical diagnostics where missing a true case (false negative) can have dire consequences. The recall plot vividly illustrates the stark enhancement in recall due to oversampling, particularly in lower neighbor counts. The without-oversampling curve plummets, indicating a high rate of false negatives, whereas the with-oversampling curve consistently captures a higher proportion of positive cases. This enhancement is critical, as it suggests oversampling equips kNN to better identify at-risk patients, making it the preferred approach despite potential trade-offs in accuracy. Thus, the kNN with oversampling, optimized for recall, emerges as a crucial strategy for early and accurate identification of cardiovascular diseases, aligning with the medical imperative to prioritize patient outcomes over model precision.

```{figure} ../results/figures/accuracy_lines.png
---
width: 800px
name: accuracy_lines
---
Accuracy from 20-fold cross validation to choose K.

```{figure} ../results/figures/recall_lines.png
---
width: 800px
name: recall_lines
---
Recall from 20-fold cross validation to choose K.

```{figure} ../results/figures/radar_feature_importance.png
---
width: 800px
name: radar_feature_importance
---
Feature importance from kNN model with over-sampling.

Delving deeper into the nuances of the predictive model, a radar chart was constructed to visualize the relative importance of various clinical features in predicting CVD {numref}`radar_feature_importance`. This graphical representation underscores the pivotal role of certain attributes, with AGE and SBP (systolic blood pressure) emerging as particularly influential. The radar chart illustrates a landscape of feature importance, revealing the multifaceted nature of heart disease risk factors. It is noteworthy that, while AGE and SBP are established risk factors, our model also identifies other variables such as CHOL (cholesterol levels) and CIG (cigarette consumption) as significant contributors to the predictive prowess of the kNN model.

Implications for Clinical Practice
The insights gleaned from our analysis carry profound implications for clinical practice. The data-driven identification of key risk factors not only corroborates established medical knowledge but also provides a quantifiable measure of their impact. This enhanced understanding facilitates a more nuanced risk assessment, potentially leading to tailored interventions that reflect an individual's unique risk profile.


report test scores

confusion matrix






## Discussion

### Contribution

This study has significant implications for both clinical practice and public health policy, primarily in enhancing early detection of cardiovascular disease (CVD) risk and enabling more tailored healthcare approaches.

Clinical Applications:

The findings offer a critical tool for healthcare providers in identifying individuals with elevated CVD risk. Early detection through predictive modeling facilitates timely interventions and personalized treatment strategies, potentially improving patient outcomes. The ability to profile individual risk factors paves the way for more customized patient care, aligning treatment plans with specific patient needs.

Public Health Policy and Prevention:

From a policy perspective, the analysis informs targeted preventive strategies for populations at higher CVD risk. The insights can guide healthcare policy-makers in developing focused interventions to address modifiable risk factors, optimizing resource allocation in public health. By incorporating such predictive models, healthcare systems can prioritize screenings and preventive measures for high-risk groups, ultimately contributing to more efficient and effective public health initiatives.

### Future Work (update the figures, delete logistic model)

In medical diagnostics, especially for critical conditions like coronary heart disease, the priority is often to minimize false negatives as we aim to minimize instances where patients miss critical treatment. Thus, in our analysis we tried different measures to increase the recall. However, the relatively moderate accuracy and recall levels (53.1% for logistic regression and 64% for KNN after oversampling) might be lower than what one might expect for medical diagnostic tools. This could result from the imbalance in class distribution and needs to be evaluated within the specific application to determine whether this low accuracy is acceptable. Such an evaluation should also contains an estimation of the incurred costs, that arise by having false positive/false negatives and approach to minimize these costs.

```{bibliography}
```

## References


```{bibliography}
```