# Assessing Cardiovascular Disease Risk Utilizing the K-Nearest Neighbors Algorithm

_CardioPredict harnesses the power of kNN Models to analyze key health indicators and provide a predictive model for assessing the risk of cardiovascular disease in individuals._

Authors: (listed alphabetically)

Gross, Sandy - UBC-MDS

Ma, He - UBC-MDS

Wang, Doris - UBC-MDS

Wu, Joey - UBC, MDS

In [1]:
import pandas as pd
import pickle
from myst_nb import glue

In [2]:
confusion_df = pd.read_csv("../results/tables/knn_test_confusion_matrix.csv", index_col=0)
confusion_df.index.name = None
confusion_df.columns = ['Predicted non-CVD', 'Predicted CVD']
confusion_df.index = ['Actual non-CVD', 'Actual CVD']

true_negatives = confusion_df.loc['Actual non-CVD', 'Predicted non-CVD']
true_positives = confusion_df.loc['Actual CVD', 'Predicted CVD']
false_negatives = confusion_df.loc['Actual CVD', 'Predicted non-CVD']
false_positives = confusion_df.loc['Actual non-CVD', 'Predicted CVD']

glue("true_pos", true_positives, display=False)
glue("true_neg", true_negatives, display=False)
glue("false_neg", false_negatives, display=False)
glue("false_pos", false_positives, display=False)

glue("confusion_df", confusion_df, display=False)

In [3]:
test_scores_df = pd.read_csv("../results/tables/knn_test_data_classification_report.csv").round(3)
test_scores_df.set_index('Unnamed: 0', inplace=True)
test_scores_df.index = test_scores_df.index.map({'0': 'non-CVD', '1': 'CVD', 'accuracy': 'accuracy', 'macro avg': 'macro avg', 'weighted avg': 'weighted avg'})
test_scores_df.index.name = None

glue("non_cvd_recall", test_scores_df.loc["non-CVD", "recall"], display=False)
glue("cvd_recall", test_scores_df.loc["CVD", "recall"], display=False)
glue("non_cvd_case", test_scores_df.loc["non-CVD", "support"], display=False)
glue("cvd_case", test_scores_df.loc["CVD", "support"], display=False)

glue("accuracy", test_scores_df.loc["accuracy", "support"], display=False)
glue("total_case", test_scores_df.loc["weighted avg", "support"], display=False)
test_scores_df = test_scores_df.style.format('{:.3f}') \
                             .set_caption("Classification Report") \
                             .set_table_styles([{
                                 'selector': 'caption',
                                 'props': [('color', 'black'), ('font-size', '16px')]
                             }])
glue("test_scores_df", test_scores_df, display=False)

## Abstract

Cardiovascular disease (CVD) ranks as a foremost global health challenge, driving an urgent need for refined predictive methodologies. Leveraging data from the esteemed Framingham Heart Study, this research delves into clinical, demographic, and lifestyle factors influencing CVD risk. Central to the study is the adept use of hyperparameter-optimized k-Nearest Neighbors (kNN) algorithm, augmented by an oversampling strategy to enhance sensitivity in class-imbalanced scenarios. Despite modest levels of accuracy ({glue:text}`accuracy`) and recall ({glue:text}`cvd_recall`), the analysisl underscores the significance of cholesterol levels and smoking habits as substantial contributors to cardiovascular disease risk, alongside established factors such as age and systolic blood pressure. These findings catalyze further explorations into the nuanced interactions of risk factors, aiming to bolster the precision and practical application of predictive models in combating CVD.

## Introduction
Heart disease remains the leading cause of death worldwide, with an estimated 17.9 million lives lost annually {cite}`whr2023`. The global fatalities from cardiovascular disease (CVD) have risen from about 12.1 million in 1990 to 18.6 million in 2019 {cite}`whf2023`. Identifying high-risk individuals and providing timely treatment is critical to reducing premature mortality from CVDs.

The assertion that up to 80 percent of premature heart attacks and strokes are preventable highlights the critical role of quality data in formulating effective health policies {cite}`whf2023`. The report also highlights the importance of data-driven approaches in predicting cardiovascular disease (CVD) risk. Leveraging data from the famous Framingham Heart Study {cite}`fhs`, our project examines the potential of machine learning classification methods to assess heart disease risk and identify the risk factors. The dataset contains 1,363 records and includes variables such as age, gender, blood pressure, cholesterol levels, and smoking habits, analyzed to predict the likelihood of developing heart disease. Given that many cardiovascular conditions present no initial symptoms and are preventable through lifestyle modifications {cite}`who`, an algorithm capable of accurately detecting at-risk individuals can be pivotal. Early detection is paramount to providing early interventions, which may include lifestyle counseling and proactive medication management, thereby improving patient outcomes.

## Data

The CardioPredict project utilizes data from the renowned Framingham Heart Study (FHS), initiated by the USA Public Health Service {cite}`nih`, a pivotal cohort study initiated in 1948 in Framingham, Massachusetts. In 2018, The FHS celebrated its seventieth anniversary, is the longest-running cardiovascular epidemiological cohort study in the USA {cite}`Andersson2019`. The Framingham Heart Study (FHS) has been widely used for various research purposes. The study has contributed to the understanding of risk factors for cardiovascular diseases (CVD) such as hypertension, dyslipidemia, smoking, and diabetes {cite}`Tsao201`. It has also facilitated the development and implementation of effective treatments for these conditions. The FHS has been instrumental in investigating the role of lifestyle habits, social networks, and genetics in CVD. Additionally, the study has provided valuable data for genomic research, including genome-wide single-nucleotide polymorphism (SNP) data and other 'omics' data types like DNA methylation and gene expression. The FHS continues to be relevant today in addressing challenges such as physical inactivity, obesity, and diabetes, and remains an important institution for training CVD epidemiologists and researchers {cite}`Andersson2021`.

The FHS is also widely used as a source of data for educational purposes in the medical field by National Institute of Health (NIH) {cite}`nih2021`.
Our dataset, derived from this study and provided by Professor Paul Blanche {cite}`data`, encompasses a comprehensive collection of health metrics from 1,363 individuals, meticulously recorded and analyzed to gauge the risk of developing heart disease.

### Data Description

Key variables in this dataset include age, gender, systolic and diastolic blood pressure (SBP and DBP), cholesterol levels, Framingham relative weight (FRW), and smoking habits (measured as cigarettes per day). These variables offer a multifaceted view into each individual's health status, providing a robust foundation for our predictive analysis. For example, 'sex' is categorized into Female or Male, 'AGE' represents the age of individuals in years, 'FRW' indicates the Framingham relative weight percentage at baseline ranging from 52 to 222, 'SBP' and 'DBP' measure blood pressure in mmHg, 'CHOL' shows cholesterol levels in mg/100ml, and 'CIG' quantifies cigarette consumption per day. The 'disease' variable,which serves as out target variable, marks the occurrence of coronary heart disease during the study, noted as 1 for occurrence and 0 otherwise.


| Variable | Explanation |
|----------|-------------|
| sex      | sex (Female/Male) |
| AGE      | Age in years |
| FRW      | "Framingham relative weight" (pct.) at baseline (52-222) |
| SBP      | systolic blood pressure at baseline mmHg (90-300) |
| DBP      | diastolic blood pressure at baseline mmHg (50-160) |
| CHOL     | cholesterol at baseline mg/100ml (96-430) |
| CIG      | cigarettes per day at baseline (0-60) |
| disease  | 1 if coronary heart disease occurred during the follow-up, 0 otherwise |

### Data Visualization

As {numref}`Figure {number} <distribution_of_disease_occurrence>` shown, there are 1,095 individuals without heart disease and 268 individuals with heart disease, indicating a higher prevalence of non-disease cases in the sample and high class imbalance.

```{figure} ../results/figures/distribution_of_disease_occurrence.png
---
width: 400px
name: distribution_of_disease_occurrence
---
Distribution of Disease Occurrence
```

The distributions in {numref}`Figure {number} <age_and_health_indicators_exhibit_elevated_heart_disease>` displayed in the charts indicate that most of the variables (AGE, FRW, SBP, DBP, CHOL, CIG) show a clear differentiation between individuals with heart disease (1) and those without (0), especially for SBP, DBP, and CHOL. Individuals with heart disease tend to have higher systolic blood pressure (SBP), diastolic blood pressure (DBP), and cholesterol (CHOL) levels, while age, relative weight (FRW), and cigarette consumption (CIG) do not show as pronounced a difference between the two groups. Most of the features, except for CIG and AGE are roughly normally distributed.

```{figure} ../results/figures/age_and_health_indicators_exhibit_elevated_heart_disease.png
---
width: 800px
name: age_and_health_indicators_exhibit_elevated_heart_disease
---
Age and Health Indicators Exhibit Elevated Heart Disease
```

As {numref}`Figure {number} <correlation_matrix_of_the_features>` shown, Systolic and diastolic blood pressure (SBP and DBP) show a strong positive correlation. Additionally, there is a notable positive correlation between SBP and the occurrence of disease, as well as between DBP and disease, age and disease, while cigarette smoking (CIG) is negatively correlated with Framingham relative weight (FRW) and cholesterol levels (CHOL).

```{figure} ../results/figures/correlation_matrix_of_the_features.png
---
width: 600px
name: correlation_matrix_of_the_features
---
Correlation Matrix of the Numerical Features 
```

In {numref}`Figure {number} <pairwise_scatter_plot_matrix>`, Systolic and diastolic blood pressure (SBP and DBP) show a strong linear relationship. Additionally, the color differentiation indicates potential trends between these variables and the presence of heart disease, with some variables like SBP exhibiting clusters that may correlate with higher instances of the disease.

```{figure} ../results/figures/pairwise_scatter_plot_matrix.png
---
width: 800px
name: pairwise_scatter_plot_matrix
---
Pairwise Scatter Plot Matrix
```

The distribution in {numref}`Figure {number} <distribution_of_the_sex_variable>` of the two genders appears to be approximately equal, with a similar number of observations being sick in each group.

```{figure} ../results/figures/distribution_of_the_sex_variable.png
---
width: 400px
name: distribution_of_the_sex_variable
---
Distribution of the sex variable
```

The outliers in the boxplot as shown in {numref}`Figure {number} <boxplot_of_specified_numerical_features>` for the features 'AGE', 'FRW', 'SBP', 'DBP', 'CHOL', and 'CIG' could be indicative of significant health-related phenomena or critical cases that warrant further medical investigation. For instance, extreme values in 'SBP' and 'DBP' may reflect cases of hypertension or hypotension, which are clinically relevant, while variations in 'CHOL' and 'CIG' might highlight individuals with high cholesterol levels or heavy smoking habits, both of which are key factors in assessing cardiovascular risk. Therefore, retaining these outliers is essential for a comprehensive analysis, as they contribute to the understanding of the full range of health profiles within the population studied.

```{figure} ../results/figures/boxplot_of_specified_numerical_features.png
---
width: 600px
name: boxplot_of_specified_numerical_features
---
Boxplot of Numerical Features
```

## Methodology

### Model Specification and Retionalization

The selection of k-Nearest Neighbors (kNN) for the prediction of cardiovascular disease (CVD) is underpinned by its well-recognized efficacy in clinical diagnostics. kNN's simplicity and its inherent capacity to discern intricate, non-linear relationships make it particularly suited for medical datasets, where such complexities are the norm. It has gained traction within the medical research community for its successful application across various diagnostic challenges. {cite}`Lahmiri2023` utilizes convolutional neural networks (CNN) for automatic feature extraction from MRI images. Receiver operating characteristic (ROC) based feature ranking and selection is employed to reduce the dimensionality of the CNN-based features. And the k-nearest neighbors (kNN) classifier is then optimized using Bayesian optimization (BO) algorithm. The proposed system achieves high classification performance and shows promise for efficient Alzheimer's disease (AD) diagnosis. {cite}`Cherif2018` optimizes the performance of the kNN algorithm for breast cancer diagnosis by clustering class instances, selecting significant attributes, and weighting similarities with reliability coefficients. The results demonstrate that this approach outperforms other classification techniques, achieving an average f-measure exceeding 94% on the breast cancer dataset.

Recent scholarly work has showcased the integration of kNN with advanced computational techniques to amplify its diagnostic capabilities in CVD. {cite}`jabbar2013` proposes a new algorithm that combines the kNN method with a genetic algorithm for the classification of heart disease. The kNN algorithm is utilized for pattern recognition based on the class of the nearest neighbor, while the genetic algorithm performs a global search to find optimal solutions. The experimental results demonstrate that the proposed algorithm enhances the accuracy in diagnosing heart disease. The study of {cite}`Ali2021` aimed to predict heart disease using data mining techniques and used a heart disease dataset with 14 attributes to compare different classification algorithms. The results showed that the kNN, RF, and DT algorithms achieved 100% accuracy in predicting heart disease. 


### Evaluation Metrics and Model Performance

Given the critical nature of accurately identifying at-risk patients, the study prioritizes recall as the primary evaluation metric. In CVD prediction, the cost of false negatives – overlooking individuals who are actually at risk – is exceedingly high, potentially leading to missed opportunities for early intervention. Therefore, recall, which measures the model's ability to detect all positive cases, emerges as the most pertinent metric for this study.

### Data Preprocessing and Feature Handling

A thorough preprocessing protocol is adopted to ensure the data's suitability for analysis. This includes addressing missing values in numerical attributes through median imputation, a decision driven by the skewed nature of certain variables, notably the cigarette consumption (CIG). This approach preserves the underlying distribution of these features. Additionally, to counter the diverse range of scales across various numerical features, the dataset undergoes normalization using standard scaling This normalization ensures equitable representation of each feature in the model, preventing any single variable from exerting undue influence due to scale disparities. Furthermore, the study incorporates OneHotEncoding for the binary variable SEX with female as reference group. 

### Hyperparameter Optimization and Over-Sampling Approach

For the k-Nearest Neighbors (kNN) model, this study optimizes the number of neighbors (n_neighbors). A cross-validation approach was applied across a range of 1 to 20 neighbors, in increments of 5, to determine the ideal balance between underfitting and overfitting, thus ascertaining the most effective neighbor count for accurate classification. To address class imbalance in the kNN model, we implemented Random Over Sampling (ROS), ensuring an equitable class distribution in the training set. This approach significantly enhanced the model’s ability to detect the minority class, a prevalent challenge in medical datasets with disproportionate class sizes.

### Estimation Strategy

Utilizing the scikit-learn Python package {cite}`Pedregosa2012``, this research benefits from its extensive suite of machine learning tools. This package not only provides efficient implementations of Logistic Regression and kNN but also encompasses a broad spectrum of data pre-processing techniques and metrics for comprehensive model evaluation, all encapsulated within an intuitive framework conducive to streamlined model development and assessment.

## Results

In the quest to refine cardiovascular disease prediction models, this analysis reveals a nuanced balance between accuracy {numref}`Figure {number} <accuracy_lines>` and recall {numref}`Figure {number} <recall_lines>`, particularly when examining k-Nearest Neighbors (kNN) with and without oversampling. The accuracy plot underscores a steadier, more uniform performance with oversampling, maintaining a high accuracy across the spectrum of neighbors considered. In contrast, the kNN model without oversampling demonstrates a gradual increase in accuracy as the number of neighbors grows, hinting at the model's sensitivity to the neighbor parameter.

However, this study's focus pivots towards recall, a metric of paramount importance in medical diagnostics where missing a true case (false negative) can have dire consequences. The recall plot vividly illustrates the stark enhancement in the recall due to oversampling, particularly in lower neighbor counts. The without-oversampling curve plummets, indicating a high rate of false negatives, whereas the with-oversampling curve consistently captures a higher proportion of positive cases. This enhancement is critical, as it suggests oversampling equips kNN to better identify at-risk patients, making it the preferred approach despite potential trade-offs in accuracy. Thus, the kNN with oversampling, optimized for recall, emerges as a crucial strategy for early and accurate identification of cardiovascular diseases, aligning with the medical imperative to prioritize patient outcomes over model precision.

The selection of $k=9$ as the optimal number of neighbors for the kNN model with over-sampling was guided by a careful balance between model complexity and the ability to generalize. A lower $k$ value, while potentially offering higher flexibility and sensitivity to the training data, runs the risk of overfitting, as it may capture noise and lead to an overly complex model. Conversely, a higher 
$k$ tends to smooth the decision boundary, potentially leading to underfitting. At $k=9$, the model achieves a favorable trade-off, maintaining a satisfactory level of recall that is crucial for medical diagnosis, while mitigating overfitting and reducing computational cost. This equilibrium also addresses the bias-variance trade-off, providing a generalization that can be expected to perform well on unseen data.

```{figure} ../results/figures/accuracy_lines.png
---
width: 800px
name: accuracy_lines
---
Accuracy from 20-fold cross validation to choose K.

```{figure} ../results/figures/recall_lines.png
---
width: 800px
name: recall_lines
---
Recall from 20-fold cross validation to choose K.

Delving deeper into the nuances of the predictive model, a radar chart was constructed using permutation to visualize the relative importance of various clinical features in predicting CVD {numref}`Figure {number} <radar_feature_importance>`. Although kNN does not inherently provide feature importance scores, techniques such as permutation importance can be applied after the model is trained to evaluate the importance of features with kNN. Permutation importance works by randomly shuffling a single feature in the validation set and measuring how much the shuffling affects the accuracy of the model. If the model's performance drops significantly when a feature's values are shuffled, it suggests that the model relied on that feature for the prediction and hence that the feature is important.

This graphical representation underscores the pivotal role of certain attributes, with AGE and SBP (systolic blood pressure) emerging as particularly influential. The radar chart illustrates a landscape of feature importance, revealing the multifaceted nature of heart disease risk factors. It is noteworthy that, while AGE and SBP are established risk factors, our model also identifies other variables such as CHOL (cholesterol levels) and CIG (cigarette consumption) as significant contributors to the predictive prowess of the kNN model.

The insights gleaned from our analysis carry profound implications for clinical practice. The data-driven identification of key risk factors not only corroborates established medical knowledge but also provides a quantifiable measure of their impact. This enhanced understanding facilitates a more nuanced risk assessment, potentially leading to tailored interventions that reflect an individual's unique risk profile.

```{figure} ../results/figures/radar_feature_importance.png
---
width: 600px
name: radar_feature_importance
---
Feature importance from kNN model with over-sampling.

The confusion matrix shows that out of the cases predicted, {glue:text}`true_neg` true negatives (correctly predicted non-CVD cases) and {glue:text}`true_pos` true positives (correctly predicted CVD cases) were observed. However, there were also {glue:text}`false_neg` false negatives (CVD cases incorrectly predicted as non-CVD) and {glue:text}`false_pos` false positives (non-CVD cases incorrectly predicted as CVD), indicating potential areas for improving the model's precision and recall.

```{figure} ../results/figures/knn_test_data_confusion_matrix.png
---
width: 600px
name: knn_test_data_confusion_matrix
---
Feature importance from kNN model with over-sampling.

The classification report {numref}`Figure {number} <test_scores_df>` provides insights into the model's performance on the test set. The support for non-CVD cases is {glue:text}`total_case`, indicating that there were {glue:text}`non_cvd_case` instances of non-CVD in the test data that the model attempted to classify. On the other hand, there were {glue:text}`cvd_case` instances of CVD, reflecting a smaller subset of the data. The classifier's performance in distinguishing between patients with and without cardiovascular disease (CVD) indicates a recall of {glue:text}`cvd_recall` for CVD cases, suggesting that approximately, 55.2% of actual {glue:text}`cvd_case` CVD cases were correctly identified. While the precision for non-CVD cases stands at a higher threshold, it is the recall metric that is of paramount concern in medical diagnostics to ensure at-risk patients are not overlooked. Overall accuracy of the model is {glue:text}`accuracy`, which, while indicative of a moderate predictive ability, underscores the necessity of optimizing the recall for CVD patients due to the high stakes involved in early and accurate diagnosis. It is crucial to enhance the model's sensitivity to CVD cases to improve patient outcomes and healthcare delivery.

```{glue:figure} test_scores_df
:figwidth: 400px
:name: "test_scores_df"

Classification report from prediction of kNN (k=9) model with oversampling on the test data.
```

## Discussion

### Contribution

This study has significant implications for both clinical practice and public health policy, primarily in enhancing early detection of cardiovascular disease (CVD) risk and enabling more tailored healthcare approaches.

Clinical Applications: The findings offer a critical tool for healthcare providers in identifying individuals with elevated CVD risk. Early detection through predictive modeling facilitates timely interventions and personalized treatment strategies, potentially improving patient outcomes. The ability to profile individual risk factors paves the way for more customized patient care, aligning treatment plans with specific patient needs.

Public Health Policy and Prevention: This study delivers actionable insights that could refine preventative strategies aimed at populations predisposed to CVD. Notably, the kNN model underscores the significance of cholesterol levels and smoking habits alongside traditional markers like age and systolic blood pressure, underscoring a broader spectrum of modifiable risk factors. This empowers policymakers to sculpt targeted interventions and allocate resources judiciously, thereby enhancing the efficacy of public health campaigns. Moreover, the implementation of such predictive models in healthcare systems could strategically direct preventative efforts toward individuals with the highest risk, thereby enhancing the overall health outcomes on a population level.

### Limitations

This research, while offering valuable insights into the predictive modeling of cardiovascular disease (CVD) risk, encounters several limitations that warrant acknowledgment and consideration for future work. The study achieved modest levels of accuracy and recall in the predictive modeling. This limitation is partly attributable to the constrained size and scope of the dataset utilized. With just over 1000 observations sourced from publicly available data primarily intended for teaching and practice, the model's training and validation might not comprehensively encapsulate the complex, multifaceted nature of CVD risk factors. Larger, more diverse datasets could potentially enhance the model's predictive power. 

While the Framingham Heart Study is a rich source of clinical data, its span of over 70 years introduces certain challenges. Medical practices, diagnostic criteria, and patient demographics have evolved considerably over this period. Therefore, applying findings from this historical dataset to contemporary clinical scenarios may have inherent limitations. Ideally, employing machine learning techniques on real-time, real-world clinical data could yield more nuanced and current insights. 

The research primarily revolves around the kNN algorithm, supplemented by oversampling to address class imbalance. Although kNN is a robust and effective method, relying exclusively on it might overlook the potential benefits of other predictive models. Machine learning in healthcare is a rapidly advancing field, and integrating a variety of models, such as random forests, support vector machines, or deep learning approaches, could offer a more comprehensive understanding and improved predictive accuracy.

The generalizability of the study's findings might be limited due to the specific demographic and clinical characteristics of the Framingham study cohort. The cohort, primarily from a specific geographical location and demographic, may not accurately represent the global population's diverse spectrum. Future studies should aim to include more heterogeneous populations to enhance the applicability and relevance of the findings.

These limitations highlight the need for ongoing research and iterative model refinement. Future studies should aim to incorporate larger, more diverse datasets, explore a wider range of predictive models, and continually adapt to the evolving landscape of cardiovascular health and risk factors.

### Future Work

The imperative to mitigate false negatives in cardiovascular disease diagnostics cannot be overstated, particularly given the dire consequences of delayed or missed treatment opportunities. In response, this study has concentrated on refining the k-Nearest Neighbors (kNN) model—a method with a rich history of application in medical diagnostics—augmented with an oversampling technique aimed at bolstering the recall metric. Despite these efforts, the accuracy and recall rates achieved through this model may not fully align with the elevated benchmarks typically anticipated of medical diagnostic tools.

The modest accuracy and recall observed underscore a need for further methodological refinements. This includes a comprehensive assessment of the model within clinical contexts, where the balance between sensitivity (recall) and specificity (accuracy) must be carefully managed against the backdrop of the costs associated with false positives and negatives. Such costs, both in economic and human terms, necessitate a nuanced approach to optimize the model's performance in actual clinical practice.

Furthermore, while age and systolic blood pressure are well-documented risk factors, our analysis illuminates the substantial roles that cholesterol levels and smoking habits play in the model's predictive capacity. These findings suggest potential avenues for future research to delve into the intricate web of causative factors contributing to CVD, thereby enhancing the model's diagnostic precision.

In light of these considerations, future work will focus on advancing the model through multidimensional strategies. This includes integrating a broader array of clinical data, exploring advanced algorithmic enhancements, and applying cost-sensitive learning frameworks to calibrate the model more finely to the realities of clinical decision-making. It is through such endeavors that we can aspire to reach the zenith of predictive accuracy, ultimately translating into saved lives and improved patient outcomes.

## References


```{bibliography}
```