## Introduction

Heart disease remains the leading cause of death worldwide, with an estimated 17.9 million lives lost annually [(World Health Organization. (2023))](https://www.who.int/health-topics/cardiovascular-diseases). The global fatalities from cardiovascular diseases (CVDs) have risen from about 12.1 million in 1990 to 18.6 million in 2019 [(World Heart Federation. (2023))](https://world-heart-federation.org/wp-content/uploads/World-Heart-Report-2023.pdf). Identifying high-risk individuals and providing timely treatment is critical to reducing premature mortality from CVDs.

Professor Fausto Pinto states, "Up to 80 percent of premature heart attacks and strokes are preventable... Quality data plays a pivotal role in shaping effective policies. The opportunity to expedite efforts in reducing premature mortality from non-communicable diseases (NCDs) by one-third by 2030 remains attainable." [(World Heart Federation. (2023). Deaths from cardiovascular disease surged 60% globally over the last 30 years, report)](https://world-heart-federation.org/news/deaths-from-cardiovascular-disease-surged-60-globally-over-the-last-30-years-report/)

Professor Fausto Pinto highlights the importance of data-driven approaches in predicting cardiovascular disease (CVD) risk. Leveraging the famous Framingham Heart Study's data, our project examines the potential of machine learning classification methods to assess heart disease risk and identify the risk factors. [The dataset](https://paulblanche.com/files/DataFramingham.html) contains 1,363 records and includes variables such as age, gender, blood pressure, cholesterol levels, and smoking habits, analyzed to predict the likelihood of developing heart disease. Given that many cardiovascular conditions present no initial symptoms and are preventable through lifestyle modifications [(World Health Organization. (2023))](https://www.who.int/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds)), an algorithm capable of accurately detecting at-risk individuals can be pivotal. Early detection is paramount to providing early interventions, which may include lifestyle counseling and proactive medication management, thereby improving patient outcomes.

## Discussion
### Summary of Findings
1. **Logistic Regression:**
   - The model was adjusted to prioritize minimizing false negatives, which is crucial in medical diagnostics.
   - A specific parameter (C) was determined using Gridsearch to optimize the model.
   - The optimized logistic regression model achieved an accuracy of 53.1%.
   - It exhibited high recall but low precision, indicating it was better at identifying true cases of CHD but also had a higher rate of false positives.

2. **k-Nearest Neighbors (k-NN):**
   - The initial k-NN model showed a decent accuracy of around 79% but suffered from very low recall, leading to a significant number of false negatives.
   - To address the issue of class imbalance in the dataset, oversampling was applied.
   - After oversampling, the recall of the model improved notably, indicating fewer false negatives. However, this improvement came at the cost of reduced overall accuracy (64%).

### Discussion on Expectations
1. **Expectation of High Recall in Medical Diagnostics:**
   - In medical diagnostics, especially for critical conditions like coronary heart disease, the priority is often to minimize false negatives (missed diagnoses). The study's emphasis on high recall for the logistic regression model aligns with this expectation.

2. **Trade-off Between Recall and Precision:**
   - The trade-off between recall and precision, particularly in the logistic regression model, is a well-known challenge. In medical contexts, this trade-off is crucial and often skewed towards maximizing recall, as observed in the study.
   - The decrease in overall accuracy of the k-NN model after oversampling to improve recall is also a common outcome in machine learning, especially when addressing imbalanced datasets.

3. **Overall Accuracy and Model Performance:**
   - The relatively moderate accuracy levels (53.1% for logistic regression and 64% for kNN after oversampling) might be lower than what one might expect for medical diagnostic tools. However, this can be attributed to the complexity of predicting diseases like CHD, which are influenced by a multitude of factors.
   - The difficulty in achieving high accuracy in predicting complex health outcomes is not uncommon, especially when using traditional models like logistic regression and k-NN.
   
## Impact of Findings
1. **Medical Practice and Patient Care:**
   - Improved Risk Stratification: The models can help in identifying individuals at higher risk of CHD, enabling early intervention and potentially more effective treatment strategies.
   - Precision Medicine: Tailoring patient care based on individual risk profiles, informed by these predictive models, could lead to more personalized and effective healthcare.

2. **Healthcare Policy and Preventive Strategies:**
   - Policymaking: These findings can inform healthcare policies focused on preventive measures for high-risk groups, potentially leading to public health initiatives that target modifiable risk factors for CHD.
   - Resource Allocation: Healthcare systems could use such models to allocate resources more effectively, prioritizing high-risk populations for screenings and interventions.

3. **Research and Development:**
   - Cross-disciplinary Collaboration: This study could encourage collaborations between data scientists, healthcare professionals, and epidemiologists to develop more robust models for disease prediction.

### Future Questions
1. **Feature Relevance:** Which features most strongly predict coronary heart disease? Is there a need for additional or different types of data?
2. **Model Improvement:** How can the models be further refined for better accuracy and recall? Would more complex models like neural networks offer improvements?
3. **Generalizability:** How well do these models perform on diverse populations beyond the Framingham study?
4. **Longitudinal Analysis:** Could incorporating time-series data from patient histories improve predictive accuracy?