![Be Fast Awareness](https://healthaware.com/wp-content/uploads/2023/02/be-fast-stroke-awareness@2x.png)

# Capstone Two: Predicting Stroke Risk

## Problem Statement

*Stroke is a serious medical condition that can have devastating consequences.
In the United States, stroke ranks as the fifth leading cause of death and a primary source of long-term disability. According to the American Brain Foundation, every year, over 800,000 individuals experience a new or recurrent stroke, and unfortunately, at least 140,000 succumb to this disease. Can we develop a highly accurate predictive model for stroke occurence, utilizing a comprehensive set of patient data and advanced machine learning techniques?*

## Data Wrangling

For this project, I utilized the **‘healthcare-dataset-stroke-data.csv’** dataset from Kaggle. The dataset comprises twelve columns, several of which contained missing data.

* id
* gender
* age
* hypertension
* heart_disease
* ever_married
* work_type
* Residence_type
* avg_glucose_level
* bmi
* smoking_status
* stroke

To clean up the data, I removed the **‘id’** column, as it served solely as a unique identifier for each patient and was not relevant to the analysis. Additionally, I eliminated any instances with missing values to ensure data integrity and consistency.

### Problems Encountered:
* Age column had a few random values
* BMI column had missing values
* Work type, residence type, and smoking status had multiple entries

### Problem Solution:
* Applied .astype(float) to ensure all values under 'age' column matched
* Iterated values under 'gender' and 'ever_married' columns
* Applied get_dummies for the columns that had multiple values to create new binary columns

### Exploratory Data Analysis
In this phase, we delved into the data to uncover hidden connections between variables and stroke. 


![Stroke Percentage](Stroke1.png)

Analysis of the **‘healthcare-dataset-stroke-data.csv'** dataset reveals a stroke prevalence of 4.3%, whereas a vast majority (95.7%) of patients did not suffer from stroke.
After cleaning the data, we focused on 209 patients who had experienced a stroke.

![Correlation Between Stroke and Age](Stroke2.png)![Correlation1](Stroke3.png)![Correlation2](Stroke4.png)![Correlation3](Stroke5.png)

To find correlations among the dataset, we used a kernel density estimate (KDE) plot and a categorical plot. By doing so, it helps us visualize the given data.

![Correlation4](Stroke6.png)

To visualizate relationships across the dataset, we used a pair plot. This analysis revealed several key findings: 
* The distribution of age is skewed to the right, indication a higher concentration of individuals in the older age groups. There seems to be a positive correlation between age and stroke, as the scatterplot shows a general upward trend.
* The scatterplots between hypertension, heart disease, and stroke suggest potential correlations, especially between hypertension and stroke.
* There seems to be a positive correlation between average glucose level and stroke, however, it is less pronounced than the relationship between age and stroke.
* BMI and stroke shows a relationship that is less definitive with a more scattered distribution.

**To ensure the accuracy of the data after analyzing the pair plot, a series of chi-squared tests was used to analyze the association between several variables (hypertension, heart_disease, ever_married, avg_glucose_level, bmi, and age) along with the 'stroke' variable was used.**

![Correlation5](Stroke7.png)

* Based on this chi-squared test, hypertension, heart_disease, ever_married, and age are significantly associated with stroke by the chi-squared statistics being relatively high and the p-values being very low.
* The chi-squared statistics shows a weak correlation between BMI and stroke.

**To further ensure the accuracy of the chi-squared tests, *Bonferroni Correction* was used along with the chi-squared test.**
![Correlation6](Stroke8.png)

## Preprocessing and Training Data

**To develop a robust machine learning model, we need to preprocess the data to ensure its quality and consistency. Then, by training the model on this prepared data, we can equip it to learn meaningful patterns and make accurate stroke predictions that generalize well to new, unseen data.**

To do so, we identified categorical columns for encoding. The following columns were used for the preprocessing:
* 'hypertension'
* 'heart_disease'
* 'ever_married'
* 'avg_glucose_level_bin'
* 'bmi_bin'
* 'age_bin'

### Preprocessing Results
![Prep and Train](Stroke9.png)

The output shapes confirm that the dataset has been split correctly and is well-prepared for training a machine learning model.
* The training set has **3927** samples which will be beneficial for the model to learn the underlying patterns in the dataset.
* The testing set has **982** samples has a reasonable size to evaluate meaningful performance matrics


## Modeling
**To determine which models to use to accurately predict the risk of stroke, I chose to test four types of models:**
* Logistic Regression
* Decision Tree Classifier
* Random Forest Classifier
* Gradiant Boosting Classifier

While Logistic Regression exhibited the highest overall accuracy among the models evaluated, it encountered difficulties in accurately predicting certain classes, as evidenced by the ***'UndefinedMetricWarning'*** error. This suggests that the model might be struggling to capture the nuances of these specific classes, potentially due to factors such as class imbalance or insufficient data. Further investigation and potential adjustments to the model or data preprocessing was needed to address this issue and improve performance for all classes. Due to the persistent error, I applied Synthetic Minority Over-sampling Technique, or SMOTE.

### Results for Each Model
![Model](Stroke10.png)![Model2](Stroke11.png)

The bar graph indicates that the Logistic Regression showed a significantly **lower accuracy (0.7424)** of predicting strokes compared to the other models.

### HOWEVER
![Model3](Stroke12.png)

**In order to select the optimal model, we can compare the performance metrics across all models, including accuracy, precision, recall, F1-score, and ROC AUC. This comprehensive analysis  can determine the best model for the job.**



### Deep Dive Each Model
* **Random Forest and Gradient Boosting** have the highest accuracy, indicating that they perform well overall.
* **Gradient Boosting** has a precision of 1.000000 which indicates every positive prediction it makes will be correct. *However,* its recall is ZERO! Meaning that it does not identify any positive cases.
* **Logistic Regression** has the highest recall, meaning that it identifies a large portion of actual positive cases.
* **Logistic Regression** also has the best F1 score which indicates a better balance between precision and recall.
* **Logistic Regression** ALSO has a decent ROC AUC score which indicates a good model discrimination ability.

Therefore, to reduce the amount of false negatives and capture the best prediction as posible, **Logistic Regression** is the better choice.