# Stroke Prediction

## Context

Strokes are a significant health concern worldwide, often leading to severe consequences including long-term disability or death. Predicting the likelihood of a stroke can play a crucial role in early intervention and treatment, potentially saving lives and improving patient outcomes.

According to the World Health Organization (WHO), stroke is the second leading cause of death globally, responsible for approximately 11% of total deaths. This project uses a dataset to predict whether a patient is likely to suffer a stroke based on input parameters such as gender, age, various diseases, and smoking status. Each row in the dataset provides relevant information about a patient.

## Source

This dataset is available on Kaggele in the following link:
> https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset/data

## Data Dictionary

- **id**: Unique identifier for each patient. It contains Numeric Data.
- **gender**: Gender of the patient. It contains categorical data. (**"Male", "Female", or "Other"**)
- **age**: Age of the patient. It contains numeric data.
- **hypertension**: It contains binary data whether the patient has hypertension or not. 0 if the patient doesn't have hypertension, 1 if the patient has hypertension.
- **heart_disease**: It contains binary data whether the patient has heart disease or not. 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease.
- **ever_married**: It contains categorical data whether the patient is married or not. (**"No" or "Yes"**)
- **work_type**: Type of work of the patient. It contans categorical data. (**"children", "Govt_job", "Never_worked", "Private", or "Self-employed"**)
- **Residence_type**: Type of residence of the patient. It contains categorical data. (**"Rural" or "Urban"**)
- **avg_glucose_level**: Average glucose level in blood. It contains numeric data.
- **smoking_status**: Status of smoking habit of the patient. It contains categorical data. (**"formerly smoked", "never smoked", "smokes", or "Unknown"**)
- **stroke**: It is the output feature. 1 if the patient had a stroke, 0 if not

*Note: "Unknown" in `smoking_status` means that the information is unavailable for this patient.

## Problem Statement

1. **Model Training**: The objective of model training is to train the model with the data so that it can predict the risk of stroke.
2. **Model Evaluation**: Evaluate the performance of the model with different metrics such as accuracy, precision, recall and F1 score.


### Load Libraries

In [111]:
# General
import pandas as pd
import numpy as np
import os
import warnings
import pickle

# Preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Model & Evaluation metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Optimization
from sklearn.model_selection import GridSearchCV

### Settings

In [93]:
# Warnings
warnings.filterwarnings("ignore")

# Path
data_path = "../data"
model_path = "../models"
csv_path = os.path.join(data_path, "stroke_d_e.csv")

### Load Data

In [94]:
df = pd.read_csv(csv_path)

In [95]:
# Check data
df.head()

Unnamed: 0,age,hypertension,heart_disease,ever_married,Residence_type,avg_glucose_level,bmi,stroke,gender_Male,gender_Other,work_type_Never_worked,work_type_Private,work_type_Self-employed,work_type_children,smoking_status_formerly smoked,smoking_status_never smoked,smoking_status_smokes
0,67.0,0,1,1,1,228.69,36.6,1,1,0,0,1,0,0,1,0,0
1,80.0,0,1,1,0,105.92,32.5,1,1,0,0,1,0,0,0,1,0
2,49.0,0,0,1,1,171.23,34.4,1,0,0,0,1,0,0,0,0,1
3,74.0,1,1,1,0,70.09,27.4,1,1,0,0,1,0,0,0,1,0
4,69.0,0,0,0,1,94.39,22.8,1,0,0,0,1,0,0,0,1,0


### Data Preprocessing

In [96]:
# Separate the Input and output feature
X = df.drop("stroke", axis= 1)
y = df["stroke"]

In [97]:
# Split train and test
X_train,X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [98]:
# Standardize the data
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

### Model Training and Evaluation

In [105]:
# Train the model and evaluate metrics
def train_evaluate(model):
    # Train the model
    model.fit(X_train_s, y_train)

    # Predict with train and test data
    y_train_pred = model.predict(X_train_s)
    y_test_pred = model.predict(X_test_s)

    # Evaluate Train data
    print("=" * 60)
    print("TRANING METRICS")
    print("=" * 60)
    print(f"Accuracy: {accuracy_score(y_train, y_train_pred): .2f}")
    print(f"Precision: {precision_score(y_train, y_train_pred): .2f}")
    print(f"Recall: {recall_score(y_train, y_train_pred): .2f}")
    print(f"F1: {f1_score(y_train, y_train_pred): .2f}")

    # Evaluate Test data
    print("=" * 60)
    print("TEST METRICS")
    print("=" * 60)
    print(f"Accuracy: {accuracy_score(y_test, y_test_pred): .2f}")
    print(f"Precision: {precision_score(y_test, y_test_pred): .2f}")
    print(f"Recall: {recall_score(y_test, y_test_pred): .2f}")
    print(f"F1: {f1_score(y_test, y_test_pred): .2f}")

In [106]:
# Random Forest
rf = RandomForestClassifier()
train_evaluate(rf)

TRANING METRICS
Accuracy:  1.00
Precision:  1.00
Recall:  1.00
F1:  1.00
TEST METRICS
Accuracy:  0.96
Precision:  0.00
Recall:  0.00
F1:  0.00


### Conclusion

- After dropping the missing values present in BMI and alsodropping the outlier rows the accuracy of trained model is **96%**.

### Model Optimization

In [107]:
# Define parameter dictionary
param_dict = {
    "n_estimators": [100, 200, 500],
    "criterion": ["gini", "entropy"],
    "max_depth": [None, 2, 3, 5],
    "min_samples_split": [2, 3, 4],
    "min_samples_leaf": [1, 2, 3]
}

In [109]:
# Hyperparameter tuning
rfht = RandomForestClassifier()
gscv = GridSearchCV(estimator=rfht, param_grid= param_dict, cv= 5, verbose= 1)

gscv.fit(X, y)

best_params = gscv.best_params_
print(f"Best Accuracy: {gscv.best_score_}")
print(f"Best Parameter Set: {best_params}")

Fitting 5 folds for each of 216 candidates, totalling 1080 fits
Best Accuracy: 0.9631057737146606
Best Parameter Set: {'criterion': 'gini', 'max_depth': None, 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 200}


In [110]:
# Train with best set of hyperparameter
model = RandomForestClassifier(**best_params)
train_evaluate(model)

TRANING METRICS
Accuracy:  0.97
Precision:  1.00
Recall:  0.08
F1:  0.15
TEST METRICS
Accuracy:  0.96
Precision:  0.00
Recall:  0.00
F1:  0.00


In [112]:
# Save the model
trained_model_path = os.path.join(model_path, "rfcmodel.pkl")
with open(trained_model_path, "wb") as trained_model:
    pickle.dump(model, trained_model)