# Stroke Prediction

## Context

Strokes are a significant health concern worldwide, often leading to severe consequences including long-term disability or death. Predicting the likelihood of a stroke can play a crucial role in early intervention and treatment, potentially saving lives and improving patient outcomes.

According to the World Health Organization (WHO), stroke is the second leading cause of death globally, responsible for approximately 11% of total deaths. This project uses a dataset to predict whether a patient is likely to suffer a stroke based on input parameters such as gender, age, various diseases, and smoking status. Each row in the dataset provides relevant information about a patient.

## Source

This dataset is available on Kaggele in the following link:
> https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset/data

## Data Dictionary

- **id**: Unique identifier for each patient. It contains Numeric Data.
- **gender**: Gender of the patient. It contains categorical data. (**"Male", "Female", or "Other"**)
- **age**: Age of the patient. It contains numeric data.
- **hypertension**: It contains binary data whether the patient has hypertension or not. 0 if the patient doesn't have hypertension, 1 if the patient has hypertension.
- **heart_disease**: It contains binary data whether the patient has heart disease or not. 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease.
- **ever_married**: It contains categorical data whether the patient is married or not. (**"No" or "Yes"**)
- **work_type**: Type of work of the patient. It contans categorical data. (**"children", "Govt_job", "Never_worked", "Private", or "Self-employed"**)
- **Residence_type**: Type of residence of the patient. It contains categorical data. (**"Rural" or "Urban"**)
- **avg_glucose_level**: Average glucose level in blood. It contains numeric data.
- **smoking_status**: Status of smoking habit of the patient. It contains categorical data. (**"formerly smoked", "never smoked", "smokes", or "Unknown"**)
- **stroke**: It is the output feature. 1 if the patient had a stroke, 0 if not

*Note: "Unknown" in `smoking_status` means that the information is unavailable for this patient.

## Problem Statement

1. **Feature Engineering**: The objective of feature engineering is to encode the categorical features into numerical using suitable encoding techniques.
2. **Feature Selection**: Select the most suitable features for accurately predicting the stroke.


### Load Libraries

In [1]:
import pandas as pd
import numpy as np
import os
import warnings

### Settings

In [53]:
# warning
warnings.filterwarnings("ignore")

# Path
data_path = "../data"
# csv_path = os.path.join(data_path, "stroke_dropped.csv")
csv_path = os.path.join(data_path, "stroke_do.csv")

### Load Data

In [54]:
df = pd.read_csv(csv_path)

In [55]:
# Check data
df.head()

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
2,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
3,Male,74.0,1,1,Yes,Private,Rural,70.09,27.4,never smoked,1
4,Female,69.0,0,0,No,Private,Urban,94.39,22.8,never smoked,1


### Feature Engineering

In [56]:
# Encode the ever_married with mapping
df["ever_married"] = df["ever_married"].map({ "Yes": 1, "No": 0})

In [57]:
# Sanity check
df.head()

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,Male,67.0,0,1,1,Private,Urban,228.69,36.6,formerly smoked,1
1,Male,80.0,0,1,1,Private,Rural,105.92,32.5,never smoked,1
2,Female,49.0,0,0,1,Private,Urban,171.23,34.4,smokes,1
3,Male,74.0,1,1,1,Private,Rural,70.09,27.4,never smoked,1
4,Female,69.0,0,0,0,Private,Urban,94.39,22.8,never smoked,1


In [58]:
# Encode the Residence_type with mapping
df["Residence_type"] = df["Residence_type"].replace({"Urban": 1, "Rural": 0})

In [59]:
# Sanity check
df.head()

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,Male,67.0,0,1,1,Private,1,228.69,36.6,formerly smoked,1
1,Male,80.0,0,1,1,Private,0,105.92,32.5,never smoked,1
2,Female,49.0,0,0,1,Private,1,171.23,34.4,smokes,1
3,Male,74.0,1,1,1,Private,0,70.09,27.4,never smoked,1
4,Female,69.0,0,0,0,Private,1,94.39,22.8,never smoked,1


In [60]:
df = pd.get_dummies(df, drop_first= True, dtype=int)

In [61]:
# Sanity check
df.head()

Unnamed: 0,age,hypertension,heart_disease,ever_married,Residence_type,avg_glucose_level,bmi,stroke,gender_Male,gender_Other,work_type_Never_worked,work_type_Private,work_type_Self-employed,work_type_children,smoking_status_formerly smoked,smoking_status_never smoked,smoking_status_smokes
0,67.0,0,1,1,1,228.69,36.6,1,1,0,0,1,0,0,1,0,0
1,80.0,0,1,1,0,105.92,32.5,1,1,0,0,1,0,0,0,1,0
2,49.0,0,0,1,1,171.23,34.4,1,0,0,0,1,0,0,0,0,1
3,74.0,1,1,1,0,70.09,27.4,1,1,0,0,1,0,0,0,1,0
4,69.0,0,0,0,1,94.39,22.8,1,0,0,0,1,0,0,0,1,0


In [62]:
# Save the data
output_path = os.path.join(data_path, "stroke_d_e.csv")
df.to_csv(output_path, index=False)