# Supervised Learning Final Project: Stroke Prediction

## 1. Project Topic
**Type of Learning:** Supervised Learning
**Type of Task:** Binary Classification

**Goal:**
The goal of this project is to predict whether a patient is likely to get a stroke based on input parameters like gender, age, various diseases, and smoking status. Strokes are a leading cause of death globally, and early identification of high-risk patients can be crucial for prevention.

By analyzing this dataset, I aim to build a predictive model that classifies patients into two groups: likely to suffer a stroke (1) and unlikely to suffer a stroke (0). I will compare multiple machine learning models to determine which offers the best predictive performance, focusing on metrics suitable for imbalanced medical data.

**Project link:**
[https://github.com/VTornoreanu/Stroke-Prediction-Analysis].

## 2. Data Source
**Source:**
The dataset used for this project is the "Stroke Prediction Dataset" sourced from Kaggle.

**Citation:**
Fedesoriano. (2021). *Stroke Prediction Dataset*. Kaggle. Retrieved from [https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset].

**Data Description:**
The dataset contains medical and demographic features.
* **Target Variable:** `stroke` (0 = No, 1 = Yes).

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from google.colab import files
import io

# 1. Upload the file
print("Please upload 'healthcare-dataset-stroke-data.csv' when prompted:")
uploaded = files.upload()

# 2. Load the data into a DataFrame
# Note: The filename in the brackets must match the file you upload exactly
filename = list(uploaded.keys())[0]
df = pd.read_csv(io.BytesIO(uploaded[filename]))

# 3. Display Data Info to answer Rubric questions about size/features
print("\n--- Data Info ---")
print(df.info())

print("\n--- First 5 Rows ---")
display(df.head())

# 4. Check for Missing Values (Crucial for Data Cleaning Step)
print("\n--- Missing Values ---")
print(df.isnull().sum())

## 3. Data Cleaning

**Cleaning Strategy:**
1.  **Drop `id` column:** This column acts as a unique identifier for patients and provides no predictive value for the model. Keeping it could confuse the model.
2.  **Impute Missing `bmi` values:** The `bmi` feature has 201 missing values (approx. 4% of the data). Dropping these rows would result in data loss. Since BMI distributions can have outliers (skewed right), I will use the **median** value of the BMI column to fill these missing entries. This preserves the data distribution better than using the mean.

In [None]:
# 1. Drop the 'id' column
df.drop('id', axis=1, inplace=True)

# 2. Impute missing 'bmi' values with the median
bmi_median = df['bmi'].median()
df['bmi'].fillna(bmi_median, inplace=True)

# 3. Verify cleaning
print("Missing values after cleaning:")
print(df.isnull().sum())
print(f"\nNew Data Shape: {df.shape}")

## 4. Exploratory Data Analysis (EDA)

In this section, I will analyze the data distribution to understand:
1.  **Class Imbalance:** How many patients actually had a stroke?
2.  **Correlations:** Which features are most strongly related to strokes?
3.  **Feature Distribution:** How does age affect stroke probability?

In [None]:
# Set visual style
sns.set_style('whitegrid')

# Plot 1: Target Variable Distribution (Fixed Warning)
plt.figure(figsize=(6, 4))
sns.countplot(x='stroke', data=df, hue='stroke', legend=False, palette='coolwarm')
plt.title('Class Distribution: Stroke vs No Stroke')
plt.xlabel('Stroke (0=No, 1=Yes)')
plt.ylabel('Count')
plt.show()

# Calculate the exact percentage of imbalance
stroke_count = df['stroke'].value_counts()
print(f"No Stroke: {stroke_count[0]} ({round(stroke_count[0]/len(df)*100, 2)}%)")
print(f"Stroke:    {stroke_count[1]} ({round(stroke_count[1]/len(df)*100, 2)}%)")

# Plot 2: Correlation Matrix (Heatmap)
plt.figure(figsize=(10, 8))
numeric_df = df.select_dtypes(include=[np.number])
corr = numeric_df.corr()
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Feature Correlation Matrix')
plt.show()

# Plot 3: Age Distribution by Stroke Status
plt.figure(figsize=(10, 6))
sns.kdeplot(data=df[df['stroke'] == 0], x='age', fill=True, color='blue', label='No Stroke')
sns.kdeplot(data=df[df['stroke'] == 1], x='age', fill=True, color='red', label='Stroke')
plt.title('Age Distribution: Stroke vs No Stroke')
plt.legend()
plt.show()

## 5. Data Preprocessing

**Feature Engineering:**
Machine learning models require numerical input. I will use **One-Hot Encoding** to convert categorical variables (Gender, Work Type, Residence, Smoking Status) into numeric binary features.

**Data Splitting:**
I will split the dataset into a **Training Set (80%)** to teach the models and a **Test Set (20%)** to evaluate their performance on unseen data.

In [None]:
from sklearn.model_selection import train_test_split

# 1. One-Hot Encoding for categorical variables
# drop_first=True prevents multicollinearity (dummy variable trap)
df_encoded = pd.get_dummies(df, drop_first=True)

print("Columns after encoding:")
print(df_encoded.columns)

# 2. Define X (features) and y (target)
X = df_encoded.drop('stroke', axis=1)
y = df_encoded['stroke']

# 3. Split the data (80% Train, 20% Test)
# stratify=y ensures the imbalance is preserved in both train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(f"\nTraining Shape: {X_train.shape}")
print(f"Testing Shape: {X_test.shape}")

## 6. Model Building & Analysis

**Strategy:**
I will train three different supervised learning models to compare their performance.
1.  **Logistic Regression:** A baseline linear model.
2.  **Decision Tree:** A non-linear model that captures decision rules.
3.  **Random Forest:** An ensemble method that reduces overfitting and improves accuracy.

**Handling Imbalance:**
Since the dataset is highly imbalanced (approx 5% stroke cases), I will use the parameter `class_weight='balanced'` for all models. This penalizes the model more for missing a positive stroke case, encouraging it to find the minority class.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Initialize models with class_weight='balanced' to handle the imbalance
models = {
    "Logistic Regression": LogisticRegression(max_iter=1000, class_weight='balanced', random_state=42),
    "Decision Tree": DecisionTreeClassifier(class_weight='balanced', random_state=42),
    "Random Forest": RandomForestClassifier(class_weight='balanced', random_state=42)
}

# Loop through models to train and predict
for name, model in models.items():
    print(f"\n--- {name} ---")

    # Train
    model.fit(X_train, y_train)

    # Predict
    y_pred = model.predict(X_test)

    # Evaluate
    print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred))

In [None]:
from sklearn.metrics import confusion_matrix

# Plot Confusion Matrix for each model
plt.figure(figsize=(18, 5))

for i, (name, model) in enumerate(models.items()):
    plt.subplot(1, 3, i+1)

    # Predict
    y_pred = model.predict(X_test)

    # Compute Matrix
    cm = confusion_matrix(y_test, y_pred)

    # Plot heatmap
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False)
    plt.title(f'{name} Confusion Matrix')
    plt.xlabel('Predicted')
    plt.ylabel('Actual')

plt.tight_layout()
plt.show()

## 7. Results and Analysis

**Model Comparison:**
* **Random Forest** achieved the highest accuracy (95%) but failed completely to detect stroke cases (Recall = 0.00). It essentially memorized the majority class ("No Stroke"), making it useless for a medical screening tool.
* **Decision Tree** performed slightly better but still missed most stroke cases (Recall = 0.12).
* **Logistic Regression** had the lowest accuracy (75%) but achieved the highest **Recall (0.80)**. It correctly identified 40 out of 50 stroke cases in the test set.

**Metric Selection:**
For this medical problem, **Accuracy is misleading**. A model that predicts "No Stroke" for everyone would be 95% accurate but would kill patients by missing diagnoses.
We prioritize **Recall** (Sensitivity) because missing a stroke case (False Negative) is far worse than a False Positive (unnecessary checkup).

**Conclusion:**
Despite the lower accuracy, **Logistic Regression is the superior model** for this specific task because it successfully identifies high-risk patients.

## 8. Discussion and Conclusion

**Learnings:**
This project highlighted the dangers of relying on accuracy for imbalanced datasets. The Random Forest model fell into the "accuracy paradox," optimizing for the majority class at the expense of the minority class, despite using class weights.

**Why Random Forest Failed:**
Even with `class_weight='balanced'`, the Random Forest likely struggled because the "Stroke" cases are not easily separable from the "No Stroke" cases in the feature space. The algorithm favored the path of least resistance (predicting the majority) to minimize overall error.

**Future Improvements:**
To improve the Random Forest or Decision Tree models, I would:
1.  **Undersampling/Oversampling:** Use SMOTE (Synthetic Minority Over-sampling Technique) to artificially create more stroke samples in the training data.
2.  **Threshold Tuning:** Instead of using the default 0.5 probability threshold for classification, I would lower the threshold (e.g., to 0.2) to catch more positive cases.
3.  **Feature Engineering:** Collect more specific medical data (e.g., family history, physical activity frequency) to help the complex models distinguish between classes better.