# Hands-On Workshop - Big Data in Healthcare 8400

### Amit Levon - December 2025

Welcome to this hands-on workshop on Big Data in Healthcare. In this workshop, we will perform Exploratory Data Analysis (EDA) to gain insights and intuition about a stroke records dataset.

We will be covering the following topics:

1. Introduction to the dataset and problem
2. Data cleaning and preparation
3. Exploratory Data Analysis (EDA)
4. Feature engineering
5. Model training and evaluation

Let's get started!

## Stroke Prediction Dataset

The dataset we will be using in this workshop is the [Stroke Prediction Dataset](https://www.kaggle.com/fedesoriano/stroke-prediction-dataset) from **Kaggle**. It contains health records of over 5000 individuals, some of whom have suffered a stroke.

According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths.
This dataset can be used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relevant information about the patient.

We will be using this dataset to get initial insights about the data, and to get a *feel* for the work of a data scientist. We will understand how to use basic python scripting, packages and techniques to explore the data, and how to use this information to train a Machine Learning (ML) model in subsequent workshop.

Hopefully, this workshop will give you a taste of what it's like to be a data scientist, and will demonstrate why it is beneficial to use python and Jupyter notebooks for data science.

# Step 0: Imports and Reading Data

In [None]:
# Import NumPy: fundamental package for numerical computing with arrays and matrices
import numpy as np

# Import Pandas: data manipulation and analysis library with DataFrame structures
import pandas as pd

# Import Matplotlib: plotting library for creating static, interactive visualizations
import matplotlib.pylab as plt

# Import Seaborn: statistical data visualization library built on matplotlib
import seaborn as sns

# Set default matplotlib style to 'ggplot' for better-looking plots
plt.style.use('ggplot')

In [None]:
# Download into this current session the data shared by Amit from his Google Drive.
!wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=19cNSK5MahWk4ErIf_Bs8aIJFkNUXUkBB' -O healthcare-dataset-stroke-data.csv

# Read the CSV file containing stroke prediction data into a pandas DataFrame
# The CSV file contains health records of 5,115 patients with 12 health/demographic features
# and a target variable 'stroke' indicating whether the patient had a stroke (1) or not (0)
df = pd.read_csv('healthcare-dataset-stroke-data.csv')

# Step 1: Data Understanding

* Dataframe `shape`
* `head` and `tail`
*  `dtypes`
*  `describe`

In [None]:
# Display shape of dataframe
df.shape

In [None]:
# Display first 10 rows of dataframe
df.head(10)
#df.tail(10)

In [None]:
# Display column names
df.columns

The current dataset contains the following features:

* `id`: unique identifier
* `gender`: "Male", "Female" or "Other"
* `age`: age of the patient
* `hypertension`: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
* `heart_disease`: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
* `ever_married`: "No" or "Yes"
* `work_type`: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
* `Residence_type`: "Rural" or "Urban"
* `avg_glucose_level`: average glucose level in blood
* `bmi`: body mass index
* `smoking_status`: "formerly smoked", "never smoked", "smokes" or "Unknown"*
* `stroke`: 1 if the patient had a stroke or 0 if not

###### **Note**: "Unknown" in smoking_status means that the information is unavailable for this patient

In [None]:
# Display information about types of data in dataframe
df.dtypes

In [None]:
# Describe basic statistics about dataframe
df.describe()


# AI Features:
# 1) Please explain this code:
# 2) Could you use more simple terms?
# 3) Could you help me write code to analyze the other non-numeric columns?
#cf.describe # Create an error!!

# Step 2: Data Preparation

* Dropping irrelevant columns and rows
* Identifying duplicated columns
* Renaming Columns
* Feature Creation

In [None]:
# Remove columns that don't contribute to the analysis
# 'irrelevant_column': contains non-informative data (all 'A' or 'C')
# 'duplicate_column': duplicates avg_glucose_level values, adding no new information
# axis=1 specifies we're dropping columns (not rows)
# inplace=True modifies the dataframe directly without creating a new copy
df.drop(['irrelevant_column', 'duplicate_column'], axis=1, inplace=True)
df.head()

In [None]:
# Rename columns to match standard naming conventions in the documentation
# 'sex' → 'gender': more inclusive terminology for gender variable
# 'patient_age' → 'age': shorter, cleaner column name
# inplace=True modifies the original dataframe
df.rename(columns={'sex': 'gender', 'patient_age': 'age'}, inplace=True)
df.head()

In [None]:
# Identifying duplicate rows
df.duplicated()

In [None]:
# Display all rows that are duplicates
# This shows us which rows have identical values
# Useful for understanding what duplicate data we have
df[df.duplicated()]

In [None]:
# Remove all duplicate rows from the dataframe
# Keeps the first occurrence of each duplicate row by default
# Updates the dataframe and returns the new shape to confirm duplicates were removed
df = df.drop_duplicates()
df.shape

In [None]:
# Checking the number of missing values
df.isnull().sum()

In [None]:
# Calculate the percentage of missing BMI values
# Divides missing count by total rows, multiplies by 100 for percentage
# Helps decide whether to drop rows or impute missing values
# Rule of thumb: if > 5% missing, consider imputation; if > 20%, may want to drop
df['bmi'].isnull().sum() / len(df) * 100

In [None]:
# Impute missing BMI values with the column mean
# df['bmi'].mean() calculates the average BMI of all non-missing values
# fillna() replaces all NaN values with this mean
# This is a simple imputation strategy; advanced methods (KNN, MICE) could be used
# inplace=True modifies the original column
df['bmi'].fillna(df['bmi'].mean(), inplace=True)
df.isnull().sum()

In [None]:
# ==============================================================================
# CREATE NEW FEATURE: Health Risk Score
# ==============================================================================
# This engineered feature combines multiple health indicators into a single score
# Each component is normalized to a 0-1 scale for fair weighting:

# Calculate raw health risk score by summing normalized health indicators:
# - age/100: normalized age (assuming max age ~100)
# - hypertension: binary flag (0 or 1)
# - heart_disease: binary flag (0 or 1)
# - avg_glucose_level/200: normalized glucose (assuming normal range 0-200)
# - bmi/50: normalized BMI (assuming max healthy BMI ~50)

df['health_risk_score'] = (
    (df['age'] / 100) +
    df['hypertension'] +
    df['heart_disease'] +
    (df['avg_glucose_level'] / 200) +
    (df['bmi']) / 50)

# Min-Max normalize the health_risk_score to scale between 0 and 1
# This ensures the final score is comparable and easier to interpret
# Formula: (x - min) / (max - min) scales any distribution to [0, 1] range
df['health_risk_score'] = (df['health_risk_score'] - df['health_risk_score'].min()) / (df['health_risk_score'].max() - df['health_risk_score'].min())

# Display first few rows to verify the new feature was created
df.head()

# Step 3: Feature Understanding - Univariate Analysis

* Plotting Feature Distributions
  * Histogram
  * KDE
  * Boxplot

In [None]:
# Count the frequency of each smoking status category
# value_counts() returns counts sorted in descending order
# Shows the distribution of smoking statuses in our dataset
df['smoking_status'].value_counts()

In [None]:
# Create a bar plot of smoking status distribution
# .value_counts() gets frequency of each category
# .head(10) limits to top 10 categories (though smoking_status has only 4)
# kind='bar' creates a bar chart
# figsize=(7, 5) sets plot dimensions
# color='green' sets bar color
# fontsize=10
ax = df['smoking_status'].value_counts() \
    .head(10) \
    .plot(kind='bar', title='Smoking Status', figsize=(7, 5), color='green', fontsize=10)
ax.set_xlabel('Smoking Status')
ax.set_ylabel('Count')

In [None]:
# ==============================================================================
# EXPLORATORY DATA ANALYSIS: Dual Histogram with Twin Axis
# ==============================================================================
# This visualization shows the distribution of two continuous variables simultaneously

# Create a figure with a single axis
# figsize=(15, 10) sets the plot size
fig, ax1 = plt.subplots(figsize=(15, 10))

# Plot histogram of 'age' on the first y-axis
# bins=20 divides the age range into 20 bins
# alpha=0.5 sets transparency to 50% so overlapping bars are visible
# color='blue' colors age bars blue
# label='Age' provides legend text
ax1.hist(df['age'], bins=20, alpha=0.5, color='blue', label='Age')

# Set labels and font sizes for the first axis
ax1.set_xlabel('Age', fontsize=20)
ax1.set_ylabel('Count of People with Age', fontsize=20)
ax1.tick_params(axis='both', which='major', labelsize=20)

# Create a second y-axis that shares the same x-axis
# This allows plotting two different scales on the same figure
ax2 = ax1.twinx()

# Plot histogram of 'avg_glucose_level' on the second y-axis
# The data has different scale than age, so a separate axis is needed
ax2.hist(df['avg_glucose_level'], bins=20, alpha=0.5, color='green', label='Avg Glucose Level')

# Set labels for the second y-axis
ax2.set_ylabel('Count of People with Avg Glucose Level', fontsize=20)
ax2.tick_params(axis='y', which='major', labelsize=20)

# Add legends from both axes for complete information
ax1.legend(loc='upper left', fontsize=15)
ax2.legend(loc='upper right', fontsize=15)

# Showing the plot
plt.show()

In [None]:
# =============================================================================
# VISUALIZATION: Health Risk Score Distribution using KDE
# =============================================================================

# Create a new figure with specified size
plt.figure(figsize=(10, 6))

# Kernel Density Estimation (KDE) plot:
# KDE is a statistical technique that estimates the probability density function
# of a continuous variable by smoothing the distribution
# 
# Key parameters:
#   - df['health_risk_score']: the data to visualize (values range from 0 to 1)
#   - shade=True: fills the area under the curve with color (red in this case)
#     This makes the distribution more visually prominent
#   - color="r": uses red color, which stands out and emphasizes the distribution
#
# Why use KDE instead of histogram?
#   - KDE shows a smooth, continuous curve rather than discrete bars
#   - Better for identifying overall distribution shape
#   - Easier to spot multiple peaks (modes) if they exist
#   - More professional appearance for presentations
#
# What to look for:
#   - Shape: Is the distribution normal (bell-shaped), skewed, or has multiple peaks?
#   - Location: Where are most health risk scores concentrated?
#   - Spread: How wide is the distribution? (Narrow = consistent risk, Wide = variable risk)
#   - Outliers: Are there extreme values (very low or very high risk scores)?
sns.kdeplot(df['health_risk_score'], shade=True, color="r")

# Set the title of the plot
plt.title('KDE of Health Risk Score')

# Label the x-axis
# The x-axis shows the health risk score values (0 = lowest risk, 1 = highest risk)
plt.xlabel('Health Risk Score')

# Label the y-axis
# The y-axis shows density (probability), not count
# Higher density = more patients with that risk score value
plt.ylabel('Density')

# Display the plot in the notebook
plt.show()

# Step 4: Feature Relationship - Bivariate Analysis

* Scatterplot
* Heatmap Correlation
* Pairplot
* Groupby comparisons

In [None]:
# =============================================================================
# VISUALIZATION: Age vs Health Risk Score with Stroke Outcome Coloring
# =============================================================================

plt.figure(figsize=(10, 5))

# Scatter plot: plots individual data points showing relationship between two variables
# This visualization helps identify patterns, clusters, and potential correlations
#
# Key parameters explained:
#   - x=df['age']: x-axis represents patient age (in years, typically 0-82 in this dataset)
#   - y=df['health_risk_score']: y-axis represents the composite health risk score we created
#     (values range from 0 to 1, where 0 = lowest risk, 1 = highest risk)
#
#   - hue=df['stroke']: This parameter colors points based on stroke status
#     hue = "shade" or "color" in visualization terminology
#     Each unique value in df['stroke'] gets a different color
#     Stroke=0: no stroke (blue), Stroke=1: had stroke (red)
#     This allows us to visually compare risk profiles between stroke/no-stroke patients
#
#   - palette=['blue', 'red']: defines the color scheme
#     Blue = safer/no event, Red = danger/event occurred
#
#   - alpha=0.5: transparency level (0=invisible, 1=opaque)
#     alpha=0.5 means 50% transparency
#     When points overlap (because many patients have similar age/risk scores),
#     we can see through overlapping points to the points beneath them
#     Darker areas indicate high concentration of patients (overlapping points)
#     This reveals hidden patterns in dense areas of data
#
# What patterns to look for:
#   - Do blue points cluster in lower age/lower risk areas? (expected for healthy)
#   - Do red points cluster in higher age/higher risk areas? (expected for stroke risk)
#   - Are there any red points in low-risk areas? (surprising cases)
#   - Is there a clear separation between blue and red? (indicates age/risk are good predictors)
#   - Do we see a trend where risk increases with age?
sns.scatterplot(
    x=df['age'], 
    y=df['health_risk_score'],
    hue=df['stroke'],
    palette=['blue', 'red'],
    alpha=0.5)

# Set the plot title with specified font size
plt.title('Scatter Plot of Age and Health Risk Score', fontsize=20)

# Label the x-axis with specified font size
# X-axis = Age: shows patient age in years
plt.xlabel('Age', fontsize=10)

# Label the y-axis with specified font size
# Y-axis = Health Risk Score: our engineered feature combining multiple health factors
#   Range: 0 to 1 (0 = lowest risk, 1 = highest risk)
#   Higher scores indicate worse overall health profile
plt.ylabel('Health Risk Score', fontsize=10)

# Display the plot in the notebook
plt.show()

In [None]:
# =============================================================================
# VISUALIZATION: Age vs Health Risk Score with Linear Regression Trend Line
# =============================================================================

plt.figure(figsize=(10, 5))

# Regression plot: combines scatter plot with linear regression analysis
# This visualization shows both individual data points AND the overall trend
# 
# Linear regression finds the "best-fit" straight line through the data
# The line equation: health_risk_score = intercept + slope * age
# 
# Key parameters explained:
#   - x=df['age']: x-axis variable (patient age in years)
#   - y=df['health_risk_score']: y-axis variable (our engineered health risk score)
#     The regression line predicts y-values based on x-values
#     Equation: predicted_score = a + b*age
#     If b (slope) is positive: older age → higher risk score (expect this)
#     If b (slope) is negative: older age → lower risk score (unexpected)
#     If b (slope) ≈ 0: age has little effect on risk score (weak relationship)
#
#   - color='green': base color for the visualization
#     Used as default color when scatter_kws and line_kws don't override
#
#   - scatter_kws={'alpha': 0.5}: keyword arguments for scatter points
#     'kws' = keyword arguments (configuration options)
#     scatter_kws = options specifically for the scatter plot portion
#     alpha=0.5 makes points 50% transparent
#     Shows clustering: darker areas = more overlapping points = more patients there
#     This reveals which age/risk combinations are most common in our dataset
#
#   - line_kws={'color': 'red'}: keyword arguments for the regression line
#     line_kws = options specifically for the regression line
#     color='red' makes the line stand out prominently against green scatter points
#     Red line = easy to see the overall trend direction
#     Visually distinguishes the trend from individual data points
sns.regplot(
    x=df['age'], 
    y=df['health_risk_score'], 
    color='green',
    scatter_kws={'alpha': 0.5}, 
    line_kws={'color': 'red'})

plt.title('Scatter Plot of Age and Health Risk Score', fontsize=20)

plt.xlabel('Age', fontsize=10)

plt.ylabel('Health Risk Score', fontsize=10)

plt.show()

In [None]:
# Create a pairplot matrix showing relationships between multiple variables
# A pairplot shows scatter plots for all pairs of variables
# Diagonal shows distributions of each variable

# Use seaborn's pairplot to create a matrix of relationships
# df: the dataframe to plot
# vars=[...]: specific variables to include (numeric features of interest)
# hue='stroke': color points by stroke status for pattern detection
# palette=['blue', 'red']: blue=no stroke, red=stroke
sns.pairplot(
    df,
    vars=['age', 'avg_glucose_level', 'bmi', 'hypertension', 'heart_disease'],
    hue='stroke',
    palette=['blue', 'red']
)
plt.show()


In [None]:
# ==============================================================================
# EXPLORATORY DATA ANALYSIS: Multi-panel Analysis of Key Features
# ==============================================================================
# Create a 3x3 grid of plots analyzing three key health features

# Create a 3x3 subplot grid (3 rows, 3 columns)
# Each row analyzes one health feature (age, glucose, BMI) in three ways
fig, axes = plt.subplots(nrows=3, ncols=3, figsize=(16, 12))

# ==============================================================================
# ROW 1: Age Analysis
# ==============================================================================
# Box plot of age (shows quartiles, median, outliers)
sns.boxplot(x=df['age'], ax=axes[0, 0]).set_title("**BoxPlot For Age Col**")
# Histogram with KDE curve (shows distribution shape)
sns.histplot(data=df, x='age', kde=True, ax=axes[0, 1]).set_title("**Distribution Of Age**")
# Line plot of age vs stroke (shows trend with target variable)
sns.lineplot(data=df, x='age', y="stroke", ax=axes[0, 2]).set_title('**Lineplot For Age with Stroke**')

# ==============================================================================
# ROW 2: Average Glucose Level Analysis
# ==============================================================================
# Box plot of glucose levels
sns.boxplot(x=df['avg_glucose_level'], ax=axes[1, 0]).set_title("BoxPlot For Glucose")
# Histogram with KDE of glucose levels
sns.histplot(data=df, x='avg_glucose_level', kde=True, ax=axes[1, 1]).set_title("Distribution Of Glucose")
# Line plot showing glucose relationship with stroke
sns.lineplot(data=df, x='avg_glucose_level', y="stroke", ax=axes[1, 2]).set_title('**Lineplot For Glucose Level With Stroke')

# ==============================================================================
# ROW 3: BMI Analysis
# ==============================================================================
# Box plot of BMI values
sns.boxplot(x=df['bmi'], ax=axes[2, 0]).set_title("BoxPlot For Bmi Col")
# Histogram with KDE of BMI distribution
sns.histplot(data=df, x='bmi', kde=True, ax=axes[2, 1]).set_title("Distribution Of Bmi")
# Line plot of BMI vs stroke outcome
sns.lineplot(data=df, x='bmi', y="stroke", ax=axes[2, 2]).set_title('Lineplot For Bmi With Stroke')

# Automatically adjust spacing to prevent label overlap
plt.tight_layout()
# Display the complete grid of plots
plt.show()

In [None]:
# Create a correlation heatmap of numeric features
# Shows how strongly pairs of variables are related

# Create a new figure
plt.figure(figsize=(10, 5))

# Calculate correlation matrix for selected numeric features
# .corr() computes Pearson correlation coefficients between all variables
# Values range from -1 (perfect negative) to +1 (perfect positive) correlation
# Create heatmap visualization
# annot=True displays correlation values in each cell
# cmap='coolwarm': uses blue-red colormap (blue=negative, red=positive correlations)
sns.heatmap(
    df[['age', 'avg_glucose_level', 'bmi', 'hypertension', 'heart_disease']].corr(), 
    annot=True,
    cmap='coolwarm')

# We can add 'health_risk_score'.

# Add title
plt.title('Correlation Matrix')

# Display the plot
plt.show()

# Step 5: Stroke Prediction Modeling

In [None]:
# ==============================================================================
# MODEL PREPARATION: ENCODE CATEGORICAL VARIABLES
# ==============================================================================
# Convert categorical (text) features to numeric codes for machine learning models
# Models require numeric input, so we map categories to integers
df_ml = df.copy()  # For model training (with encoding)

# Encode 'gender' column
# Male → 0, Female → 1, Other → -1
# astype(np.uint8) converts to unsigned integer (uses less memory)
df_ml['gender'] = df_ml['gender'].replace({'Male':0,'Female':1,'Other':-1}).astype(np.uint8)

# Encode 'Residence_type' column
# Rural → 0, Urban → 1
df_ml['Residence_type'] = df_ml['Residence_type'].replace({'Rural':0,'Urban':1}).astype(np.uint8)

# Encode 'work_type' column (5 categories need 5 different codes)
# Private → 0, Self-employed → 1, Govt_job → 2
# children → -1, Never_worked → -2 (special codes for age-specific groups)
df_ml['work_type'] = df_ml['work_type'].replace({'Private':0,'Self-employed':1,'Govt_job':2,'children':-1,'Never_worked':-2}).astype(np.uint8)

# Display updated dataframe with encoded variables
df_ml

In [None]:
# ==============================================================================
# MACHINE LEARNING: PREPARE FEATURES (X) AND TARGET (y)
# ==============================================================================

# Define feature matrix X: select columns to use for model training
# These are the input variables (predictors) the model will learn from
X  = df_ml[['gender','age','hypertension','heart_disease','work_type','avg_glucose_level','bmi']]

# Define target variable y: what we want to predict
# stroke: 0=No stroke, 1=Stroke occurred
y = df_ml['stroke']

# Import train_test_split function for dividing data into training and testing sets
from sklearn.model_selection import train_test_split

# Split data into training (70%) and testing (30%) sets
# This allows us to train the model on one subset and evaluate on unseen data
# train_size=0.3 means use 70% for training (30% for testing)
# random_state=42 ensures reproducible splits for consistency across runs
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=42)

# Display first 2 rows of test set to verify the split worked correctly
X_test.head(2)

In [None]:
# ==============================================================================
# MACHINE LEARNING: CREATE ML PIPELINES WITH SCALING AND CLASSIFIERS
# ==============================================================================
# Pipelines combine preprocessing (scaling) with model training in a single workflow
# This ensures consistent processing of training and test data

# Import required libraries
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

# STANDARDSCALER
# ======================================================================
# StandardScaler transforms features to have mean 0 and standard deviation 1,
# which normalizes all features to the same scale so that features with large 
# values (like age or glucose level) don't dominate the machine learning model 
# over features with smaller values.

# ==============================================================================
# PIPELINE 1: Random Forest Classifier with StandardScaler
# ==============================================================================
# RandomForestClassifier: ensemble of decision trees, robust to outliers
# Random Forest Classifier builds many decision trees (each asking yes/no 
# questions about features like "Is age > 60?" or "Does patient have hypertension?"), 
# then combines their predictions by majority vote to make a final stroke prediction.

# Step 1: Normalize features
# Step 2: Train RF model
rf_pipeline = Pipeline(steps = [('scale',StandardScaler()),('RF',RandomForestClassifier(random_state=42))])

# ==============================================================================
# PIPELINE 2: Support Vector Machine with StandardScaler
# ==============================================================================
# SVC: finds optimal hyperplane to separate classes, good for high-dimensional data
# Support Vector Machine finds the optimal boundary (hyperplane) that 
# maximally separates stroke patients from non-stroke patients while 
# leaving the largest margin of safety between the two groups.

# Step 1: Normalize features
# Step 2: Train SVM model
svm_pipeline = Pipeline(steps = [('scale',StandardScaler()),('SVM',SVC(random_state=42))])

# ==============================================================================
# PIPELINE 3: Logistic Regression with StandardScaler
# ==============================================================================
# LogisticRegression: linear model for binary classification, interpretable
# Logistic Regression predicts the probability of stroke by finding the 
# best-fitting line/curve that separates patients with stroke from those 
# without stroke, using a mathematical formula that outputs values between 0 and 1.

# Step 1: Normalize features
# Step 2: Train LR model
logreg_pipeline = Pipeline(steps = [('scale',StandardScaler()),('LR',LogisticRegression(random_state=42))])

In [None]:
# Check the class distribution of the target variable
# Count how many samples are in each class (stroke vs no stroke)
# This reveals if the data is balanced or imbalanced
df_ml['stroke'].value_counts()

In [None]:
# ==============================================================================
# HANDLING CLASS IMBALANCE: SMOTE (Synthetic Minority Over-sampling Technique)
# ==============================================================================
# Our data has far fewer stroke cases (positive class) than non-stroke (negative class)
# This imbalance can bias the model toward the majority class
# SMOTE synthetically generates new samples of the minority class

# Import SMOTE from imbalanced-learn library
from imblearn.over_sampling import SMOTE

# Create SMOTE resampler object
# This will generate synthetic samples for the minority class
oversample = SMOTE()

# Apply SMOTE to training data
# fit_resample(): generates synthetic minority samples to balance the data
# X_train_resh, y_train_resh: resampled training features and labels
# Now X_train_resh has balanced number of stroke and non-stroke cases
X_train_resh, y_train_resh = oversample.fit_resample(X_train, y_train.ravel())

# UNDERSAMPLING
# ======================================================================
# Undersampling randomly removes majority class samples (non-stroke patients) 
# to match the number of minority class samples (stroke patients), creating 
# a balanced dataset but discarding potentially useful data.

In [None]:
# ==============================================================================
# MODEL EVALUATION: CROSS-VALIDATION USING F1 SCORE
# ==============================================================================
# Cross-validation: train and evaluate model on multiple data splits
# F1 score: harmonic mean of precision and recall (good for imbalanced data)
# More robust than accuracy for imbalanced classification problems

# Import cross-validation function
from sklearn.model_selection import cross_val_score

# Print header for results
print('Mean f1 scores:')

# Evaluate Random Forest Classifier
# cv=10: 10-fold cross-validation (split data into 10 folds)
# scoring='f1': use F1 score as evaluation metric
# .mean(): calculate average F1 score across all 10 folds
print('Random Forest mean :',cross_val_score(rf_pipeline,X_train_resh,y_train_resh,cv=10,scoring='f1').mean())

# Evaluate Support Vector Machine
print('SVM mean :',cross_val_score(svm_pipeline,X_train_resh,y_train_resh,cv=10,scoring='f1').mean())

# Evaluate Logistic Regression
print('Logistic Regression mean :',cross_val_score(logreg_pipeline,X_train_resh,y_train_resh,cv=10,scoring='f1').mean())

# F1 SCORE
# ======================================================================
# F1 Score is the harmonic mean of precision and recall, measuring how well 
# the model balances correctly identifying stroke patients (recall) with 
# avoiding false alarms (precision), making it ideal for imbalanced datasets 
# where both missed strokes and unnecessary alerts are costly.
# Precision = Of patients we predict will have stroke, how many actually do?
# Recall = Of patients who actually had stroke, how many did we catch?

# Step 6: Ask a Question about the data
* Try to answer a question you have about the data using a plot or statistic.

In [None]:
# ==============================================================================
# Identify Which Groups Are Most Susceptible to Stroke
# ==============================================================================
# Analyze stroke risk by demographic groups (residence type + work type)

# Filter for stroke cases and group by residence and work type
# df[df['stroke'] == 1]: select only rows where a stroke occurred
# .groupby(['Residence_type', 'work_type']): group by these two variables
# .size(): count cases in each group
# .reset_index(name='count'): convert to dataframe with 'count' column
stroke_susceptibility = df[df['stroke'] == 1].groupby(['Residence_type', 'work_type']).size().reset_index(name='count')

# Sort by stroke count in descending order
# Shows which groups have the highest number of stroke cases
stroke_susceptibility = stroke_susceptibility.sort_values('count', ascending=False)

# Create a combined label for better visualization
# Combines 'Residence_type' and 'work_type' into a single string
stroke_susceptibility['Residence_Work'] = stroke_susceptibility['Residence_type'] + ", " + stroke_susceptibility['work_type']

# Create horizontal bar plot
# kind='barh': horizontal bar chart
# x='Residence_Work': labels on y-axis (grouped categories)
# y='count': bar lengths (number of strokes)
# figsize=(20, 7): wide plot for readability
# title: descriptive title
ax = stroke_susceptibility.plot(
    kind='barh',
    x='Residence_Work',
    y='count',
    figsize=(20, 7),
    title='Number of Strokes by Residence and Work Type'
)

ax.set_xlabel('Number of Strokes')
ax.set_ylabel('Residence Type, Work Type')

plt.show()