### Feture Engineering: Steps followed
<pre>
0. Data Splitting
    * Since the dataset has an imbalanced target (stroke), use stratify=y in train_test_split.
    * Perform all preprocessing steps only on the training set to avoid data leakage.
    * The test set must remain unseen during transformation and oversampling.

1. Data Imputation
    * Check if the data is Missing Completely at Random (MCAR) or not.
    * Use: Chi-square test for categorical predictors
              Logistic regression (Missing Indicator ~ Predictors) for numerical ones
    * Based on expert-backed approaches, apply Iterative Imputer (with median) on numerical missing values (e.g., bmi).

2. Handle Categorical Columns
    * Apply One-Hot Encoding to convert categorical variables into numerical format.
    * This ensures compatibility with models and SMOTE later.

3. Transform Numerical Columns
    * If numerical features are skewed, apply transformations like:
            PowerTransformer, 
            log, or 
            Box-Cox depending on distribution.
    * This improves scaling and model performance.

4. Normalization/Scaling
    * Use StandardScaler or MinMaxScaler on numerical columns.
    * Scaling is essential before SMOTE, as it relies on distance metrics.

5. SMOTE (Over-sampling Technique for Imbalanced Data)
    * Apply SMOTE only on the training set.
    * Perform after all preprocessing steps (imputation, encoding, transformation, scaling).
    * SMOTE should never touch the test data to avoid information leakage.

6. Modeling
    * Train the model using the balanced and preprocessed training data.
    * Evaluate it using the preprocessed (but not oversampled) test set.
</pre>

In [1]:
import pandas as pd
from sklearn.compose import make_column_selector as selector
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy.stats import mannwhitneyu
from scipy.stats import chi2_contingency


In [17]:
df = pd.read_csv("healthcare-dataset-stroke-data.csv")
df.drop(['id'], axis=1, inplace=True)
df.head()

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In [5]:
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression

In [6]:
from sklearn.model_selection import train_test_split

In [7]:
# There is class imbalance in the data
X = df.drop('stroke', axis=1)
y = df['stroke']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)

In [8]:
num_cols = ['age','avg_glucose_level','bmi']
cat_cols = [col for col in df.columns if col not in num_cols]

In [9]:
y_test.value_counts()

stroke
0    1458
1      75
Name: count, dtype: int64

In [10]:
df['stroke'].value_counts()

stroke
0    4861
1     249
Name: count, dtype: int64

### 1. Impute Missing values

### 1.1 Check also outliers and impute values

In [11]:
df.head()

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In [40]:
df['bmi_missing'] = df['bmi'].isnull().astype(int)

In [48]:
categorical_vars = ['gender', 'ever_married', 'work_type', 'Residence_type', 'smoking_status', 'stroke']

for col in categorical_vars:
    contingency = pd.crosstab(df[col], df['bmi_missing'])
    chi2, p, _, _ = chi2_contingency(contingency)
    print(f"Chi-sq test for: {col}")
    print(f"  p-value = {p:.4f} {'--> Related to missingness' if p < 0.05 else '--> Not related'}\n")


Chi-sq test for: gender
  p-value = 0.0097 --> Related to missingness

Chi-sq test for: ever_married
  p-value = 0.0118 --> Related to missingness

Chi-sq test for: work_type
  p-value = 0.0382 --> Related to missingness

Chi-sq test for: Residence_type
  p-value = 0.6258 --> Not related

Chi-sq test for: smoking_status
  p-value = 0.0000 --> Related to missingness

Chi-sq test for: stroke
  p-value = 0.0000 --> Related to missingness



In [47]:
df['smoking_status'][df['bmi_missing']==0].value_counts()

smoking_status
never smoked       1852
Unknown            1483
formerly smoked     837
smokes              737
Name: count, dtype: int64

In [49]:
import statsmodels.api as sm
numerical_vars = ['age', 'avg_glucose_level']

X = df[numerical_vars].dropna()
y = df.loc[X.index, 'bmi_missing']  # Align indices

# Add constant term for intercept
X_const = sm.add_constant(X)

logit_model = sm.Logit(y, X_const)
result = logit_model.fit()

print(result.summary())


Optimization terminated successfully.
         Current function value: 0.160504
         Iterations 8
                           Logit Regression Results                           
Dep. Variable:            bmi_missing   No. Observations:                 5110
Model:                          Logit   Df Residuals:                     5107
Method:                           MLE   Df Model:                            2
Date:                Sun, 03 Aug 2025   Pseudo R-squ.:                 0.03208
Time:                        16:05:03   Log-Likelihood:                -820.18
converged:                       True   LL-Null:                       -847.36
Covariance Type:            nonrobust   LLR p-value:                 1.567e-12
                        coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------
const                -4.6307      0.220    -21.005      0.000      -5.063      -4.199
age     

In [52]:

# Using pipe line
'''
from sklearn.experimental import enable_iterative_imputer  # noqa
from sklearn.impute import IterativeImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier

bmi_pipeline = Pipeline([
    ('imputer', IterativeImputer(random_state=0, initial_strategy='median', max_iter=10)),
    ('scaler', StandardScaler())
])

cat_encoder = Pipeline([
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer([
    ('bmi_pipe', bmi_pipeline, ['bmi']),
    ('num_scaler', StandardScaler(), ['age', 'avg_glucose_level']),
    ('cat_encoder', cat_encoder, categorical_cols)
])

pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=0, class_weight='balanced'))
])
'''

"\nfrom sklearn.experimental import enable_iterative_imputer  # noqa\nfrom sklearn.impute import IterativeImputer\nfrom sklearn.preprocessing import StandardScaler, OneHotEncoder\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.compose import ColumnTransformer\nfrom sklearn.ensemble import RandomForestClassifier\n\nbmi_pipeline = Pipeline([\n    ('imputer', IterativeImputer(random_state=0, initial_strategy='median', max_iter=10)),\n    ('scaler', StandardScaler())\n])\n\ncat_encoder = Pipeline([\n    ('onehot', OneHotEncoder(handle_unknown='ignore'))\n])\n\npreprocessor = ColumnTransformer([\n    ('bmi_pipe', bmi_pipeline, ['bmi']),\n    ('num_scaler', StandardScaler(), ['age', 'avg_glucose_level']),\n    ('cat_encoder', cat_encoder, categorical_cols)\n])\n\npipeline = Pipeline([\n    ('preprocessor', preprocessor),\n    ('classifier', RandomForestClassifier(random_state=0, class_weight='balanced'))\n])\n"

In [53]:
df.columns

Index(['gender', 'age', 'hypertension', 'heart_disease', 'ever_married',
       'work_type', 'Residence_type', 'avg_glucose_level', 'bmi',
       'smoking_status', 'stroke', 'bmi_missing'],
      dtype='object')

In [54]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.experimental import enable_iterative_imputer  # noqa
from sklearn.impute import IterativeImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.compose import make_column_selector as selector

In [56]:
# Assuming df is your DataFrame
target_col = 'stroke'

numerical_cols = ['age', 'avg_glucose_level', 'bmi']
categorical_cols = ['gender', 'hypertension', 'heart_disease', 'ever_married',
                    'work_type', 'Residence_type', 'smoking_status']

# Step 1: Split data
X = df.drop(columns=[target_col])
y = df[target_col]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)


In [57]:
# Step 2: Impute only bmi (IterativeImputer)
bmi_imputer = IterativeImputer(random_state=0, initial_strategy='median')
X_train_bmi = X_train[['bmi']]
X_test_bmi = X_test[['bmi']]

X_train['bmi'] = bmi_imputer.fit_transform(X_train_bmi)
X_test['bmi'] = bmi_imputer.transform(X_test_bmi)


In [None]:
# Step 3: Scale numerical columns (after imputation)
scaler = StandardScaler()
X_train_num = scaler.fit_transform(X_train[numerical_cols])
X_test_num = scaler.transform(X_test[numerical_cols])

X_train_num_df = pd.DataFrame(X_train_num, columns=numerical_cols, index=X_train.index)
X_test_num_df = pd.DataFrame(X_test_num, columns=numerical_cols, index=X_test.index)




In [None]:
# Step 4: Encode categorical columns
encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
X_train_cat = encoder.fit_transform(X_train[categorical_cols])
X_test_cat = encoder.transform(X_test[categorical_cols])

encoded_cat_cols = encoder.get_feature_names_out(categorical_cols)
X_train_cat_df = pd.DataFrame(X_train_cat, columns=encoded_cat_cols, index=X_train.index)
X_test_cat_df = pd.DataFrame(X_test_cat, columns=encoded_cat_cols, index=X_test.index)



In [None]:
# Step 5: Concatenate processed numerical + categorical features
X_train_final = pd.concat([X_train_num_df, X_train_cat_df], axis=1)
X_test_final = pd.concat([X_test_num_df, X_test_cat_df], axis=1)



In [None]:
# Step 6: Train model (handle imbalance)
model = RandomForestClassifier(random_state=0, class_weight='balanced')
model.fit(X_train_final, y_train)



In [None]:
# Step 7: Predict and evaluate
y_pred = model.predict(X_test_final)
print(classification_report(y_test, y_pred))

In [None]:
### Data SPlitting
We are considering staifying the split due to presence of imbalance in 

In [None]:


target_col = 'stroke'

numerical_cols = ['age', 'avg_glucose_level', 'bmi']
categorical_cols = ['gender', 'hypertension', 'heart_disease', 'ever_married',
                    'work_type', 'Residence_type', 'smoking_status']