### Feture Engineering: Steps followed
<pre>
0. Data Splitting
    * Since the dataset has an imbalanced target (stroke), use stratify=y in train_test_split.
    * Perform all preprocessing steps only on the training set to avoid data leakage.
    * The test set must remain unseen during transformation and oversampling.

1. Data Imputation
    * Check if the data is Missing Completely at Random (MCAR) or not.
    * Use: Chi-square test for categorical predictors
              Logistic regression (Missing Indicator ~ Predictors) for numerical ones
    * Based on expert-backed approaches, apply Iterative Imputer (with median) on numerical missing values (e.g., bmi).

2. Handle Categorical Columns
    * Apply One-Hot Encoding to convert categorical variables into numerical format.
    * This ensures compatibility with models and SMOTE later.

3. Transform Numerical Columns
    * If numerical features are skewed, apply transformations like:
            PowerTransformer, 
            log, or 
            Box-Cox depending on distribution.
    * This improves scaling and model performance.

4. Normalization/Scaling
    * Use StandardScaler or MinMaxScaler on numerical columns.
    * Scaling is essential before SMOTE, as it relies on distance metrics.

5. SMOTE (Over-sampling Technique for Imbalanced Data)
    * Apply SMOTE only on the training set.
    * Perform after all preprocessing steps (imputation, encoding, transformation, scaling).
    * SMOTE should never touch the test data to avoid information leakage.

6. Modeling
    * Train the model using the balanced and preprocessed training data.
    * Evaluate it using the preprocessed (but not oversampled) test set.
</pre>

In [2]:
import pandas as pd
from sklearn.compose import make_column_selector as selector
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy.stats import mannwhitneyu
from scipy.stats import chi2_contingency


from sklearn.model_selection import train_test_split

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

from sklearn.preprocessing import OneHotEncoder, PowerTransformer, StandardScaler
import statsmodels.api as sm
import pandas as pd


In [3]:
df = pd.read_csv("healthcare-dataset-stroke-data.csv")
df.drop(['id'], axis=1, inplace=True)
df.head()

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In [7]:
df[numerical_cols].describe()

Unnamed: 0,age,avg_glucose_level,bmi
count,5110.0,5110.0,4909.0
mean,43.226614,106.147677,28.893237
std,22.612647,45.28356,7.854067
min,0.08,55.12,10.3
25%,25.0,77.245,23.5
50%,45.0,91.885,28.1
75%,61.0,114.09,33.1
max,82.0,271.74,97.6


In [6]:

# Automatically select columns
numerical_selector = selector(dtype_include=['int64', 'float64'])
categorical_selector = selector(dtype_include=['object', 'category', 'bool'])

# Get initial lists
numerical_cols = numerical_selector(df)
categorical_cols = categorical_selector(df)

# Identify binary numeric columns (with only two unique values)
binary_numerical_cols = [col for col in numerical_cols 
                         if df[col].nunique(dropna=False) == 2]

# Move them from numerical to categorical
numerical_cols = [col for col in numerical_cols if col not in binary_numerical_cols]
categorical_cols = categorical_cols + binary_numerical_cols

# Results
print("Numerical columns:", numerical_cols)
print("Categorical columns:", categorical_cols)

Numerical columns: ['age', 'avg_glucose_level', 'bmi']
Categorical columns: ['gender', 'ever_married', 'work_type', 'Residence_type', 'smoking_status', 'hypertension', 'heart_disease', 'stroke']


### 0. Data Splitting

In [8]:
# Assuming df is your DataFrame
target_col = 'stroke'

numerical_cols = ['age', 'avg_glucose_level', 'bmi']
categorical_cols = ['gender', 'hypertension', 'heart_disease', 'ever_married',
                    'work_type', 'Residence_type', 'smoking_status']

# Step 1: Split data
X = df.drop(columns=[target_col])
y = df[target_col]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)


### 1. Data Imputation

### 1. Is the missing data random?

In [9]:

df['missing_bmi'] = df['bmi'].isna().astype(int)
cat_cols = ['gender', 'ever_married', 'work_type', 'Residence_type', 'smoking_status', 'hypertension', 'heart_disease', 'stroke']
num_cols = ['age', 'avg_glucose_level']

df_enc = pd.get_dummies(df[categorical_cols + numerical_cols + ['missing_bmi']], drop_first=True).astype(float)
df_enc = df_enc.dropna()

X = sm.add_constant(df_enc.drop('missing_bmi', axis=1))
y = df_enc['missing_bmi']

model = sm.Logit(y, X).fit(disp=0)
print(model.summary())


  return 1/(1+np.exp(-X))
  return np.sum(np.log(self.cdf(q * linpred)))


LinAlgError: Singular matrix

<pre>
The missingness in the bmi column was analyzed using logistic regression and found to be significantly related to features like hypertension and stroke. This indicates that the missing data is Missing At Random (MAR), not completely random. Therefore, a model-based imputation method was chosen to leverage these relationships for accurate imputation.
</pre>

### Data Imputation

In [18]:


bmi_imputer = IterativeImputer(random_state=0, initial_strategy='median')
X_train_bmi = X_train[['bmi']]
X_test_bmi = X_test[['bmi']]

X_train['bmi'] = bmi_imputer.fit_transform(X_train_bmi)
X_test['bmi'] = bmi_imputer.transform(X_test_bmi)

### Handling categorical data

In [19]:
df[categorical_cols].head()

Unnamed: 0,gender,hypertension,heart_disease,ever_married,work_type,Residence_type,smoking_status
0,Male,0,1,Yes,Private,Urban,formerly smoked
1,Female,0,0,Yes,Self-employed,Rural,never smoked
2,Male,0,1,Yes,Private,Rural,never smoked
3,Female,0,0,Yes,Private,Urban,smokes
4,Female,1,0,Yes,Self-employed,Rural,never smoked


In [20]:
cat_ohe_cols = ['gender', 'ever_married', 'work_type', 'Residence_type', 'smoking_status']


# Fit encoder
ohe = OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore')
X_train_ohe = pd.DataFrame(ohe.fit_transform(X_train[cat_ohe_cols]), 
                           columns=ohe.get_feature_names_out(cat_ohe_cols),
                           index=X_train.index)
X_test_ohe = pd.DataFrame(ohe.transform(X_test[cat_ohe_cols]), 
                          columns=ohe.get_feature_names_out(cat_ohe_cols),
                          index=X_test.index)

# Drop original and concatenate
X_train = pd.concat([X_train.drop(columns=cat_ohe_cols), X_train_ohe], axis=1)
X_test = pd.concat([X_test.drop(columns=cat_ohe_cols), X_test_ohe], axis=1)

In [21]:
X_train.head()

Unnamed: 0,age,hypertension,heart_disease,avg_glucose_level,bmi,gender_Male,gender_Other,ever_married_Yes,work_type_Never_worked,work_type_Private,work_type_Self-employed,work_type_children,Residence_type_Urban,smoking_status_formerly smoked,smoking_status_never smoked,smoking_status_smokes
2226,52.0,0,0,107.84,22.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3927,62.0,0,0,88.32,36.3,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
3358,81.0,0,1,95.49,29.4,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4152,55.0,0,0,73.57,28.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
4866,37.0,0,0,103.66,36.1,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0


### Transformation

In [22]:
pt = PowerTransformer(method='box-cox')

# Fit on train and transform both train and test
X_train[['bmi', 'avg_glucose_level']] = pt.fit_transform(X_train[['bmi', 'avg_glucose_level']])
X_test[['bmi', 'avg_glucose_level']] = pt.transform(X_test[['bmi', 'avg_glucose_level']])


In [23]:
X_train.head()

Unnamed: 0,age,hypertension,heart_disease,avg_glucose_level,bmi,gender_Male,gender_Other,ever_married_Yes,work_type_Never_worked,work_type_Private,work_type_Self-employed,work_type_children,Residence_type_Urban,smoking_status_formerly smoked,smoking_status_never smoked,smoking_status_smokes
2226,52.0,0,0,0.448242,-0.904104,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3927,62.0,0,0,-0.163428,1.013194,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
3358,81.0,0,1,0.091233,0.208501,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4152,55.0,0,0,-0.849117,0.021761,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
4866,37.0,0,0,0.337227,0.992152,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0


### Scaling

In [24]:
numerical_cols = ['age', 'avg_glucose_level', 'bmi']

scaler = StandardScaler()

# Fit on train, transform train and test
X_train[numerical_cols] = scaler.fit_transform(X_train[numerical_cols])
X_test[numerical_cols] = scaler.transform(X_test[numerical_cols])


In [25]:
X_train.head()

Unnamed: 0,age,hypertension,heart_disease,avg_glucose_level,bmi,gender_Male,gender_Other,ever_married_Yes,work_type_Never_worked,work_type_Private,work_type_Self-employed,work_type_children,Residence_type_Urban,smoking_status_formerly smoked,smoking_status_never smoked,smoking_status_smokes
2226,0.389044,0,0,0.448242,-0.904104,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3927,0.833687,0,0,-0.163428,1.013194,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
3358,1.67851,0,1,0.091233,0.208501,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4152,0.522437,0,0,-0.849117,0.021761,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
4866,-0.277921,0,0,0.337227,0.992152,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0


In [28]:
y_train.head()

2226    0
3927    0
3358    0
4152    0
4866    0
Name: stroke, dtype: int64

### Feature selection

In [29]:
from sklearn.feature_selection import mutual_info_classif
import pandas as pd

# Compute Mutual Information
mi_scores = mutual_info_classif(X_train, y_train, discrete_features='auto', random_state=42)
mi_df = pd.DataFrame({'Feature': X_train.columns, 'MI Score': mi_scores}).sort_values(by='MI Score', ascending=False)

print("Mutual Information Scores:\n", mi_df)


Mutual Information Scores:
                            Feature  MI Score
0                              age  0.034544
4                              bmi  0.012080
7                 ever_married_Yes  0.009049
3                avg_glucose_level  0.005901
11              work_type_children  0.004515
8           work_type_Never_worked  0.001774
9                work_type_Private  0.001752
15           smoking_status_smokes  0.001686
1                     hypertension  0.001492
2                    heart_disease  0.000832
14     smoking_status_never smoked  0.000647
5                      gender_Male  0.000000
6                     gender_Other  0.000000
10         work_type_Self-employed  0.000000
13  smoking_status_formerly smoked  0.000000
12            Residence_type_Urban  0.000000


In [30]:
print(ohe.categories_)
print(X_train_ohe.columns)


[array(['Female', 'Male', 'Other'], dtype=object), array(['No', 'Yes'], dtype=object), array(['Govt_job', 'Never_worked', 'Private', 'Self-employed', 'children'],
      dtype=object), array(['Rural', 'Urban'], dtype=object), array(['Unknown', 'formerly smoked', 'never smoked', 'smokes'],
      dtype=object)]
Index(['gender_Male', 'gender_Other', 'ever_married_Yes',
       'work_type_Never_worked', 'work_type_Private',
       'work_type_Self-employed', 'work_type_children', 'Residence_type_Urban',
       'smoking_status_formerly smoked', 'smoking_status_never smoked',
       'smoking_status_smokes'],
      dtype='object')


In [32]:
# sekect the columns using ---methods

from sklearn.feature_selection import SelectKBest, mutual_info_classif

selector = SelectKBest(score_func=mutual_info_classif, k=10)
X_train_selected = selector.fit_transform(X_train, y_train)
X_test_selected = selector.transform(X_test)

In [33]:
selected_features = X_train.columns[selector.get_support()]
print("Selected Features:", list(selected_features))


Selected Features: ['age', 'hypertension', 'heart_disease', 'avg_glucose_level', 'bmi', 'ever_married_Yes', 'work_type_children', 'Residence_type_Urban', 'smoking_status_formerly smoked', 'smoking_status_never smoked']


## APply SMOTE

In [40]:
# 3. Apply SMOTE only on training data

from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_train_bal, y_train_bal = smote.fit_resample(X_train_selected, y_train)

## BAse line model

In [41]:
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import cross_validate
from sklearn.metrics import make_scorer, f1_score, recall_score, precision_score, accuracy_score

# Initialize dummy classifier
dummy_clf = DummyClassifier(strategy='most_frequent')

# Define scoring metrics
scoring = {
    'accuracy': make_scorer(accuracy_score),
    'f1': make_scorer(f1_score),
    'precision': make_scorer(precision_score),
    'recall': make_scorer(recall_score)
}

# Cross-validation on balanced training set
cv_results = cross_validate(dummy_clf, X_train_bal, y_train_bal, cv=5, scoring=scoring)

# Print average scores
print("CV Accuracy:", cv_results['test_accuracy'].mean())
print("CV F1 Score:", cv_results['test_f1'].mean())
print("CV Precision:", cv_results['test_precision'].mean())
print("CV Recall:", cv_results['test_recall'].mean())


CV Accuracy: 0.49970609845701686
CV F1 Score: 0.2665360117589417
CV Precision: 0.19985304922850844
CV Recall: 0.4


  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


In [42]:
import optuna
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.metrics import make_scorer, f1_score, recall_score
import numpy as np

# Use F1 or recall as scoring
scorer = make_scorer(f1_score)  # you can also switch to recall_score

# Stratified CV
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Optuna objective function
def objective(trial, model_type='logistic'):
    if model_type == 'logistic':
        # Hyperparameters
        penalty = trial.suggest_categorical('penalty', ['l1', 'l2', 'elasticnet', 'none'])
        solver_options = {'l1': 'liblinear', 'l2': 'lbfgs', 'elasticnet': 'saga', 'none': 'lbfgs'}
        solver = solver_options[penalty]
        
        l1_ratio = None
        if penalty == 'elasticnet':
            l1_ratio = trial.suggest_float('l1_ratio', 0.0, 1.0)
        
        C = trial.suggest_loguniform('C', 0.01, 100)
        max_iter = trial.suggest_int('max_iter', 100, 5000)
        
        clf = LogisticRegression(penalty=penalty, solver=solver, C=C, l1_ratio=l1_ratio,
                                 max_iter=max_iter, class_weight='balanced', random_state=42)
    
    elif model_type == 'decision_tree':
        max_depth = trial.suggest_int('max_depth', 2, 50)
        min_samples_split = trial.suggest_int('min_samples_split', 2, 50)
        min_samples_leaf = trial.suggest_int('min_samples_leaf', 1, 20)
        criterion = trial.suggest_categorical('criterion', ['gini', 'entropy'])
        clf = DecisionTreeClassifier(max_depth=max_depth,
                                     min_samples_split=min_samples_split,
                                     min_samples_leaf=min_samples_leaf,
                                     criterion=criterion,
                                     class_weight='balanced',
                                     random_state=42)
    
    elif model_type == 'xgboost':
        n_estimators = trial.suggest_int('n_estimators', 50, 1000)
        max_depth = trial.suggest_int('max_depth', 2, 20)
        learning_rate = trial.suggest_loguniform('learning_rate', 0.01, 0.3)
        subsample = trial.suggest_float('subsample', 0.5, 1.0)
        colsample_bytree = trial.suggest_float('colsample_bytree', 0.5, 1.0)
        gamma = trial.suggest_float('gamma', 0, 5)
        reg_alpha = trial.suggest_float('reg_alpha', 0, 5)
        reg_lambda = trial.suggest_float('reg_lambda', 0, 5)
        min_child_weight = trial.suggest_int('min_child_weight', 1, 10)
        
        clf = XGBClassifier(n_estimators=n_estimators,
                            max_depth=max_depth,
                            learning_rate=learning_rate,
                            subsample=subsample,
                            colsample_bytree=colsample_bytree,
                            gamma=gamma,
                            reg_alpha=reg_alpha,
                            reg_lambda=reg_lambda,
                            min_child_weight=min_child_weight,
                            scale_pos_weight=1,  # adjust if your classes are imbalanced
                            use_label_encoder=False,
                            eval_metric='logloss',
                            random_state=42)
    
    # Cross-validation
    scores = cross_val_score(clf, X_train_bal, y_train_bal, cv=cv, scoring=scorer)
    return np.mean(scores)

# Example: create study for Logistic Regression
study_lr = optuna.create_study(direction='maximize')
study_lr.optimize(lambda trial: objective(trial, 'logistic'), n_trials=50)

# Example: create study for Decision Tree
study_dt = optuna.create_study(direction='maximize')
study_dt.optimize(lambda trial: objective(trial, 'decision_tree'), n_trials=50)

# Example: create study for XGBoost
study_xgb = optuna.create_study(direction='maximize')
study_xgb.optimize(lambda trial: objective(trial, 'xgboost'), n_trials=50)


  from .autonotebook import tqdm as notebook_tqdm
[I 2025-08-20 21:12:31,507] A new study created in memory with name: no-name-69e74ac4-97ae-40bc-a9df-975b13d73db2
  C = trial.suggest_loguniform('C', 0.01, 100)
[W 2025-08-20 21:12:31,556] Trial 0 failed with parameters: {'penalty': 'none', 'C': 73.47132786929942, 'max_iter': 2904} because of the following error: ValueError('\nAll the 5 fits failed.\nIt is very likely that your model is misconfigured.\nYou can try to debug the error by setting error_score=\'raise\'.\n\nBelow are more details about the failures:\n--------------------------------------------------------------------------------\n5 fits failed with the following error:\nTraceback (most recent call last):\n  File "C:\\Users\\Ola\\datascience\\Lib\\site-packages\\sklearn\\model_selection\\_validation.py", line 859, in _fit_and_score\n    estimator.fit(X_train, y_train, **fit_params)\n  File "C:\\Users\\Ola\\datascience\\Lib\\site-packages\\sklearn\\base.py", line 1358, in wra

ValueError: 
All the 5 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\Ola\datascience\Lib\site-packages\sklearn\model_selection\_validation.py", line 859, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\Ola\datascience\Lib\site-packages\sklearn\base.py", line 1358, in wrapper
    estimator._validate_params()
  File "C:\Users\Ola\datascience\Lib\site-packages\sklearn\base.py", line 471, in _validate_params
    validate_parameter_constraints(
  File "C:\Users\Ola\datascience\Lib\site-packages\sklearn\utils\_param_validation.py", line 98, in validate_parameter_constraints
    raise InvalidParameterError(
sklearn.utils._param_validation.InvalidParameterError: The 'penalty' parameter of LogisticRegression must be a str among {'elasticnet', 'l1', 'l2'} or None. Got 'none' instead.
