 # Preprocessing Pipeline: Diabetes Prediction



 **Extension of Nguyen & Zhang (2025)**



 ## Notebook Structure

 0. Forensics: Understanding the Paper's Approach

 1. Load Data

 2. Train/Test Split

 3. Feature Engineering

 4. Class Balancing (SMOTE vs Paper's Undersampling)

 5. Feature Scaling

 6. Save Preprocessed Data

 7. Data Leakage Check

 ## 0. Forensics: Understanding the Paper's Approach



 Before preprocessing, we reverse-engineer what Nguyen & Zhang (2025) did

 to their 50-50 balanced dataset. This informs our extension strategy.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE
from imblearn.combine import SMOTETomek
import pickle
import warnings
warnings.filterwarnings('ignore')


In [2]:
# Load both datasets for comparison
df_balanced = pd.read_csv('diabetes_binary_5050split_health_indicators_BRFSS2015.csv')
df_imbalanced = pd.read_csv('diabetes_binary_health_indicators_BRFSS2015.csv')

print("=" * 70)
print("FORENSICS: PAPER'S DATASET vs ORIGINAL")
print("=" * 70)
print(f"Paper's balanced:   {df_balanced.shape[0]:,} samples")
print(f"Original imbalanced: {df_imbalanced.shape[0]:,} samples")
print(f"Samples discarded:   {df_imbalanced.shape[0] - df_balanced.shape[0]:,}")


FORENSICS: PAPER'S DATASET vs ORIGINAL
Paper's balanced:   70,692 samples
Original imbalanced: 253,680 samples
Samples discarded:   182,988


In [3]:
# Check 1: Class distribution
print("\n" + "-" * 70)
print("CHECK 1: CLASS DISTRIBUTION")
print("-" * 70)
print("\nPaper's balanced dataset:")
print(df_balanced['Diabetes_binary'].value_counts())
print(f"Ratio: {df_balanced['Diabetes_binary'].value_counts()[0] / df_balanced['Diabetes_binary'].value_counts()[1]:.2f}:1")

print("\nOriginal imbalanced dataset:")
print(df_imbalanced['Diabetes_binary'].value_counts())
print(f"Ratio: {df_imbalanced['Diabetes_binary'].value_counts()[0] / df_imbalanced['Diabetes_binary'].value_counts()[1]:.2f}:1")



----------------------------------------------------------------------
CHECK 1: CLASS DISTRIBUTION
----------------------------------------------------------------------

Paper's balanced dataset:
Diabetes_binary
0.0    35346
1.0    35346
Name: count, dtype: int64
Ratio: 1.00:1

Original imbalanced dataset:
Diabetes_binary
0.0    218334
1.0     35346
Name: count, dtype: int64
Ratio: 6.18:1


In [4]:
# Check 2: Scaling detection
print("\n" + "-" * 70)
print("CHECK 2: SCALING DETECTION")
print("-" * 70)

features = [c for c in df_balanced.columns if c != 'Diabetes_binary']
stats_balanced = df_balanced[features].agg(['mean', 'std', 'min', 'max']).T
stats_imbalanced = df_imbalanced[features].agg(['min', 'max']).T

# StandardScaler check
scaled = stats_balanced['mean'].abs().mean() < 0.1 and (stats_balanced['std'] - 1).abs().mean() < 0.1
print(f"StandardScaler applied: {'YES' if scaled else 'NO'}")

# MinMaxScaler check  
minmax = (stats_balanced['min'] == 0).all() and (stats_balanced['max'] == 1).all()
print(f"MinMaxScaler applied: {'YES' if minmax else 'NO'}")

# Raw values check
comparison = pd.DataFrame({
    'bal_min': stats_balanced['min'],
    'bal_max': stats_balanced['max'],
    'imbal_min': stats_imbalanced['min'],
    'imbal_max': stats_imbalanced['max']
})
comparison['min_match'] = comparison['bal_min'] == comparison['imbal_min']
comparison['max_match'] = comparison['bal_max'] == comparison['imbal_max']
raw = comparison['min_match'].all() and comparison['max_match'].all()
print(f"Raw values (no scaling): {'YES' if raw else 'NO'}")



----------------------------------------------------------------------
CHECK 2: SCALING DETECTION
----------------------------------------------------------------------
StandardScaler applied: NO
MinMaxScaler applied: NO
Raw values (no scaling): YES


In [5]:
# Check 3: Balancing method detection
print("\n" + "-" * 70)
print("CHECK 3: BALANCING METHOD DETECTION")
print("-" * 70)

# Check if balanced data is a subset (undersampling) or has new samples (SMOTE)
diabetic_balanced = df_balanced[df_balanced['Diabetes_binary'] == 1].drop('Diabetes_binary', axis=1)
diabetic_imbalanced = df_imbalanced[df_imbalanced['Diabetes_binary'] == 1].drop('Diabetes_binary', axis=1)

bal_tuples = set(diabetic_balanced.apply(tuple, axis=1))
imbal_tuples = set(diabetic_imbalanced.apply(tuple, axis=1))
overlap = len(bal_tuples.intersection(imbal_tuples))

print(f"Diabetic samples in balanced: {len(bal_tuples):,}")
print(f"Diabetic samples in imbalanced: {len(imbal_tuples):,}")
print(f"Overlap: {overlap:,} ({overlap/len(bal_tuples)*100:.1f}%)")

if overlap / len(bal_tuples) > 0.95:
    print("\n→ Method: RANDOM UNDERSAMPLING")
    print("  (All diabetic cases retained, non-diabetic randomly sampled)")
else:
    print("\n→ Method: Unknown (possibly SMOTE or augmentation)")



----------------------------------------------------------------------
CHECK 3: BALANCING METHOD DETECTION
----------------------------------------------------------------------
Diabetic samples in balanced: 35,097
Diabetic samples in imbalanced: 35,097
Overlap: 35,097 (100.0%)

→ Method: RANDOM UNDERSAMPLING
  (All diabetic cases retained, non-diabetic randomly sampled)


In [6]:
# Forensics summary
print("\n" + "=" * 70)
print("FORENSICS SUMMARY")
print("=" * 70)
print("""
PAPER'S APPROACH:
  - Scaling: None (raw values)
  - Balancing: Random undersampling of majority class
  - Data discarded: ~183,000 non-diabetic samples
  - Test distribution: Balanced 50/50

OUR EXTENSION:
  - Scaling: StandardScaler on ordinal + numeric features
  - Balancing: SMOTE oversampling (preserves all original data)
  - Additional: SMOTE-Tomek hybrid, class_weight comparison
  - Test distribution: Realistic 86/14 imbalanced

WHY THIS MATTERS:
  - Our SMOTE approach is a genuine methodological extension
  - Different test distributions will affect F1 scores (not model quality)
  - Fair comparison requires ROC-AUC (threshold-independent)
""")



FORENSICS SUMMARY

PAPER'S APPROACH:
  - Scaling: None (raw values)
  - Balancing: Random undersampling of majority class
  - Data discarded: ~183,000 non-diabetic samples
  - Test distribution: Balanced 50/50

OUR EXTENSION:
  - Scaling: StandardScaler on ordinal + numeric features
  - Balancing: SMOTE oversampling (preserves all original data)
  - Additional: SMOTE-Tomek hybrid, class_weight comparison
  - Test distribution: Realistic 86/14 imbalanced

WHY THIS MATTERS:
  - Our SMOTE approach is a genuine methodological extension
  - Different test distributions will affect F1 scores (not model quality)
  - Fair comparison requires ROC-AUC (threshold-independent)



 ## 1. Load Data

In [7]:
print("\n" + "=" * 70)
print("PREPROCESSING: OUR APPROACH")
print("=" * 70)

df = pd.read_csv('diabetes_binary_health_indicators_BRFSS2015.csv')
X = df.drop('Diabetes_binary', axis=1)
y = df['Diabetes_binary']

print(f"\nOriginal data shape: {X.shape}")
print(f"Class distribution:\n{y.value_counts(normalize=True).round(3)}")
print(f"Imbalance ratio: {y.value_counts()[0] / y.value_counts()[1]:.1f}:1")



PREPROCESSING: OUR APPROACH

Original data shape: (253680, 21)
Class distribution:
Diabetes_binary
0.0    0.861
1.0    0.139
Name: proportion, dtype: float64
Imbalance ratio: 6.2:1


In [8]:
# Feature groups
BINARY_FEATURES = [
    'HighBP', 'HighChol', 'CholCheck', 'Smoker', 'Stroke',
    'HeartDiseaseorAttack', 'PhysActivity', 'Fruits', 'Veggies',
    'HvyAlcoholConsump', 'AnyHealthcare', 'NoDocbcCost', 'DiffWalk', 'Sex'
]
ORDINAL_FEATURES = ['GenHlth', 'Age', 'Education', 'Income']
NUMERIC_FEATURES = ['BMI', 'MentHlth', 'PhysHlth']
FEATURES_TO_SCALE = ORDINAL_FEATURES + NUMERIC_FEATURES


 ## 2. Train/Test Split

In [9]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

print("\n" + "=" * 70)
print("TRAIN/TEST SPLIT")
print("=" * 70)
print(f"Training set: {X_train.shape[0]:,} samples")
print(f"Test set: {X_test.shape[0]:,} samples")
print(f"Train class distribution:\n{y_train.value_counts(normalize=True).round(3)}")



TRAIN/TEST SPLIT
Training set: 177,576 samples
Test set: 76,104 samples
Train class distribution:
Diabetes_binary
0.0    0.861
1.0    0.139
Name: proportion, dtype: float64


 ## 3. Feature Engineering



 Based on EDA findings:

 - BMI is right-skewed → log transform

 - MentHlth/PhysHlth are zero-inflated → binning

In [10]:
def engineer_features(df, log_transform_bmi=True, bin_health_days=True):
    """Apply feature engineering based on EDA insights."""
    df = df.copy()
    
    if log_transform_bmi:
        df['BMI_log'] = np.log(df['BMI'])
    
    if bin_health_days:
        df['MentHlth_binned'] = pd.cut(
            df['MentHlth'], bins=[-1, 0, 10, 30], labels=[0, 1, 2]
        ).astype(int)
        df['PhysHlth_binned'] = pd.cut(
            df['PhysHlth'], bins=[-1, 0, 10, 30], labels=[0, 1, 2]
        ).astype(int)
    
    return df


In [11]:
APPLY_FEATURE_ENGINEERING = True

if APPLY_FEATURE_ENGINEERING:
    X_train = engineer_features(X_train)
    X_test = engineer_features(X_test)
    new_features = [c for c in X_train.columns if c not in X.columns]
    print("\n" + "=" * 70)
    print("FEATURE ENGINEERING")
    print("=" * 70)
    print(f"New features added: {new_features}")
    print(f"Total features: {X_train.shape[1]}")



FEATURE ENGINEERING
New features added: ['BMI_log', 'MentHlth_binned', 'PhysHlth_binned']
Total features: 24


 ## 4. Class Balancing



 We create multiple versions for comparison:

 - **SMOTE**: Synthetic oversampling of minority class

 - **SMOTE-Tomek**: SMOTE + remove overlapping samples

 - **None**: Use class_weight in models instead

In [12]:
def balance_data(X_train, y_train, method='smote', random_state=42):
    """Balance training data using specified method."""
    if method == 'none':
        return X_train, y_train
    elif method == 'smote':
        sampler = SMOTE(random_state=random_state)
    elif method == 'smote_tomek':
        sampler = SMOTETomek(random_state=random_state)
    else:
        raise ValueError(f"Unknown method: {method}")
    
    X_balanced, y_balanced = sampler.fit_resample(X_train, y_train)
    return X_balanced, y_balanced


In [13]:
print("\n" + "=" * 70)
print("CLASS BALANCING")
print("=" * 70)

# No balancing (for class_weight approach)
X_train_none, y_train_none = balance_data(X_train, y_train, method='none')
print(f"\nNo balancing (class_weight):")
print(f"  Samples: {len(X_train_none):,}")
print(f"  Class 0: {(y_train_none == 0).sum():,}, Class 1: {(y_train_none == 1).sum():,}")

# SMOTE
X_train_smote, y_train_smote = balance_data(X_train, y_train, method='smote')
print(f"\nSMOTE:")
print(f"  Samples: {len(X_train_smote):,} (was {len(X_train):,})")
print(f"  Class 0: {(y_train_smote == 0).sum():,}, Class 1: {(y_train_smote == 1).sum():,}")

# SMOTE-Tomek
X_train_smote_tomek, y_train_smote_tomek = balance_data(X_train, y_train, method='smote_tomek')
print(f"\nSMOTE-Tomek:")
print(f"  Samples: {len(X_train_smote_tomek):,}")
print(f"  Class 0: {(y_train_smote_tomek == 0).sum():,}, Class 1: {(y_train_smote_tomek == 1).sum():,}")



CLASS BALANCING

No balancing (class_weight):
  Samples: 177,576
  Class 0: 152,834, Class 1: 24,742

SMOTE:
  Samples: 305,668 (was 177,576)
  Class 0: 152,834, Class 1: 152,834

SMOTE-Tomek:
  Samples: 305,466
  Class 0: 152,733, Class 1: 152,733


 ## 5. Feature Scaling



 Unlike the paper (no scaling), we apply StandardScaler to ordinal and numeric features.

In [14]:
def scale_features(X_train, X_test, features_to_scale):
    """Scale specified features. Fit on train, transform both."""
    X_train_scaled = X_train.copy()
    X_test_scaled = X_test.copy()
    
    scaler = StandardScaler()
    X_train_scaled[features_to_scale] = scaler.fit_transform(X_train[features_to_scale])
    X_test_scaled[features_to_scale] = scaler.transform(X_test[features_to_scale])
    
    return X_train_scaled, X_test_scaled, scaler


In [15]:
print("\n" + "=" * 70)
print("FEATURE SCALING")
print("=" * 70)
print(f"Scaling features: {FEATURES_TO_SCALE}")

# Scale each version
X_train_smote_scaled, X_test_scaled, scaler = scale_features(
    X_train_smote, X_test, FEATURES_TO_SCALE
)
X_train_smote_tomek_scaled, _, _ = scale_features(
    X_train_smote_tomek, X_test, FEATURES_TO_SCALE
)
X_train_unbalanced_scaled, _, _ = scale_features(
    X_train, X_test, FEATURES_TO_SCALE
)

print("\nScaled feature stats (SMOTE training set):")
print(X_train_smote_scaled[FEATURES_TO_SCALE].describe().round(2).loc[['mean', 'std']])



FEATURE SCALING
Scaling features: ['GenHlth', 'Age', 'Education', 'Income', 'BMI', 'MentHlth', 'PhysHlth']

Scaled feature stats (SMOTE training set):
      GenHlth  Age  Education  Income  BMI  MentHlth  PhysHlth
mean      0.0 -0.0       -0.0     0.0  0.0       0.0      -0.0
std       1.0  1.0        1.0     1.0  1.0       1.0       1.0


 ## 6. Save Preprocessed Data

In [16]:
preprocessed_data = {
    'smote': {
        'X_train': X_train_smote_scaled,
        'y_train': y_train_smote,
        'X_test': X_test_scaled,
        'y_test': y_test,
        'description': 'SMOTE oversampling + StandardScaler'
    },
    'smote_tomek': {
        'X_train': X_train_smote_tomek_scaled,
        'y_train': y_train_smote_tomek,
        'X_test': X_test_scaled,
        'y_test': y_test,
        'description': 'SMOTE-Tomek hybrid + StandardScaler'
    },
    'class_weight': {
        'X_train': X_train_unbalanced_scaled,
        'y_train': y_train,
        'X_test': X_test_scaled,
        'y_test': y_test,
        'description': 'No resampling (use class_weight="balanced")'
    }
}

print("\n" + "=" * 70)
print("PREPROCESSED DATA SUMMARY")
print("=" * 70)
for name, data in preprocessed_data.items():
    print(f"\n{name}:")
    print(f"  {data['description']}")
    print(f"  Train: {data['X_train'].shape[0]:,} samples, {data['X_train'].shape[1]} features")
    print(f"  Test: {data['X_test'].shape[0]:,} samples")



PREPROCESSED DATA SUMMARY

smote:
  SMOTE oversampling + StandardScaler
  Train: 305,668 samples, 24 features
  Test: 76,104 samples

smote_tomek:
  SMOTE-Tomek hybrid + StandardScaler
  Train: 305,466 samples, 24 features
  Test: 76,104 samples

class_weight:
  No resampling (use class_weight="balanced")
  Train: 177,576 samples, 24 features
  Test: 76,104 samples


In [17]:
# Save
with open('preprocessed_data.pkl', 'wb') as f:
    pickle.dump(preprocessed_data, f)

with open('scaler.pkl', 'wb') as f:
    pickle.dump(scaler, f)

print("\n✓ Data saved to preprocessed_data.pkl")
print("✓ Scaler saved to scaler.pkl")



✓ Data saved to preprocessed_data.pkl
✓ Scaler saved to scaler.pkl


 ## 7. Data Leakage Check

In [18]:
print("\n" + "=" * 70)
print("DATA LEAKAGE CHECK")
print("=" * 70)

# Test set size unchanged
print(f"✓ Test set size unchanged: {len(X_test) == len(X_test_scaled)}")

# Test class distribution unchanged
print(f"✓ Test class distribution preserved:")
print(f"    {y_test.value_counts(normalize=True).round(3).to_dict()}")

# Scaling fit only on training
train_means = X_train_smote_scaled[FEATURES_TO_SCALE].mean()
print(f"✓ Training scaled means near 0: {(train_means.abs() < 0.01).all()}")

print("\n" + "=" * 70)
print("PREPROCESSING COMPLETE")
print("=" * 70)
print("""
SUMMARY OF DIFFERENCES FROM PAPER:

| Aspect            | Paper             | Our Extension           |
|-------------------|-------------------|-------------------------|
| Scaling           | None              | StandardScaler          |
| Balancing         | Undersampling     | SMOTE / SMOTE-Tomek     |
| Data preserved    | 70,692            | 253,680 (all)           |
| Test distribution | 50/50 balanced    | 86/14 realistic         |
| Feature eng.      | None              | BMI_log, binned health  |

→ Proceed to 03_modeling.py
""")


DATA LEAKAGE CHECK
✓ Test set size unchanged: True
✓ Test class distribution preserved:
    {0.0: 0.861, 1.0: 0.139}
✓ Training scaled means near 0: True

PREPROCESSING COMPLETE

SUMMARY OF DIFFERENCES FROM PAPER:

| Aspect            | Paper             | Our Extension           |
|-------------------|-------------------|-------------------------|
| Scaling           | None              | StandardScaler          |
| Balancing         | Undersampling     | SMOTE / SMOTE-Tomek     |
| Data preserved    | 70,692            | 253,680 (all)           |
| Test distribution | 50/50 balanced    | 86/14 realistic         |
| Feature eng.      | None              | BMI_log, binned health  |

→ Proceed to 03_modeling.py

