# 2. Feature Engineering

## Objective
Encode categorical variables, scale numerical variables, create interaction terms, and select final features for modeling.

### Input
- `data/processed/1_cleaned_data.csv`

### Output
- `data/processed/2_featured_data.csv`

---

In [1]:
# ==============================================================================
# SETUP CELL: Environment and Imports
# ==============================================================================
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import warnings
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
import joblib

# Set project root directory for robust path handling
PROJECT_ROOT = os.path.abspath(os.path.join(os.getcwd(), ".."))
DATA_DIR = os.path.join(PROJECT_ROOT, "data")
RAW_DATA_DIR = os.path.join(DATA_DIR, "raw")
PROCESSED_DATA_DIR = os.path.join(DATA_DIR, "processed")
MODELS_DIR = os.path.join(PROJECT_ROOT, "models")

# Ensure directories exist
os.makedirs(PROCESSED_DATA_DIR, exist_ok=True)
os.makedirs(MODELS_DIR, exist_ok=True)

# Set plotting style
sns.set_style("whitegrid")
plt.rcParams["figure.figsize"] = (10, 6)

# Suppress warnings for cleaner output
warnings.filterwarnings("ignore")

print("Libraries imported and environment set successfully.")
print(f"Project Root: {PROJECT_ROOT}")

Libraries imported and environment set successfully.
Project Root: /home/ghost/workspace/university/machine_learning_and_computer_vision/assessment_main


## Step 1: Load the Cleaned Data

Load the cleaned dataset from the previous notebook and verify its structure.

In [28]:
# Load the cleaned data
file_path = os.path.join(PROCESSED_DATA_DIR, '1_cleaned_data.csv')
df_clean = pd.read_csv(file_path)

# Display basic information about the dataset
print("Dataset shape:", df_clean.shape)
print("\nFirst few rows:")
print(df_clean.head())
print("\nDataset info:")
df_clean.info()

Dataset shape: (101763, 47)

First few rows:
              race  gender      age  admission_type_id  \
0        Caucasian  Female   [0-10)                  6   
1        Caucasian  Female  [10-20)                  1   
2  AfricanAmerican  Female  [20-30)                  1   
3        Caucasian    Male  [30-40)                  1   
4        Caucasian    Male  [40-50)                  1   

   discharge_disposition_id  admission_source_id  time_in_hospital  \
0                        25                    1                 1   
1                         1                    7                 3   
2                         1                    7                 2   
3                         1                    7                 2   
4                         1                    7                 1   

          medical_specialty  num_lab_procedures  num_procedures  ...  insulin  \
0  Pediatrics-Endocrinology                  41               0  ...       No   
1                   Mis

## Step 2: Domain-Specific Feature Engineering

Create new features based on the research paper's logic about HbA1c measurement and medication changes.

In [29]:
# Identify the 24 medication columns
medication_columns = [col for col in df_clean.columns if col.startswith('metformin') or
                     col.startswith('repaglinide') or col.startswith('nateglinide') or
                     col.startswith('chlorpropamide') or col.startswith('glimepiride') or
                     col.startswith('acetohexamide') or col.startswith('glipizide') or
                     col.startswith('glyburide') or col.startswith('tolbutamide') or
                     col.startswith('pioglitazone') or col.startswith('rosiglitazone') or
                     col.startswith('acarbose') or col.startswith('miglitol') or
                     col.startswith('troglitazone') or col.startswith('tolazamide') or
                     col.startswith('examide') or col.startswith('citoglipton') or
                     col.startswith('insulin') or col.startswith('glyburide-metformin') or
                     col.startswith('glipizide-metformin') or col.startswith('glimepiride-pioglitazone') or
                     col.startswith('metformin-rosiglitazone') or col.startswith('metformin-pioglitazone')]

print(f"Found {len(medication_columns)} medication columns:")
print(medication_columns)

Found 23 medication columns:
['metformin', 'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide', 'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone', 'tolazamide', 'examide', 'citoglipton', 'insulin', 'glyburide-metformin', 'glipizide-metformin', 'glimepiride-pioglitazone', 'metformin-rosiglitazone', 'metformin-pioglitazone']


In [30]:
# Create the change_of_meds feature
# A change is defined as any medication column having 'up' or 'down'
med_change_mask = df_clean[medication_columns].isin(['Up', 'Down']).any(axis=1)
df_clean['change_of_meds'] = med_change_mask.astype(int)

print("Distribution of change_of_meds feature:")
print(df_clean['change_of_meds'].value_counts())
print(f"\nPercentage with medication changes: {df_clean['change_of_meds'].mean() * 100:.2f}%")

Distribution of change_of_meds feature:
change_of_meds
0    74060
1    27703
Name: count, dtype: int64

Percentage with medication changes: 27.22%


In [31]:
# Create the HbA1c_Change_Group feature
# This combines HbA1c test results with medication changes
conditions = [
    df_clean['A1Cresult'].isna(),  # Group 1: No Test
    df_clean['A1Cresult'] == 'Norm',  # Group 2: Normal
    (df_clean['A1Cresult'].isin(['>7', '>8'])) & (df_clean['change_of_meds'] == 0),  # Group 3: High, No Change
    (df_clean['A1Cresult'].isin(['>7', '>8'])) & (df_clean['change_of_meds'] == 1)   # Group 4: High, Change
]

choices = [
    'No Test',
    'Normal',
    'High, No Change',
    'High, Change'
]

df_clean['HbA1c_Change_Group'] = np.select(conditions, choices, default='Unknown')

print("Distribution of HbA1c_Change_Group feature:")
print(df_clean['HbA1c_Change_Group'].value_counts())
print("\nPercentage distribution:")
print(df_clean['HbA1c_Change_Group'].value_counts(normalize=True) * 100)

Distribution of HbA1c_Change_Group feature:
HbA1c_Change_Group
No Test            84745
High, No Change     7043
Normal              4990
High, Change        4985
Name: count, dtype: int64

Percentage distribution:
HbA1c_Change_Group
No Test            83.276829
High, No Change     6.920983
Normal              4.903550
High, Change        4.898637
Name: proportion, dtype: float64


## Step 3: Define Target and Feature Variables

Create the binary target variable and feature matrix for modeling.

In [32]:
# Create the target variable (y)
# Convert readmitted to binary: <30 -> 1, >30 or NO -> 0
y = (df_clean['readmitted'] == '<30').astype(int)

print("Target variable distribution:")
print(y.value_counts())
print("\nPercentage distribution:")
print(y.value_counts(normalize=True) * 100)
print(f"\nClass imbalance ratio: {y.value_counts()[1] / y.value_counts()[0]:.3f}")

Target variable distribution:
readmitted
0    90406
1    11357
Name: count, dtype: int64

Percentage distribution:
readmitted
0    88.839755
1    11.160245
Name: proportion, dtype: float64

Class imbalance ratio: 0.126


In [34]:
# Create the feature matrix (X)
# Drop original readmitted, individual medication columns, and original change column
columns_to_drop = ['readmitted', 'readmitted_binary'] + medication_columns + ['change']
X = df_clean.drop(columns=columns_to_drop)

print(f"Feature matrix shape: {X.shape}")
print("\nFeature columns:")
print(X.columns.tolist())
print("\nFirst few rows of feature matrix:")
X.head()

Feature matrix shape: (101763, 23)

Feature columns:
['race', 'gender', 'age', 'admission_type_id', 'discharge_disposition_id', 'admission_source_id', 'time_in_hospital', 'medical_specialty', 'num_lab_procedures', 'num_procedures', 'num_medications', 'number_outpatient', 'number_emergency', 'number_inpatient', 'diag_1', 'diag_2', 'diag_3', 'number_diagnoses', 'max_glu_serum', 'A1Cresult', 'diabetesMed', 'change_of_meds', 'HbA1c_Change_Group']

First few rows of feature matrix:


Unnamed: 0,race,gender,age,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,medical_specialty,num_lab_procedures,num_procedures,...,number_inpatient,diag_1,diag_2,diag_3,number_diagnoses,max_glu_serum,A1Cresult,diabetesMed,change_of_meds,HbA1c_Change_Group
0,Caucasian,Female,[0-10),6,25,1,1,Pediatrics-Endocrinology,41,0,...,0,250.83,0.0,0,1,,,No,0,No Test
1,Caucasian,Female,[10-20),1,1,7,3,Missing,59,0,...,0,276.0,250.01,255,9,,,Yes,1,No Test
2,AfricanAmerican,Female,[20-30),1,1,7,2,Missing,11,5,...,1,648.0,250.0,V27,6,,,Yes,0,No Test
3,Caucasian,Male,[30-40),1,1,7,2,Missing,44,1,...,0,8.0,250.43,403,7,,,Yes,1,No Test
4,Caucasian,Male,[40-50),1,1,7,1,Missing,51,0,...,0,197.0,157.0,250,5,,,Yes,0,No Test


## Step 4: Identify Numerical and Categorical Features

Separate features into numerical and categorical for appropriate preprocessing.

In [35]:
# Identify numerical and categorical features
numerical_features = X.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = X.select_dtypes(include=['object', 'category']).columns.tolist()

print("Numerical features:")
print(numerical_features)
print(f"\nCount: {len(numerical_features)} numerical features")

print("\nCategorical features:")
print(categorical_features)
print(f"\nCount: {len(categorical_features)} categorical features")

# Verify our engineered features are correctly categorized
print(f"\nEngineered features:")
print(f"change_of_meds: {'Numerical' if 'change_of_meds' in numerical_features else 'Categorical'}")
print(f"HbA1c_Change_Group: {'Numerical' if 'HbA1c_Change_Group' in numerical_features else 'Categorical'}")

Numerical features:
['admission_type_id', 'discharge_disposition_id', 'admission_source_id', 'time_in_hospital', 'num_lab_procedures', 'num_procedures', 'num_medications', 'number_outpatient', 'number_emergency', 'number_inpatient', 'number_diagnoses', 'change_of_meds']

Count: 12 numerical features

Categorical features:
['race', 'gender', 'age', 'medical_specialty', 'diag_1', 'diag_2', 'diag_3', 'max_glu_serum', 'A1Cresult', 'diabetesMed', 'HbA1c_Change_Group']

Count: 11 categorical features

Engineered features:
change_of_meds: Numerical
HbA1c_Change_Group: Categorical


## Step 5: Split Data into Training and Testing Sets

Split the data using stratification to maintain class distribution.

In [36]:
from sklearn.model_selection import train_test_split

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

print("Training set shape:", X_train.shape)
print("Testing set shape:", X_test.shape)
print("\nTraining target distribution:")
print(y_train.value_counts(normalize=True) * 100)
print("\nTesting target distribution:")
print(y_test.value_counts(normalize=True) * 100)
print("\nStratification successful - distributions are similar!")

Training set shape: (81410, 23)
Testing set shape: (20353, 23)

Training target distribution:
readmitted
0    88.839209
1    11.160791
Name: proportion, dtype: float64

Testing target distribution:
readmitted
0    88.84194
1    11.15806
Name: proportion, dtype: float64

Stratification successful - distributions are similar!


## Step 6: Build Preprocessing Pipeline

Create a ColumnTransformer with StandardScaler for numerical features and OneHotEncoder for categorical features.

In [37]:
# Create the preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=False), categorical_features)
    ],
    remainder='drop'  # Drop any columns not specified
)

print("Preprocessor created successfully!")
print(f"Numerical features to scale: {len(numerical_features)}")
print(f"Categorical features to encode: {len(categorical_features)}")

# Fit the preprocessor on training data only
preprocessor.fit(X_train)

print("\nPreprocessor fitted on training data!")
print("This prevents data leakage from test set.")

Preprocessor created successfully!
Numerical features to scale: 12
Categorical features to encode: 11

Preprocessor fitted on training data!
This prevents data leakage from test set.


## Step 7: Transform the Data

Apply the fitted preprocessor to transform both training and testing data.

In [38]:
# Transform the training and testing data
X_train_transformed = preprocessor.transform(X_train)
X_test_transformed = preprocessor.transform(X_test)

print("Data transformation completed!")
print(f"Original training shape: {X_train.shape}")
print(f"Transformed training shape: {X_train_transformed.shape}")
print(f"Original testing shape: {X_test.shape}")
print(f"Transformed testing shape: {X_test_transformed.shape}")

print(f"\nFeature expansion due to one-hot encoding: {X_train_transformed.shape[1] - X_train.shape[1]} new features")
print(f"Data type: {type(X_train_transformed)}")
print(f"Sample values from transformed training data (first 5 features of first sample):")
print(X_train_transformed[0, :5])

Data transformation completed!
Original training shape: (81410, 23)
Transformed training shape: (81410, 2290)
Original testing shape: (20353, 23)
Transformed testing shape: (20353, 2290)

Feature expansion due to one-hot encoding: 2267 new features
Data type: <class 'numpy.ndarray'>
Sample values from transformed training data (first 5 features of first sample):
[ 0.67700876  2.70926497 -1.17301699  1.20916585 -1.27570736]


## Step 7b: Address Class Imbalance with SMOTE

**Critical:** The severe class imbalance (88.8% vs 11.2%) will cause poor recall for readmissions.

We use **SMOTE (Synthetic Minority Over-sampling Technique)** to balance the training data:
- Creates synthetic samples for the minority class (readmissions)
- **Only applied to training data** - never to test data
- Prevents the model from being biased toward predicting "no readmission"

This is essential for healthcare applications where detecting readmissions (minority class) is critical.

In [41]:
from imblearn.over_sampling import SMOTE

# Check class distribution before SMOTE
print("Class Distribution BEFORE SMOTE:")
print("="*70)
print(f"Training set:")
print(f"  Class 0 (No Readmission): {(y_train == 0).sum()} ({(y_train == 0).sum() / len(y_train) * 100:.2f}%)")
print(f"  Class 1 (Readmission):    {(y_train == 1).sum()} ({(y_train == 1).sum() / len(y_train) * 100:.2f}%)")
print(f"  Imbalance Ratio: {(y_train == 1).sum() / (y_train == 0).sum():.3f}")
print("="*70)

# Apply SMOTE to training data only
print("Applying SMOTE to balance training data...")
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_transformed, y_train)

# Check class distribution after SMOTE
print("Class Distribution AFTER SMOTE:")
print("="*70)
print(f"Training set:")
print(f"  Class 0 (No Readmission): {(y_train_resampled == 0).sum()} ({(y_train_resampled == 0).sum() / len(y_train_resampled) * 100:.2f}%)")
print(f"  Class 1 (Readmission):    {(y_train_resampled == 1).sum()} ({(y_train_resampled == 1).sum() / len(y_train_resampled) * 100:.2f}%)")
print(f"  Imbalance Ratio: {(y_train_resampled == 1).sum() / (y_train_resampled == 0).sum():.3f}")
print("="*70)

print(f"Original training samples: {X_train_transformed.shape[0]}")
print(f"Resampled training samples: {X_train_resampled.shape[0]}")
print(f"Synthetic samples created: {X_train_resampled.shape[0] - X_train_transformed.shape[0]}")

print("⚠ IMPORTANT: Test set remains unchanged to provide unbiased evaluation!")
print(f"Test set shape: {X_test_transformed.shape}")
print(f"Test set class distribution:")
print(f"  Class 0: {(y_test == 0).sum()} ({(y_test == 0).sum() / len(y_test) * 100:.2f}%)")
print(f"  Class 1: {(y_test == 1).sum()} ({(y_test == 1).sum() / len(y_test) * 100:.2f}%)")

Class Distribution BEFORE SMOTE:
Training set:
  Class 0 (No Readmission): 72324 (88.84%)
  Class 1 (Readmission):    9086 (11.16%)
  Imbalance Ratio: 0.126
Applying SMOTE to balance training data...
Class Distribution AFTER SMOTE:
Training set:
  Class 0 (No Readmission): 72324 (50.00%)
  Class 1 (Readmission):    72324 (50.00%)
  Imbalance Ratio: 1.000
Original training samples: 81410
Resampled training samples: 144648
Synthetic samples created: 63238
⚠ IMPORTANT: Test set remains unchanged to provide unbiased evaluation!
Test set shape: (20353, 2290)
Test set class distribution:
  Class 0: 18082 (88.84%)
  Class 1: 2271 (11.16%)


## Step 8: Save Preprocessor and Final Data

Save the fitted preprocessor and transformed data for use in the modeling notebook.

In [42]:
# Save the preprocessor
preprocessor_path = os.path.join(MODELS_DIR, "preprocessor.joblib")
joblib.dump(preprocessor, preprocessor_path)
print(f"Preprocessor saved to: {preprocessor_path}")

# Save the RESAMPLED training data with original test data
final_data_path = os.path.join(PROCESSED_DATA_DIR, "3_final_data_resampled.npz")
np.savez_compressed(
    final_data_path,
    X_train=X_train_resampled,  # Using resampled training data
    X_test=X_test_transformed,   # Original test data (no resampling)
    y_train=y_train_resampled,   # Resampled training labels
    y_test=y_test.values         # Original test labels
)
print(f"Final data saved to: {final_data_path}")
print(f"  - Training data: RESAMPLED (balanced classes)")
print(f"  - Test data: ORIGINAL (real-world distribution)")

# Also save the feature engineering results as CSV for reference
featured_data_path = os.path.join(PROCESSED_DATA_DIR, "2_featured_data.csv")
df_featured = X.copy()
df_featured["readmitted_binary"] = y
df_featured.to_csv(featured_data_path, index=False)
print(f"Featured data saved to: {featured_data_path}")

Preprocessor saved to: /home/ghost/workspace/university/machine_learning_and_computer_vision/assessment_main/models/preprocessor.joblib


Final data saved to: /home/ghost/workspace/university/machine_learning_and_computer_vision/assessment_main/data/processed/3_final_data_resampled.npz
  - Training data: RESAMPLED (balanced classes)
  - Test data: ORIGINAL (real-world distribution)
Featured data saved to: /home/ghost/workspace/university/machine_learning_and_computer_vision/assessment_main/data/processed/2_featured_data.csv


## Summary

Feature engineering completed successfully! Key accomplishments:

1. **Domain-Specific Features Created:**
   - `change_of_meds`: Binary feature indicating medication changes
   - `HbA1c_Change_Group`: 4-category feature combining HbA1c results with medication changes

2. **Data Preprocessing:**
   - Target variable converted to binary (<30 days = 1, others = 0)
   - 24 individual medication columns removed (information captured in engineered features)
   - Features separated into numerical and categorical for appropriate processing

3. **Pipeline Construction:**
   - StandardScaler applied to numerical features
   - OneHotEncoder applied to categorical features
   - Proper train/test split with stratification
   - Preprocessor fitted only on training data to prevent leakage

4. **Class Imbalance Handling (CRITICAL):**
   - Applied SMOTE to training data to balance classes (88.8% → 50/50)
   - Test data kept at original distribution for unbiased evaluation
   - This prevents poor recall for the minority class (readmissions)

5. **Outputs Saved:**
   - `preprocessor.joblib`: Fitted preprocessor for reuse
   - `3_final_data.npz`: **Resampled** training data + original test data
   - `2_featured_data.csv`: Reference dataset with engineered features

The data is now ready for neural network modeling in Notebook 3!