### Tabular Data Science – Final Research Project
### Overview
This notebook presents an automated feature engineering approach to improve model performance across multiple datasets. We compare a baseline model trained on the original features with an enhanced model that includes automatically generated features.
### Approach
1. Baseline Model: Train and evaluate a simple model on the raw dataset.
2. Feature Engineering: Automatically generate, filter, and rank new features.
3. Enhanced Model: Train and evaluate a model with the engineered features.
4. Comparison: Compare baseline vs. enhanced model performance using statistical tests.
### Datasets Used
We evaluate our approach on four different datasets:
- Cancer Patient Data (Classification)
- Amsterdam Rental Prices (Regression)
- Student Performance Factors (Classification)
- Weather Data (Regression)

In [None]:
import pandas as pd
import numpy as np
import shap
import os
import sys
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score,r2_score, mean_squared_error, mean_absolute_error
from scipy.stats import ttest_rel
from sklearn.preprocessing import StandardScaler,LabelEncoder
from scipy.stats import wilcoxon
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

# Ensure the feature_engineering module is in the Python path
sys.path.append(os.path.abspath("feature_engineering"))
from feature_generator import SemiAutomatedFeatureEngineering


### Dataset 1: Cancer Patient Data (Classification)
Goal: Predict cancer severity level based on patient attributes.

- Type: Classification (Target: Level)
- Baseline Model: RandomForestClassifier
- Enhanced Model: Random Forest with Engineered Features
- Comparison Metrics: Accuracy, Precision, Recall, F1 Score

In [68]:
###############################################################################
# 1. Load & Preprocess the Data
###############################################################################
dataset_path = os.path.abspath("../Data/cancer_patient.csv")
df_cancer = pd.read_csv(dataset_path)

# Drop unnecessary columns
if 'index' in df_cancer.columns:
    df_cancer.drop(columns=['index'], inplace=True)
if 'Patient Id' in df_cancer.columns:
    df_cancer.drop(columns=['Patient Id'], inplace=True)

df_cancer.drop_duplicates(inplace=True)

# Encode the target column
label_encoder = LabelEncoder()
df_cancer['Level'] = label_encoder.fit_transform(df_cancer['Level'])

# Save the cleaned dataset
cleaned_path = os.path.abspath("../Data/cancer_patient_clean.csv")
df_cancer.to_csv(cleaned_path, index=False)

###############################################################################
# 2. Baseline Model Training
###############################################################################
X = df_cancer.drop(columns=['Level'])
y = df_cancer['Level']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Simple baseline model
baseline_model = RandomForestClassifier(
    n_estimators=50, max_depth=5, min_samples_split=10, random_state=42
)
baseline_model.fit(X_train, y_train)

# Evaluate baseline model
y_pred_baseline = baseline_model.predict(X_test)
baseline_results = {
    "Accuracy": accuracy_score(y_test, y_pred_baseline),
    "Precision": precision_score(y_test, y_pred_baseline, average='weighted'),
    "Recall": recall_score(y_test, y_pred_baseline, average='weighted'),
    "F1 Score": f1_score(y_test, y_pred_baseline, average='weighted')
}

print("\nBaseline Model Performance:")
print(baseline_results)

###############################################################################
# 3. Feature Engineering (SemiAutomatedFeatureEngineering)
###############################################################################
print("\nRunning Feature Engineering...")
feature_engineer = SemiAutomatedFeatureEngineering(
    df_cancer.copy(), 
    target_column="Level",
    task="classification"
)
enhanced_results = feature_engineer.run_pipeline()

###############################################################################
# 4. Train a New Model on the Enhanced Data
###############################################################################
# The feature_engineer.df now has new features
df_enhanced = feature_engineer.df
X_enhanced = df_enhanced.drop(columns=["Level"])
y_enhanced = df_enhanced["Level"]

X_train_enhanced, X_test_enhanced, y_train_enhanced, y_test_enhanced = train_test_split(
    X_enhanced, y_enhanced, test_size=0.2, random_state=42
)

enhanced_model = RandomForestClassifier(
    n_estimators=50, max_depth=5, min_samples_split=10, random_state=42
)
enhanced_model.fit(X_train_enhanced, y_train_enhanced)

# Evaluate the enhanced model
y_pred_enhanced = enhanced_model.predict(X_test_enhanced)
enhanced_model_results = {
    "Accuracy": accuracy_score(y_test_enhanced, y_pred_enhanced),
    "Precision": precision_score(y_test_enhanced, y_pred_enhanced, average='weighted'),
    "Recall": recall_score(y_test_enhanced, y_pred_enhanced, average='weighted'),
    "F1 Score": f1_score(y_test_enhanced, y_pred_enhanced, average='weighted')
}

###############################################################################
# 5. Compare Baseline vs. Enhanced
###############################################################################
print("\nComparison Between Baseline and Enhanced Model:")
print("Baseline Model:", baseline_results)
print("Enhanced Model:", enhanced_model_results)

# Statistical Test
baseline_scores = np.array(list(baseline_results.values()))
enhanced_scores = np.array(list(enhanced_model_results.values()))
t_stat, p_value = ttest_rel(baseline_scores, enhanced_scores)
print(f"\nPaired T-Test: t={t_stat:.4f}, p={p_value:.4f}")

###############################################################################
# 6. SHAP Analysis for Both Models
###############################################################################
# print("\nRunning SHAP Analysis for Baseline Model...")
# explainer_baseline = shap.Explainer(baseline_model, X_test)
# shap_values_baseline = explainer_baseline(X_test)

# if len(shap_values_baseline.values.shape) == 3:
#     # Multi-class scenario
#     for class_idx in range(shap_values_baseline.values.shape[2]):
#         plt.figure(figsize=(15, 10))
#         shap.summary_plot(shap_values_baseline[..., class_idx], X_test, 
#                           feature_names=X_test.columns, show=True)
#         plt.savefig(f"shap_baseline_model_class_{class_idx}.png", 
#                     dpi=300, bbox_inches="tight")
#         plt.close()
#         print(f"SHAP plot (Baseline, Class {class_idx}) saved.")

# print("\nRunning SHAP Analysis for Enhanced Model...")
# explainer_enhanced = shap.Explainer(enhanced_model, X_test_enhanced)
# shap_values_enhanced = explainer_enhanced(X_test_enhanced)

# if len(shap_values_enhanced.values.shape) == 3:
#     # Multi-class scenario
#     for class_idx in range(shap_values_enhanced.values.shape[2]):
#         plt.figure(figsize=(15, 10))
#         shap.summary_plot(shap_values_enhanced[..., class_idx], X_test_enhanced, 
#                           feature_names=X_test_enhanced.columns, show=True)
#         plt.savefig(f"shap_enhanced_model_class_{class_idx}.png", 
#                     dpi=300, bbox_inches="tight")
#         plt.close()
#         print(f"SHAP plot (Enhanced, Class {class_idx}) saved.")

# print("SHAP analysis complete.")



Baseline Model Performance:
{'Accuracy': 0.9354838709677419, 'Precision': 0.946236559139785, 'Recall': 0.9354838709677419, 'F1 Score': 0.9319648093841643}

Running Feature Engineering...
Generating new features...
Filtering irrelevant features...
Computing feature importance...

Important Features:
 Air Pollution_times_Coughing of Blood          0.014227
Air Pollution_plus_Coughing of Blood           0.012577
Alcohol use_plus_Fatigue                       0.011524
Balanced Diet_div_Clubbing of Finger Nails     0.010567
Balanced Diet_plus_Coughing of Blood           0.010485
                                                 ...   
Dust Allergy_times_Dry Cough                   0.000000
Dust Allergy_times_Frequent Cold               0.000000
Dust Allergy_minus_Clubbing of Finger Nails    0.000000
Dust Allergy_minus_Wheezing                    0.000000
Dry Cough_div_Snoring                          0.000000
Length: 921, dtype: float64
Training and evaluating the model...

Model Performanc

### Dataset 2: Amsterdam Rental Prices (Regression)
Goal: Predict rental prices of properties in Amsterdam.

- Type: Regression (Target: realSum)
- Baseline Model: RandomForestRegressor
- Enhanced Model: Random Forest with Engineered Features
- Comparison Metrics: R², RMSE, MAE

In [70]:
###############################################################################
# 1. LOAD & CLEAN (Baseline)
###############################################################################
dataset_path = os.path.abspath("../Data/amsterdam_weekdays.csv")
df_amsterdam = pd.read_csv(dataset_path)

# Convert booleans/strings to 0/1
for col in ["host_is_superhost", "room_private", "room_shared"]:
    if col in df_amsterdam.columns:
        df_amsterdam[col] = df_amsterdam[col].replace({
            False: 0, True: 1, "FALSE": 0, "TRUE": 1
        }).astype(int)

# One-hot encode 'room_type' if present
if "room_type" in df_amsterdam.columns:
    df_amsterdam = pd.get_dummies(df_amsterdam, columns=["room_type"], prefix="room_type")

# Convert any leftover bools
bool_cols = df_amsterdam.select_dtypes(include='bool').columns
df_amsterdam[bool_cols] = df_amsterdam[bool_cols].astype(int)

target_col = "realSum"
if target_col not in df_amsterdam.columns:
    raise KeyError(f"Target '{target_col}' not found in amsterdam_weekdays.csv")

# (Optional) Save cleaned
cleaned_path = os.path.abspath("../Data/amsterdam_weekdays_clean.csv")
df_amsterdam.to_csv(cleaned_path, index=False)

# Prepare baseline data
X_amst = df_amsterdam.drop(columns=[target_col])
y_amst = df_amsterdam[target_col]

scaler = StandardScaler()
X_amst_scaled = pd.DataFrame(scaler.fit_transform(X_amst), columns=X_amst.columns)

X_train_amst, X_test_amst, y_train_amst, y_test_amst = train_test_split(
    X_amst_scaled, y_amst, test_size=0.2, random_state=42
)

# Baseline model
baseline_model = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42)
baseline_model.fit(X_train_amst, y_train_amst)

# Evaluate baseline
y_pred_baseline = baseline_model.predict(X_test_amst)
baseline_metrics = {
    "R²": r2_score(y_test_amst, y_pred_baseline),
    "RMSE": np.sqrt(mean_squared_error(y_test_amst, y_pred_baseline)),
    "MAE": mean_absolute_error(y_test_amst, y_pred_baseline),
}
print("\n=== Amsterdam Baseline Model ===")
print(baseline_metrics)

###############################################################################
# 2. SEMI-AUTOMATED FEATURE ENGINEERING
###############################################################################
print("\nRunning Semi-Automated Feature Engineering (Amsterdam)...")

feature_engineer = SemiAutomatedFeatureEngineering(
    df_amsterdam.copy(),  # pass a copy
    target_column=target_col,
    task="regression"
    # correlation_threshold=0.05, variance_threshold=0.01 (if your class supports those)
)
pipeline_results = feature_engineer.run_pipeline()  # This prints out results, features, etc.

# The pipeline's final DataFrame with new features
df_amst_enh = feature_engineer.df
X_enh = df_amst_enh.drop(columns=[target_col])
y_enh = df_amst_enh[target_col]

# Scale again with new features
X_enh_scaled = pd.DataFrame(scaler.fit_transform(X_enh), columns=X_enh.columns)

X_train_enh, X_test_enh, y_train_enh, y_test_enh = train_test_split(
    X_enh_scaled, y_enh, test_size=0.2, random_state=42
)

enh_model = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42)
enh_model.fit(X_train_enh, y_train_enh)

y_pred_enh = enh_model.predict(X_test_enh)
enhanced_metrics = {
    "R²": r2_score(y_test_enh, y_pred_enh),
    "RMSE": np.sqrt(mean_squared_error(y_test_enh, y_pred_enh)),
    "MAE": mean_absolute_error(y_test_enh, y_pred_enh),
}
print("\n=== Amsterdam Enhanced Model ===")
print(enhanced_metrics)

# Compare
print("\nComparison Baseline vs. Enhanced (Amsterdam)")
print("Baseline:", baseline_metrics)
print("Enhanced:", enhanced_metrics)

base_scores = np.array(list(baseline_metrics.values()))
enh_scores = np.array(list(enhanced_metrics.values()))
stat, pval = wilcoxon(base_scores, enh_scores)
print(f"Wilcoxon: statistic={stat:.4f}, pvalue={pval:.4f}")

  df_amsterdam[col] = df_amsterdam[col].replace({
  df_amsterdam[col] = df_amsterdam[col].replace({
  df_amsterdam[col] = df_amsterdam[col].replace({



=== Amsterdam Baseline Model ===
{'R²': 0.481964703404132, 'RMSE': 224.68364147185966, 'MAE': 148.43478257371325}

Running Semi-Automated Feature Engineering (Amsterdam)...
Generating new features...
Filtering irrelevant features...
Computing feature importance...





Important Features:
 person_capacity_plus_room_type_Entire home/apt     17.286412
room_private_div_person_capacity                    9.517621
person_capacity_minus_metro_dist                    9.327779
bedrooms_plus_room_type_Entire home/apt             8.414398
person_capacity_times_room_type_Entire home/apt     7.440377
                                                     ...    
room_shared_plus_room_type_Entire home/apt          0.001522
room_private                                        0.001091
room_private_plus_room_type_Private room            0.000826
room_private_times_room_type_Private room           0.000628
room_private_minus_room_type_Entire home/apt        0.000443
Length: 545, dtype: float64
Training and evaluating the model...

Model Performance with Engineered Features: {'R²': 0.570615190219463, 'RMSE': 204.55736294415533}

=== Amsterdam Enhanced Model ===
{'R²': 0.5824470572475611, 'RMSE': 201.71934648304074, 'MAE': 134.1443843798962}

Comparison Baseline vs. Enh

### Dataset 3: Student Performance (Classification)
Goal: Predict whether a student passes based on academic and lifestyle factors.

- Type: Classification (Target: Passed)
- Baseline Model: RandomForestClassifier
- Enhanced Model: Random Forest with Engineered Features
- Comparison Metrics: Accuracy, Precision, Recall, F1 Score

In [None]:
###############################################################################
# 1. BASELINE
###############################################################################
df_students = pd.read_csv("../Data/StudentPerformanceFactors.csv")

# Create "Passed" from "Exam_Score"
if "Exam_Score" not in df_students.columns:
    raise KeyError("No 'Exam_Score' column found; cannot create 'Passed' column.")
df_students["Passed"] = (df_students["Exam_Score"] >= 60).astype(int)

# Convert boolean-like columns
bool_cols_list = ["Internet_Access", "Learning_Disabilities", "Extracurricular_Activities"]
for col in bool_cols_list:
    if col in df_students.columns:
        if df_students[col].dtype == bool:
            df_students[col] = df_students[col].astype(int)
        else:
            df_students[col] = (
                df_students[col]
                .astype(str)
                .str.lower()
                .replace({"true": 1, "false": 0, "yes": 1, "no": 0})
                .fillna(0)
                .astype(int)
            )

# Identify other categorical columns & one-hot
categorical_cols = [
    "Parental_Involvement", "Access_to_Resources", "Motivation_Level",
    "Family_Income", "Teacher_Quality", "School_Type", "Peer_Influence",
    "Parental_Education_Level", "Distance_from_Home", "Gender"
]
existing_cat = [c for c in categorical_cols if c in df_students.columns]
if existing_cat:
    df_students = pd.get_dummies(df_students, columns=existing_cat, drop_first=True)

# Convert leftover bool
bool_cols2 = df_students.select_dtypes(include='bool').columns
df_students[bool_cols2] = df_students[bool_cols2].astype(int)

target_col = "Passed"
if target_col not in df_students.columns:
    raise KeyError(f"Target '{target_col}' not found in student dataset.")

# For baseline, we'll remove "Exam_Score" from features to avoid direct leak
X_stud = df_students.drop(columns=["Exam_Score", target_col])
y_stud = df_students[target_col]

scaler = StandardScaler()
X_stud_scaled = pd.DataFrame(scaler.fit_transform(X_stud), columns=X_stud.columns)

X_train_stud, X_test_stud, y_train_stud, y_test_stud = train_test_split(
    X_stud_scaled, y_stud, test_size=0.2, random_state=42
)

baseline_model = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
baseline_model.fit(X_train_stud, y_train_stud)

y_pred_base = baseline_model.predict(X_test_stud)
baseline_metrics = {
    "Accuracy": accuracy_score(y_test_stud, y_pred_base),
    "Precision": precision_score(y_test_stud, y_pred_base, average='weighted', zero_division=0),
    "Recall": recall_score(y_test_stud, y_pred_base, average='weighted', zero_division=0),
    "F1 Score": f1_score(y_test_stud, y_pred_base, average='weighted', zero_division=0),
}
print("\n=== Students Baseline Model ===")
print(baseline_metrics)

###############################################################################
# 2. SEMI-AUTOMATED FEATURE ENGINEERING
###############################################################################
print("\nRunning Semi-Automated Feature Engineering (Students)...")

feature_engineer = SemiAutomatedFeatureEngineering(
    df_students.copy(),
    target_column=target_col,
    task="classification"
)
pipeline_results = feature_engineer.run_pipeline()

df_stud_enh = feature_engineer.df
X_stud_enh = df_stud_enh.drop(columns=["Exam_Score", target_col], errors='ignore')
y_stud_enh = df_stud_enh[target_col]

X_stud_enh_scaled = pd.DataFrame(scaler.fit_transform(X_stud_enh), columns=X_stud_enh.columns)

X_train_stud2, X_test_stud2, y_train_stud2, y_test_stud2 = train_test_split(
    X_stud_enh_scaled, y_stud_enh, test_size=0.2, random_state=42
)

enh_model = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
enh_model.fit(X_train_stud2, y_train_stud2)

y_pred_enh = enh_model.predict(X_test_stud2)
enhanced_metrics = {
    "Accuracy": accuracy_score(y_test_stud2, y_pred_enh),
    "Precision": precision_score(y_test_stud2, y_pred_enh, average='weighted', zero_division=0),
    "Recall": recall_score(y_test_stud2, y_pred_enh, average='weighted', zero_division=0),
    "F1 Score": f1_score(y_test_stud2, y_pred_enh, average='weighted', zero_division=0),
}
print("\n=== Students Enhanced Model ===")
print(enhanced_metrics)

# Compare
print("\nCompare Baseline vs. Enhanced (Students):")
print("Baseline:", baseline_metrics)
print("Enhanced:", enhanced_metrics)

base_scores = np.array(list(baseline_metrics.values()))
enh_scores = np.array(list(enhanced_metrics.values()))
stat, pval = wilcoxon(base_scores, enh_scores)
print(f"\nWilcoxon: statistic={stat:.4f}, pvalue={pval:.4f}")



  .replace({"true": 1, "false": 0, "yes": 1, "no": 0})
  .replace({"true": 1, "false": 0, "yes": 1, "no": 0})
  .replace({"true": 1, "false": 0, "yes": 1, "no": 0})



=== Students Baseline Model ===
{'Accuracy': 0.9916792738275341, 'Precision': 0.9834277821391052, 'Recall': 0.9916792738275341, 'F1 Score': 0.9875362916732983}

Running Semi-Automated Feature Engineering (Students)...
Generating new features...
Filtering irrelevant features...
Computing feature importance...

Important Features:
 Exam_Score                                               0.046962
Learning_Disabilities_plus_Exam_Score                    0.041836
Exam_Score_plus_Teacher_Quality_Low                      0.033608
Exam_Score_plus_Parental_Education_Level_Postgraduate    0.032655
Internet_Access_minus_Exam_Score                         0.022138
                                                           ...   
Internet_Access_minus_Access_to_Resources_Low            0.000000
Internet_Access_minus_Motivation_Level_Low               0.000000
Internet_Access_times_Distance_from_Home_Near            0.000000
Tutoring_Sessions_minus_Learning_Disabilities            0.000000
Hours_S

### Dataset 4: Weather Data (Regression)
Goal: Predict maximum temperature (max_temp °c) based on weather conditions.

- Type: Regression (Target: max_temp °c)
- Baseline Model: RandomForestRegressor
- Enhanced Model: Random Forest with Engineered Features
- Comparison Metrics: R², RMSE, MAE