### Tabular Data Science – Final Research Project
### Overview
This notebook presents an automated feature engineering approach to improve model performance across multiple datasets. We compare a baseline model trained on the original features with an enhanced model that includes automatically generated features.
### Approach
1. Baseline Model: Train and evaluate a simple model on the raw dataset.
2. Feature Engineering: Automatically generate, filter, and rank new features.
3. Enhanced Model: Train and evaluate a model with the engineered features.
4. Comparison: Compare baseline vs. enhanced model performance using statistical tests.
### Datasets Used
We evaluate our approach on four different datasets:
- Cancer Patient Data (Classification)
- Amsterdam Rental Prices (Regression)
- Student Performance Factors (Classification)
- Life expectancy (Regression)

In [21]:
import pandas as pd
import numpy as np
import shap
import os
import sys
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score,r2_score, mean_squared_error, mean_absolute_error
from scipy.stats import ttest_rel
from sklearn.preprocessing import StandardScaler,LabelEncoder
from scipy.stats import wilcoxon
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression

sys.path.append(os.path.abspath("feature_engineering"))
from feature_generator import SemiAutomatedFeatureEngineering


### Dataset 1: Cancer Patient Data (Classification)
Goal: Predict cancer severity level based on patient attributes.

- Type: Classification (Target: Level)
- Baseline Model: RandomForestClassifier
- Enhanced Model: Random Forest with Engineered Features
- Comparison Metrics: Accuracy, Precision, Recall, F1 Score

In [22]:
###############################################################################
# 1. Load & Preprocess the Data
###############################################################################
dataset_path = os.path.abspath("../Data/cancer_patient.csv")
df_cancer = pd.read_csv(dataset_path)

# Drop unnecessary columns
if 'index' in df_cancer.columns:
    df_cancer.drop(columns=['index'], inplace=True)
if 'Patient Id' in df_cancer.columns:
    df_cancer.drop(columns=['Patient Id'], inplace=True)

df_cancer.drop_duplicates(inplace=True)

# Encode the target column
label_encoder = LabelEncoder()
df_cancer['Level'] = label_encoder.fit_transform(df_cancer['Level'])

# Save the cleaned dataset
cleaned_path = os.path.abspath("../Data/cancer_patient_clean.csv")
df_cancer.to_csv(cleaned_path, index=False)

###############################################################################
# 2. Baseline Model Training
###############################################################################
X = df_cancer.drop(columns=['Level'])
y = df_cancer['Level']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Simple baseline model
baseline_model = RandomForestClassifier(
    n_estimators=50, max_depth=5, min_samples_split=10, random_state=42
)
baseline_model.fit(X_train, y_train)

# Evaluate baseline model
y_pred_baseline = baseline_model.predict(X_test)
baseline_results = {
    "Accuracy": accuracy_score(y_test, y_pred_baseline),
    "Precision": precision_score(y_test, y_pred_baseline, average='weighted'),
    "Recall": recall_score(y_test, y_pred_baseline, average='weighted'),
    "F1 Score": f1_score(y_test, y_pred_baseline, average='weighted')
}

print("\nBaseline Model Performance:")
print(baseline_results)

###############################################################################
# 3. Feature Engineering (SemiAutomatedFeatureEngineering)
###############################################################################
print("\nRunning Feature Engineering...")
feature_engineer = SemiAutomatedFeatureEngineering(
    df_cancer.copy(), 
    target_column="Level",
    task="classification",
    correlation_threshold=0.05,
    variance_threshold=0.01
)
enhanced_results = feature_engineer.run_pipeline()

###############################################################################
# 4. Train a New Model on the Enhanced Data
###############################################################################
# The feature_engineer.df now has new features
df_enhanced = feature_engineer.df
X_enhanced = df_enhanced.drop(columns=["Level"])
y_enhanced = df_enhanced["Level"]

X_train_enhanced, X_test_enhanced, y_train_enhanced, y_test_enhanced = train_test_split(
    X_enhanced, y_enhanced, test_size=0.2, random_state=42
)

enhanced_model = RandomForestClassifier(
    n_estimators=50, max_depth=5, min_samples_split=10, random_state=42
)
enhanced_model.fit(X_train_enhanced, y_train_enhanced)

# Evaluate the enhanced model
y_pred_enhanced = enhanced_model.predict(X_test_enhanced)
enhanced_model_results = {
    "Accuracy": accuracy_score(y_test_enhanced, y_pred_enhanced),
    "Precision": precision_score(y_test_enhanced, y_pred_enhanced, average='weighted'),
    "Recall": recall_score(y_test_enhanced, y_pred_enhanced, average='weighted'),
    "F1 Score": f1_score(y_test_enhanced, y_pred_enhanced, average='weighted')
}

###############################################################################
# 5. Compare Baseline vs. Enhanced
###############################################################################
print("\nComparison Between Baseline and Enhanced Model:")
print("Baseline Model:", baseline_results)
print("Enhanced Model:", enhanced_model_results)

# Statistical Test
baseline_scores = np.array(list(baseline_results.values()))
enhanced_scores = np.array(list(enhanced_model_results.values()))
t_stat_cancer, p_value_cancer = ttest_rel(baseline_scores, enhanced_scores)
print(f"\nPaired T-Test: t={t_stat_cancer:.4f}, p={p_value_cancer:.4f}")


Baseline Model Performance:
{'Accuracy': 0.9354838709677419, 'Precision': 0.946236559139785, 'Recall': 0.9354838709677419, 'F1 Score': 0.9319648093841643}

Running Feature Engineering...
[FeatureEngineering] Starting pipeline...
[FeatureEngineering] Generating new features...
[FeatureEngineering] Added 1012 new features.
Sample new features: ['Balanced Diet_div_Swallowing Difficulty', 'Balanced Diet_times_Clubbing of Finger Nails', 'Balanced Diet_times_Dry Cough', 'Dust Allergy_minus_Clubbing of Finger Nails', 'Chest Pain_minus_Shortness of Breath', 'Frequent Cold_times_Snoring', 'Gender_plus_Passive Smoker', 'Chest Pain_plus_Wheezing', 'chronic Lung Disease_minus_Chest Pain', 'Obesity_times_Clubbing of Finger Nails']
[FeatureEngineering] Filtering features based on correlation & variance...
[FeatureEngineering] Dropping 114
[FeatureEngineering] 900 new features were retained after filtering.
[FeatureEngineering] Checking final set of features for importance...
[FeatureEngineering] Co

### Dataset 2: Amsterdam Rental Prices (Regression)
Goal: Predict rental prices of properties in Amsterdam.

- Type: Regression (Target: realSum)
- Baseline Model: RandomForestRegressor
- Enhanced Model: Random Forest with Engineered Features
- Comparison Metrics: R², RMSE, MAE

In [26]:
###############################################################################
# 1. LOAD & CLEAN (Baseline)
###############################################################################
dataset_path = os.path.abspath("../Data/amsterdam_weekdays.csv")
df_amsterdam = pd.read_csv(dataset_path)

# Convert booleans/strings to 0/1
for col in ["host_is_superhost", "room_private", "room_shared"]:
    if col in df_amsterdam.columns:
        df_amsterdam[col] = df_amsterdam[col].replace({
            False: 0, True: 1, "FALSE": 0, "TRUE": 1
        }).astype(int)

# One-hot encode 'room_type' if present
if "room_type" in df_amsterdam.columns:
    df_amsterdam = pd.get_dummies(df_amsterdam, columns=["room_type"], prefix="room_type")

# Convert any leftover bools
bool_cols = df_amsterdam.select_dtypes(include='bool').columns
df_amsterdam[bool_cols] = df_amsterdam[bool_cols].astype(int)

target_col = "realSum"
if target_col not in df_amsterdam.columns:
    raise KeyError(f"Target '{target_col}' not found in amsterdam_weekdays.csv")

# Save cleaned
cleaned_path = os.path.abspath("../Data/amsterdam_weekdays_clean.csv")
df_amsterdam.to_csv(cleaned_path, index=False)

# Prepare baseline data
X_amst = df_amsterdam.drop(columns=[target_col])
y_amst = df_amsterdam[target_col]

scaler = StandardScaler()
X_amst_scaled = pd.DataFrame(scaler.fit_transform(X_amst), columns=X_amst.columns)

X_train_amst, X_test_amst, y_train_amst, y_test_amst = train_test_split(
    X_amst_scaled, y_amst, test_size=0.2, random_state=42
)

# Baseline model
baseline_model = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42)
baseline_model.fit(X_train_amst, y_train_amst)

# Evaluate baseline
y_pred_baseline = baseline_model.predict(X_test_amst)
baseline_metrics = {
    "R²": r2_score(y_test_amst, y_pred_baseline),
    "RMSE": np.sqrt(mean_squared_error(y_test_amst, y_pred_baseline)),
    "MAE": mean_absolute_error(y_test_amst, y_pred_baseline),
}
print("\n=== Amsterdam Baseline Model ===")
print(baseline_metrics)

###############################################################################
# 2. SEMI-AUTOMATED FEATURE ENGINEERING
###############################################################################

feature_engineer = SemiAutomatedFeatureEngineering(
    df_amsterdam.copy(), 
    target_column=target_col,
    task="regression",
    correlation_threshold=0.05,
    variance_threshold=0.01
)
pipeline_results = feature_engineer.run_pipeline()

# The pipeline's final DataFrame with new features
df_amst_enh = feature_engineer.df
X_enh = df_amst_enh.drop(columns=[target_col])
y_enh = df_amst_enh[target_col]

# Scale again with new features
X_enh_scaled = pd.DataFrame(scaler.fit_transform(X_enh), columns=X_enh.columns)

X_train_enh, X_test_enh, y_train_enh, y_test_enh = train_test_split(
    X_enh_scaled, y_enh, test_size=0.2, random_state=42
)

enh_model = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42)
enh_model.fit(X_train_enh, y_train_enh)

y_pred_enh = enh_model.predict(X_test_enh)
enhanced_metrics = {
    "R²": r2_score(y_test_enh, y_pred_enh),
    "RMSE": np.sqrt(mean_squared_error(y_test_enh, y_pred_enh)),
    "MAE": mean_absolute_error(y_test_enh, y_pred_enh),
}
print("\n=== Amsterdam Enhanced Model ===")
print(enhanced_metrics)

# Compare
print("\nComparison Baseline vs. Enhanced (Amsterdam)")
print("Baseline:", baseline_metrics)
print("Enhanced:", enhanced_metrics)

# Statistical Test (Paired T-Test)
baseline_scores = np.array(list(baseline_metrics.values()))
enhanced_scores = np.array(list(enhanced_metrics.values()))
t_stat_ams, p_value_ams = ttest_rel(baseline_scores, enhanced_scores) 
print(f"\nPaired T-Test: t={t_stat_ams:.4f}, p={p_value_ams:.4f}") 

  df_amsterdam[col] = df_amsterdam[col].replace({
  df_amsterdam[col] = df_amsterdam[col].replace({
  df_amsterdam[col] = df_amsterdam[col].replace({



=== Amsterdam Baseline Model ===
{'R²': 0.481964703404132, 'RMSE': 224.68364147185966, 'MAE': 148.43478257371325}
[FeatureEngineering] Starting pipeline...
[FeatureEngineering] Generating new features...
[FeatureEngineering] Added 840 new features.
Sample new features: ['bedrooms_div_rest_index', 'multi_div_room_type_Entire home/apt', 'host_is_superhost_plus_room_type_Shared room', 'room_shared_times_rest_index', 'room_shared_times_room_type_Private room', 'cleanliness_rating_div_bedrooms', 'rest_index_minus_rest_index_norm', 'dist_div_rest_index', 'person_capacity_plus_rest_index', 'room_shared_times_room_type_Shared room']
[FeatureEngineering] Filtering features based on correlation & variance...
[FeatureEngineering] Dropping 198
[FeatureEngineering] Dropping 70
[FeatureEngineering] 579 new features were retained after filtering.
[FeatureEngineering] Checking final set of features for importance...
[FeatureEngineering] Computing feature importance via SHAP & feature_importances_...





[FeatureEngineering] Important Features (Combined SHAP + Permutation):
 person_capacity_plus_room_type_Entire home/apt     17.100581
person_capacity_minus_metro_dist                   14.274584
room_private_div_person_capacity                    9.024589
person_capacity_times_room_type_Entire home/apt     7.837398
bedrooms_plus_room_type_Entire home/apt             7.156334
person_capacity_minus_dist                          7.011777
guest_satisfaction_overall_plus_attr_index_norm     6.579408
person_capacity_div_room_type_Entire home/apt       5.831489
bedrooms_minus_dist                                 4.501200
attr_index_norm_div_room_type_Entire home/apt       4.000648
bedrooms_times_attr_index_norm                      3.653305
person_capacity_times_attr_index                    3.164379
guest_satisfaction_overall_plus_rest_index_norm     2.550856
attr_index_norm_times_room_type_Entire home/apt     2.537424
person_capacity_div_dist                            2.532622
person_capac

### Dataset 3: Student Performance (Classification)
Goal: Predict whether a student passes based on academic and lifestyle factors.

- Type: Classification (Target: Passed)
- Baseline Model: RandomForestClassifier
- Enhanced Model: Random Forest with Engineered Features
- Comparison Metrics: Accuracy, Precision, Recall, F1 Score

In [27]:
###############################################################################
# 1. BASELINE
###############################################################################
df_students = pd.read_csv("../Data/StudentPerformanceFactors.csv")

# Create "Passed" from "Exam_Score"
if "Exam_Score" not in df_students.columns:
    raise KeyError("No 'Exam_Score' column found; cannot create 'Passed' column.")
df_students["Passed"] = (df_students["Exam_Score"] >= 60).astype(int)

# Convert boolean-like columns
bool_cols_list = ["Internet_Access", "Learning_Disabilities", "Extracurricular_Activities"]
for col in bool_cols_list:
    if col in df_students.columns:
        if df_students[col].dtype == bool:
            df_students[col] = df_students[col].astype(int)
        else:
            df_students[col] = (
                df_students[col]
                .astype(str)
                .str.lower()
                .replace({"true": 1, "false": 0, "yes": 1, "no": 0})
                .fillna(0)
                .astype(int)
            )

# Identify other categorical columns & one-hot
categorical_cols = [
    "Parental_Involvement", "Access_to_Resources", "Motivation_Level",
    "Family_Income", "Teacher_Quality", "School_Type", "Peer_Influence",
    "Parental_Education_Level", "Distance_from_Home", "Gender"
]
existing_cat = [c for c in categorical_cols if c in df_students.columns]
if existing_cat:
    df_students = pd.get_dummies(df_students, columns=existing_cat, drop_first=True)

# Convert leftover bool
bool_cols2 = df_students.select_dtypes(include='bool').columns
df_students[bool_cols2] = df_students[bool_cols2].astype(int)

target_col = "Passed"
if target_col not in df_students.columns:
    raise KeyError(f"Target '{target_col}' not found in student dataset.")

# For baseline, we'll remove "Exam_Score" from features to avoid direct leak
X_stud = df_students.drop(columns=["Exam_Score", target_col])
y_stud = df_students[target_col]

scaler = StandardScaler()
X_stud_scaled = pd.DataFrame(scaler.fit_transform(X_stud), columns=X_stud.columns)

X_train_stud, X_test_stud, y_train_stud, y_test_stud = train_test_split(
    X_stud_scaled, y_stud, test_size=0.2, random_state=42
)

baseline_model = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
baseline_model.fit(X_train_stud, y_train_stud)

y_pred_base = baseline_model.predict(X_test_stud)
baseline_metrics_stu = {
    "Accuracy": accuracy_score(y_test_stud, y_pred_base),
    "Precision": precision_score(y_test_stud, y_pred_base, average='weighted', zero_division=0),
    "Recall": recall_score(y_test_stud, y_pred_base, average='weighted', zero_division=0),
    "F1 Score": f1_score(y_test_stud, y_pred_base, average='weighted', zero_division=0),
}
print("\n=== Students Baseline Model ===")
print(baseline_metrics_stu)

###############################################################################
# 2. SEMI-AUTOMATED FEATURE ENGINEERING
###############################################################################

feature_engineer = SemiAutomatedFeatureEngineering(
    df_students.copy(),
    target_column=target_col,
    task="classification"
)
pipeline_results = feature_engineer.run_pipeline()

df_stud_enh = feature_engineer.df
X_stud_enh = df_stud_enh.drop(columns=["Exam_Score", target_col], errors='ignore')
y_stud_enh = df_stud_enh[target_col]

X_stud_enh_scaled = pd.DataFrame(scaler.fit_transform(X_stud_enh), columns=X_stud_enh.columns)

X_train_stud2, X_test_stud2, y_train_stud2, y_test_stud2 = train_test_split(
    X_stud_enh_scaled, y_stud_enh, test_size=0.2, random_state=42
)

enh_model = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
enh_model.fit(X_train_stud2, y_train_stud2)

y_pred_enh = enh_model.predict(X_test_stud2)
enhanced_metrics_stu = {
    "Accuracy": accuracy_score(y_test_stud2, y_pred_enh),
    "Precision": precision_score(y_test_stud2, y_pred_enh, average='weighted', zero_division=0),
    "Recall": recall_score(y_test_stud2, y_pred_enh, average='weighted', zero_division=0),
    "F1 Score": f1_score(y_test_stud2, y_pred_enh, average='weighted', zero_division=0),
}
print("\n=== Students Enhanced Model ===")
print(enhanced_metrics_stu)

# Compare
print("\nCompare Baseline vs. Enhanced (Students):")
print("Baseline:", baseline_metrics_stu)
print("Enhanced:", enhanced_metrics_stu)

# Statistical Test (Paired T-Test)
baseline_scores_stu = np.array(list(baseline_metrics_stu.values()))
enhanced_scores_stu = np.array(list(enhanced_metrics_stu.values()))
t_stat_stu, p_value_stu = ttest_rel(baseline_scores_stu, enhanced_scores_stu) 
print(f"\nPaired T-Test: t={t_stat_stu:.4f}, p={p_value_stu:.4f}") 



  .replace({"true": 1, "false": 0, "yes": 1, "no": 0})
  .replace({"true": 1, "false": 0, "yes": 1, "no": 0})
  .replace({"true": 1, "false": 0, "yes": 1, "no": 0})



=== Students Baseline Model ===
{'Accuracy': 0.9916792738275341, 'Precision': 0.9834277821391052, 'Recall': 0.9916792738275341, 'F1 Score': 0.9875362916732983}
[FeatureEngineering] Starting pipeline...
[FeatureEngineering] Generating new features...
[FeatureEngineering] Added 1512 new features.
Sample new features: ['Teacher_Quality_Medium_plus_Peer_Influence_Neutral', 'Extracurricular_Activities_plus_Learning_Disabilities', 'Exam_Score_plus_Teacher_Quality_Medium', 'Sleep_Hours_plus_Parental_Education_Level_High School', 'Parental_Involvement_Medium_div_Teacher_Quality_Medium', 'Access_to_Resources_Low_plus_Teacher_Quality_Low', 'Attendance_div_Family_Income_Low', 'Learning_Disabilities_div_Exam_Score', 'Motivation_Level_Low_plus_Gender_Male', 'Parental_Education_Level_High School_minus_Distance_from_Home_Moderate']
[FeatureEngineering] Filtering features based on correlation & variance...
[FeatureEngineering] Dropping 1217
[FeatureEngineering] Dropping 21
[FeatureEngineering] 296 ne




[FeatureEngineering] Important Features (Combined SHAP + Permutation):
 Attendance       0.00005
Hours_Studied    0.00005
dtype: float64
[FeatureEngineering] Training & evaluating final model with new features...

[FeatureEngineering] Model Performance with Engineered Features: {'Accuracy': 1.0, 'Precision': 1.0, 'Recall': 1.0, 'F1 Score': 1.0}

[FeatureEngineering] Summary:
- 1512 new features generated
- 303 total features after filtering
- 296 new features retained

=== Students Enhanced Model ===
{'Accuracy': 1.0, 'Precision': 1.0, 'Recall': 1.0, 'F1 Score': 1.0}

Compare Baseline vs. Enhanced (Students):
Baseline: {'Accuracy': 0.9916792738275341, 'Precision': 0.9834277821391052, 'Recall': 0.9916792738275341, 'F1 Score': 0.9875362916732983}
Enhanced: {'Accuracy': 1.0, 'Precision': 1.0, 'Recall': 1.0, 'F1 Score': 1.0}

Paired T-Test: t=-5.7796, p=0.0103


### Dataset 4: Life expectancy (Regression)
Goal: Predict life expectancy 
- Type: Regression (Target: Life expectancy)
- Baseline Model: RandomForestRegressor
- Enhanced Model: Random Forest with Engineered Features
- Comparison Metrics: R², RMSE, MAE


In [25]:
from xgboost import XGBRegressor
###############################################################################
# 1. Load & Preprocess the Data
###############################################################################
print("\nLoading Life Expectancy Dataset...")
dataset_path = "../Data/LifeExpect.csv"
df_life = pd.read_csv(dataset_path)

print("Columns in the dataset:", df_life.columns.tolist())

# Drop unnecessary columns if applicable
drop_columns = ["ID", "Index"]
df_life.drop(columns=[col for col in drop_columns if col in df_life.columns], inplace=True)

# Separate numeric and categorical columns
numeric_cols = df_life.select_dtypes(include=["int64", "float64"]).columns
categorical_cols = df_life.select_dtypes(include=["object"]).columns

# Handle missing values
df_life[numeric_cols] = df_life[numeric_cols].fillna(df_life[numeric_cols].median())
for col in categorical_cols:
    df_life[col] = df_life[col].fillna(df_life[col].mode()[0])

# Convert boolean columns to numeric (0/1)
bool_cols = df_life.select_dtypes(include='bool').columns
df_life[bool_cols] = df_life[bool_cols].astype(int)

# Identify categorical columns
object_cols = categorical_cols.tolist()
CARDINALITY_THRESHOLD = 50
low_cardinality_cols = [col for col in object_cols if df_life[col].nunique() <= CARDINALITY_THRESHOLD]
high_cardinality_cols = [col for col in object_cols if df_life[col].nunique() > CARDINALITY_THRESHOLD]

print(f"Low-cardinality columns: {low_cardinality_cols}")
print(f"High-cardinality columns: {high_cardinality_cols}")

# One-Hot Encoding for low-cardinality categorical features
df_life = pd.get_dummies(df_life, columns=low_cardinality_cols, drop_first=True)

# Frequency Encoding for high-cardinality categorical features
for col in high_cardinality_cols:
    freq_map = df_life[col].value_counts(normalize=True)
    df_life[f"{col}_freq"] = df_life[col].map(freq_map)
df_life.drop(columns=high_cardinality_cols, inplace=True, errors='ignore')

# Ensure all boolean columns are numeric (0/1)
bool_cols_after = df_life.select_dtypes(include='bool').columns
df_life[bool_cols_after] = df_life[bool_cols_after].astype(int)

# Ensure all columns are numeric
df_life = df_life.select_dtypes(include=["int", "float"])

# Define the target variable (Life Expectancy)
target_col = "Life expectancy " 
if target_col not in df_life.columns:
    raise KeyError(f"Target column '{target_col}' not found. Available: {df_life.columns.tolist()}")

# Split the data into features and target
X_life = df_life.drop(columns=[target_col])
y_life = df_life[target_col]

# Standardize features
scaler = StandardScaler()
X_life_scaled = pd.DataFrame(scaler.fit_transform(X_life), columns=X_life.columns)

# Save cleaned dataset
cleaned_path = "../Data/life_expectancy_clean.csv"
df_life.to_csv(cleaned_path, index=False)

###############################################################################
# 2. Baseline Model Training (XGBRegressor)
###############################################################################
X_train_life, X_test_life, y_train_life, y_test_life = train_test_split(
    X_life_scaled, y_life, test_size=0.2, random_state=42
)

# Train a baseline model with XGBoost
baseline_model = XGBRegressor(n_estimators=100, max_depth=7, random_state=42, n_jobs=-1)
baseline_model.fit(X_train_life, y_train_life)

# Make predictions
y_pred_life = baseline_model.predict(X_test_life)

# Calculate regression metrics
baseline_results_exp = {
    "R²": r2_score(y_test_life, y_pred_life),
    "RMSE": np.sqrt(mean_squared_error(y_test_life, y_pred_life)),
    "MAE": mean_absolute_error(y_test_life, y_pred_life),
}

print("\n=== Life Expectancy Baseline Model ===")
print(baseline_results_exp)

###############################################################################
# 3. Feature Engineering (SemiAutomatedFeatureEngineering)
###############################################################################
print("\nRunning Feature Engineering on Life Expectancy Data...")

feature_engineer = SemiAutomatedFeatureEngineering(
    df_life.copy(),
    target_column=target_col,
    task="regression",
    correlation_threshold=0.1,  
    variance_threshold=0.05
)

feature_engineer.run_pipeline()

###############################################################################
# 4. Train a New Model on the Enhanced Data
###############################################################################
# Get the enhanced dataset
df_life_enhanced = feature_engineer.df
X_life_enhanced = df_life_enhanced.drop(columns=[target_col])
y_life_enhanced = df_life_enhanced[target_col]

# Scale enhanced features
X_life_enhanced_scaled = pd.DataFrame(scaler.fit_transform(X_life_enhanced), columns=X_life_enhanced.columns)

X_train_life_enh, X_test_life_enh, y_train_life_enh, y_test_life_enh = train_test_split(
    X_life_enhanced_scaled, y_life_enhanced, test_size=0.2, random_state=42
)

# Train an enhanced model with XGBoost
enhanced_model = XGBRegressor(n_estimators=100, max_depth=7, random_state=42, n_jobs=-1)
enhanced_model.fit(X_train_life_enh, y_train_life_enh)

# Evaluate the enhanced model
y_pred_life_enh = enhanced_model.predict(X_test_life_enh)
enhanced_results_exp = {
    "R²": r2_score(y_test_life_enh, y_pred_life_enh),
    "RMSE": np.sqrt(mean_squared_error(y_test_life_enh, y_pred_life_enh)),
    "MAE": mean_absolute_error(y_test_life_enh, y_pred_life_enh),
}

###############################################################################
# 5. Compare Baseline vs. Enhanced
###############################################################################
print("\nComparison Between Baseline and Enhanced Model (Life Expectancy Prediction):")
print("Baseline Model:", baseline_results_exp)
print("Enhanced Model:", enhanced_results_exp)

# Statistical Test (Paired T-Test)
baseline_scores = np.array(list(baseline_results_exp.values()))
enhanced_scores = np.array(list(enhanced_results_exp.values()))
t_stat, p_value = ttest_rel(baseline_scores, enhanced_scores) 
print(f"\nPaired T-Test: t={t_stat:.4f}, p={p_value:.4f}") 



Loading Life Expectancy Dataset...
Columns in the dataset: ['Country', 'Year', 'Status', 'Life expectancy ', 'Adult Mortality', 'infant deaths', 'Alcohol', 'percentage expenditure', 'Hepatitis B', 'Measles ', ' BMI ', 'under-five deaths ', 'Polio', 'Total expenditure', 'Diphtheria ', ' HIV/AIDS', 'GDP', 'Population', ' thinness  1-19 years', ' thinness 5-9 years', 'Income composition of resources', 'Schooling']
Low-cardinality columns: ['Status']
High-cardinality columns: ['Country']

=== Life Expectancy Baseline Model ===
{'R²': 0.965400247335062, 'RMSE': 1.731666654894535, 'MAE': 1.1672954948580996}

Running Feature Engineering on Life Expectancy Data...
[FeatureEngineering] Starting pipeline...
[FeatureEngineering] Generating new features...
[FeatureEngineering] Added 840 new features.
Sample new features: ['infant deaths_plus_Total expenditure', 'Measles _plus_ BMI ', 'Hepatitis B_times_GDP', 'Polio_div_GDP', 'Year_minus_ thinness 5-9 years', 'percentage expenditure_minus_Country_




[FeatureEngineering] Important Features (Combined SHAP + Permutation):
 under-five deaths _times_ HIV/AIDS                         1.228899
Adult Mortality_div_Schooling                              0.517097
Income composition of resources_minus_Status_Developing    0.360611
 HIV/AIDS_minus_Income composition of resources            0.290027
 HIV/AIDS_minus_Schooling                                  0.256694
Year_div_Income composition of resources                   0.170533
infant deaths_times_ HIV/AIDS                              0.166747
Diphtheria _times_Income composition of resources          0.159529
Year_times_Income composition of resources                 0.134318
Diphtheria _div_ HIV/AIDS                                  0.104327
 BMI _div_ HIV/AIDS                                        0.086058
Year_minus_Adult Mortality                                 0.083090
Adult Mortality_div_Income composition of resources        0.073742
infant deaths_div_under-five deaths        

comparison to the autofeat library's algorithm

In [29]:
# Import required libraries
import os
import pandas as pd
import numpy as np
from autofeat import AutoFeatRegressor
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from scipy.stats import ttest_rel  # Using paired t-test
from sklearn.preprocessing import StandardScaler

###############################################################################
# 1. Load & Preprocess the Data
###############################################################################
dataset_path = os.path.abspath("../Data/amsterdam_weekdays_clean.csv")
df_amsterdam = pd.read_csv(dataset_path)

# Define Features and Target
target_col = "realSum"
X = df_amsterdam.drop(columns=[target_col])  # Features
y = df_amsterdam[target_col]  # Target (Regression Task)

# Standardize the features
scaler = StandardScaler()
X_scaled = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)

# Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42
)

###############################################################################
# 2. Apply AutoFeat for Feature Engineering
###############################################################################
print("\nApplying AutoFeat on Amsterdam Dataset...")

# Initialize AutoFeat Regressor
auto_feat = AutoFeatRegressor(verbose=1, feateng_steps=1)

# Fit and Transform Training Data
X_train_autofeat = auto_feat.fit_transform(X_train, y_train)
X_test_autofeat = auto_feat.transform(X_test)

# Number of New Features Created
print(f"AutoFeat generated {X_train_autofeat.shape[1] - X_train.shape[1]} new features.")

###############################################################################
# 3. Train a RandomForest Regressor with AutoFeat Features
###############################################################################
auto_feat_model = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42)
auto_feat_model.fit(X_train_autofeat, y_train)

# Predict and Evaluate
y_pred_autofeat = auto_feat_model.predict(X_test_autofeat)

auto_feat_results = {
    "R²": r2_score(y_test, y_pred_autofeat),
    "RMSE": np.sqrt(mean_squared_error(y_test, y_pred_autofeat)),
    "MAE": mean_absolute_error(y_test, y_pred_autofeat),
}

print("\n=== AutoFeat Model Performance ===")
print(auto_feat_results)

###############################################################################
# 4. Load Semi-Automated Feature Engineering Results
###############################################################################
# These values should be replaced with actual results from the semi-automated process
print("\nLoading Results from Semi-Automated Feature Engineering...")
semi_auto_results = {
    "R²": 0.5894,  
    "RMSE": 200.03,
    "MAE": 134.65
}

print("\n=== Semi-Automated Feature Engineering Performance ===")
print(semi_auto_results)

###############################################################################
# 5. Compare Semi-Automated vs. AutoFeat
###############################################################################
print("\nComparison Between Semi-Automated and AutoFeat (Amsterdam)")
print("Semi-Automated:", semi_auto_results)
print("AutoFeat:", auto_feat_results)

# Convert dictionary values to arrays for comparison
semi_auto_scores = np.array(list(semi_auto_results.values()))
auto_feat_scores = np.array(list(auto_feat_results.values()))

# Ensure both arrays have the same shape before performing the t-test
valid_mask = ~np.isnan(semi_auto_scores) & ~np.isnan(auto_feat_scores)
semi_auto_valid = semi_auto_scores[valid_mask]
auto_feat_valid = auto_feat_scores[valid_mask]

# Run Paired T-Test only if at least 2 valid metrics exist
if len(semi_auto_valid) > 1 and len(auto_feat_valid) > 1:
    t_stat, p_value = ttest_rel(semi_auto_valid, auto_feat_valid)

    print(f"\nPaired T-Test (Semi-Automated vs. AutoFeat): t={t_stat:.4f}, p-value={p_value:.4f}")

    # Conclusion
    if p_value < 0.05:
        print("Statistically significant difference found between Semi-Automated and AutoFeat approach.")
    else:
        print("No significant difference found between Semi-Automated and AutoFeat approach.")
else:
    print("Not enough valid metrics to perform the paired t-test.")


2025-03-10 12:05:57,074 INFO: [AutoFeat] The 1 step feature engineering process could generate up to 147 features.
2025-03-10 12:05:57,076 INFO: [AutoFeat] With 882 data points this new feature matrix would use about 0.00 gb of space.
2025-03-10 12:05:57,078 INFO: [feateng] Step 1: transformation of original features



Applying AutoFeat on Amsterdam Dataset...
[feateng]               0/             21 features transformed

2025-03-10 12:05:58,369 INFO: [feateng] Generated 65 transformed features from 21 original features - done.
2025-03-10 12:05:58,372 INFO: [feateng] Generated altogether 65 new features in 1 steps
2025-03-10 12:05:58,373 INFO: [feateng] Removing correlated features, as well as additions at the highest level
2025-03-10 12:05:58,380 INFO: [feateng] Generated a total of 65 additional features
2025-03-10 12:05:58,385 INFO: [featsel] Feature selection run 1/5
2025-03-10 12:05:58,536 INFO: [featsel] Feature selection run 2/5


[featsel] Scaling data...done.


2025-03-10 12:05:58,678 INFO: [featsel] Feature selection run 3/5
2025-03-10 12:05:58,828 INFO: [featsel] Feature selection run 4/5
2025-03-10 12:05:58,966 INFO: [featsel] Feature selection run 5/5
2025-03-10 12:05:59,108 INFO: [featsel] 15 features after 5 feature selection runs
  if np.max(np.abs(correlations[c].ravel()[:i])) < 0.9:
2025-03-10 12:05:59,111 INFO: [featsel] 12 features after correlation filtering
2025-03-10 12:05:59,142 INFO: [featsel] 11 features after noise filtering
2025-03-10 12:05:59,144 INFO: [AutoFeat] Computing 6 new features.


[AutoFeat]     5/    6 new features

2025-03-10 12:05:59,671 INFO: [AutoFeat]     6/    6 new features ...done.
2025-03-10 12:05:59,673 INFO: [AutoFeat] Final dataframe with 27 feature columns (6 new).
2025-03-10 12:05:59,674 INFO: [AutoFeat] Training final regression model.
2025-03-10 12:05:59,690 INFO: [AutoFeat] Trained model: largest coefficients:
2025-03-10 12:05:59,692 INFO: 402.11845964909133
2025-03-10 12:05:59,693 INFO: 92.500695 * room_type_Entire home/apt
2025-03-10 12:05:59,693 INFO: 79.735985 * person_capacity
2025-03-10 12:05:59,694 INFO: 76.659469 * attr_index
2025-03-10 12:05:59,696 INFO: 62.997990 * exp(guest_satisfaction_overall)
2025-03-10 12:05:59,697 INFO: -41.016056 * Abs(lng)
2025-03-10 12:05:59,698 INFO: 36.235626 * bedrooms
2025-03-10 12:05:59,699 INFO: 28.353970 * Unnamed0**2
2025-03-10 12:05:59,700 INFO: 24.991998 * bedrooms**2
2025-03-10 12:05:59,701 INFO: 19.892433 * exp(person_capacity)
2025-03-10 12:05:59,702 INFO: -10.822153 * dist
2025-03-10 12:05:59,702 INFO: 4.142612 * bedrooms**3
2025-0

AutoFeat generated 6 new features.s

=== AutoFeat Model Performance ===
{'R²': 0.5516051871390136, 'RMSE': 209.03647251328667, 'MAE': 144.84566065118793}

Loading Results from Semi-Automated Feature Engineering...

=== Semi-Automated Feature Engineering Performance ===
{'R²': 0.5894, 'RMSE': 200.03, 'MAE': 134.65}

Comparison Between Semi-Automated and AutoFeat (Amsterdam)
Semi-Automated: {'R²': 0.5894, 'RMSE': 200.03, 'MAE': 134.65}
AutoFeat: {'R²': 0.5516051871390136, 'RMSE': 209.03647251328667, 'MAE': 144.84566065118793}

Paired T-Test (Semi-Automated vs. AutoFeat): t=-1.9770, p-value=0.1867
No significant difference found between Semi-Automated and AutoFeat approach.
