### Tabular Data Science – Final Research Project
### Overview
This notebook presents an automated feature engineering approach to improve model performance across multiple datasets. We compare a baseline model trained on the original features with an enhanced model that includes automatically generated features.
### Approach
1. Baseline Model: Train and evaluate a simple model on the raw dataset.
2. Feature Engineering: Automatically generate, filter, and rank new features.
3. Enhanced Model: Train and evaluate a model with the engineered features.
4. Comparison: Compare baseline vs. enhanced model performance using statistical tests.
### Datasets Used
We evaluate our approach on four different datasets:
- Cancer Patient Data (Classification)
- Amsterdam Rental Prices (Regression)
- Student Performance Factors (Classification)
- Life expectancy (Regression)

In [1]:
import pandas as pd
import numpy as np
import shap
import os
import sys
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score,r2_score, mean_squared_error, mean_absolute_error
from scipy.stats import ttest_rel
from sklearn.preprocessing import StandardScaler,LabelEncoder
from scipy.stats import wilcoxon
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression

# Ensure the feature_engineering module is in the Python path
sys.path.append(os.path.abspath("feature_engineering"))
from feature_generator import SemiAutomatedFeatureEngineering


  from .autonotebook import tqdm as notebook_tqdm


### Dataset 1: Cancer Patient Data (Classification)
Goal: Predict cancer severity level based on patient attributes.

- Type: Classification (Target: Level)
- Baseline Model: RandomForestClassifier
- Enhanced Model: Random Forest with Engineered Features
- Comparison Metrics: Accuracy, Precision, Recall, F1 Score

In [None]:
###############################################################################
# 1. Load & Preprocess the Data
###############################################################################
dataset_path = os.path.abspath("../Data/cancer_patient.csv")
df_cancer = pd.read_csv(dataset_path)

# Drop unnecessary columns
if 'index' in df_cancer.columns:
    df_cancer.drop(columns=['index'], inplace=True)
if 'Patient Id' in df_cancer.columns:
    df_cancer.drop(columns=['Patient Id'], inplace=True)

df_cancer.drop_duplicates(inplace=True)

# Encode the target column
label_encoder = LabelEncoder()
df_cancer['Level'] = label_encoder.fit_transform(df_cancer['Level'])

# Save the cleaned dataset
cleaned_path = os.path.abspath("../Data/cancer_patient_clean.csv")
df_cancer.to_csv(cleaned_path, index=False)

###############################################################################
# 2. Baseline Model Training
###############################################################################
X = df_cancer.drop(columns=['Level'])
y = df_cancer['Level']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Simple baseline model
baseline_model = RandomForestClassifier(
    n_estimators=50, max_depth=5, min_samples_split=10, random_state=42
)
baseline_model.fit(X_train, y_train)

# Evaluate baseline model
y_pred_baseline = baseline_model.predict(X_test)
baseline_results = {
    "Accuracy": accuracy_score(y_test, y_pred_baseline),
    "Precision": precision_score(y_test, y_pred_baseline, average='weighted'),
    "Recall": recall_score(y_test, y_pred_baseline, average='weighted'),
    "F1 Score": f1_score(y_test, y_pred_baseline, average='weighted')
}

print("\nBaseline Model Performance:")
print(baseline_results)

###############################################################################
# 3. Feature Engineering (SemiAutomatedFeatureEngineering)
###############################################################################
print("\nRunning Feature Engineering...")
feature_engineer = SemiAutomatedFeatureEngineering(
    df_cancer.copy(), 
    target_column="Level",
    task="classification"
)
enhanced_results = feature_engineer.run_pipeline()

###############################################################################
# 4. Train a New Model on the Enhanced Data
###############################################################################
# The feature_engineer.df now has new features
df_enhanced = feature_engineer.df
X_enhanced = df_enhanced.drop(columns=["Level"])
y_enhanced = df_enhanced["Level"]

X_train_enhanced, X_test_enhanced, y_train_enhanced, y_test_enhanced = train_test_split(
    X_enhanced, y_enhanced, test_size=0.2, random_state=42
)

enhanced_model = RandomForestClassifier(
    n_estimators=50, max_depth=5, min_samples_split=10, random_state=42
)
enhanced_model.fit(X_train_enhanced, y_train_enhanced)

# Evaluate the enhanced model
y_pred_enhanced = enhanced_model.predict(X_test_enhanced)
enhanced_model_results = {
    "Accuracy": accuracy_score(y_test_enhanced, y_pred_enhanced),
    "Precision": precision_score(y_test_enhanced, y_pred_enhanced, average='weighted'),
    "Recall": recall_score(y_test_enhanced, y_pred_enhanced, average='weighted'),
    "F1 Score": f1_score(y_test_enhanced, y_pred_enhanced, average='weighted')
}

###############################################################################
# 5. Compare Baseline vs. Enhanced
###############################################################################
print("\nComparison Between Baseline and Enhanced Model:")
print("Baseline Model:", baseline_results)
print("Enhanced Model:", enhanced_model_results)

# Statistical Test
baseline_scores = np.array(list(baseline_results.values()))
enhanced_scores = np.array(list(enhanced_model_results.values()))
t_stat, p_value = ttest_rel(baseline_scores, enhanced_scores)
print(f"\nPaired T-Test: t={t_stat:.4f}, p={p_value:.4f}")


Baseline Model Performance:
{'Accuracy': 0.9354838709677419, 'Precision': 0.946236559139785, 'Recall': 0.9354838709677419, 'F1 Score': 0.9319648093841643}

Running Feature Engineering...
[FeatureEngineering] Starting pipeline...
[FeatureEngineering] Generating new features...
[FeatureEngineering] Filtering features based on correlation & variance...
[FeatureEngineering] Dropping 114 low-correlation features: ['Age', 'Swallowing Difficulty', 'Age_plus_Gender', 'Age_minus_Gender', 'Age_minus_Dust Allergy', 'Age_minus_OccuPational Hazards', 'Age_plus_Weight Loss', 'Age_minus_Weight Loss', 'Age_times_Weight Loss', 'Age_plus_Shortness of Breath', 'Age_minus_Shortness of Breath', 'Age_plus_Wheezing', 'Age_minus_Wheezing', 'Age_plus_Swallowing Difficulty', 'Age_minus_Swallowing Difficulty', 'Age_times_Swallowing Difficulty', 'Age_div_Swallowing Difficulty', 'Age_plus_Clubbing of Finger Nails', 'Age_minus_Clubbing of Finger Nails', 'Age_plus_Frequent Cold', 'Age_minus_Frequent Cold', 'Age_plu

### Dataset 2: Amsterdam Rental Prices (Regression)
Goal: Predict rental prices of properties in Amsterdam.

- Type: Regression (Target: realSum)
- Baseline Model: RandomForestRegressor
- Enhanced Model: Random Forest with Engineered Features
- Comparison Metrics: R², RMSE, MAE

In [7]:
###############################################################################
# 1. LOAD & CLEAN (Baseline)
###############################################################################
dataset_path = os.path.abspath("../Data/amsterdam_weekdays.csv")
df_amsterdam = pd.read_csv(dataset_path)

# Convert booleans/strings to 0/1
for col in ["host_is_superhost", "room_private", "room_shared"]:
    if col in df_amsterdam.columns:
        df_amsterdam[col] = df_amsterdam[col].replace({
            False: 0, True: 1, "FALSE": 0, "TRUE": 1
        }).astype(int)

# One-hot encode 'room_type' if present
if "room_type" in df_amsterdam.columns:
    df_amsterdam = pd.get_dummies(df_amsterdam, columns=["room_type"], prefix="room_type")

# Convert any leftover bools
bool_cols = df_amsterdam.select_dtypes(include='bool').columns
df_amsterdam[bool_cols] = df_amsterdam[bool_cols].astype(int)

target_col = "realSum"
if target_col not in df_amsterdam.columns:
    raise KeyError(f"Target '{target_col}' not found in amsterdam_weekdays.csv")

# Save cleaned
cleaned_path = os.path.abspath("../Data/amsterdam_weekdays_clean.csv")
#df_amsterdam.to_csv(cleaned_path, index=False)

# Prepare baseline data
X_amst = df_amsterdam.drop(columns=[target_col])
y_amst = df_amsterdam[target_col]

scaler = StandardScaler()
X_amst_scaled = pd.DataFrame(scaler.fit_transform(X_amst), columns=X_amst.columns)

X_train_amst, X_test_amst, y_train_amst, y_test_amst = train_test_split(
    X_amst_scaled, y_amst, test_size=0.2, random_state=42
)

# Baseline model
baseline_model = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42)
baseline_model.fit(X_train_amst, y_train_amst)

# Evaluate baseline
y_pred_baseline = baseline_model.predict(X_test_amst)
baseline_metrics = {
    "R²": r2_score(y_test_amst, y_pred_baseline),
    "RMSE": np.sqrt(mean_squared_error(y_test_amst, y_pred_baseline)),
    "MAE": mean_absolute_error(y_test_amst, y_pred_baseline),
}
print("\n=== Amsterdam Baseline Model ===")
print(baseline_metrics)

###############################################################################
# 2. SEMI-AUTOMATED FEATURE ENGINEERING
###############################################################################

feature_engineer = SemiAutomatedFeatureEngineering(
    df_amsterdam.copy(),  # pass a copy
    target_column=target_col,
    task="regression"
    # correlation_threshold=0.05, variance_threshold=0.01 (if your class supports those)
)
pipeline_results = feature_engineer.run_pipeline()  # This prints out results, features, etc.

# The pipeline's final DataFrame with new features
df_amst_enh = feature_engineer.df
X_enh = df_amst_enh.drop(columns=[target_col])
y_enh = df_amst_enh[target_col]

# Scale again with new features
X_enh_scaled = pd.DataFrame(scaler.fit_transform(X_enh), columns=X_enh.columns)

X_train_enh, X_test_enh, y_train_enh, y_test_enh = train_test_split(
    X_enh_scaled, y_enh, test_size=0.2, random_state=42
)

enh_model = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42)
enh_model.fit(X_train_enh, y_train_enh)

y_pred_enh = enh_model.predict(X_test_enh)
enhanced_metrics = {
    "R²": r2_score(y_test_enh, y_pred_enh),
    "RMSE": np.sqrt(mean_squared_error(y_test_enh, y_pred_enh)),
    "MAE": mean_absolute_error(y_test_enh, y_pred_enh),
}
print("\n=== Amsterdam Enhanced Model ===")
print(enhanced_metrics)

# Compare
print("\nComparison Baseline vs. Enhanced (Amsterdam)")
print("Baseline:", baseline_metrics)
print("Enhanced:", enhanced_metrics)

base_scores = np.array(list(baseline_metrics.values()))
enh_scores = np.array(list(enhanced_metrics.values()))
stat, pval = wilcoxon(base_scores, enh_scores)
print(f"Wilcoxon: statistic={stat:.4f}, pvalue={pval:.4f}")

  df_amsterdam[col] = df_amsterdam[col].replace({
  df_amsterdam[col] = df_amsterdam[col].replace({
  df_amsterdam[col] = df_amsterdam[col].replace({



=== Amsterdam Baseline Model ===
{'R²': 0.481964703404132, 'RMSE': 224.68364147185966, 'MAE': 148.43478257371325}
[FeatureEngineering] Starting pipeline...
[FeatureEngineering] Generating new features...
[FeatureEngineering] Filtering features based on correlation & variance...
[FeatureEngineering] Dropping 198 low-correlation features: ['Unnamed: 0', 'room_shared', 'biz', 'cleanliness_rating', 'lng', 'room_type_Shared room', 'Unnamed: 0_plus_room_shared', 'Unnamed: 0_minus_room_shared', 'Unnamed: 0_times_room_shared', 'Unnamed: 0_div_room_shared', 'Unnamed: 0_plus_room_private', 'Unnamed: 0_minus_room_private', 'Unnamed: 0_plus_person_capacity', 'Unnamed: 0_minus_person_capacity', 'Unnamed: 0_plus_host_is_superhost', 'Unnamed: 0_minus_host_is_superhost', 'Unnamed: 0_plus_multi', 'Unnamed: 0_minus_multi', 'Unnamed: 0_plus_biz', 'Unnamed: 0_minus_biz', 'Unnamed: 0_times_biz', 'Unnamed: 0_div_biz', 'Unnamed: 0_plus_cleanliness_rating', 'Unnamed: 0_minus_cleanliness_rating', 'Unnamed: 0_




[FeatureEngineering] Important Features (Combined SHAP + Permutation):
 person_capacity_plus_room_type_Entire home/apt     17.100581
person_capacity_minus_metro_dist                   14.274584
room_private_div_person_capacity                    9.024589
person_capacity_times_room_type_Entire home/apt     7.837398
bedrooms_plus_room_type_Entire home/apt             7.156334
person_capacity_minus_dist                          7.011777
guest_satisfaction_overall_plus_attr_index_norm     6.579408
person_capacity_div_room_type_Entire home/apt       5.831489
bedrooms_minus_dist                                 4.501200
attr_index_norm_div_room_type_Entire home/apt       4.000648
bedrooms_times_attr_index_norm                      3.653305
person_capacity_times_attr_index                    3.164379
guest_satisfaction_overall_plus_rest_index_norm     2.550856
attr_index_norm_times_room_type_Entire home/apt     2.537424
person_capacity_div_dist                            2.532622
person_capac

### Dataset 3: Student Performance (Classification)
Goal: Predict whether a student passes based on academic and lifestyle factors.

- Type: Classification (Target: Passed)
- Baseline Model: RandomForestClassifier
- Enhanced Model: Random Forest with Engineered Features
- Comparison Metrics: Accuracy, Precision, Recall, F1 Score

In [5]:
###############################################################################
# 1. BASELINE
###############################################################################
df_students = pd.read_csv("../Data/StudentPerformanceFactors.csv")

# Create "Passed" from "Exam_Score"
if "Exam_Score" not in df_students.columns:
    raise KeyError("No 'Exam_Score' column found; cannot create 'Passed' column.")
df_students["Passed"] = (df_students["Exam_Score"] >= 60).astype(int)

# Convert boolean-like columns
bool_cols_list = ["Internet_Access", "Learning_Disabilities", "Extracurricular_Activities"]
for col in bool_cols_list:
    if col in df_students.columns:
        if df_students[col].dtype == bool:
            df_students[col] = df_students[col].astype(int)
        else:
            df_students[col] = (
                df_students[col]
                .astype(str)
                .str.lower()
                .replace({"true": 1, "false": 0, "yes": 1, "no": 0})
                .fillna(0)
                .astype(int)
            )

# Identify other categorical columns & one-hot
categorical_cols = [
    "Parental_Involvement", "Access_to_Resources", "Motivation_Level",
    "Family_Income", "Teacher_Quality", "School_Type", "Peer_Influence",
    "Parental_Education_Level", "Distance_from_Home", "Gender"
]
existing_cat = [c for c in categorical_cols if c in df_students.columns]
if existing_cat:
    df_students = pd.get_dummies(df_students, columns=existing_cat, drop_first=True)

# Convert leftover bool
bool_cols2 = df_students.select_dtypes(include='bool').columns
df_students[bool_cols2] = df_students[bool_cols2].astype(int)

target_col = "Passed"
if target_col not in df_students.columns:
    raise KeyError(f"Target '{target_col}' not found in student dataset.")

# For baseline, we'll remove "Exam_Score" from features to avoid direct leak
X_stud = df_students.drop(columns=["Exam_Score", target_col])
y_stud = df_students[target_col]

scaler = StandardScaler()
X_stud_scaled = pd.DataFrame(scaler.fit_transform(X_stud), columns=X_stud.columns)

X_train_stud, X_test_stud, y_train_stud, y_test_stud = train_test_split(
    X_stud_scaled, y_stud, test_size=0.2, random_state=42
)

baseline_model = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
baseline_model.fit(X_train_stud, y_train_stud)

y_pred_base = baseline_model.predict(X_test_stud)
baseline_metrics = {
    "Accuracy": accuracy_score(y_test_stud, y_pred_base),
    "Precision": precision_score(y_test_stud, y_pred_base, average='weighted', zero_division=0),
    "Recall": recall_score(y_test_stud, y_pred_base, average='weighted', zero_division=0),
    "F1 Score": f1_score(y_test_stud, y_pred_base, average='weighted', zero_division=0),
}
print("\n=== Students Baseline Model ===")
print(baseline_metrics)

###############################################################################
# 2. SEMI-AUTOMATED FEATURE ENGINEERING
###############################################################################

feature_engineer = SemiAutomatedFeatureEngineering(
    df_students.copy(),
    target_column=target_col,
    task="classification"
)
pipeline_results = feature_engineer.run_pipeline()

df_stud_enh = feature_engineer.df
X_stud_enh = df_stud_enh.drop(columns=["Exam_Score", target_col], errors='ignore')
y_stud_enh = df_stud_enh[target_col]

X_stud_enh_scaled = pd.DataFrame(scaler.fit_transform(X_stud_enh), columns=X_stud_enh.columns)

X_train_stud2, X_test_stud2, y_train_stud2, y_test_stud2 = train_test_split(
    X_stud_enh_scaled, y_stud_enh, test_size=0.2, random_state=42
)

enh_model = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
enh_model.fit(X_train_stud2, y_train_stud2)

y_pred_enh = enh_model.predict(X_test_stud2)
enhanced_metrics = {
    "Accuracy": accuracy_score(y_test_stud2, y_pred_enh),
    "Precision": precision_score(y_test_stud2, y_pred_enh, average='weighted', zero_division=0),
    "Recall": recall_score(y_test_stud2, y_pred_enh, average='weighted', zero_division=0),
    "F1 Score": f1_score(y_test_stud2, y_pred_enh, average='weighted', zero_division=0),
}
print("\n=== Students Enhanced Model ===")
print(enhanced_metrics)

# Compare
print("\nCompare Baseline vs. Enhanced (Students):")
print("Baseline:", baseline_metrics)
print("Enhanced:", enhanced_metrics)

base_scores = np.array(list(baseline_metrics.values()))
enh_scores = np.array(list(enhanced_metrics.values()))
stat, pval = wilcoxon(base_scores, enh_scores)
print(f"\nWilcoxon: statistic={stat:.4f}, pvalue={pval:.4f}")



  .replace({"true": 1, "false": 0, "yes": 1, "no": 0})
  .replace({"true": 1, "false": 0, "yes": 1, "no": 0})
  .replace({"true": 1, "false": 0, "yes": 1, "no": 0})



=== Students Baseline Model ===
{'Accuracy': 0.9916792738275341, 'Precision': 0.9834277821391052, 'Recall': 0.9916792738275341, 'F1 Score': 0.9875362916732983}
[FeatureEngineering] Starting pipeline...
[FeatureEngineering] Generating new features...
[FeatureEngineering] Filtering features based on correlation & variance...
[FeatureEngineering] Dropping 1217 low-correlation features: ['Extracurricular_Activities', 'Sleep_Hours', 'Previous_Scores', 'Internet_Access', 'Physical_Activity', 'Parental_Involvement_Low', 'Parental_Involvement_Medium', 'Access_to_Resources_Low', 'Access_to_Resources_Medium', 'Motivation_Level_Medium', 'Family_Income_Low', 'Family_Income_Medium', 'Teacher_Quality_Low', 'Teacher_Quality_Medium', 'School_Type_Public', 'Peer_Influence_Neutral', 'Peer_Influence_Positive', 'Parental_Education_Level_High School', 'Parental_Education_Level_Postgraduate', 'Distance_from_Home_Moderate', 'Distance_from_Home_Near', 'Gender_Male', 'Hours_Studied_minus_Previous_Scores', 'Ho

### Dataset 4: Life expectancy (Regression)
Goal: Predict life expectancy 
- Type: Regression (Target: Life expectancy)
- Baseline Model: RandomForestRegressor
- Enhanced Model: Random Forest with Engineered Features
- Comparison Metrics: R², RMSE, MAE


In [None]:
###############################################################################
# 1. Load & Preprocess the Data
###############################################################################
print("\nLoading Life Expectancy Dataset...")
dataset_path = "../Data/LifeExpect.csv"
df_life = pd.read_csv(dataset_path)

print("Columns in the dataset:", df_life.columns.tolist())

# Drop unnecessary columns if applicable
drop_columns = ["ID", "Index"]
df_life.drop(columns=[col for col in drop_columns if col in df_life.columns], inplace=True)

# Separate numeric and categorical columns
numeric_cols = df_life.select_dtypes(include=["int64", "float64"]).columns
categorical_cols = df_life.select_dtypes(include=["object"]).columns

# Handle missing values
df_life[numeric_cols] = df_life[numeric_cols].fillna(df_life[numeric_cols].median())  # Median imputation for numeric columns
for col in categorical_cols:
    df_life[col] = df_life[col].fillna(df_life[col].mode()[0])  # Fill categorical columns with mode

# Convert boolean columns to numeric (0/1)
bool_cols = df_life.select_dtypes(include='bool').columns
df_life[bool_cols] = df_life[bool_cols].astype(int)

# Identify categorical columns
object_cols = categorical_cols.tolist()
CARDINALITY_THRESHOLD = 50
low_cardinality_cols = [col for col in object_cols if df_life[col].nunique() <= CARDINALITY_THRESHOLD]
high_cardinality_cols = [col for col in object_cols if df_life[col].nunique() > CARDINALITY_THRESHOLD]

print(f"Low-cardinality columns: {low_cardinality_cols}")
print(f"High-cardinality columns: {high_cardinality_cols}")

# One-Hot Encoding for low-cardinality categorical features
df_life = pd.get_dummies(df_life, columns=low_cardinality_cols, drop_first=True)

# Frequency Encoding for high-cardinality categorical features
for col in high_cardinality_cols:
    freq_map = df_life[col].value_counts(normalize=True)
    df_life[f"{col}_freq"] = df_life[col].map(freq_map)
df_life.drop(columns=high_cardinality_cols, inplace=True, errors='ignore')

# Ensure all boolean columns are numeric (0/1)
bool_cols_after = df_life.select_dtypes(include='bool').columns
df_life[bool_cols_after] = df_life[bool_cols_after].astype(int)

# Ensure all columns are numeric
df_life = df_life.select_dtypes(include=["int", "float"])

# Define the target variable (Life Expectancy)
target_col = "Life expectancy "  # Ensure correct column name
if target_col not in df_life.columns:
    raise KeyError(f"Target column '{target_col}' not found. Available: {df_life.columns.tolist()}")

# Split the data into features and target
X_life = df_life.drop(columns=[target_col])
y_life = df_life[target_col]

# Standardize features
scaler = StandardScaler()
X_life_scaled = pd.DataFrame(scaler.fit_transform(X_life), columns=X_life.columns)

# Save cleaned dataset
cleaned_path = "../Data/life_expectancy_clean.csv"
df_life.to_csv(cleaned_path, index=False)

###############################################################################
# 2. Baseline Model Training (XGBRegressor)
###############################################################################
X_train_life, X_test_life, y_train_life, y_test_life = train_test_split(
    X_life_scaled, y_life, test_size=0.2, random_state=42
)

# Train a baseline model with XGBoost
baseline_model = XGBRegressor(n_estimators=100, max_depth=7, random_state=42, n_jobs=-1)
baseline_model.fit(X_train_life, y_train_life)

# Make predictions
y_pred_life = baseline_model.predict(X_test_life)

# Calculate regression metrics
baseline_results = {
    "R²": r2_score(y_test_life, y_pred_life),
    "RMSE": np.sqrt(mean_squared_error(y_test_life, y_pred_life)),
    "MAE": mean_absolute_error(y_test_life, y_pred_life),
}

print("\n=== Life Expectancy Baseline Model ===")
print(baseline_results)

###############################################################################
# 3. Feature Engineering (SemiAutomatedFeatureEngineering)
###############################################################################
print("\nRunning Feature Engineering on Life Expectancy Data...")

feature_engineer = SemiAutomatedFeatureEngineering(
    df_life.copy(),
    target_column=target_col,
    task="regression",
    correlation_threshold=0.1,  # Stricter correlation threshold
    variance_threshold=0.05  # Drop low variance features
)

feature_engineer.run_pipeline()

###############################################################################
# 4. Train a New Model on the Enhanced Data
###############################################################################
# Get the enhanced dataset
df_life_enhanced = feature_engineer.df
X_life_enhanced = df_life_enhanced.drop(columns=[target_col])
y_life_enhanced = df_life_enhanced[target_col]

# Scale enhanced features
X_life_enhanced_scaled = pd.DataFrame(scaler.fit_transform(X_life_enhanced), columns=X_life_enhanced.columns)

X_train_life_enh, X_test_life_enh, y_train_life_enh, y_test_life_enh = train_test_split(
    X_life_enhanced_scaled, y_life_enhanced, test_size=0.2, random_state=42
)

# Train an enhanced model with XGBoost
enhanced_model = XGBRegressor(n_estimators=100, max_depth=7, random_state=42, n_jobs=-1)
enhanced_model.fit(X_train_life_enh, y_train_life_enh)

# Evaluate the enhanced model
y_pred_life_enh = enhanced_model.predict(X_test_life_enh)
enhanced_results = {
    "R²": r2_score(y_test_life_enh, y_pred_life_enh),
    "RMSE": np.sqrt(mean_squared_error(y_test_life_enh, y_pred_life_enh)),
    "MAE": mean_absolute_error(y_test_life_enh, y_pred_life_enh),
}

###############################################################################
# 5. Compare Baseline vs. Enhanced
###############################################################################
print("\nComparison Between Baseline and Enhanced Model (Life Expectancy Prediction):")
print("Baseline Model:", baseline_results)
print("Enhanced Model:", enhanced_results)

# Statistical Test (Paired T-Test)
baseline_scores = np.array(list(baseline_results.values()))
enhanced_scores = np.array(list(enhanced_results.values()))
t_stat, p_value = ttest_rel(baseline_scores, enhanced_scores) 
print(f"\nPaired T-Test: t={t_stat:.4f}, p={p_value:.4f}") 



Loading Life Expectancy Dataset...
Columns in the dataset: ['Country', 'Year', 'Status', 'Life expectancy ', 'Adult Mortality', 'infant deaths', 'Alcohol', 'percentage expenditure', 'Hepatitis B', 'Measles ', ' BMI ', 'under-five deaths ', 'Polio', 'Total expenditure', 'Diphtheria ', ' HIV/AIDS', 'GDP', 'Population', ' thinness  1-19 years', ' thinness 5-9 years', 'Income composition of resources', 'Schooling']
Low-cardinality columns: ['Status']
High-cardinality columns: ['Country']

=== Life Expectancy Baseline Model ===
{'R²': 0.965400247335062, 'RMSE': 1.731666654894535, 'MAE': 1.1672954948580996}

Running Feature Engineering on Life Expectancy Data...
[FeatureEngineering] Starting pipeline...
[FeatureEngineering] Generating new features...
[FeatureEngineering] Filtering features based on correlation & variance...
[FeatureEngineering] Dropping 141 low-correlation features: ['Population', 'Country_freq', 'Year_div_Hepatitis B', 'Year_div_Measles ', 'Year_minus_Total expenditure', '




[FeatureEngineering] Important Features (Combined SHAP + Permutation):
 under-five deaths _times_ HIV/AIDS                         1.228899
Adult Mortality_div_Schooling                              0.517097
Income composition of resources_minus_Status_Developing    0.360611
 HIV/AIDS_minus_Income composition of resources            0.290027
 HIV/AIDS_minus_Schooling                                  0.256694
Year_div_Income composition of resources                   0.170533
infant deaths_times_ HIV/AIDS                              0.166747
Diphtheria _times_Income composition of resources          0.159529
Year_times_Income composition of resources                 0.134318
Diphtheria _div_ HIV/AIDS                                  0.104327
 BMI _div_ HIV/AIDS                                        0.086058
Year_minus_Adult Mortality                                 0.083090
Adult Mortality_div_Income composition of resources        0.073742
infant deaths_div_under-five deaths        

comparison to the autofeat library's algorithm

In [None]:
# import os
# import pandas as pd
# import numpy as np
# from autofeat import AutoFeatClassifier
# from sklearn.model_selection import train_test_split
# from sklearn.ensemble import RandomForestClassifier
# from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
# from scipy.stats import wilcoxon
# from sklearn.preprocessing import StandardScaler

# ###############################################################################
# # 1. Load & Preprocess the Data
# ###############################################################################
# dataset_path = os.path.abspath("../Data/cancer_patient_clean.csv")
# df_cancer = pd.read_csv(dataset_path)

# # Define Features and Target
# X = df_cancer.drop(columns=['Level'])  # Features
# y = df_cancer['Level']                 # Target (Classification Task)

# # Train/Test Split
# X_train, X_test, y_train, y_test = train_test_split(
#     X, y, test_size=0.2, random_state=42
# )

# ###############################################################################
# # 2. Apply AutoFeat for Feature Engineering
# ###############################################################################
# print("\nApplying AutoFeat on Cancer Dataset...")

# # Initialize AutoFeat Classifier
# auto_feat = AutoFeatClassifier(verbose=1, feateng_steps=2)

# # Fit and Transform Training Data
# X_train_autofeat = auto_feat.fit_transform(X_train, y_train)
# X_test_autofeat = auto_feat.transform(X_test)

# # Number of New Features Created
# print(f"AutoFeat generated {X_train_autofeat.shape[1] - X_train.shape[1]} new features.")

# ###############################################################################
# # 3. Train a RandomForest Classifier with AutoFeat Features
# ###############################################################################
# auto_feat_model = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
# auto_feat_model.fit(X_train_autofeat, y_train)

# # Predict and Evaluate
# y_pred_autofeat = auto_feat_model.predict(X_test_autofeat)

# auto_feat_results = {
#     "Accuracy": accuracy_score(y_test, y_pred_autofeat),
#     "Precision": precision_score(y_test, y_pred_autofeat, average='weighted'),
#     "Recall": recall_score(y_test, y_pred_autofeat, average='weighted'),
#     "F1 Score": f1_score(y_test, y_pred_autofeat, average='weighted'),
# }

# print("\n=== AutoFeat Model Performance ===")
# print(auto_feat_results)

# ###############################################################################
# # 4. Compare AutoFeat vs. Semi-Automated Feature Engineering
# ###############################################################################

# print("\nLoading Results from Semi-Automated Feature Engineering...")
# semi_auto_results = {
#     "Accuracy": 0.967741935483871,  
#     "Precision": 0.969758064516129,
#     "Recall": 0.967741935483871,
#     "F1 Score": 0.9667959511872103
# }

# print("\n=== Semi-Automated Feature Engineering Performance ===")
# print(semi_auto_results)

# # Perform Wilcoxon Signed-Rank Test for Statistical Significance
# auto_feat_scores = np.array(list(auto_feat_results.values()))
# semi_auto_scores = np.array(list(semi_auto_results.values()))

# stat, pval = wilcoxon(auto_feat_scores, semi_auto_scores)
# print(f"\nWilcoxon Signed-Rank Test: statistic={stat:.4f}, p-value={pval:.4f}")

# # Conclusion
# if pval < 0.05:
#     print("Statistically significant difference found between AutoFeat and Semi-Automated approach.")
# else:
#     print("No significant difference found between AutoFeat and Semi-Automated approach.")


2025-03-06 14:45:03,994 INFO: [AutoFeat] The 2 step feature engineering process could generate up to 13041 features.
2025-03-06 14:45:03,995 INFO: [AutoFeat] With 121 data points this new feature matrix would use about 0.01 gb of space.
2025-03-06 14:45:03,997 INFO: [feateng] Step 1: transformation of original features



Applying AutoFeat on Cancer Dataset...
[feateng]               0/             23 features transformed

2025-03-06 14:45:06,644 INFO: [feateng] Generated 131 transformed features from 23 original features - done.
2025-03-06 14:45:06,647 INFO: [feateng] Step 2: first combination of features


[feateng]           11100/          11781 feature tuples combined

2025-03-06 14:45:10,513 INFO: [feateng] Generated 11709 feature combinations from 11781 original feature tuples - done.
2025-03-06 14:45:10,519 INFO: [feateng] Generated altogether 11841 new features in 2 steps


[feateng]           11700/          11781 feature tuples combined

2025-03-06 14:45:10,520 INFO: [feateng] Removing correlated features, as well as additions at the highest level
2025-03-06 14:45:10,668 INFO: [feateng] Generated a total of 8823 additional features


[featsel] Scaling data...

2025-03-06 14:45:11,996 INFO: [featsel] Feature selection run 1/5


done.


2025-03-06 14:49:26,706 INFO: [featsel] Feature selection run 2/5


KeyboardInterrupt: 