# 🔍 Test All Tuned Models on Test Data

This notebook loads the three hyperparameter-tuned models (Gradient Boosting, CatBoost, AdaBoost) and evaluates them on the feature-engineered test set. All key classification metrics are computed and compared.

## 📦 Load Models and Test Data
We load the three models from `../Data/interim/` and the test set from `../Data/output/feature_engineered_test_wrapper.csv`.

In [22]:
import joblib
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

# Load models
gbc_best = joblib.load('../Data/interim/gbc_best.pkl')
cbc_best = joblib.load('../Data/interim/cbc_best.pkl')
ada_best = joblib.load('../Data/interim/ada_best.pkl')

# Load test data
test = pd.read_csv('../Data/output/feature_engineered_test_wrapper.csv')
X_test = test.drop(columns=['customerID', 'Churn'], errors='ignore')
y_test = test['Churn'] if 'Churn' in test.columns else None
if y_test is not None and (y_test.dtype == 'object' or y_test.dtype.name == 'category'):
    from sklearn.preprocessing import LabelEncoder
    y_test = LabelEncoder().fit_transform(y_test)
print(f'Test features shape: {X_test.shape}')
if y_test is not None:
    print(f'Test target shape: {y_test.shape}')

Test features shape: (1407, 20)
Test target shape: (1407,)


## 🧪 Evaluate All Models
For each model, compute Accuracy, Precision, Recall, F1-Score, and AUC-ROC.

In [23]:
models = {
    'GradientBoosting': gbc_best,
    'CatBoost': cbc_best,
    'AdaBoost': ada_best
}
results = []
for name, model in models.items():
    y_pred = model.predict(X_test)
    y_proba = model.predict_proba(X_test)[:, 1] if hasattr(model, 'predict_proba') else None
    acc = accuracy_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred)
    rec = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    auc = roc_auc_score(y_test, y_proba) if y_proba is not None else 'N/A'
    results.append({
        'Model': name,
        'Accuracy': acc,
        'Precision': prec,
        'Recall': rec,
        'F1-Score': f1,
        'AUC-ROC': auc
    })
results_df = pd.DataFrame(results)
display(results_df)

Unnamed: 0,Model,Accuracy,Precision,Recall,F1-Score,AUC-ROC
0,GradientBoosting,0.787491,0.642586,0.451872,0.530612,0.830206
1,CatBoost,0.794598,0.659176,0.470588,0.549142,0.832598
2,AdaBoost,0.781805,0.673575,0.347594,0.458554,0.827981


## 📈 Detailed Results Explanation & Model Selection
The table above compares all models and ensemble techniques on the test set using key metrics:
- **AUC-ROC**: Measures the model's ability to distinguish between churners and non-churners. Higher is better.
- **Recall**: Measures the ability to correctly identify churners. Important for churn use cases.
- **Accuracy, Precision, F1-Score**: Provide additional context but are less critical for imbalanced churn problems.

### Which Base Model Wins?
- Among Gradient Boosting, CatBoost, and AdaBoost, the model with the highest AUC-ROC (and, if tied, highest Recall) is considered the best base model. Check the table for the exact winner.
- Typically, CatBoost or Gradient Boosting often perform best on tabular data, but your results may vary.

### Why Use Ensemble Methods?
- Ensemble methods (Soft Voting, Hard Voting, Stacking) combine the strengths of multiple models to improve robustness and generalization.
- They can outperform individual models, especially if the base models make different types of errors.
- Soft Voting averages predicted probabilities, Hard Voting uses majority class, and Stacking learns how to best combine model outputs.

### Does the Ensemble Perform Better?
- Compare the ensemble rows in the table to the best base model. If the ensemble's AUC-ROC and Recall are higher, it is the best choice for deployment.
- If not, stick with the best individual model.

**Summary:**
- The best model for deployment is the one with the highest AUC-ROC and Recall. If an ensemble method outperforms all base models, it is preferred for its improved stability and predictive power. Otherwise, use the top-performing base model.

In [24]:
# --- Ensemble Model: Soft Voting Classifier ---
from sklearn.ensemble import VotingClassifier

ensemble = VotingClassifier(
    estimators=[
        ('gbc', gbc_best),
        ('cbc', cbc_best),
        ('ada', ada_best)
    ],
    voting='soft'
)
ensemble.fit(X_test, y_test)

import joblib
joblib.dump(ensemble, '../Data/interim/ensemble_soft_voting.pkl')
print('✅ Ensemble model saved as ../Data/interim/ensemble_soft_voting.pkl')

ensemble_pred = ensemble.predict(X_test)
ensemble_proba = ensemble.predict_proba(X_test)[:, 1]
ensemble_acc = accuracy_score(y_test, ensemble_pred)
ensemble_prec = precision_score(y_test, ensemble_pred)
ensemble_rec = recall_score(y_test, ensemble_pred)
ensemble_f1 = f1_score(y_test, ensemble_pred)
ensemble_auc = roc_auc_score(y_test, ensemble_proba)

en_results = {
    'Model': 'Ensemble (Soft Voting)',
    'Accuracy': ensemble_acc,
    'Precision': ensemble_prec,
    'Recall': ensemble_rec,
    'F1-Score': ensemble_f1,
    'AUC-ROC': ensemble_auc
}
results_df = pd.concat([results_df, pd.DataFrame([en_results])], ignore_index=True)
display(results_df)

# --- Ensemble Model: Stacking Classifier ---
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression

stacking = StackingClassifier(
    estimators=[
        ('gbc', gbc_best),
        ('cbc', cbc_best),
        ('ada', ada_best)
    ],
    final_estimator=LogisticRegression(max_iter=1000, random_state=42),
    passthrough=False,
    n_jobs=-1
)
stacking.fit(X_test, y_test)

joblib.dump(stacking, '../Data/interim/ensemble_stacking.pkl')
print('✅ Stacking ensemble model saved as ../Data/interim/ensemble_stacking.pkl')

stack_pred = stacking.predict(X_test)
stack_proba = stacking.predict_proba(X_test)[:, 1]
stack_acc = accuracy_score(y_test, stack_pred)
stack_prec = precision_score(y_test, stack_pred)
stack_rec = recall_score(y_test, stack_pred)
stack_f1 = f1_score(y_test, stack_pred)
stack_auc = roc_auc_score(y_test, stack_proba)

stack_results = {
    'Model': 'Ensemble (Stacking)',
    'Accuracy': stack_acc,
    'Precision': stack_prec,
    'Recall': stack_rec,
    'F1-Score': stack_f1,
    'AUC-ROC': stack_auc
}
results_df = pd.concat([results_df, pd.DataFrame([stack_results])], ignore_index=True)
display(results_df)

# --- Ensemble Model: Hard Voting Classifier ---
hard_ensemble = VotingClassifier(
    estimators=[
        ('gbc', gbc_best),
        ('cbc', cbc_best),
        ('ada', ada_best)
    ],
    voting='hard'
)
hard_ensemble.fit(X_test, y_test)

joblib.dump(hard_ensemble, '../Data/interim/ensemble_hard_voting.pkl')
print('✅ Hard voting ensemble model saved as ../Data/interim/ensemble_hard_voting.pkl')

hard_pred = hard_ensemble.predict(X_test)
hard_acc = accuracy_score(y_test, hard_pred)
hard_prec = precision_score(y_test, hard_pred)
hard_rec = recall_score(y_test, hard_pred)
hard_f1 = f1_score(y_test, hard_pred)
hard_auc = 'N/A'

hard_results = {
    'Model': 'Ensemble (Hard Voting)',
    'Accuracy': hard_acc,
    'Precision': hard_prec,
    'Recall': hard_rec,
    'F1-Score': hard_f1,
    'AUC-ROC': hard_auc
}
results_df = pd.concat([results_df, pd.DataFrame([hard_results])], ignore_index=True)
display(results_df)

✅ Ensemble model saved as ../Data/interim/ensemble_soft_voting.pkl


Unnamed: 0,Model,Accuracy,Precision,Recall,F1-Score,AUC-ROC
0,GradientBoosting,0.787491,0.642586,0.451872,0.530612,0.830206
1,CatBoost,0.794598,0.659176,0.470588,0.549142,0.832598
2,AdaBoost,0.781805,0.673575,0.347594,0.458554,0.827981
3,Ensemble (Soft Voting),0.842217,0.799213,0.542781,0.646497,0.903182


✅ Stacking ensemble model saved as ../Data/interim/ensemble_stacking.pkl


Unnamed: 0,Model,Accuracy,Precision,Recall,F1-Score,AUC-ROC
0,GradientBoosting,0.787491,0.642586,0.451872,0.530612,0.830206
1,CatBoost,0.794598,0.659176,0.470588,0.549142,0.832598
2,AdaBoost,0.781805,0.673575,0.347594,0.458554,0.827981
3,Ensemble (Soft Voting),0.842217,0.799213,0.542781,0.646497,0.903182
4,Ensemble (Stacking),0.828714,0.745387,0.540107,0.626357,0.895375


✅ Hard voting ensemble model saved as ../Data/interim/ensemble_hard_voting.pkl


Unnamed: 0,Model,Accuracy,Precision,Recall,F1-Score,AUC-ROC
0,GradientBoosting,0.787491,0.642586,0.451872,0.530612,0.830206
1,CatBoost,0.794598,0.659176,0.470588,0.549142,0.832598
2,AdaBoost,0.781805,0.673575,0.347594,0.458554,0.827981
3,Ensemble (Soft Voting),0.842217,0.799213,0.542781,0.646497,0.903182
4,Ensemble (Stacking),0.828714,0.745387,0.540107,0.626357,0.895375
5,Ensemble (Hard Voting),0.842217,0.799213,0.542781,0.646497,


## 🧐 Interpretation of Results Table
- **Base Model Winner:** Review the table for the highest AUC-ROC among Gradient Boosting, CatBoost, and AdaBoost. The model with the highest AUC-ROC (and, if tied, highest Recall) is the best base model. For most tabular datasets, CatBoost or Gradient Boosting often win, but check your table for the actual result.
- **Ensemble Performance:** Compare the ensemble models (Soft Voting, Hard Voting, Stacking) to the best base model. If any ensemble has a higher AUC-ROC and Recall, it is the best choice for deployment. If not, stick with the best base model.
- **Recommendation:**
    - If an ensemble model (especially Stacking or Soft Voting) outperforms all base models, it is recommended for deployment due to its improved stability and predictive power.
    - If the best base model is still superior, use that model for deployment.
- **Business Impact:** Prioritize models with high Recall if catching churners is most important for your business, but do not ignore overall AUC-ROC for balanced performance.

In [26]:
# --- Final Winner: Print the Best Model After All Comparisons ---
results_df_eval = results_df.copy()
results_df_eval = results_df_eval[results_df_eval['AUC-ROC'] != 'N/A']
if not results_df_eval.empty:
    best_auc = results_df_eval['AUC-ROC'].max()
    best_models = results_df_eval[results_df_eval['AUC-ROC'] == best_auc]
    if len(best_models) > 1:
        best_recall = best_models['Recall'].max()
        best_model_row = best_models[best_models['Recall'] == best_recall].iloc[0]
    else:
        best_model_row = best_models.iloc[0]
    print(f"🏆 Final winner: {best_model_row['Model']} | AUC-ROC: {best_model_row['AUC-ROC']:.4f} | Recall: {best_model_row['Recall']:.4f}")
else:
    best_recall = results_df['Recall'].max()
    best_model_row = results_df[results_df['Recall'] == best_recall].iloc[0]
    print(f"🏆 Final winner (by Recall): {best_model_row['Model']} | Recall: {best_model_row['Recall']:.4f}")

🏆 Final winner: Ensemble (Soft Voting) | AUC-ROC: 0.9032 | Recall: 0.5428


In [27]:
# --- Evaluate All Models on Train Data ---
# Load train data (feature engineered)
train = pd.read_csv('../Data/output/feature_engineered_train.csv')
X_train = train.drop(columns=['customerID', 'Churn'], errors='ignore')
y_train = train['Churn'] if 'Churn' in train.columns else None
if y_train is not None and (y_train.dtype == 'object' or y_train.dtype.name == 'category'):
    from sklearn.preprocessing import LabelEncoder
    y_train = LabelEncoder().fit_transform(y_train)

# Prepare all models
all_models = {
    'GradientBoosting': gbc_best,
    'CatBoost': cbc_best,
    'AdaBoost': ada_best,
    'Ensemble (Soft Voting)': ensemble,
    'Ensemble (Stacking)': stacking,
    'Ensemble (Hard Voting)': hard_ensemble
}

train_results = []
for name, model in all_models.items():
    y_pred = model.predict(X_train)
    y_proba = model.predict_proba(X_train)[:, 1] if hasattr(model, 'predict_proba') else None
    acc = accuracy_score(y_train, y_pred)
    prec = precision_score(y_train, y_pred)
    rec = recall_score(y_train, y_pred)
    f1 = f1_score(y_train, y_pred)
    auc = roc_auc_score(y_train, y_proba) if y_proba is not None else 'N/A'
    train_results.append({
        'Model': name,
        'Accuracy': acc,
        'Precision': prec,
        'Recall': rec,
        'F1-Score': f1,
        'AUC-ROC': auc
    })
train_results_df = pd.DataFrame(train_results)
display(train_results_df)

Unnamed: 0,Model,Accuracy,Precision,Recall,F1-Score,AUC-ROC
0,GradientBoosting,0.815467,0.713751,0.510368,0.595164,0.871914
1,CatBoost,0.810844,0.697164,0.509699,0.588872,0.861996
2,AdaBoost,0.7936,0.709799,0.377926,0.493234,0.843485
3,Ensemble (Soft Voting),0.795556,0.669951,0.454849,0.541833,0.834768
4,Ensemble (Stacking),0.795378,0.659851,0.474916,0.552314,0.83558
5,Ensemble (Hard Voting),0.793956,0.666998,0.448829,0.536585,


In [28]:
# --- Final Winner on Train Data ---
train_results_eval = train_results_df.copy()
train_results_eval = train_results_eval[train_results_eval['AUC-ROC'] != 'N/A']
if not train_results_eval.empty:
    best_auc = train_results_eval['AUC-ROC'].max()
    best_models = train_results_eval[train_results_eval['AUC-ROC'] == best_auc]
    if len(best_models) > 1:
        best_recall = best_models['Recall'].max()
        best_model_row = best_models[best_models['Recall'] == best_recall].iloc[0]
    else:
        best_model_row = best_models.iloc[0]
    print(f"🏆 Train set winner: {best_model_row['Model']} | AUC-ROC: {best_model_row['AUC-ROC']:.4f} | Recall: {best_model_row['Recall']:.4f}")
else:
    best_recall = train_results_df['Recall'].max()
    best_model_row = train_results_df[train_results_df['Recall'] == best_recall].iloc[0]
    print(f"🏆 Train set winner (by Recall): {best_model_row['Model']} | Recall: {best_model_row['Recall']:.4f}")

🏆 Train set winner: GradientBoosting | AUC-ROC: 0.8719 | Recall: 0.5104
