  <h3 style="color: teal; background-color: white; padding: 10px; border-radius: 5px; text-align:center">
  5: Evaluation and Interpolation
</h3>

The goal of this section is to evaluate the trained models using appropriate metrics, compare their performance, and interpret the results in a business-relevant and critical manner.

In [53]:
import pickle
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from sklearn.metrics import confusion_matrix
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    f1_score, roc_auc_score,
    confusion_matrix
)

In [54]:
# Load saved artifacts
with open("../outputs/models.pkl", "rb") as f:
    models = pickle.load(f)

with open("../outputs/X_test.pkl", "rb") as f:
    X_test = pickle.load(f)

with open("../outputs/y_test.pkl", "rb") as f:
    y_test = pickle.load(f)


<h4 style="
  margin-bottom: 4px;
  background-color: #f3f4f6;
  padding: 4px 8px;
  border-radius: 4px;
  display: inline-block;
  color: black;
">
  5.1 Model Performance Metrics
</h4>

In [55]:
# Prepare metrics
metrics_list = []
for name, model in models.items():
    y_pred = model.predict(X_test)
    y_proba = model.predict_proba(X_test)[:, 1]
    
    metrics_list.append({
        "Model": name,
        "Accuracy": accuracy_score(y_test, y_pred),
        "Precision": precision_score(y_test, y_pred),
        "Recall": recall_score(y_test, y_pred),
        "F1-score": f1_score(y_test, y_pred),
        "ROC-AUC": roc_auc_score(y_test, y_proba)
    })

metrics_df = pd.DataFrame(metrics_list).sort_values("ROC-AUC", ascending=False)

print("Model Performance Metrics:")
print(metrics_df)

Model Performance Metrics:
                    Model  Accuracy  Precision    Recall  F1-score   ROC-AUC
3  Random Forest Weighted  0.857490   0.413986  0.637931  0.502120  0.812733
1           Decision Tree  0.899369   0.621622  0.272629  0.379026  0.810393
2           Random Forest  0.901554   0.720755  0.205819  0.320201  0.810029
0     Logistic Regression  0.901554   0.695652  0.224138  0.339038  0.800654


Interpretation:
- Random Forest (Weighted) achieves the highest ROC-AUC (0.813), indicating the best discriminative ability between subscribers and non-subscribers. This model is particularly effective at ranking clients by likelihood of subscription.
- Decision Tree and Logistic Regression have slightly lower ROC-AUC scores, but appear to have higher overall accuracy due to the class imbalance, where the majority class (non-subscribers) dominates. Accuracy alone can be misleading in this context.
- Precision vs Recall Tradeoff:
  - Logistic Regression: High precision (≈0.696), low recall (≈0.224) → predicts subscribers cautiously, resulting in fewer false positives, but misses many actual subscribers.
  - Random Forest (Weighted): Lower precision (≈0.414), higher recall (≈0.638) → better at capturing actual subscribers, but produces more false positives. This makes it more suitable when missing subscribers is costly.
  - Random Forest (Normal): High precision (≈0.721), low recall (≈0.206) → very cautious in predicting subscribers, similar to Logistic Regression but slightly more extreme.
  - Decision Tree: Moderate precision (≈0.622), low recall (≈0.273) → strikes a middle ground but is less effective at identifying subscribers than weighted Random Forest.

<h4 style="
  margin-bottom: 4px;
  background-color: #f3f4f6;
  padding: 4px 8px;
  border-radius: 4px;
  display: inline-block;
  color: black;
">
  5.2 Visual Comparison (ROC-AUC)
</h4>

In [None]:
# Visualization of ROC-AUC comparison
fig_roc = px.bar(
    metrics_df,
    x="Model",
    y="ROC-AUC",
    text="ROC-AUC",
    color="ROC-AUC",
    color_continuous_scale="Viridis",
    title="Model Comparison: ROC-AUC",
    template="plotly_white"
)

fig_roc.update_traces(
    texttemplate='%{text:.3f}',
    textposition='outside'
)
fig_roc.update_layout(
    yaxis=dict(range=[metrics_df["ROC-AUC"].min() - 0.02, metrics_df["ROC-AUC"].max() + 0.02]),
    xaxis_title="Model",
    yaxis_title="ROC-AUC",
    uniformtext_minsize=8,
    uniformtext_mode='hide'
)
fig_roc.show()


Interpretation:

ROC-AUC highlights Random Forest weighted as the strongest model in terms of distinguishing between clients likely to subscribe vs not, even if raw accuracy is lower due to class imbalance.

In [57]:
fig_all = px.bar(
    metrics_df.melt(id_vars='Model', value_vars=['Accuracy', 'Precision', 'Recall', 'F1-score', 'ROC-AUC']),
    x='Model',
    y='value',
    color='variable',
    barmode='group',
    title='Model Performance Comparison Across Metrics',
    text_auto='.3f',
    template="plotly_white"
)
fig_all.update_layout(
    yaxis_title='Metric Value',
    legend_title_text='Metric'
)
fig_all.show()

<h4 style="
  margin-bottom: 4px;
  background-color: #f3f4f6;
  padding: 4px 8px;
  border-radius: 4px;
  display: inline-block;
  color: black;
">
  5.3 Detailed Interpretation
</h4>

1. Random Forest (Weighted)
- ROC-AUC: 0.813 → best discriminative ability for ranking clients by likelihood of subscription
- Recall: 0.638 → identifies the majority of actual subscribers
- Precision: 0.414 → moderate number of false positives
- Use Case: Ideal when maximizing identification of potential subscribers is more important than avoiding some unnecessary calls.

2. Random Forest (Normal)
- ROC-AUC: 0.810 → slightly lower than weighted RF
- Recall: 0.206 → misses most actual subscribers
- Precision: 0.721 → very few false positives
- Use Case: Conservative approach, prioritizing precision over recall; suitable when false positives are costly.

3. Decision Tree
- ROC-AUC: 0.810 → slightly lower than RF
- Recall: 0.273 → misses most actual subscribers
- Precision: 0.622 → cautious predictions
- Use Case: Simpler and interpretable, but less effective at capturing subscribers; good for quick insights or business rules.

4. Logistic Regression
- ROC-AUC: 0.801 → baseline model
- Recall: 0.224 → captures very few actual subscribers
- Precision: 0.696 → predictions are mostly correct
- Use Case: Very interpretable; useful if minimizing false positives is critical.

Key Insight:
- Overall accuracy (>0.85) can be misleading due to class imbalance (most clients do not subscribe).
- ROC-AUC, recall, and F1-score are more informative metrics for this marketing classification task.
- Weighted Random Forest provides the best trade-off between capturing subscribers (recall) and avoiding unnecessary calls (precision).

<h4 style="
  margin-bottom: 4px;
  background-color: #f3f4f6;
  padding: 4px 8px;
  border-radius: 4px;
  display: inline-block;
  color: black;
">
  5.4 Feature Importance (Random Forest)
</h4>

In [60]:
# Extract weighted Random Forest classifier from the loaded models
rf_weighted_model = models["Random Forest Weighted"].named_steps["classifier"]

# Get feature names from the preprocessing pipeline
preprocessor = models["Random Forest Weighted"].named_steps["preprocessing"]
feature_names = preprocessor.get_feature_names_out()

# Get feature importances
importances = rf_weighted_model.feature_importances_

# Create a DataFrame and sort by importance
importance_df = pd.DataFrame({
    "Feature": feature_names,
    "Importance": importances
}).sort_values("Importance", ascending=False)

# Select top N features
top_n = 15
top_features = importance_df.head(top_n)

# Plot using Plotly
fig = go.Figure(
    go.Bar(
        x=top_features["Importance"],
        y=top_features["Feature"],
        orientation='h',
        marker_color='seagreen',
        text=top_features["Importance"].round(3),
        textposition='auto'
    )
)

fig.update_layout(
    title=f"Top {top_n} Feature Importances (Weighted Random Forest)",
    yaxis=dict(autorange="reversed"), # Reverse y-axis to have the most important on top
    xaxis_title="Importance",
    template="plotly_white"
)

fig.show()

Interpretation:

- Macroeconomic features dominate: euribor3m, nr.employed, emp.var.rate together account for the highest predictive power. This shows that broader economic conditions heavily influence subscription likelihood.

- Client history / campaign features are important: pdays_transformed, contacted_before, total_contacts, poutcome_success are meaningful predictors.

- Demographics like age are moderately predictive.

- Month (month_may) and contact method (contact_telephone) also influence probability, though less than macroeconomic and campaign-related features.

Implication: The model relies more on economic context and past contact behavior than on basic client demographics. This insight can guide marketing strategies: prioritize campaigns when economic indicators are favorable and target clients with prior engagement.

<h4 style="
  margin-bottom: 4px;
  background-color: #f3f4f6;
  padding: 4px 8px;
  border-radius: 4px;
  display: inline-block;
  color: black;
">
  5.5 Confusion Matrix
</h4>

In [59]:

# Get predictions for Weighted Random Forest
rf_weighted_model = models["Random Forest Weighted"]
y_pred_weighted = rf_weighted_model.predict(X_test)

# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred_weighted)

# Labels
labels = ["No", "Yes"]

# Create heatmap with Plotly
fig = ff.create_annotated_heatmap(
    z=cm,
    x=labels,
    y=labels,
    colorscale='Blues',
    showscale=True,
    reversescale=False
)

fig.update_layout(
    title="Confusion Matrix – Weighted Random Forest",
    xaxis_title="Predicted",
    yaxis_title="Actual",
    template="plotly_white"
)

fig.show()


Interpretation
- True negatives dominate (expected)
- False negatives are costly, missed potential subscribers
- Depending on business strategy, recall may be more important than precision

<h4 style="
  margin-bottom: 4px;
  background-color: #f3f4f6;
  padding: 4px 8px;
  border-radius: 4px;
  display: inline-block;
  color: black;
">
  5.6 Business Insights & Recommendations
</h4>

1. Prioritize campaigns based on economic conditions:
  - Since euribor3m, nr.employed, and emp.var.rate are strong predictors, campaigns launched during favorable macroeconomic conditions may yield higher subscription rates.

2. Target clients with prior engagement:
  - Features like contacted_before, total_contacts, and poutcome_success show that prior interactions matter. Clients previously contacted or showing positive outcomes are more likely to subscribe.

3. Use channel & timing strategically:
  - contact_telephone and month_may indicate certain channels or periods perform better. Marketing operations can optimize based on these factors.

4. Demographic targeting is secondary:
  - age has some influence, but less than macroeconomic and campaign features. Demographics should be considered, but behavior and context are more critical.

5. Model choice & deployment:
  - Random Forest remains the recommended model for production due to its superior ROC-AUC and ability to capture nuanced patterns.
