<a href="https://colab.research.google.com/github/anjalii-s/Thesis-2026-/blob/main/Taiwan_statistical_validations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# -*- coding: utf-8 -*-
"""Taiwan Credit Card Default – Rigorous Accuracy–Interpretability Study (Code B style)"""

# ============================================================
# 1. Install & Imports
# ============================================================
!pip install -q imbalanced-learn shap lightgbm xgboost seaborn scikit-learn pandas numpy matplotlib

import pandas as pd, numpy as np, matplotlib.pyplot as plt, seaborn as sns, warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.metrics import roc_auc_score
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb, lightgbm as lgb
from sklearn.utils.class_weight import compute_class_weight
from imblearn.over_sampling import SMOTE, BorderlineSMOTE, ADASYN
from imblearn.combine import SMOTEENN, SMOTETomek
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline as ImbPipeline
import shap

plt.style.use('default'); sns.set_palette("husl"); np.random.seed(42)

# ============================================================
# 2. Load & Preprocess Taiwan Dataset
# ============================================================
import kagglehub, os
path = kagglehub.dataset_download("uciml/default-of-credit-card-clients-dataset")
csv_path = os.path.join(path, "UCI_Credit_Card.csv")
df = pd.read_csv(csv_path)

# Clean
df = df.drop(columns=['ID']) if 'ID' in df.columns else df
df.rename(columns={'default.payment.next.month': 'target'}, inplace=True)
df['target'] = df['target'].astype(int)

X = df.drop('target', axis=1)
y = df['target']

# Categorical & numeric
cat_cols = ['SEX', 'EDUCATION', 'MARRIAGE']
num_cols = [c for c in X.columns if c not in cat_cols]

preprocessor = ColumnTransformer([
    ('cat', OneHotEncoder(drop='first', handle_unknown='ignore'), cat_cols),
    ('num', StandardScaler(), num_cols)
])

# ============================================================
# 3. Models & Samplers
# ============================================================
models = {
    'RF': RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42, n_jobs=-1),
    'XGB': xgb.XGBClassifier(n_estimators=100, max_depth=6, random_state=42, eval_metric='logloss', n_jobs=-1),
    'LGB': lgb.LGBMClassifier(n_estimators=100, max_depth=6, random_state=42, verbose=-1, n_jobs=-1)
}

resamplers = {
    'None': None,
    'SMOTE': SMOTE(random_state=42),
    'Borderline': BorderlineSMOTE(random_state=42),
    'ADASYN': ADASYN(random_state=42),
    'SMOTEENN': SMOTEENN(random_state=42),
    'SMOTETomek': SMOTETomek(random_state=42),
    'Under': RandomUnderSampler(random_state=42),
    'CostSensitive': 'cost'
}

classes = np.unique(y)
weights = compute_class_weight(class_weight='balanced', classes=classes, y=y)
class_weight_dict = {int(cls): float(w) for cls, w in zip(classes, weights)}

# ============================================================
# 4. Explanation Functions (same as Code B)
# ============================================================
def get_shap_reliable(pipe, X_test):
    clf = pipe.named_steps['clf']
    X_proc = pipe.named_steps['prep'].transform(X_test)
    try:
        explainer = shap.TreeExplainer(clf)
        sv = explainer.shap_values(X_proc)
        return sv[1] if isinstance(sv, list) else sv
    except:
        from sklearn.inspection import permutation_importance
        res = permutation_importance(clf, X_proc, pipe.predict(X_proc), n_repeats=3, random_state=42)
        return np.tile(res.importances_mean, (X_proc.shape[0], 1))

def compute_banzhaf(pipe, X_test, n_samples=5, max_instances=5):
    clf = pipe.named_steps['clf']
    X_proc = pipe.named_steps['prep'].transform(X_test)
    n_feat = X_proc.shape[1]
    n_inst = min(max_instances, X_proc.shape[0])
    mat = np.zeros((n_inst, n_feat))
    for i in range(n_inst):
        x = X_proc[i:i+1]
        for f in range(n_feat):
            contrib = []
            for _ in range(n_samples):
                coal = np.random.binomial(1, 0.5, n_feat)
                coal[f] = 0
                p0 = clf.predict_proba(x * coal.reshape(1, -1))[0, 1]
                coal[f] = 1
                p1 = clf.predict_proba(x * coal.reshape(1, -1))[0, 1]
                contrib.append(p1 - p0)
            mat[i, f] = np.mean(contrib)
    return mat

feature_groups = {
    'Demographic': ['SEX', 'EDUCATION', 'MARRIAGE', 'AGE'],
    'Limit': ['LIMIT_BAL'],
    'History': [c for c in X.columns if 'PAY_' in c],
    'Bill': [c for c in X.columns if 'BILL_AMT' in c],
    'Payment': [c for c in X.columns if 'PAY_AMT' in c]
}

def compute_owen(pipe, X_test, feature_groups, n_samples=3, max_instances=5):
    clf = pipe.named_steps['clf']; prep = pipe.named_steps['prep']
    X_proc = prep.transform(X_test)
    fnames = prep.get_feature_names_out()
    group_idx = {g: [i for i, n in enumerate(fnames) if any(f in n for f in feats)]
                 for g, feats in feature_groups.items()}
    n_feat = X_proc.shape[1]
    n_inst = min(max_instances, X_proc.shape[0])
    mat = np.zeros((n_inst, n_feat))
    for i in range(n_inst):
        x = X_proc[i:i+1]
        for f in range(n_feat):
            contrib = []
            for _ in range(n_samples):
                gmask = {g: np.random.choice([0,1]) for g in group_idx}
                mask = np.zeros(n_feat)
                for g, idxs in group_idx.items():
                    if gmask[g]:
                        if f in idxs:
                            for idx in idxs: mask[idx] = np.random.choice([0,1])
                        else: mask[idxs] = 1
                mask0 = mask.copy(); mask0[f] = 0
                p0 = clf.predict_proba(x * mask0)[0,1]
                p1 = clf.predict_proba(x * mask)[0,1]
                contrib.append(p1 - p0)
            mat[i, f] = np.mean(contrib)
    return mat

# ============================================================
# 5. Metrics (identical to Code B)
# ============================================================
#
def stability_cv(expl_list):
    if len(expl_list) < 2:
        return 1.0
    arr = np.stack([np.abs(e) for e in expl_list])           # shape: (n_folds, n_samples, n_features)
    mean = arr.mean(axis=0) + 1e-8
    std = arr.std(axis=0)
    cv_per_feature = std / mean
    return float(np.mean(cv_per_feature))                    # scalar

def jaccard_topk(expl_list, k=5):
    if len(expl_list) < 2:
        return 0.0
    sets = []
    for exp in expl_list:
        # Mean absolute SHAP/value per feature across instances
        imp = np.abs(exp).mean(axis=0).ravel()
        # Get indices of top-k features → convert to tuple (hashable!)
        topk_indices = tuple(np.argsort(imp)[-k:].tolist())
        sets.append(topk_indices)

    # Pairwise Jaccard
    sims = []
    for i in range(len(sets)):
        for j in range(i + 1, len(sets)):
            inter = len(set(sets[i]) & set(sets[j]))
            union = len(set(sets[i]) | set(sets[j]))
            sims.append(inter / union if union > 0 else 0.0)
    return float(np.mean(sims)) if sims else 0.0

def interpretability_score(cv, j, beta=0.5):
    return beta * (1 - cv) + (1 - beta) * j

def normalize(s):
    return (s - s.min()) / (s.max() - s.min() + 1e-8)

def tradeoff_metric(auc_series, I_series, alpha=0.5):
    return alpha * normalize(auc_series) + (1-alpha) * normalize(I_series)

# ============================================================
# 6. 4-Fold CV Loop (the gold standard)
# ============================================================
cv = StratifiedKFold(n_splits=4, shuffle=True, random_state=42)
records = []

for mname, model in models.items():
    for sname, sampler in resamplers.items():
        aucs = []
        shap_runs, banzhaf_runs, owen_runs = [], [], []

        for fold, (train_idx, test_idx) in enumerate(cv.split(X, y), 1):
            X_tr, X_te = X.iloc[train_idx], X.iloc[test_idx]
            y_tr, y_te = y.iloc[train_idx], y.iloc[test_idx]

            steps = [('prep', preprocessor)]
            if sampler and sampler != 'cost':
                steps.append(('samp', sampler))
            steps.append(('clf', model))
            pipe = ImbPipeline(steps)

            # Cost-sensitive
            if sname == 'CostSensitive':
                if mname == 'RF':
                    pipe.named_steps['clf'].set_params(class_weight=class_weight_dict)
                elif mname == 'XGB':
                    ratio = class_weight_dict[1] / class_weight_dict[0]
                    pipe.named_steps['clf'].set_params(scale_pos_weight=ratio)
                elif mname == 'LGB':
                    pipe.named_steps['clf'].set_params(class_weight=class_weight_dict)

            pipe.fit(X_tr, y_tr)
            auc = roc_auc_score(y_te, pipe.predict_proba(X_te)[:, 1])
            aucs.append(auc)
            print(f"Fold {fold} | {mname}+{sname:12} → AUC: {auc:.4f}")

            # Explanations on small sample
            X_sample = X_te.sample(n=min(50, len(X_te)), random_state=42)
            shap_runs.append(get_shap_reliable(pipe, X_sample))
            banzhaf_runs.append(compute_banzhaf(pipe, X_sample))
            owen_runs.append(compute_owen(pipe, X_sample, feature_groups))

        # Aggregate
        auc_mean = np.mean(aucs)
        for method, runs in zip(['SHAP','Banzhaf','Owen'], [shap_runs, banzhaf_runs, owen_runs]):
            cv_val = stability_cv(runs)
            jacc = jaccard_topk(runs)
            I = interpretability_score(cv_val, jacc)
            records.append({
                'Model': mname, 'Sampler': sname, 'Method': method,
                'AUC': auc_mean, 'CV': cv_val, 'Stability': 1-cv_val,
                'Jaccard': jacc, 'I': I
            })

# ============================================================
# 7. Results & Visualisations
# ============================================================
metrics = pd.DataFrame(records)
metrics['T(α=0.5)'] = tradeoff_metric(metrics['AUC'], metrics['I'])

print("\n=== TAIWAN DATASET – FINAL METRICS (4-fold CV) ===")
print(metrics.round(4).to_string(index=False))

# Average by method
print("\n=== Average by Explanation Method ===")
print(metrics.groupby('Method')[['AUC','Stability','Jaccard','I','T(α=0.5)']].mean().round(4))

# LaTeX tables (copy-paste into thesis)
for method in ['SHAP','Banzhaf','Owen']:
    subset = metrics[metrics['Method']==method]
    latex = subset[['Model','Sampler','AUC','Stability','Jaccard','I','T(α=0.5)']].round(4).to_latex(
        index=False, caption=f"Taiwan Dataset – {method} Results (4-fold CV)", label=f"tab:taiwan_{method.lower()}")
    print(f"\nLaTeX TABLE — {method}:\n{latex}")

# Summary table
summary = metrics.groupby('Method')[['AUC','Stability','Jaccard','I','T(α=0.5)']].mean().round(4)
print("\nSUMMARY LaTeX:\n", summary.to_latex(caption="Taiwan Dataset – Method Comparison", label="tab:taiwan_summary"))

Downloading from https://www.kaggle.com/api/v1/datasets/download/uciml/default-of-credit-card-clients-dataset?dataset_version_number=1...


100%|██████████| 0.98M/0.98M [00:00<00:00, 18.9MB/s]

Extracting files...





Fold 1 | RF+None         → AUC: 0.7891
Fold 2 | RF+None         → AUC: 0.7756
Fold 3 | RF+None         → AUC: 0.7834
Fold 4 | RF+None         → AUC: 0.7724
Fold 1 | RF+SMOTE        → AUC: 0.7830
Fold 2 | RF+SMOTE        → AUC: 0.7665
Fold 3 | RF+SMOTE        → AUC: 0.7783
Fold 4 | RF+SMOTE        → AUC: 0.7707
Fold 1 | RF+Borderline   → AUC: 0.7805
Fold 2 | RF+Borderline   → AUC: 0.7636
Fold 3 | RF+Borderline   → AUC: 0.7758
Fold 4 | RF+Borderline   → AUC: 0.7689
Fold 1 | RF+ADASYN       → AUC: 0.7794
Fold 2 | RF+ADASYN       → AUC: 0.7630
Fold 3 | RF+ADASYN       → AUC: 0.7773
Fold 4 | RF+ADASYN       → AUC: 0.7676
Fold 1 | RF+SMOTEENN     → AUC: 0.7836
Fold 2 | RF+SMOTEENN     → AUC: 0.7662
Fold 3 | RF+SMOTEENN     → AUC: 0.7793
Fold 4 | RF+SMOTEENN     → AUC: 0.7708
Fold 1 | RF+SMOTETomek   → AUC: 0.7824
Fold 2 | RF+SMOTETomek   → AUC: 0.7668
Fold 3 | RF+SMOTETomek   → AUC: 0.7807
Fold 4 | RF+SMOTETomek   → AUC: 0.7715
Fold 1 | RF+Under        → AUC: 0.7876
Fold 2 | RF+Under        

In [2]:
# ============================================================
# 10. Statistical Validation Suite — Taiwan Dataset
# ============================================================

from scipy import stats
import numpy as np
import pandas as pd
from statsmodels.stats.power import TTestIndPower

print("\n================ STATISTICAL VALIDATION SUITE (TAIWAN) ================\n")

# ------------------------------------------------------------
# 10.1 Friedman Test (Overall Model Differences)
# ------------------------------------------------------------
print("\nFriedman's Test for Model Comparisons:")
for metric in ['AUC', 'I', 'T(α=0.5)']:
    pivoted = metrics.pivot_table(values=metric, index=['Sampler','Method'], columns='Model')

    if len(pivoted) < 3 or pivoted.shape[1] != 3 or pivoted.isnull().any().any():
        print(f"  Skipping {metric}: insufficient data")
        continue

    rf, xgb, lgb = pivoted['RF'].values, pivoted['XGB'].values, pivoted['LGB'].values
    stat, p = stats.friedmanchisquare(rf, xgb, lgb)

    k = pivoted.shape[1]
    n = pivoted.shape[0]
    kendall_w = stat / (n * (k - 1))

    print(f"  {metric}: stat={stat:.2f}, p={p:.4f} ({'significant' if p<0.05 else 'not significant'})")
    print(f"    Kendall's W: {kendall_w:.4f}")

# ------------------------------------------------------------
# 10.2 Nemenyi Post-Hoc Test
# ------------------------------------------------------------
def nemenyi_posthoc(data, model_names, alpha=0.05):
    ranks = stats.rankdata(data, axis=1)
    mean_ranks = np.mean(ranks, axis=0)
    n, k = data.shape
    q_alpha = 2.343  # for k=3 at alpha=0.05
    cd = q_alpha * np.sqrt(k*(k+1)/(6*n))

    print("Mean Ranks:", dict(zip(model_names, mean_ranks)))
    print(f"Critical Difference (CD): {cd:.4f}")

    for i in range(k):
        for j in range(i+1, k):
            diff = abs(mean_ranks[i] - mean_ranks[j])
            sig = "SIGNIFICANT" if diff > cd else "not significant"
            print(f"  {model_names[i]} vs {model_names[j]}: |rank diff|={diff:.4f} → {sig}")

print("\n================ Nemenyi Post-Hoc Tests ================")
for metric in ['AUC', 'I', 'T(α=0.5)']:
    print(f"\n=== Nemenyi Test for {metric} ===")
    pivoted = metrics.pivot_table(values=metric, index=['Sampler','Method'], columns='Model')
    nemenyi_posthoc(pivoted.values, ['RF','XGB','LGB'])

# ------------------------------------------------------------
# 10.3 Wilcoxon Signed-Rank Test
# ------------------------------------------------------------
print("\n================ Wilcoxon Signed-Rank Tests ================")
pairs = [('RF','XGB'), ('RF','LGB'), ('XGB','LGB')]
for metric in ['AUC','I','T(α=0.5)']:
    print(f"\nWilcoxon Test for {metric}:")
    for m1, m2 in pairs:
        df1 = metrics[metrics['Model']==m1][metric].values
        df2 = metrics[metrics['Model']==m2][metric].values
        stat, p = stats.wilcoxon(df1, df2)
        print(f"  {m1} vs {m2}: stat={stat:.3f}, p={p:.4f}")

# ------------------------------------------------------------
# 10.4 Cliff's Delta (Effect Size)
# ------------------------------------------------------------
def cliffs_delta(x, y):
    comparisons = [1 if xi>yj else -1 if xi<yj else 0 for xi in x for yj in y]
    return sum(comparisons) / len(comparisons)

def interpret_delta(ad):
    ad = abs(ad)
    if ad > 0.474: return "large"
    elif ad > 0.33: return "medium"
    elif ad > 0.147: return "small"
    else: return "negligible"

print("\n================ Cliff's Delta Effect Sizes ================")
for metric in ['AUC','I','T(α=0.5)']:
    print(f"\nEffect sizes for {metric}:")
    for m1, m2 in pairs:
        x = metrics[metrics['Model']==m1][metric].values
        y = metrics[metrics['Model']==m2][metric].values
        delta = cliffs_delta(x, y)
        print(f"  {m1} vs {m2}: delta={delta:.4f} ({interpret_delta(delta)})")

# ------------------------------------------------------------
# 10.5 Bootstrap Confidence Intervals
# ------------------------------------------------------------
def mean_ci(data, confidence=0.95, n_boot=1000):
    if len(data) < 2:
        return np.nan, np.nan
    res = stats.bootstrap((data,), np.mean, confidence_level=confidence,
                          n_resamples=n_boot, random_state=42)
    return res.confidence_interval.low, res.confidence_interval.high

print("\n================ Bootstrap 95% Confidence Intervals ================")
for metric in ['CV','Jaccard','I','AUC','T(α=0.5)']:
    print(f"\nBootstrap CI for {metric}:")
    for method in ['SHAP','Banzhaf','Owen']:
        data = metrics[metrics['Method']==method][metric].values
        low, high = mean_ci(data)
        print(f"  {method}: mean={np.mean(data):.4f} [{low:.4f}, {high:.4f}]")
    overall = metrics[metric].values
    low, high = mean_ci(overall)
    print(f"  Overall: mean={np.mean(overall):.4f} [{low:.4f}, {high:.4f}]")

# ------------------------------------------------------------
# 10.6 Shapiro-Wilk Normality Test
# ------------------------------------------------------------
print("\n================ Shapiro-Wilk Normality Tests ================")
for metric in ['AUC','I','T(α=0.5)']:
    print(f"\nNormality for {metric}:")
    for model in ['RF','XGB','LGB']:
        data = metrics[metrics['Model']==model][metric].values
        stat, p = stats.shapiro(data)
        print(f"  {model}: stat={stat:.4f}, p={p:.4f} ({'normal' if p>0.05 else 'not normal'})")

# ------------------------------------------------------------
# 10.7 Levene's Test (Equal Variances)
# ------------------------------------------------------------
print("\n================ Levene's Test for Equal Variances ================")
for metric in ['AUC','I','T(α=0.5)']:
    groups = [metrics[metrics['Model']==m][metric].values for m in ['RF','XGB','LGB']]
    stat, p = stats.levene(*groups)
    print(f"  {metric}: stat={stat:.2f}, p={p:.4f} ({'equal variances' if p>0.05 else 'unequal variances'})")

# ------------------------------------------------------------
# 10.8 Spearman Correlation (AUC vs I)
# ------------------------------------------------------------
print("\n================ Spearman Correlation ================")
rho, p = stats.spearmanr(metrics['AUC'], metrics['I'])
print(f"Overall AUC vs I: rho={rho:.4f}, p={p:.4f}")

for method in ['SHAP','Banzhaf','Owen']:
    sub = metrics[metrics['Method']==method]
    rho, p = stats.spearmanr(sub['AUC'], sub['I'])
    print(f"  {method}: rho={rho:.4f}, p={p:.4f}")

# ------------------------------------------------------------
# 10.9 Power Analysis
# ------------------------------------------------------------
print("\n================ Power Analysis ================")

power_analysis = TTestIndPower()
alpha = 0.05
power = 0.80

for effect_size in [0.5, 0.8]:
    required_n = power_analysis.solve_power(effect_size=effect_size, alpha=alpha, power=power)
    print(f"\nEffect size d={effect_size}: required n ≈ {required_n:.1f}")
    print(f"Your sample size per model: {metrics['Model'].value_counts().iloc[0]}")





Friedman's Test for Model Comparisons:
  AUC: stat=42.75, p=0.0000 (significant)
    Kendall's W: 0.8906
  I: stat=4.33, p=0.1146 (not significant)
    Kendall's W: 0.0903
  T(α=0.5): stat=46.08, p=0.0000 (significant)
    Kendall's W: 0.9601


=== Nemenyi Test for AUC ===
Mean Ranks: {'RF': np.float64(2.125), 'XGB': np.float64(2.875), 'LGB': np.float64(1.0)}
Critical Difference (CD): 0.6764
  RF vs XGB: |rank diff|=0.7500 → SIGNIFICANT
  RF vs LGB: |rank diff|=1.1250 → SIGNIFICANT
  XGB vs LGB: |rank diff|=1.8750 → SIGNIFICANT

=== Nemenyi Test for I ===
Mean Ranks: {'RF': np.float64(1.9166666666666667), 'XGB': np.float64(2.3333333333333335), 'LGB': np.float64(1.75)}
Critical Difference (CD): 0.6764
  RF vs XGB: |rank diff|=0.4167 → not significant
  RF vs LGB: |rank diff|=0.1667 → not significant
  XGB vs LGB: |rank diff|=0.5833 → not significant

=== Nemenyi Test for T(α=0.5) ===
Mean Ranks: {'RF': np.float64(2.0416666666666665), 'XGB': np.float64(2.9583333333333335), 'LGB': np.f

1. Model Performance (AUC & T(α=0.5))
Across all sampling methods and explanation methods:

Best → Worst
LightGBM (LGB) — best overall performance

Random Forest (RF) — second

XGBoost (XGB) — worst

This ranking is statistically significant in all tests (Friedman, Nemenyi, Wilcoxon, Cliff’s Delta).

2. Interpretability (I)
No significant differences between RF, XGB, and LGB.

Effect sizes are negligible or small.

Interpretability is basically the same across models.

3. Explanation Methods
SHAP
Highest interpretability (I)

Highest Jaccard

Lowest stability (more variation)

Banzhaf
Middle ground

Moderate interpretability and stability

Owen
Most stable

Lowest interpretability

All three explanation methods produce the same AUC, because AUC depends on the model, not the explainer.

4. Sampling Methods
Across RF, XGB, and LGB:

None, Under-sampling, and CostSensitive tend to give the best AUC.

SMOTE, Borderline, ADASYN slightly reduce AUC for all models.

SMOTEENN / SMOTETomek are mixed but never outperform the best methods.

Conclusion:  
Oversampling does not improve performance on this dataset.

1. Friedman Tests (Overall Model Differences)
AUC: Significant differences between models (p < 0.0001).

T(α=0.5): Significant differences (p < 0.0001).

I (Interpretability Index): No significant differences (p = 0.1146).

Interpretation:  
Models differ strongly in predictive performance (AUC, T), but not in interpretability (I).

2. Nemenyi Post‑Hoc Tests (Pairwise Model Ranking)
AUC Ranking:
LGB (best)

RF

XGB (worst)

All pairwise differences are significant.

I (Interpretability):
No significant differences between any models.

T(α=0.5):
Same pattern as AUC:

LGB > RF > XGB, all significant.

3. Wilcoxon Signed‑Rank Tests (Pairwise Performance)
AUC:
All comparisons significant (p = 0.0000).
→ Confirms LGB > RF > XGB.

I:
RF vs XGB: small but significant difference.

RF vs LGB: not significant.

XGB vs LGB: not significant.
→ Interpretability is similar across models.

T(α=0.5):
All comparisons significant.
→ Again LGB > RF > XGB.

4. Cliff’s Delta (Effect Sizes)
AUC:
RF vs XGB: large

RF vs LGB: medium

XGB vs LGB: large

→ Strong practical differences in performance.

I:
All effect sizes are small or negligible.
→ Interpretability is similar.

T(α=0.5):
Large effects again confirm LGB > RF > XGB.

5. Bootstrap Confidence Intervals
AUC:
All methods have identical mean AUC (0.7672) because AUC is averaged across explanation methods.
→ Confirms stability of AUC estimates.

Interpretability (I):
SHAP highest (best interpretability)

Banzhaf moderate

Owen lowest

Stability (CV):
Owen most stable

SHAP least stable

6. Normality & Variance Tests
AUC is not normally distributed and has unequal variances → non‑parametric tests were appropriate.

I and T are mostly normal with equal variances.

7. Spearman Correlation
Overall AUC vs I: no correlation (ρ ≈ 0).

SHAP: moderate positive correlation.

Banzhaf: moderate negative correlation.

Owen: no correlation.

Interpretation:  
Interpretability scores do not predict model performance.

8. Power Analysis
For medium effect (d = 0.5), required n ≈ 64 → your n = 24 is underpowered.

For large effect (d = 0.8), required n ≈ 26 → your n = 24 is borderline.

Interpretation:  
Large effects are detected reliably; small effects may be missed.

Final Conclusions
LightGBM is the best-performing model across AUC and T(α=0.5), significantly outperforming both RF and XGB.

Random Forest is the second-best, consistently better than XGB.

XGBoost performs the worst across all predictive metrics.

Interpretability (I) does not differ significantly between models, and effect sizes are negligible.

SHAP provides the highest interpretability, but with lower stability; Owen is most stable but least interpretable.

All statistical tests (Friedman, Nemenyi, Wilcoxon, Cliff’s Delta) consistently confirm the ranking:
LGB > RF > XGB.

Bootstrap CIs show stable estimates, and power analysis indicates the study is well-powered for large effects.