# Class 1 Degradation Analysis: Train vs Test (Overall)

## Objective
Diagnose why Class 1 percentage drops from train to test using decision tree path analysis on the overall dataset.

## Methodology
1. **Overall Decision Tree Analysis**:
   - Fit a decision tree on combined train+test data to predict `is_val_set`
   - Extract leaf nodes showing train vs test distribution differences
   - Calculate Class 1 degradation patterns
2. **Comprehensive Output**:
   - Print overall statistics (class 1 rates, sample sizes)
   - Print leaf characteristics showing degradation
   - Identify feature combinations causing degradation

## Expected Output
- Overall class 1 rates for train and test
- Decision tree leaf characteristics
- Leaf-wise degradation analysis

In [1]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from sklearn.tree import DecisionTreeClassifier, _tree, plot_tree

warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 8)

print("‚úì Libraries imported")

‚úì Libraries imported


In [2]:
# Load and prepare data
df = pd.read_parquet('../actual_data.parquet')
print(f"‚úì Data loaded: {df.shape}")

# Clean data
df = df.replace([np.inf, -np.inf], np.nan)
for col in df.select_dtypes(include=['object']).columns:
    df[col] = pd.to_numeric(df[col], errors='ignore')
df = df.fillna(-99999)

# Date splits
june_2024_end = pd.Timestamp('2024-06-30')
july_2024_end = pd.Timestamp('2024-07-31')

train_df = df[pd.to_datetime(df['CUTOFF_DATE']) <= june_2024_end].query("EVER_4DPD_IN_120DAYS != -99999").copy()
val_df = df[(pd.to_datetime(df['CUTOFF_DATE']) > june_2024_end) & 
            (pd.to_datetime(df['CUTOFF_DATE']) <= july_2024_end)].copy()

# Load features
features = pd.read_csv('../column_order_401_9_jan.csv')['features'].tolist()

# Extract X and y
X_train = train_df[features].clip(lower=-1e20, upper=1e20).copy()
y_train = train_df['EVER_4DPD_IN_120DAYS'].copy()
X_val = val_df[features].copy()
y_val = val_df['EVER_4DPD_IN_120DAYS'].copy()

print(f"\n‚úì Train: {len(X_train):,} samples, Class 1 rate: {y_train.mean():.4f} ({y_train.mean()*100:.2f}%)")
print(f"‚úì Val: {len(X_val):,} samples, Class 1 rate: {y_val.mean():.4f} ({y_val.mean()*100:.2f}%)")
print(f"\n‚ö†Ô∏è Class 1 rate change: {((y_val.mean() - y_train.mean()) / y_train.mean() * 100):.2f}%")

‚úì Data loaded: (135592, 362)

‚úì Train: 123,208 samples, Class 1 rate: 0.3119 (31.19%)
‚úì Val: 7,802 samples, Class 1 rate: 0.3013 (30.13%)

‚ö†Ô∏è Class 1 rate change: -3.37%


## Step 1: Fit Decision Tree and Extract Leaf Characteristics

In [3]:
# Fit decision tree on overall data
print(f"{'='*120}")
print("FITTING DECISION TREE ON OVERALL DATA")
print(f"{'='*120}\n")

# Combine train and test data
X_combined = pd.concat([X_train, X_val], axis=0, ignore_index=True)
y_combined_class1 = pd.concat([y_train, y_val], axis=0, ignore_index=True)
is_val_combined = np.concatenate([np.zeros(len(X_train)), np.ones(len(X_val))])

print(f"‚úì Combined dataset: {len(X_combined):,} samples")
print(f"  - Train: {len(X_train):,} ({len(X_train)/len(X_combined)*100:.2f}%)")
print(f"  - Val: {len(X_val):,} ({len(X_val)/len(X_combined)*100:.2f}%)")

# Fit decision tree
min_samples_leaf = max(50, int(0.005 * len(X_combined)))

tree_model = DecisionTreeClassifier(
    max_depth=8,
    min_samples_leaf=min_samples_leaf,
    criterion='gini',
    random_state=42
)

tree_model.fit(X_combined, is_val_combined)

# Get feature importance
feature_importance = pd.DataFrame({
    'feature': features,
    'importance': tree_model.feature_importances_
}).sort_values('importance', ascending=False)

print(f"\n‚úì Decision tree fitted")
print(f"  - Tree depth: {tree_model.get_depth()}")
print(f"  - Number of leaves: {tree_model.get_n_leaves()}")
print(f"  - Min samples per leaf: {min_samples_leaf}")

print(f"\nüìä Top 20 features distinguishing test from train:")
print(feature_importance.head(20).to_string(index=False))

print(f"\n{'='*120}")


FITTING DECISION TREE ON OVERALL DATA

‚úì Combined dataset: 131,010 samples
  - Train: 123,208 (94.04%)
  - Val: 7,802 (5.96%)

‚úì Decision tree fitted
  - Tree depth: 8
  - Number of leaves: 84
  - Min samples per leaf: 655

üìä Top 20 features distinguishing test from train:
                                                          feature  importance
                           MAX_DAYS_PAST_DUE_ACTIVE_30_TO_60_DAYS    0.187132
                                                      BUREAUSCORE    0.174774
                                   NUM_MAINMONEYCLICK_91_TO_120_D    0.103194
                                     TOTAL_OVERDUE_AMOUNT_60TO90D    0.088811
                                     NUM_PAYOUTCREATED_31_TO_60_D    0.057169
                         MIN_CURRENT_BALANCE_UNSECURED_AND_ACTIVE    0.057159
                 MAX_DAYS_SINCE_LAST_PAYMENT_ACTIVE_AND_UNSECURED    0.055423
                                   NUM_MAINBILLSCLICK_91_TO_120_D    0.052539
                 

In [4]:
def format_bounded_conditions(path_conditions, X_leaf_data, feature_names):
    """Convert path conditions to bounded ranges"""
    feature_bounds = {}
    for cond in path_conditions:
        if ' <= ' in cond:
            feat, val = cond.split(' <= ')
            val = float(val)
            if feat not in feature_bounds:
                feature_bounds[feat] = {'lower': None, 'upper': None}
            if feature_bounds[feat]['upper'] is None or val < feature_bounds[feat]['upper']:
                feature_bounds[feat]['upper'] = val
        elif ' > ' in cond:
            feat, val = cond.split(' > ')
            val = float(val)
            if feat not in feature_bounds:
                feature_bounds[feat] = {'lower': None, 'upper': None}
            if feature_bounds[feat]['lower'] is None or val > feature_bounds[feat]['lower']:
                feature_bounds[feat]['lower'] = val
    
    for feat, bounds in feature_bounds.items():
        feat_idx = feature_names.index(feat)
        feat_data = X_leaf_data[:, feat_idx]
        if bounds['lower'] is None:
            bounds['lower'] = feat_data.min()
        if bounds['upper'] is None:
            bounds['upper'] = feat_data.max()
    
    formatted = []
    for feat, bounds in sorted(feature_bounds.items()):
        formatted.append(f"{bounds['lower']:.4f} < {feat} <= {bounds['upper']:.4f}")
    return formatted

def extract_decision_paths(tree, feature_names, X, y_class1, is_val):
    """Extract all decision paths from tree"""
    tree_ = tree.tree_
    feature_name = [feature_names[i] if i != _tree.TREE_UNDEFINED else "undefined!" for i in tree_.feature]
    paths = []
    
    def recurse(node, path_conditions):
        if tree_.feature[node] != _tree.TREE_UNDEFINED:
            name = feature_name[node]
            threshold = tree_.threshold[node]
            recurse(tree_.children_left[node], path_conditions + [f"{name} <= {threshold:.4f}"])
            recurse(tree_.children_right[node], path_conditions + [f"{name} > {threshold:.4f}"])
        else:
            leaf_idx = tree.apply(X) == node
            n_train = (is_val[leaf_idx] == 0).sum()
            n_val = (is_val[leaf_idx] == 1).sum()
            train_class1_rate = y_class1[leaf_idx][is_val[leaf_idx] == 0].mean() if n_train > 0 else 0
            val_class1_rate = y_class1[leaf_idx][is_val[leaf_idx] == 1].mean() if n_val > 0 else 0
            class1_diff = val_class1_rate - train_class1_rate
            
            paths.append({
                'node_id': node,
                'path': ' AND '.join(format_bounded_conditions(path_conditions, X[leaf_idx], feature_names)) if path_conditions else 'root',
                'n_conditions': len(path_conditions),
                'n_samples': n_train + n_val,
                'n_train': n_train,
                'n_val': n_val,
                'val_purity': n_val / (n_train + n_val) if (n_train + n_val) > 0 else 0,
                'train_class1_rate': train_class1_rate,
                'val_class1_rate': val_class1_rate,
                'class1_diff': class1_diff,
                'class1_diff_pct': (class1_diff / train_class1_rate * 100) if train_class1_rate > 0 else 0,
                'impact_score': class1_diff * (n_train + n_val)
            })
    
    recurse(0, [])
    return pd.DataFrame(paths)

# Extract paths
print(f"\n{'='*120}")
print("EXTRACTING DECISION TREE LEAF CHARACTERISTICS")
print(f"{'='*120}\n")

df_paths = extract_decision_paths(tree_model, features, X_combined.values, y_combined_class1.values, is_val_combined)

# Filter valid leaves
min_samples_per_split = 20
df_paths_valid = df_paths[(df_paths['n_train'] >= min_samples_per_split) & (df_paths['n_val'] >= min_samples_per_split)].copy()

print(f"‚úì Total leaves: {len(df_paths)}")
print(f"‚úì Valid leaves (min {min_samples_per_split} samples): {len(df_paths_valid)}")

# Add metrics
expected_val_pct = len(X_val) / len(X_combined)
df_paths_valid['test_overrep_%'] = (df_paths_valid['val_purity'] / expected_val_pct - 1) * 100
df_paths_valid['pct_of_train'] = df_paths_valid['n_train'] / len(X_train) * 100
df_paths_valid['pct_of_val'] = df_paths_valid['n_val'] / len(X_val) * 100
df_ranked = df_paths_valid.sort_values('impact_score', ascending=True).reset_index(drop=True)

print(f"\n{'='*120}")
print("TOP 10 LEAVES WITH HIGHEST CLASS 1 SUPPRESSION")
print(f"{'='*120}\n")

for idx, row in df_ranked.head(10).iterrows():
    print(f"\n{'-'*120}")
    print(f"Leaf #{idx + 1} | Impact Score: {row['impact_score']:.2f}")
    print(f"{'-'*120}")
    print(f"\nüìç Decision Path ({row['n_conditions']} conditions):")
    if row['path'] != 'root':
        for i, cond in enumerate(row['path'].split(' AND '), 1):
            print(f"  {i}. {cond}")
    print(f"\nüìä Population:")
    print(f"  ‚Ä¢ Total: {row['n_samples']:,} ({row['n_samples']/len(X_combined)*100:.2f}%)")
    print(f"  ‚Ä¢ Train: {row['n_train']:,} ({row['pct_of_train']:.2f}%)")
    print(f"  ‚Ä¢ Val: {row['n_val']:,} ({row['pct_of_val']:.2f}%)")
    print(f"  ‚Ä¢ Test purity: {row['val_purity']:.2%}")
    print(f"\nüéØ Class 1 Analysis:")
    print(f"  ‚Ä¢ Train: {row['train_class1_rate']:.4f} ({row['train_class1_rate']*100:.2f}%)")
    print(f"  ‚Ä¢ Val: {row['val_class1_rate']:.4f} ({row['val_class1_rate']*100:.2f}%)")
    print(f"  ‚Ä¢ Change: {row['class1_diff_pct']:.2f}%")

print(f"\n{'='*120}")



EXTRACTING DECISION TREE LEAF CHARACTERISTICS

‚úì Total leaves: 84
‚úì Valid leaves (min 20 samples): 46

TOP 10 LEAVES WITH HIGHEST CLASS 1 SUPPRESSION


------------------------------------------------------------------------------------------------------------------------
Leaf #1 | Impact Score: -732.27
------------------------------------------------------------------------------------------------------------------------

üìç Decision Path (8 conditions):
  1. -99999.0000 < BUREAUSCORE <= 771.5000
  2. -99999.0000 < MAX_DAYS_PAST_DUE_ACTIVE_30_TO_60_DAYS <= -49999.5000
  3. -249916.0000 < MIN_CURRENT_BALANCE_ACTIVE <= 1196.5000
  4. -249916.0000 < MIN_CURRENT_BALANCE_UNSECURED_AND_ACTIVE <= 2451.0000
  5. 0.5000 < NUM_MAINMONEYCLICK_61_TO_90_D <= 351.0000
  6. 0.0000 < NUM_PAYOUTSETTLED_1_TO_30_D <= 0.5000
  7. -99999.0000 < RATIO_AVG_DEBIT_AMOUNT_AVG_CDT_AMT_LAST30D_SMS_BANKING <= 1.6576
  8. -99999.0000 < TENURE_MONTHS <= 11.5000

üìä Population:
  ‚Ä¢ Total: 7,060 (5.39%)
  

In [5]:
# Summary
print(f"\n{'='*120}")
print("SUMMARY")
print(f"{'='*120}\n")

print(f"üìä OVERALL STATISTICS:")
print(f"  ‚Ä¢ Train samples: {len(X_train):,}, Class 1: {y_train.mean()*100:.2f}%")
print(f"  ‚Ä¢ Test samples: {len(X_val):,}, Class 1: {y_val.mean()*100:.2f}%")
print(f"  ‚Ä¢ Class 1 degradation: {((y_val.mean() - y_train.mean()) / y_train.mean() * 100):.2f}%")

print(f"\nüå≥ TREE ANALYSIS:")
print(f"  ‚Ä¢ Tree depth: {tree_model.get_depth()}, Leaves: {tree_model.get_n_leaves()}")
print(f"  ‚Ä¢ Valid leaves analyzed: {len(df_paths_valid)}")
print(f"  ‚Ä¢ Leaves with degradation: {(df_paths_valid['class1_diff'] < 0).sum()}")

print(f"\nüìà KEY FINDINGS:")
print(f"  ‚Ä¢ Average impact score: {df_paths_valid['impact_score'].mean():.2f}")
print(f"  ‚Ä¢ Worst degradation: {df_ranked.iloc[0]['class1_diff_pct']:.2f}%")

print(f"\nüí° RECOMMENDATIONS:")
print(f"  1. Focus on top 10 leaves with highest degradation")
print(f"  2. Investigate feature distributions in affected leaves")
print(f"  3. Monitor similar patterns in future data")

print(f"\n‚úì Analysis complete!")
print(f"{'='*120}")



SUMMARY

üìä OVERALL STATISTICS:
  ‚Ä¢ Train samples: 123,208, Class 1: 31.19%
  ‚Ä¢ Test samples: 7,802, Class 1: 30.13%
  ‚Ä¢ Class 1 degradation: -3.37%

üå≥ TREE ANALYSIS:
  ‚Ä¢ Tree depth: 8, Leaves: 84
  ‚Ä¢ Valid leaves analyzed: 46
  ‚Ä¢ Leaves with degradation: 32

üìà KEY FINDINGS:
  ‚Ä¢ Average impact score: -34.44
  ‚Ä¢ Worst degradation: -33.70%

üí° RECOMMENDATIONS:
  1. Focus on top 10 leaves with highest degradation
  2. Investigate feature distributions in affected leaves
  3. Monitor similar patterns in future data

‚úì Analysis complete!


## Step 2: Comprehensive Summary and Export