# Tier 5: Gradient Boosting Classification

---

**Author:** Brandon Deloatch
**Affiliation:** Quipu Research Labs, LLC
**Date:** 2025-10-02
**Version:** v1.3
**License:** MIT
**Notebook ID:** cd8c8259-f6c0-41ca-984d-c071e0066ad5

---

## Citation
Brandon Deloatch, "Tier 5: Gradient Boosting Classification," Quipu Research Labs, LLC, v1.3, 2025-10-02.

Please cite this notebook if used or adapted in publications, presentations, or derivative work.

---

## Contributors / Acknowledgments
- **Primary Author:** Brandon Deloatch (Quipu Research Labs, LLC)
- **Institutional Support:** Quipu Research Labs, LLC - Advanced Analytics Division
- **Technical Framework:** Built on scikit-learn, pandas, numpy, and plotly ecosystems
- **Methodological Foundation:** Statistical learning principles and modern data science best practices

---

## Version History
| Version | Date | Notes |
|---------|------|-------|
| v1.3 | 2025-10-02 | Enhanced professional formatting, comprehensive documentation, interactive visualizations |
| v1.2 | 2024-09-15 | Updated analysis methods, improved data generation algorithms |
| v1.0 | 2024-06-10 | Initial release with core analytical framework |

---

## Environment Dependencies
- **Python:** 3.8+
- **Core Libraries:** pandas 2.0+, numpy 1.24+, scikit-learn 1.3+
- **Visualization:** plotly 5.0+, matplotlib 3.7+
- **Statistical:** scipy 1.10+, statsmodels 0.14+
- **Development:** jupyter-lab 4.0+, ipywidgets 8.0+

> **Reproducibility Note:** Use requirements.txt or environment.yml for exact dependency matching.

---

## Data Provenance
| Dataset | Source | License | Notes |
|---------|--------|---------|-------|
| Synthetic Data | Generated in-notebook | MIT | Custom algorithms for realistic simulation |
| Statistical Distributions | NumPy/SciPy | BSD-3-Clause | Standard library implementations |
| ML Algorithms | Scikit-learn | BSD-3-Clause | Industry-standard implementations |
| Visualization Schemas | Plotly | MIT | Interactive dashboard frameworks |

---

## Execution Provenance Logs
- **Created:** 2025-10-02
- **Notebook ID:** cd8c8259-f6c0-41ca-984d-c071e0066ad5
- **Execution Environment:** Jupyter Lab / VS Code
- **Computational Requirements:** Standard laptop/workstation (2GB+ RAM recommended)

> **Auto-tracking:** Execution metadata can be programmatically captured for reproducibility.

---

## Disclaimer & Responsible Use
This notebook is provided "as-is" for educational, research, and professional development purposes. Users assume full responsibility for any results, applications, or decisions derived from this analysis.

**Professional Standards:**
- Validate all results against domain expertise and additional data sources
- Respect licensing and attribution requirements for all dependencies
- Follow ethical guidelines for data analysis and algorithmic decision-making
- Credit all methodological sources and derivative frameworks appropriately

**Academic & Commercial Use:**
- Permitted under MIT license with proper attribution
- Suitable for educational curriculum and professional training
- Appropriate for commercial adaptation with citation requirements
- Recommended for reproducible research and transparent analytics

---



In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from sklearn.ensemble import GradientBoostingClassifier, AdaBoostClassifier
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, validation_curve
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_curve, auc
from sklearn.preprocessing import LabelEncoder
from sklearn.datasets import make_classification
import warnings
warnings.filterwarnings('ignore')

# Try to import advanced boosting libraries
try:
 import xgboost as xgb
 XGBOOST_AVAILABLE = True
except ImportError:
 XGBOOST_AVAILABLE = False

try:
 import lightgbm as lgb
 LIGHTGBM_AVAILABLE = True
except ImportError:
 LIGHTGBM_AVAILABLE = False

try:
 import catboost as cb
 CATBOOST_AVAILABLE = True
except ImportError:
 CATBOOST_AVAILABLE = False

print(" Tier 5: Gradient Boosting Classification - Libraries Loaded!")
print("="*62)
print("Gradient Boosting Classification Techniques:")
print("• Gradient Boosting Classifier (scikit-learn)")
print("• AdaBoost (Adaptive Boosting)")
print(f"• XGBoost: {'Available' if XGBOOST_AVAILABLE else 'Not Available'}")
print(f"• LightGBM: {'Available' if LIGHTGBM_AVAILABLE else 'Not Available'}")
print(f"• CatBoost: {'Available' if CATBOOST_AVAILABLE else 'Not Available'}")
print("• Learning rate optimization and regularization")
print("• Early stopping and cross-validation")

 Tier 5: Gradient Boosting Classification - Libraries Loaded!
Gradient Boosting Classification Techniques:
• Gradient Boosting Classifier (scikit-learn)
• AdaBoost (Adaptive Boosting)
• XGBoost: Available
• LightGBM: Available
• CatBoost: Available
• Learning rate optimization and regularization
• Early stopping and cross-validation


In [4]:
# Generate comprehensive boosting datasets
np.random.seed(42)

# 1. Financial risk assessment dataset
def generate_financial_dataset(n_samples=5000):
    """Generate realistic financial risk dataset."""

    data = []
    risk_levels = ['Low', 'Medium', 'High', 'Very_High']

    for i in range(n_samples):
        # Risk level assignment
        risk_idx = np.random.choice(4, p=[0.4, 0.3, 0.2, 0.1])
        risk_level = risk_levels[risk_idx]

        # Generate features based on risk level
        if risk_level == 'Low':
            credit_score = np.random.normal(750, 50)
            debt_to_income = np.random.beta(2, 8) * 0.5 # Lower DTI
            payment_history = np.random.beta(9, 1) # Excellent history
            account_age = np.random.exponential(5) + 2 # Longer history
            income = np.random.lognormal(11, 0.3) # Higher income

        elif risk_level == 'Medium':
            credit_score = np.random.normal(680, 40)
            debt_to_income = np.random.beta(3, 5) * 0.6
            payment_history = np.random.beta(7, 2)
            account_age = np.random.exponential(3) + 1
            income = np.random.lognormal(10.5, 0.4)

        elif risk_level == 'High':
            credit_score = np.random.normal(620, 35)
            debt_to_income = np.random.beta(5, 3) * 0.8
            payment_history = np.random.beta(5, 4)
            account_age = np.random.exponential(2) + 0.5
            income = np.random.lognormal(10, 0.5)

        else: # Very_High
            credit_score = np.random.normal(550, 50)
            debt_to_income = np.random.beta(8, 2) * 1.0
            payment_history = np.random.beta(3, 7)
            account_age = np.random.exponential(1) + 0.1
            income = np.random.lognormal(9.5, 0.6)

        # Additional features
        data.append({
            'customer_id': f'CUST_{i:06d}',
            'credit_score': max(300, min(850, credit_score)), # Clamp to valid range
            'debt_to_income_ratio': min(1.5, debt_to_income), # Cap at 150%
            'payment_history_score': payment_history,
            'account_age_years': account_age,
            'annual_income': income,
            'number_of_accounts': np.random.poisson(5) + 1,
            'recent_inquiries': np.random.poisson(1),
            'utilization_ratio': np.random.beta(2, 3),
            'employment_length': np.random.exponential(3) + 0.5,
            'homeowner': np.random.choice([0, 1], p=[0.7, 0.3]),
            'risk_level': risk_level,
            'risk_code': risk_idx
        })

    return pd.DataFrame(data)

# 2. E-commerce customer dataset
def generate_ecommerce_dataset(n_samples=4000):
    """Generate e-commerce customer behavior dataset."""

    data = []
    customer_types = ['Casual', 'Regular', 'Premium']

    for i in range(n_samples):
        # Customer type assignment
        type_idx = np.random.choice(3, p=[0.5, 0.3, 0.2])
        customer_type = customer_types[type_idx]

        # Generate features based on customer type
        if customer_type == 'Casual':
            monthly_purchases = np.random.poisson(1) + 1
            avg_order_value = np.random.lognormal(3, 0.8) # ~$25
            session_duration = np.random.exponential(5)
            page_views = np.random.poisson(8)

        elif customer_type == 'Regular':
            monthly_purchases = np.random.poisson(3) + 2
            avg_order_value = np.random.lognormal(4, 0.6) # ~$75
            session_duration = np.random.exponential(12)
            page_views = np.random.poisson(15)

        else: # Premium
            monthly_purchases = np.random.poisson(6) + 4
            avg_order_value = np.random.lognormal(5, 0.5) # ~$200
            session_duration = np.random.exponential(20)
            page_views = np.random.poisson(25)

        data.append({
            'customer_id': f'ECOM_{i:06d}',
            'monthly_purchases': monthly_purchases,
            'avg_order_value': avg_order_value,
            'session_duration_min': session_duration,
            'page_views_per_session': page_views,
            'cart_abandonment_rate': np.random.beta(3, 5),
            'time_on_site_total': session_duration * monthly_purchases,
            'mobile_usage_ratio': np.random.beta(4, 3),
            'email_open_rate': np.random.beta(3, 4),
            'social_media_referrals': np.random.poisson(2),
            'customer_type': customer_type,
            'type_code': type_idx
        })

    return pd.DataFrame(data)

# Generate datasets
financial_df = generate_financial_dataset()
ecommerce_df = generate_ecommerce_dataset()

print(" Gradient Boosting Datasets Created:")
print(f"Financial risk assessment: {financial_df.shape}")
print(f"Risk distribution: {financial_df['risk_level'].value_counts().to_dict()}")
print(f"\nE-commerce customers: {ecommerce_df.shape}")
print(f"Customer type distribution: {ecommerce_df['customer_type'].value_counts().to_dict()}")

# Show sample statistics
print(f"\nFinancial Dataset - Credit Score by Risk Level:")
risk_stats = financial_df.groupby('risk_level')['credit_score'].agg(['mean', 'std'])
for risk_level in ['Low', 'Medium', 'High', 'Very_High']:
    if risk_level in risk_stats.index:
        mean_score = risk_stats.loc[risk_level, 'mean']
        std_score = risk_stats.loc[risk_level, 'std']
        print(f" {risk_level}: {mean_score:.0f} ± {std_score:.0f}")

 Gradient Boosting Datasets Created:
Financial risk assessment: (5000, 13)
Risk distribution: {'Low': 1999, 'Medium': 1484, 'High': 1019, 'Very_High': 498}

E-commerce customers: (4000, 12)
Customer type distribution: {'Casual': 2028, 'Regular': 1172, 'Premium': 800}

Financial Dataset - Credit Score by Risk Level:
 Low: 748 ± 50
 Medium: 682 ± 40
 High: 621 ± 34
 Very_High: 552 ± 50


In [5]:
# 1. GRADIENT BOOSTING ALGORITHM COMPARISON
print(" 1. GRADIENT BOOSTING ALGORITHM COMPARISON")
print("="*44)

# Prepare financial data
financial_features = [col for col in financial_df.columns
 if col not in ['customer_id', 'risk_level', 'risk_code']]
X_financial = financial_df[financial_features].values
y_financial = financial_df['risk_code'].values

# Split the data
X_fin_train, X_fin_test, y_fin_train, y_fin_test = train_test_split(
 X_financial, y_financial, test_size=0.2, random_state=42, stratify=y_financial
)

# Initialize boosting algorithms
boosting_algorithms = {}
results = {}

# 1. Scikit-learn Gradient Boosting
gb_sklearn = GradientBoostingClassifier(
 n_estimators=100,
 learning_rate=0.1,
 max_depth=6,
 random_state=42
)
boosting_algorithms['Gradient Boosting'] = gb_sklearn

# 2. AdaBoost
ada_boost = AdaBoostClassifier(
 n_estimators=100,
 learning_rate=1.0,
 random_state=42
)
boosting_algorithms['AdaBoost'] = ada_boost

# 3. XGBoost (if available)
if XGBOOST_AVAILABLE:
 xgb_classifier = xgb.XGBClassifier(
 n_estimators=100,
 learning_rate=0.1,
 max_depth=6,
 random_state=42,
 eval_metric='mlogloss'
 )
 boosting_algorithms['XGBoost'] = xgb_classifier

# 4. LightGBM (if available)
if LIGHTGBM_AVAILABLE:
 lgb_classifier = lgb.LGBMClassifier(
 n_estimators=100,
 learning_rate=0.1,
 max_depth=6,
 random_state=42,
 verbose=-1
 )
 boosting_algorithms['LightGBM'] = lgb_classifier

# 5. CatBoost (if available)
if CATBOOST_AVAILABLE:
 cat_classifier = cb.CatBoostClassifier(
 n_estimators=100,
 learning_rate=0.1,
 max_depth=6,
 random_state=42,
 verbose=False
 )
 boosting_algorithms['CatBoost'] = cat_classifier

# Train and evaluate all algorithms
print("Training boosting algorithms...")
for name, algorithm in boosting_algorithms.items():
    print(f"Training {name}...")

    # Train the algorithm
    algorithm.fit(X_fin_train, y_fin_train)

    # Predictions
    y_pred = algorithm.predict(X_fin_test)
    y_pred_proba = algorithm.predict_proba(X_fin_test)

    # Performance metrics
    accuracy = accuracy_score(y_fin_test, y_pred)
    cv_scores = cross_val_score(algorithm, X_fin_train, y_fin_train, cv=3)

    results[name] = {
        'accuracy': accuracy,
        'cv_mean': cv_scores.mean(),
        'cv_std': cv_scores.std(),
        'y_pred': y_pred,
        'y_pred_proba': y_pred_proba,
        'model': algorithm
    }

    print(f" Accuracy: {accuracy:.3f} | CV: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")

# Performance comparison
print(f"\nBoosting Algorithm Performance Comparison:")
print("Algorithm Accuracy CV Mean CV Std")
print("-" * 50)
for name, metrics in results.items():
 print(f"{name:<15} {metrics['accuracy']:.3f} {metrics['cv_mean']:.3f} {metrics['cv_std']:.3f}")

# Find best performer
best_algorithm = max(results.keys(), key=lambda x: results[x]['accuracy'])
best_accuracy = results[best_algorithm]['accuracy']
print(f"\nBest performing algorithm: {best_algorithm} (Accuracy: {best_accuracy:.3f})")

 1. GRADIENT BOOSTING ALGORITHM COMPARISON
Training boosting algorithms...
Training Gradient Boosting...
 Accuracy: 0.926 | CV: 0.920 ± 0.010
Training AdaBoost...
 Accuracy: 0.926 | CV: 0.920 ± 0.010
Training AdaBoost...
 Accuracy: 0.791 | CV: 0.832 ± 0.016
Training XGBoost...
 Accuracy: 0.791 | CV: 0.832 ± 0.016
Training XGBoost...
 Accuracy: 0.929 | CV: 0.918 ± 0.008
Training LightGBM...
 Accuracy: 0.929 | CV: 0.918 ± 0.008
Training LightGBM...
 Accuracy: 0.930 | CV: 0.916 ± 0.009
Training CatBoost...
 Accuracy: 0.930 | CV: 0.916 ± 0.009
Training CatBoost...
 Accuracy: 0.927 | CV: 0.923 ± 0.010

Boosting Algorithm Performance Comparison:
Algorithm Accuracy CV Mean CV Std
--------------------------------------------------
Gradient Boosting 0.926 0.920 0.010
AdaBoost        0.791 0.832 0.016
XGBoost         0.929 0.918 0.008
LightGBM        0.930 0.916 0.009
CatBoost        0.927 0.923 0.010

Best performing algorithm: LightGBM (Accuracy: 0.930)
 Accuracy: 0.927 | CV: 0.923 ± 0.010

Bo

In [6]:
# 2. HYPERPARAMETER OPTIMIZATION AND LEARNING CURVES
print(" 2. HYPERPARAMETER OPTIMIZATION AND LEARNING CURVES")
print("="*55)

# Focus on Gradient Boosting for detailed analysis
print("Detailed Gradient Boosting Analysis:")

# Learning rate analysis
learning_rates = [0.01, 0.05, 0.1, 0.2, 0.3]
lr_scores = []

print("\nLearning Rate Optimization:")
for lr in learning_rates:
    gb_temp = GradientBoostingClassifier(
        n_estimators=100,
        learning_rate=lr,
        max_depth=6,
        random_state=42
    )

    cv_scores = cross_val_score(gb_temp, X_fin_train, y_fin_train, cv=3)
    lr_scores.append(cv_scores.mean())

    print(f"Learning Rate {lr:.2f}: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")

best_lr = learning_rates[np.argmax(lr_scores)]
print(f"Best learning rate: {best_lr} (Score: {max(lr_scores):.3f})")

# Max depth analysis
max_depths = [3, 4, 5, 6, 7, 8]
depth_scores = []

print(f"\nMax Depth Optimization:")
for depth in max_depths:
    gb_temp = GradientBoostingClassifier(
        n_estimators=100,
        learning_rate=best_lr,
        max_depth=depth,
        random_state=42
    )

    cv_scores = cross_val_score(gb_temp, X_fin_train, y_fin_train, cv=3)
    depth_scores.append(cv_scores.mean())

    print(f"Max Depth {depth}: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")

best_depth = max_depths[np.argmax(depth_scores)]
print(f"Best max depth: {best_depth} (Score: {max(depth_scores):.3f})")

# N_estimators analysis with early stopping simulation
n_estimators_range = [25, 50, 75, 100, 150, 200, 300]
train_scores_est = []
test_scores_est = []

print(f"\nN_estimators Analysis:")
for n_est in n_estimators_range:
    gb_temp = GradientBoostingClassifier(
        n_estimators=n_est,
        learning_rate=best_lr,
        max_depth=best_depth,
        random_state=42
    )

    gb_temp.fit(X_fin_train, y_fin_train)

    train_score = gb_temp.score(X_fin_train, y_fin_train)
    test_score = gb_temp.score(X_fin_test, y_fin_test)

    train_scores_est.append(train_score)
    test_scores_est.append(test_score)

    print(f"N_estimators {n_est:3d}: Train={train_score:.3f}, Test={test_score:.3f}")

# Find optimal n_estimators (where test score peaks)
best_n_est_idx = np.argmax(test_scores_est)
best_n_est = n_estimators_range[best_n_est_idx]
print(f"Best n_estimators: {best_n_est} (Test Score: {test_scores_est[best_n_est_idx]:.3f})")

# Train final optimized model
gb_optimized = GradientBoostingClassifier(
 n_estimators=best_n_est,
 learning_rate=best_lr,
 max_depth=best_depth,
 random_state=42
)

gb_optimized.fit(X_fin_train, y_fin_train)
optimized_accuracy = gb_optimized.score(X_fin_test, y_fin_test)

print(f"\nOptimized Gradient Boosting Performance:")
print(f"Final accuracy: {optimized_accuracy:.3f}")
print(f"Improvement over baseline: {optimized_accuracy - results['Gradient Boosting']['accuracy']:+.3f}")

 2. HYPERPARAMETER OPTIMIZATION AND LEARNING CURVES
Detailed Gradient Boosting Analysis:

Learning Rate Optimization:
Learning Rate 0.01: 0.906 ± 0.013
Learning Rate 0.01: 0.906 ± 0.013
Learning Rate 0.05: 0.917 ± 0.012
Learning Rate 0.05: 0.917 ± 0.012
Learning Rate 0.10: 0.920 ± 0.010
Learning Rate 0.10: 0.920 ± 0.010
Learning Rate 0.20: 0.920 ± 0.010
Learning Rate 0.20: 0.920 ± 0.010
Learning Rate 0.30: 0.923 ± 0.008
Best learning rate: 0.3 (Score: 0.923)

Max Depth Optimization:
Learning Rate 0.30: 0.923 ± 0.008
Best learning rate: 0.3 (Score: 0.923)

Max Depth Optimization:
Max Depth 3: 0.920 ± 0.008
Max Depth 3: 0.920 ± 0.008
Max Depth 4: 0.920 ± 0.009
Max Depth 4: 0.920 ± 0.009
Max Depth 5: 0.918 ± 0.012
Max Depth 5: 0.918 ± 0.012
Max Depth 6: 0.923 ± 0.008
Max Depth 6: 0.923 ± 0.008
Max Depth 7: 0.920 ± 0.009
Max Depth 7: 0.920 ± 0.009
Max Depth 8: 0.917 ± 0.011
Best max depth: 6 (Score: 0.923)

N_estimators Analysis:
Max Depth 8: 0.917 ± 0.011
Best max depth: 6 (Score: 0.923)


In [7]:
# 3. FEATURE IMPORTANCE AND MODEL INTERPRETATION
print(" 3. FEATURE IMPORTANCE AND MODEL INTERPRETATION")
print("="*51)

# Feature importance analysis
feature_importance_gb = pd.DataFrame({
 'feature': financial_features,
 'importance': gb_optimized.feature_importances_
}).sort_values('importance', ascending=False)

print("Feature Importance (Gradient Boosting):")
for i, (idx, row) in enumerate(feature_importance_gb.head(10).iterrows()):
 print(f"{i+1:2d}. {row['feature']:<25}: {row['importance']:.4f}")

# Partial dependence analysis (simplified)
print(f"\nTop 3 Feature Analysis:")
top_features = feature_importance_gb.head(3)['feature'].tolist()

for feature in top_features:
    feature_idx = financial_features.index(feature)
    feature_values = X_fin_test[:, feature_idx]

    # Calculate average prediction for different feature value ranges
    percentiles = [0, 25, 50, 75, 100]
    thresholds = np.percentile(feature_values, percentiles)

    print(f"\n{feature} impact on risk prediction:")
    for i in range(len(thresholds) - 1):
        mask = (feature_values >= thresholds[i]) & (feature_values < thresholds[i+1])
        if np.sum(mask) > 0:
            avg_risk = np.mean(y_fin_test[mask])
            print(f" {thresholds[i]:.2f} - {thresholds[i+1]:.2f}: Avg risk level {avg_risk:.2f}")

# E-commerce dataset analysis
print(f"\nE-commerce Customer Classification:")
ecommerce_features = [col for col in ecommerce_df.columns
 if col not in ['customer_id', 'customer_type', 'type_code']]
X_ecom = ecommerce_df[ecommerce_features].values
y_ecom = ecommerce_df['type_code'].values

X_ecom_train, X_ecom_test, y_ecom_train, y_ecom_test = train_test_split(
 X_ecom, y_ecom, test_size=0.2, random_state=42, stratify=y_ecom
)

# Train model for e-commerce data
gb_ecommerce = GradientBoostingClassifier(
 n_estimators=best_n_est,
 learning_rate=best_lr,
 max_depth=best_depth,
 random_state=42
)

gb_ecommerce.fit(X_ecom_train, y_ecom_train)
ecom_accuracy = gb_ecommerce.score(X_ecom_test, y_ecom_test)

print(f"E-commerce classification accuracy: {ecom_accuracy:.3f}")

# E-commerce feature importance
ecom_feature_importance = pd.DataFrame({
 'feature': ecommerce_features,
 'importance': gb_ecommerce.feature_importances_
}).sort_values('importance', ascending=False)

print(f"\nE-commerce Feature Importance:")
for i, (idx, row) in enumerate(ecom_feature_importance.head(8).iterrows()):
 print(f"{i+1:2d}. {row['feature']:<25}: {row['importance']:.4f}")

# Learning curve analysis
print(f"\nLearning Curve Analysis:")
train_sizes = [0.1, 0.2, 0.4, 0.6, 0.8, 1.0]
lc_train_scores = []
lc_test_scores = []

for size in train_sizes:
    n_samples = int(len(X_fin_train) * size)
    X_subset = X_fin_train[:n_samples]
    y_subset = y_fin_train[:n_samples]

    gb_temp = GradientBoostingClassifier(
        n_estimators=best_n_est,
        learning_rate=best_lr,
        max_depth=best_depth,
        random_state=42
    )

    gb_temp.fit(X_subset, y_subset)

    train_score = gb_temp.score(X_subset, y_subset)
    test_score = gb_temp.score(X_fin_test, y_fin_test)

    lc_train_scores.append(train_score)
    lc_test_scores.append(test_score)

    print(f"Training size {size*100:3.0f}%: Train={train_score:.3f}, Test={test_score:.3f}")

 3. FEATURE IMPORTANCE AND MODEL INTERPRETATION
Feature Importance (Gradient Boosting):
 1. debt_to_income_ratio     : 0.5355
 2. credit_score             : 0.1679
 3. account_age_years        : 0.0935
 4. annual_income            : 0.0928
 5. payment_history_score    : 0.0918
 6. employment_length        : 0.0069
 7. utilization_ratio        : 0.0057
 8. number_of_accounts       : 0.0033
 9. recent_inquiries         : 0.0014
10. homeowner                : 0.0010

Top 3 Feature Analysis:

debt_to_income_ratio impact on risk prediction:
 0.01 - 0.10: Avg risk level 0.13
 0.10 - 0.21: Avg risk level 0.43
 0.21 - 0.43: Avg risk level 1.08
 0.43 - 0.98: Avg risk level 2.37

credit_score impact on risk prediction:
 394.83 - 631.33: Avg risk level 2.24
 631.33 - 687.06: Avg risk level 1.17
 687.06 - 739.85: Avg risk level 0.49
 739.85 - 850.00: Avg risk level 0.11

account_age_years impact on risk prediction:
 0.10 - 1.75: Avg risk level 2.05
 1.75 - 3.44: Avg risk level 1.00
 3.44 - 6.28: A

In [8]:
# 4. COMPREHENSIVE GRADIENT BOOSTING VISUALIZATION DASHBOARD
print(" 4. COMPREHENSIVE GRADIENT BOOSTING VISUALIZATION DASHBOARD")
print("="*66)

# Create comprehensive dashboard
fig = make_subplots(
 rows=3, cols=2,
 subplot_titles=[
 'Algorithm Performance Comparison',
 'Hyperparameter Optimization: Learning Rate',
 'N_estimators vs Performance (Overfitting Analysis)',
 'Feature Importance: Financial Risk Factors',
 'Learning Curves: Training Size vs Performance',
 'Confusion Matrix: Risk Level Prediction'
 ],
 specs=[[{"secondary_y": False}, {"secondary_y": False}],
 [{"secondary_y": False}, {"secondary_y": False}],
 [{"secondary_y": False}, {"type": "heatmap"}]]
)

# 1. Algorithm comparison
algorithms = list(results.keys())
accuracies = [results[alg]['accuracy'] for alg in algorithms]
cv_means = [results[alg]['cv_mean'] for alg in algorithms]

fig.add_trace(
 go.Bar(
 x=algorithms,
 y=accuracies,
 name='Test Accuracy',
 marker_color='lightblue',
 text=[f'{acc:.3f}' for acc in accuracies],
 textposition='auto'
 ),
 row=1, col=1
)

fig.add_trace(
 go.Bar(
 x=algorithms,
 y=cv_means,
 name='CV Mean',
 marker_color='lightcoral',
 text=[f'{cv:.3f}' for cv in cv_means],
 textposition='auto'
 ),
 row=1, col=1
)

# 2. Learning rate optimization
fig.add_trace(
 go.Scatter(
 x=learning_rates,
 y=lr_scores,
 mode='lines+markers',
 name='Learning Rate',
 line=dict(color='blue', width=3),
 marker=dict(size=8)
 ),
 row=1, col=2
)

# Highlight best learning rate
best_lr_idx = np.argmax(lr_scores)
fig.add_trace(
 go.Scatter(
 x=[learning_rates[best_lr_idx]],
 y=[lr_scores[best_lr_idx]],
 mode='markers',
 name='Best LR',
 marker=dict(color='red', size=12, symbol='star')
 ),
 row=1, col=2
)

# 3. N_estimators analysis
fig.add_trace(
 go.Scatter(
 x=n_estimators_range,
 y=train_scores_est,
 mode='lines+markers',
 name='Training Score',
 line=dict(color='blue')
 ),
 row=2, col=1
)

fig.add_trace(
 go.Scatter(
 x=n_estimators_range,
 y=test_scores_est,
 mode='lines+markers',
 name='Test Score',
 line=dict(color='red')
 ),
 row=2, col=1
)

# 4. Feature importance
top_features_plot = feature_importance_gb.head(12)
fig.add_trace(
 go.Bar(
 x=top_features_plot['importance'],
 y=top_features_plot['feature'],
 orientation='h',
 marker_color='green',
 name='Feature Importance'
 ),
 row=2, col=2
)

# 5. Learning curves
training_sizes_pct = [size * 100 for size in train_sizes]

fig.add_trace(
 go.Scatter(
 x=training_sizes_pct,
 y=lc_train_scores,
 mode='lines+markers',
 name='Training Score',
 line=dict(color='blue')
 ),
 row=3, col=1
)

fig.add_trace(
 go.Scatter(
 x=training_sizes_pct,
 y=lc_test_scores,
 mode='lines+markers',
 name='Validation Score',
 line=dict(color='red')
 ),
 row=3, col=1
)

# 6. Confusion matrix
y_pred_final = gb_optimized.predict(X_fin_test)
cm_financial = confusion_matrix(y_fin_test, y_pred_final)
risk_levels = ['Low', 'Medium', 'High', 'Very_High']

fig.add_trace(
 go.Heatmap(
 z=cm_financial,
 x=risk_levels,
 y=risk_levels,
 colorscale='Blues',
 text=cm_financial,
 texttemplate='%{text}',
 hovertemplate='Predicted: %{x}<br>Actual: %{y}<br>Count: %{z}<extra></extra>'
 ),
 row=3, col=2
)

# Update layout
fig.update_layout(
 height=1200,
 title="Gradient Boosting Classification - Comprehensive Analysis Dashboard",
 showlegend=True
)

# Update axis labels
fig.update_xaxes(title_text="Algorithm", row=1, col=1)
fig.update_xaxes(title_text="Learning Rate", row=1, col=2)
fig.update_xaxes(title_text="Number of Estimators", row=2, col=1)
fig.update_xaxes(title_text="Feature Importance", row=2, col=2)
fig.update_xaxes(title_text="Training Set Size (%)", row=3, col=1)
fig.update_xaxes(title_text="Predicted Risk Level", row=3, col=2)

fig.update_yaxes(title_text="Accuracy", row=1, col=1)
fig.update_yaxes(title_text="CV Score", row=1, col=2)
fig.update_yaxes(title_text="Accuracy", row=2, col=1)
fig.update_yaxes(title_text="Feature", row=2, col=2)
fig.update_yaxes(title_text="Accuracy", row=3, col=1)
fig.update_yaxes(title_text="Actual Risk Level", row=3, col=2)

fig.show()

 4. COMPREHENSIVE GRADIENT BOOSTING VISUALIZATION DASHBOARD


In [9]:
# 5. BUSINESS INSIGHTS AND ROI ANALYSIS
print(" 5. BUSINESS INSIGHTS AND ROI ANALYSIS")
print("="*40)

# Financial risk assessment business impact
print("Financial Risk Assessment System ROI:")
loan_portfolio_value = 500_000_000 # $500M loan portfolio
risk_assessment_accuracy = optimized_accuracy

# Default rates by risk level
default_rates = {
 'Low': 0.01, # 1% default rate
 'Medium': 0.05, # 5% default rate
 'High': 0.15, # 15% default rate
 'Very_High': 0.35 # 35% default rate
}

# Calculate risk-adjusted portfolio value
portfolio_distribution = financial_df['risk_level'].value_counts(normalize=True).to_dict()
expected_losses_baseline = 0
expected_losses_improved = 0

for risk_level, proportion in portfolio_distribution.items():
    portfolio_segment = loan_portfolio_value * proportion
    default_rate = default_rates[risk_level]

    # Baseline (without ML): assume average default rate for all
    avg_default_rate = 0.08 # 8% average default rate
    baseline_loss = portfolio_segment * avg_default_rate
    expected_losses_baseline += baseline_loss

    # Improved (with ML): accurate risk assessment
    # Assume we can reduce defaults by adjusting terms/rates based on risk
    risk_adjustment_factor = 0.7 if risk_assessment_accuracy > 0.8 else 0.85
    improved_loss = portfolio_segment * default_rate * risk_adjustment_factor
    expected_losses_improved += improved_loss

    print(f"• {risk_level}: ${portfolio_segment:,.0f} portfolio, "
          f"{default_rate*100:.0f}% default rate")

loss_reduction = expected_losses_baseline - expected_losses_improved
system_cost = 400_000 # Annual system cost
net_benefit = loss_reduction - system_cost
roi = net_benefit / system_cost

print(f"\nFinancial Risk System Impact:")
print(f"• Baseline expected losses: ${expected_losses_baseline:,.0f}")
print(f"• Improved expected losses: ${expected_losses_improved:,.0f}")
print(f"• Annual loss reduction: ${loss_reduction:,.0f}")
print(f"• System cost: ${system_cost:,.0f}")
print(f"• Net annual benefit: ${net_benefit:,.0f}")
print(f"• ROI: {roi*100:.0f}%")

# E-commerce customer segmentation ROI
print(f"\nE-commerce Customer Segmentation ROI:")
total_customers_ecom = 1_000_000 # 1M customers
segmentation_accuracy_ecom = ecom_accuracy

# Revenue per customer by type
customer_revenues = {
 'Casual': 150, # Annual revenue per casual customer
 'Regular': 450, # Annual revenue per regular customer
 'Premium': 1200 # Annual revenue per premium customer
}

# Marketing efficiency improvements
marketing_improvements = {
 'Casual': 0.15, # 15% improvement in conversion
 'Regular': 0.25, # 25% improvement
 'Premium': 0.40 # 40% improvement for premium targeting
}

ecom_distribution = ecommerce_df['customer_type'].value_counts(normalize=True).to_dict()
total_revenue_improvement_ecom = 0

for customer_type, proportion in ecom_distribution.items():
    customers_in_type = total_customers_ecom * proportion
    correctly_segmented = customers_in_type * segmentation_accuracy_ecom

    base_revenue = customer_revenues[customer_type]
    improvement_factor = marketing_improvements[customer_type]
    additional_revenue = correctly_segmented * base_revenue * improvement_factor

    total_revenue_improvement_ecom += additional_revenue

    print(f"• {customer_type}: {customers_in_type:,.0f} customers, "
          f"${additional_revenue:,.0f} additional revenue")

ecom_system_cost = 200_000 # Annual system cost
ecom_net_benefit = total_revenue_improvement_ecom - ecom_system_cost
ecom_roi = ecom_net_benefit / ecom_system_cost

print(f"\nE-commerce System Impact:")
print(f"• Total revenue improvement: ${total_revenue_improvement_ecom:,.0f}")
print(f"• System cost: ${ecom_system_cost:,.0f}")
print(f"• Net annual benefit: ${ecom_net_benefit:,.0f}")
print(f"• ROI: {ecom_roi*100:.0f}%")

# Combined systems ROI
total_investment = system_cost + ecom_system_cost
total_benefits = net_benefit + ecom_net_benefit
combined_roi = total_benefits / total_investment

print(f"\nCombined Gradient Boosting Systems ROI:")
print(f"• Total investment: ${total_investment:,.0f}")
print(f"• Total annual benefits: ${total_benefits:,.0f}")
print(f"• Combined ROI: {combined_roi*100:.0f}%")
print(f"• Payback period: {total_investment/total_benefits*12:.1f} months")

# Implementation guidelines
print(f"\nGradient Boosting Implementation Guidelines:")
print(f"• Start with learning_rate=0.1, adjust based on dataset size")
print(f"• Use max_depth=6 for complex problems, 3-4 for simple ones")
print(f"• Monitor train/validation curves to detect overfitting")
print(f"• Implement early stopping for optimal n_estimators")
print(f"• Consider XGBoost/LightGBM for large datasets")
print(f"• Use feature importance for interpretability and feature selection")

print(f"\nCross-Reference Learning Path:")
print(f"• Foundation: Tier2_DecisionTree.ipynb (tree fundamentals)")
print(f"• Building On: Tier2_GradientBoosting.ipynb (basic boosting)")
print(f"• Comparison: Tier5_RandomForest.ipynb (bagging vs boosting)")
print(f"• Advanced: Advanced_EnsembleClassification.ipynb, Advanced_HyperparameterTuning.ipynb")

 5. BUSINESS INSIGHTS AND ROI ANALYSIS
Financial Risk Assessment System ROI:
• Low: $199,900,000 portfolio, 1% default rate
• Medium: $148,400,000 portfolio, 5% default rate
• High: $101,900,000 portfolio, 15% default rate
• Very_High: $49,800,000 portfolio, 35% default rate

Financial Risk System Impact:
• Baseline expected losses: $40,000,000
• Improved expected losses: $29,493,800
• Annual loss reduction: $10,506,200
• System cost: $400,000
• Net annual benefit: $10,106,200
• ROI: 2527%

E-commerce Customer Segmentation ROI:
• Casual: 507,000 customers, $10,708,791 additional revenue
• Regular: 293,000 customers, $30,943,547 additional revenue
• Premium: 200,000 customers, $90,120,000 additional revenue

E-commerce System Impact:
• Total revenue improvement: $131,772,338
• System cost: $200,000
• Net annual benefit: $131,572,338
• ROI: 65786%

Combined Gradient Boosting Systems ROI:
• Total investment: $600,000
• Total annual benefits: $141,678,538
• Combined ROI: 23613%
• Payback p