# Tier 2: Decision Tree Analysis

---

**Author:** Brandon Deloatch
**Affiliation:** Quipu Research Labs, LLC
**Date:** 2025-10-02
**Version:** v1.3
**License:** MIT
**Notebook ID:** 7c6dbde2-98b5-42f9-b57e-8a36653ff77c

---

## Citation
Brandon Deloatch, "Tier 2: Decision Tree Analysis," Quipu Research Labs, LLC, v1.3, 2025-10-02.

Please cite this notebook if used or adapted in publications, presentations, or derivative work.

---

## Contributors / Acknowledgments
- **Primary Author:** Brandon Deloatch (Quipu Research Labs, LLC)
- **Institutional Support:** Quipu Research Labs, LLC - Advanced Analytics Division
- **Technical Framework:** Built on scikit-learn, pandas, numpy, and plotly ecosystems
- **Methodological Foundation:** Statistical learning principles and modern data science best practices

---

## Version History
| Version | Date | Notes |
|---------|------|-------|
| v1.3 | 2025-10-02 | Enhanced professional formatting, comprehensive documentation, interactive visualizations |
| v1.2 | 2024-09-15 | Updated analysis methods, improved data generation algorithms |
| v1.0 | 2024-06-10 | Initial release with core analytical framework |

---

## Environment Dependencies
- **Python:** 3.8+
- **Core Libraries:** pandas 2.0+, numpy 1.24+, scikit-learn 1.3+
- **Visualization:** plotly 5.0+, matplotlib 3.7+
- **Statistical:** scipy 1.10+, statsmodels 0.14+
- **Development:** jupyter-lab 4.0+, ipywidgets 8.0+

> **Reproducibility Note:** Use requirements.txt or environment.yml for exact dependency matching.

---

## Data Provenance
| Dataset | Source | License | Notes |
|---------|--------|---------|-------|
| Synthetic Data | Generated in-notebook | MIT | Custom algorithms for realistic simulation |
| Statistical Distributions | NumPy/SciPy | BSD-3-Clause | Standard library implementations |
| ML Algorithms | Scikit-learn | BSD-3-Clause | Industry-standard implementations |
| Visualization Schemas | Plotly | MIT | Interactive dashboard frameworks |

---

## Execution Provenance Logs
- **Created:** 2025-10-02
- **Notebook ID:** 7c6dbde2-98b5-42f9-b57e-8a36653ff77c
- **Execution Environment:** Jupyter Lab / VS Code
- **Computational Requirements:** Standard laptop/workstation (2GB+ RAM recommended)

> **Auto-tracking:** Execution metadata can be programmatically captured for reproducibility.

---

## Disclaimer & Responsible Use
This notebook is provided "as-is" for educational, research, and professional development purposes. Users assume full responsibility for any results, applications, or decisions derived from this analysis.

**Professional Standards:**
- Validate all results against domain expertise and additional data sources
- Respect licensing and attribution requirements for all dependencies
- Follow ethical guidelines for data analysis and algorithmic decision-making
- Credit all methodological sources and derivative frameworks appropriately

**Academic & Commercial Use:**
- Permitted under MIT license with proper attribution
- Suitable for educational curriculum and professional training
- Appropriate for commercial adaptation with citation requirements
- Recommended for reproducible research and transparent analytics

---



In [None]:
# Import Essential Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.figure_factory as ff

# Scikit-learn imports
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score, classification_report
from sklearn.metrics import confusion_matrix, roc_curve, auc, precision_recall_curve
from sklearn import tree
from sklearn.tree import export_text

# Additional imports
import scipy.stats as stats
from sklearn.datasets import make_classification
import warnings
warnings.filterwarnings('ignore')

print(" Tier 2: Decision Tree Analysis - Libraries Loaded Successfully!")
print("=" * 70)
print("Available Decision Tree Techniques:")
print("• Decision Tree Regression - Continuous target prediction")
print("• Decision Tree Classification - Categorical target prediction")
print("• Tree Pruning - Overfitting prevention strategies")
print("• Feature Importance - Variable significance ranking")
print("• Tree Visualization - Structure and decision path analysis")
print("• Hyperparameter Tuning - Optimal tree complexity")

In [None]:
# Generate Comprehensive Datasets for Decision Tree Analysis
np.random.seed(42)

def generate_decision_tree_datasets():
 """Generate both regression and classification datasets for tree analysis"""

 # 1. REGRESSION DATASET - Business Revenue Prediction
 n_samples = 1000

 # Create hierarchical business factors
 company_size = np.random.choice(['Small', 'Medium', 'Large'], n_samples, p=[0.4, 0.4, 0.2])
 industry = np.random.choice(['Tech', 'Retail', 'Manufacturing', 'Services'], n_samples, p=[0.3, 0.25, 0.25, 0.2])
 region = np.random.choice(['North', 'South', 'East', 'West'], n_samples, p=[0.3, 0.25, 0.25, 0.2])

 # Encode categorical variables
 le_size = LabelEncoder()
 le_industry = LabelEncoder()
 le_region = LabelEncoder()

 size_encoded = le_size.fit_transform(company_size)
 industry_encoded = le_industry.fit_transform(industry)
 region_encoded = le_region.fit_transform(region)

 # Continuous features
 marketing_spend = np.random.exponential(scale=5000, size=n_samples) + 1000
 employee_count = np.random.poisson(lam=50, size=n_samples) + 5
 years_in_business = np.random.gamma(shape=2, scale=3, size=n_samples) + 1
 customer_satisfaction = np.random.beta(a=2, b=0.5, size=n_samples) * 10

 # Create hierarchical decision rules for revenue
 revenue = np.zeros(n_samples)

 for i in range(n_samples):
 base_revenue = 50000

 # Company size effect
 if company_size[i] == 'Large':
 base_revenue *= 3
 elif company_size[i] == 'Medium':
 base_revenue *= 1.5

 # Industry effect
 if industry[i] == 'Tech':
 base_revenue *= 2
 elif industry[i] == 'Manufacturing':
 base_revenue *= 1.2

 # Marketing spend effect (non-linear)
 if marketing_spend[i] > 8000:
 base_revenue *= 1.5
 elif marketing_spend[i] > 4000:
 base_revenue *= 1.2

 # Employee productivity
 revenue_per_employee = base_revenue / max(employee_count[i], 1)
 if revenue_per_employee > 2000:
 base_revenue *= 1.3

 # Customer satisfaction threshold
 if customer_satisfaction[i] > 8:
 base_revenue *= 1.4
 elif customer_satisfaction[i] < 5:
 base_revenue *= 0.7

 revenue[i] = base_revenue + np.random.normal(0, base_revenue * 0.1)

 # Create regression DataFrame
 regression_df = pd.DataFrame({
 'company_size': company_size,
 'industry': industry,
 'region': region,
 'marketing_spend': marketing_spend,
 'employee_count': employee_count,
 'years_in_business': years_in_business,
 'customer_satisfaction': customer_satisfaction,
 'revenue': revenue
 })

 # 2. CLASSIFICATION DATASET - Customer Churn Prediction
 # Generate features for classification
 monthly_charges = np.random.gamma(shape=2, scale=30, size=n_samples) + 20
 tenure_months = np.random.exponential(scale=24, size=n_samples) + 1
 total_charges = monthly_charges * tenure_months + np.random.normal(0, 100, n_samples)
 support_calls = np.random.poisson(lam=2, size=n_samples)

 contract_type = np.random.choice(['Month-to-month', 'One year', 'Two year'],
 n_samples, p=[0.5, 0.3, 0.2])
 payment_method = np.random.choice(['Credit card', 'Bank transfer', 'Electronic check', 'Mailed check'],
 n_samples, p=[0.3, 0.25, 0.25, 0.2])

 # Create hierarchical churn rules
 churn_probability = np.zeros(n_samples)

 for i in range(n_samples):
 prob = 0.1 # Base churn rate

 # Tenure effect
 if tenure_months[i] < 6:
 prob += 0.4
 elif tenure_months[i] < 12:
 prob += 0.2

 # Contract type effect
 if contract_type[i] == 'Month-to-month':
 prob += 0.3
 elif contract_type[i] == 'One year':
 prob += 0.1

 # Support calls effect
 if support_calls[i] > 3:
 prob += 0.3
 elif support_calls[i] > 1:
 prob += 0.1

 # Monthly charges effect
 if monthly_charges[i] > 80:
 prob += 0.2
 elif monthly_charges[i] < 30:
 prob += 0.1

 churn_probability[i] = min(prob, 0.9)

 # Generate churn labels
 churn = np.random.binomial(1, churn_probability, n_samples)

 # Create classification DataFrame
 classification_df = pd.DataFrame({
 'monthly_charges': monthly_charges,
 'tenure_months': tenure_months,
 'total_charges': total_charges,
 'support_calls': support_calls,
 'contract_type': contract_type,
 'payment_method': payment_method,
 'customer_satisfaction': customer_satisfaction[:n_samples],
 'churn': churn
 })

 return regression_df, classification_df

# Generate datasets
print(" Generating decision tree datasets...")
regression_df, classification_df = generate_decision_tree_datasets()

print(f"Regression Dataset Shape: {regression_df.shape}")
print(f"Classification Dataset Shape: {classification_df.shape}")

print("\nRegression Dataset (Revenue Prediction):")
print(regression_df.head())
print("\nRegression Target Statistics:")
print(regression_df['revenue'].describe())

print("\nClassification Dataset (Churn Prediction):")
print(classification_df.head())
print("\nChurn Distribution:")
print(classification_df['churn'].value_counts(normalize=True))

In [None]:
# 1. DECISION TREE REGRESSION ANALYSIS
print(" 1. DECISION TREE REGRESSION ANALYSIS")
print("=" * 40)

# Prepare regression data
# Encode categorical variables for regression
regression_df_encoded = regression_df.copy()
categorical_cols = ['company_size', 'industry', 'region']

for col in categorical_cols:
 le = LabelEncoder()
 regression_df_encoded[col + '_encoded'] = le.fit_transform(regression_df_encoded[col])

# Features and target
reg_features = ['company_size_encoded', 'industry_encoded', 'region_encoded',
 'marketing_spend', 'employee_count', 'years_in_business', 'customer_satisfaction']
X_reg = regression_df_encoded[reg_features]
y_reg = regression_df_encoded['revenue']

# Split data
X_reg_train, X_reg_test, y_reg_train, y_reg_test = train_test_split(
 X_reg, y_reg, test_size=0.2, random_state=42
)

# Fit decision tree regressor with default parameters
dt_reg_default = DecisionTreeRegressor(random_state=42)
dt_reg_default.fit(X_reg_train, y_reg_train)

# Predictions
y_reg_train_pred = dt_reg_default.predict(X_reg_train)
y_reg_test_pred = dt_reg_default.predict(X_reg_test)

# Calculate metrics
train_mse_reg = mean_squared_error(y_reg_train, y_reg_train_pred)
test_mse_reg = mean_squared_error(y_reg_test, y_reg_test_pred)
train_r2_reg = r2_score(y_reg_train, y_reg_train_pred)
test_r2_reg = r2_score(y_reg_test, y_reg_test_pred)

print(" Decision Tree Regression Performance (Default):")
print(f"• Training MSE: {train_mse_reg:,.0f}")
print(f"• Test MSE: {test_mse_reg:,.0f}")
print(f"• Training R²: {train_r2_reg:.4f}")
print(f"• Test R²: {test_r2_reg:.4f}")
print(f"• Overfitting indicator: {train_r2_reg - test_r2_reg:.4f}")

# Tree structure analysis
print(f"\n Tree Structure Analysis:")
print(f"• Tree depth: {dt_reg_default.get_depth()}")
print(f"• Number of leaves: {dt_reg_default.get_n_leaves()}")
print(f"• Total nodes: {dt_reg_default.tree_.node_count}")

# Feature importance analysis
feature_importance_reg = pd.DataFrame({
 'feature': reg_features,
 'importance': dt_reg_default.feature_importances_
}).sort_values('importance', ascending=False)

print(f"\n Feature Importance (Regression):")
for _, row in feature_importance_reg.iterrows():
 print(f"• {row['feature']}: {row['importance']:.4f}")

# Visualize feature importance
fig_importance_reg = go.Figure()

fig_importance_reg.add_trace(
 go.Bar(
 x=feature_importance_reg['feature'],
 y=feature_importance_reg['importance'],
 marker_color='green',
 text=feature_importance_reg['importance'].round(3),
 textposition='auto',
 hovertemplate="<b>%{x}</b><br>Importance: %{y:.4f}<extra></extra>"
 )
)

fig_importance_reg.update_layout(
 title="Decision Tree Regression: Feature Importance",
 xaxis_title="Features",
 yaxis_title="Importance Score",
 height=400,
 xaxis_tickangle=-45
)
fig_importance_reg.show()

# Residual analysis
residuals_reg_train = y_reg_train - y_reg_train_pred
residuals_reg_test = y_reg_test - y_reg_test_pred

fig_residuals_reg = make_subplots(
 rows=1, cols=2,
 subplot_titles=("Training Residuals", "Test Residuals")
)

# Training residuals
fig_residuals_reg.add_trace(
 go.Scatter(
 x=y_reg_train_pred,
 y=residuals_reg_train,
 mode='markers',
 marker=dict(color='blue', opacity=0.6),
 name='Training',
 hovertemplate="Predicted: %{x:,.0f}<br>Residual: %{y:,.0f}<extra></extra>"
 ),
 row=1, col=1
)

# Test residuals
fig_residuals_reg.add_trace(
 go.Scatter(
 x=y_reg_test_pred,
 y=residuals_reg_test,
 mode='markers',
 marker=dict(color='red', opacity=0.6),
 name='Test',
 hovertemplate="Predicted: %{x:,.0f}<br>Residual: %{y:,.0f}<extra></extra>"
 ),
 row=1, col=2
)

# Add zero lines
for col in [1, 2]:
 fig_residuals_reg.add_hline(y=0, line_dash="dash", line_color="black", row=1, col=col)

fig_residuals_reg.update_layout(
 title="Decision Tree Regression: Residual Analysis",
 height=400
)
fig_residuals_reg.show()

# Actual vs Predicted
fig_pred_reg = go.Figure()

# Training predictions
fig_pred_reg.add_trace(
 go.Scatter(
 x=y_reg_train,
 y=y_reg_train_pred,
 mode='markers',
 marker=dict(color='blue', opacity=0.6),
 name='Training',
 hovertemplate="Actual: %{x:,.0f}<br>Predicted: %{y:,.0f}<extra></extra>"
 )
)

# Test predictions
fig_pred_reg.add_trace(
 go.Scatter(
 x=y_reg_test,
 y=y_reg_test_pred,
 mode='markers',
 marker=dict(color='red', opacity=0.6),
 name='Test',
 hovertemplate="Actual: %{x:,.0f}<br>Predicted: %{y:,.0f}<extra></extra>"
 )
)

# Perfect prediction line
min_val = min(y_reg.min(), dt_reg_default.predict(X_reg).min())
max_val = max(y_reg.max(), dt_reg_default.predict(X_reg).max())

fig_pred_reg.add_trace(
 go.Scatter(
 x=[min_val, max_val],
 y=[min_val, max_val],
 mode='lines',
 line=dict(color='black', dash='dash'),
 name='Perfect Prediction',
 showlegend=True
 )
)

fig_pred_reg.update_layout(
 title="Decision Tree Regression: Actual vs Predicted",
 xaxis_title="Actual Revenue",
 yaxis_title="Predicted Revenue",
 height=500
)
fig_pred_reg.show()

if train_r2_reg > 0.95 and test_r2_reg < 0.8:
 print(" HIGH OVERFITTING DETECTED - Tree pruning recommended!")
elif train_r2_reg - test_r2_reg > 0.2:
 print(" Moderate overfitting - consider reducing tree complexity")
else:
 print(" Reasonable model performance")

In [None]:
# 2. DECISION TREE CLASSIFICATION ANALYSIS
print("\n 2. DECISION TREE CLASSIFICATION ANALYSIS")
print("=" * 45)

# Prepare classification data
classification_df_encoded = classification_df.copy()
categorical_cols_clf = ['contract_type', 'payment_method']

for col in categorical_cols_clf:
 le = LabelEncoder()
 classification_df_encoded[col + '_encoded'] = le.fit_transform(classification_df_encoded[col])

# Features and target
clf_features = ['monthly_charges', 'tenure_months', 'total_charges', 'support_calls',
 'contract_type_encoded', 'payment_method_encoded', 'customer_satisfaction']
X_clf = classification_df_encoded[clf_features]
y_clf = classification_df_encoded['churn']

# Split data
X_clf_train, X_clf_test, y_clf_train, y_clf_test = train_test_split(
 X_clf, y_clf, test_size=0.2, random_state=42, stratify=y_clf
)

# Fit decision tree classifier with default parameters
dt_clf_default = DecisionTreeClassifier(random_state=42)
dt_clf_default.fit(X_clf_train, y_clf_train)

# Predictions
y_clf_train_pred = dt_clf_default.predict(X_clf_train)
y_clf_test_pred = dt_clf_default.predict(X_clf_test)
y_clf_test_proba = dt_clf_default.predict_proba(X_clf_test)[:, 1]

# Calculate metrics
train_acc_clf = accuracy_score(y_clf_train, y_clf_train_pred)
test_acc_clf = accuracy_score(y_clf_test, y_clf_test_pred)

print(" Decision Tree Classification Performance (Default):")
print(f"• Training Accuracy: {train_acc_clf:.4f}")
print(f"• Test Accuracy: {test_acc_clf:.4f}")
print(f"• Overfitting indicator: {train_acc_clf - test_acc_clf:.4f}")

# Detailed classification report
print(f"\n Detailed Classification Report:")
print(classification_report(y_clf_test, y_clf_test_pred, target_names=['No Churn', 'Churn']))

# Tree structure analysis
print(f"\n Tree Structure Analysis (Classification):")
print(f"• Tree depth: {dt_clf_default.get_depth()}")
print(f"• Number of leaves: {dt_clf_default.get_n_leaves()}")
print(f"• Total nodes: {dt_clf_default.tree_.node_count}")

# Feature importance analysis
feature_importance_clf = pd.DataFrame({
 'feature': clf_features,
 'importance': dt_clf_default.feature_importances_
}).sort_values('importance', ascending=False)

print(f"\n Feature Importance (Classification):")
for _, row in feature_importance_clf.iterrows():
 print(f"• {row['feature']}: {row['importance']:.4f}")

# Visualize feature importance
fig_importance_clf = go.Figure()

fig_importance_clf.add_trace(
 go.Bar(
 x=feature_importance_clf['feature'],
 y=feature_importance_clf['importance'],
 marker_color='red',
 text=feature_importance_clf['importance'].round(3),
 textposition='auto',
 hovertemplate="<b>%{x}</b><br>Importance: %{y:.4f}<extra></extra>"
 )
)

fig_importance_clf.update_layout(
 title="Decision Tree Classification: Feature Importance",
 xaxis_title="Features",
 yaxis_title="Importance Score",
 height=400,
 xaxis_tickangle=-45
)
fig_importance_clf.show()

# Confusion Matrix
cm = confusion_matrix(y_clf_test, y_clf_test_pred)
cm_normalized = confusion_matrix(y_clf_test, y_clf_test_pred, normalize='true')

fig_cm = make_subplots(
 rows=1, cols=2,
 subplot_titles=("Confusion Matrix (Counts)", "Confusion Matrix (Normalized)")
)

# Counts confusion matrix
fig_cm.add_trace(
 go.Heatmap(
 z=cm,
 x=['No Churn', 'Churn'],
 y=['No Churn', 'Churn'],
 colorscale='Blues',
 text=cm,
 texttemplate="%{text}",
 textfont={"size": 16},
 hoverongaps=False
 ),
 row=1, col=1
)

# Normalized confusion matrix
fig_cm.add_trace(
 go.Heatmap(
 z=cm_normalized,
 x=['No Churn', 'Churn'],
 y=['No Churn', 'Churn'],
 colorscale='Reds',
 text=cm_normalized.round(3),
 texttemplate="%{text}",
 textfont={"size": 16},
 hoverongaps=False
 ),
 row=1, col=2
)

fig_cm.update_layout(
 title="Decision Tree Classification: Confusion Matrix Analysis",
 height=400
)
fig_cm.show()

# ROC Curve Analysis
fpr, tpr, thresholds = roc_curve(y_clf_test, y_clf_test_proba)
roc_auc = auc(fpr, tpr)

# Precision-Recall Curve
precision, recall, pr_thresholds = precision_recall_curve(y_clf_test, y_clf_test_proba)
pr_auc = auc(recall, precision)

fig_curves = make_subplots(
 rows=1, cols=2,
 subplot_titles=("ROC Curve", "Precision-Recall Curve")
)

# ROC Curve
fig_curves.add_trace(
 go.Scatter(
 x=fpr,
 y=tpr,
 mode='lines',
 name=f'ROC (AUC = {roc_auc:.3f})',
 line=dict(color='blue', width=3),
 hovertemplate="FPR: %{x:.3f}<br>TPR: %{y:.3f}<extra></extra>"
 ),
 row=1, col=1
)

# Diagonal line for ROC
fig_curves.add_trace(
 go.Scatter(
 x=[0, 1],
 y=[0, 1],
 mode='lines',
 line=dict(color='red', dash='dash'),
 name='Random',
 showlegend=False
 ),
 row=1, col=1
)

# Precision-Recall Curve
fig_curves.add_trace(
 go.Scatter(
 x=recall,
 y=precision,
 mode='lines',
 name=f'PR (AUC = {pr_auc:.3f})',
 line=dict(color='green', width=3),
 hovertemplate="Recall: %{x:.3f}<br>Precision: %{y:.3f}<extra></extra>"
 ),
 row=1, col=2
)

fig_curves.update_layout(
 title="Decision Tree Classification: Performance Curves",
 height=400
)
fig_curves.update_xaxes(title_text="False Positive Rate", row=1, col=1)
fig_curves.update_yaxes(title_text="True Positive Rate", row=1, col=1)
fig_curves.update_xaxes(title_text="Recall", row=1, col=2)
fig_curves.update_yaxes(title_text="Precision", row=1, col=2)

fig_curves.show()

print(f"\n Performance Metrics Summary:")
print(f"• ROC AUC: {roc_auc:.4f}")
print(f"• Precision-Recall AUC: {pr_auc:.4f}")
print(f"• Precision (Churn): {precision[precision.shape[0]//2]:.4f}")
print(f"• Recall (Churn): {recall[recall.shape[0]//2]:.4f}")

if train_acc_clf > 0.95 and test_acc_clf < 0.8:
 print(" HIGH OVERFITTING DETECTED - Tree pruning recommended!")
elif train_acc_clf - test_acc_clf > 0.15:
 print(" Moderate overfitting - consider reducing tree complexity")
else:
 print(" Reasonable classification performance")

In [None]:
# 3. TREE PRUNING AND HYPERPARAMETER OPTIMIZATION
print("\n 3. TREE PRUNING AND HYPERPARAMETER OPTIMIZATION")
print("=" * 52)

# 3.1 Regression Tree Optimization
print("3.1 Regression Tree Optimization:")

# Define parameter grid for regression
reg_param_grid = {
 'max_depth': [3, 5, 7, 10, 15, None],
 'min_samples_split': [2, 5, 10, 20],
 'min_samples_leaf': [1, 2, 5, 10],
 'max_features': ['sqrt', 'log2', None]
}

# Grid search for regression
reg_grid_search = GridSearchCV(
 DecisionTreeRegressor(random_state=42),
 reg_param_grid,
 cv=5,
 scoring='r2',
 n_jobs=-1
)

reg_grid_search.fit(X_reg_train, y_reg_train)

print(f" Best Regression Tree Parameters:")
for param, value in reg_grid_search.best_params_.items():
 print(f"• {param}: {value}")
print(f"• Best CV R²: {reg_grid_search.best_score_:.4f}")

# Fit optimized regression tree
dt_reg_optimized = reg_grid_search.best_estimator_
y_reg_test_pred_opt = dt_reg_optimized.predict(X_reg_test)
test_r2_reg_opt = r2_score(y_reg_test, y_reg_test_pred_opt)

print(f"• Test R² (optimized): {test_r2_reg_opt:.4f}")
print(f"• Improvement: {test_r2_reg_opt - test_r2_reg:.4f}")

# 3.2 Classification Tree Optimization
print(f"\n3.2 Classification Tree Optimization:")

# Define parameter grid for classification
clf_param_grid = {
 'max_depth': [3, 5, 7, 10, 15, None],
 'min_samples_split': [2, 5, 10, 20],
 'min_samples_leaf': [1, 2, 5, 10],
 'criterion': ['gini', 'entropy'],
 'max_features': ['sqrt', 'log2', None]
}

# Grid search for classification
clf_grid_search = GridSearchCV(
 DecisionTreeClassifier(random_state=42),
 clf_param_grid,
 cv=5,
 scoring='roc_auc',
 n_jobs=-1
)

clf_grid_search.fit(X_clf_train, y_clf_train)

print(f" Best Classification Tree Parameters:")
for param, value in clf_grid_search.best_params_.items():
 print(f"• {param}: {value}")
print(f"• Best CV AUC: {clf_grid_search.best_score_:.4f}")

# Fit optimized classification tree
dt_clf_optimized = clf_grid_search.best_estimator_
y_clf_test_pred_opt = dt_clf_optimized.predict(X_clf_test)
y_clf_test_proba_opt = dt_clf_optimized.predict_proba(X_clf_test)[:, 1]

test_acc_clf_opt = accuracy_score(y_clf_test, y_clf_test_pred_opt)
fpr_opt, tpr_opt, _ = roc_curve(y_clf_test, y_clf_test_proba_opt)
roc_auc_opt = auc(fpr_opt, tpr_opt)

print(f"• Test Accuracy (optimized): {test_acc_clf_opt:.4f}")
print(f"• Test AUC (optimized): {roc_auc_opt:.4f}")
print(f"• Accuracy improvement: {test_acc_clf_opt - test_acc_clf:.4f}")
print(f"• AUC improvement: {roc_auc_opt - roc_auc:.4f}")

# 3.3 Complexity Analysis
print(f"\n3.3 Tree Complexity Comparison:")

# Compare tree structures
trees_comparison = pd.DataFrame({
 'Model': ['Regression (Default)', 'Regression (Optimized)',
 'Classification (Default)', 'Classification (Optimized)'],
 'Max_Depth': [dt_reg_default.get_depth(), dt_reg_optimized.get_depth(),
 dt_clf_default.get_depth(), dt_clf_optimized.get_depth()],
 'Num_Leaves': [dt_reg_default.get_n_leaves(), dt_reg_optimized.get_n_leaves(),
 dt_clf_default.get_n_leaves(), dt_clf_optimized.get_n_leaves()],
 'Total_Nodes': [dt_reg_default.tree_.node_count, dt_reg_optimized.tree_.node_count,
 dt_clf_default.tree_.node_count, dt_clf_optimized.tree_.node_count],
 'Performance': [test_r2_reg, test_r2_reg_opt, roc_auc, roc_auc_opt]
})

print("Tree Complexity Comparison:")
print(trees_comparison)

# Visualize complexity vs performance
fig_complexity = make_subplots(
 rows=1, cols=2,
 subplot_titles=("Tree Depth vs Performance", "Number of Leaves vs Performance")
)

# Depth vs Performance
fig_complexity.add_trace(
 go.Scatter(
 x=trees_comparison['Max_Depth'],
 y=trees_comparison['Performance'],
 mode='markers+text',
 text=trees_comparison['Model'],
 textposition="top center",
 marker=dict(size=12, color=['blue', 'darkblue', 'red', 'darkred']),
 name='Models',
 hovertemplate="<b>%{text}</b><br>Depth: %{x}<br>Performance: %{y:.3f}<extra></extra>"
 ),
 row=1, col=1
)

# Leaves vs Performance
fig_complexity.add_trace(
 go.Scatter(
 x=trees_comparison['Num_Leaves'],
 y=trees_comparison['Performance'],
 mode='markers+text',
 text=trees_comparison['Model'],
 textposition="top center",
 marker=dict(size=12, color=['blue', 'darkblue', 'red', 'darkred']),
 name='Models',
 showlegend=False,
 hovertemplate="<b>%{text}</b><br>Leaves: %{x}<br>Performance: %{y:.3f}<extra></extra>"
 ),
 row=1, col=2
)

fig_complexity.update_layout(
 title="Tree Complexity vs Performance Analysis",
 height=500
)
fig_complexity.update_xaxes(title_text="Tree Depth", row=1, col=1)
fig_complexity.update_yaxes(title_text="Performance Score", row=1, col=1)
fig_complexity.update_xaxes(title_text="Number of Leaves", row=1, col=2)
fig_complexity.update_yaxes(title_text="Performance Score", row=1, col=2)

fig_complexity.show()

# 3.4 Pruning Path Analysis
print(f"\n3.4 Pruning Path Analysis:")

# Cost complexity pruning for regression
path_reg = dt_reg_default.cost_complexity_pruning_path(X_reg_train, y_reg_train)
ccp_alphas_reg = path_reg.ccp_alphas
impurities_reg = path_reg.impurities

# Cost complexity pruning for classification
path_clf = dt_clf_default.cost_complexity_pruning_path(X_clf_train, y_clf_train)
ccp_alphas_clf = path_clf.ccp_alphas
impurities_clf = path_clf.impurities

# Train trees with different alpha values (regression)
reg_scores_train = []
reg_scores_test = []

for ccp_alpha in ccp_alphas_reg:
 dt_reg_temp = DecisionTreeRegressor(random_state=42, ccp_alpha=ccp_alpha)
 dt_reg_temp.fit(X_reg_train, y_reg_train)
 reg_scores_train.append(dt_reg_temp.score(X_reg_train, y_reg_train))
 reg_scores_test.append(dt_reg_temp.score(X_reg_test, y_reg_test))

# Train trees with different alpha values (classification)
clf_scores_train = []
clf_scores_test = []

for ccp_alpha in ccp_alphas_clf:
 dt_clf_temp = DecisionTreeClassifier(random_state=42, ccp_alpha=ccp_alpha)
 dt_clf_temp.fit(X_clf_train, y_clf_train)
 clf_scores_train.append(dt_clf_temp.score(X_clf_train, y_clf_train))
 clf_scores_test.append(dt_clf_temp.score(X_clf_test, y_clf_test))

# Plot pruning paths
fig_pruning = make_subplots(
 rows=1, cols=2,
 subplot_titles=("Regression Pruning Path", "Classification Pruning Path")
)

# Regression pruning
fig_pruning.add_trace(
 go.Scatter(
 x=ccp_alphas_reg,
 y=reg_scores_train,
 mode='lines+markers',
 name='Training',
 line=dict(color='blue'),
 hovertemplate="Alpha: %{x:.6f}<br>R²: %{y:.3f}<extra></extra>"
 ),
 row=1, col=1
)

fig_pruning.add_trace(
 go.Scatter(
 x=ccp_alphas_reg,
 y=reg_scores_test,
 mode='lines+markers',
 name='Test',
 line=dict(color='red'),
 hovertemplate="Alpha: %{x:.6f}<br>R²: %{y:.3f}<extra></extra>"
 ),
 row=1, col=1
)

# Classification pruning
fig_pruning.add_trace(
 go.Scatter(
 x=ccp_alphas_clf,
 y=clf_scores_train,
 mode='lines+markers',
 name='Training',
 line=dict(color='blue'),
 showlegend=False,
 hovertemplate="Alpha: %{x:.6f}<br>Accuracy: %{y:.3f}<extra></extra>"
 ),
 row=1, col=2
)

fig_pruning.add_trace(
 go.Scatter(
 x=ccp_alphas_clf,
 y=clf_scores_test,
 mode='lines+markers',
 name='Test',
 line=dict(color='red'),
 showlegend=False,
 hovertemplate="Alpha: %{x:.6f}<br>Accuracy: %{y:.3f}<extra></extra>"
 ),
 row=1, col=2
)

fig_pruning.update_layout(
 title="Cost Complexity Pruning Analysis",
 height=500
)
fig_pruning.update_xaxes(title_text="Alpha", type="log")
fig_pruning.update_yaxes(title_text="R² Score", row=1, col=1)
fig_pruning.update_yaxes(title_text="Accuracy", row=1, col=2)

fig_pruning.show()

# Find optimal alpha values
optimal_alpha_reg = ccp_alphas_reg[np.argmax(reg_scores_test)]
optimal_alpha_clf = ccp_alphas_clf[np.argmax(clf_scores_test)]

print(f"• Optimal alpha (regression): {optimal_alpha_reg:.6f}")
print(f"• Optimal alpha (classification): {optimal_alpha_clf:.6f}")
print(f"• Max test R² (regression): {max(reg_scores_test):.4f}")
print(f"• Max test accuracy (classification): {max(clf_scores_test):.4f}")

In [None]:
# 4. TREE VISUALIZATION AND INTERPRETATION
print("\n 4. TREE VISUALIZATION AND INTERPRETATION")
print("=" * 45)

# 4.1 Tree Structure Visualization
print("4.1 Tree Structure Analysis:")

# Create a simple tree for visualization (limited depth)
dt_reg_simple = DecisionTreeRegressor(max_depth=3, min_samples_split=20, random_state=42)
dt_reg_simple.fit(X_reg_train, y_reg_train)

dt_clf_simple = DecisionTreeClassifier(max_depth=3, min_samples_split=20, random_state=42)
dt_clf_simple.fit(X_clf_train, y_clf_train)

# Text representation of trees
print(" Regression Tree Structure (Depth=3):")
tree_rules_reg = export_text(dt_reg_simple, feature_names=reg_features, max_depth=3)
print(tree_rules_reg[:1000] + "..." if len(tree_rules_reg) > 1000 else tree_rules_reg)

print("\n Classification Tree Structure (Depth=3):")
tree_rules_clf = export_text(dt_clf_simple, feature_names=clf_features, max_depth=3)
print(tree_rules_clf[:1000] + "..." if len(tree_rules_clf) > 1000 else tree_rules_clf)

# 4.2 Decision Boundary Analysis (for 2D visualization)
print(f"\n4.2 Decision Boundary Analysis:")

# Select two most important features for visualization
top_features_reg = feature_importance_reg.head(2)['feature'].tolist()
top_features_clf = feature_importance_clf.head(2)['feature'].tolist()

# Create 2D datasets
X_reg_2d = X_reg_train[top_features_reg].values
X_clf_2d = X_clf_train[top_features_clf].values

# Fit simple trees on 2D data
dt_reg_2d = DecisionTreeRegressor(max_depth=4, random_state=42)
dt_reg_2d.fit(X_reg_2d, y_reg_train)

dt_clf_2d = DecisionTreeClassifier(max_depth=4, random_state=42)
dt_clf_2d.fit(X_clf_2d, y_clf_train)

# Create meshgrid for decision boundary
def create_meshgrid(X, h=0.02):
 x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
 y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
 xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
 np.arange(y_min, y_max, h))
 return xx, yy

# Classification decision boundary
xx_clf, yy_clf = create_meshgrid(X_clf_2d)
mesh_points_clf = np.c_[xx_clf.ravel(), yy_clf.ravel()]
Z_clf = dt_clf_2d.predict(mesh_points_clf)
Z_clf = Z_clf.reshape(xx_clf.shape)

# Regression decision boundary
xx_reg, yy_reg = create_meshgrid(X_reg_2d, h=100)
mesh_points_reg = np.c_[xx_reg.ravel(), yy_reg.ravel()]
Z_reg = dt_reg_2d.predict(mesh_points_reg)
Z_reg = Z_reg.reshape(xx_reg.shape)

# Visualize decision boundaries
fig_boundaries = make_subplots(
 rows=1, cols=2,
 subplot_titles=(f"Regression: {top_features_reg[0]} vs {top_features_reg[1]}",
 f"Classification: {top_features_clf[0]} vs {top_features_clf[1]}")
)

# Regression boundary
fig_boundaries.add_trace(
 go.Contour(
 x=xx_reg[0],
 y=yy_reg[:, 0],
 z=Z_reg,
 colorscale='Viridis',
 opacity=0.3,
 showscale=False,
 hovertemplate="X: %{x}<br>Y: %{y}<br>Prediction: %{z:,.0f}<extra></extra>"
 ),
 row=1, col=1
)

# Add regression data points
fig_boundaries.add_trace(
 go.Scatter(
 x=X_reg_2d[:, 0],
 y=X_reg_2d[:, 1],
 mode='markers',
 marker=dict(
 color=y_reg_train,
 colorscale='Viridis',
 size=6,
 opacity=0.7,
 colorbar=dict(title="Revenue")
 ),
 name='Training Data',
 hovertemplate=f"<b>{top_features_reg[0]}</b>: %{{x}}<br><b>{top_features_reg[1]}</b>: %{{y}}<br>Revenue: %{{marker.color:,.0f}}<extra></extra>"
 ),
 row=1, col=1
)

# Classification boundary
fig_boundaries.add_trace(
 go.Contour(
 x=xx_clf[0],
 y=yy_clf[:, 0],
 z=Z_clf,
 colorscale='RdYlBu',
 opacity=0.3,
 showscale=False,
 hovertemplate="X: %{x}<br>Y: %{y}<br>Prediction: %{z}<extra></extra>"
 ),
 row=1, col=2
)

# Add classification data points
colors_clf = ['blue' if c == 0 else 'red' for c in y_clf_train]
fig_boundaries.add_trace(
 go.Scatter(
 x=X_clf_2d[:, 0],
 y=X_clf_2d[:, 1],
 mode='markers',
 marker=dict(
 color=colors_clf,
 size=6,
 opacity=0.7
 ),
 name='Training Data',
 showlegend=False,
 hovertemplate=f"<b>{top_features_clf[0]}</b>: %{{x}}<br><b>{top_features_clf[1]}</b>: %{{y}}<br>Churn: %{{marker.color}}<extra></extra>"
 ),
 row=1, col=2
)

fig_boundaries.update_layout(
 title="Decision Tree: Decision Boundaries (2D Projection)",
 height=500
)
fig_boundaries.update_xaxes(title_text=top_features_reg[0], row=1, col=1)
fig_boundaries.update_yaxes(title_text=top_features_reg[1], row=1, col=1)
fig_boundaries.update_xaxes(title_text=top_features_clf[0], row=1, col=2)
fig_boundaries.update_yaxes(title_text=top_features_clf[1], row=1, col=2)

fig_boundaries.show()

# 4.3 Feature Interaction Analysis
print(f"\n4.3 Feature Interaction Analysis:")

# Analyze how features split data at different tree levels
def analyze_tree_splits(tree_model, feature_names, max_depth=3):
 """Analyze decision tree splits and feature usage by depth"""

 tree_structure = tree_model.tree_
 feature_usage = {}

 def traverse_tree(node_id, depth):
 if depth > max_depth:
 return

 if tree_structure.feature[node_id] != -2: # Not a leaf
 feature_idx = tree_structure.feature[node_id]
 feature_name = feature_names[feature_idx]
 threshold = tree_structure.threshold[node_id]

 if depth not in feature_usage:
 feature_usage[depth] = []

 feature_usage[depth].append({
 'feature': feature_name,
 'threshold': threshold,
 'samples': tree_structure.n_node_samples[node_id]
 })

 # Traverse children
 traverse_tree(tree_structure.children_left[node_id], depth + 1)
 traverse_tree(tree_structure.children_right[node_id], depth + 1)

 traverse_tree(0, 0)
 return feature_usage

# Analyze optimized trees
reg_splits = analyze_tree_splits(dt_reg_optimized, reg_features)
clf_splits = analyze_tree_splits(dt_clf_optimized, clf_features)

print(" Regression Tree - Feature Usage by Depth:")
for depth, splits in reg_splits.items():
 print(f" Depth {depth}:")
 for split in splits:
 print(f" • {split['feature']} <= {split['threshold']:.3f} (samples: {split['samples']})")

print("\n Classification Tree - Feature Usage by Depth:")
for depth, splits in clf_splits.items():
 print(f" Depth {depth}:")
 for split in splits:
 print(f" • {split['feature']} <= {split['threshold']:.3f} (samples: {split['samples']})")

# 4.4 Prediction Path Analysis
print(f"\n4.4 Sample Prediction Path Analysis:")

# Get prediction paths for a few samples
def get_decision_path(tree_model, X_sample, feature_names):
 """Get the decision path for a sample"""

 tree_structure = tree_model.tree_
 decision_path = tree_model.decision_path(X_sample)

 paths = []
 for sample_id in range(X_sample.shape[0]):
 sample_path = []
 node_indicator = decision_path[sample_id]

 for node_id in node_indicator.indices:
 if tree_structure.feature[node_id] != -2: # Not a leaf
 feature_idx = tree_structure.feature[node_id]
 feature_name = feature_names[feature_idx]
 threshold = tree_structure.threshold[node_id]
 feature_value = X_sample[sample_id, feature_idx]

 if feature_value <= threshold:
 condition = f"{feature_name} <= {threshold:.3f}"
 direction = "left"
 else:
 condition = f"{feature_name} > {threshold:.3f}"
 direction = "right"

 sample_path.append({
 'condition': condition,
 'feature_value': feature_value,
 'direction': direction
 })

 paths.append(sample_path)

 return paths

# Analyze a few test samples
sample_indices = [0, 1, 2]
reg_sample_paths = get_decision_path(dt_reg_optimized, X_reg_test.iloc[sample_indices].values, reg_features)
clf_sample_paths = get_decision_path(dt_clf_optimized, X_clf_test.iloc[sample_indices].values, clf_features)

print(" Sample Regression Predictions:")
for i, path in enumerate(reg_sample_paths):
 actual = y_reg_test.iloc[sample_indices[i]]
 predicted = dt_reg_optimized.predict(X_reg_test.iloc[sample_indices:sample_indices+1])[0]
 print(f" Sample {i+1}: Actual=${actual:,.0f}, Predicted=${predicted:,.0f}")
 print(f" Decision path: {' → '.join([step['condition'] for step in path])}")

print("\n Sample Classification Predictions:")
for i, path in enumerate(clf_sample_paths):
 actual = y_clf_test.iloc[sample_indices[i]]
 predicted = dt_clf_optimized.predict(X_clf_test.iloc[sample_indices:sample_indices+1])[0]
 probability = dt_clf_optimized.predict_proba(X_clf_test.iloc[sample_indices:sample_indices+1])[0]
 print(f" Sample {i+1}: Actual={actual}, Predicted={predicted}, Prob=[{probability[0]:.3f}, {probability[1]:.3f}]")
 print(f" Decision path: {' → '.join([step['condition'] for step in path])}")

In [None]:
# 5. BUSINESS INSIGHTS AND STRATEGIC RECOMMENDATIONS
print("\n 5. BUSINESS INSIGHTS AND STRATEGIC RECOMMENDATIONS")
print("=" * 58)

# 5.1 Revenue Prediction Insights
print("5.1 Revenue Prediction Business Rules:")

# Extract business rules from regression tree
def extract_business_rules(tree_model, feature_names, target_name, threshold_percentile=80):
 """Extract actionable business rules from decision tree"""

 tree_structure = tree_model.tree_
 rules = []

 def traverse_for_rules(node_id, conditions, depth=0):
 if tree_structure.feature[node_id] != -2: # Not a leaf
 feature_idx = tree_structure.feature[node_id]
 feature_name = feature_names[feature_idx]
 threshold = tree_structure.threshold[node_id]

 # Left child (<=)
 left_conditions = conditions + [f"{feature_name} <= {threshold:.2f}"]
 traverse_for_rules(tree_structure.children_left[node_id], left_conditions, depth + 1)

 # Right child (>)
 right_conditions = conditions + [f"{feature_name} > {threshold:.2f}"]
 traverse_for_rules(tree_structure.children_right[node_id], right_conditions, depth + 1)
 else:
 # Leaf node - extract rule
 prediction = tree_structure.value[node_id][0][0]
 samples = tree_structure.n_node_samples[node_id]

 rules.append({
 'conditions': conditions,
 'prediction': prediction,
 'samples': samples,
 'rule': ' AND '.join(conditions) if conditions else 'All samples'
 })

 traverse_for_rules(0, [])

 # Filter rules by prediction value (high-value rules)
 all_predictions = [rule['prediction'] for rule in rules]
 high_threshold = np.percentile(all_predictions, threshold_percentile)

 high_value_rules = [rule for rule in rules if rule['prediction'] >= high_threshold]
 high_value_rules.sort(key=lambda x: x['prediction'], reverse=True)

 return rules, high_value_rules

# Extract revenue rules
all_revenue_rules, high_revenue_rules = extract_business_rules(
 dt_reg_optimized, reg_features, 'revenue', threshold_percentile=80
)

print(" High-Revenue Business Rules (Top 80th percentile):")
for i, rule in enumerate(high_revenue_rules[:5], 1):
 print(f" {i}. Expected Revenue: ${rule['prediction']:,.0f} (n={rule['samples']})")
 print(f" Conditions: {rule['rule']}")
 print()

# 5.2 Churn Prevention Insights
print("5.2 Churn Prevention Business Rules:")

# Extract churn rules (focusing on high-churn probability leaves)
def extract_churn_rules(tree_model, feature_names):
 """Extract churn probability rules"""

 tree_structure = tree_model.tree_
 rules = []

 def traverse_for_churn(node_id, conditions):
 if tree_structure.feature[node_id] != -2: # Not a leaf
 feature_idx = tree_structure.feature[node_id]
 feature_name = feature_names[feature_idx]
 threshold = tree_structure.threshold[node_id]

 # Left child
 left_conditions = conditions + [f"{feature_name} <= {threshold:.2f}"]
 traverse_for_churn(tree_structure.children_left[node_id], left_conditions)

 # Right child
 right_conditions = conditions + [f"{feature_name} > {threshold:.2f}"]
 traverse_for_churn(tree_structure.children_right[node_id], right_conditions)
 else:
 # Leaf node
 class_counts = tree_structure.value[node_id][0]
 total_samples = tree_structure.n_node_samples[node_id]
 churn_probability = class_counts[1] / total_samples if total_samples > 0 else 0

 rules.append({
 'conditions': conditions,
 'churn_probability': churn_probability,
 'samples': total_samples,
 'churn_count': int(class_counts[1]),
 'rule': ' AND '.join(conditions) if conditions else 'All samples'
 })

 traverse_for_churn(0, [])

 # Filter high-risk rules
 high_risk_rules = [rule for rule in rules if rule['churn_probability'] >= 0.6 and rule['samples'] >= 10]
 high_risk_rules.sort(key=lambda x: x['churn_probability'], reverse=True)

 return rules, high_risk_rules

all_churn_rules, high_risk_rules = extract_churn_rules(dt_clf_optimized, clf_features)

print(" High-Risk Churn Segments (>60% churn probability):")
for i, rule in enumerate(high_risk_rules[:5], 1):
 print(f" {i}. Churn Risk: {rule['churn_probability']:.1%} (n={rule['samples']}, churned={rule['churn_count']})")
 print(f" Conditions: {rule['rule']}")
 print()

# 5.3 Feature-Based Recommendations
print("5.3 Strategic Recommendations by Feature Importance:")

# Revenue recommendations
print(" Revenue Optimization Strategies:")
for i, (_, row) in enumerate(feature_importance_reg.head(3).iterrows(), 1):
 feature = row['feature']
 importance = row['importance']

 if 'marketing_spend' in feature:
 print(f" {i}. Marketing Investment (Importance: {importance:.3f})")
 print(" • Increase marketing spend for companies with high employee counts")
 print(" • Focus on digital marketing for tech industry segments")
 elif 'customer_satisfaction' in feature:
 print(f" {i}. Customer Experience (Importance: {importance:.3f})")
 print(" • Implement customer satisfaction monitoring systems")
 print(" • Prioritize satisfaction improvements for large companies")
 elif 'employee_count' in feature:
 print(f" {i}. Workforce Optimization (Importance: {importance:.3f})")
 print(" • Monitor revenue per employee ratios")
 print(" • Scale workforce based on industry-specific productivity metrics")

print("\n Churn Prevention Strategies:")
for i, (_, row) in enumerate(feature_importance_clf.head(3).iterrows(), 1):
 feature = row['feature']
 importance = row['importance']

 if 'tenure_months' in feature:
 print(f" {i}. Early Retention Focus (Importance: {importance:.3f})")
 print(" • Implement 90-day onboarding program")
 print(" • Provide extra support for customers in first 6 months")
 elif 'support_calls' in feature:
 print(f" {i}. Proactive Support (Importance: {importance:.3f})")
 print(" • Flag customers with >2 support calls per month")
 print(" • Implement proactive outreach for high-contact customers")
 elif 'monthly_charges' in feature:
 print(f" {i}. Pricing Strategy (Importance: {importance:.3f})")
 print(" • Review pricing for high-charge, short-tenure customers")
 print(" • Consider loyalty discounts for long-term customers")

# 5.4 ROI Analysis
print(f"\n5.4 ROI and Business Impact Analysis:")

# Calculate potential impact of interventions
def calculate_intervention_impact(rules, baseline_metric, improvement_rate=0.2):
 """Calculate potential business impact of interventions"""

 total_impact = 0
 for rule in rules[:3]: # Top 3 actionable rules
 samples_affected = rule['samples']
 current_value = rule.get('prediction', rule.get('churn_probability', 0))

 if 'prediction' in rule: # Revenue rules
 potential_increase = current_value * improvement_rate
 total_impact += potential_increase * samples_affected
 else: # Churn rules
 churn_reduction = current_value * improvement_rate
 total_impact += churn_reduction * samples_affected

 return total_impact

# Revenue impact
revenue_impact = calculate_intervention_impact(high_revenue_rules)
print(f" Potential Revenue Impact:")
print(f"• High-value segment optimization: ${revenue_impact:,.0f} potential increase")
print(f"• Target segments: {sum(rule['samples'] for rule in high_revenue_rules[:3])} companies")

# Churn impact
churn_impact = calculate_intervention_impact(high_risk_rules)
print(f"\n Potential Churn Reduction Impact:")
print(f"• High-risk customers prevented from churning: {churn_impact:.0f} customers")
print(f"• Target segments: {sum(rule['samples'] for rule in high_risk_rules[:3])} at-risk customers")

# 5.5 Model Reliability Assessment
print(f"\n5.5 Model Reliability and Risk Assessment:")

print(" Model Performance Summary:")
print(f"• Regression Model R²: {test_r2_reg_opt:.3f}")
print(f"• Classification Model AUC: {roc_auc_opt:.3f}")
print(f"• Regression Tree Depth: {dt_reg_optimized.get_depth()}")
print(f"• Classification Tree Depth: {dt_clf_optimized.get_depth()}")

reliability_score = (test_r2_reg_opt + roc_auc_opt) / 2
if reliability_score > 0.8:
 reliability = "HIGH"
elif reliability_score > 0.6:
 reliability = "MEDIUM"
else:
 reliability = "LOW"

print(f"• Overall Model Reliability: {reliability} ({reliability_score:.3f})")

print(f"\n Risk Factors:")
if dt_reg_optimized.get_depth() > 10:
 print("• CAUTION: Deep regression tree may overfit to training data")
if dt_clf_optimized.get_depth() > 10:
 print("• CAUTION: Deep classification tree may overfit to training data")

feature_concentration_reg = feature_importance_reg.iloc[0]['importance']
feature_concentration_clf = feature_importance_clf.iloc[0]['importance']

if feature_concentration_reg > 0.5:
 print(f"• RISK: High dependence on single feature in regression ({feature_importance_reg.iloc[0]['feature']})")
if feature_concentration_clf > 0.5:
 print(f"• RISK: High dependence on single feature in classification ({feature_importance_clf.iloc[0]['feature']})")

if reliability_score > 0.7:
 print(" Models are suitable for business decision support")
else:
 print(" Models need improvement before deployment")

# LEARNING SUMMARY: Decision Tree Analysis

## Key Concepts Mastered

### 1. **Decision Tree Fundamentals**
- **Tree Construction**: Recursive binary splitting based on feature thresholds
- **Splitting Criteria**: Gini impurity, entropy, and MSE for optimal splits
- **Tree Structure**: Understanding nodes, leaves, depth, and branching logic
- **Interpretability**: Clear decision rules and prediction paths

### 2. **Overfitting Prevention**
- **Pruning Techniques**: Pre-pruning (early stopping) and post-pruning methods
- **Hyperparameter Tuning**: max_depth, min_samples_split, min_samples_leaf
- **Cost Complexity Pruning**: Alpha-based pruning for optimal tree size
- **Cross-Validation**: Robust model selection and performance estimation

### 3. **Feature Analysis & Business Rules**
- **Feature Importance**: Automatic ranking of variable significance
- **Decision Boundaries**: Understanding how trees partition feature space
- **Rule Extraction**: Converting tree logic into actionable business rules
- **Path Analysis**: Tracing prediction logic for individual samples

## Business Applications

### Decision Support Systems
- **Credit Scoring**: Automated loan approval with clear criteria
- **Medical Diagnosis**: Rule-based diagnostic support systems
- **Quality Control**: Defect detection with interpretable rules
- **Customer Segmentation**: Clear criteria for marketing targeting

### Strategic Planning
- Decision trees provide:
 - Transparent decision logic for stakeholder buy-in
 - Actionable business rules for operational implementation
 - Feature importance for resource allocation priorities
 - Risk assessment with quantifiable decision paths

## Next Steps

1. **Ensemble Methods** - Random Forests and Gradient Boosting
2. **Advanced Pruning** - Minimal cost-complexity pruning
3. **Multi-output Trees** - Simultaneous prediction of multiple targets
4. **Tree-based Feature Selection** - Using trees for dimensionality reduction

## Pro Tips

- **Balance interpretability vs accuracy** - deeper trees = higher accuracy but less interpretable
- **Use cross-validation** for hyperparameter selection to avoid overfitting
- **Feature engineering matters** - trees work well with well-prepared features
- **Consider ensemble methods** when single trees underperform
- **Visualize decision boundaries** to understand model behavior

## Common Pitfalls

- **Overfitting**: Deep trees memorize training data rather than learning patterns
- **Bias in Splits**: Trees favor features with more levels or higher cardinality
- **Instability**: Small data changes can lead to very different trees
- **Linear Relationships**: Trees struggle with simple linear relationships
- **Missing Values**: Require preprocessing as trees can't handle NaN directly

## Advanced Considerations

### When to Use Decision Trees:
- **Need interpretability**: Stakeholders require explainable decisions
- **Mixed data types**: Handling both categorical and numerical features
- **Non-linear relationships**: Complex interactions between features
- **Rule-based systems**: Converting expert knowledge into automated systems

### Performance Optimization:
- **Feature selection**: Remove irrelevant features to reduce noise
- **Balanced datasets**: Address class imbalance for better classification
- **Ensemble methods**: Combine multiple trees for better performance
- **Regular retraining**: Update models as business conditions change

### Business Implementation:
- **Rule documentation**: Maintain clear records of decision logic
- **Performance monitoring**: Track model accuracy over time
- **Stakeholder training**: Ensure users understand tree-based decisions
- **Compliance considerations**: Ensure rules meet regulatory requirements

**Remember**: *Decision trees excel when you need both accuracy AND interpretability - they're your go-to method when stakeholders need to understand "why" the model made a specific prediction!*