# Model Auditor Example Notebook

This notebook demonstrates the core functionality of the `model-auditor` package for evaluating ML model performance across subgroups with stratified metrics and bootstrap confidence intervals.

## Setup

First, let's import the necessary modules and create a synthetic dataset for demonstration.

In [None]:
import numpy as np
import pandas as pd
np.random.seed(42)

In [None]:
from model_auditor import Auditor
from model_auditor.metrics import (
    Sensitivity, Specificity, Precision, Recall,
    F1Score, AUROC, AUPRC, MatthewsCorrelationCoefficient,
    FBetaScore, TPR, FPR, nData, nPositive, nNegative
)

## Create Synthetic Dataset

We'll create a synthetic medical dataset simulating a disease prediction model with demographic features for stratified evaluation.

In [None]:
n_samples = 2000

# Create demographic features
age_groups = np.random.choice(['18-30', '31-50', '51-70', '70+'], n_samples, p=[0.2, 0.35, 0.30, 0.15])
gender = np.random.choice(['Male', 'Female'], n_samples, p=[0.48, 0.52])
region = np.random.choice(['North', 'South', 'East', 'West'], n_samples, p=[0.25, 0.25, 0.25, 0.25])

# Create ground truth labels (disease status) with some demographic variation
base_prevalence = 0.3
disease_prob = np.where(age_groups == '70+', base_prevalence + 0.15,
               np.where(age_groups == '51-70', base_prevalence + 0.08,
               np.where(age_groups == '31-50', base_prevalence,
                        base_prevalence - 0.05)))
disease_status = np.random.binomial(1, disease_prob)

# Create model prediction scores (continuous 0-1)
# Model has varying performance across groups
noise = np.random.normal(0, 0.15, n_samples)
risk_score = np.clip(disease_status * 0.6 + (1 - disease_status) * 0.3 + noise, 0, 1)

# Add some systematic bias for demonstration
risk_score = np.where(age_groups == '70+', risk_score + 0.05, risk_score)
risk_score = np.clip(risk_score, 0, 1)

# Create DataFrame
df = pd.DataFrame({
    'patient_id': range(n_samples),
    'age_group': age_groups,
    'gender': gender,
    'region': region,
    'risk_score': risk_score,
    'disease_status': disease_status
})

print(f"Dataset shape: {df.shape}")
print(f"\nDisease prevalence: {df['disease_status'].mean():.1%}")
df.head(10)

## Basic Usage: Evaluating Model Performance

Let's set up the Auditor to evaluate our model's performance across different demographic groups.

In [None]:
# Initialize the Auditor
auditor = Auditor()

# Add the dataset
auditor.add_data(df)

# Add stratification features
auditor.add_feature(name='age_group', label='Age Group')
auditor.add_feature(name='gender', label='Gender')
auditor.add_feature(name='region', label='Region')

# Add the prediction score column with a threshold
auditor.add_score(name='risk_score', label='Risk Score', threshold=0.5)

# Add the outcome (ground truth) column
auditor.add_outcome(name='disease_status')

# Set metrics to evaluate
auditor.set_metrics([
    nData(),           # Sample size
    nPositive(),       # Number of positive cases
    Sensitivity(),     # True positive rate
    Specificity(),     # True negative rate
    Precision(),       # Positive predictive value
    F1Score(),         # Harmonic mean of precision and recall
    AUROC(),           # Area under ROC curve
])

print("Auditor configured successfully!")

In [None]:
# Run evaluation with bootstrap confidence intervals
# Using 500 bootstraps for faster execution (use 1000+ for production)
results = auditor.evaluate(score_name='risk_score', n_bootstraps=500)

In [None]:
# View all results as a DataFrame
results_df = results.to_dataframe(n_decimals=3, metric_labels=True)
results_df

## Accessing Specific Results

You can access results at different levels of the hierarchy.

In [None]:
# Get results for a specific feature
print("=== Age Group Results ===")
age_results = results.features['age_group'].to_dataframe(n_decimals=3, metric_labels=True)
display(age_results)

In [None]:
# Get results for a specific level within a feature
print("=== Results for Age Group 70+ ===")
elderly_results = results.features['age_group'].levels['70+'].to_dataframe(n_decimals=3, metric_labels=True)
display(elderly_results)

## Threshold Optimization

The Auditor can find the optimal decision threshold using the Youden index (maximizes sensitivity + specificity - 1).

In [None]:
# Find optimal threshold
optimal_threshold = auditor.optimize_score_threshold(score_name='risk_score')

In [None]:
# Re-evaluate with optimal threshold
results_optimal = auditor.evaluate(
    score_name='risk_score', 
    threshold=optimal_threshold,
    n_bootstraps=500
)

print(f"\nResults with optimal threshold ({optimal_threshold:.3f}):")
results_optimal.features['overall'].to_dataframe(n_decimals=3, metric_labels=True)

## Fast Evaluation (Without Confidence Intervals)

For quick exploration, you can disable bootstrap confidence intervals.

In [None]:
# Fast evaluation without CIs
results_fast = auditor.evaluate(score_name='risk_score', n_bootstraps=None)
results_fast.to_dataframe(n_decimals=3, metric_labels=True)

## Custom Metrics

You can create custom metrics by implementing the `AuditorMetric` protocol.

In [None]:
from model_auditor.metrics import AuditorMetric

class Accuracy(AuditorMetric):
    """Custom accuracy metric."""
    name = "accuracy"
    label = "Accuracy"
    inputs = ["tp", "tn", "fp", "fn"]
    ci_eligible = True
    
    def data_call(self, data: pd.DataFrame) -> float:
        tp = data["tp"].sum()
        tn = data["tn"].sum()
        fp = data["fp"].sum()
        fn = data["fn"].sum()
        return (tp + tn) / (tp + tn + fp + fn)


class BalancedAccuracy(AuditorMetric):
    """Balanced accuracy (average of sensitivity and specificity)."""
    name = "balanced_accuracy"
    label = "Balanced Accuracy"
    inputs = ["tp", "tn", "fp", "fn"]
    ci_eligible = True
    
    def data_call(self, data: pd.DataFrame, eps: float = 1e-8) -> float:
        tp = data["tp"].sum()
        tn = data["tn"].sum()
        fp = data["fp"].sum()
        fn = data["fn"].sum()
        sensitivity = tp / (tp + fn + eps)
        specificity = tn / (tn + fp + eps)
        return (sensitivity + specificity) / 2

In [None]:
# Use custom metrics
auditor.set_metrics([
    nData(),
    Accuracy(),
    BalancedAccuracy(),
    Sensitivity(),
    Specificity(),
    MatthewsCorrelationCoefficient(),
])

results_custom = auditor.evaluate(score_name='risk_score', n_bootstraps=500)
results_custom.features['age_group'].to_dataframe(n_decimals=3, metric_labels=True)

## Using F-beta Score

The F-beta score allows you to weight precision vs recall. Beta < 1 favors precision, beta > 1 favors recall.

In [None]:
# Compare different F-beta scores
auditor.set_metrics([
    nData(),
    Precision(),
    Recall(),
    FBetaScore(beta=0.5),   # Weights precision higher
    F1Score(),              # Equal weight (beta=1)
    FBetaScore(beta=2.0),   # Weights recall higher
])

results_fbeta = auditor.evaluate(score_name='risk_score', n_bootstraps=None)
results_fbeta.features['overall'].to_dataframe(n_decimals=3, metric_labels=True)

## Hierarchical Visualization

The `HierarchyPlotter` creates data structures for sunburst or treemap visualizations.

In [None]:
from model_auditor.plotting import HierarchyPlotter

In [None]:
# Set up the hierarchy plotter
plotter = HierarchyPlotter()
plotter.set_data(df)
plotter.set_features(['region', 'age_group', 'gender'])  # Hierarchy order
plotter.set_score(name='risk_score')
plotter.set_aggregator('mean')  # Aggregate scores by mean

# Compile the plot data
plot_data = plotter.compile(container='All Patients')

print(f"Number of nodes: {len(plot_data.labels)}")
print(f"\nFirst 10 labels: {plot_data.labels[:10]}")
print(f"First 10 values: {plot_data.values[:10]}")

In [None]:
# Create a sunburst visualization with Plotly (if available)
try:
    import plotly.graph_objects as go
    
    fig = go.Figure(go.Sunburst(
        labels=plot_data.labels,
        ids=plot_data.ids,
        parents=plot_data.parents,
        values=plot_data.values,
        marker=dict(
            colors=plot_data.colors,
            colorscale='RdYlGn',
            cmid=0.5
        ),
        branchvalues='total',
        hovertemplate='<b>%{label}</b><br>Count: %{value}<br>Mean Score: %{color:.3f}<extra></extra>'
    ))
    
    fig.update_layout(
        title='Risk Score Distribution by Demographics',
        width=800,
        height=800
    )
    
    fig.show()
except ImportError:
    print("Plotly not installed. Install with: pip install plotly")
    print("\nPlot data is available in plot_data object for use with other visualization libraries.")

## Custom Hierarchies

For more complex visualizations, you can define custom hierarchies with conditional features.

In [None]:
from model_auditor.plotting.schemas import Hierarchy, HLevel, HItem

# Create a custom hierarchy
custom_hierarchy = Hierarchy(levels=[
    HLevel([HItem(name='gender')]),           # First level: Gender
    HLevel([HItem(name='age_group')]),        # Second level: Age Group
])

plotter2 = HierarchyPlotter()
plotter2.set_data(df)
plotter2.set_features(custom_hierarchy)
plotter2.set_score(name='risk_score')
plotter2.set_aggregator('median')  # Use median instead of mean

plot_data2 = plotter2.compile(container='Population')

# Show the hierarchy structure
hierarchy_df = pd.DataFrame({
    'Label': plot_data2.labels,
    'Parent': plot_data2.parents,
    'Count': plot_data2.values,
    'Median Score': [f"{c:.3f}" for c in plot_data2.colors]
})
hierarchy_df

## Complete Workflow Example

Here's a complete example putting everything together.

In [None]:
# Complete workflow
from model_auditor import Auditor
from model_auditor.metrics import (
    Sensitivity, Specificity, AUROC, F1Score, 
    nData, nPositive, nNegative
)

# 1. Initialize and configure
auditor = Auditor()
auditor.add_data(df)
auditor.add_feature(name='age_group', label='Age Group')
auditor.add_feature(name='gender', label='Gender')
auditor.add_score(name='risk_score', label='Risk Score')
auditor.add_outcome(name='disease_status')

# 2. Find optimal threshold
threshold = auditor.optimize_score_threshold(score_name='risk_score')

# 3. Set comprehensive metrics
auditor.set_metrics([
    nData(),
    nPositive(),
    nNegative(),
    Sensitivity(),
    Specificity(),
    F1Score(),
    AUROC(),
])

# 4. Evaluate with confidence intervals
results = auditor.evaluate(
    score_name='risk_score',
    threshold=threshold,
    n_bootstraps=500
)

# 5. Display results
print("\n" + "="*60)
print("MODEL EVALUATION REPORT")
print("="*60)
print(f"\nOptimal Threshold: {threshold:.3f}")
print(f"Total Samples: {len(df)}")
print(f"Disease Prevalence: {df['disease_status'].mean():.1%}")
print("\n" + "-"*60)
print("OVERALL PERFORMANCE")
print("-"*60)
display(results.features['overall'].to_dataframe(n_decimals=3, metric_labels=True))

print("\n" + "-"*60)
print("PERFORMANCE BY AGE GROUP")
print("-"*60)
display(results.features['age_group'].to_dataframe(n_decimals=3, metric_labels=True))

print("\n" + "-"*60)
print("PERFORMANCE BY GENDER")
print("-"*60)
display(results.features['gender'].to_dataframe(n_decimals=3, metric_labels=True))

## Summary

This notebook demonstrated:

1. **Basic Usage**: Setting up the Auditor with data, features, scores, and outcomes
2. **Metric Evaluation**: Computing stratified metrics with bootstrap confidence intervals
3. **Threshold Optimization**: Finding optimal decision thresholds using the Youden index
4. **Custom Metrics**: Creating and using custom metric implementations
5. **F-beta Scores**: Adjusting precision/recall trade-offs
6. **Hierarchical Visualization**: Creating data for sunburst/treemap plots
7. **Complete Workflow**: Putting it all together for a comprehensive model audit

For more information, see the [README](README.md) or the module docstrings.