# Fairness Pipeline Development Toolkit - Demonstration

This notebook demonstrates the complete end-to-end fairness pipeline using a real-world loan approval dataset.

## Table of Contents

1. [Setup and Installation](#setup)
2. [Understanding the Pipeline](#architecture)
3. [Data Preparation](#data)
4. [Running the Pipeline](#execution)
5. [Analyzing Results](#results)
6. [MLflow Tracking](#mlflow)

## 1. Setup and Installation

### Install Required Dependencies

In [None]:
import sys
from pathlib import Path

sys.path.insert(0, str(Path.cwd().parent))

import pandas as pd
import numpy as np
import yaml
import mlflow

from src.run_pipeline import PipelineOrchestrator

print("All dependencies loaded successfully")

## 2. Understanding the Pipeline

The pipeline executes in the following sequence:

1. **Configuration Loading**: Validates config.yml using Pydantic
2. **Data Loading**: Loads loan approval dataset and creates train/test split
3. **Baseline Measurement**: Calculates initial fairness metrics
4. **Preprocessing**: Applies bias mitigation transformations
5. **Model Training**: Trains model with fairness constraints
6. **Final Validation**: Evaluates fairness improvements
7. **MLflow Logging**: Records all artifacts and metrics

## 3. Data Preparation

### Load Sample Dataset

This demonstration uses a loan approval dataset containing:
- **Features**: loan_amount, income, credit_score, employment_status
- **Protected Attributes**: gender, race, age_group
- **Target**: loan_approved (binary)

In [None]:
data_path = Path('../data/loan_approval.csv')

if data_path.exists():
    df = pd.read_csv(data_path)
    print(f"Dataset shape: {df.shape}")
    print(f"\nColumns: {df.columns.tolist()}")
    print(f"\nFirst few rows:")
    display(df.head())
    print(f"\nTarget distribution:")
    print(df['loan_approved'].value_counts(normalize=True))
else:
    print(f"Dataset not found at {data_path}")
    print("Please ensure the dataset is available before running the pipeline")

### View Current Configuration

In [None]:
with open('../config.yml', 'r') as f:
    config = yaml.safe_load(f)

print("Current Pipeline Configuration:")
print("=" * 70)
print(f"Data Source: {config['data']['path']}")
print(f"Target Column: {config['data']['target_column']}")
print(f"Protected Attributes: {config['data']['protected_attributes']}")
print(f"\nPreprocessing: {config['preprocessing']['transformers']}")
print(f"Repair Level: {config['preprocessing']['repair_level']}")
print(f"\nTraining Method: {config['training']['method']}")
print(f"Fairness Constraint: {config['training']['constraint']}")
print(f"\nPrimary Metric: {config['fairness']['primary_metric']}")
print(f"Threshold: {config['fairness']['threshold']}")
print("=" * 70)

## 4. Running the Pipeline

### Execute Complete Pipeline

In [None]:
orchestrator = PipelineOrchestrator(config_path="../config.yml")
orchestrator.run()

## 5. Analyzing Results

### Extract Pipeline Results

In [None]:
baseline_metrics = orchestrator.baseline_metrics
final_metrics = orchestrator.final_metrics

print("Pipeline Execution Results")
print("=" * 70)
print("\nBaseline Metrics:")
for metric, value in baseline_metrics.items():
    print(f"  {metric}: {value:.4f}")

print("\nFinal Metrics:")
for metric, value in final_metrics.items():
    print(f"  {metric}: {value:.4f}")

### Calculate Improvements

In [None]:
primary_metric = config['fairness']['primary_metric']
threshold = config['fairness']['threshold']

comparison_data = {
    'Metric': [primary_metric, 'accuracy', 'precision', 'recall'],
    'Baseline': [
        baseline_metrics.get(primary_metric, 0),
        0,
        0,
        0
    ],
    'Final': [
        final_metrics.get(primary_metric, 0),
        final_metrics.get('accuracy', 0),
        final_metrics.get('precision', 0),
        final_metrics.get('recall', 0)
    ]
}

df_comparison = pd.DataFrame(comparison_data)
df_comparison['Improvement'] = df_comparison['Baseline'] - df_comparison['Final']

print("\nFairness-Performance Analysis:")
display(df_comparison)

final_fairness = final_metrics.get(primary_metric, float('inf'))
if final_fairness <= threshold:
    print(f"\nSUCCESS: Fairness metric ({final_fairness:.4f}) is within threshold ({threshold:.4f})")
else:
    print(f"\nALERT: Fairness metric ({final_fairness:.4f}) exceeds threshold ({threshold:.4f})")

### Visualize Results

In [None]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=('Fairness Improvement', 'Model Performance')
)

fairness_baseline = baseline_metrics.get(primary_metric, 0)
fairness_final = final_metrics.get(primary_metric, 0)

fig.add_trace(
    go.Bar(
        x=['Baseline', 'Final', 'Threshold'],
        y=[fairness_baseline, fairness_final, threshold],
        marker_color=['red', 'green', 'blue'],
        text=[f'{fairness_baseline:.3f}', f'{fairness_final:.3f}', f'{threshold:.3f}'],
        textposition='auto'
    ),
    row=1, col=1
)

performance_metrics = ['accuracy', 'precision', 'recall', 'f1_score']
performance_values = [final_metrics.get(m, 0) for m in performance_metrics]

fig.add_trace(
    go.Bar(
        x=performance_metrics,
        y=performance_values,
        marker_color='lightblue',
        text=[f'{v:.3f}' for v in performance_values],
        textposition='auto'
    ),
    row=1, col=2
)

fig.update_layout(
    title_text="Pipeline Results Dashboard",
    showlegend=False,
    height=400
)

fig.show()

## 6. MLflow Tracking

### Query MLflow Experiments

In [None]:
experiment_name = config['mlflow']['experiment_name']
experiment = mlflow.get_experiment_by_name(experiment_name)

if experiment:
    print(f"Experiment: {experiment_name}")
    print(f"Experiment ID: {experiment.experiment_id}")
    print(f"Artifact Location: {experiment.artifact_location}")
    
    runs = mlflow.search_runs(
        experiment_ids=[experiment.experiment_id],
        order_by=["start_time DESC"],
        max_results=5
    )
    
    print(f"\nRecent Runs ({len(runs)}):")
    display(runs[['run_id', 'start_time', 'metrics.final_accuracy', 
                  'metrics.final_demographic_parity_difference']])
else:
    print(f"Experiment '{experiment_name}' not found")
    print("Run the pipeline first to create the experiment")

### View MLflow UI

To view the MLflow UI, run in terminal:

```bash
mlflow ui --port 5000
```

Then open: http://localhost:5000

## Summary

This demonstration showcased:

- **Declarative Configuration**: YAML-based pipeline control with Pydantic validation
- **Modular Architecture**: Clean separation of concerns across modules
- **Automated Orchestration**: End-to-end execution with single command
- **MLflow Integration**: Complete experiment tracking and reproducibility
- **Real-World Application**: Loan approval use case in finance domain

### Next Steps

1. Customize configuration for your specific use case
2. Integrate your own dataset
3. Extend modules with custom functionality
4. Deploy as part of MLOps pipeline
5. Monitor fairness in production