# Chapter 7 Activity: Differential Privacy in Action

**Estimated Time: 30 minutes**

## Learning Objectives
By the end of this activity, you will be able to:
- Implement basic differential privacy mechanisms
- Analyze privacy-utility trade-offs in real datasets
- Compare different noise mechanisms for privacy protection
- Evaluate privacy budget consumption and composition

## Scenario: Privacy-Preserving Healthcare Analytics

You are a data scientist at MedSecure Analytics, tasked with analyzing patient data while ensuring strict privacy protection. The hospital wants to publish statistics about patient demographics and treatment outcomes, but must comply with HIPAA and implement differential privacy to protect individual patient information.

Your mission: Implement differential privacy mechanisms and analyze the trade-offs between privacy protection and data utility.

## Setup: Import Libraries and Generate Synthetic Patient Data

### What We're About to Do

In this first section, we'll import the essential Python libraries needed for differential privacy implementation. We'll also create a synthetic healthcare dataset that mimics real patient data but is completely safe to experiment with. This includes patient demographics, medical conditions, and treatment information.

The synthetic data generation is important because it allows us to practice differential privacy techniques without any real privacy concerns while learning the concepts.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

print("SETUP: Differential Privacy Laboratory Initialized")
print("STATUS: Privacy protection mechanisms ready")
print("READY: Synthetic patient data generator prepared")

### Understanding the Output

The output above confirms that all necessary libraries are loaded and ready. We've imported:
- **NumPy & Pandas**: For data manipulation and mathematical operations
- **Matplotlib & Seaborn**: For creating visualizations of privacy-utility trade-offs
- **SciPy**: For statistical functions
- **Warnings**: Suppressed to keep output clean during experimentation

The random seed ensures that everyone running this notebook gets identical results, making it easier to follow along and compare outcomes.

In [None]:
# Generate synthetic patient dataset
def generate_patient_data(n_patients=1000):
    """Generate synthetic patient data for privacy analysis"""
    
    # Patient demographics
    ages = np.random.normal(45, 15, n_patients).astype(int)
    ages = np.clip(ages, 18, 90)  # Realistic age range
    
    genders = np.random.choice(['M', 'F'], n_patients, p=[0.48, 0.52])
    
    # Medical conditions (binary indicators)
    diabetes = np.random.binomial(1, 0.12, n_patients)
    hypertension = np.random.binomial(1, 0.25, n_patients)
    heart_disease = np.random.binomial(1, 0.08, n_patients)
    
    # Treatment costs (influenced by conditions)
    base_cost = np.random.exponential(5000, n_patients)
    condition_multiplier = 1 + 0.5 * diabetes + 0.3 * hypertension + 0.8 * heart_disease
    treatment_costs = base_cost * condition_multiplier
    
    # Hospital stay length
    stay_length = np.random.poisson(3, n_patients) + 1
    
    return pd.DataFrame({
        'patient_id': range(1, n_patients + 1),
        'age': ages,
        'gender': genders,
        'diabetes': diabetes,
        'hypertension': hypertension,
        'heart_disease': heart_disease,
        'treatment_cost': treatment_costs,
        'stay_length': stay_length
    })

# Generate the dataset
patient_data = generate_patient_data(1000)
print(f"GENERATED: Dataset with {len(patient_data)} patients")
print("\nDATASET OVERVIEW:")
print(patient_data.describe())
patient_data.head()

### Understanding the Synthetic Dataset

The output shows our synthetic dataset contains realistic healthcare data:
- **1,000 patients** with ages ranging from 18-90 years
- **Medical conditions**: Diabetes (12%), Hypertension (25%), Heart Disease (8%) - these percentages match real population statistics
- **Treatment costs**: Influenced by medical conditions, with more complex cases costing more
- **Hospital stays**: Typically 1-7 days with most patients staying 3-4 days

This synthetic data is perfect for learning differential privacy because:
1. It mimics real healthcare patterns without using actual patient data
2. We know the "ground truth" to measure privacy-utility trade-offs
3. We can experiment freely without privacy concerns

Notice how the data shows realistic relationships - patients with diabetes or heart disease have higher treatment costs, demonstrating the complex correlations found in real medical data.

## Part 1: Implementing Basic Differential Privacy Mechanisms (10 minutes)

Let's implement the core differential privacy mechanisms: Laplace and Gaussian noise addition.

### What Are Differential Privacy Mechanisms?

Before we implement the mechanisms, let's understand what we're building:

**Differential Privacy** protects individual privacy by adding carefully calibrated noise to query results. The key insight is that by adding random noise, we can hide whether any specific individual was in the dataset while still providing useful aggregate statistics.

**Two Main Mechanisms:**
1. **Laplace Mechanism**: Adds noise from a Laplace distribution - provides "pure" differential privacy
2. **Gaussian Mechanism**: Adds noise from a Gaussian (normal) distribution - provides "approximate" differential privacy with slightly different guarantees

**Key Concepts:**
- **Epsilon (ε)**: The privacy parameter - smaller values mean stronger privacy but more noise
- **Sensitivity**: How much a single person's data can change the query result
- **Privacy Budget**: Like spending money - each query "costs" some privacy

In [None]:
class DifferentialPrivacyMechanisms:
    """Implementation of core differential privacy mechanisms"""
    
    def __init__(self):
        self.privacy_budget_used = 0.0
        self.query_history = []
    
    def laplace_mechanism(self, true_value, sensitivity, epsilon):
        """Add Laplace noise for pure differential privacy"""
        scale = sensitivity / epsilon
        noise = np.random.laplace(0, scale)
        
        # Update privacy accounting
        self.privacy_budget_used += epsilon
        self.query_history.append({
            'mechanism': 'Laplace',
            'epsilon': epsilon,
            'sensitivity': sensitivity,
            'noise_scale': scale
        })
        
        return true_value + noise
    
    def gaussian_mechanism(self, true_value, sensitivity, epsilon, delta=1e-5):
        """Add Gaussian noise for (ε,δ)-differential privacy"""
        # Calculate noise scale for Gaussian mechanism
        scale = sensitivity * np.sqrt(2 * np.log(1.25 / delta)) / epsilon
        noise = np.random.normal(0, scale)
        
        # Update privacy accounting
        self.privacy_budget_used += epsilon
        self.query_history.append({
            'mechanism': 'Gaussian',
            'epsilon': epsilon,
            'delta': delta,
            'sensitivity': sensitivity,
            'noise_scale': scale
        })
        
        return true_value + noise
    
    def private_count(self, data, epsilon):
        """Count query with differential privacy"""
        true_count = len(data)
        sensitivity = 1  # Adding/removing one person changes count by 1
        return self.laplace_mechanism(true_count, sensitivity, epsilon)
    
    def private_mean(self, data, data_range, epsilon):
        """Mean query with differential privacy"""
        # Clamp data to known range
        clamped_data = np.clip(data, data_range[0], data_range[1])
        true_mean = np.mean(clamped_data)
        
        # Sensitivity calculation for bounded mean
        sensitivity = (data_range[1] - data_range[0]) / len(data)
        
        return self.laplace_mechanism(true_mean, sensitivity, epsilon)
    
    def private_histogram(self, data, bins, epsilon):
        """Histogram query with differential privacy"""
        # Create true histogram
        hist, bin_edges = np.histogram(data, bins=bins)
        
        # Add noise to each bin (sensitivity = 1 for histogram)
        sensitivity = 1
        epsilon_per_bin = epsilon / len(hist)  # Split privacy budget
        
        private_hist = []
        for count in hist:
            noisy_count = self.laplace_mechanism(count, sensitivity, epsilon_per_bin)
            private_hist.append(max(0, noisy_count))  # Ensure non-negative
        
        return np.array(private_hist), bin_edges
    
    def get_privacy_report(self):
        """Generate privacy budget usage report"""
        return {
            'total_epsilon_used': self.privacy_budget_used,
            'queries_executed': len(self.query_history),
            'query_details': self.query_history
        }

# Initialize our differential privacy toolkit
dp = DifferentialPrivacyMechanisms()
print("INITIALIZED: Differential Privacy mechanisms initialized")
print("READY: Ready to execute private queries")

### Understanding the Implementation

We've just created a comprehensive differential privacy toolkit! Here's what each component does:

**Core Mechanisms:**
- `laplace_mechanism()`: Adds Laplace-distributed noise for pure ε-differential privacy
- `gaussian_mechanism()`: Adds Gaussian noise for (ε,δ)-differential privacy

**Query Types:**
- `private_count()`: Counts records with noise (sensitivity = 1 because adding/removing one person changes the count by exactly 1)
- `private_mean()`: Calculates averages with privacy protection by first clamping data to known ranges
- `private_histogram()`: Creates histograms where each bin gets independent noise

**Privacy Accounting:**
The class tracks every query executed and the privacy budget consumed. This is crucial because:
1. Privacy "compounds" - multiple queries reveal more information
2. We need to stay within our total privacy budget
3. Regulatory compliance often requires detailed privacy auditing

The implementation handles the complex mathematics behind differential privacy while providing a simple interface for data analysts to use.

### Exercise 1: Basic Private Queries

Let's execute some basic private queries on our patient dataset.

### Getting Ready for Your First Private Queries

Now we'll execute three basic queries to see differential privacy in action. Each query uses ε = 0.1, which provides strong privacy protection but will add noticeable noise to the results.

We'll compare the true (non-private) results with the private results to see the privacy-utility trade-off in practice. Remember: the noise is random, so your results will be slightly different each time you run this!

In [None]:
# Exercise 1: Execute basic private queries
print("EXERCISE 1: Basic Private Queries")
print("=" * 50)

# Query 1: Total number of patients
true_patient_count = len(patient_data)
private_patient_count = dp.private_count(patient_data, epsilon=0.1)

print(f"PATIENT COUNT:")
print(f"   True count: {true_patient_count}")
print(f"   Private count: {private_patient_count:.1f}")
print(f"   Error: {abs(true_patient_count - private_patient_count):.1f}")

# Query 2: Average age
true_avg_age = patient_data['age'].mean()
private_avg_age = dp.private_mean(patient_data['age'], data_range=[18, 90], epsilon=0.1)

print(f"\nAVERAGE AGE:")
print(f"   True average: {true_avg_age:.2f} years")
print(f"   Private average: {private_avg_age:.2f} years")
print(f"   Error: {abs(true_avg_age - private_avg_age):.2f} years")

# Query 3: Number of diabetes patients
true_diabetes_count = patient_data['diabetes'].sum()
private_diabetes_count = dp.private_count(patient_data[patient_data['diabetes'] == 1], epsilon=0.1)

print(f"\nDIABETES PATIENTS:")
print(f"   True count: {true_diabetes_count}")
print(f"   Private count: {private_diabetes_count:.1f}")
print(f"   Error: {abs(true_diabetes_count - private_diabetes_count):.1f}")

# Check privacy budget usage
print(f"\nPRIVACY BUDGET USED: ε = {dp.privacy_budget_used:.2f}")

### Analyzing Your First Private Query Results

Look at the results above - this is differential privacy in action! Notice several important patterns:

**Error Analysis:**
- Each query shows some error between the true and private results
- The error varies because we're adding random noise for privacy protection
- With ε = 0.1 (strong privacy), we expect errors of roughly 10-50 for count queries

**Privacy Budget Consumption:**
- We've used ε = 0.3 total (0.1 for each of the three queries)
- This "composition" means our privacy guarantees are now ε = 0.3 overall
- Each additional query costs more privacy budget

**Key Insights:**
- **Patient Count**: Small error is acceptable for aggregate statistics
- **Average Age**: A few years of error might be tolerable depending on the use case
- **Diabetes Count**: Medical researchers might find this level of error reasonable for population studies

The fundamental trade-off is clear: more privacy protection (lower ε) means higher errors, but better protection for individual patients.

## Part 2: Privacy-Utility Trade-off Analysis (10 minutes)

Now let's analyze how different privacy parameters (epsilon values) affect the accuracy of our results.

### Why Study Privacy-Utility Trade-offs?

The next analysis will systematically explore how different privacy levels (ε values) affect query accuracy. This is crucial for real-world applications because:

**Business Decisions:** Organizations need to balance privacy protection with data utility for decision-making
**Regulatory Compliance:** Understanding the trade-offs helps choose appropriate privacy parameters for different types of sensitive data
**Risk Assessment:** Different queries may require different privacy levels based on sensitivity

We'll test ε values from 0.01 (very strong privacy) to 10.0 (weak privacy) and measure the average error across multiple trials. This gives us a scientific understanding of the privacy-utility relationship.

In [None]:
def analyze_privacy_utility_tradeoff():
    """Analyze the trade-off between privacy and utility"""
    
    # Test different epsilon values from very private to less private
    epsilon_values = [0.01, 0.1, 0.5, 1.0, 2.0, 5.0, 10.0]
    
    # True values for comparison
    true_count = len(patient_data)
    true_avg_age = patient_data['age'].mean()
    true_avg_cost = patient_data['treatment_cost'].mean()
    
    results = []
    
    for epsilon in epsilon_values:
        # Run multiple trials to get average error
        trials = 20
        count_errors = []
        age_errors = []
        cost_errors = []
        
        for _ in range(trials):
            # Fresh DP instance for each trial
            dp_trial = DifferentialPrivacyMechanisms()
            
            # Execute private queries
            private_count = dp_trial.private_count(patient_data, epsilon)
            private_avg_age = dp_trial.private_mean(patient_data['age'], [18, 90], epsilon)
            private_avg_cost = dp_trial.private_mean(patient_data['treatment_cost'], [0, 50000], epsilon)
            
            # Calculate absolute errors
            count_errors.append(abs(true_count - private_count))
            age_errors.append(abs(true_avg_age - private_avg_age))
            cost_errors.append(abs(true_avg_cost - private_avg_cost))
        
        # Store results
        results.append({
            'epsilon': epsilon,
            'count_error': np.mean(count_errors),
            'age_error': np.mean(age_errors),
            'cost_error': np.mean(cost_errors),
            'count_std': np.std(count_errors),
            'age_std': np.std(age_errors),
            'cost_std': np.std(cost_errors)
        })
    
    return pd.DataFrame(results)

# Perform trade-off analysis
print("INITIALIZING: Analyzing Privacy-Utility Trade-offs...")
tradeoff_results = analyze_privacy_utility_tradeoff()

print("\nRESULTS: Privacy-Utility Trade-off Results:")
print(tradeoff_results.round(2))

### Understanding the Trade-off Results

The table above shows how query accuracy changes with different privacy levels:

**Key Patterns:**
- **Lower ε (e.g., 0.01)**: Very high errors but maximum privacy protection
- **Higher ε (e.g., 10.0)**: Much lower errors but weaker privacy protection
- **Standard deviation**: Shows the variability in errors across multiple trials

**Real-World Interpretation:**
- **ε = 0.1**: Suitable for highly sensitive data where privacy is paramount
- **ε = 1.0**: A common choice balancing privacy and utility for many applications
- **ε = 5.0+**: Appropriate when some privacy protection is needed but accuracy is critical

Notice how different query types have different error sensitivities. Treatment costs show higher absolute errors because the values are larger (thousands of dollars vs. years of age).

In [None]:
# Visualize the privacy-utility trade-off
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Plot 1: Count Error vs Epsilon
axes[0].errorbar(tradeoff_results['epsilon'], tradeoff_results['count_error'], 
                yerr=tradeoff_results['count_std'], marker='o', capsize=5)
axes[0].set_xlabel('Privacy Parameter (ε)')
axes[0].set_ylabel('Average Count Error')
axes[0].set_title('Patient Count Query Error')
axes[0].set_xscale('log')
axes[0].grid(True, alpha=0.3)

# Plot 2: Age Error vs Epsilon
axes[1].errorbar(tradeoff_results['epsilon'], tradeoff_results['age_error'], 
                yerr=tradeoff_results['age_std'], marker='s', capsize=5, color='orange')
axes[1].set_xlabel('Privacy Parameter (ε)')
axes[1].set_ylabel('Average Age Error (years)')
axes[1].set_title('Average Age Query Error')
axes[1].set_xscale('log')
axes[1].grid(True, alpha=0.3)

# Plot 3: Cost Error vs Epsilon
axes[2].errorbar(tradeoff_results['epsilon'], tradeoff_results['cost_error'], 
                yerr=tradeoff_results['cost_std'], marker='^', capsize=5, color='green')
axes[2].set_xlabel('Privacy Parameter (ε)')
axes[2].set_ylabel('Average Cost Error ($)')
axes[2].set_title('Treatment Cost Query Error')
axes[2].set_xscale('log')
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("KEY OBSERVATIONS:")
print("• Lower ε (more privacy) → Higher query error")
print("• Higher ε (less privacy) → Lower query error")
print("• Error decreases exponentially as ε increases")
print("• Different queries have different error sensitivities")

### Interpreting the Privacy-Utility Visualizations

The three charts above reveal the fundamental privacy-utility relationship:

**Chart Analysis:**
1. **Left (Count Error)**: Shows patient count queries have relatively small absolute errors even with strong privacy
2. **Middle (Age Error)**: Age queries show moderate errors - a few years difference is often acceptable for population studies
3. **Right (Cost Error)**: Treatment cost queries show the highest absolute errors due to the large value range ($0-$50,000)

**Key Business Insights:**
- **Logarithmic Scale**: The x-axis uses a log scale because the relationship is exponential - small increases in ε dramatically reduce error
- **Error Bars**: Show the uncertainty/variability in results - important for setting expectations with stakeholders
- **Sweet Spot**: Many organizations find ε ≈ 1.0 provides a good balance for most use cases

**Decision Framework:**
- High-stakes decisions may need ε ≥ 2.0 for acceptable accuracy
- Population research might work well with ε = 0.5-1.0  
- Public datasets with strong privacy requirements might use ε ≤ 0.1

## Part 3: Comparing Noise Mechanisms (5 minutes)

Let's compare the Laplace and Gaussian mechanisms for differential privacy.

### Laplace vs. Gaussian: Choosing Your Noise Mechanism

Different noise mechanisms provide different privacy guarantees and have different mathematical properties. Understanding when to use each is important for real-world implementations.

**Laplace Mechanism:**
- Provides "pure" ε-differential privacy
- Has heavier tails (more extreme values possible)
- Simpler mathematical analysis
- Standard choice for most applications

**Gaussian Mechanism:**
- Provides (ε,δ)-differential privacy (slightly different guarantee)
- Has lighter tails (extreme values less likely)
- Better composition properties for multiple queries
- Often preferred for complex query sequences

We'll compare both mechanisms using the same privacy parameter to see their practical differences.

In [None]:
def compare_noise_mechanisms(epsilon=1.0, n_trials=1000):
    """Compare Laplace vs Gaussian noise mechanisms"""
    
    true_value = 100.0  # The value we want to privatize
    sensitivity = 1.0   # How much one person can change the result
    delta = 1e-5       # Small probability parameter for Gaussian
    
    laplace_results = []
    gaussian_results = []
    
    for _ in range(n_trials):
        # Laplace mechanism
        laplace_scale = sensitivity / epsilon
        laplace_noise = np.random.laplace(0, laplace_scale)
        laplace_results.append(true_value + laplace_noise)
        
        # Gaussian mechanism  
        gaussian_scale = sensitivity * np.sqrt(2 * np.log(1.25 / delta)) / epsilon
        gaussian_noise = np.random.normal(0, gaussian_scale)
        gaussian_results.append(true_value + gaussian_noise)
    
    return np.array(laplace_results), np.array(gaussian_results)

# Compare the two mechanisms
print("COMPARING: Noise Mechanisms")
print("=" * 40)

laplace_vals, gaussian_vals = compare_noise_mechanisms(epsilon=1.0)
true_value = 100.0

# Calculate error statistics
laplace_error = np.abs(laplace_vals - true_value)
gaussian_error = np.abs(gaussian_vals - true_value)

print(f"Mechanism Comparison (ε = 1.0):")
print(f"\nLaplace Mechanism:")
print(f"   Mean error: {np.mean(laplace_error):.2f}")
print(f"   Std error: {np.std(laplace_error):.2f}")
print(f"   95th percentile error: {np.percentile(laplace_error, 95):.2f}")

print(f"\nGaussian Mechanism:")
print(f"   Mean error: {np.mean(gaussian_error):.2f}")
print(f"   Std error: {np.std(gaussian_error):.2f}")
print(f"   95th percentile error: {np.percentile(gaussian_error, 95):.2f}")

# Create visualization
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.hist(laplace_vals, bins=50, alpha=0.7, label='Laplace', density=True)
plt.hist(gaussian_vals, bins=50, alpha=0.7, label='Gaussian', density=True)
plt.axvline(true_value, color='red', linestyle='--', label='True Value')
plt.xlabel('Privatized Value')
plt.ylabel('Density')
plt.title('Distribution of Privatized Values')
plt.legend()
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
plt.hist(laplace_error, bins=50, alpha=0.7, label='Laplace Error', density=True)
plt.hist(gaussian_error, bins=50, alpha=0.7, label='Gaussian Error', density=True)
plt.xlabel('Absolute Error')
plt.ylabel('Density')
plt.title('Distribution of Errors')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nKEY DIFFERENCES:")
print("• Laplace: Pure ε-DP, heavier tails, simpler analysis")
print("• Gaussian: (ε,δ)-DP, lighter tails, better composition properties")
print("• Choice depends on privacy requirements and composition needs")

### Understanding the Mechanism Comparison Results

The analysis above reveals important practical differences between the two mechanisms:

**Statistical Differences:**
- **Mean Error**: Both mechanisms typically show similar average error magnitudes
- **Error Distribution**: Laplace has heavier tails (more extreme outliers), while Gaussian has a more concentrated distribution
- **95th Percentile**: Shows the worst-case errors you might expect in practice

**Practical Implications:**
- **Laplace**: Better for applications where you want to avoid complex δ parameters and prefer simpler guarantees
- **Gaussian**: Better when you're running many queries and need good composition properties
- **Consistency**: Gaussian tends to produce more consistent results (lower variability)

**Visualization Insights:**
- **Left Chart**: Shows the full distribution of privatized values around the true value (100)
- **Right Chart**: Focuses on error magnitudes - notice how Gaussian has a more peaked distribution while Laplace has longer tails

**Choosing Between Them:**
- Use **Laplace** for simple, one-off queries or when regulations require pure ε-DP
- Use **Gaussian** for complex analytics workflows with multiple related queries

## Part 4: Privacy Budget Management (5 minutes)

Let's explore how to manage privacy budgets and understand composition effects.

### Why Privacy Budget Management Matters

In real-world applications, organizations rarely run just one query. They need ongoing analytics while maintaining privacy protection. This creates a critical challenge: **privacy budget composition**.

**The Problem:**
- Each query consumes some privacy budget (ε)
- Multiple queries compound the privacy loss
- Without tracking, you might accidentally exceed safe privacy levels

**The Solution:**
We'll implement an advanced privacy accountant that:
- Tracks every query and its privacy cost
- Prevents budget overspending
- Provides detailed audit trails for compliance
- Visualizes budget consumption over time

This mirrors real-world systems where privacy engineers must carefully manage privacy budgets across entire organizations or research projects.

In [None]:
class AdvancedPrivacyAccountant:
    """Advanced privacy budget accounting with composition"""
    
    def __init__(self, total_budget=1.0):
        self.total_budget = total_budget
        self.remaining_budget = total_budget
        self.query_log = []
    
    def execute_query(self, query_name, epsilon_cost, query_function, *args, **kwargs):
        """Execute a query with privacy budget checking"""
        
        if epsilon_cost > self.remaining_budget:
            raise ValueError(f"Insufficient privacy budget! Need {epsilon_cost}, have {self.remaining_budget}")
        
        # Execute the query
        result = query_function(*args, **kwargs)
        
        # Update budget
        self.remaining_budget -= epsilon_cost
        
        # Log the query
        self.query_log.append({
            'query_name': query_name,
            'epsilon_cost': epsilon_cost,
            'remaining_budget': self.remaining_budget,
            'timestamp': len(self.query_log) + 1
        })
        
        print(f"EXECUTED: Query '{query_name}' executed (ε={epsilon_cost})")
        print(f"   Remaining budget: {self.remaining_budget:.3f}")
        
        return result
    
    def get_budget_report(self):
        """Generate comprehensive budget report"""
        used_budget = self.total_budget - self.remaining_budget
        
        print("\nREPORT: Privacy Budget Report")
        print("=" * 30)
        print(f"Total budget: {self.total_budget}")
        print(f"Used budget: {used_budget:.3f} ({used_budget/self.total_budget*100:.1f}%)")
        print(f"Remaining budget: {self.remaining_budget:.3f}")
        print(f"Queries executed: {len(self.query_log)}")
        
        if self.query_log:
            print("\nHISTORY: Query History:")
            for query in self.query_log:
                print(f"  {query['timestamp']}. {query['query_name']} (ε={query['epsilon_cost']})")
    
    def visualize_budget_usage(self):
        """Visualize privacy budget consumption over time"""
        if not self.query_log:
            print("No queries executed yet!")
            return
        
        # Calculate cumulative budget usage
        timestamps = [0] + [q['timestamp'] for q in self.query_log]
        remaining = [self.total_budget] + [q['remaining_budget'] for q in self.query_log]
        
        plt.figure(figsize=(10, 6))
        plt.plot(timestamps, remaining, marker='o', linewidth=2, markersize=6)
        plt.axhline(y=0, color='red', linestyle='--', alpha=0.7, label='Budget Exhausted')
        plt.xlabel('Query Number')
        plt.ylabel('Remaining Privacy Budget (ε)')
        plt.title('Privacy Budget Consumption Over Time')
        plt.grid(True, alpha=0.3)
        plt.legend()
        
        # Add query labels
        for i, query in enumerate(self.query_log):
            plt.annotate(query['query_name'], 
                        (query['timestamp'], query['remaining_budget']),
                        xytext=(5, 5), textcoords='offset points',
                        fontsize=8, alpha=0.8)
        
        plt.tight_layout()
        plt.show()

# Initialize privacy accountant with budget
accountant = AdvancedPrivacyAccountant(total_budget=1.0)
print("INITIALIZED: Privacy Budget Manager initialized with ε = 1.0")

### Understanding the Privacy Accountant

We've just implemented a sophisticated privacy budget management system! Here's what it provides:

**Core Features:**
- **Budget Checking**: Prevents queries that would exceed the remaining privacy budget
- **Query Logging**: Maintains a complete audit trail of all privacy-consuming operations
- **Real-time Tracking**: Shows remaining budget after each query
- **Compliance Reporting**: Generates detailed reports for regulatory requirements

**Key Methods:**
- `execute_query()`: Safely runs queries while checking budget constraints
- `get_budget_report()`: Provides comprehensive budget usage statistics  
- `visualize_budget_usage()`: Creates charts showing budget consumption over time

**Real-World Applications:**
- **Healthcare**: Tracking privacy budget across multiple research studies
- **Finance**: Managing privacy across different analytical teams
- **Government**: Ensuring public dataset releases stay within privacy limits
- **Technology**: Balancing user privacy with product analytics

This accountant would be integrated into production data systems to provide automatic privacy protection.

In [None]:
# Exercise: Execute a series of queries while managing budget
print("Exercise: Privacy Budget Management")
print("=" * 45)

# Create a fresh DP mechanism for budget-tracked queries
dp_managed = DifferentialPrivacyMechanisms()

try:
    # Query 1: Patient count (ε = 0.2)
    result1 = accountant.execute_query(
        "Patient Count", 0.2,
        dp_managed.private_count, patient_data, 0.2
    )
    
    # Query 2: Average age (ε = 0.3)  
    result2 = accountant.execute_query(
        "Average Age", 0.3,
        dp_managed.private_mean, patient_data['age'], [18, 90], 0.3
    )
    
    # Query 3: Diabetes prevalence (ε = 0.2)
    diabetes_patients = patient_data[patient_data['diabetes'] == 1]
    result3 = accountant.execute_query(
        "Diabetes Count", 0.2,
        dp_managed.private_count, diabetes_patients, 0.2
    )
    
    # Query 4: Age histogram (ε = 0.25)
    result4 = accountant.execute_query(
        "Age Histogram", 0.25,
        dp_managed.private_histogram, patient_data['age'], 10, 0.25
    )
    
    # Try one more query that might exceed budget
    print("\nAttempting query that might exceed budget...")
    result5 = accountant.execute_query(
        "Treatment Cost Average", 0.3,
        dp_managed.private_mean, patient_data['treatment_cost'], [0, 50000], 0.3
    )
    
except ValueError as e:
    print(f"Budget exceeded: {e}")

# Generate budget report
accountant.get_budget_report()

# Visualize budget usage
print("\nPrivacy Budget Usage Visualization:")
accountant.visualize_budget_usage()

### Analyzing the Budget Management Results

The exercise above demonstrates several critical aspects of privacy budget management:

**What Just Happened:**
- **Successful Queries**: The first 4 queries executed successfully, consuming ε = 0.95 total
- **Budget Protection**: The 5th query was blocked because it would exceed our ε = 1.0 budget
- **Audit Trail**: Every query was logged with its privacy cost and timestamp
- **Real-time Monitoring**: You could see the remaining budget decrease after each query

**Key Lessons:**
1. **Planning is Essential**: You need to plan your query budget allocation in advance
2. **Prioritization Matters**: Run the most important queries first in case you run out of budget
3. **Automatic Protection**: The system prevents accidental privacy budget overspending
4. **Transparency**: Complete audit trails support regulatory compliance and internal governance

**Professional Applications:**
- **Research Teams**: Allocate budget across different research questions
- **Business Analytics**: Balance operational reporting with exploratory analysis  
- **Regulatory Compliance**: Demonstrate privacy protection measures to regulators
- **Risk Management**: Prevent privacy budget exhaustion that could halt critical operations

The visualization shows how your privacy budget depleted over time - this is exactly what privacy engineers monitor in production systems.

### Hands-On Budget Management Exercise

Now we'll simulate a realistic scenario where a data analyst needs to run multiple queries while staying within their privacy budget. We start with ε = 1.0 total budget and will execute a series of increasingly expensive queries.

**The Scenario:**
You're a healthcare data analyst who needs to generate a weekly report. You have a limited privacy budget and need to prioritize the most important queries. Watch how the budget decreases with each query and what happens when you try to exceed the limit.

**Query Plan:**
1. Patient count (ε = 0.2) - Basic statistic
2. Average age (ε = 0.3) - Demographics  
3. Diabetes count (ε = 0.2) - Health metric
4. Age histogram (ε = 0.25) - Distribution analysis
5. Treatment costs (ε = 0.3) - Financial analysis

Total planned: ε = 1.25 (This exceeds our budget of 1.0 - let's see what happens!)

## Summary and Reflection

### What You've Learned

In this 30-minute hands-on activity, you have:

1. **Implemented Core DP Mechanisms**: You built Laplace and Gaussian noise mechanisms for differential privacy

2. **Analyzed Privacy-Utility Trade-offs**: You discovered how different ε values affect query accuracy

3. **Compared Noise Mechanisms**: You saw the differences between Laplace and Gaussian mechanisms

4. **Managed Privacy Budgets**: You learned how to track and manage privacy budget consumption

### Key Insights

- **Privacy Parameter (ε)**: Lower values provide stronger privacy but higher query error
- **Sensitivity**: Critical for calculating appropriate noise levels
- **Composition**: Multiple queries consume privacy budget additively
- **Mechanism Choice**: Laplace vs Gaussian depends on privacy model and composition needs

### Real-World Applications

- **Healthcare**: Protecting patient data while enabling medical research
- **Finance**: Enabling fraud detection while protecting transaction privacy
- **Government**: Publishing census data with individual privacy protection
- **Technology**: Improving services while protecting user privacy

In [None]:
# BONUS CHALLENGE: Design a Private Analytics Dashboard
print("BONUS CHALLENGE: Design a Private Analytics Dashboard")
print("=" * 60)

print("\nScenario: Hospital Privacy Dashboard")
print("Create a dashboard showing key hospital metrics while protecting patient privacy.")
print("\nYour constraints:")
print("• Total privacy budget: ε = 2.0") 
print("• Must provide 5 key statistics for hospital administrators")
print("• Balance privacy protection with decision-making utility")

print("\nSuggested statistics to consider:")
print("1. Total patient count")
print("2. Average patient age") 
print("3. Percentage with chronic conditions")
print("4. Average treatment cost")
print("5. Average hospital stay length")

print("\nDesign questions to consider:")
print("• Which statistics are most critical for hospital operations?")
print("• How should you allocate your ε = 2.0 budget across the 5 queries?")
print("• What level of error is acceptable for each metric?")

print("\n" + "="*60)
print("IMPLEMENTATION SPACE")
print("Try creating your own private analytics system below:")
print("="*60)

# Your implementation here - example starter code:
def create_hospital_dashboard():
    """
    Create a privacy-preserving hospital dashboard
    
    Recommended approach:
    1. Initialize a new privacy accountant with ε = 2.0 budget
    2. Choose 5 key metrics from the patient data  
    3. Allocate privacy budget based on importance
    4. Execute queries and display results
    5. Show remaining budget
    """
    
    # Example budget allocation (modify as needed):
    # - Patient count: ε = 0.3 (high importance, low sensitivity)
    # - Average age: ε = 0.4 (important for planning)  
    # - Chronic conditions: ε = 0.5 (critical for resource allocation)
    # - Treatment costs: ε = 0.4 (important for financial planning)
    # - Stay length: ε = 0.4 (operational planning)
    # Total: ε = 2.0
    
    print("Dashboard implementation - your code here!")
    
# Call your dashboard function
create_hospital_dashboard()

print("\nActivity Complete!")
print("You're now ready to implement differential privacy in real-world scenarios.")
print("Key skills gained:")
print("• Understanding privacy-utility trade-offs")
print("• Implementing DP mechanisms")  
print("• Managing privacy budgets")
print("• Comparing noise mechanisms")

## Chapter 7 Knowledge Assessment

Test your understanding of differential privacy, ethical AI frameworks, and privacy-preserving techniques with our interactive quiz. The quiz covers key concepts from both the chapter content and hands-on activities.

In [None]:
# Launch Chapter 7 Quiz
import webbrowser
import os

quiz_file = 'chapter7_quiz.html'
file_path = os.path.abspath(quiz_file)
webbrowser.open(f'file://{file_path}')

print("Opening Chapter 7 Quiz in your web browser...")
print("The quiz covers differential privacy, ethical AI, and privacy-preserving techniques.")
print("Complete the quiz to test your understanding of the chapter content!")