# DPtoolkit - Basic Usage Tutorial

This notebook demonstrates the basic workflow for applying differential privacy to a dataset using DPtoolkit.

## What you'll learn:
1. Loading data with automatic type detection
2. Getting privacy recommendations
3. Configuring protection settings
4. Applying differential privacy
5. Comparing original vs. protected data
6. Exporting results

In [None]:
# Install DPtoolkit if not already installed
# !pip install -e ..

In [None]:
import pandas as pd
import numpy as np

# Create a sample healthcare dataset for demonstration
np.random.seed(42)
n_rows = 1000

df = pd.DataFrame({
    'patient_id': [f'P{i:05d}' for i in range(n_rows)],
    'age': np.random.randint(18, 90, n_rows),
    'weight_kg': np.random.normal(70, 15, n_rows).round(1),
    'blood_pressure_systolic': np.random.randint(90, 180, n_rows),
    'diagnosis_code': np.random.choice(['A01', 'B02', 'C03', 'D04', 'E05'], n_rows),
    'department': np.random.choice(['Cardiology', 'Neurology', 'Oncology', 'Pediatrics'], n_rows),
    'visit_date': pd.date_range('2024-01-01', periods=n_rows, freq='H'),
})

print(f"Dataset shape: {df.shape}")
df.head()

## 1. Getting Privacy Recommendations

DPtoolkit can automatically analyze your columns and recommend privacy settings based on column names and data patterns.

In [None]:
from dp_toolkit.recommendations.advisor import RecommendationAdvisor

# Get recommendations for the dataset
advisor = RecommendationAdvisor()
recommendations = advisor.recommend_for_dataset(df)

print(f"Recommended total epsilon: {recommendations.total_epsilon:.2f}")
print(f"\nPer-column recommendations:")
print("-" * 60)

for col, rec in recommendations.column_recommendations.items():
    eps = rec.epsilon_recommendation.epsilon
    sens = rec.epsilon_recommendation.sensitivity.value
    mech = rec.mechanism_recommendation.mechanism.value
    print(f"{col:30} ε={eps:.2f}  sensitivity={sens:8}  mechanism={mech}")

## 2. Configuring Protection Settings

Based on the recommendations, let's configure how to protect each column:
- **EXCLUDE**: Remove sensitive identifiers (patient_id)
- **PROTECT**: Apply DP noise to sensitive data
- **PASSTHROUGH**: Keep non-sensitive data unchanged (department)

In [None]:
from dp_toolkit.data.transformer import (
    DatasetTransformer,
    DatasetConfig,
    DatasetColumnConfig,
    ProtectionMode
)

# Create configuration
config = DatasetConfig(global_epsilon=1.0)

# Configure each column
config.column_configs['patient_id'] = DatasetColumnConfig(
    mode=ProtectionMode.EXCLUDE  # Remove identifier
)

config.column_configs['age'] = DatasetColumnConfig(
    mode=ProtectionMode.PROTECT,
    epsilon=0.5  # Lower epsilon for more privacy
)

config.column_configs['weight_kg'] = DatasetColumnConfig(
    mode=ProtectionMode.PROTECT,
    epsilon=1.0
)

config.column_configs['blood_pressure_systolic'] = DatasetColumnConfig(
    mode=ProtectionMode.PROTECT,
    epsilon=1.0
)

config.column_configs['diagnosis_code'] = DatasetColumnConfig(
    mode=ProtectionMode.PROTECT,
    epsilon=0.5  # More privacy for diagnosis
)

config.column_configs['department'] = DatasetColumnConfig(
    mode=ProtectionMode.PASSTHROUGH  # Not sensitive, keep as-is
)

config.column_configs['visit_date'] = DatasetColumnConfig(
    mode=ProtectionMode.PROTECT,
    epsilon=1.0
)

print("Configuration ready!")

## 3. Applying Differential Privacy

Now let's apply the DP transformation to create a privacy-protected version of the dataset.

In [None]:
# Apply transformation
transformer = DatasetTransformer()
result = transformer.transform(df, config)

protected_df = result.data

print(f"Original shape: {df.shape}")
print(f"Protected shape: {protected_df.shape}")
print(f"\nTotal epsilon used: {result.total_epsilon:.2f}")
print(f"Excluded columns: {result.excluded_columns}")
print(f"Passthrough columns: {result.passthrough_columns}")
print(f"\nProtected dataset preview:")
protected_df.head()

## 4. Comparing Original vs. Protected Data

Let's analyze how the data changed and check data quality metrics.

In [None]:
from dp_toolkit.analysis.comparator import DatasetComparator

# Need to filter original to match protected (exclude patient_id)
original_filtered = df.drop(columns=['patient_id'])

# Compare datasets
comparator = DatasetComparator()
comparison = comparator.compare(
    original=original_filtered,
    protected=protected_df,
    numeric_columns=['age', 'weight_kg', 'blood_pressure_systolic'],
    categorical_columns=['diagnosis_code']
)

print("=" * 60)
print("COMPARISON RESULTS")
print("=" * 60)
print(f"\nRows analyzed: {comparison.row_count}")
print(f"Columns analyzed: {comparison.column_count}")

if comparison.correlation_preservation:
    print(f"\nCorrelation preservation: {comparison.correlation_preservation.preservation_rate:.1%}")

print(f"\nOverall numeric MAE: {comparison.overall_numeric_mae:.4f}")
print(f"Overall numeric RMSE: {comparison.overall_numeric_rmse:.4f}")

In [None]:
# Detailed per-column comparison
print("\nNumeric Column Comparison:")
print("-" * 60)
for num_comp in comparison.numeric_comparisons:
    print(f"\n{num_comp.column_name}:")
    print(f"  MAE: {num_comp.divergence.mae:.4f}")
    print(f"  RMSE: {num_comp.divergence.rmse:.4f}")
    print(f"  Mean difference: {num_comp.divergence.mean_difference:.4f}")
    print(f"  Std difference: {num_comp.divergence.std_difference:.4f}")

In [None]:
print("\nCategorical Column Comparison:")
print("-" * 60)
for cat_comp in comparison.categorical_comparisons:
    print(f"\n{cat_comp.column_name}:")
    print(f"  Categories: {cat_comp.divergence.cardinality_original}")
    print(f"  Frequency MAE: {cat_comp.divergence.frequency_mae:.4f}")
    print(f"  Category drift: {cat_comp.divergence.category_drift:.4f}")

## 5. Visualizing the Comparison

Let's create visualizations to see how the distributions changed.

In [None]:
from dp_toolkit.analysis.visualizer import (
    create_histogram_overlay,
    create_box_comparison,
    create_category_bar_chart
)

# Histogram overlay for age
fig = create_histogram_overlay(
    original=original_filtered['age'],
    protected=protected_df['age'],
    column_name='age'
)
fig.show()

In [None]:
# Box plot comparison for weight
fig = create_box_comparison(
    original=original_filtered['weight_kg'],
    protected=protected_df['weight_kg'],
    column_name='weight_kg'
)
fig.show()

In [None]:
# Category distribution for diagnosis codes
fig = create_category_bar_chart(
    original=original_filtered['diagnosis_code'],
    protected=protected_df['diagnosis_code'],
    column_name='diagnosis_code'
)
fig.show()

## 6. Exporting the Protected Dataset

Finally, let's export the protected data and generate a PDF report.

In [None]:
# Export to CSV
protected_df.to_csv('protected_patients.csv', index=False)
print("Saved: protected_patients.csv")

# Export to Excel with metadata
from dp_toolkit.data.exporter import DataExporter

exporter = DataExporter()
exporter.export_excel(
    transform_result=result,
    path='protected_patients.xlsx',
    include_metadata=True
)
print("Saved: protected_patients.xlsx (with metadata sheet)")

In [None]:
# Generate PDF report
from dp_toolkit.reports.pdf_generator import PDFReportGenerator, ReportMetadata

metadata = ReportMetadata(
    title="Differential Privacy Analysis Report",
    original_filename="patient_data.csv"
)

# Build column configs dict for the report
column_configs_dict = {
    'age': {'mode': 'protect', 'epsilon': 0.5, 'mechanism': 'laplace'},
    'weight_kg': {'mode': 'protect', 'epsilon': 1.0, 'mechanism': 'laplace'},
    'blood_pressure_systolic': {'mode': 'protect', 'epsilon': 1.0, 'mechanism': 'laplace'},
    'diagnosis_code': {'mode': 'protect', 'epsilon': 0.5, 'mechanism': 'exponential'},
    'department': {'mode': 'passthrough'},
    'visit_date': {'mode': 'protect', 'epsilon': 1.0, 'mechanism': 'laplace'},
    'patient_id': {'mode': 'exclude'},
}

generator = PDFReportGenerator()
report = generator.generate(
    original_df=original_filtered,
    protected_df=protected_df,
    comparison=comparison,
    column_configs=column_configs_dict,
    metadata=metadata
)

# Save PDF
with open('privacy_report.pdf', 'wb') as f:
    f.write(report.content)

print(f"\nGenerated: privacy_report.pdf")
print(f"Pages: {report.page_count}")
print(f"Generation time: {report.generation_time:.2f}s")

## Summary

In this tutorial, we:

1. **Loaded** a sample healthcare dataset
2. **Got recommendations** for privacy settings based on column names
3. **Configured** protection modes (Exclude, Protect, Passthrough)
4. **Applied** differential privacy with specified epsilon values
5. **Compared** original vs. protected data with quality metrics
6. **Visualized** the distribution changes
7. **Exported** the protected data and generated a PDF report

### Key Takeaways:

- Lower epsilon = more privacy = more noise
- Total epsilon of ≤5 is recommended for sensitive healthcare data
- Always exclude direct identifiers (SSN, patient ID, etc.)
- Use the comparison metrics to verify data utility is preserved
- Generate PDF reports for compliance documentation