# Assignment 4 – Privacy-Preserving Federated Learning Simulation

This notebook runs **EDA**, creates **non-IID splits**, trains **centralized baselines**, and prepares artifacts for **federated (Flower) + differential privacy (Opacus)**.


In [None]:
# Reproducibility
import numpy as np, random, os
SEED = 42
np.random.seed(SEED)
random.seed(SEED)

# Paths
BASE = ".."


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer

X, y = load_breast_cancer(return_X_y=True, as_frame=True)
df = X.copy()
df['target'] = y
target_names = load_breast_cancer().target_names
print("Target mapping:", {0: target_names[0], 1: target_names[1]})
print("Shape:", df.shape)
print(df['target'].value_counts())

# Simple stats
display(df.describe().T[['mean','std']].head())

# Correlation heatmap (top-left 10x10 to keep it light)
corr = df.drop(columns=['target']).corr().iloc[:10,:10]
plt.figure()
plt.imshow(corr, interpolation='nearest')
plt.title('Correlation (subset 10x10)')
plt.colorbar()
plt.show()


In [None]:
# Create non-IID hospital splits (feature median strategy) and save CSVs to project root
from src.data_split import main as split_main
split_main(outdir="..", test_size=0.2, split_strategy="feature")

import pandas as pd
dist = pd.read_csv("../class_distribution.csv")
display(dist)


In [None]:
# Centralized baselines: Logistic Regression & RandomForest
from src.centralized_training import train_and_eval
metrics_df = train_and_eval(outdir="../models", test_size=0.2)
display(metrics_df)


## Federated Learning (run outside the notebook)

Open **three terminals** in the project root:

**Server:**
```
python -m src.fl_server --rounds 10 --eval-every 1 --test-split 0.2
```

**Client A:**
```
python -m src.fl_client --site A --epochs 2 --batch-size 64 --lr 0.001 --dp 0
```

**Client B:**
```
python -m src.fl_client --site B --epochs 2 --batch-size 64 --lr 0.001 --dp 0
```

Repeat with `--dp 1 --clip 1.0 --noise 1.1 --delta 1e-5` to enable differential privacy.


### What to include in the Report
- Methods: data split logic, FL setup, DP params (ε, δ, clipping, noise)
- Results: tables + plots (centralized and FL)
- Discussion: privacy–utility trade-offs, sensitivity to noise/clipping, more rounds/clients, adaptation to SeleneX
