# Differentially Private Federated Learning with Opacus and Flower

In this notebook, we will learn how to use Opacus with PyTorch and Flower for differentially private federated learning. The code is related to other examples, for example, the Flower Intoduction, so it will discuss the topics that were covered in previous notebooks only briefly.

Federated learning inherently offers a form of privacy protection by only sending model updates to the server instead of raw client data. However, it has been shown that there are attacks which enable you to, for example, reconstruct parts of the training data or infer the membership of certain users given only the trained model. A popular approach used to mitigate this privacy risk is differential privacy.

This post will introduce the concept and demonstrate how to add it to a PyTorch-based Flower client using the [Opacus library](https://opacus.ai/).

## What is differential privacy?
​
Essentially it's a definition for privacy that allows you to mathematically reason about the risk of an adversary learning information about a user given a model's parameters and arbitrary side information. For a training algorithm to satisfy differential privacy, it needs to ensure that a single user missing from or appearing in a training set won't have too much of an effect on the model.
​

Formally ([Dwork et al. 2006](https://www.iacr.org/archive/eurocrypt2006/40040493/40040493.pdf)), given two adjacent datasets $d$ and $d'$ - where $d'$ can be formed from $d$ by adding or removing all the samples belonging to a single user - a randomized training algorithm $\mathcal{A}: \mathcal{D} \rightarrow \mathcal{M}$, mapping possible training datasets to possible trained models, is $(\varepsilon, \delta)$-differentially private if for any subset of outputs $O \in \mathcal{M}$ we have
$$Pr[\mathcal{A}(d) \in O] \leq e^\varepsilon Pr[\mathcal{A}(d') \in O] + \delta.$$
​

Here $\varepsilon$ is the so called privacy budget: the lower it is, the better the privacy guarantee you can provide. The other parameter, $\delta$ is a measure of the probability that something goes wrong and you can't fulfill that guarantee (and again, the lower the better).
​

For deep learning, the currently most commonly used algorithm  to achieve $(\varepsilon, \delta)$-differential privacy is DP-SGD ([Abadi et al. 2016](https://arxiv.org/pdf/1607.00133.pdf)), which can easily be adapted to be used as part of federated learning strategies such as FedAvg ([McMahan et al. 2017](https://arxiv.org/pdf/1710.06963v1.pdf)).
​

It has two main steps used to increase user privacy:
​
- **Gradient norm clipping**, which ensures a single client's influence on the overall average is bounded to a maximum gradient norm of $L$ ($L$ depending on many factors determined by your model and data), and
- **Gaussian noising**, which introduces the element of randomness needed by adding $\mathcal{N}(0, L^2\sigma^2)$ to a model's parameters ($\sigma$ is then referred to as the noise multiplier).

​
The algorithm also relies on clients and data samples being selected uniformly at random.
​

For federated learning in particular, it is possible to either add noise locally to the model update before sending it to the server, or add noise centrally to the averaged model. The central approach leads to more accurate models since less noise is added overall, however, it relies on the assumption of a trusted server. The following example will focus on local differential privacy.
​

Another important part of any library providing differentially private mechanisms is to provide a way to keep track of the privacy budget. This is because multiple applications of a mechanism on the same dataset change the privacy guarantees you can provide, as obviously any access to data means potentially revealing more information about a client. Computing $\varepsilon$ values and bounds is mathematically quite complicated, but broadly speaking every epoch of training performed in a client results in $\varepsilon$ growing in an additive way, while across your federated setting you will have a form of parallel composition where the maximum $\varepsilon$ out of all of your clients determines your overall budget.

## Installing Opacus

Let's start again by installing and importing the necessary dependencies, including Opacus. The current release of Opacus is not (yet) compatible with the latest version of PyTorch, which is why we install earlier version of both `torch` and `torchvision`:

In [None]:
!pip install torchcsprng==0.1.3+cu101 -f https://download.pytorch.org/whl/torch_stable.html
!pip install matplotlib opacus==0.14.0 torch==1.7.0 torchvision==0.8.0 git+https://github.com/adap/flower.git@release/0.17#egg=flwr["simulation"]

from collections import OrderedDict
from typing import List

import flwr as fl
import numpy as np
import opacus                                           # <-- NEW
from opacus import PrivacyEngine                        # <-- NEW
from opacus.dp_model_inspector import DPModelInspector  # <-- NEW
import torch
import torch.nn as nn
import torchvision
import torch.nn.functional as F
import torchvision.transforms as transforms
from torch.utils.data import DataLoader, random_split
from torchvision.datasets import CIFAR10

print(torch.__version__)
print(torchvision.__version__)
print(opacus.__version__)

DEVICE = torch.device("cpu")
DEVICE = "cpu"  # Enable this line to force execution on CPU
print(f"Training on {DEVICE}")

## Defining our task

As before, we begin by defining our basic task. This includes data loading and model architecture, and the usual `get_parameters`/`set_parameters`:

In [None]:
NUM_CLIENTS = 2  # This time, we'll only us two clients
BATCH_SIZE = 32

def load_datasets():
    # Download and transform CIFAR-10 (train and test)
    transform = transforms.Compose(
      [transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]
    )
    trainset = CIFAR10("./dataset", train=True, download=True, transform=transform)
    testset = CIFAR10("./dataset", train=False, download=True, transform=transform)

    # Split training set into 10 partitions to simulate the individual dataset
    partition_size = len(trainset) // NUM_CLIENTS
    lengths = [partition_size] * NUM_CLIENTS
    datasets = random_split(trainset, lengths, torch.Generator().manual_seed(42))

    # Split each partition into train/val and create DataLoader
    trainloaders = []
    valloaders = []
    for ds in datasets:
        len_val = len(ds) // 10  # 10 % validation set
        len_train = len(ds) - len_val
        lengths = [len_train, len_val]
        ds_train, ds_val = random_split(ds, lengths, torch.Generator().manual_seed(42))
        trainloaders.append(DataLoader(ds_train, batch_size=BATCH_SIZE, shuffle=True))
        valloaders.append(DataLoader(ds_val, batch_size=BATCH_SIZE))
    testloader = DataLoader(testset, batch_size=BATCH_SIZE)
    return trainloaders, valloaders, testloader

trainloaders, valloaders, testloader = load_datasets()


class Net(nn.Module):
    def __init__(self) -> None:
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

def get_parameters(net) -> List[np.ndarray]:
    return [val.cpu().numpy() for _, val in net.state_dict().items()]

def set_parameters(net, parameters: List[np.ndarray]):
    params_dict = zip(net.state_dict().keys(), parameters)
    state_dict = OrderedDict({k: torch.Tensor(v) for k, v in params_dict})
    net.load_state_dict(state_dict, strict=True)

## A simple example using Opacus
​
There are multiple libraries available implementing differentially private machine learning. These are typically dependant on the library you use for training your models, since the clipping step is usually performed with each training step and hence requires you to have library-specific wrappers for optimizers. This example uses Opacus which is specific to PyTorch, however there also exists an example of a [Flower client implementing DP-SGD using Tensorflow Privacy](https://github.com/adap/flower/tree/main/examples/dp-sgd-mnist).
​
The full code for this example and instructions for running it can be found [here](). It builds on the [PyTorch Quickstart](https://flower.dev/docs/quickstart_pytorch.html) example which trains a convolutional network on the CIFAR10 dataset, so this assumes that you are familiar with it and have the code ready. 
​
The first step is to check whether your network is compatible with Opacus, since it doesn't currently support all layers (see more about this in the [documentation](https://github.com/pytorch/opacus/blob/master/opacus/README.md)). For this you just use the `DPModelInspector` when you instantiate the model:

In [None]:
def validate_model():
  model = Net()
  inspector = DPModelInspector()
  print(inspector.validate(model))


validate_model()

The trickiest part about the setup is finding the right privacy parameters for the privacy engine:
​

- **Target delta $\delta$**: should be set to be less than the inverse of the size of the training dataset, so for example if your dataset has 50,000 training points like CIFAR10, a good value to set it to is $10^{-5}$.
- **Noise multiplier $\sigma$**: determines the amount of noise added in each step, the larger it is the smaller the resulting $\varepsilon$ value.
- **Target epsilon $\varepsilon$**: as an alternative to having a fixed noise multiplier, it can be computed by the engine when given a target $\varepsilon$. However this requires you to also provide the number of training epochs, which is harder to know in a federated setting since you need to consider both global and local training rounds.
- **Maximum gradient norm $L$**: for this parameter - which depends heavily on factors such as model architecture, amount of training data on the client and learning rate - it can be useful to do a grid search since setting it too low can result in high bias, whereas setting it too high might destroy model utility ([Andrew et al. 2021](https://arxiv.org/pdf/1905.03871.pdf)). 
​

​
Instead of a sample rate you can also provide both batch size (the number of training samples taken in one step) and sample size (overall size of the dataset in one client), since `sample_rate = batch_size / sample_size`.

​
There are more advanced parameters you can specify, you can find more details in the Opacus [documentation](https://opacus.ai/api/privacy_engine.html) and their [tutorials](https://opacus.ai/tutorials/).

​
The example uses the following parameters but they aren't optimal (depending on your preferences in terms of the utility-privacy trade-off):

In [None]:
PARAMS = {
    'batch_size': 32,
    'train_split': 0.7,
    'local_epochs': 1
}
PRIVACY_PARAMS = {
    'target_delta': 1e-5,
    'noise_multiplier': 0.4,
    'max_grad_norm': 1.2
}

The last step is to attach the privacy engine to you optimizer in each round of training, and optionally get and return the current privacy budget spent:

In [None]:
def train(net, trainloader, privacy_engine, epochs):
    criterion = torch.nn.CrossEntropyLoss()
    optimizer = torch.optim.SGD(net.parameters(), lr=0.001, momentum=0.9)
    # Attach privacy engine to optimizer
    privacy_engine.attach(optimizer)
    for _ in range(epochs):
        for images, labels in trainloader:
            images, labels = images.to(DEVICE), labels.to(DEVICE)
            optimizer.zero_grad()
            loss = criterion(net(images), labels)
            loss.backward()
            optimizer.step()
    # Get privacy budget
    epsilon, _ = optimizer.privacy_engine.get_privacy_spent(PRIVACY_PARAMS['target_delta'])
    return epsilon


# Same as before
def test(net, testloader):
    """Evaluate the network on the entire test set."""
    criterion = torch.nn.CrossEntropyLoss()
    correct, total, loss = 0, 0, 0.0
    net.eval()
    with torch.no_grad():
        for images, labels in testloader:
            images, labels = images.to(DEVICE), labels.to(DEVICE)
            outputs = net(images)
            loss += criterion(outputs, labels).item()
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
    loss /= len(testloader.dataset)
    accuracy = correct / total
    return loss, accuracy

Next you specify an instance of the PrivacyEngine in your client:

In [None]:
PE = {}

def get_privacy_engine(cid, model, sample_rate):
    if cid not in PE.keys():
        PE[cid] = PrivacyEngine(
            model,
            sample_rate = sample_rate,
            target_delta = PRIVACY_PARAMS['target_delta'],
            max_grad_norm = PRIVACY_PARAMS['max_grad_norm'],
            noise_multiplier = PRIVACY_PARAMS['noise_multiplier']
        )
    return PE[cid]  # Use the previously created PrivacyEngine

It is important to only have one engine per model that persists throughout each round of training, since otherwise the privacy tracking won't be accurate. To achieve that whilst still using the Flower Virtual Client Engine for a resource efficient single-machine simulation, we have to use a little trick: we initialize the privacy engine for each client lazily and keep a reference to it in a dictionary. This allows us to use the Virtual Client Engine and discard `FlowerClient` instances after use, but still re-use the same `PrivacyEngine` instance for each client every time it gets created.

## Implement the FlowerClient

You can then also return the privacy budget as a custom metric in the client's fitting method (which can then be further used by [overriding the aggregation strategy](https://flower.dev/docs/saving-progress.html)):

In [None]:
class FlowerClient(fl.client.NumPyClient):
    def __init__(self, cid, net, trainloader, valloader, privacy_engine):
        super().__init__()
        self.cid = cid
        self.net = net
        self.trainloader = trainloader
        self.valloader = valloader
        self.privacy_engine = privacy_engine

    def get_parameters(self):
        return get_parameters(self.net)
    
    def fit(self, parameters, config):
        set_parameters(self.net, parameters)
        epsilon = train(self.net, self.trainloader, self.privacy_engine, PARAMS['local_epochs'])
        print(f"[CLIENT {self.cid}] epsilon = {epsilon:.2f}")
        return get_parameters(self.net), len(self.trainloader), {"epsilon":epsilon}

    def evaluate(self, parameters, config):
        set_parameters(self.net, parameters)
        loss, accuracy = test(self.net, self.valloader)
        print(f"[CLIENT {self.cid}] loss {loss}, accuraccy {accuracy}")
        return float(loss), len(self.valloader), {"accuracy": float(accuracy)}

## Start the training

And that's it, you can run your Flower client just as you are used to before! Let's train for a few rounds of federated learning and see if the accuraccy of our differentially private model still increases: 

In [None]:
def client_fn(cid) -> FlowerClient:
    """Create a Flower client representing a single organization."""

    # Load model
    net = Net().to(DEVICE)

    # Load data (CIFAR-10)
    trainloader = trainloaders[int(cid)]
    valloader = valloaders[int(cid)]

    # PrivacyEngine
    sample_rate = BATCH_SIZE / len(trainloader.dataset) 
    pe = get_privacy_engine(cid, net, sample_rate)

    # Create a  single Flower client representing a single organization
    return FlowerClient(cid, net, trainloader, valloader, pe)


# Create FedAvg strategy
strategy = fl.server.strategy.FedAvg(
        fraction_fit=1.0,  # Sample 100% of available clients for training
        fraction_eval=1.0,  # Sample 100% of available clients for evaluation
        min_fit_clients=2,  # Never sample less than 10 clients for training
        min_eval_clients=2,  # Never sample less than 10 clients for evaluation
        min_available_clients=2,  # Wait until all 10 clients are available
)

# Start simulation
fl.simulation.start_simulation(
    client_fn=client_fn,
    num_clients=NUM_CLIENTS,
    num_rounds=5,
    strategy=strategy,
)

## Limitations and next steps
​
While implementing a basic setup like this is relatively simple, there are a lot of considerations and open questions when it comes to deploying differentially private models in a federated setting in practice.
​

Ultimately it comes down to considering the trade-off between utility and privacy. On the one hand, differentially private models take longer to converge, requiring more computation ([McMahan et al. 2017](https://arxiv.org/pdf/1710.06963v1.pdf)), and can result in worse models ([Bagdasaryan et al. 2019](https://proceedings.neurips.cc/paper/2019/file/fc0de4e0396fff257ea362983c2dda5a-Paper.pdf)). On the other hand, even relatively large privacy budgets are beneficial for user privacy ([Thakkar et al. 2020](https://arxiv.org/pdf/2006.07490.pdf)). Hence considering your privacy parameters and assumptions about the required utility and privacy becomes crucial.

​
For Flower, the next question would be to find ways to train models in a differentially private way such that the mechanism becomes independent from the machine learning library, however even this example will hopefully be a useful first step to experiment with privacy in federated learning.