
## Implementing Differential Privacy SGD (DP-SGD) with Opacus






### Library Installation and Dataset Preparation

First, let's install the Opacus. Opacus is a library that enables training PyTorch models with differential privacy.

In [None]:
# install opacus [https://github.com/pytorch/opacus]
!pip install opacus

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
!pip install -U -q PyDrive
import os
import pandas as pd
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
if not os.path.exists('/content/BMI.csv'):
    link = '1_jBaXC32QcfGOHo2J5040PuEU7TAKj7Y'  # Restricted shared link
    downloaded = drive.CreateFile({'id':link}) 
    downloaded.GetContentFile('BMI.csv')

df = pd.read_csv('/content/BMI.csv')

df.head()

Unnamed: 0,UnderwaterDensity,BodyFatSiriEqu,Age,Height,Weight(kg),NeckCircumf,ChestCircumf,Abdomen2Circumf,HipCircumf,ThighCircumf,KneeCircumf,AnkleCircumf,ExtendBicepsCircumf,ForearmCircumf,WristCircumf
0,1.0708,12.3,23,172.085,69.96662,36.2,93.1,85.2,94.5,59.0,37.3,21.9,32.0,27.4,17.1
1,1.0853,6.1,22,183.515,78.58488,38.5,93.6,83.0,98.7,58.7,37.3,23.4,30.5,28.9,18.2
2,1.0414,25.3,22,168.275,69.85322,34.0,95.8,87.9,99.2,59.6,38.9,24.0,28.8,25.2,16.6
3,1.0751,10.4,26,183.515,83.80119,37.4,101.8,86.4,101.2,60.1,37.3,22.8,32.4,29.4,18.2
4,1.034,28.7,24,180.975,83.57439,34.4,97.3,100.0,101.9,63.2,42.2,24.0,32.2,27.7,17.7


We treat the columns in the csv as features and perform classification task with BMI.

There is no BMI data in this dataset, so we can calculate the BMI value following 
$ BMI=\frac{Weight(kg)}{Height(m)^2} $

We now create our dataset.

In [None]:
from PIL import Image
import matplotlib.pyplot as plt
import torch
import numpy as np
from torch.utils.data import Dataset, Subset, DataLoader
from torchvision import datasets, transforms,models
import torch.optim as optim
import os

class MyDataset(Dataset):
 
  def __init__(self,df, mean=None, std=None):
    # Get Height and Weight
    x = np.stack(df.iloc[:,:].values,axis=1).transpose()
    if mean is not None:
      self.mean = mean
      self.std = std
    else:
      self.mean = x.mean(axis=0, dtype=np.float32)
      self.std = x.std(axis=0, dtype=np.float32)
    # Define BMI
    df['BMI'] = df['Weight(kg)']/(df['Height']*df['Height']/10000)
    # Divide BMI into 4 categories
    df['BMI_category'] = "not defined"
    df['BMI_category_int'] = "not defined"

    df['BMI_category'][df['BMI']<18.5] = "Underweight"
    df['BMI_category'][(df['BMI']>=18.5) & (df['BMI']<=24.99)] = "Healthy Weight Range"
    df['BMI_category'][(df['BMI']>=25) & (df['BMI']<=29.99)] = "Overweight"
    df['BMI_category'][df['BMI']>=30] = "Obese"

    df['BMI_category_int'][df['BMI']<18.5] = 0
    df['BMI_category_int'][(df['BMI']>=18.5) & (df['BMI']<=24.99)] = 1
    df['BMI_category_int'][(df['BMI']>=25) & (df['BMI']<=29.99)] = 2
    df['BMI_category_int'][df['BMI']>=30] = 3

    y = np.array(df['BMI_category_int'].values, dtype=int)
    x = torch.from_numpy(x)
    y = torch.from_numpy(y)

    self.x=torch.tensor(x,dtype=torch.float32)
    self.y=torch.tensor(y,dtype=torch.int64)
 
  def mean_std(self):
    return self.mean, self.std

  def __len__(self):
    return len(self.y)
   
  def __getitem__(self,idx):
    data = self.x[idx]
    target = self.y[idx]
    data = (data - self.mean) / self.std
    return data, target

We split the csv into training set and testing set.

In [None]:
df = pd.read_csv('/content/BMI.csv')
df_size = df.shape[0]
split_ratio = 0.8
index = int(df_size*split_ratio)
train_df = df.iloc[0: index,:]
test_df = df.iloc[index:, :]

trainset = MyDataset(train_df)
mean, std = trainset.mean_std()
testset = MyDataset(test_df, mean, std)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A val

### Build Toy Neural Network

Opacus provides a good encapsulation of DP-SGD upon pre-defined models and optimizers. To utilize Opacus, we take a toy Neural Net as an example.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class ToyNN(nn.Module):
    def __init__(self):
        super(ToyNN, self).__init__()
        self.fc1 = nn.Linear(15, 30)
        self.fc2 = nn.Linear(30, 8)
        self.fc3 = nn.Linear(8, 4)
        self.act = nn.Tanh()
        self.dropout = nn.Dropout(p=0.5)

    def forward(self, x):
        x = self.fc1(x) # -> [B, 30]
        x = self.act(x) # -> [B, 30]
        x = self.dropout(x)
        x = self.fc2(x) # -> [B, 8]
        x = self.act(x) # -> [B, 8]
        x = self.dropout(x)
        x = self.fc3(x) # -> [B, 4]
        return x

    def name(self):
        return "ToyNN"

We can simply wrap our toyNN and dataloader by the PrivacyEngine.

In [None]:
from opacus import PrivacyEngine

# set standard parameters
epochs = 50
lr = 1e-3
batchsize = 16
weight_decay = 5e-4
# set parameters for DP-SGD
noise_multiplier = 1.1
max_grad_norm = 1.0
delta = 1e-5

# create data loaders
trainloader = DataLoader(trainset, batch_size=batchsize, shuffle=True, num_workers=1)
testloader = DataLoader(testset, batch_size=batchsize, shuffle=False, num_workers=1)
# define your components as usual
model = ToyNN()
# define optimizer and loss function
optimizer = optim.Adam(model.parameters(), lr=lr, weight_decay=weight_decay)
criterion = nn.CrossEntropyLoss()
# enter PrivacyEngine
privacy_engine = PrivacyEngine()
model, optimizer, trainloader = privacy_engine.make_private(
    module=model,
    optimizer=optimizer,
    data_loader=trainloader,
    noise_multiplier=noise_multiplier,
    max_grad_norm=max_grad_norm,
)

  "Secure RNG turned off. This is perfectly fine for experimentation as it allows "


With returned model, optimizer, and trainloader, we can train our network with differential privacy. The $\epsilon$ and $\delta$ correspond to the ($\epsilon$, $\delta$)-DP in Lecture Slides.

In [None]:
# define LR schedule
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, float(epochs))

# start training
for epoch in range(1, epochs + 1):
    model.train()
    scheduler.step()
    lr = scheduler.get_lr()[0]
    train_losses = []
    train_correct = 0.0
    train_total = 0.0
    for batch_idx, (data, target) in enumerate(trainloader):
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()
        train_losses.append(loss.item())
        # convert output probabilities to predicted class
        pred = output.data.max(1, keepdim=True)[1]
        # compare predictions to true label
        train_correct += np.sum(np.squeeze(pred.eq(target.data.view_as(pred))).cpu().numpy())
        train_total += data.size(0)
    train_acc = 100. * train_correct / train_total

    epsilon = privacy_engine.get_epsilon(delta)

    model.eval()
    test_losses = []
    test_correct = 0.0
    test_total = 0.0
    with torch.no_grad():
        for batch_idx, (data, target) in enumerate(testloader):
            output = model(data)
            loss = criterion(output, target)
            test_losses.append(loss.item())
            # convert output probabilities to predicted class
            pred = output.data.max(1, keepdim=True)[1]
            # compare predictions to true label
            test_correct += np.sum(np.squeeze(pred.eq(target.data.view_as(pred))).cpu().numpy())
            test_total += data.size(0)
        test_acc = 100. * test_correct / test_total

    print("Epoch: {0},  Train_Loss: {1:.2f}, Train_Acc:{2:.2f}, Test_Loss:{3:.2f}, Test_Acc:{4:.2f}, eps: {5:.2f}".format(epoch, np.mean(train_losses), train_acc, np.mean(test_losses), test_acc, epsilon))



Epoch: 1,  Train_Loss: 1.36, Train_Acc:41.58, Test_Loss:1.29, Test_Acc:45.10, eps: 2.49
Epoch: 2,  Train_Loss: 1.30, Train_Acc:43.13, Test_Loss:1.27, Test_Acc:49.02, eps: 3.04
Epoch: 3,  Train_Loss: 1.29, Train_Acc:41.62, Test_Loss:1.25, Test_Acc:50.98, eps: 3.49
Epoch: 4,  Train_Loss: 1.25, Train_Acc:54.07, Test_Loss:1.23, Test_Acc:54.90, eps: 3.88
Epoch: 5,  Train_Loss: 1.23, Train_Acc:48.97, Test_Loss:1.21, Test_Acc:60.78, eps: 4.24
Epoch: 6,  Train_Loss: 1.22, Train_Acc:51.28, Test_Loss:1.19, Test_Acc:66.67, eps: 4.56
Epoch: 7,  Train_Loss: 1.22, Train_Acc:51.69, Test_Loss:1.18, Test_Acc:68.63, eps: 4.87
Epoch: 8,  Train_Loss: 1.21, Train_Acc:48.11, Test_Loss:1.16, Test_Acc:64.71, eps: 5.16
Epoch: 9,  Train_Loss: 1.19, Train_Acc:53.93, Test_Loss:1.15, Test_Acc:64.71, eps: 5.44
Epoch: 10,  Train_Loss: 1.14, Train_Acc:59.11, Test_Loss:1.13, Test_Acc:64.71, eps: 5.70
Epoch: 11,  Train_Loss: 1.13, Train_Acc:59.41, Test_Loss:1.12, Test_Acc:64.71, eps: 5.96
Epoch: 12,  Train_Loss: 1.11, 

# Practice Questions

1. Try to modfiy ToyNN and fit again, see what will happen.

2. Explore PrivacyEngine modules.