# Differential Privacy_in_Deep_Learning

Previously, we defined perfect privacy as "a query to a database returns the same value even if we remove any person from the database", and used this intuition in the description of epsilon/delta. In the context of deep learning we have a similar standard.

Training a model on a dataset should return the same model even if we remove any person from the dataset.

## Hospital Dataset Example
You are representing a hospital. You have an unlabelled dataset with you. You need to train a classification model on the data but data is not annotated.

There are 10 hospitals that do have labelled data. You can reach out to them. But the other will not share their data with you. Hence, the procedure will become as follows

1. All the 10 hospitals will train their own models
2. You will use those 10 models to create 10 prediction labels for each of your data
3. For each local data point, you will create a DP query (max_count function). Will add laplacian noise to make it DP
4. Train a model on our local dataset with the result of the max DP query.

let's say we have 10,000 training examples, and we've got 10 labels for each example (from our 10 "teacher models" which were trained directly on private data). Each label is chosen from a set of 10 possible labels (categories) for each image.

In [None]:
import numpy as np

In [None]:
num_teachers = 10 # working with 10 partner hospitals
num_examples = 10000 # len(our_dataset)
num_labels = 10 # number of labels for our classifier

# Creating a synthetic label vector for each hospital where each row in the 
# following represents the 10k labels that their models predicted
preds = (np.random.rand(num_teachers, num_examples) * num_labels).astype(int)
preds

array([[8, 6, 6, ..., 9, 6, 3],
       [3, 9, 0, ..., 0, 5, 3],
       [2, 3, 2, ..., 8, 0, 7],
       ...,
       [9, 6, 4, ..., 6, 1, 6],
       [0, 9, 1, ..., 0, 1, 1],
       [9, 8, 0, ..., 2, 6, 7]])

In [None]:
preds.shape # each row is the prediction of that hospital for the 10,000 examples

(10, 10000)

In [None]:
preds[0] # all predictions of 0th hospital

array([8, 6, 6, ..., 9, 6, 3])

In [None]:
len(preds[0])

10000

In [None]:
preds[:, 0] # all 10 predictions for the 0th training example

array([8, 3, 2, 5, 5, 8, 5, 9, 0, 9])

Here one patient will get 10 predictions. Convert each vector of predictions to a single prediction.

In [None]:
an_image = preds[:, 0]
np.bincount(an_image, minlength=num_labels) # counts the frequency of individual elements and returns a list of counts for the increasing order of elements

array([1, 0, 1, 1, 0, 3, 0, 0, 2, 2])

In [None]:
label_counts = np.bincount(an_image, minlength=num_labels)
np.argmax(label_counts) # Index of the largest count will give the aggregated label

5

In [None]:
# This is a straightforward prediction. We need to add laplacian to make it differentially private
epsilon = 0.1
beta = 1/epsilon

for i in range(len(label_counts)):
    label_counts[i] += np.random.laplace(0, beta, 1)

In [None]:
label_counts

array([ -3,   3, -17,   4,   0, -25,  12,  -1,  -7,  -9])

In [None]:
np.argmax(label_counts)

6

Here the answer (label) changed. But this is a tradeoff. Assumption is that later on, as the DNN trains, it will filter through the noise and learn how to predict reasonabily accurately (federated learning integrates this into learning).

In [None]:
preds = (np.random.rand(num_teachers, num_examples) * num_labels).astype(int).transpose(1,0) # simulated predictions
# For the whole 10000 points,

new_labels = []
for an_image in preds:
    label_counts = np.bincount(an_image, minlength=num_labels)

    epsilon = 0.1
    beta = 1/epsilon

    for i in range(len(label_counts)):
        label_counts[i] += np.random.laplace(0, beta, 1)
    
    new_label = np.argmax(label_counts)

    new_labels.append(new_label)

len(new_labels)

10000

In [18]:
new_labels[0]

2

###Differential Privacy for the participating hospitals
If I can remove any participating hospital's result (label) from the set of 10 labels, if the output of the query (argmax) does not change, we call it perfect privacy.

But it is conditioned w.r.t. what the labels actually were. 

In [20]:
!pip install syft==0.2.9 >/dev/null

[31mERROR: tensorflow 2.4.1 has requirement numpy~=1.19.2, but you'll have numpy 1.18.5 which is incompatible.[0m
[31mERROR: google-colab 1.0.0 has requirement notebook~=5.3.0; python_version >= "3.0", but you'll have notebook 5.7.8 which is incompatible.[0m
[31mERROR: google-colab 1.0.0 has requirement requests~=2.23.0, but you'll have requests 2.22.0 which is incompatible.[0m
[31mERROR: google-colab 1.0.0 has requirement tornado~=5.1.0; python_version >= "3.0", but you'll have tornado 4.5.3 which is incompatible.[0m
[31mERROR: datascience 0.10.6 has requirement folium==0.2.1, but you'll have folium 0.8.3 which is incompatible.[0m
[31mERROR: bokeh 2.1.1 has requirement tornado>=5.1, but you'll have tornado 4.5.3 which is incompatible.[0m
[31mERROR: albumentations 0.1.12 has requirement imgaug<0.2.7,>=0.2.5, but you'll have imgaug 0.2.9 which is incompatible.[0m


## PATE Analysis

In [23]:
from syft.frameworks.torch.dp import pate # in syft 0.2.9, replaced differential_privacy as dp

num_teachers, num_examples, num_labels = (100, 100, 10)
preds = (np.random.rand(num_teachers, num_examples) * num_labels).astype(int) # simulated fake predictions
indices = (np.random.rand(num_examples) * num_labels).astype(int) # say true answers

In [24]:
data_dep_epsilon, data_indep_epsilon = pate.perform_analysis(teacher_preds=preds, indices=indices, noise_eps=0.1, delta=1e-5)
data_dep_epsilon, data_indep_epsilon

(11.756462732485105, 11.756462732485115)

To demo, let's force the teacher preds as 0 for the first 5 examples and see if there is a difference in the epsilons.

In [25]:
# First 5 exampes, all 100 hospitals agreed that it was labelled as 0
preds[:,  0:5] *= 0
preds
# data_dep_epsilon, data_indep_epsilon =

array([[0, 0, 0, ..., 6, 1, 7],
       [0, 0, 0, ..., 5, 7, 6],
       [0, 0, 0, ..., 2, 6, 0],
       ...,
       [0, 0, 0, ..., 3, 2, 3],
       [0, 0, 0, ..., 0, 5, 3],
       [0, 0, 0, ..., 6, 5, 4]])

In [28]:
data_dep_epsilon, data_indep_epsilon = pate.perform_analysis(teacher_preds=preds, indices=indices, noise_eps=0.1, delta=1e-5)
data_dep_epsilon, data_indep_epsilon

(8.503131916570494, 11.756462732485115)

here the data dependent epsilon changed.

In [30]:
# First 50 exampes, all 100 hospitals agreed that it was labelled as 0
preds[:,  0:50] *= 0
data_dep_epsilon, data_indep_epsilon = pate.perform_analysis(teacher_preds=preds, indices=indices, noise_eps=0.1, delta=1e-5)
data_dep_epsilon, data_indep_epsilon



(1.52655213289881, 11.756462732485115)

We get significantly better privacy leakage

The assumption here is that the more the prediction agree with each other, the tighter the epsilon value can get. Hence the better privacy leakage.