<a href="https://colab.research.google.com/github/ShowLongYoung/SecurePrivateAILab/blob/solution/7_differential_privacy_solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Privacy Preserving Machine Learning

First things first. Let's run the package installations. They take quite a while.

Change the Runtime of this Notebook to GPU first. Otherwise it will be pretty slow.

To do so go to Runtime -> Change Runtime Type and change it to GPU.

In [None]:
!pip install syft==0.2.9 keras==2.2.3 tensorflow_privacy==0.2.2

# !git clone https://github.com/OpenMined/PySyft.git
# !pip install -e PySyft
# !pip install tensorflow_federated

fatal: destination path 'PySyft' already exists and is not an empty directory.
Obtaining file:///content/PySyft
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Installing collected packages: syft
  Found existing installation: syft 0.2.4
    Can't uninstall 'syft'. No files were found to uninstall.
  Running setup.py develop for syft
Successfully installed syft


Next we'll get our usual boilerplat code out of the way. Data loading, splitting, etc.

Load our data set and split it into test and training data.

## Differential Privacy

Below we will train a model perform malware detection. Image classification on the MNIST data. will train it using Differantially Private SGD optimimizer.

How does the privacy budget `epsilon` change when you tweak the parameters of the optimizer? How does it influence accuracy?

In [None]:
import keras
import tensorflow as tf
from tensorflow_privacy.privacy.analysis import compute_dp_sgd_privacy
from tensorflow_privacy.privacy.optimizers.dp_optimizer import DPGradientDescentGaussianOptimizer
from keras.datasets import mnist
import numpy as np

(mnist_x_train, mnist_y_train), (mnist_x_test, mnist_y_test) = mnist.load_data()

mnist_x_train = mnist_x_train.astype( np.float32 ) / 255
mnist_x_test = mnist_x_test.astype( np.float32 ) / 255

mnist_x_train = mnist_x_train.reshape( -1, 28, 28, 1)
mnist_x_test = mnist_x_test.reshape( -1, 28, 28, 1)


mnist_y_train = keras.utils.to_categorical( mnist_y_train )
mnist_y_test = keras.utils.to_categorical( mnist_y_test )

EPOCHS = 10
BATCH_SIZE = 250

model = tf.keras.models.Sequential()
model.add( tf.keras.layers.Conv2D( 32, kernel_size=(3, 3), activation='relu', input_shape=mnist_x_train.shape[ 1: ]  ) )
model.add( tf.keras.layers.MaxPooling2D( pool_size=(2, 2) ) )
model.add( tf.keras.layers.Conv2D( 64, kernel_size=(3, 3), activation='relu' ) )
model.add( tf.keras.layers.Flatten() )
model.add( tf.keras.layers.Dense( 128, activation='relu' ) )
model.add( tf.keras.layers.Dense( 10, activation='softmax' ) )


optimizer = DPGradientDescentGaussianOptimizer(
    l2_norm_clip=1.5,
    noise_multiplier=1.3,
    num_microbatches=250,
    learning_rate=0.25)

loss = tf.keras.losses.CategoricalCrossentropy( from_logits=True, reduction=tf.losses.Reduction.NONE )

model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])

model.fit(mnist_x_train, mnist_y_train,
          epochs=EPOCHS,
          batch_size=BATCH_SIZE)

print( 'test acc:', model.evaluate( mnist_x_test, mnist_y_test, batch_size=250 ) )

eps = compute_dp_sgd_privacy.compute_dp_sgd_privacy(n=60000, batch_size=250, noise_multiplier=1.3, epochs=15, delta=1e-5)
print( 'epsilon: ', eps )


Using TensorFlow backend.


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
test acc: [2.365349054336548, 0.0957999974489212]
DP-SGD with sampling rate = 0.417% and noise_multiplier = 1.3 iterated over 3600 steps satisfies differential privacy with eps = 1.18 and delta = 1e-05.
The optimal RDP order is 17.0.
epsilon:  (1.1799006739827, 17.0)


# Pate

First we need to split up the data and train the teachers. For simplicty we will work with 3 teachers and a small amount of data.


We are going to split the data into 4 partions of 500 instances each. 3 partions for the teachers and one for the students.



In [None]:
import tensorflow as tf
from tensorflow import keras
import numpy as np

n_instances = 500
n_teachers = 3

# load data and transform it
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()

x_train = x_train.astype( float ) / 255.
x_test = x_test.astype( float ) / 255.

x_train = x_train.reshape( -1, 28, 28, 1)
x_test = x_test.reshape( -1, 28, 28, 1)

y_train = keras.utils.to_categorical( y_train )
y_test = keras.utils.to_categorical( y_test )

# shuffle data
idx = np.arange( len( x_train ) )
np.random.shuffle( idx )
x_train = x_train[ idx ]
y_train = y_train[ idx ]

# gather the teacher data
teacher_data_x = [ x_train[ i * n_instances : ( i + 1 ) * n_instances ] for i in range( n_teachers ) ]
teacher_data_y = [ y_train[ i * n_instances : ( i + 1 ) * n_instances ] for i in range( n_teachers ) ]

# gather the student data
student_data_x = x_train[ n_teachers * n_instances : ( n_teachers + 1 ) * n_instances ]
student_data_y = y_train[ n_teachers * n_instances : ( n_teachers + 1 ) * n_instances ]








Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz


In [None]:
# train the teacher models
def get_model():
  model = keras.models.Sequential()
  model.add( keras.layers.Conv2D( 32, 3, 2, activation='relu', input_shape=x_train.shape[ 1: ] ) )
  model.add( keras.layers.MaxPooling2D( ) )
  model.add( keras.layers.Conv2D( 16, 3, 2, activation='relu' ) )
  model.add( keras.layers.Flatten() )
  model.add( keras.layers.Dense(32, activation='relu') )
  model.add( keras.layers.Dense(10, activation='softmax') )

  model.compile( optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'] )

  return model

# list of teacher models
teacher_models = [ get_model() for _ in range( n_teachers ) ]

# train teacher models
for i, (model, x, y) in enumerate( zip( teacher_models, teacher_data_x, teacher_data_y ) ):
  print( 'teacher', i )
  model.fit( x, y, epochs=16, verbose=0 )
  print( 'test accuracy:', model.evaluate( x_test, y_test, verbose=0 )[ 1 ] )


teacher 0
test accuracy: 0.8320000171661377
teacher 1
test accuracy: 0.8119999766349792
teacher 2
test accuracy: 0.7914000153541565


## Train the student model

To train the student model we need to label the students training data using the teacher models. We'll use a majority voting with added noises to determine the label.

In [None]:
# label the data
labels = [ teacher.predict( student_data_x ) for teacher in teacher_models ]

# preform the voting
votes = np.zeros( ( student_data_x.shape[ 0 ], 10 ), dtype=np.float )
for i in range( len( student_data_x ) ):
  for j in range( n_teachers ):
    label = np.argmax( labels[ j ][ i ] )
    votes[ i, label ] += 1
  # add the noise per class
  for j in range( 10 ):
    votes[ i, j ] += np.random.laplace(loc=0.0, scale=5 )

student_data_y = keras.utils.to_categorical( np.argmax( votes, axis=1 ) )

# train model
student_model = get_model()
print( 'training student model' )
student_model.fit( x, y, epochs=16, verbose=0 )
print( 'test accuracy:', student_model.evaluate( student_data_x, student_data_y, verbose=0 )[ 1 ] )


training student model
test accuracy: 0.15199999511241913


In [None]:
# privacy analysis
from syft.frameworks.torch.dp import pate


teacher_preds = np.argmax( np.array( labels ), axis=2 )
print( teacher_preds.shape )

data_dep_eps, data_indep_eps = pate.perform_analysis( teacher_preds=teacher_preds,
                                                      indices=np.argmax( votes, axis=1 ),
                                                      noise_eps=0.2,
                                                      delta=1/1500
                                                     )

print(data_dep_eps, data_indep_eps)

(3, 500)
87.3132203870893 87.31322038709031


What's more. Try by yourself with PyTorch

https://opacus.ai/tutorials/building_image_classifier

[Opacus](https://github.com/pytorch/opacus) is a library that enables training PyTorch models with differential privacy. It supports training with minimal code changes required on the client, has little impact on training performance and allows the client to online track the privacy budget expended at any given moment.

Just click

<a href="https://colab.research.google.com/github/pytorch/opacus/blob/main/tutorials/building_image_classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

It comes from
https://github.com/pytorch/opacus/blob/main/tutorials/building_image_classifier.ipynb


And another report

https://github.com/erinqhu/differential-privacy-PATE/blob/master/PATE_analysis.ipynb

<a href="https://colab.research.google.com/github/erinqhu/differential-privacy-PATE/blob/master/PATE_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>