# Cifar-10 tutorial

This notebook introduces advanced tools like MLP mixer, which involves residual connections with Lipschitz guarantees, other input space (HSB) and loss gradient clipping.

## Imports

The library is based on tensorflow.

In [1]:
import tensorflow as tf

### lip-dp dependencies

The need a model `DP_Model` that handles the noisification of gradients. It is trained with a `loss`. The model is initialized with the convenience function `DPParameters`. 

In [2]:
from deel.lipdp import losses
from deel.lipdp.model import DP_Model
from deel.lipdp.model import DPParameters

The `DP_Accountant` callback keeps track of $(\epsilon,\delta)$-DP values epoch after epoch. In practice we may be interested in reaching the maximum val_accuracy under privacy constraint $\epsilon$: the convenience function `get_max_epochs` exactly does that by performing a dichotomy search over the number of epochs.

In [3]:
from deel.lipdp.model import DP_Accountant
from deel.lipdp.sensitivity import get_max_epochs

The framework requires a control of the maximum norm of inputs. This can be ensured with input clipping for example: `bound_clip_value`.

In [4]:
from deel.lipdp.pipeline import bound_clip_value
from deel.lipdp.pipeline import load_and_prepare_data

## Setup DP Lipschitz model

Here we apply the "global" strategy, with a noise multiplier $2.5$. Note that for Cifar-10 the dataset size is $N=50,000$, and it is recommended that $\delta<\frac{1}{N}$. So we propose a value of $\delta=10^{-5}$.  

In [5]:
import warnings
warnings.filterwarnings("ignore")

dp_parameters = DPParameters(
    noisify_strategy="global",
    noise_multiplier=4.0,
    delta=1e-5,
)

epsilon_max = 10.0

With many parameters, it can be interesting to use `local` strategy over `global`, since the effective noise growths as $\mathcal{O}(\sqrt{(D)})$ in `global` strategy. Since the privacy leakge is more important is `local` strategy, we compensate with high `noise_multiplier`.

![DP-SGD accountant](fig_accountant.png "DP-SGD accountant")

### Loading the data

We clip the elementwise input upper-bound to $40.0$. The operates in `HSV` space. The train set is augmented with random left/right flips.

In [6]:
def augmentation_fct(image, label):
    image = tf.image.random_flip_left_right(image)
    return image, label

input_upper_bound = 30.0
ds_train, ds_test, dataset_metadata = load_and_prepare_data(
    "cifar10",
    colorspace="HSV",
    batch_size=10_000,
    drop_remainder=True,  # accounting assumes fixed batch size
    augmentation_fct=augmentation_fct,
    bound_fct=bound_clip_value(  # other strategies are possible, like normalization.
        input_upper_bound
    ),  # clipping preprocessing allows to control input bound
)

2023-05-24 17:27:24.335576: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-05-24 17:27:24.905888: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 47066 MB memory:  -> device: 0, name: Quadro RTX 8000, pci bus id: 0000:03:00.0, compute capability: 7.5


### Build the MLP Mixer model

We imitate the interface of Keras. We use common layers found in deel-lip, which a wrapper that handles the bound propagation. 

In [7]:
from deel.lipdp.layers import DP_AddBias
from deel.lipdp.layers import DP_BoundedInput
from deel.lipdp.layers import DP_ClipGradient
from deel.lipdp.layers import DP_Flatten
from deel.lipdp.layers import DP_GroupSort
from deel.lipdp.layers import DP_Lambda
from deel.lipdp.layers import DP_LayerCentering
from deel.lipdp.layers import DP_Permute
from deel.lipdp.layers import DP_QuickSpectralDense
from deel.lipdp.layers import DP_Reshape
from deel.lipdp.layers import DP_ScaledGlobalL2NormPooling2D
from deel.lipdp.layers import DP_ScaledL2NormPooling2D
from deel.lipdp.layers import DP_QuickSpectralConv2D

The MLP Mixer uses residual connections. Residuals connections are handled with the utility function `make_residuals` that wraps the layers inside a block that handles bounds propagation.

![Residuals Connections](residuals.png "Residual Connections")

In [8]:
from deel.lipdp.layers import make_residuals

Now, we proceed with the creation of the environnement.

In [9]:
skip_connections = False  # use skip connections, like in original MLP Mixer architecture.
clip_loss_gradient = 2**0.5  # elementwise gradient is clipped to value sqrt(2) - which is the maximum for CCE loss.
add_biases = False  # Add biases after linear transformations.
biases_norm_max = 0.05
hidden_size = 64
mlp_seq_dim = 64
mlp_channel_dim = 128
num_mixer_layers = 2  # Two MLP Mixer blocks.
layer_centering = False  # Centering operation (like LayerNormalization without the reducing operation). Linear 1-Lipschitz.
patch_size = 4  # Number of pixels in each patch.

def create_MLP_Mixer(dp_parameters, dataset_metadata, upper_bound):
    input_shape = (32, 32, 3)
    layers = [DP_BoundedInput(input_shape=input_shape, upper_bound=upper_bound)]

    layers.append(
        DP_Lambda(
            tf.image.extract_patches,
            arguments=dict(
                sizes=[1, patch_size, patch_size, 1],
                strides=[1, patch_size, patch_size, 1],
                rates=[1, 1, 1, 1],
                padding="VALID",
            ),
        )
    )

    seq_len = (input_shape[0] // patch_size) * (input_shape[1] // patch_size)

    layers.append(DP_Reshape((seq_len, (patch_size ** 2) * input_shape[-1])))
    layers.append(
        DP_QuickSpectralDense(
            units=hidden_size, use_bias=False, kernel_initializer="identity"
        )
    )

    for _ in range(num_mixer_layers):
        to_add = [
            DP_Permute((2, 1)),
            DP_QuickSpectralDense(
                units=mlp_seq_dim, use_bias=False, kernel_initializer="identity"
            ),
        ]
        if add_biases:
            to_add.append(DP_AddBias(biases_norm_max))
        to_add.append(DP_GroupSort(2))
        if layer_centering:
            to_add.append(DP_LayerCentering())
        to_add += [
            DP_QuickSpectralDense(
                units=seq_len, use_bias=False, kernel_initializer="identity"
            ),
            DP_Permute((2, 1)),
        ]

        if skip_connections:
            layers += make_residuals("1-lip-add", to_add)
        else:
            layers += to_add

        to_add = [
            DP_QuickSpectralDense(
                units=mlp_channel_dim, use_bias=False, kernel_initializer="identity"
            ),
        ]
        if add_biases:
            to_add.append(DP_AddBias(biases_norm_max))
        to_add.append(DP_GroupSort(2))
        if layer_centering:
            to_add.append(DP_LayerCentering())
        to_add.append(
            DP_QuickSpectralDense(
                units=hidden_size, use_bias=False, kernel_initializer="identity"
            )
        )

        if skip_connections:
            layers += make_residuals("1-lip-add", to_add)
        else:
            layers += to_add

    layers.append(DP_Flatten())
    layers.append(
        DP_QuickSpectralDense(units=10, use_bias=False, kernel_initializer="identity")
    )

    layers.append(DP_ClipGradient(clip_loss_gradient))

    model = DP_Model(
        layers,
        dp_parameters=dp_parameters,
        dataset_metadata=dataset_metadata,
        name="mlp_mixer",
    )

    model.build(input_shape=(None, *input_shape))

    return model

We compile the model with:
* any first order optimizer (e.g Adam). No adaptation is needed.
* a loss with known Lipschitz constant, e.g Categorical Cross-entropy with temperature.

In [10]:
model = create_MLP_Mixer(dp_parameters, dataset_metadata, input_upper_bound)
model.compile(
    # Compile model using DP loss
    loss=losses.DP_TauCategoricalCrossentropy(256.0),
    # this method is compatible with any first order optimizer
    optimizer=tf.keras.optimizers.Adam(learning_rate=2e-4),
    metrics=["accuracy"],
)
model.summary()

Model: "mlp_mixer"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dp__bounded_input (DP_Bound  multiple                 0         
 edInput)                                                        
                                                                 
 dp__lambda (DP_Lambda)      multiple                  0         
                                                                 
 dp__reshape (DP_Reshape)    multiple                  0         
                                                                 
 dp__quick_spectral_dense (D  multiple                 3072      
 P_QuickSpectralDense)                                           
                                                                 
 dp__permute (DP_Permute)    multiple                  0         
                                                                 
 dp__quick_spectral_dense_1   multiple                 40

Observe that the model contains only 246K parmaeters. This is an advantage of MLP Mixer architectures: the number of parameters is small. However the number of FLOPS can be quite high. Without gradient clipping, huge batch sizes can be used, which benefits to privacy/utility ratio. 

In order to control epsilon, we compute the adequate number of epochs.

In [11]:
num_epochs = get_max_epochs(epsilon_max, model)

epoch bounds = (0, 512.0) and epsilon = 14.81894855578722 at epoch 512.0
epoch bounds = (256.0, 512.0) and epsilon = 9.820083418023108 at epoch 256.0
epoch bounds = (256.0, 384.0) and epsilon = 12.31951600358698 at epoch 384.0
epoch bounds = (256.0, 320.0) and epsilon = 11.069799714608529 at epoch 320.0
epoch bounds = (256.0, 288.0) and epsilon = 10.44494156631582 at epoch 288.0
epoch bounds = (256.0, 272.0) and epsilon = 10.132512492169463 at epoch 272.0
epoch bounds = (264.0, 272.0) and epsilon = 9.976297955096285 at epoch 264.0
epoch bounds = (264.0, 268.0) and epsilon = 10.054405223632873 at epoch 268.0
epoch bounds = (264.0, 266.0) and epsilon = 10.015351589364581 at epoch 266.0
epoch bounds = (265.0, 266.0) and epsilon = 9.995824772230431 at epoch 265.0


## Train the model

The model can be trained, and the DP Accountant will automatically track the privacy loss.

In [12]:
hist = model.fit(
    ds_train,
    epochs=num_epochs,
    validation_data=ds_test,
    callbacks=[
        # accounting is done thanks to a callback
        DP_Accountant(log_fn="logging"),  # wandb.log also available.
    ],
)

Epoch 1/265
 (0.5205893807331654, 1e-05)-DP guarantees for epoch 1 

Epoch 2/265
 (0.7169615437758403, 1e-05)-DP guarantees for epoch 2 

Epoch 3/265
 (0.8714581783028138, 1e-05)-DP guarantees for epoch 3 

Epoch 4/265
 (1.0041033056975341, 1e-05)-DP guarantees for epoch 4 

Epoch 5/265
 (1.121902451763874, 1e-05)-DP guarantees for epoch 5 

Epoch 6/265
 (1.2297900098052366, 1e-05)-DP guarantees for epoch 6 

Epoch 7/265
 (1.3301791512711914, 1e-05)-DP guarantees for epoch 7 

Epoch 8/265
 (1.425115891691246, 1e-05)-DP guarantees for epoch 8 

Epoch 9/265
 (1.512644960027369, 1e-05)-DP guarantees for epoch 9 

Epoch 10/265
 (1.599192443478913, 1e-05)-DP guarantees for epoch 10 

Epoch 11/265
 (1.6782666312983627, 1e-05)-DP guarantees for epoch 11 

Epoch 12/265
 (1.7566369758486253, 1e-05)-DP guarantees for epoch 12 

Epoch 13/265
 (1.833150779023074, 1e-05)-DP guarantees for epoch 13 

Epoch 14/265
 (1.903546174784228, 1e-05)-DP guarantees for epoch 14 

Epoch 15/265
 (1.9739415712927

This final val_accuracy is compliant with results reported in other framework. For comparison, in Opacus tutorials, the Resnet 18 reaches 60% val_accuracy at $\epsilon=47$, but 15% at $\epsilon=13$. 