# Label Only Membership Inference (Revisited on points)

### Threat Model:

- **Black Box** access to an overfitted classifier with no access to actual $D_{train}$
- Predict API returns **only labels instead of confidence vectors**
- We have some insight on the training data distribution, $D_{out}$ , **but** $D_{train} \cap D_{out} = \varnothing$


### Attack Target: 
- Use a shadow model to attack local shadow models and extract membership leakage features
- Use data perturbations in order to exploit test/training data approximation relevancies to the classification boundaries.
- Perfom the boundary-based attack on the actual model

### Evaluation Target
- Score over $50\%$ accuracy
- Train attack model based on this assumption and compare with conf-vector attack

Implemented based on [this paper](https://arxiv.org/abs/2007.14321).

In [1]:
import numpy as np
import matplotlib.pyplot as plt

import math
import tensorflow as tf
from tensorflow.keras import datasets, layers, models
from sklearn.model_selection import train_test_split
from tensorflow import keras
from tensorflow.keras import regularizers

# for image interpolation
import scipy.ndimage.interpolation as interpolation

from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.metrics import classification_report

from mia.attack_model import *
from mia.label_only import *
from mia.shadow_models import *
from mia.utilities import *


from tqdm import tqdm
import sys
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))


Num GPUs Available:  1


## Target Model

### Model Architecture
- 2 layers of 32 $3\times 3$ Conv2D filters with Max Pooling
- 2 layers of 64 $3\times 3$ Conv2D filters with MaxPooling
- Dense Layer of 512 neurons
- Dense Output layer of 10 neurons
- Each layer has ReLU activation


In [2]:
D_TARGET_SIZE = 7500

In [3]:
def f_target(X_train, y_train, X_test=None, y_test=None, epochs=100):
  """
  Returns a trained target model, if test data are specified we will evaluate the model and print its accuracy
  """
  model = models.Sequential()
  model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)))
  model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)))
  model.add(layers.MaxPooling2D((2, 2)))
  model.add(layers.Dropout(0.3))
  model.add(layers.Conv2D(64, (3, 3), activation='relu'))
  model.add(layers.Conv2D(64, (3, 3), activation='relu'))
  model.add(layers.MaxPooling2D((2, 2)))
  model.add(layers.Dropout(0.3))

  model.add(layers.Flatten())
  model.add(layers.Dense(256, activation='relu'))
  model.add(layers.Dropout(0.3))
  model.add(layers.Dense(512, activation='relu'))

  model.add(layers.Dense(10))
  
  optimizer = keras.optimizers.Adam(learning_rate=0.001)
  model.compile(optimizer=optimizer,
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])
  if X_test is None or y_test is None:
    history = model.fit(X_train, y_train, epochs=epochs, 
                    validation_split=0.2)
  else:
    history = model.fit(X_train, y_train, epochs=epochs, 
                    validation_data=(X_test, y_test))
  return model

In [4]:
(train_images, train_labels), (test_images, test_labels) = datasets.cifar10.load_data()
# use the rest as testing - 'out' records
attacker_labels = np.concatenate((train_labels[D_TARGET_SIZE:], test_labels))
attacker_images = np.concatenate((train_images[D_TARGET_SIZE:], test_images))
target_images = train_images[:D_TARGET_SIZE]
target_labels = train_labels[:D_TARGET_SIZE]

In [5]:
train_images, eval_images, train_labels, eval_labels = train_test_split(target_images, target_labels, test_size=0.2, shuffle=True)
target_model = f_target(train_images, train_labels, eval_images, eval_labels, epochs=100) 



### Perturbed Instance Behaviour

Following we will apply some perturbations to data instances from in and out of $D_{target}$ and we will count how the predicted label change in respect to this perturbations, according to each class.



## Shadow Models
Following we define our own shadow models

### Shadow Model Architecture
- 3 CNN layers of $32, 64, 128$ filters of size $3 \times 3$ with MaxPooling and ReLU activation
- Dense Layer of 128 nodes
- Dense Layer of 10 nodes as Output layer

All output logits pass through Softmax Unit as in the target model to acquire probability vectors



### Shadow Dataset Composition

We just divide the CIFAR-10 dataset to $D_{out}$ and $D_{train}$ such as $D_{train} \cap D_{out} = \varnothing$ and use $D_{out}$ in order to train/test shadow models and attack model.

In [6]:
N_SHADOWS = 1
D_SHADOW_SIZE = 5000

In [7]:
def f_shadow():
  model = models.Sequential()
  model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)))
  model.add(layers.MaxPooling2D((2, 2)))
  model.add(layers.Conv2D(64, (3, 3), activation='relu'))
  model.add(layers.MaxPooling2D((2, 2)))
  model.add(layers.Conv2D(128, (3, 3), activation='relu'))
  model.add(layers.MaxPooling2D((2, 2)))
  
  
  model.add(layers.Flatten())
  model.add(layers.Dense(256, activation='relu'))
  model.add(layers.Dense(512, activation='relu'))

  model.add(layers.Dense(10))
  
  optimizer = keras.optimizers.Adam()
  model.compile(optimizer=optimizer,
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])
  return model

In [8]:
# returns list of (trained shadow_model, D_shadow)
def create_shadows(D_shadows):
  shadow_models_bundle = ShadowModelBatch(len(D_shadows), f_shadow) # shadow model list
  shadow_models_bundle.fit_all(D_shadows, epochs=50)
  return shadow_models_bundle # return a list where every item is (model, acc), train-data, test-data

In [9]:
# generate shadow datasets
D_shadows = generate_shadow_dataset(target_model, N_SHADOWS, D_SHADOW_SIZE, 10, attacker_images, attacker_labels)

In [10]:
# train the shadow models
shadow_models = create_shadows(D_shadows)



## Attack Model

### Attack Model Architecture
The attack model is consisted of 1 swallow layer of 10 neurons just as proposed in Shokri et al. and in the relative label only attack paper.


### Attack Dataset
The attack dataset will be consisted of vectors $x_i$, s.t. $x_i$ contains:
- real label
- predicted label
- bitstring of length $n'$, where $x_{ij+2}, \; j \in \{1, ..., n'\} $ will be 1 if perturbed label is same as predicted, otherwise it'll be zero.


### Perturbed Queries for feature extraction and Attack Dataset

In order to construct the actual attack dataset we have 2 perturbation functions:
- Translate
- Rotate

that can apply the necessary augmentations in order to acquire the feature vector for a query.

This works by applying all augmentations to the input X and querying the target model in order to return a binary vector $x_{attack}$ where $$x_{attack_p} = 1 \; if \;y_p == y_{true} \; else \; 0, \forall p \in Perturbations(X)$$

where $y_p$ is the label for pertubation $p$ of input $X$.

In [11]:
r = 3 # rotate range => creating 2*r+1 rotations 
d = 1 # translate range =? creating 4*d + 1 translates

In [12]:
# input dims = 2*r(# of rotates - neutral)  + 4*d(# of translates - neutral) + 2 (y_pred, y_true)
attack_model = LabelOnlyAttackModel(shadow_models, 10, (2*r+4*d+2,), 'adam')

In [13]:
attack_model.fit(r, d, epochs=100)

Preparing shadow batch of size 66
Done!


## Attack Evaluation

In [14]:
D_in = attack_model.prepare_batch(target_model, train_images[:500], train_labels[:500], True)

KeyboardInterrupt: ignored

In [None]:
D_out = attack_model.prepare_batch(target_model, attacker_images[:500], attacker_labels[:500], False)

In [None]:
y_pred = (attack_model.predict(np.concatenate((D_out[:, :-1], D_in[:, :-1]))) > 0.5).astype(np.int8)
y_true = np.concatenate((D_out[:, -1], D_in[:, -1]))
print(classification_report(y_true.reshape(-1), y_pred.reshape(-1)))

In [None]:

y_pred = attack_model.predict(np.concatenate((D_out[:, :-1], D_in[:, :-1])))
y_true = np.concatenate((D_out[:, -1], D_in[:, -1]))

fpr, tpr, _ = roc_curve(y_true, y_pred)

plt.plot(fpr, tpr)

print(f"AUC = {roc_auc_score(y_true, y_pred)}")

# Extras

## Check target model's behaviour in perturbed images

In [None]:
def study_perturbations(model, X, y, rs, ts):
  diffs = []
  y_pred = target_predict(model, X)    
  for c in range(10):
    #  given class acquire the changes in perturbed input instances given the model
    idx = y_pred[:, 0] == c
    X_c = X[idx]
    y_pred_c = y_pred[idx]
    perturbed_labels = (augmented_queries(model, X_c, y_pred_c, rs, ts) == y_pred_c).astype(np.int8)
    # Now we have to count how many labels diverge from the predicted label
    diff = len(perturbed_labels.reshape(-1)) - sum(perturbed_labels.reshape(-1)) # the labels are binary where 1 == y_pred = y_perturbed, otherwise 0
    diffs.append(int(100 * diff/len(perturbed_labels.reshape(-1)))) # append the percentage of changes in the class sample
    
  return diffs 

In [None]:
N_SAMPLES = 100
train_idx = np.random.choice(range(train_images.shape[0]), N_SAMPLES, replace=False)
test_idx = np.random.choice(range(attacker_images.shape[0]), N_SAMPLES, replace=False)

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# tests in D_in
diffs_per_class_D_in = study_perturbations(target_model, train_images[train_idx], train_labels[train_idx], r, d) 
total_diffs_in = sum(diffs_per_class_D_in)/10
axes[0].set_title('D_in predicted label divergence');
axes[0].barh(list(range(10)), diffs_per_class_D_in, tick_label=[f'Class-{i}' for i in range(10)])

# tests in D_out  
diffs_per_class_D_out= study_perturbations(target_model,attacker_images[test_idx],attacker_labels[test_idx], r, d)
total_diffs_out = sum(diffs_per_class_D_out)/10
axes[1].set_title('D_out predicted label divergence');
axes[1].barh(list(range(10)), diffs_per_class_D_out, tick_label=[f'Class-{i}' for i in range(10)])

axes[2].set_title('Total predicted label divergence percentage');
axes[2].barh([1, 0], [total_diffs_in, total_diffs_out], tick_label=['In', 'Out'])

plt.setp(axes, xticks=range(0, 101, 10), xticklabels=[f'{i}%' for i in range(0, 101, 10)])

## Attack a perturbation trained model

- We will apply augmentations to the dataset and re-train the target model.
- The attacker **does not** know our augmentation settings, so he will train with a normal dataset of zero augmentations
- We want to measure the quality of the attack when the target tries to defend MIAs by adding perturbed images of data samples 

In [None]:
# We will defend against the same rotations and translations that the attack models uses (worst case for the attacker)
rotates = create_rotates(r)
translates = create_translates(d)

X_train_aug = train_images
X_eval_aug = eval_images
y_train_aug = np.concatenate(tuple([train_labels] + [train_labels for rot in rotates] + [train_labels for tra in translates]))
y_eval_aug = np.concatenate(tuple([eval_labels] + [eval_labels for rot in rotates] + [eval_labels for tra in translates]))



for rot in rotates:
  aug_x = apply_augment(train_images, rot, 'r')
  X_train_aug = np.concatenate((X_train_aug,aug_x))
  aug_x = apply_augment(eval_images, rot, 'r')
  X_eval_aug = np.concatenate((X_eval_aug,aug_x))

for tra in translates:
  aug_x = apply_augment(train_images, tra, 'd')
  X_train_aug = np.concatenate((X_train_aug,aug_x))
  aug_x = apply_augment(eval_images, tra, 'd')
  X_eval_aug = np.concatenate((X_eval_aug ,aug_x))


In [None]:

with tf.device('/gpu:0'):
  X_train_aug = tf.convert_to_tensor(X_train_aug)
  y_train_aug = tf.convert_to_tensor(y_train_aug)
  X_eval_aug = tf.convert_to_tensor(X_eval_aug)
  y_eval_aug = tf.convert_to_tensor(y_eval_aug)
  target_model = f_target(X_train_aug, y_train_aug, X_eval_aug, y_eval_aug, epochs=10) 

The model is quite overfitted so now all that is left is to evaluate the attack model we created before on the newly trained and "defended" target model with perturbations in the train dataset.

In [None]:
D_in = prepare_batch(target_model, train_images[:1000], train_labels[:1000], True)
print("Testing with 'in' data only:")
res_in = evaluate_attack(attack_model_bundle, D_in[:, :-1], D_in[:, -1], 10)

D_out = prepare_batch(target_model, attacker_images[:1000], attacker_labels[:1000], False)
print("\nTesting with 'out' data only:")
res_out = evaluate_attack(attack_model_bundle, D_out[:, :-1], D_out[:, -1], 10)

print("\nTesting with all prev data: ")
res_all = evaluate_attack(attack_model_bundle, np.concatenate((D_out[:, :-1], D_in[:, :-1])), np.concatenate((D_out[:, -1], D_in[:, -1])), 10)

print(f"\nTotal attack accuracy: {np.mean(res_all)}")

### Conclusion

To conclude if the model is more vulnerable, we must meassure the label divergence percentage in the adjusted-to-augmentations model.

In [None]:
# test it onthe same data as we tested the non-adjusted to augmentation model
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# tests in D_in
diffs_per_class_D_in = study_perturbations(target_model, train_images[train_idx], train_labels[train_idx], r, d) 
total_diffs_in = sum(diffs_per_class_D_in)/10
axes[0].set_title('D_in predicted label divergence');
axes[0].barh(list(range(10)), diffs_per_class_D_in, tick_label=[f'Class-{i}' for i in range(10)])

# tests in D_out  
diffs_per_class_D_out= study_perturbations(target_model,attacker_images[test_idx],attacker_labels[test_idx], r, d)
total_diffs_out = sum(diffs_per_class_D_out)/10
axes[1].set_title('D_out predicted label divergence');
axes[1].barh(list(range(10)), diffs_per_class_D_out, tick_label=[f'Class-{i}' for i in range(10)])

axes[2].set_title('Total predicted label divergence percentage');
axes[2].barh([1, 0], [total_diffs_in, total_diffs_out], tick_label=['In', 'Out'])

plt.setp(axes, xticks=range(0, 101, 10), xticklabels=[f'{i}%' for i in range(0, 101, 10)])

We can see that the general percentage of predicted label divergence has fallen, **but** the confidence of the ML algorithm in predicting the label of perturbed instances of instances in $D_{in}$ is even higher that before. This means that the adjusted model is even more vulnerable. Next step is to run all that with a well-generalized model and tune the attack model to get max accuracy.