# Membership Inference Attacks (Problems)

IMPORTANT: make a copy of this notebook before proceeding!

This is a brief tutorial on membership inference attacks on the nursery dataset (find on Kaggle here: https://www.kaggle.com/datasets/nimapourmoradi/nursery).

We have already preprocessed the dataset such that all categorical features are one-hot encoded, and the data was scaled using sklearn's StandardScaler.

Run these cells to set up and train the model (don't worry about what's going on):

In [None]:
# SETUP (run first)
!pip install adversarial-robustness-toolbox

import os
import sys
from art.utils import load_nursery
import numpy as np
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from art.estimators.classification.scikitlearn import ScikitlearnRandomForestClassifier

sys.path.insert(0, os.path.abspath('..'))
(x_train, y_train), (x_test, y_test), _, _ = load_nursery(test_set=0.5)



Since we have a relatively simple categorical dataset, for the sake of demonstration, we'll opt for a random forest model instead of a larger NN. Here's how you might train a random foest classifier.

In [None]:
model = RandomForestClassifier()
model.fit(x_train, y_train)

art_classifier = ScikitlearnRandomForestClassifier(model)

print('Base model accuracy: ', model.score(x_test, y_test))

Base model accuracy:  0.9745291756715035


# Attacks

## Attmept 1: Basic Attack Model
A membership inference attack atempt to determine whether a particular data point was used to train an ML model.

Question: what information can a membership inference attack reveal? Why are they dangerous?

<details>
  <summary>Click to see answer</summary>
  <p>
  Medical: if an attacker can determine that someone's data was used to train a biological foundation model, they might infer that person had a relevant medical condition.
  
  Finance: For a model trained on financial data, membership might reveal someone's financial status
  
  For recommendation systems, membership might reveal personal preferences
  </p>
</details>

### How do MIAs Work?

MIAs exploit differences in how ML models behave on training data versus data they haven't seen before. In an ideal model that exactly learns what the data looks like (e.g. exactly knows the differences between cats and dogs), it won't see training and testing data any differently. However, in reality, the model may memorize specific patterns from the training data (e.g. is much more confident that a picture of the cat it's seen in the past is indeed a cat, but is less confident when seeing a picture of a new cat).

We'll implement an attack that uses these differences to build a binary classifier that predicts whether a data point was in the training set.

### Training Initial Model

First, let's train a random forest classifier (using sklearn) on our dataset. This is our **target model**, which we want to attack.It represents represents a deployed ML model that an attacker might try to exploit.

In [None]:
# TODO: Train the target model
###############################################################################
# 1. Load the nursery dataset (train and test)
# 2. Create and train a RandomForestClassifier
# 3. Evaluate its accuracy on the test set
###############################################################################
(x_train, y_train), (x_test, y_test), min_, max_ = load_nursery(test_set=0.5)

target_model = RandomForestClassifier(n_estimators=100, random_state=42)
# Train the target model
target_model.fit(x_train, y_train)

# Evaluate base accuracy
base_acc = model.score(x_test, y_test)
print(f"Target Model Test Accuracy: {base_acc:.4f}")

Target Model Test Accuracy: 0.9745


How accurate can you get? I got $97.3$%.

### The Attack

The key to a successful MIA is choosing good features that distinguish between members and non-members. Members are datapoints we know are in the model's original dataset, and non-members are datapoints the model has never seen before.

We'll try to predict two common features:
1. Predicted probability of the true class
2. Negative log-likelihood of the true class

We can use features like the negative log likelihood to predict membership. For example, models assign higher probabilities to training examples they've seen before.

In [None]:
# TODO: Implement the attack features
def compute_attack_features(model, x, y_true):
    """
    TODO: Implement a function that computes features for the attack model.

    Returns a feature matrix for the attack model, each row is of the format:
      [predicted_prob_of_correct_class, NLL]

    Hint: Use model.predict_proba() to get class probabilities
    """
    # Your code here:
    probs = model.predict_proba(x)

    p_correct = probs[np.arange(len(probs)), y_true]

    nll = -np.log(np.clip(probs, 1e-10, 1.0));
    # Note: you'll have to clip the probabilities before calculating the log
    # for numerical stability.

    # Return combined features
    return np.column_stack([p_correct, nll])

Now, we'll create the dataset for training our attack model. Our dataset should consist of:

1. "Member" examples from the train set
2. "Non-member" examples (i.e. from the test set)
3. Features computed (function above)
4. Labels indicating membership (1 or 0)

In [None]:
# TODO: Prepare the attack dataset
###############################################################################
# TODO:
# 1. Select a portion of training data as "members"
# 2. Select a portion of test data as "non-members"
# 3. Compute attack features for both sets
# 4. Create appropriate labels (1 for members, 0 for non-members)
# 5. Combine into final training set
###############################################################################
attack_train_ratio = 0.5
member_size = int(len(x_train)*attack_train_ratio)
non_member_size = int(len(x_test) * attack_train_ratio)

# Select member samples
x_member = x_train[:member_size]
y_member = y_train[:member_size]

# Select non-member samples
x_non_member = x_test[:non_member_size]
y_non_member = y_test[:non_member_size]

#Features computed
X_member_features = compute_attack_features(target_model, x_member, y_member)
X_non_member_features = compute_attack_features(target_model, x_non_member, y_non_member)

#Labels indicating membership
y_member_attack = np.ones(member_size, dtype=int)
y_non_member_attack = np.zeros(non_member_size, dtype=int)

#Combine into final training set
X_attack_train = np.concatenate([X_member_features, X_non_member_features], axis=0)
y_attack_train = np.concatenate([y_member_attack, y_non_member_attack], axis=0)

Finally, we can train our attack model. We'll use another Random Forest classifier, but feel free to experiment with any other binary classifier! Experiment and see how well you can do.

In [None]:
# TODO: Train the attack model
###############################################################################
# 1. Create a binary classifier for the attack
# 2. Train it on the attack features
###############################################################################
attack_model = RandomForestClassifier(n_estimators=100, random_state=42)
# TODO: train the above
attack_model.fit(X_attack_train, y_attack_train)

Finally, we can evaluate our attack!

In [None]:
# Problem: Evaluate the attack
###############################################################################
# TODO:
# 1. Prepare a test set from remaining data
# 2. Compute attack features for test set
# 3. Make predictions with attack model
# 4. Calculate and report attack accuracy
###############################################################################

# Get remaining members/nonmembers
x_member_test = x_train[member_size:]
y_member_test = y_train[member_size:]
x_non_member_test = x_test[non_member_size:]
y_non_member_test = y_test[non_member_size]

# Compute features
X_member_test_feats = compute_attack_features(target_model, x_member_test, y_member_test)
X_non_member_test_feats = compute_attack_features(target_model, x_non_member_test, y_non_member_test)

# Create labels (binary classification)
y_member_test_attack = np.ones(len(X_member_test_feats), dtype=int);
y_non_member_test_attack = np.zeros(len(X_non_member_test_feats), dtype=int)

# Combine test sets
X_attack_test = np.concatenate([X_member_test_feats, X_non_member_test_feats], axis=0)
y_attack_test = np.concatenate([y_member_test_attack, y_non_member_test_attack], axis=0)

# Evaluate attack
y_attack_pred = attack_model.predict(X_attack_test)
attack_accuracy = np.mean(y_attack_pred == y_attack_test)
print(f"Attack Model Accuracy: {attack_accuracy:.4f}")

Attack Model Accuracy: 0.7975


If your accuracy was $>0.50$, this attack is successful!

# ART Attacks
Below is the code to run some more sophisticated MIA attacks using the Adversarial Robustness Toolki! Feel free to play around with them.

## Attack
### Rule-based attack
The rule-based attack uses the simple rule to determine membership in the training data: if the model's prediction for a sample is correct, then it is a member. Otherwise, it is not a member.

In [None]:
import numpy as np
from art.attacks.inference.membership_inference import MembershipInferenceBlackBoxRuleBased

attack = MembershipInferenceBlackBoxRuleBased(art_classifier)

# infer attacked feature
inferred_train = attack.infer(x_train, y_train)
inferred_test = attack.infer(x_test, y_test)

# check accuracy
train_acc = np.sum(inferred_train) / len(inferred_train)
test_acc = 1 - (np.sum(inferred_test) / len(inferred_test))
acc = (train_acc * len(inferred_train) + test_acc * len(inferred_test)) / (len(inferred_train) + len(inferred_test))
# print(f"Members Accuracy: {train_acc:.4f}")
# print(f"Non Members Accuracy {test_acc:.4f}")
print(f"Attack Accuracy {acc:.4f}")

Attack Accuracy 0.5127


This means that on average for 51% of the data, membership status is inferred correctly.

In [None]:
def calc_precision_recall(predicted, actual, positive_value=1):
    score = 0  # both predicted and actual are positive
    num_positive_predicted = 0  # predicted positive
    num_positive_actual = 0  # actual positive
    for i in range(len(predicted)):
        if predicted[i] == positive_value:
            num_positive_predicted += 1
        if actual[i] == positive_value:
            num_positive_actual += 1
        if predicted[i] == actual[i]:
            if predicted[i] == positive_value:
                score += 1

    if num_positive_predicted == 0:
        precision = 1
    else:
        precision = score / num_positive_predicted  # the fraction of predicted “Yes” responses that are correct
    if num_positive_actual == 0:
        recall = 1
    else:
        recall = score / num_positive_actual  # the fraction of “Yes” responses that are predicted correctly

    return precision, recall

# rule-based
print(calc_precision_recall(np.concatenate((inferred_train, inferred_test)),
                            np.concatenate((np.ones(len(inferred_train)), np.zeros(len(inferred_test))))))

(0.506449847549058, 1.0)


### Black-box attack
The black-box attack basically trains an additional classifier (called the attack model) to predict the membership status of a sample. It can use as input to the learning process probabilities/logits or losses, depending on the type of model and provided configuration.
#### Train attack model

In [None]:
from art.attacks.inference.membership_inference import MembershipInferenceBlackBox

attack_train_ratio = 0.5
attack_train_size = int(len(x_train) * attack_train_ratio)
attack_test_size = int(len(x_test) * attack_train_ratio)

bb_attack = MembershipInferenceBlackBox(art_classifier)

# train attack model
bb_attack.fit(x_train[:attack_train_size], y_train[:attack_train_size],
              x_test[:attack_test_size], y_test[:attack_test_size])

KeyboardInterrupt: 

#### Infer membership and check accuracy

In [None]:
# get inferred values
inferred_train_bb = bb_attack.infer(x_train[attack_train_size:], y_train[attack_train_size:])
inferred_test_bb = bb_attack.infer(x_test[attack_test_size:], y_test[attack_test_size:])
# check accuracy
train_acc = np.sum(inferred_train_bb) / len(inferred_train_bb)
test_acc = 1 - (np.sum(inferred_test_bb) / len(inferred_test_bb))
acc = (train_acc * len(inferred_train_bb) + test_acc * len(inferred_test_bb)) / (len(inferred_train_bb) + len(inferred_test_bb))
print(f"Attack Accuracy {acc:.4f}")

Achieves much better results than the rule-based attack.

In [None]:
# black-box
print(calc_precision_recall(np.concatenate((inferred_train_bb, inferred_test_bb)),
                            np.concatenate((np.ones(len(inferred_train_bb)), np.zeros(len(inferred_test_bb))))))