# Membership Inference Example

A notebook showing how we _could_ create a fairly simple single-use pipeline for assessing models and data.

Follows the following steps:
1. Get data
1. Remove rows with NA and drop ID and group columns
1. Remove `val_data` from training. This will be used later for membership inference
1. Train the classifier (target is `outcome`)
1. Compute classifier AUC (just to check it has learnt something)
1. Create the membership inference dataset. This is a binary dataset with all the examples that were used to train having class 1 and all the validation examples having class 0
1. Split the MI datasey into train and test portions
1. Train a classifier (I have used RF) on the train portion and assess AUC on the test

If the AUC of this second classifier is high, then we are able to classify points as having belonged to the training set or not.

In [2]:
import os
import logging
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.ensemble import RandomForestClassifier
# add the root project folder to the path
ROOT_DIR = os.path.dirname(os.path.dirname(os.path.abspath('')))

logging.basicConfig()
logger = logging.getLogger('MI')
logger.setLevel(logging.INFO)

In [4]:
# Assuming we'll keep data files in a (non-version tracked) folder called data
DATA_FOLDER = os.path.join(ROOT_DIR, 'data')

In [5]:
DATA_FILE_NAME = "data01.csv" # I didn't choose this name :-), it's what the data comes as!
DATA_URL = "https://datadryad.org/stash/downloads/file_stream/773992"

Check that the data file exists. If it doesn't, chuck out an error with instructions

In [6]:
if not os.path.exists(os.path.join(DATA_FOLDER, DATA_FILE_NAME)):
    logger.error("File not available, download from %s and save to GRAIMatter/data", DATA_URL)
else:
    input_data = pd.read_csv(os.path.join(DATA_FOLDER, DATA_FILE_NAME))

## Data pre-processing

- Remove NA rows
- Remove irrelevant columns
- Split into training and validation 

In [8]:
clean_data = input_data.dropna(axis=0, how='any').drop(columns=["group", "ID"])

# Split into a set for training, and a held out set that can be used for MI assessment
train_data, val_data = train_test_split(clean_data, test_size=0.5, stratify=clean_data.outcome, shuffle=True)

## Train classifier

Use a random forest, make sure we remove `outcome` from the data

In [9]:
rf = RandomForestClassifier() # boostrap = False will help overfit
rf.fit(train_data.drop(columns=["outcome"]), train_data.outcome)

RandomForestClassifier()

## Assess performance on val data

In [10]:
preds = rf.predict_proba(val_data.drop(columns=['outcome']))
auc = roc_auc_score(val_data.outcome, preds[:, 1])
logger.info("Model AUC: %f", auc)

INFO:MI:Model AUC: 0.821781


## Construct a dataset with label of whether or not in the training

Stack `train_data` and `val_data` (after dropping `outcome`). Create a target (1 if example in train data, 0 otherwise)

In [11]:
miX = pd.concat(
    (
        train_data.drop(columns=["outcome"]),
        val_data.drop(columns=["outcome"])
    )
).values

miY = np.vstack(
    (
        np.ones((len(train_data), 1), int),
        np.zeros((len(val_data), 1), int)
    )
).flatten()

Split the MI dataset into train and test portions

In [12]:
mi_train_x, mi_test_x, mi_train_y, mi_test_y = train_test_split(miX, miY, test_size=0.2, stratify=miY)

Train the membership inference classifier

In [15]:
mi_rf = RandomForestClassifier()
mi_rf.fit(mi_train_x, mi_train_y)

RandomForestClassifier()

In [16]:
pred_probs = mi_rf.predict_proba(mi_test_x)
mi_auc = roc_auc_score(mi_test_y, pred_probs[:, 1])
logger.info("Membership AUC = %f", mi_auc)

INFO:MI:Membership AUC = 0.417523
