# Membership Inference Example

A notebook showing how we _could_ create a fairly simple single-use pipeline for assessing models and data.

Follows the following steps:
1. Get data
1. Remove rows with NA and drop ID and group columns
1. Remove `val_data` from training. This will be used later for membership inference
1. Train the classifier (target is `outcome`)
1. Compute classifier AUC (just to check it has learnt something)
1. Create the membership inference dataset. This is a binary classification dataset. The _features_ are the predictive probabilities of the trained model. We combine the predictive probabilities for the _training_ data and the _validation data_ (held out earlier). Observations have class 1 if they were from the training set and class 0 if from the validation.
1. Split the MI datasey into train and test portions
1. Train a classifier (I have used RF) on the train portion and assess AUC on the test

If the AUC of this second classifier is high, then we are able to classify points as having belonged to the training set or not.

In [1]:
import os
import sys
import logging
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.ensemble import RandomForestClassifier
# add the root project folder to the path
ROOT_DIR = os.path.dirname(os.path.dirname(os.path.abspath('')))

logging.basicConfig()
logger = logging.getLogger('MI')
logger.setLevel(logging.INFO)

sys.path.append(ROOT_DIR)
from data_preprocessing.data_interface import get_data_sklearn

## Load and pre-process data

In [2]:
X, y = get_data_sklearn('in-hospital-mortality')

## Train classifier

Use a random forest, make sure we remove `outcome` from the data

In [6]:
train_data_X, val_data_X, train_data_y, val_data_y = train_test_split(X, y, test_size=0.2)

In [7]:
rf = RandomForestClassifier(bootstrap=False) # boostrap = False will help overfit
rf.fit(train_data_X, train_data_y)

RandomForestClassifier(bootstrap=False)

## Assess performance on val data

In [8]:
preds = rf.predict_proba(val_data_X)
auc = roc_auc_score(val_data_y, preds[:, 1])
logger.info("Model AUC: %f", auc)

INFO:MI:Model AUC: 0.864242


## Construct a dataset with label of whether or not in the training

Stack `train_data` and `val_data` (after dropping `outcome`). Create a target (1 if example in train data, 0 otherwise)

In [10]:
miX = np.vstack(
    (
        rf.predict_proba(train_data_X),
        rf.predict_proba(val_data_X)
    )
)

miY = np.vstack(
    (
        np.ones((len(train_data_X), 1), int),
        np.zeros((len(val_data_X), 1), int)
    )
).flatten()

Split the MI dataset into train and test portions

In [11]:
mi_train_x, mi_test_x, mi_train_y, mi_test_y = train_test_split(miX, miY, test_size=0.2, stratify=miY)

Train the membership inference classifier

In [12]:
mi_rf = RandomForestClassifier()
mi_rf.fit(mi_train_x, mi_train_y)

RandomForestClassifier()

In [13]:
pred_probs = mi_rf.predict_proba(mi_test_x)
mi_auc = roc_auc_score(mi_test_y, pred_probs[:, 1])
logger.info("Membership AUC = %f", mi_auc)

INFO:MI:Membership AUC = 1.000000
