# Counterfactuals Training Data Extraction Experiment

In [1]:
import pandas as pd
import sklearn.ensemble as es
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
import random
import logging
import time
import dice_ml

In [2]:
%run experiment_setup.ipynb

This notebook will test whether training data extraction is possible with counterfactuals (CF) that are drawn from the training data. Training data extraction means an attacker can find out the feature values of samples from the training data without prior knowledge of them. The attacker only has access to the model's prediction function and the explanation.

This attack should be trivial because any counterfactual that is shown as an explanation was picked directly from the training data.

First we define the function that will run the experiment for the different variations. The attacker makes repeated queries to the model with random input values. In order to do this, the attacker knows the maximum and minimum value of each feature in the training data. Each counterfactual is recorded and in the end, it is checked what percentage of them is contained in the training data.

In [3]:
class CounterfactualTDE(XaiPrivacyExperiment):
    def train_explainer(self, data_train, model):
        # train explainer on training data
        d = dice_ml.Data(dataframe=data_train, continuous_features=self.continuous_features,\
                         outcome_name=self.outcome_name)
        m = dice_ml.Model(model=model, backend="sklearn", model_type='classifier')
        
        # use method "kd-tree" to get counterfactuals drawn from the training data
        return dice_ml.Dice(d, m, method="kdtree")
        
    @staticmethod
    def training_data_extraction_model_access(explainer, stop_after, feature_formats, rng, model):
        # Get all feature names
        feature_names = []
        
        for feature in feature_formats:
            feature_names.append(feature['name'])
        
        samples_df = pd.DataFrame(columns=feature_names)
    
        # Generate random samples as queries for the explainer.
        for i in range(stop_after):
            sample = {}
            for feature in feature_formats:
                if feature['isCont']:
                    sample[feature['name']] = rng.integers(feature['min'], feature['max'])
                else:
                    sample[feature['name']] = random.choice(feature['categories'])
            samples_df = samples_df.append(sample, ignore_index=True)

        # Cast categorical features to string again because of DiCE peculiarities
        for feature in feature_formats:
            if not feature['isCont']:
                samples_df[feature['name']] = samples_df[feature['name']].astype(str)
            else:
                samples_df[feature['name']] = samples_df[feature['name']].astype(int)
                
        # Collect all extracted samples in this dataframe
        extracted_samples_df = pd.DataFrame(columns=feature_names)

        # Get one counterfactual for each random sample
        for index in range(len(samples_df)):
            # needs double brackets so that iloc returns a dataframe instead of a series
            row = samples_df.iloc[[index], :]

            logging.debug(f'Random sample to enter: {row.to_numpy()}')

            # there is an issue with dice where desired_class="opposite" does not calculate counterfactuals of opposite class 
            # for class 1: https://github.com/interpretml/DiCE/issues/215
            # this is why we need to manually set the desired class. This requires access to the model which would otherwise 
            # not be necessary.
            model_pred = model.predict(row)[0]
            logging.debug(f'Prediction by model: {model_pred}')

            # Get counterfactual as dataframe. See https://github.com/interpretml/DiCE/issues/174
            # Use sparse final cfs instead of just final_cfs_df because those are the counterfactuals that are also shown by 
            # default with the function visualize_as_dataframe()
            e1 = explainer.generate_counterfactuals(row, total_CFs=1, desired_class=int(1-model_pred))
            cf = e1.cf_examples_list[0].final_cfs_df_sparse
            logging.debug(f'Counterfactual: {cf.to_numpy()}')

            extracted_samples_df = extracted_samples_df.append(cf, ignore_index=True)
        
        return extracted_samples_df

# Dataset 1: Heart Disease

We now generate five counterfactuals for the first sample from the training data to demonstrate counterfactual explanations in general.

In [4]:
features = data_num.drop('heart_disease_label', axis=1)
labels = data_num['heart_disease_label']

# Train a random forest on training data.
model = es.RandomForestClassifier(random_state=0)
model = model.fit(features, labels)

# Train explainer
d = dice_ml.Data(dataframe=data_num, continuous_features=continuous_features_num, outcome_name=outcome_name_num)

m = dice_ml.Model(model=model, backend="sklearn", model_type='classifier')
# Generating counterfactuals from training data (kd-tree)
exp = dice_ml.Dice(d, m, method="kdtree")

In [5]:
e1 = exp.generate_counterfactuals(features[0:1], total_CFs=5, desired_class="opposite")
e1.visualize_as_dataframe(show_only_changes=True)

100%|██████████| 1/1 [00:02<00:00,  2.70s/it]

Query instance (original outcome : 0)





Unnamed: 0,age,cigs_per_day,total_chol,sys_bp,dia_bp,bmi,heart_rate,glucose,heart_disease_label
0,39.0,0.0,195.0,106.0,70.0,26.97,80.0,77.0,0.0



Diverse Counterfactual set (new outcome: 1.0)


Unnamed: 0,age,cigs_per_day,total_chol,sys_bp,dia_bp,bmi,heart_rate,glucose,heart_disease_label
4188,44.0,-,180.0,106.9,-,23.98,92.0,67.0,1.0
1358,64.0,-,210.0,120.0,70.1,24.77,-,-,1.0
2633,43.0,-,196.0,121.5,86.5,20.82,92.0,-,-
1345,49.0,-,211.0,104.0,66.5,24.17,75.0,87.0,1.0
1253,50.0,-,196.0,126.0,88.0,26.73,-,77.1,1.0


We can see that the counterfactuals are similar to the query sample and that most of them have a flipped prediction. These are the two general properties of counterfactual explanations.

We will now do a small proof of concept of the experiment with logging enabled to demonstrate how it works.

In [6]:
logging.root.setLevel(logging.DEBUG)

experiment_num = CounterfactualTDE(data_num, continuous_features_num, outcome_name_num, 0)
experiment_num.training_data_extraction_experiment(10, es.RandomForestClassifier(random_state=0), model_access=True)

logging.root.setLevel(logging.ERROR)

DEBUG:root:Random sample to enter: [[ 64  44 410 140  76  16  51  45]]
DEBUG:root:Prediction by model: 0.0
100%|██████████| 1/1 [00:00<00:00,  6.66it/s]
DEBUG:root:Counterfactual: [[ 64.     0.   372.   169.    85.    26.01  75.    79.     1.  ]]
DEBUG:root:Random sample to enter: [[ 38  56 491 276  95  39 140 298]]
DEBUG:root:Prediction by model: 1.0
100%|██████████| 1/1 [00:00<00:00,  6.05it/s]
DEBUG:root:Counterfactual: [[ 54.     0.   326.   187.    95.    29.94  67.   235.     0.  ]]
DEBUG:root:Random sample to enter: [[ 56  38 439 281  74  48 110  40]]
DEBUG:root:Prediction by model: 1.0
100%|██████████| 1/1 [00:00<00:00,  5.32it/s]
DEBUG:root:Counterfactual: [[ 49.    30.   350.   174.    90.    18.44 110.    78.     0.  ]]
DEBUG:root:Random sample to enter: [[ 46  60 436  90 119  44 127 102]]
DEBUG:root:Prediction by model: 1.0
100%|██████████| 1/1 [00:00<00:00,  6.54it/s]
DEBUG:root:Counterfactual: [[ 42.    20.   410.   116.    83.    21.68  90.    83.     0.  ]]
DEBUG:root:R

Number of extracted samples: 9
Number of accurate extracted samples: 7
Precision: 0.7777777777777778, recall: 0.7


We can see that some counterfactuals can be found in the training data, while others cannot. This is due to the sparsity induced by the counterfactual explainer. Some counterfactuals do not retain all their original feature values from the training data. Instead, feature values that lie close to the feature values of the query instance are replaced by those original values. Therefore, some counterfactuals may not appear in the training data.

We can now begin with the actual experiments.

In [7]:
results_ = {'dataset': [], 'model': [], 'precision': [], 'recall': []}

results = pd.DataFrame(data = results_)

In [8]:
print("features: continuous, model: decision tree.")

start_time = time.time()

experiment_num = CounterfactualTDE(data_num, continuous_features_num, outcome_name_num, 0)
precision, recall = experiment_num.training_data_extraction_experiment(100,\
                                    DecisionTreeClassifier(random_state=0), model_access=True)

results.loc[len(results.index)] = ['continuous', 'decision tree', precision, recall]

print("--- %s seconds ---" % (time.time() - start_time))

features: continuous, model: decision tree.


100%|██████████| 1/1 [00:00<00:00, 17.65it/s]
100%|██████████| 1/1 [00:00<00:00, 18.46it/s]
100%|██████████| 1/1 [00:00<00:00, 18.13it/s]
100%|██████████| 1/1 [00:00<00:00,  6.78it/s]
100%|██████████| 1/1 [00:00<00:00, 19.01it/s]
100%|██████████| 1/1 [00:00<00:00, 11.10it/s]
100%|██████████| 1/1 [00:00<00:00, 18.90it/s]
100%|██████████| 1/1 [00:00<00:00, 19.38it/s]
100%|██████████| 1/1 [00:00<00:00, 10.66it/s]
100%|██████████| 1/1 [00:00<00:00, 16.19it/s]
100%|██████████| 1/1 [00:00<00:00,  6.59it/s]
100%|██████████| 1/1 [00:00<00:00, 17.20it/s]
100%|██████████| 1/1 [00:00<00:00,  6.66it/s]
100%|██████████| 1/1 [00:00<00:00,  4.81it/s]
100%|██████████| 1/1 [00:00<00:00, 18.25it/s]
100%|██████████| 1/1 [00:00<00:00,  8.71it/s]
100%|██████████| 1/1 [00:00<00:00, 17.51it/s]
100%|██████████| 1/1 [00:00<00:00, 18.61it/s]
100%|██████████| 1/1 [00:00<00:00, 19.78it/s]
100%|██████████| 1/1 [00:00<00:00, 18.76it/s]
100%|██████████| 1/1 [00:00<00:00, 19.15it/s]
100%|██████████| 1/1 [00:00<00:00,

Number of extracted samples: 79
Number of accurate extracted samples: 37
Precision: 0.46835443037974683, recall: 0.37
--- 9.532217264175415 seconds ---


In [9]:
print("features: continuous, model: random forest.")

start_time = time.time()

experiment_num = CounterfactualTDE(data_num, continuous_features_num, outcome_name_num, 0)
precision, recall = experiment_num.training_data_extraction_experiment(100,\
                                    es.RandomForestClassifier(random_state=0), model_access=True)

results.loc[len(results.index)] = ['continuous', 'random forest', precision, recall]

print("--- %s seconds ---" % (time.time() - start_time))

features: continuous, model: random forest.


100%|██████████| 1/1 [00:00<00:00,  7.07it/s]
100%|██████████| 1/1 [00:00<00:00,  6.83it/s]
100%|██████████| 1/1 [00:00<00:00,  6.81it/s]
100%|██████████| 1/1 [00:00<00:00,  7.12it/s]
100%|██████████| 1/1 [00:00<00:00,  7.09it/s]
100%|██████████| 1/1 [00:00<00:00,  2.48it/s]
100%|██████████| 1/1 [00:00<00:00,  7.17it/s]
100%|██████████| 1/1 [00:00<00:00,  6.82it/s]
100%|██████████| 1/1 [00:00<00:00,  2.56it/s]
100%|██████████| 1/1 [00:00<00:00,  7.23it/s]
100%|██████████| 1/1 [00:00<00:00,  6.82it/s]
100%|██████████| 1/1 [00:00<00:00,  6.77it/s]
100%|██████████| 1/1 [00:00<00:00,  7.22it/s]
100%|██████████| 1/1 [00:00<00:00,  1.15it/s]
100%|██████████| 1/1 [00:00<00:00,  7.11it/s]
100%|██████████| 1/1 [00:00<00:00,  1.85it/s]
100%|██████████| 1/1 [00:00<00:00,  7.06it/s]
100%|██████████| 1/1 [00:00<00:00,  6.99it/s]
100%|██████████| 1/1 [00:00<00:00,  7.23it/s]
100%|██████████| 1/1 [00:01<00:00,  1.39s/it]
100%|██████████| 1/1 [00:00<00:00,  3.09it/s]
100%|██████████| 1/1 [00:00<00:00,

Number of extracted samples: 75
Number of accurate extracted samples: 34
Precision: 0.4533333333333333, recall: 0.34
--- 37.465229749679565 seconds ---





In [10]:
print("features: continuous, model: neural network.")

start_time = time.time()

experiment_num = CounterfactualTDE(data_num, continuous_features_num, outcome_name_num, 0)
precision, recall = experiment_num.training_data_extraction_experiment(100,\
                                    MLPClassifier(hidden_layer_sizes=(32, 32, 32), random_state=0), model_access=True)

results.loc[len(results.index)] = ['continuous', 'neural network', precision, recall]

print("--- %s seconds ---" % (time.time() - start_time))

features: continuous, model: neural network.


100%|██████████| 1/1 [00:00<00:00,  9.33it/s]
100%|██████████| 1/1 [00:00<00:00, 13.25it/s]
100%|██████████| 1/1 [00:00<00:00, 11.56it/s]
100%|██████████| 1/1 [00:00<00:00, 13.12it/s]
100%|██████████| 1/1 [00:00<00:00, 15.18it/s]
100%|██████████| 1/1 [00:00<00:00, 15.51it/s]
100%|██████████| 1/1 [00:00<00:00,  5.11it/s]
100%|██████████| 1/1 [00:00<00:00, 15.30it/s]
100%|██████████| 1/1 [00:00<00:00,  6.13it/s]
100%|██████████| 1/1 [00:00<00:00, 15.35it/s]
100%|██████████| 1/1 [00:00<00:00, 15.13it/s]
100%|██████████| 1/1 [00:00<00:00, 13.84it/s]
100%|██████████| 1/1 [00:00<00:00, 15.66it/s]
100%|██████████| 1/1 [00:00<00:00, 10.57it/s]
100%|██████████| 1/1 [00:00<00:00, 15.75it/s]
100%|██████████| 1/1 [00:00<00:00, 16.41it/s]
100%|██████████| 1/1 [00:00<00:00, 15.84it/s]
100%|██████████| 1/1 [00:00<00:00, 16.16it/s]
100%|██████████| 1/1 [00:00<00:00, 15.64it/s]
100%|██████████| 1/1 [00:00<00:00, 11.57it/s]
100%|██████████| 1/1 [00:00<00:00,  8.35it/s]
100%|██████████| 1/1 [00:00<00:00,

Number of extracted samples: 70
Number of accurate extracted samples: 24
Precision: 0.34285714285714286, recall: 0.24
--- 10.378497123718262 seconds ---





# Dataset 2: Census Income (categorical)

In [11]:
# DiCE needs categorical features to be strings:
categorical_features = data_cat.columns.difference([outcome_name_cat])

for col in categorical_features:
    data_cat[col] = data_cat[col].astype(str)

In [12]:
print("features: categorical, model: decision tree.")

start_time = time.time()

experiment_cat = CounterfactualTDE(data_cat, continuous_features_cat, outcome_name_cat, 0)
precision, recall = experiment_cat.training_data_extraction_experiment(100,\
                                    DecisionTreeClassifier(random_state=0), model_access=True)

results.loc[len(results.index)] = ['categorical', 'decision tree', precision, recall]

print("--- %s seconds ---" % (time.time() - start_time))

features: categorical, model: decision tree.


100%|██████████| 1/1 [00:00<00:00,  9.24it/s]
100%|██████████| 1/1 [00:00<00:00,  8.75it/s]
100%|██████████| 1/1 [00:00<00:00,  8.63it/s]
100%|██████████| 1/1 [00:00<00:00,  6.17it/s]
100%|██████████| 1/1 [00:00<00:00,  9.39it/s]
100%|██████████| 1/1 [00:00<00:00,  8.73it/s]
100%|██████████| 1/1 [00:00<00:00,  6.31it/s]
100%|██████████| 1/1 [00:00<00:00,  9.31it/s]
100%|██████████| 1/1 [00:00<00:00,  6.42it/s]
100%|██████████| 1/1 [00:00<00:00,  8.09it/s]
100%|██████████| 1/1 [00:00<00:00,  7.87it/s]
100%|██████████| 1/1 [00:00<00:00,  5.51it/s]
100%|██████████| 1/1 [00:00<00:00,  7.86it/s]
100%|██████████| 1/1 [00:00<00:00,  8.13it/s]
100%|██████████| 1/1 [00:00<00:00,  7.57it/s]
100%|██████████| 1/1 [00:00<00:00,  8.88it/s]
100%|██████████| 1/1 [00:00<00:00,  5.83it/s]
100%|██████████| 1/1 [00:00<00:00,  7.53it/s]
100%|██████████| 1/1 [00:00<00:00,  8.75it/s]
100%|██████████| 1/1 [00:00<00:00,  7.69it/s]
100%|██████████| 1/1 [00:00<00:00,  8.77it/s]
100%|██████████| 1/1 [00:00<00:00,

Number of extracted samples: 2
Number of accurate extracted samples: 2
Precision: 1.0, recall: 0.02
--- 14.234471559524536 seconds ---





In [13]:
print("features: categorical, model: random forest.")

start_time = time.time()

experiment_cat = CounterfactualTDE(data_cat, continuous_features_cat, outcome_name_cat, 0)
precision, recall = experiment_cat.training_data_extraction_experiment(100,\
                                    es.RandomForestClassifier(random_state=0), model_access=True)

results.loc[len(results.index)] = ['categorical', 'random forest', precision, recall]

print("--- %s seconds ---" % (time.time() - start_time))

features: categorical, model: random forest.


100%|██████████| 1/1 [00:00<00:00,  3.43it/s]
100%|██████████| 1/1 [00:00<00:00,  3.51it/s]
100%|██████████| 1/1 [00:00<00:00,  3.53it/s]
100%|██████████| 1/1 [00:00<00:00,  3.49it/s]
100%|██████████| 1/1 [00:00<00:00,  3.39it/s]
100%|██████████| 1/1 [00:00<00:00,  3.50it/s]
100%|██████████| 1/1 [00:00<00:00,  3.45it/s]
100%|██████████| 1/1 [00:00<00:00,  3.52it/s]
100%|██████████| 1/1 [00:00<00:00,  3.52it/s]
100%|██████████| 1/1 [00:00<00:00,  3.22it/s]
100%|██████████| 1/1 [00:00<00:00,  2.94it/s]
100%|██████████| 1/1 [00:00<00:00,  3.31it/s]
100%|██████████| 1/1 [00:00<00:00,  2.85it/s]
100%|██████████| 1/1 [00:00<00:00,  3.22it/s]
100%|██████████| 1/1 [00:00<00:00,  3.22it/s]
100%|██████████| 1/1 [00:00<00:00,  3.08it/s]
100%|██████████| 1/1 [00:00<00:00,  2.91it/s]
100%|██████████| 1/1 [00:00<00:00,  3.04it/s]
100%|██████████| 1/1 [00:00<00:00,  3.04it/s]
100%|██████████| 1/1 [00:00<00:00,  3.11it/s]
100%|██████████| 1/1 [00:00<00:00,  3.23it/s]
100%|██████████| 1/1 [00:00<00:00,

Number of extracted samples: 2
Number of accurate extracted samples: 2
Precision: 1.0, recall: 0.02
--- 36.449681997299194 seconds ---





In [14]:
print("features: categorical, model: neural network.")

start_time = time.time()

experiment_cat = CounterfactualTDE(data_cat, continuous_features_cat, outcome_name_cat, 0)
precision, recall = experiment_cat.training_data_extraction_experiment(100,\
                                    MLPClassifier(hidden_layer_sizes=(32, 32, 32), random_state=0), model_access=True)

results.loc[len(results.index)] = ['categorical', 'neural network', precision, recall]

print("--- %s seconds ---" % (time.time() - start_time))

features: categorical, model: neural network.


100%|██████████| 1/1 [00:00<00:00,  7.76it/s]
100%|██████████| 1/1 [00:00<00:00,  5.61it/s]
100%|██████████| 1/1 [00:00<00:00,  5.89it/s]
100%|██████████| 1/1 [00:00<00:00,  8.14it/s]
100%|██████████| 1/1 [00:00<00:00,  7.68it/s]
100%|██████████| 1/1 [00:00<00:00,  8.08it/s]
100%|██████████| 1/1 [00:00<00:00,  7.78it/s]
100%|██████████| 1/1 [00:00<00:00,  7.62it/s]
100%|██████████| 1/1 [00:00<00:00,  7.80it/s]
100%|██████████| 1/1 [00:00<00:00,  7.47it/s]
100%|██████████| 1/1 [00:00<00:00,  7.12it/s]
100%|██████████| 1/1 [00:00<00:00,  7.04it/s]
100%|██████████| 1/1 [00:00<00:00,  7.76it/s]
100%|██████████| 1/1 [00:00<00:00,  7.97it/s]
100%|██████████| 1/1 [00:00<00:00,  7.74it/s]
100%|██████████| 1/1 [00:00<00:00,  7.78it/s]
100%|██████████| 1/1 [00:00<00:00,  7.76it/s]
100%|██████████| 1/1 [00:00<00:00,  7.86it/s]
100%|██████████| 1/1 [00:00<00:00,  7.54it/s]
100%|██████████| 1/1 [00:00<00:00,  8.22it/s]
100%|██████████| 1/1 [00:00<00:00,  5.37it/s]
100%|██████████| 1/1 [00:00<00:00,

Number of extracted samples: 2
Number of accurate extracted samples: 2
Precision: 1.0, recall: 0.02
--- 25.1548433303833 seconds ---





# Results

"Accuracy" describes the percentage of counterfactuals that matched a sample in the training data exactly.

In [15]:
results

Unnamed: 0,dataset,model,precision,recall
0,continuous,decision tree,0.468354,0.37
1,continuous,random forest,0.453333,0.34
2,continuous,neural network,0.342857,0.24
3,categorical,decision tree,1.0,0.02
4,categorical,random forest,1.0,0.02
5,categorical,neural network,1.0,0.02


In [16]:
results.to_csv('results/cf-training-data-extraction-results.csv', index=False, na_rep='NaN')