# Example 1 : Classification with Model Agnostic CF Method

Adult Income Dataset 

Steps:

0. Install packages, load data
1. Convert Training data to DiCE data
2. Instantiate pre-trained model (Black Box Model) 
3. Combine Model and Data 
4. Specify CF Method (model agnostic method only)
5. Specify Constraints 
6. Define Feature Importance


Goal: Classify whether adult is above or below $50k

Model: Sklearn Random Forest Classifier

CF Method: independent random sampling 

Illustration: Getting started + implementing different constraints (Baseline)

# Step 0 : Installing the Package

In [15]:
!pip install dice-ml
!pip install RISE
!pip install tensorflow 
import tensorflow as tf
import numpy as np
import timeit
import random
import dice_ml
from numpy.random import seed
from dice_ml.utils import helpers  # helper functions
from dice_ml import Dice
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier



In [16]:
%load_ext autoreload
%autoreload 2

# Step 0: Loading Dataset

In [25]:
dataset = helpers.load_adult_income_dataset()
adult_info = helpers.get_adult_data_info()
adult_info

{'age': 'age',
 'workclass': 'type of industry (Government, Other/Unknown, Private, Self-Employed)',
 'education': 'education level (Assoc, Bachelors, Doctorate, HS-grad, Masters, Prof-school, School, Some-college)',
 'marital_status': 'marital status (Divorced, Married, Separated, Single, Widowed)',
 'occupation': 'occupation (Blue-Collar, Other/Unknown, Professional, Sales, Service, White-Collar)',
 'race': 'white or other race?',
 'gender': 'male or female?',
 'hours_per_week': 'total work hours per week',
 'income': '0 (<=50K) vs 1 (>50K)'}

# Step 1: Convert Training data to DiCE data (+ Splitting Dataset)

In [26]:
target = dataset["income"]
train_dataset, test_dataset, y_train, y_test = train_test_split(dataset,
                                                                target,
                                                                test_size=0.2,
                                                                random_state=0,
                                                                stratify=target)
x_train = train_dataset.drop('income', axis=1)
x_test = test_dataset.drop('income', axis=1)

In [None]:
d = dice_ml.Data(dataframe=train_dataset, 
                 continuous_features=['age', 'hours_per_week'], 
                 outcome_name='income')

# Step 2: Instantiating the Pipeline 

In [28]:
numerical = ["age", "hours_per_week"]
categorical = x_train.columns.difference(numerical)

categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

transformations = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, categorical)])

# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
clf = Pipeline(steps=[('preprocessor', transformations),
                      ('classifier', RandomForestClassifier())])
model = clf.fit(x_train, y_train)

# Step 3 + 4 : Combining Data and Model and select method

In [29]:
# Using sklearn backend
m = dice_ml.Model(model=model, backend="sklearn")
exp = dice_ml.Dice(d, m, method="random") # exchange random for genetic or kdtree for other method

## Generating Counterfactuals 

Now we can generate CFs. At first we select the most basic specification without constraints 

In [30]:
e1 = exp.generate_counterfactuals(x_test[0:1], total_CFs=2, desired_class="opposite")
e1.visualize_as_dataframe(show_only_changes=False)

100%|██████████| 1/1 [00:00<00:00,  5.56it/s]

Query instance (original outcome : 0)





Unnamed: 0,age,workclass,education,marital_status,occupation,race,gender,hours_per_week,income
0,29,Private,HS-grad,Married,Blue-Collar,White,Female,38,0



Diverse Counterfactual set (new outcome: 1.0)


Unnamed: 0,age,workclass,education,marital_status,occupation,race,gender,hours_per_week,income
0,29,Self-Employed,HS-grad,Married,White-Collar,White,Female,38,1
1,29,Self-Employed,HS-grad,Married,Professional,White,Female,38,1


# Step 5: More confined Restrictions and Constraints  

Now we can add further constraints by for instance only changing education and occupation. 
We can also change the number of CFs to 3 by setting **total_CFs= 3**.
Further we only look at the changes by ***setting show_only_changes=True*** . 

In [1]:
# Changing only age and education
e2 = exp.generate_counterfactuals(x_test[0:1],
                                  total_CFs=3,
                                  desired_class="opposite",
                                  features_to_vary=["education", "occupation"]
                                  )
e2.visualize_as_dataframe(show_only_changes=True)

NameError: name 'exp' is not defined

## More Strict Restrictions

As the next restriction we can also specifiy a **permitted_range**. Here we give a range for 'age' and 'education'. 

In [47]:
# Restricting age to be between [20,30] and Education to be either {'Doctorate', 'Prof-school'}.
e3 = exp.generate_counterfactuals(x_test[0:1],
                                  total_CFs=3,
                                  desired_class="opposite",
                                  permitted_range={'age': [20, 30], 'education': ['Doctorate', 'Prof-school']})
e3.visualize_as_dataframe(show_only_changes=True)

100%|██████████| 1/1 [00:00<00:00,  4.86it/s]

Query instance (original outcome : 0)





Unnamed: 0,age,workclass,education,marital_status,occupation,race,gender,hours_per_week,income
0,29,Private,HS-grad,Married,Blue-Collar,White,Female,38,0



Diverse Counterfactual set (new outcome: 1.0)


Unnamed: 0,age,workclass,education,marital_status,occupation,race,gender,hours_per_week,income
0,-,Self-Employed,-,-,White-Collar,-,-,-,1
1,-,-,Doctorate,-,White-Collar,-,-,-,1
2,-,-,Doctorate,-,Professional,-,-,-,1


# Step 6 (Optional): Local/Global Feature Importance 

In [48]:

query_instance = x_test[0:1]
imp = exp.local_feature_importance(query_instance, total_CFs=10)
print(imp.local_importance)
query_instances = x_test[0:20]
imp = exp.global_feature_importance(query_instances)
print(imp.summary_importance)

100%|██████████| 1/1 [00:00<00:00,  2.80it/s]


[{'education': 0.7, 'workclass': 0.5, 'occupation': 0.5, 'gender': 0.2, 'marital_status': 0.1, 'race': 0.0, 'age': 0.0, 'hours_per_week': 0.0}]


100%|██████████| 20/20 [00:07<00:00,  2.56it/s]

{'education': 0.675, 'occupation': 0.27, 'age': 0.25, 'marital_status': 0.22, 'hours_per_week': 0.195, 'workclass': 0.19, 'race': 0.125, 'gender': 0.065}





# Example 2: Regression with Model Agnostic Method 

Steps: 
1. Convert Training data to DiCE data
2. Instantiate pre-trained model (Black Box Model) 
3. Combine Model and Data 
4. Specify CF Method (model agnostic method only)
5. Generating CF and Specify Constraints (optional)


Goal: Estimate median house value for California districts

Model: Sklearn Random Forest Regressor 

CF Method: independent random sampling 

Illustration: Getting started + implementing different constraints (Baseline)

# Step 0:  Importing Packages 

In [17]:
import dice_ml
from dice_ml import Dice
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
import pandas as pd

## Minor Pre-Processing

In [19]:
df_housing = pd.DataFrame(housing.data, columns=housing.feature_names)
df_housing['outcome_name'] = pd.Series(housing.target)
df_housing.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,outcome_name
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


# Target 
The target variable is the median house value for California districts, expressed in hundreds of thousands of dollars ($100,000). see: https://scikit-learn.org/stable/datasets/real_world.html#california-housing-dataset

In [2]:
continuous_features_cali = df_housing.drop('outcome_name', axis=1).columns.tolist()
target = df_housing['outcome_name']

NameError: name 'df_housing' is not defined

# Step 1-3

Again we need to determine our features and specify the categorical and numerical features. 

In [21]:
datasetX = df_housing.drop('outcome_name', axis=1)
x_train, x_test, y_train, y_test = train_test_split(datasetX,
                                                    target,
                                                    test_size=0.2,
                                                    random_state=0)
categorical_features = x_train.columns.difference(continuous_features_cali)

numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())])
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])
transformations = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, continuous_features_cali),
        ('cat', categorical_transformer, categorical_features)])

In [None]:
# instantiate pipeline
regr_cali = Pipeline(steps=[('preprocessor', transformations),
                              ('regressor', RandomForestRegressor())])
# instantiate Model 
model_cali = regr_cali.fit(x_train, y_train)

In [None]:
d_cali = dice_ml.Data(dataframe=df_housing,
                      continuous_features=continuous_features_cali,
                      outcome_name='outcome_name')

# Step 4: Specifying CF Method and instantiating Model and Data 

In [22]:
# We provide the type of model as a parameter (model_type)
m_cali = dice_ml.Model(model=model_cali,
                       backend="sklearn",
                       model_type='regressor')
exp_genetic_cali = Dice(d_cali,
                        m_cali,
                        method="genetic")

# Step 5: Generating Counterfactuals


In [23]:
# Multiple queries can be given as input at once
query_instances_cali = x_train[17:19]
genetic_cali = exp_genetic_cali.generate_counterfactuals(query_instances_cali, 
                                                             total_CFs=2, 
                                                             desired_range=[3.5, 4])
genetic_cali.visualize_as_dataframe(show_only_changes=True)

100%|██████████| 2/2 [00:01<00:00,  1.56it/s]

Query instance (original outcome : 3)





Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,outcome_name
0,6.0145,14.0,6.156749,1.082729,1858.0,2.696662,34.23,-118.87,2.96717



Diverse Counterfactual set (new outcome: [3.5, 4])


Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,outcome_name
0,5.7128,24.0,5.3,1.1,1830.0,2.2,33.85,-118.39,3.8099504999999954
0,4.9297,31.0,5.3,1.0,1895.0,2.1,34.19,-118.22,3.859700499999999


Query instance (original outcome : 2)


Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,outcome_name
0,3.6413,40.0,4.604736,1.041894,1318.0,2.400729,34.27,-119.26,2.26566



Diverse Counterfactual set (new outcome: [3.5, 4])


Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,outcome_name
0,3.6202,34.0,4.4,1.1,1338.0,1.6,34.15,-118.36,3.9834513
0,3.6202,34.0,4.0,1.1,3.0,0.7,34.15,-118.36,3.5879317999999985


# Example 3: Deep Learning Application: Gradient Descent Method 

Steps:
1. Convert Training data to DiCE data
2. Instantiate pre-trained model (Black Box Model)
3. Combine Model and Data
4. Specify CF Method (model agnostic method only)
5. Specify Constraints

Goal: See Example 1 
Illustration: Advanced Methods 

In [2]:
!pip install tensorflow
import tensorflow as tf



2022-01-30 19:43:39.934738: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-01-30 19:43:39.934788: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


# Step 1 Convert data 

In [5]:
dataset = helpers.load_adult_income_dataset()
d = dice_ml.Data(dataframe=dataset, continuous_features=['age', 
                                                         'hours_per_week'],
                 outcome_name='income')

# Step 2 Instantiate pre-trained model 

In [6]:
# seeding random numbers for reproducability
seed(1)
# from tensorflow import set_random_seed; set_random_seed(2) # for tf1
tf.random.set_seed(1)

In [7]:
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR) # supress deprecation warnings from TF
backend = 'TF'+tf.__version__[0]  # TF1
ML_modelpath = helpers.get_adult_income_modelpath(backend=backend)
m = dice_ml.Model(model_path=ML_modelpath, backend=backend) # Step 2: dice_ml.Model
# Step 3: initiate DiCE exp = dice_ml.Dice(d, m)

In [8]:
# query instance in the form of a dictionary or a dataframe; keys: feature name, values: feature value
query_instance = {'age': 22,
                  'workclass': 'Private',
                  'education': 'HS-grad',
                  'marital_status': 'Single',
                  'occupation': 'Service',
                  'race': 'White',
                  'gender': 'Female',
                  'hours_per_week': 45}

# Step 3 + Step 4 Combine Model and data and Select CF Method

Here we automatically generate CFs based on the proximity, validity and diversity because all three are incorporated in the loss as we are no longer sampling from points. This also means it takes much longer

In [10]:
exp = dice_ml.Dice(d, m)
dice_exp = exp.generate_counterfactuals(query_instance,
                                        total_CFs=4, 
                                        desired_class="opposite")

2022-01-30 19:46:04.934690: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2022-01-30 19:46:04.934745: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2022-01-30 19:46:04.934775: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (jupyter-danielsaggau-2ddice-2dcukxxrlp): /proc/driver/nvidia/version does not exist
2022-01-30 19:46:04.935236: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Diverse Counterfactuals found! total time taken: 00 min 44 sec


In [11]:
# Optional 
dice_exp.visualize_as_dataframe(show_only_changes=True)

Query instance (original outcome : 0)


Unnamed: 0,age,workclass,education,marital_status,occupation,race,gender,hours_per_week,income
0,22.0,Private,HS-grad,Single,Service,White,Female,45.0,0.019



Diverse Counterfactual set (new outcome: 1.0)


Unnamed: 0,age,workclass,education,marital_status,occupation,race,gender,hours_per_week,income
0,70.0,-,Masters,-,White-Collar,-,-,51.0,0
1,-,Self-Employed,Doctorate,Married,-,-,-,-,0
2,47.0,-,-,Married,-,-,-,-,0
3,36.0,-,Prof-school,Married,-,-,-,62.0,0


## Step 5: Feature Weights + Hyperparameter Weights

Median Absolute Deviation (MAD) of a continuous feature conveys the variability of the feature, and is more robust than standard deviation as is less affected by outliers and non-normality

In [12]:
# Median average distance 
mads = d.get_mads(normalized=True)
feature_weights = {} # create feature weights
for feature in mads:
    feature_weights[feature] = round(1/mads[feature], 2)

print(feature_weights)
feature_weights = {'age': 1, 'hours_per_week': 1} # assigning equal weights
dice_exp = exp.generate_counterfactuals(query_instance, total_CFs=2, 
                                        desired_class="opposite",
                                        feature_weights=feature_weights,
                                        loss_converge_maxiter=2) 

# change proximity_weight from default value of 0.5 to 1.5
dice_exp = exp.generate_counterfactuals(query_instance, total_CFs=4, 
                                        desired_class="opposite",
                                        proximity_weight=1.5, 
                                        diversity_weight=1.0, 
                                        loss_converge_maxiter=2)
dice_exp.visualize_as_dataframe(show_only_changes=True)

{'age': 7.3, 'hours_per_week': 24.5}
Diverse Counterfactuals found! total time taken: 02 min 55 sec
