In the following notebook we will show how you can use the CARLA library.

# How to use CARLA


In [1]:
from IPython.display import display

import warnings
warnings.filterwarnings('ignore')

## Data

Before we can do anything we need some data. Using CARLA, you have several options to handle data.

1. You could import one of the datasets from our [OnlineCatalog](https://carla-counterfactual-and-recourse-library.readthedocs.io/en/latest/data.html#module-data.catalog.online_catalog).
2. However, you may want to use your own data instead. This can easily be done by using the [CsvCatalog](https://carla-counterfactual-and-recourse-library.readthedocs.io/en/latest/data.html#module-data.catalog.csv_catalog).

### Using the <code>OnlineCatalog</code>

Using the <code>OnlineCatalog</code> is very easy. Currently, we support four data sets: "heloc", "adult", "compas", and "give_me_credit". In the examples below, we will use the adult data set. Below, we demonstrate how you can use the <code>OnlineCatalog</code>.

In [2]:
from carla.data.catalog import OnlineCatalog

# load catalog dataset
data_name = "adult"
dataset = OnlineCatalog(data_name)

Using TensorFlow backend.


[INFO] Using Python-MIP package version 1.12.0 [model.py <module>]
 [deprecation_wrapper.py __getattr__]


Below, we take a look at how you can add your own data to CARLA.

### Using the <code>CsvCatalog</code>

For the "CsvCatalog" there are 5 attributes. The file_path should be the path of the csv file you want to use. Then we have two different types of features, continous and categorical, of which some can be immutable. Finally, the target attribute is the column which contains the targets/labels. For the Adult Income data set, this will be "Income", i.e., whether an individual earned more or less than \$50.000.

Note that when using the <code>CsvCatalog</code> the data should already be cleaned; e.g., your .csv file should not contain any NaNs. 
Moreover, also make sure that the categorical variables are binary encoded, i.e., $x_j \in \{0,1\}$, if feature $j$ is a categorical variable (e.g., "workclass_private"). We are currently working on extensions to this.

In [3]:
from carla.data.catalog import CsvCatalog

continuous = ["age", "fnlwgt", "education-num", "capital-gain", "hours-per-week", "capital-loss"]
categorical = ["marital-status", "native-country", "occupation", "race", "relationship", "sex", "workclass"]
immutable = ["age", "sex"]

dataset = CsvCatalog(file_path="adult.csv",
                     continuous=continuous,
                     categorical=categorical,
                     immutables=immutable,
                     target='income')

display(dataset.df.head())

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,...,occupation_Other,race_White,relationship_Non-Husband,sex_Male,workclass_Private
0,0.30137,0.044131,0.8,0.02174,0.0,...,0.0,1.0,1.0,1.0,0.0
1,0.452055,0.048052,0.8,0.0,0.0,...,0.0,1.0,0.0,1.0,0.0
2,0.287671,0.137581,0.533333,0.0,0.0,...,1.0,1.0,1.0,1.0,1.0
3,0.493151,0.150486,0.4,0.0,0.0,...,1.0,0.0,0.0,1.0,1.0
4,0.150685,0.220635,0.8,0.0,0.0,...,0.0,0.0,1.0,0.0,1.0


## ML Classifier

Now that we have the data loaded we also need a classification model. Again, you have two options:

1. You can easily define your own model. In our [model documentation](https://carla-counterfactual-and-recourse-library.readthedocs.io/en/latest/examples.html#black-box-model) we describe how you can do that.
2. Here we will show how you can train one of our [catalog](https://carla-counterfactual-and-recourse-library.readthedocs.io/en/latest/mlmodel.html#module-models.catalog.catalog) models.
Depending on your data and your use-case you might need to tweak the training hyperparameters.

For example, for the **ANN** used here we need to define the following hyperparameters:
- *learning rate*
- *number of epochs*
- *batch size*
- *sizes of the hidden layers*.

After defining the model using the <code>MLModelCatalog</code>, just call the *train* method with those parameters and you are good to go!

In [4]:
from carla.models.catalog import MLModelCatalog

In [5]:
training_params = {"lr": 0.002, "epochs": 10, "batch_size": 1024, "hidden_size": [18, 9, 3]}

ml_model = MLModelCatalog(
    dataset, 
    model_type="ann", 
    load_online=False, 
    backend="pytorch"
)

ml_model.train(
    learning_rate=training_params["lr"],
    epochs=training_params["epochs"],
    batch_size=training_params["batch_size"],
    hidden_size=training_params["hidden_size"]
)

Loaded model from C:\Users\fred0\carla\models\custom\ann_layers_18_9_3.pt
test accuracy for model: 0.8414154652686763


## Counterfactual Explanations and Algorithmic Recourse

Now that we have both the data, and a model we can start using CARLA to generate counterfactuals. Again, you have two options:

1. You can pick a [recourse method](https://carla-counterfactual-and-recourse-library.readthedocs.io/en/latest/recourse.html) from the catalog.
2. Or you can implement one yourself using our [recourse interface](https://carla-counterfactual-and-recourse-library.readthedocs.io/en/latest/recourse.html#recourse-api). If you would like to add a new method to the library, just submit a pull-request. :)

In the following example, we are getting negatively labeled samples for which we would like to find counterfactuals.

In [6]:
from carla.models.negative_instances import predict_negative_instances
import carla.recourse_methods.catalog as recourse_catalog

In [7]:
factuals = predict_negative_instances(ml_model, dataset.df)
test_factual = factuals.iloc[:5]

display(test_factual)

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,...,occupation_Other,race_White,relationship_Non-Husband,sex_Male,workclass_Private
0,0.30137,0.044131,0.8,0.02174,0.0,...,0.0,1.0,1.0,1.0,0.0
1,0.452055,0.048052,0.8,0.0,0.0,...,0.0,1.0,0.0,1.0,0.0
2,0.287671,0.137581,0.533333,0.0,0.0,...,1.0,1.0,1.0,1.0,1.0
3,0.493151,0.150486,0.4,0.0,0.0,...,1.0,0.0,0.0,1.0,1.0
4,0.150685,0.220635,0.8,0.0,0.0,...,0.0,0.0,1.0,0.0,1.0


### Wachter et al (2018) (gradient method)

The recourse objective function looks as follows:
\begin{align}
\delta_x^* & = argmin_{\delta_x, x+ \delta_x \in \mathcal{A}} \, \ell \big(h(x + \delta_x), 0.5)\big) + \lambda \cdot \, d(x + \delta_x, x), %\\
\end{align}
where $\lambda \geq 0$ is a trade-off parameter, $0.5$ is the probabilistic target, $\mathcal{A}$ is the feasible set of actions, and $\ell(\cdot, \cdot)$ is the binary-cross-entropy loss. The first term on the right-hand-side ensures that the model prediction corresponding to the counterfactual i.e., $h(x + \delta_x )$ is close to the favorable outcome with classification prediction $1$. The second term encourages low-cost recourses; for example, Wachter et al (2018) propose $\ell_1$ or $\ell_2$ distances to ensure that the distance between the factual instance $x$ and the counterfactual $\check{x} =  x + \delta_x^*$ is small.

In [8]:
hyperparams = {"loss_type": "BCE", "binary_cat_features": False}
recourse_method = recourse_catalog.Wachter(ml_model, hyperparams)
df_cfs = recourse_method.get_counterfactuals(test_factual)

display(df_cfs)

[INFO] Counterfactual Explanation Found [wachter.py wachter_recourse]
[INFO] Counterfactual Explanation Found [wachter.py wachter_recourse]
[INFO] Counterfactual Explanation Found [wachter.py wachter_recourse]
[INFO] Counterfactual Explanation Found [wachter.py wachter_recourse]
[INFO] Counterfactual Explanation Found [wachter.py wachter_recourse]


Unnamed: 0,age,capital-gain,capital-loss,education-num,fnlwgt,...,race_White,relationship_Non-Husband,sex_Male,workclass_Private,income
0,0.341076,0.061459,0.03971,0.839714,0.083811,...,1.0,1.0,0.0,0.037467,1.0
1,0.47199,0.019946,0.019943,0.819944,0.067987,...,1.0,0.0,1.0,0.019928,1.0
2,0.435641,0.149357,0.149372,0.681845,0.041204,...,0.0,1.0,0.0,0.851989,1.0
3,0.552305,0.059178,0.059163,0.459171,0.20938,...,0.0,0.0,1.0,0.940863,1.0
4,0.276991,0.126107,0.126601,0.925433,0.154797,...,0.0,1.0,0.0,0.875522,1.0


### CCHVAE by Pawelczyk et al (2020) (manifold method)

Let $g: \mathcal{Z} \to \mathcal{X}$ be the decoder of a generative model (e.g., VAE). Let $e: \mathcal{X} \to \mathcal{Z}$ be the correpsonding encoder. We then encode the factual input x, for which we wish to find a counterfactual explanation, as follows:  $e(x)=z$, and conduct the search in the latent space.
Manifold-based methods solve an objective function that looks as follows:
\begin{align}
\delta_z^* & = argmin_{\delta_z, g(z + \delta_z) \in \mathcal{A}} \, \ell \big(h(g(z + \delta_z), 0.5)\big) + \cdot \, d(z + \delta_z, z), %\\
\label{eq:wachter}
\end{align}
where $\lambda \geq 0$ is a trade-off parameter, $0.5$ is the probabilistic target,
and $\ell(\cdot, \cdot)$ is the binary-cross-entropy loss, and  $\mathcal{A}$ is the feasible set of actions. The first term on the right-hand-side ensures that the model prediction corresponding to the counterfactual i.e., $h(g(z + \delta_z))$ is close to the favorable outcome with classification label $1$. The second term encourages low-cost recourses;

For example, Pawelczyk et al (2020) use random search in the latent space to approximate the above objective function, while Joshi et al (2019) use a gradient-based algorithm on a variant of the above objective function. We refer to the respective papers for more details

In [9]:
hyperparams = {
    "data_name": dataset.name,
    "n_search_samples": 100,
    "p_norm": 1,
    "step": 0.1,
    "max_iter": 1000,
    "clamp": True,
    "binary_cat_features": False,
    "vae_params": {
        "layers": [len(ml_model.feature_input_order), 512, 256, 8],
        "train": True,
        "lambda_reg": 1e-6,
        "epochs": 5,
        "lr": 1e-3,
        "batch_size": 32,
    },
}

cchvae = recourse_catalog.CCHVAE(ml_model, hyperparams)
df_cfs = cchvae.get_counterfactuals(test_factual)

display(df_cfs)

[INFO] Start training of Variational Autoencoder... [models.py fit]
[INFO] [Epoch: 0/5] [objective: 0.381] [models.py fit]
[INFO] [ELBO train: 0.38] [models.py fit]
[INFO] [ELBO train: 0.14] [models.py fit]
[INFO] [ELBO train: 0.12] [models.py fit]
[INFO] [ELBO train: 0.12] [models.py fit]
[INFO] [ELBO train: 0.12] [models.py fit]
[INFO] ... finished training of Variational Autoencoder. [models.py fit]


Unnamed: 0,age,capital-gain,capital-loss,education-num,fnlwgt,...,race_White,relationship_Non-Husband,sex_Male,workclass_Private,income
0,0.296056,0.036247,0.03969,0.601053,0.120308,...,1.0,0.0,1.0,0.737635,1.0
1,0.296061,0.036247,0.039689,0.60105,0.120308,...,1.0,0.0,1.0,0.737636,1.0
2,0.296058,0.036248,0.039689,0.60105,0.120308,...,1.0,0.0,1.0,0.737637,1.0
3,0.296058,0.036247,0.03969,0.601051,0.120308,...,1.0,0.0,1.0,0.737637,1.0
4,0.296059,0.036248,0.039689,0.601049,0.120308,...,1.0,0.0,1.0,0.737638,1.0


### FOCUS by Lucic et al (2021) (tree method)

Our library also supports sklearn and xgboost tree-based classifiers such as *Random Forests*, *Decision Trees* or *Gradient Boosted Decision Trees*.
Those classifiers are needed for methods, which explicitly require the use of tree models (e.g., FeatureTweak and FOCUS).

In [10]:
from carla.recourse_methods.catalog.focus.tree_model import ForestModel, XGBoostModel
ml_model = XGBoostModel(dataset)

factuals = predict_negative_instances(ml_model, dataset.df)
test_factual = factuals.iloc[:5]

display(test_factual)

[0]	validation_0-logloss:0.58281	validation_1-logloss:0.58635
[1]	validation_0-logloss:0.52201	validation_1-logloss:0.52660
[2]	validation_0-logloss:0.48373	validation_1-logloss:0.48970
[3]	validation_0-logloss:0.45787	validation_1-logloss:0.46419
[4]	validation_0-logloss:0.43901	validation_1-logloss:0.44431


Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,...,occupation_Other,race_White,relationship_Non-Husband,sex_Male,workclass_Private
0,0.30137,0.044131,0.8,0.02174,0.0,...,0.0,1.0,1.0,1.0,0.0
1,0.452055,0.048052,0.8,0.0,0.0,...,0.0,1.0,0.0,1.0,0.0
2,0.287671,0.137581,0.533333,0.0,0.0,...,1.0,1.0,1.0,1.0,1.0
3,0.493151,0.150486,0.4,0.0,0.0,...,1.0,0.0,0.0,1.0,1.0
4,0.150685,0.220635,0.8,0.0,0.0,...,0.0,0.0,1.0,0.0,1.0


Below we start generating counterfactuals using FOCUS.

In [11]:
hyperparams = {
    "optimizer": "adam",
    "lr": 0.001,
    "n_class": 2,
    "n_iter": 1000,
    "sigma": 1.0,
    "temperature": 1.0,
    "distance_weight": 0.01,
    "distance_func": "l1",
}

focus = recourse_catalog.FOCUS(ml_model, hyperparams)
df_cfs = focus.get_counterfactuals(test_factual)
display(df_cfs)

 [deprecation_wrapper.py __getattr__]
 [deprecation_wrapper.py __getattr__]
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where [deprecation.py new_func]
 [deprecation_wrapper.py __getattr__]
 [deprecation_wrapper.py __getattr__]


Unnamed: 0,age,fnlwgt,education-num,capital-gain,hours-per-week,capital-loss
0,0.301443,0.044131,0.800003,0.051235,0.398013,0.0
1,0.452152,0.048052,0.799976,0.051206,0.122428,0.0
2,0.287656,0.137581,0.533235,0.051204,0.397913,0.0
3,0.493167,0.150486,0.399988,0.051275,0.397919,0.0
4,0.15764,0.220635,0.799542,0.05157,0.398128,0.0
