In the following notebook we will show how you can use the CARLA library.

# How to use CARLA


In [1]:
from IPython.display import display

%load_ext autoreload
%autoreload 2

## Data

Before we can do anything we need some data. Using CARLA, you have several options to handle data.

1. You could import one of the datasets from our [OnlineCatalog](https://carla-counterfactual-and-recourse-library.readthedocs.io/en/latest/data.html#module-data.catalog.online_catalog).
2. However, you may want to use your own data instead. This can easily be done by using the [CsvCatalog](https://carla-counterfactual-and-recourse-library.readthedocs.io/en/latest/data.html#module-data.catalog.csv_catalog).

### Using the <code>OnlineCatalog</code>

Using the <code>OnlineCatalog</code> is very easy. Currently, we support four data sets: "heloc", "adult", "compas", and "give_me_credit". In the examples below, we will use the adult data set. Below, we demonstrate how you can use the <code>OnlineCatalog</code>.

In [13]:
from carla.data.catalog import OnlineCatalog

# load catalog dataset
data_name = "adult"
dataset = OnlineCatalog(data_name)

Below, we take a look at how you can add your own data to CARLA.

### Using the <code>CsvCatalog</code>

For the "CsvCatalog" there are 5 attributes. The file_path should be the path of the csv file you want to use. Then we have two different types of features, continous and categorical, of which some can be immutable. Finally, the target attribute is the column which contains the targets/labels. For the Adult Income data set, this will be "Income", i.e., whether an individual earned more or less than \$50.000.

Note that when using the <code>CsvCatalog</code> the data should already be cleaned; e.g., your .csv file should not contain any NaNs. 
Moreover, also make sure that the categorical variables are binary encoded, i.e., $x_j \in \{0,1\}$, if feature $j$ is a categorical variable (e.g., "workclass_private"). We are currently working on extensions to this.

In [2]:
from carla.data.catalog import CsvCatalog

continuous = ["age", "fnlwgt", "education-num", "capital-gain", "hours-per-week", "capital-loss"]
categorical = ["marital-status", "native-country", "occupation", "race", "relationship", "sex", "workclass"]
immutable = ["age", "sex"]

dataset = CsvCatalog(file_path="adult.csv",
                     continuous=continuous,
                     categorical=categorical,
                     immutables=immutable,
                     target='income')

display(dataset.df.head())

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
Using TensorFlow backend.


[INFO] Using Python-MIP package version 1.12.0 [model.py <module>]
 [deprecation_wrapper.py __getattr__]


Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,...,occupation_Other,race_White,relationship_Non-Husband,sex_Male,workclass_Private
0,0.30137,0.044131,0.8,0.02174,0.0,...,0.0,1.0,1.0,1.0,0.0
1,0.452055,0.048052,0.8,0.0,0.0,...,0.0,1.0,0.0,1.0,0.0
2,0.287671,0.137581,0.533333,0.0,0.0,...,1.0,1.0,1.0,1.0,1.0
3,0.493151,0.150486,0.4,0.0,0.0,...,1.0,0.0,0.0,1.0,1.0
4,0.150685,0.220635,0.8,0.0,0.0,...,0.0,0.0,1.0,0.0,1.0


## ML Classifier

Now that we have the data loaded we also need a classification model. Again, you have two options:

1. You can easily define your own model. In our [model documentation](https://carla-counterfactual-and-recourse-library.readthedocs.io/en/latest/examples.html#black-box-model) we describe how you can do that.
2. Here we will show how you can train one of our [catalog](https://carla-counterfactual-and-recourse-library.readthedocs.io/en/latest/mlmodel.html#module-models.catalog.catalog) models.
Depending on your data and your use-case you might need to tweak the training hyperparameters.

For example, for the **ANN** used here we need to define the following hyperparameters:
- *learning rate*
- *number of epochs*
- *batch size*
- *sizes of the hidden layers*.

After defining the model using the <code>MLModelCatalog</code>, just call the *train* method with those parameters and you are good to go!

In [3]:
from carla.models.catalog import MLModelCatalog

In [6]:
training_params = {"lr": 0.002, "epochs": 10, "batch_size": 1024, "hidden_size": [18, 9, 3]}

ml_model = MLModelCatalog(
    dataset, 
    model_type="ann", 
    load_online=False, 
    backend="pytorch"
)

ml_model.train(
    learning_rate=training_params["lr"],
    epochs=training_params["epochs"],
    batch_size=training_params["batch_size"],
    hidden_size=training_params["hidden_size"]
)

balance on test set 0.23774027959807775, balance on test set 0.24410222804718218
Epoch 0/9
----------
train Loss: 0.4692 Acc: 0.7623

test Loss: 0.4262 Acc: 0.7559

Epoch 1/9
----------
train Loss: 0.4114 Acc: 0.7623

test Loss: 0.4054 Acc: 0.7559

Epoch 2/9
----------
train Loss: 0.3967 Acc: 0.7623

test Loss: 0.3938 Acc: 0.7559

Epoch 3/9
----------
train Loss: 0.3851 Acc: 0.8229

test Loss: 0.3912 Acc: 0.8147

Epoch 4/9
----------
train Loss: 0.3745 Acc: 0.8311

test Loss: 0.3727 Acc: 0.8306

Epoch 5/9
----------
train Loss: 0.3663 Acc: 0.8324

test Loss: 0.3721 Acc: 0.8308

Epoch 6/9
----------
train Loss: 0.3598 Acc: 0.8350

test Loss: 0.3612 Acc: 0.8355

Epoch 7/9
----------
train Loss: 0.3534 Acc: 0.8371

test Loss: 0.3537 Acc: 0.8388

Epoch 8/9
----------
train Loss: 0.3512 Acc: 0.8378

test Loss: 0.3509 Acc: 0.8391

Epoch 9/9
----------
train Loss: 0.3468 Acc: 0.8399

test Loss: 0.3511 Acc: 0.8396



## Counterfactual Explanations and Algorithmic Recourse

Now that we have both the data, and a model we can start using CARLA to generate counterfactuals. Again, you have two options:

1. You can pick a [recourse method](https://carla-counterfactual-and-recourse-library.readthedocs.io/en/latest/recourse.html) from the catalog.
2. Or you can implement one yourself using our [recourse interface](https://carla-counterfactual-and-recourse-library.readthedocs.io/en/latest/recourse.html#recourse-api). If you would like to add a new method to the library, just submit a pull-request. :)

In the following example, we are getting negatively labeled samples for which we would like to find counterfactuals.

In [4]:
from carla.models.negative_instances import predict_negative_instances
import carla.recourse_methods.catalog as recourse_catalog

In [8]:
factuals = predict_negative_instances(ml_model, dataset.df)
test_factual = factuals.iloc[:5]

display(test_factual)

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,...,occupation_Other,race_White,relationship_Non-Husband,sex_Male,workclass_Private
0,0.30137,0.044131,0.8,0.02174,0.0,...,0.0,1.0,1.0,1.0,0.0
2,0.287671,0.137581,0.533333,0.0,0.0,...,1.0,1.0,1.0,1.0,1.0
3,0.493151,0.150486,0.4,0.0,0.0,...,1.0,0.0,0.0,1.0,1.0
4,0.150685,0.220635,0.8,0.0,0.0,...,0.0,0.0,1.0,0.0,1.0
6,0.438356,0.100061,0.266667,0.0,0.0,...,1.0,0.0,1.0,0.0,1.0


### Wachter et al (2018) (gradient method)

The recourse objective function looks as follows:
\begin{align}
\delta_x^* & = argmin_{\delta_x, x+ \delta_x \in \mathcal{A}} \, \ell \big(h(x + \delta_x), 0.5)\big) + \lambda \cdot \, d(x + \delta_x, x), %\\
\end{align}
where $\lambda \geq 0$ is a trade-off parameter, $0.5$ is the probabilistic target, $\mathcal{A}$ is the feasible set of actions, and $\ell(\cdot, \cdot)$ is the binary-cross-entropy loss. The first term on the right-hand-side ensures that the model prediction corresponding to the counterfactual i.e., $h(x + \delta_x )$ is close to the favorable outcome with classification prediction $1$. The second term encourages low-cost recourses; for example, Wachter et al (2018) propose $\ell_1$ or $\ell_2$ distances to ensure that the distance between the factual instance $x$ and the counterfactual $\check{x} =  x + \delta_x^*$ is small.

In [9]:
hyperparams = {"loss_type": "BCE", "binary_cat_features": False}
recourse_method = recourse_catalog.Wachter(ml_model, hyperparams)
df_cfs = recourse_method.get_counterfactuals(test_factual)

display(df_cfs)

[INFO] Counterfactual Explanation Found [wachter.py wachter_recourse]
[INFO] Counterfactual Explanation Found [wachter.py wachter_recourse]
[INFO] Counterfactual Explanation Found [wachter.py wachter_recourse]
[INFO] Counterfactual Explanation Found [wachter.py wachter_recourse]
[INFO] Counterfactual Explanation Found [wachter.py wachter_recourse]


Unnamed: 0,age,capital-gain,capital-loss,education-num,fnlwgt,...,race_White,relationship_Non-Husband,sex_Male,workclass_Private,income
0,0.350025,0.070708,0.048941,0.849036,0.000182,...,1.0,1.0,0.0,0.042775,1.0
2,0.414486,0.125881,0.127741,0.658535,0.033042,...,0.0,1.0,0.0,0.873237,1.0
3,0.550512,0.058995,0.058867,0.459005,0.210247,...,0.0,0.0,1.0,0.950902,1.0
4,0.257907,0.106133,0.107396,0.905325,0.150003,...,0.0,1.0,0.0,0.894212,1.0
6,0.605591,0.167044,0.167439,0.433625,-0.056802,...,0.0,1.0,0.0,0.834089,1.0


### CCHVAE by Pawelczyk et al (2020) (manifold method)

Let $g: \mathcal{Z} \to \mathcal{X}$ be the decoder of a generative model (e.g., VAE). Let $e: \mathcal{X} \to \mathcal{Z}$ be the correpsonding encoder. We then encode the factual input x, for which we wish to find a counterfactual explanation, as follows:  $e(x)=z$, and conduct the search in the latent space.
Manifold-based methods solve an objective function that looks as follows:
\begin{align}
\delta_z^* & = argmin_{\delta_z, g(z + \delta_z) \in \mathcal{A}} \, \ell \big(h(g(z + \delta_z), 0.5)\big) + \cdot \, d(z + \delta_z, z), %\\
\label{eq:wachter}
\end{align}
where $\lambda \geq 0$ is a trade-off parameter, $0.5$ is the probabilistic target,
and $\ell(\cdot, \cdot)$ is the binary-cross-entropy loss, and  $\mathcal{A}$ is the feasible set of actions. The first term on the right-hand-side ensures that the model prediction corresponding to the counterfactual i.e., $h(g(z + \delta_z))$ is close to the favorable outcome with classification label $1$. The second term encourages low-cost recourses;

For example, Pawelczyk et al (2020) use random search in the latent space to approximate the above objective function, while Joshi et al (2019) use a gradient-based algorithm on a variant of the above objective function. We refer to the respective papers for more details

In [10]:
hyperparams = {
    "data_name": dataset.name,
    "n_search_samples": 100,
    "p_norm": 1,
    "step": 0.1,
    "max_iter": 1000,
    "clamp": True,
    "binary_cat_features": False,
    "vae_params": {
        "layers": [len(ml_model.feature_input_order), 512, 256, 8],
        "train": True,
        "lambda_reg": 1e-6,
        "epochs": 5,
        "lr": 1e-3,
        "batch_size": 32,
    },
}

cchvae = recourse_catalog.CCHVAE(ml_model, hyperparams)
df_cfs = cchvae.get_counterfactuals(test_factual)

display(df_cfs)

[INFO] Start training of Variational Autoencoder... [models.py fit]
[INFO] [Epoch: 0/5] [objective: 0.381] [models.py fit]
[INFO] [ELBO train: 0.38] [models.py fit]
[INFO] [ELBO train: 0.14] [models.py fit]
[INFO] [ELBO train: 0.12] [models.py fit]
[INFO] [ELBO train: 0.12] [models.py fit]
[INFO] [ELBO train: 0.12] [models.py fit]
[INFO] ... finished training of Variational Autoencoder. [models.py fit]


Unnamed: 0,age,capital-gain,capital-loss,education-num,fnlwgt,...,race_White,relationship_Non-Husband,sex_Male,workclass_Private,income
0,0.296009,0.036202,0.039718,0.601063,0.12027,...,1.0,0.0,1.0,0.73757,1.0
2,0.296006,0.036202,0.039718,0.601062,0.12027,...,1.0,0.0,1.0,0.737569,1.0
3,0.296006,0.036203,0.039718,0.601061,0.12027,...,1.0,0.0,1.0,0.737569,1.0
4,0.295999,0.036202,0.039718,0.601063,0.12027,...,1.0,0.0,1.0,0.737569,1.0
6,0.29601,0.036202,0.039718,0.601064,0.12027,...,1.0,0.0,1.0,0.73757,1.0


### FOCUS by Lucic et al (2021) (tree method)

Our library also supports sklearn and xgboost tree-based classifiers such as *Random Forests*, *Decision Trees* or *Gradient Boosted Decision Trees*.
Those classifiers are needed for methods, which explicitly require the use of tree models (e.g., FeatureTweak and FOCUS).

In [5]:
ml_model = MLModelCatalog(dataset, "forest", backend="sklearn", load_online=False)
ml_model.train(max_depth=2, n_estimators=5, force_train=True)

factuals = predict_negative_instances(ml_model, dataset.df)
test_factual = factuals.iloc[:5]

display(test_factual)

balance on test set 0.2406618610747051, balance on test set 0.23533748361730014
model fitted with training score 0.8020150720838795 and test score 0.8075032765399738


Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,...,occupation_Other,race_White,relationship_Non-Husband,sex_Male,workclass_Private
0,0.30137,0.044131,0.8,0.02174,0.0,...,0.0,1.0,1.0,1.0,0.0
1,0.452055,0.048052,0.8,0.0,0.0,...,0.0,1.0,0.0,1.0,0.0
2,0.287671,0.137581,0.533333,0.0,0.0,...,1.0,1.0,1.0,1.0,1.0
3,0.493151,0.150486,0.4,0.0,0.0,...,1.0,0.0,0.0,1.0,1.0
4,0.150685,0.220635,0.8,0.0,0.0,...,0.0,0.0,1.0,0.0,1.0


In [7]:
from carla.models.api import MLModel
import xgboost

In [10]:
class XGBoostModel(MLModel):
    """The default way of implementing XGBoost
    https://xgboost.readthedocs.io/en/latest/python/python_intro.html"""

    def __init__(self, data):
        super().__init__(data)

        # get preprocessed data
        df_train = self.data.df_train
        df_test = self.data.df_test

        x_train = df_train[self.data.continuous]
        y_train = df_train[self.data.target]
        x_test = df_test[self.data.continuous]
        y_test = df_test[self.data.target]

        self._feature_input_order = self.data.continuous

        param = {
            "max_depth": 2,  # determines how deep the tree can go
            "objective": "binary:logistic",  # determines the loss function
            "n_estimators": 5,
        }
        self._mymodel = xgboost.XGBClassifier(**param)
        self._mymodel.fit(
                x_train,
                y_train,
                eval_set=[(x_train, y_train), (x_test, y_test)],
                eval_metric="logloss",
                verbose=True,
            )

    @property
    def feature_input_order(self):
        # List of the feature order the ml model was trained on
        return self._feature_input_order

    @property
    def backend(self):
        # The ML framework the model was trained on
        return "xgboost"

    @property
    def raw_model(self):
        # The black-box model object
        return self._mymodel

    @property
    def tree_iterator(self):
        # make a copy of the trees, else feature names are not saved
        booster_it = [booster for booster in self.raw_model.get_booster()]
        # set the feature names
        for booster in booster_it:
            booster.feature_names = self.feature_input_order
        return booster_it

    # The predict function outputs
    # the continuous prediction of the model
    def predict(self, x):
        return self._mymodel.predict(self.get_ordered_features(x))

    # The predict_proba method outputs
    # the prediction as class probabilities
    def predict_proba(self, x):
        return self._mymodel.predict_proba(self.get_ordered_features(x))

In [11]:
ml_model = XGBoostModel(dataset)

factuals = predict_negative_instances(ml_model, dataset.df)
test_factual = factuals.iloc[:5]

display(test_factual)

[0]	validation_0-logloss:0.58413	validation_1-logloss:0.58327
[1]	validation_0-logloss:0.52405	validation_1-logloss:0.52332




[2]	validation_0-logloss:0.48522	validation_1-logloss:0.48436
[3]	validation_0-logloss:0.45917	validation_1-logloss:0.45862
[4]	validation_0-logloss:0.44013	validation_1-logloss:0.43889


Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,...,occupation_Other,race_White,relationship_Non-Husband,sex_Male,workclass_Private
0,0.30137,0.044131,0.8,0.02174,0.0,...,0.0,1.0,1.0,1.0,0.0
1,0.452055,0.048052,0.8,0.0,0.0,...,0.0,1.0,0.0,1.0,0.0
2,0.287671,0.137581,0.533333,0.0,0.0,...,1.0,1.0,1.0,1.0,1.0
3,0.493151,0.150486,0.4,0.0,0.0,...,1.0,0.0,0.0,1.0,1.0
4,0.150685,0.220635,0.8,0.0,0.0,...,0.0,0.0,1.0,0.0,1.0


Below we start generating counterfactuals using FOCUS.

In [12]:
hyperparams = {
    "optimizer": "adam",
    "lr": 0.001,
    "n_class": 2,
    "n_iter": 1000,
    "sigma": 1.0,
    "temperature": 1.0,
    "distance_weight": 0.01,
    "distance_func": "l1",
}

focus = recourse_catalog.FOCUS(ml_model, hyperparams)
df_cfs = focus.get_counterfactuals(test_factual)
display(df_cfs)



Unnamed: 0,age,fnlwgt,education-num,capital-gain,hours-per-week,capital-loss
0,0.301348,0.044131,0.799989,0.051228,0.398055,0.0
1,0.452092,0.048052,0.800043,0.051198,0.122474,7.3e-05
2,0.287735,0.137581,0.533309,0.051196,0.39797,0.0
3,0.493114,0.150486,0.40008,0.05123,0.397947,0.0
4,0.158401,0.220635,0.800588,0.051254,0.397995,0.0
