Hello! Thank you for checking out our tool.

The purpose of this demo is demonstrate some of the basics. In doing so, we will generate a flipset for one individual. In doing so, we'll show:

1. How to use the ActionSet interface to specify immutable variables and variables with custom ranges.
2. How to use a model to align an ActionSet
3. How to use the RecourseBuilder interface to find the feasibility of one person.

We'll work using CPLEX. The problem is equivalent for CBC. To install either package, read [here](https://github.com/ustunb/actionable-recourse/blob/master/README.md).

In [16]:
import os
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)
from recourse.builder import RecourseBuilder
from recourse.builder import ActionSet


# Read in the data

To start out, we need to read in and process a dataset. We'll read in the `german` credit reporting dataset and do some light processing. You can find more information about this dataset [here](https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)). A processed version of this dataset is included in our repository for the purposes of this demo.

We convert categorical columns to binary where possible and drop other more complicated categorical columns. In the near future, we will be able to support auditing over these features as well.

In [17]:
data_dir = "../data/2_1_experiment_1/"
data_name = 'german'
data_file = os.path.join(data_dir, '%s_processed.csv' % data_name)
## load and process data
german_df = pd.read_csv(data_file).reset_index(drop=True)

german_df = (german_df
             .assign(isMale=lambda df: (df['Gender']=='Male').astype(int))
             .drop(['PurposeOfLoan', 'Gender', 'OtherLoansAtStore'], axis=1)
            )

y = german_df['GoodCustomer']
X = german_df.drop('GoodCustomer', axis=1)

pd.set_option('display.max_columns', None)
display(X)


Unnamed: 0,ForeignWorker,Single,Age,LoanDuration,LoanAmount,LoanRateAsPercentOfIncome,YearsAtCurrentHome,NumberOfOtherLoansAtBank,NumberOfLiableIndividuals,HasTelephone,CheckingAccountBalance_geq_0,CheckingAccountBalance_geq_200,SavingsAccountBalance_geq_100,SavingsAccountBalance_geq_500,MissedPayments,NoCurrentLoan,CriticalAccountOrLoansElsewhere,OtherLoansAtBank,HasCoapplicant,HasGuarantor,OwnsHouse,RentsHouse,Unemployed,YearsAtCurrentJob_lt_1,YearsAtCurrentJob_geq_4,JobClassIsSkilled,isMale
0,0,1,67,6,1169,4,4,2,1,1,0,0,0,0,1,0,1,0,0,0,1,0,0,0,1,1,1
1,0,0,22,48,5951,2,2,1,1,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0
2,0,1,49,12,2096,2,3,1,2,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0,1,0,1
3,0,1,45,42,7882,2,4,1,2,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,1,1
4,0,1,53,24,4870,3,4,2,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0,0,31,12,1736,3,4,1,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0
996,0,0,40,30,3857,4,4,1,1,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,1
997,0,1,38,12,804,4,4,1,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,1,1
998,0,1,23,45,1845,4,4,1,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1


Now let's set up our training sets.

In [18]:
action_set = ActionSet(X = X)
# action_set['Age'].mutable = False
# action_set['Single'].mutable = False
# action_set['JobClassIsSkilled'].mutable = False
# action_set['ForeignWorker'].mutable = False
# action_set['OwnsHouse'].mutable = False
# action_set['RentsHouse'].mutable = False
# action_set['isMale'].mutable = False

Ok, great, we've instantialized an `ActionSet` and determined some features that we'd like to call immutable. Let's check what we've got so far:

In [19]:
action_set

+---------------------------------+---------------+---------+------------+----------------+----------------+-----------+-----------+-----------+-------+---------+
|                            name | variable type | mutable | actionable | step direction | flip direction | grid size | step type | step size |    lb |      ub |
+---------------------------------+---------------+---------+------------+----------------+----------------+-----------+-----------+-----------+-------+---------+
|                   ForeignWorker | <class 'int'> |    True |       True |              0 |            nan |         2 |  relative |      0.01 |   0.0 |     1.0 |
|                          Single | <class 'int'> |    True |       True |              0 |            nan |         2 |  relative |      0.01 |   0.0 |     1.0 |
|                             Age | <class 'int'> |    True |       True |              0 |            nan |        49 |  relative |      0.01 |  20.0 |    68.0 |
|                    L

There's a lot of stuff baked in here. Let's walk through some of it.

* You can see that every feature has a `lb` and an `ub`. These specify the lower and upper bounds for that feature, which we derived by looking at the 100th percentile of each feature.
* All the features have defaulted to `step_type=relative` and `step_size=.01`. This means that when we consider potential actions, we take steps of size $1%$ through the feasible space. So, for example, if you observe in the dataset that `LoanAmount` ranges from $\$100$ to $\$200$, your actions would be: $\$100, \$101, \$102,...\$200$. Other `step_types` are: `absolute` and `percentile`. `Absolute` means you can specify a positive real, (any positive real!) and we'll step at that amount. `Percentile` means we'll consider one action to be `step_size` percentile shift.
* This leads us naturally into the `action_grid`. This is a grid we set up based on `step_size` and `step_type`. We've already setup grids for these features. These will discretize continuous variables for you. Check out our paper to see how this is optimal and equivalent to treating these features as continuous. Plus, it makes your wait-time a lot slower.

See how much time and processing you save just by passing in the dataset? Told you it was worth it.

But ok, in the real world you might want change. GObama. Let's see how to do that. We'll take `LoanDuration`. These are what some of the attributes look like:

In [20]:
action_set['LoanDuration']
## default str shows (lb, ub)

'LoanDuration': (6.0, 60.0)

In [21]:
action_set['LoanDuration'].step_size

0.01

To change `step_size`, we can simply set another value. UCI tells us that this is measured in months. Let's say we only want to consider 6-month intervals.

In [22]:
# action_set['LoanDuration'].step_size = 6

Whoops, we got an assertion error. Let's make sure to change the `step_type` first.

In [23]:
action_set['LoanDuration'].step_type ="absolute"
action_set['LoanDuration'].step_size = 6

Great. Now let's look at the bounds:

In [24]:
action_set['LoanDuration']

'LoanDuration': (6.0, 60.0)

To set bounds:

In [25]:
action_set['LoanDuration'].bounds = (1, 100)

In [26]:
action_set['LoanDuration']

'LoanDuration': (1.0, 100.0)

# Train model

Ok great, now let's get into the meat of it. Let's train up a model as see what recourse exists.

In [27]:
## grid search
clf = LogisticRegression(max_iter=1000, solver='lbfgs')
grid = GridSearchCV(
  clf, param_grid={'C': np.logspace(-4, 3)},
  cv=10,
  scoring='roc_auc',
  return_train_score=True
)
grid.fit(X, y)
clf = grid.best_estimator_

In [28]:
coefficients = clf.coef_[0]
intercept = clf.intercept_[0]

Ok cool, now that we have our model, we can do some exciting stuff. First things first, we need to align our `ActionSet`. What this means is that our `ActionSet` didn't know yet which directions were the _right_ directions to step in. The coefficients will tell it that.

In [29]:
action_set.align(coefficients=coefficients)

Hold on, let's do a quick check. Sometimes, coefficients can be misleading for a host of reasons:
* A confounder can switch the direction.
* A collinear feature could negative or also flip the direction.

In [30]:
print(intercept)
display(pd.Series(coefficients, index=X.columns).to_frame('Coefficients'))

0.4045801917840976


Unnamed: 0,Coefficients
ForeignWorker,0.236718
Single,0.189415
Age,0.016919
LoanDuration,-0.024855
LoanAmount,-7.1e-05
LoanRateAsPercentOfIncome,-0.219886
YearsAtCurrentHome,0.055598
NumberOfOtherLoansAtBank,-0.107838
NumberOfLiableIndividuals,0.020561
HasTelephone,0.316835


Hmmm that's odd. `CriticalAccountOrLoansElsewhere` is a positive coefficient, meaning that opening more accounts elsewhere will _help_ someone flip their prediction. It very well might be predictive within the confines of this model, but we're looking for more than predictive power, here. We're not quite looking for causality, but still, we certainly don't want to actively encourage someone to do something that we know a-priori will be bad for them.

When we notice weird stuff like that, we can easily hop into the interfact and change it.

In [31]:
action_set['CriticalAccountOrLoansElsewhere'].step_direction = -1
action_set['CheckingAccountBalance_geq_0'].step_direction = 1

# Generate Recourse

First, let's score everyone using our model. Now, let's say that we will give loans to anyone with a greater than a $80\%$ chance of paying it back

In [32]:
scores = pd.Series(clf.predict_proba(X)[:, 1])

In [33]:
scores.loc[lambda s: s<.8].head()

1    0.360888
3    0.642839
4    0.565258
5    0.497203
7    0.500728
dtype: float64

In [34]:
scores.loc[lambda s: s<.8].shape

(691,)

That's quite a few people in our dataset that won't be able to qualify for a loan. Let's see what one of them can do.

In [35]:
denied_individuals = scores.loc[lambda s: s < .8].index

In [36]:
x = X.values[denied_individuals[0]]

p = .8
print(coefficients)
coeff2 = [0.0012120465936242733, 0.004839284114685458, 0.06480051238943815, 0.04175560885002065, -0.06878436382877491, 0.006111372970447006, 0.00797296374936266, -0.001352669239675068, 0.004244146555282295, 0.005675716665101924, 0.0006954682600628114, -0.0017046206293610312, -0.0036741319368620295, 0.0019871292693564475, 0.0016968883376547305, -0.0002057586591747878, 0.00412755835621202, -3.088368115581484e-06, -0.00022224325330919578, 0.0016085404643264324, 0.003351801294654334, 0.0018361437262650244, 0.0013410228204784147, -0.004978286338462014, 0.0001829416346684786, 0.0020646010867193393, 0.00045730290058494915]
print(coeff2)
rb = RecourseBuilder(
      optimizer="cbc",
      coefficients=coeff2,
      intercept=intercept- (np.log(p / (1. - p))),
      action_set=action_set,
      x=x
)

[ 2.36717677e-01  1.89414961e-01  1.69186891e-02 -2.48551941e-02
 -7.12020114e-05 -2.19885513e-01  5.55980597e-02 -1.07837909e-01
  2.05607407e-02  3.16835035e-01 -2.64473205e-01  1.85391395e-01
  3.41770702e-01  3.99607852e-01  1.78215063e-01 -4.39578679e-01
  6.50446480e-01 -4.71776744e-01 -1.80014706e-01  2.69615672e-01
  5.74868711e-01  6.45968443e-02 -1.29616019e-01 -1.62587451e-01
  3.46826876e-01  1.95709952e-01  1.31230489e-01]
[0.0012120465936242733, 0.004839284114685458, 0.06480051238943815, 0.04175560885002065, -0.06878436382877491, 0.006111372970447006, 0.00797296374936266, -0.001352669239675068, 0.004244146555282295, 0.005675716665101924, 0.0006954682600628114, -0.0017046206293610312, -0.0036741319368620295, 0.0019871292693564475, 0.0016968883376547305, -0.0002057586591747878, 0.00412755835621202, -3.088368115581484e-06, -0.00022224325330919578, 0.0016085404643264324, 0.003351801294654334, 0.0018361437262650244, 0.0013410228204784147, -0.004978286338462014, 0.0001829416346

You can switch optimizers if you don't have CPLEX by setting `optimizer="cbc"`. 

A quick note: Our decision boundary is by default 0. We shift this by tweaking the intercept. Since we used Logistic Regression, we use the trick above to do that. In future iterations, we will provide a more elegant way of doing this.

In [37]:
output_1 = rb.fit()
output_1

    model=unknown;
        message from solver=<undefined>


ValueError: solver status is not OK

Ok, great, we have a solution! This individual has recourse. The total cost of all the actions needed to flip their prediction is the first thing of interest to us. It costs this person $.21$, meaning that the sum of percentile shifts across this person's features is $.21$. That's quite a lot. Imagine having to shift that much relative to a population? Let's check out what this means in terms of actions:

In [None]:
pd.Series(output_1['actions'], index=X.columns).to_frame('Actions')

Ok, so let's read this. 

* `SavingsAccountBalance_geq_100`$=1$, for example. This was a binary feature, so it can only be $1$. This also means that we're enouraging this person to increase their savings. 
* `LoanDuration`$=20$. This, if we recall, was the number of months of loan. This means we're encouraging this person to reapply but specify that their loan repayment period is 20 months shorter.

Let's check if these two actions make sense in the context of this person:

In [None]:
X.loc[denied_individuals[0]].to_frame("Original Features")

Ok, this person originally applied with no savings and with a 4-year repayment period. So asking them to get savings and decrease their loan repayment period by $20$ months make sense as actions.

(Let's leave aside the question of mutually exclusive features (eg. `SavingsAccountBalance_geq_100` $=0$, `SavingsAccountBalance_geq_500`$=1$). We'll get back to that in later releases.)

Let's close by noting some things:

* Immutable features are __not__ changed. That's good. That's recourse.
* The changes make sense, at least directionally. We'd encourage this person to get a gaurantor, to decrease their loan amount, and to decrease their loan period, among other changes.

Yes, these might be hard for someone. They might have other reasons for immutability that we're not considering. Maybe they _need_ that amount and cannot change. Ok, let's express that:

In [None]:
action_set['LoanAmount'].mutable=False

In [None]:
x = X.values[denied_individuals[0]]

print(coefficients)
print(intercept)
p = .8
rb = RecourseBuilder(
      optimizer="cbc",
      coefficients=coefficients,
      intercept=intercept- (np.log(p / (1. - p))),
      action_set=action_set,
      x=x
)

In [None]:
output_2 = rb.fit()
output_2

Ok, so their total cost actually didn't change, which is nice. Let's take a look at their new action set:

In [None]:
pd.Series(output_2['actions'], index=X.columns).to_frame("New Actions")

Ok, by decreasing their repayment period by a bit more and changing some other features, this person can still ask for the same amount. That's good.

The magical thing about both of these action sets is that this person, if they do this, _will_ qualify for a loan. Let's check that:

In [None]:
clf.predict_proba([X.loc[denied_individuals[0]] + pd.Series(output_1['actions'], index=X.columns)])[:, 1]

In [None]:
clf.predict_proba([X.loc[denied_individuals[0]] + pd.Series(output_2['actions'], index=X.columns)])[:, 1]

And there we have it. By making these tweaks, this person has two ways to get over the $.8$ threshold that we've set. This period can now get approved under this model.