Hello! Thank you for checking out our tool.

The purpose of this demo is demonstrate some of the basics. In doing so, we will generate a flipset for one individual. In doing so, we'll show:

1. How to use the ActionSet interface to specify immutable variables and variables with custom ranges.
2. How to use a model to align an ActionSet
3. How to use the RecourseBuilder interface to find the feasibility of one person.

We'll work using CPLEX. The problem is equivalent for CBC. To install either package, read [here](https://github.com/ustunb/actionable-recourse/blob/master/README.md).

In [15]:
import os
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
import sklearn
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)
from recourse.builder import RecourseBuilder
from recourse.builder import ActionSet
from recourse.flipset import Flipset

data_dir = "../data/2_1_experiment_1/"
pd.set_option('display.max_columns', None)


In [16]:
from sklearn.preprocessing import LabelEncoder

def ohe(data, categorical_names, encoder, columns = []):
    if isinstance(data, pd.DataFrame):
        df_data = data.copy()
    else:
        if len(columns) == 0:
            raise ValueError('Need to supply columns to make a pandas dataframe')
        df_data = pd.DataFrame(data, columns = columns)        
    transformed = encoder.transform(df_data[categorical_names])
    df_transformed = (pd.DataFrame(transformed, columns = encoder.get_feature_names(input_features = categorical_names)))
    return pd.concat([df_data.reset_index(drop = True), df_transformed], axis=1).drop(categorical_names, axis=1)

def un_ohe(ohe_data, categorical_names, encoder, columns):
    ohe_categorical_columns = encoder.get_feature_names(input_features = categorical_names)
    if len(columns) == 0:
            raise ValueError('Need to supply columns to make a pandas dataframe')
    if isinstance(ohe_data, pd.DataFrame):
        df_ohe_data = ohe_data.copy()
    else:
        df_ohe_data = pd.DataFrame(ohe_data, columns = columns)        
    untransformed = encoder.inverse_transform(df_ohe_data[ohe_categorical_columns])
    untransformed_df = pd.DataFrame(untransformed, columns = categorical_names)
    to_return = pd.concat([df_ohe_data.reset_index(drop = True), untransformed_df], axis=1).drop(ohe_categorical_columns, axis=1)
    return to_return[columns]

# #NEED TO FINISH
# def ohe_coefficients(w, x, columns, enc, categorical_names, ohe_categorical_columns):
#     new_coefficients = []
#     categories_mapping = {}
#     ohe_columns = ohe(x, categorical_names, enc, columns = columns).columns
    
#     for feat_idx, feat in enumerate(categorical_names):
#         categories_mapping[feat] = compas_enc.categories_[feat_idx]
    
#     for feat in ohe_columns:
#         if "_"
#         new_coefficients.append
#     display(compas_enc.categories_)


def get_label_encoders(categorical_names, data):
    return_data = data.copy()
    categorical_encoders = {}

    for cf in categorical_names:
        le = LabelEncoder()
        le.fit(data[cf])
        return_data[cf] = le.transform(return_data[cf])
        categorical_encoders[cf] = le
    
    return categorical_encoders, return_data

# COMPAS dataset

In [17]:
data_name = "compas-scores-two-years"
data_file = os.path.join(data_dir, '%s.csv' % data_name)
## load and process data
compas_df = pd.read_csv(data_file).reset_index(drop=True)

# filter according to https://github.com/propublica/compas-analysis/blob/master/Compas%20Analysis.ipynb and https://github.com/dylan-slack/Fooling-LIME-SHAP/blob/e763fdea8242f4f3a5955951161c69f573db624d/get_data.py#L5
# compas_df = compas_df.loc[(compas_df['days_b_screening_arrest'] <= 30) & \
#                           (compas_df['days_b_screening_arrest'] >= -30) & \
#                           (compas_df['is_recid'] != -1) & \
#                           (compas_df['c_charge_degree'] != "O") & \
#                           (compas_df['score_text'] != "NA")]

# cols_with_missing_values = []
# for col in compas_df.columns:
#     if len(np.where(compas_df[col].values == '?')[0]) >= 1 or compas_df[col].isnull().values.any():
#         cols_with_missing_values.append(col)    
# compas_df = compas_df.drop(cols_with_missing_values, axis=1)

compas_df['length_of_stay'] = (pd.to_datetime(compas_df['c_jail_out']) - pd.to_datetime(compas_df['c_jail_in'])).dt.days

compas_df = compas_df[['age', 'two_year_recid','c_charge_degree', 'race', 'sex', 'priors_count', 'length_of_stay', 'score_text']]
compas_df = compas_df.dropna()

compas_X = compas_df.drop('score_text', axis=1)

# compas_df = (compas_df
#              .drop(['id', 'name', 'first', 'last', 'dob', 'compas_screening_date', \
#                     'screening_date', 'v_type_of_assessment', 'type_of_assessment', 'v_screening_date'], axis=1)
#             )

# compas_df = pd.get_dummies(compas_df, columns=['sex']).drop(['sex_Female'], axis=1)
# compas_df = pd.get_dummies(compas_df, columns=['age_cat'])
# compas_df = pd.get_dummies(compas_df, columns=['score_text'])
# compas_df = pd.get_dummies(compas_df, columns=['v_score_text'])
# compas_df = pd.get_dummies(compas_df, columns=['c_charge_degree']).drop(['c_charge_degree_F'], axis=1)

compas_y = pd.Series(np.array([-1 if score == 'High' else 1 for score in compas_df['score_text']]))
# compas_X = compas_df.drop('score_text', axis=1)


# CATEGORICAL FEATURES
compas_categorical_features = [1, 2, 3, 4]

compas_X = compas_X.reset_index(drop = True)
compas_y = compas_y.reset_index(drop = True)

columns = compas_X.columns
compas_categorical_names = [columns[i] for i in compas_categorical_features] 


# ENCODE NUMERICALLY
compas_label_encoders, compas_X = (get_label_encoders(compas_categorical_names, compas_X))

# CREATE ENCODER
from sklearn.preprocessing import OneHotEncoder
compas_enc = OneHotEncoder(sparse = False, handle_unknown='error')
compas_enc.fit(compas_X[compas_categorical_names])

display(compas_X)
display(ohe(compas_X, compas_categorical_names, compas_enc))
display(un_ohe(ohe(compas_X, compas_categorical_names, compas_enc), compas_categorical_names, compas_enc, columns = compas_X.columns))


Unnamed: 0,age,two_year_recid,c_charge_degree,race,sex,priors_count,length_of_stay
0,69,0,0,5,1,0,0.0
1,34,1,0,0,1,0,10.0
2,24,1,0,0,1,4,1.0
3,44,0,1,5,1,0,1.0
4,41,1,0,2,1,14,6.0
...,...,...,...,...,...,...,...
6902,23,0,0,0,1,0,1.0
6903,23,0,0,0,1,0,1.0
6904,57,0,0,5,1,0,1.0
6905,33,0,1,0,0,3,1.0


Unnamed: 0,age,priors_count,length_of_stay,two_year_recid_0,two_year_recid_1,c_charge_degree_0,c_charge_degree_1,race_0,race_1,race_2,race_3,race_4,race_5,sex_0,sex_1
0,69,0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
1,34,0,10.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,24,4,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,44,0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
4,41,14,6.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6902,23,0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
6903,23,0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
6904,57,0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
6905,33,3,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


Unnamed: 0,age,two_year_recid,c_charge_degree,race,sex,priors_count,length_of_stay
0,69,0,0,5,1,0,0.0
1,34,1,0,0,1,0,10.0
2,24,1,0,0,1,4,1.0
3,44,0,1,5,1,0,1.0
4,41,1,0,2,1,14,6.0
...,...,...,...,...,...,...,...
6902,23,0,0,0,1,0,1.0
6903,23,0,0,0,1,0,1.0
6904,57,0,0,5,1,0,1.0
6905,33,0,1,0,0,3,1.0


# German Credit dataset

In [18]:
data_name = 'german_processed'
data_file = os.path.join(data_dir, '%s.csv' % data_name)
## load and process data
german_df = pd.read_csv(data_file).reset_index(drop=True)

german_df = (german_df
             .assign(isMale=lambda df: (df['Gender']=='Male').astype(int))
             .drop(['PurposeOfLoan', 'Gender', 'OtherLoansAtStore'], axis=1)
            )

german_y = german_df['GoodCustomer']
german_X = german_df.drop('GoodCustomer', axis=1)

german_categorical_features = [0, 1, 9, 10, 11, 12, 13, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26]
columns = german_X.columns
german_categorical_names = [columns[i] for i in german_categorical_features] 

# Adult dataset

In [19]:
data_name = "adult"
data_file = os.path.join(data_dir, '%s.csv' % data_name)
## load and process data
adult_df = pd.read_csv(data_file).reset_index(drop=True)
adult_df.columns = ['age','workclass','fnlwgt','education','education-num','marital-status','occupation','relationship','race','sex',\
                                          'capital-gain','capital-loss','hours-per-week','native-country','label']

cols_with_missing_values = []
for col in adult_df.columns:
    if len(np.where(adult_df[col].values == '?')[0]) >= 1 or adult_df[col].isnull().values.any():
        cols_with_missing_values.append(col)    

adult_df = adult_df.drop(cols_with_missing_values, axis=1)

adult_df['Married'] = adult_df.apply(lambda row: 1 if 'Married' in row['marital-status'] else 0, axis=1)
adult_df['Widowed'] = adult_df.apply(lambda row: 1 if 'Widowed' in row['marital-status'] else 0, axis=1)
adult_df['NeverMarried'] = adult_df.apply(lambda row: 1 if 'Never-married' in row['marital-status'] else 0, axis=1)

adult_df['workclass_gov'] = adult_df.apply(lambda row: 1 if 'gov' in row['workclass'] else 0, axis=1)
adult_df['workclass_private'] = adult_df.apply(lambda row: 1 if 'Private' in row['workclass'] else 0, axis=1)
adult_df['workclass_self-emp'] = adult_df.apply(lambda row: 1 if 'Self-emp' in row['workclass'] else 0, axis=1)
# adult_df['workclass_never-worked'] = adult_df.apply(lambda row: 1 if 'Never-worked' in row['workclass'] else 0, axis=1)

adult_df['White'] = adult_df.apply(lambda row: 1 if 'White' in row['race'] else 0, axis=1)

# adult_df = pd.get_dummies(adult_df, columns=['race'])
adult_df = pd.get_dummies(adult_df, columns=['sex'])

adult_df = adult_df.drop(['education', 'occupation', 'native-country', \
                          'relationship'], axis=1)
adult_df = adult_df.drop(['sex_ Female', 'race'], axis=1)
adult_df = adult_df.drop(['marital-status', 'workclass', 'fnlwgt'], axis=1)

adult_df.columns = adult_df.columns.str.replace(' ', '')

adult_X = adult_df.drop('label', axis=1)
adult_y = adult_df['label'].replace(' <=50K', -1)
adult_y = adult_y.replace(' >50K', 1)

for col in adult_X.columns:
    print(col)
    print(adult_X[col].value_counts())

adult_categorical_features = [5, 6, 7, 8, 9, 10, 11, 12]
columns = adult_X.columns
print(columns)
adult_categorical_names = [columns[i] for i in adult_categorical_features] 


  if __name__ == '__main__':


age
36    898
31    888
34    886
23    877
35    876
     ... 
83      6
85      3
88      3
87      1
86      1
Name: age, Length: 73, dtype: int64
education-num
9     10501
10     7291
13     5354
14     1723
11     1382
7      1175
12     1067
6       933
4       646
15      576
5       514
8       433
16      413
3       333
2       168
1        51
Name: education-num, dtype: int64
capital-gain
0        29849
15024      347
7688       284
7298       246
99999      159
         ...  
4931         1
1455         1
6097         1
22040        1
1111         1
Name: capital-gain, Length: 119, dtype: int64
capital-loss
0       31041
1902      202
1977      168
1887      159
1848       51
        ...  
1411        1
1539        1
2472        1
1944        1
2201        1
Name: capital-loss, Length: 92, dtype: int64
hours-per-week
40    15216
50     2819
45     1824
60     1475
35     1297
      ...  
92        1
94        1
87        1
74        1
82        1
Name: hours-per-week, Lengt

Make the data not ohe.

In [20]:
#need the data for recourse and lime to be NOT one-hot-encoded and to be numerical
#need the data for the classifier to be one hot encoded

# german_df['YearsAtCurrentJob_lt_1'] = german_df['YearsAtCurrentJob_lt_1'].replace(1, 'lt_1')
# german_df['YearsAtCurrentJob'] = german_df['YearsAtCurrentJob_lt_1']
# german_df['YearsAtCurrentJob_geq_4'] = german_df['YearsAtCurrentJob_geq_4'].replace(1, 'geq_4')
# german_df['YearsAtCurrentJob'] = german_df.apply(lambda row: 'geq_4' if row['YearsAtCurrentJob_geq_4'] == 'geq_4' else row['YearsAtCurrentJob'], axis=1)
# german_df['YearsAtCurrentJob'] = german_df['YearsAtCurrentJob_lt_1'].replace(0, 'bet_1_4')
# german_df = german_df.drop(['YearsAtCurrentJob_lt_1', 'YearsAtCurrentJob_geq_4'], axis=1)

# german_df['CheckingAccountBalance_geq_0'] = german_df['CheckingAccountBalance_geq_0'].replace(1, 'geq_0')
# german_df['CheckingAccountBalance_geq_200'] = german_df['CheckingAccountBalance_geq_200'].replace(1, 'geq_200')
# german_df['CheckingAccountBalance'] = german_df['CheckingAccountBalance_geq_0']
# german_df['CheckingAccountBalance'] = german_df.apply(lambda row: 'geq_200' if row['CheckingAccountBalance_geq_200'] == 'geq_200' else row['CheckingAccountBalance'], axis=1)
# german_df['CheckingAccountBalance'] = german_df['CheckingAccountBalance'].replace('geq_0', '0_200')
# german_df = german_df.drop(['CheckingAccountBalance_geq_0', 'CheckingAccountBalance_geq_200'], axis=1)

# german_df['SavingsAccountBalance_geq_100'] = german_df['SavingsAccountBalance_geq_100'].replace(1, '100_500')
# german_df['SavingsAccountBalance_geq_500'] = german_df['SavingsAccountBalance_geq_500'].replace(1, 'geq_500')
# german_df['SavingsAccountBalance'] = german_df['SavingsAccountBalance_geq_100']
# german_df['SavingsAccountBalance'] = german_df.apply(lambda row: 'geq_500' if row['SavingsAccountBalance_geq_500'] == 'geq_500' else row['SavingsAccountBalance'], axis=1)
# german_df['SavingsAccountBalance'] = german_df['SavingsAccountBalance'].replace('0', 'lt_100')
# german_df = german_df.drop(['SavingsAccountBalance_geq_100', 'SavingsAccountBalance_geq_500'], axis=1)
# display(german_df)


In [21]:
pd.set_option('display.max_columns', None)
# display(X)
# display(y)

In [22]:
# msk = np.random.rand(len(X)) < 0.8
# train = X[msk]
# test = X[~msk]

# train_y = y[msk]
# test_y = y[~msk]

Currently, no immutable features.

# Train model

Ok great, now let's get into the meat of it. Let's train up a model as see what recourse exists.

# Generate Recourse

First, let's score everyone using our model. Now, let's say that we will give loans to anyone with a greater than a $80\%$ chance of paying it back

In [23]:
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import Ridge
from IPython.display import HTML
import time

from sklearn.tree import DecisionTreeRegressor
sys.path.append(os.path.dirname(os.getcwd()))
sys.path.append("../..")
from mlinsights.mlinsights.mlmodel import PiecewiseRegressor


# start by randomly picking an action for each feature
def sample_with_actions(instance, actions, num_samples, ordered_feature_names):
    num_features = len(ordered_feature_names)
    sampled_data = np.zeros((num_samples, num_features))
    sampled_data[0, :] = instance
    
    len_actions = [len(actions[feat]) for feat in ordered_feature_names]
    ordered_actions = [actions[feat] for feat in ordered_feature_names]
    
#     print("instance: ", instance)
#     print("actions: ", ordered_actions)
#     print("len_actions; ", len_actions)
#     print("ordered_actions: ", ordered_actions)
    
    # max number of actions
    max_actions = len(actions[max(actions, key=lambda feat:len(actions[feat]))])
            
#     print(len_actions)    
        
    for s in range(1, num_samples):
        sampled_actions = [ordered_actions[i][np.random.choice(x)] for i, x in enumerate(len_actions)]
#         print("sampled_actions: ", sampled_actions)
        sampled_data[s, :] = instance + sampled_actions
#         print("sampled_actions: ", sampled_actions)
        
    return sampled_data

def convert_binary_categorical_coefficients(exp_list):
    cleaned_exp_dict = {}
    for (feat, coeff) in exp_list:
        if "=" in feat:
            original_feat, val = feat.split("=")
            int_val = int(val)
            if int_val == 1:
                cleaned_exp_dict[original_feat] = coeff
            else:
                cleaned_exp_dict[original_feat] = -1 * coeff
        else:
            cleaned_exp_dict[feat] = coeff
    return cleaned_exp_dict

# scaled_X = (X - explainer.scaler.mean_) / explainer.scaler.scale_

In [24]:
from functools import partial
from sklearn.model_selection import train_test_split
from pyutilib.common import ApplicationError
from sklearn.preprocessing import KBinsDiscretizer

sys.path.append(os.path.dirname(os.getcwd()))
sys.path.append("../..")
    
# import MAPLE.MAPLE
from MAPLE.Code.Misc import load_normalize_data, unpack_coefs
from MAPLE.Code.MAPLE import MAPLE

def get_nonzero_actions(feature_names, action):
    action_dict = {}
    for feat_idx, feat_name in enumerate(feature_names):
        action_for_feat = action[feat_idx]
        if action_for_feat != 0:
            action_dict[feat_name] = action_for_feat
    return action_dict

# # X_train is unnormalized
# def find_testing_interventions(x, model, X_train, binary_categorical_features, num_divisions = 100):
#     interventions = [[]] * len(X_train.columns)
#     original_pred = model
#     for feat_idx, feat in X_train.columns:
#         if feat_idx not in binary_categorical_features:
#             feat_min = X_train[feat].min()
#             feat_max = X_train[feat].max()
#             increment = (feat_max - feat_min) / 100
#         else:
#             x[feat_idx] = 
        
# data is array-like of shape (n_samples, n_features)
def get_pred_function(model, categorical_names, enc, columns):
    def new_predict_proba(data):
        transformed_data = ohe(data, categorical_names, enc, columns = columns)
        return model.predict_proba(transformed_data)
    return new_predict_proba        
    
def get_lime_sampled(lime_explainer, x, new_predict_proba, num_features, num_samples, columns, true_index):
    exp = lime_explainer.explain_instance(x, new_predict_proba, num_features = num_features, num_samples = num_samples)
    return exp.inverse, exp.scaled_data, exp.all_sampled_preds, exp.weights, lime_explainer.scaler.mean_, lime_explainer.scaler.scale_
    
def get_lime_coefficients(lime_explainer, x, new_predict_proba, num_features, num_samples, columns, true_index):
    exp = lime_explainer.explain_instance(x, new_predict_proba, num_features = num_features, num_samples = num_samples)
    inverse_data = (pd.DataFrame(exp.inverse, columns = columns))
#     display(inverse_data)
#     display(pd.DataFrame([x], columns = columns))
#     display(inverse_data['race'].value_counts())
#     display(inverse_data['sex'].value_counts())
    local_pred = exp.local_pred

    coefficients = [None] * num_features

    local_exp = exp.local_exp[true_index]
        
    for (feat, coef) in local_exp:
        coefficients[feat] = coef
    
#     cleaned_exp_dict = convert_binary_categorical_coefficients(exp.as_list())
    
#     for j, col in enumerate(columns):
#         coefficients[j] = cleaned_exp_dict[col]

        
    print("exp.local_exp:", local_exp)
        
    intercept = exp.intercept[true_index]

    x_shift = np.array(lime_explainer.scaler.mean_)
    x_scale = np.array(lime_explainer.scaler.scale_)
    w = coefficients / x_scale
    b = intercept - np.sum(w * x_shift) - 0.5 # subtract 0.5 bc using probs as labels
    
    discrete_yss = (exp.yss[:, true_index] > 0.5).astype(int)
    discrete_sampled_preds = (exp.all_sampled_preds > 0.5).astype(int)
    
    num_pos_yss = (np.count_nonzero(discrete_yss == 1))
    num_neg_yss = (np.count_nonzero(discrete_yss == 0))
    
    num_accurate_preds = np.count_nonzero(discrete_yss == discrete_sampled_preds)
    accuracy_sampled = num_accurate_preds/len(discrete_yss)
    
    return w, b, local_pred, accuracy_sampled

def get_maple_coefficients(maple_explainer, x, mean, std, lime_sampled = [], model_preds_sampled = [], use_distance_weights = True):
    if lime_sampled != []:
        e_maple = maple_explainer.explain(x, lime_sampled = lime_sampled, model_preds_sampled = model_preds_sampled, use_distance_weights = use_distance_weights)
    else:
        e_maple = maple_explainer.explain(x, use_distance_weights = use_distance_weights)
        
    coefs_maple = e_maple["coefs"][1:]
    intercept_maple = e_maple["coefs"][0]
    
    
    w = coefs_maple / std
    b = intercept_maple - np.sum(w * mean) - 0.5 # subtract 0.5 bc using probs as labels
        
    num_pos_yss = (np.count_nonzero(e_maple['selected_sampled_yss'] == 1))
    num_neg_yss = (np.count_nonzero(e_maple['selected_sampled_yss'] == 0))
    
    num_accurate_preds = np.count_nonzero(e_maple['selected_sampled_yss'] == e_maple['selected_sampled_preds'])   
    accuracy_sampled = num_accurate_preds/len(e_maple['selected_sampled_preds'])
    local_pred = e_maple['pred']    
    
#     print(e_maple['weights'])
#     print(len(np.nonzero(e_maple['weights'])[0]))
#     print(e_maple['weights'][np.nonzero(e_maple['weights'])[0]])
    
    return w, b, local_pred, accuracy_sampled

def get_piecewise_coefficients_with_maple(maple_explainer, x, lime_sampled = [], model_preds_sampled = [], use_distance_weights = False):
    e_maple = maple_explainer.explain(x, lime_sampled = lime_sampled, model_preds_sampled = model_preds_sampled, use_distance_weights = use_distance_weights)
    maple_weights = e_maple['weights']
    
    model = PiecewiseRegressor(verbose=False,
                           binner=DecisionTreeRegressor(min_samples_leaf=2500))
    model.fit(lime_sampled, model_preds_sampled, sample_weight=maple_weights)
    
    estimators = model.estimators_
    
    return estimators    

def get_piecewise_coefficients(x, lime_sampled, model_preds_sampled, lime_sampled_weights):
    model = PiecewiseRegressor(verbose=False,
                           binner=DecisionTreeRegressor(min_samples_leaf=2500))
    model.fit(lime_sampled, model_preds_sampled, sample_weight=lime_sampled_weights)
    
#     sampled_preds = model.predict()
    
#     accuracy_sampled = 
    
    estimators = model.estimators_
    
    return estimators

# def get_ohe_coefficients(w):
#     df_w = pd.DataFrame(w, columns = columns)        
#     transformed = encoder.transform(df_data[categorical_names])
#     df_transformed = (pd.DataFrame(transformed, columns = encoder.get_feature_names(input_features = categorical_names)))
#     return pd.concat([df_data.reset_index(drop = True), df_transformed], axis=1).drop(categorical_names, axis=1)

    

def get_recourse(x, action_set, w, b):
    action_set.align(coefficients=w)
    fb = Flipset(x = x, action_set = action_set, coefficients = w, intercept = b)
    
    try:
        print("populating...")
        fb = fb.populate(enumeration_type = 'distinct_subsets', total_items = 20)
        actions = fb._builder.actions

        error = False

        returned_actions = [result['actions'] for result in fb.items]

    except (ValueError, ApplicationError, AssertionError) as e:
        print("excepting...")
        print(e)
#         print("coeffs from error: ", w)
        error = True

        returned_actions = []
    

    return returned_actions, error

# assumes data is properly formatted
def calculate_recourse_accuracy(model, data, enc, categorical_features, categorical_names, file_name, \
    num_samples = 5000, kernel_width = 1, explanation_type = 'lime', lime_sample_around_instance = None, \
    use_lime_sampled_maple = None, maple_use_distance_weights = None, instances_subset = None):
    
    instances_with_recourses = []
    
    use_lime_inverse = True
    
    with open(file_name, "a") as f:   
        
        X_train = data['X_train']
        y_train = data['y_train']

        X_val = data['X_val']
        y_val = data['y_val']

        X_test = data['X_test']
        y_test = data['y_test']
        
        ohe_X_train = ohe(X_train, categorical_names, enc)
        ohe_X_val = ohe(X_val, categorical_names, enc)
        ohe_X_test = ohe(X_test, categorical_names, enc)

        print("\n\nTRAIN LABEL SPLIT: ", file=f)
        print(y_train.value_counts(), file=f)

        print("validation score: ", model.score(ohe_X_val, y_val), file=f)

        new_predict_proba = get_pred_function(model, categorical_names, enc, X_train.columns)
        
        if instances_subset != None:
            calculate_subset_accuracy = True
            subset_total_recourses = 0
            subset_total_actual_recourses = 0
        else:
            instances_subset = []
            calculate_subset_accuracy = False
        
        classes = model.classes_
        true_index = list(classes).index(1)
        
        scores = pd.Series(new_predict_proba(X_test)[:, true_index])
        discrete_scores = pd.Series(model.predict(ohe_X_test))

        total_recourses = 0
        total_actual_recourses = 0
        
        total_instances_with_recourses = 0

        num_actiongrid_regressor_agree = 0
        num_lime_agree = 0
        num_sampled_total = 0

        print("NUM SAMPLES: ", num_samples, file=f)
        print("KERNEL WIDTH: ", kernel_width, file=f)
        print("EXPLANATION TYPE: ", explanation_type, file=f)
        if lime_sample_around_instance != None:
            print("LIME SAMPLE AROUND INSTANCES: ", lime_sample_around_instance, file=f)
        if use_lime_sampled_maple != None:
            print("USE SAMPLED LIME FOR MAPLE: ", use_lime_sampled_maple, file=f)
        if maple_use_distance_weights != None:
            print("USE DISTANCE WEIGHTS FOR MAPLE: ", maple_use_distance_weights, file=f)
            
        print("num unique preds: ", np.unique(discrete_scores, axis=0).shape[0])


        print("TEST LABEL SPLIT: ", file=f)

        print(discrete_scores.value_counts())

        # class_names have to be ordered according to what the classifier is using
        lime_explainer = lime_tabular.LimeTabularExplainer(X_train.values, categorical_features=categorical_features, 
                                                           categorical_names=categorical_names, \
                                                           feature_names=X_train.columns, class_names=classes, \
                                                           discretize_continuous=False, kernel_width = kernel_width, \
                                                           sample_around_instance = lime_sample_around_instance)

        train_stddev = X_train[X_train.columns[:]].std()
        train_mean = X_train[X_train.columns[:]].mean()

        # Normalize to have mean 0 and variance 1
        norm_X_train = (X_train - train_mean) / train_stddev
        norm_X_val = (X_val - train_mean) / train_stddev
        norm_X_test = (X_test - train_mean) / train_stddev

        pred_train = new_predict_proba(X_train)[:, true_index]
        pred_val = new_predict_proba(X_val)[:, true_index]

        maple_explainer = MAPLE(norm_X_train, pred_train, norm_X_val, pred_val)

        action_set = ActionSet(X = X_train)
#         action_set = ActionSet(X = ohe_X_train, custom_bounds={'race_1':(0, 1.0), 'race_4':(0, 1.0)})
        display(action_set)

        start_time = time.time()
        num_neg_test_preds = 0

        negative_scores = np.nonzero(scores < 0.5)[0]
        recourses = [None] * len(negative_scores)

        num_neg_test_preds = len(negative_scores)

        columns = X_train.columns
        
        print(len(X_train))
        
        if "cluster" in explanation_type:
            cluster_model = PiecewiseRegressor(verbose=False,
                                   binner=DecisionTreeRegressor(max_leaf_nodes = 100))
            if explanation_type == "cluster_train":
                cluster_model.fit(norm_X_train, pred_train)
#             else: NEED SAMPLED POINTS
#                 cluster_model.fit(lime_sampled, model_preds_sampled, sample_weight=lime_sampled_weights)
            cluster_estimators = cluster_model.estimators_
            print(len(cluster_estimators))
            bins = cluster_model.transform_bins(norm_X_train.values)
        
        for idx, i in enumerate(negative_scores): #scores is for X_test specifically
            if idx % 25 == 0:
                print("\n", idx, " out of ", len(negative_scores))
            if idx % 100 == 0:
                print("time elapsed: ", (time.time() - start_time) / 60, " minutes")
                start_time = time.time()
            
            x = X_test.values[i]

            num_features = len(x)

            print(explanation_type)
            
            if "cluster" in explanation_type:
                normalized_x = (x - train_mean) / train_stddev
                estimator_idx = int(cluster_model.transform_bins(np.array([normalized_x]))[0])
                print(estimator_idx)
                coefs = cluster_estimators[estimator_idx].coef_
                intercept = cluster_estimators[estimator_idx].intercept_
                
                w = coefs / train_stddev
                b = intercept - np.sum(w * train_mean) - 0.5 # subtract 0.5 bc using probs as labels                   w, b = 
                returned_actions, error = get_recourse(x, action_set, w, b)                
            
            elif explanation_type == "lime":
                w, b, local_pred, accuracy_sampled = get_lime_coefficients(lime_explainer, x, new_predict_proba, num_features, num_samples, X_train.columns, true_index)
                
                returned_actions, error = get_recourse(x, action_set, w, b)
                
            elif explanation_type == "maple":
                if use_lime_sampled_maple:
                    inverse_lime_sampled, scaled_binary_lime_sampled, model_preds_sampled, lime_sampled_weights, mean, std = get_lime_sampled(lime_explainer, x, new_predict_proba, num_features, num_samples, X_train.columns, true_index)
#                     un_ohe_lime_sampled = un_ohe(lime_sampled, categorical_names, enc, columns = columns)
                    if use_lime_inverse:
                        lime_sampled = inverse_lime_sampled
                    else:
                        lime_sampled = scaled_binary_lime_sampled
                    w, b, local_pred, accuracy_sampled = get_maple_coefficients(maple_explainer, x, train_mean, train_stddev, lime_sampled = lime_sampled, model_preds_sampled = model_preds_sampled, use_distance_weights = maple_use_distance_weights)
                else:
                    w, b, local_pred, accuracy_sampled = get_maple_coefficients(maple_explainer, x, train_mean, train_stddev, use_distance_weights = maple_use_distance_weights)
                
                returned_actions, error = get_recourse(x, action_set, w, b)
                
            elif explanation_type == "piecewise" or explanation_type == "piecewise_maple":
                inverse_lime_sampled, scaled_binary_lime_sampled, model_preds_sampled, lime_sampled_weights, mean, std = get_lime_sampled(lime_explainer, x, new_predict_proba, num_features, num_samples, X_train.columns, true_index)
                if use_lime_inverse:
                    lime_sampled = inverse_lime_sampled
                else:
                    lime_sampled = scaled_binary_lime_sampled
                if explanation_type == "piecewise":
                    estimators = get_piecewise_coefficients(x, lime_sampled, model_preds_sampled, lime_sampled_weights)
                else:
                    estimators = get_piecewise_coefficients_with_maple(maple_explainer, x, lime_sampled = lime_sampled, model_preds_sampled = model_preds_sampled, use_distance_weights = False)
                returned_actions, error = [], []
                
                print("NUM ESTIMATORS: ", len(estimators))
                
                for estimator in estimators:
                    
                    coefs = estimator.coef_
                    intercept = estimator.intercept_
            
                    print(coefs)
                    print(intercept)
            
                    w = coefs / std
                    b = intercept - np.sum(w * mean) - 0.5 # subtract 0.5 bc using probs as labels                    
                    
                    
                    ra, er = get_recourse(x, action_set, w, b)
                    for a in ra:
                        if not any((a == e).all() for e in returned_actions):
                            returned_actions.append(a)
                    error.extend([er])
                
            model_pred = (new_predict_proba([x])[0][true_index])

                
            recourse = {}
            recourse['idx'] = i
            recourse['instance'] = x
            recourse['model_prob'] = model_pred
#             recourse['local_prob'] = local_pred
            recourse['model_pred'] = 1 if model_pred >= 0.5 else -1
#             recourse['local_pred'] = 1 if local_pred >= 0.5 else -1

            recourse['scaled_coeff'] = w
            recourse['scaled_intercept'] = b
            recourse['actions'] = returned_actions
            recourse['error_solving'] = error

            recourse['explanation_type'] = explanation_type

#             recourse['accurate_pred'] = 1 if (recourse['model_pred'] == recourse['local_pred']) else 0
#             recourse['sampled_accuracy'] = accuracy_sampled

            recourse['returned_actions'] = returned_actions
    
            recourses[idx] = recourse

            print_coefs = False

                                    
            no_changes = True
            
            if len(returned_actions) != 0:
                total_instances_with_recourses += 1
                instances_with_recourses.append(i) 
            
            for action in returned_actions:
                new_x = (x + action)
                ohe_new_x = ohe(new_x.reshape(1, -1), categorical_names, enc, columns = columns)
                
                
                old_pred = recourse['model_pred']
                new_pred = model.predict(ohe_new_x)[0]

                new_lime_pred = 1 if np.dot(w, new_x) + b >= 0.0 else -1
                total_recourses += 1
                
                if i in instances_subset and calculate_subset_accuracy:
                    subset_total_recourses += 1
                
                if old_pred != new_pred:
                    print(get_nonzero_actions(columns, action))
                    total_actual_recourses += 1
                    no_changes = False
                    
                    if i in instances_subset and calculate_subset_accuracy:
                        subset_total_actual_recourses +=1
                
            if no_changes:
                print(x)
                    
            print("model_pred: ", recourse['model_pred']) 
#             print("local_pred: ", recourse['local_pred'])
            print("intercept: ", b)

        if explanation_type == "piecewise":
            total_errors = [1 for rec in recourses if (True not in rec['error_solving'])]            
        else:
            total_errors = [1 for rec in recourses if (rec['error_solving'] == True)]   
#         total_accurate_preds = [1 for rec in recourses if (rec['accurate_pred'] == True)]   
#         average_sampled_accuracy = np.mean([rec['sampled_accuracy'] for rec in recourses])
        average_recourses_per_all = np.mean([len(rec['returned_actions']) for rec in recourses])
        average_recourses_per_found = np.mean([len(rec['returned_actions']) for rec in recourses if rec['returned_actions'] != []])

        try:
        
            print("num_neg_test_preds: ", num_neg_test_preds, " out of ", len(scores), " = ", round(num_neg_test_preds/len(scores), 2), file=f)
            print("recourse accuracy: ", round(total_actual_recourses/total_recourses, 2), "; total instances with recourses found: ", total_instances_with_recourses, file=f)
            print("recourse accuracy (on all instances, assuming recourse and assuming avg per found instance would be found): ", round(total_actual_recourses/(average_recourses_per_found * num_neg_test_preds), 2), file=f)
            if instances_subset != []:
                print("subset recourse accuracy: ", round(subset_total_actual_recourses/subset_total_recourses, 2), "; total instances in subset: ", len(instances_subset), file=f)
                print("subset recourse accuracy (out of total potential): ", round(subset_total_actual_recourses/(len(instances_subset) * 20), 2), "; total instances in subset: ", len(instances_subset), file=f)
            print("number of errors: ", sum(total_errors), "; percent of total instances: ", round(sum(total_errors)/len(recourses), 2), file=f)
            print("average number of recourses per instance: ", round(average_recourses_per_all, 2), file=f)
            print("average number of recourses per instance found: ", round(average_recourses_per_found, 2), file=f)
#             print("number accurate preds (original data): ", sum(total_accurate_preds), "; percent of total instances: ", round(sum(total_accurate_preds)/len(recourses), 2), file=f)
#             print("average accuracy of preds on sampled data: ", average_sampled_accuracy, file=f)

        except ZeroDivisionError as error_msg:
            print(error_msg)
            
        return instances_with_recourses


In [25]:
def get_data(X, y, test_size = 0.5):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
    X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, test_size=test_size)
    
    data = {
        'X_train': X_train,
        'y_train': y_train,

        'X_val': X_val,
        'y_val': y_val,

        'X_test': X_test,
        'y_test': y_test
    }
    
    return data    

# Experiments

In [26]:
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier
import sys
import os
# import lime.explanation
# import lime.lime_tabular

sys.path.append(os.path.dirname(os.getcwd()))
sys.path.append("../lime_experiments")
    
import explanation
import lime_tabular
import lime_base

import importlib
importlib.reload(explanation)
importlib.reload(lime_tabular)
importlib.reload(lime_base)

<module 'lime_base' from '../lime_experiments/lime_base.py'>

In [27]:
def run_all(file_name, model, enc, data, categorical_features, categorical_names):
    
    start_time = time.time()
    
    exp1 = {'explanation_type': 'lime', 'use_lime_sampled_maple': None, 'lime_sample_around_instance': False, 'maple_use_distance_weights': None, 'instances_subset': None}

    experiments = [exp1]

    for exp in experiments:
        lime_instances = calculate_recourse_accuracy(model, data, enc, categorical_features, categorical_names, \
                                    file_name, explanation_type = exp["explanation_type"], \
                                    use_lime_sampled_maple = exp['use_lime_sampled_maple'], \
                                    lime_sample_around_instance = exp['lime_sample_around_instance'], \
                                    maple_use_distance_weights = exp['maple_use_distance_weights'], instances_subset = exp['instances_subset'])

    print("TIME FOR EXP1: ", (time.time() - start_time) / 60, " minutes")
    cluster_exp = {'explanation_type': 'cluster_train', 'use_lime_sampled_maple': None, 'lime_sample_around_instance': False, 'maple_use_distance_weights': None, 'instances_subset': lime_instances}
    
    exp2 = {'explanation_type': 'piecewise_maple', 'use_lime_sampled_maple': None, 'lime_sample_around_instance': False, 'maple_use_distance_weights': None, 'instances_subset': lime_instances}
    exp3 = {'explanation_type': 'piecewise', 'use_lime_sampled_maple': None, 'lime_sample_around_instance': False, 'maple_use_distance_weights': None, 'instances_subset': lime_instances}

    exp4 = {'explanation_type': 'maple', 'use_lime_sampled_maple': True, 'lime_sample_around_instance': False, 'maple_use_distance_weights': True, 'instances_subset': lime_instances}
    exp5 = {'explanation_type': 'maple', 'use_lime_sampled_maple': True, 'lime_sample_around_instance': False, 'maple_use_distance_weights': False, 'instances_subset': lime_instances}

    exp6 = {'explanation_type': 'maple', 'use_lime_sampled_maple': False, 'lime_sample_around_instance': False, 'maple_use_distance_weights': False, 'instances_subset': lime_instances}
    exp7 = {'explanation_type': 'maple', 'use_lime_sampled_maple': False, 'lime_sample_around_instance': False, 'maple_use_distance_weights': True, 'instances_subset': lime_instances}

    experiments = [cluster_exp, exp2, exp3, exp4, exp5, exp6, exp7]


    for exp in experiments:
        _ = calculate_recourse_accuracy(model, data, enc, categorical_features, categorical_names, \
                                    file_name, explanation_type = exp["explanation_type"], \
                                    use_lime_sampled_maple = exp['use_lime_sampled_maple'], \
                                    lime_sample_around_instance = exp['lime_sample_around_instance'], \
                                    maple_use_distance_weights = exp['maple_use_distance_weights'], instances_subset = exp['instances_subset'])

    with open(file_name, "a") as f:
        print("--------------------------------------------", file=f)
        print("lime_sample_around_instance: TRUE", file=f)

    exp8 = {'explanation_type': 'lime', 'use_lime_sampled_maple': None, 'lime_sample_around_instance': True, 'maple_use_distance_weights': None, 'instances_subset': None}
    experiments = [exp7]

    for exp in experiments:
        lime_instances = calculate_recourse_accuracy(model, data, enc, categorical_features, categorical_names, \
                                    file_name, explanation_type = exp["explanation_type"], \
                                    use_lime_sampled_maple = exp['use_lime_sampled_maple'], \
                                    lime_sample_around_instance = exp['lime_sample_around_instance'], \
                                    maple_use_distance_weights = exp['maple_use_distance_weights'], instances_subset = exp['instances_subset'])

    exp9 = {'explanation_type': 'piecewise', 'use_lime_sampled_maple': None, 'lime_sample_around_instance': True, 'maple_use_distance_weights': None, 'instances_subset': lime_instances}
    exp10 = {'explanation_type': 'maple', 'use_lime_sampled_maple': True, 'lime_sample_around_instance': True, 'maple_use_distance_weights': True, 'instances_subset': lime_instances}
    exp11 = {'explanation_type': 'maple', 'use_lime_sampled_maple': True, 'lime_sample_around_instance': True, 'maple_use_distance_weights': False, 'instances_subset': lime_instances}


    experiments = [exp9, exp10, exp11]

    for exp in experiments:
        _ = calculate_recourse_accuracy(model, data, enc, categorical_features, categorical_names, \
                                    file_name, explanation_type = exp["explanation_type"], \
                                    use_lime_sampled_maple = exp['use_lime_sampled_maple'], \
                                    lime_sample_around_instance = exp['lime_sample_around_instance'], \
                                    maple_use_distance_weights = exp['maple_use_distance_weights'], instances_subset = exp['instances_subset'])

    print("TOTAL TIME FOR ALL EXPERIMENTS: ", (time.time() - start_time) / 60, " minutes")


## COMPAS

In [31]:
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier
import lime.explanation
import lime.lime_tabular

# NEURAL NETWORK
 
compas_nn = MLPClassifier()

compas_data = get_data(compas_X, compas_y, test_size = 0.25)

ohe_compas_X_train = ohe(compas_data['X_train'], compas_categorical_names, compas_enc)
ohe_compas_X_val = ohe(compas_data['X_val'], compas_categorical_names, compas_enc)
ohe_compas_X_test = ohe(compas_data['X_test'], compas_categorical_names, compas_enc)

compas_nn.fit(ohe_compas_X_train, compas_data['y_train']) 
print("validation score: ", round(compas_nn.score(ohe_compas_X_val, compas_data['y_val']), 2))
test_preds = pd.Series(compas_nn.predict(ohe_compas_X_test))
print("test predictions split: ")
print(test_preds.value_counts())

validation score:  0.83
test predictions split: 
 1    513
-1     57
dtype: int64


In [None]:
run_all("3_28.txt", compas_nn, compas_enc, compas_data, compas_categorical_features, \
        compas_categorical_names)

num unique preds:  2
 1    513
-1     57
dtype: int64


+-----------------+---------------+---------+------------+----------------+----------------+-----------+-----------+-----------+------+-------+
|            name | variable type | mutable | actionable | step direction | flip direction | grid size | step type | step size |   lb |    ub |
+-----------------+---------------+---------+------------+----------------+----------------+-----------+-----------+-----------+------+-------+
|             age | <class 'int'> |    True |       True |              0 |            nan |        47 |  relative |      0.01 | 20.0 |  66.0 |
|  two_year_recid | <class 'int'> |    True |       True |              0 |            nan |         2 |  relative |      0.01 |  0.0 |   1.0 |
| c_charge_degree | <class 'int'> |    True |       True |              0 |            nan |         2 |  relative |      0.01 |  0.0 |   1.0 |
|            race | <class 'int'> |    True |       True |              0 |            nan |         6 |  relative |      0.01 |  0.0 | 

  return bound(*args, **kwds)


4627

 0  out of  57
time elapsed:  2.0432472229003907e-05  minutes
lime
exp.local_exp: [(1, -0.10112168450826348), (3, 0.03829063832407629), (2, 0.03349494759616605), (5, -0.026086645328056168), (0, 0.01658108318955907), (4, 0.00904306173035872), (6, -0.0009401531152060787)]
populating...
    model=unknown;
        message from solver=<undefined>
excepting...
solver status is not OK
[25.  0.  0.  0.  1.  9.  2.]
model_pred:  -1
intercept:  -0.17620986107001507
lime
exp.local_exp: [(1, -0.11229163301035912), (3, 0.04229693388560304), (5, -0.025520650116081643), (2, 0.024923005013287407), (0, 0.01584320578521159), (4, -0.004252522719370366), (6, -0.0010155583450584507)]
populating...
obtained 20 items in 0.7 seconds
{'age': 34.0, 'race': 2.0, 'priors_count': -13.0}
{'age': 34.0, 'race': 2.0, 'priors_count': -13.0, 'length_of_stay': -1.0}
{'age': 1.0, 'two_year_recid': -1.0, 'race': 2.0}
{'age': 1.0, 'two_year_recid': -1.0, 'race': 2.0, 'length_of_stay': -1.0}
{'two_year_recid': -1.0, 'r

In [None]:
cluster_exp = {'explanation_type': 'cluster_train', 'use_lime_sampled_maple': None, 'lime_sample_around_instance': False, 'maple_use_distance_weights': None, 'instances_subset': None}

_ = calculate_recourse_accuracy(compas_nn, compas_data, compas_enc, compas_categorical_features, compas_categorical_names, \
                                "cluster_train.txt", explanation_type = cluster_exp["explanation_type"], \
                                use_lime_sampled_maple = cluster_exp['use_lime_sampled_maple'], \
                                lime_sample_around_instance = cluster_exp['lime_sample_around_instance'], \
                                maple_use_distance_weights = cluster_exp['maple_use_distance_weights'], instances_subset = cluster_exp['instances_subset'])


In [None]:
# RANDOM FOREST

compas_rf = RandomForestClassifier()

compas_rf.fit(ohe_compas_X_train, compas_data['y_train']) 
print("validation score: ", round(compas_rf.score(ohe_compas_X_val, compas_data['y_val']), 2))
test_preds = pd.Series(compas_rf.predict(ohe_compas_X_test))
print("test predictions split: ")
print(test_preds.value_counts())

In [None]:
run_all("3_27_rf_V2.txt", compas_rf, compas_enc, compas_data, compas_categorical_features, \
        compas_categorical_names)

## German

In [None]:
# NEURAL NETWORK

german_nn = MLPClassifier()

german_data = get_data(german_X, german_y)

german_nn.fit(german_data['X_train'], german_data['y_train']) 
print("validation score: ", round(german_nn.score(german_data['X_val'], german_data['y_val']), 2))
test_preds = pd.Series(german_nn.predict(german_data['X_test']))
print("test predictions split: ")
print(test_preds.value_counts())

# german_rf = RandomForestClassifier(n_estimators=40)


In [None]:
run_all("3_23_german_nn.txt", german_nn, german_data, german_categorical_features, \
        german_categorical_names)

In [None]:
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier

# NEURAL NETWORK

german_rf = RandomForestClassifier()

german_rf.fit(german_data['X_train'], german_data['y_train']) 
print("validation score: ", round(german_rf.score(german_data['X_val'], german_data['y_val']), 2))
test_preds = pd.Series(german_rf.predict(german_data['X_test']))
print("test predictions split: ")
print(test_preds.value_counts())

In [None]:
run_all("3_23_german_rf.txt", german_rf, german_data, german_categorical_features, \
        german_categorical_names)

## ADULT

In [None]:
# NEURAL NETWORK

adult_nn = MLPClassifier()

adult_data = get_data(adult_X, adult_y, test_size = 0.1)

adult_nn.fit(adult_data['X_train'], adult_data['y_train']) 
print("validation score: ", round(adult_nn.score(adult_data['X_val'], adult_data['y_val']), 2))
test_preds = pd.Series(adult_nn.predict(adult_data['X_test']))
print("test predictions split: ")
print(test_preds.value_counts())

In [None]:
run_all("3_23_adult_nn.txt", adult_nn, adult_data, adult_categorical_features, \
        adult_categorical_names)

In [None]:
# RANDOM FOREST

adult_rf = RandomForestClassifier()

adult_rf.fit(adult_data['X_train'], adult_data['y_train']) 
print("validation score: ", round(adult_rf.score(adult_data['X_val'], adult_data['y_val']), 2))
test_preds = pd.Series(adult_rf.predict(adult_data['X_test']))
print("test predictions split: ")
print(test_preds.value_counts())

In [None]:
run_all("3_23_adult_rf.txt", adult_rf, adult_data, adult_categorical_features, \
        adult_categorical_names)

You can switch optimizers if you don't have CPLEX by setting `optimizer="cbc"`. 

A quick note: Our decision boundary is by default 0. We shift this by tweaking the intercept. Since we used Logistic Regression, we use the trick above to do that. In future iterations, we will provide a more elegant way of doing this.

In [None]:
output_1 = rb.fit()
output_1

all_info = rb.populate()
print(all_info)

Ok, great, we have a solution! This individual has recourse. The total cost of all the actions needed to flip their prediction is the first thing of interest to us. It costs this person $.21$, meaning that the sum of percentile shifts across this person's features is $.21$. That's quite a lot. Imagine having to shift that much relative to a population? Let's check out what this means in terms of actions:

In [None]:
# pd.Series(output_1['actions'], index=X.columns).to_frame('Actions')
actions = [x['actions'] for x in all_info]
actions_df = pd.DataFrame(data=actions).transpose().set_index(X.columns)
person = (pd.Series(x, index=X.columns))
print(person)
display(actions_df)

Ok, so let's read this. 

* `SavingsAccountBalance_geq_100`$=1$, for example. This was a binary feature, so it can only be $1$. This also means that we're enouraging this person to increase their savings. 
* `LoanDuration`$=20$. This, if we recall, was the number of months of loan. This means we're encouraging this person to reapply but specify that their loan repayment period is 20 months shorter.

Let's check if these two actions make sense in the context of this person:

In [None]:
X.loc[denied_individuals[0]].to_frame("Original Features")

Ok, this person originally applied with no savings and with a 4-year repayment period. So asking them to get savings and decrease their loan repayment period by $20$ months make sense as actions.

(Let's leave aside the question of mutually exclusive features (eg. `SavingsAccountBalance_geq_100` $=0$, `SavingsAccountBalance_geq_500`$=1$). We'll get back to that in later releases.)

Let's close by noting some things:

* Immutable features are __not__ changed. That's good. That's recourse.
* The changes make sense, at least directionally. We'd encourage this person to get a gaurantor, to decrease their loan amount, and to decrease their loan period, among other changes.

Yes, these might be hard for someone. They might have other reasons for immutability that we're not considering. Maybe they _need_ that amount and cannot change. Ok, let's express that:

In [None]:
action_set['LoanAmount'].mutable=False

In [None]:
x = X.values[denied_individuals[0]]

p = .8
rb = RecourseBuilder(
      optimizer="cbc",
      coefficients=coefficients,
      intercept=intercept- (np.log(p / (1. - p))),
      action_set=action_set,
      x=x
)

In [None]:
output_2 = rb.fit()
output_2

Ok, so their total cost actually didn't change, which is nice. Let's take a look at their new action set:

In [None]:
pd.Series(output_2['actions'], index=X.columns).to_frame("New Actions")

Ok, by decreasing their repayment period by a bit more and changing some other features, this person can still ask for the same amount. That's good.

The magical thing about both of these action sets is that this person, if they do this, _will_ qualify for a loan. Let's check that:

In [None]:
clf.predict_proba([X.loc[denied_individuals[0]] + pd.Series(output_1['actions'], index=X.columns)])[:, 1]

In [None]:
clf.predict_proba([X.loc[denied_individuals[0]] + pd.Series(output_2['actions'], index=X.columns)])[:, 1]

And there we have it. By making these tweaks, this person has two ways to get over the $.8$ threshold that we've set. This period can now get approved under this model.