# Explainable AI - Notebook

### Table of Contents

* [Import Libraries and Data](#chapterLibraryData)

#### Part 1

* [Chapter 1. SHAP - Shapley Additive exPlaination method](#chapter1)
   * [Section 1.1 Global Level - Beeswarm plot](#section_1_1)
   * [Section 1.2 Local Level - Waterfall plot - Random Row with predicted default class](#section_1_2)
   * [Section 1.3 Local Level - Waterfall plot - for all default predictions ](#section_1_3)
* [Chapter 2. LIME - Local  Interpretable Model-agnostic Explanations method](#chapter2)
   * [Section 2.1 Local Level - Random Row with predicted default class ](#section_2_1)
   * [Section 2.2 Local Level - For all default predictions ](#section_2_2)
* [Chapter 3. Intersection between SHAP and LIME methods ](#chapter3)
   * [Section 3.1 Insights - Exploratory Data Analysis ](#section_3_1)
* [Chapter 4. Combination of Chapter 3 with Conformal Prediction ](#chapter4)

#### Part 2

* [Chapter 1. Guided Prototypes algorithm](#chapter1)
  * [Section 1.1 Local Level - Random Row with predicted default class](#section_1_1)
  * [Section 1.2 Local Level - for all default predictions](#section_1_2)
* [Chapter 2. CEML algorithm](#chapter2)
  * [Section 2.1 Local Level - Random Row with predicted default class](#section_2_1)
  * [Section 2.2 Local Level - for all default predictions](#section_2_2)
* [Chapter 3. Insights - Exploratory Data Analysis](#chapter3)
* [Chapter 4. Combination of Chapter 3 with Conformal Prediction](#chapter4)

### Import Libraries and Data <a class="anchor" id="chapterLibraryData"></a>

In [3]:
import warnings
import tensorflow as tf
warnings.filterwarnings("ignore")
tf.compat.v1.disable_eager_execution()
from alibi.explainers import CounterfactualProto
import shap
from ceml.sklearn import generate_counterfactual
import joblib
import numpy as np
from lime.lime_tabular import LimeTabularExplainer
import matplotlib.pyplot as plt
import pandas as pd
from tqdm import tqdm

In [4]:
model_rf = joblib.load('../model/model_rf.joblib')
X_test = pd.read_csv('../data/processed/X_test_cp.csv',index_col=0)

### Part 1

## 1. SHAP - Shapley Additive exPlaination method

explainer = shap.TreeExplainer(model_rf)
shap_values = explainer.shap_values(X_test)

### 1.1 Global Level - Beeswarm plot

In [None]:
class_index = 1  
shap_values_class = shap_values[:, :, class_index]
feature_names = X_test.columns
shap.summary_plot(shap_values_class, X_test, feature_names=feature_names, max_display=13)

plt.show()

The first 10 most important features for the model's default prediction, sorted by the mean of their absolute SHAP values (their importance) - sum the absoluate SHAP values and then divide by the total number of instances in the dataset.

Global level insights:

1. As the AMT_CREDIT_SUM_DEBT_bur decreases (a lower value), an applicant is less likely to default (tends to decrease the probabiltiy for class 1).
2. If an applicant has its own car (FLAG_OWN_CAR_app), that applicant is less likely to default.
3. As the DAYS_LAST_PHONE_CHANGE_app increases, an applicant is less likely to default.
4. If an applicant has a Higher Education type, that applicant is less likely to default.
5. If an applicant provides home phone (FLAG_HOME_app), that applicant is less likely to default.
6. As the DAYS_ID_PUBLISH_app increases, that applicant is less likely to default.
7. If an applicant has WEEKDAY_APPR_PROCESS_START_app_THURSDAY, that applicant is less likely to default.
8. As the DAYS_EMPLOYED_app increases, an applicant is less likely to default.
9. As the REGION_POPULATION_RELATIVE_app increases, an applicant is less likely to default.
10. If an applicant has a NAME_INCOME_TYPE_app_State servant, an applicant is less likely to default.

The SHAP values consist of 2 elements: magnitude and direction, quantifying the influence of each feature on the model's prediction. The magnitude is the strength of the impact, given by the mean of the features' absolute SHAP values across all instances. The direction is the sign of the contribution, represented by the majority of negative SHAP values.

### 1.2 Local Level - Waterfall plot - Random Row with predicted default class

In [None]:
X_test_copy= X_test.copy()
X_test_copy.reset_index(inplace=True)
X_test_default = X_test.copy()
X_test_default['predictions'] = model_rf.predict(X_test)
X_test_default.reset_index(inplace=True)
X_test_default['original_index'] = X_test_copy.index
X_test_default = X_test_default[X_test_default['predictions'] == 1]

id_values_predicted_default = X_test_default.original_index.tolist()

X_test_default.head(6)

In [None]:
shap_df = pd.DataFrame(shap_values_class, columns=feature_names) # we put the shap values in a dataframe along their corresponding features

def generate_waterfall_plot(row_idx):
    shap_values_row = shap_df.iloc[row_idx] # we extract the shap values for the row
    expected_value = explainer.expected_value[class_index]  # we extract the base value for class_index
    waterfall_legacy_plot = shap.plots._waterfall.waterfall_legacy(
        expected_value, shap_values_row.values, feature_names=X_test.columns, max_display = 10
    )
    return waterfall_legacy_plot

waterfall_plot_first_row = generate_waterfall_plot(187)
waterfall_plot_first_row

f(x) = 0.52; the predicted probability for class 1 (default)

E[f(x)] = 0.5; the base value, representing the average predicted probability for class 1 (default) - mean prediction

f(x) = base value + sum(SHAP values)

The figure above illustrates the first 16 most important features for the model's default prediction of the applicant with id 453167, sorted by their numerical contribution on the model's prediction. Each feature contributes to the predicted probability by increasing/decreasing the base value of the instance.

The following top 10 features increase the probability of default:

OR

The following top 10 features contribute to the default prediction:

1. FLAG_OWN_CAR (0)
2. DAYS_BIRTH_app (31)
3. AMT_CREDIT_SUM_DEBT_bur (34.631)
4. DAYS_ID_PUBLISH_app (3.346)
5. FLAG_PHONE_app (0)
6. AMT_GOODS_PRICE_app (472.500)
7. FLAG_OWN_REALTY_app (0)
8. NAME_EDUCATION_TYPE_app_Higher education (0)
9. DAYS_REGISTRATION_app (3.619)
10. WEEKDAY_APPR_PROCESS_START_app_THURSDAY (0)

### 1.3 Local Level - Waterfall plot - for all default predictions

In [None]:
class_index = 1  
shap_values_class = shap_values[:, :, class_index] # array data type without SK_ID_CURR index from shap generation
shap_df = pd.DataFrame(shap_values_class, columns=feature_names) # dataframe with original index
shap_df.index = X_test.index # we pass the SK_ID_CURR index from X_test
shap_df.reset_index(inplace=True) # we set the index as column for later merging

X_test_default = X_test.copy()
X_test_default['predictions'] = model_rf.predict(X_test)
X_test_default = X_test_default[X_test_default['predictions'] == 1]
X_test_default.reset_index(inplace=True) 

df = pd.merge(X_test_default,shap_df,how='inner', on='SK_ID_CURR')
columns_to_remove = [col for col in df.columns if col.endswith('_x')]
df = df.drop(columns=columns_to_remove)
df.set_index('SK_ID_CURR', inplace=True)
del df["predictions"]

df.head()

In [None]:
for column in df.columns:
    df[column] = df[column].apply(lambda x: "positive" if x > 0 else "negative" if x < 0 else "no impact")

df.head() 

for column in df.columns:
    df[column] = df[column].astype('object')

## 2. LIME - Local  Interpretable Model-agnostic Explanations method

### 2.1 Local Level - Random Row with predicted default class

In [None]:
chosen_instance = X_test.iloc[187]

explainer_lime = LimeTabularExplainer(X_test.values, feature_names=X_test.columns.tolist(), class_names=["non-default","default"], mode='classification', random_state=42)
exp = explainer_lime.explain_instance(chosen_instance, model_rf.predict_proba, num_features = 138)

exp.show_in_notebook(show_table=True)
print(exp.local_exp)

The LIME output consists of 3 parts for this instance: the predicted probabilities of the random forest model (left); the feature importance scores (middle), ranked in decreasing order and resulted from the linear regression model and the feature-value table (right).

The following top 10 features support the default prediction:

OR

The following top 10 features significantly decreases the applicant's chance of receiving a loan (being non-default):


1. NAME_EDUCATION_TYPE_app_Higher education (0) - not having a higher education
2. NAME_INCOME_TYPE_app_State servant (0) - not having a State servant income type
3. FLAG_PHONE_app (0) - not having a phone
4. FLAG_OWN_CAR_app (0) - not having own car
5. NAME_TYPE_SUITE_app_Group of people (0)
6. NAME_TYPE_UITE_app_Other_A (0)
7. WEEKDAY_APPR_PROCESS_START_app_THURSDAY (0)
8. AMT_CREDIT_SUM_DEBT_bur (34.631) - it is more than 19.789
9. WEEKDAY_APPR_PROCESS_START_app_SATURDAY (0)
10. ORGANIZATION_TYPE_app_Cleaning (0)

### 2.2 Local Level - For all default predictions

In [None]:
feature_index_to_name = {i: feature_name for i, feature_name in enumerate(X_test.columns)}

row_index = 187
row = X_test.iloc[row_index]
feature_index = 89
feature_name = feature_index_to_name.get(feature_index)

feature_value = row.iloc[feature_index]
print(f"The name of the feature at index {feature_index} is: {feature_name}")
print(f"The value of this feature for the row at index {row_index} is: {feature_value}")

def generate_filter_lime_scores(row):
    
    chosen_instance = X_test.iloc[row]
    explainer_lime = LimeTabularExplainer(X_test.values, feature_names=X_test.columns.tolist(), class_names=["non-default", "default"], mode='classification', random_state=42)
    exp = explainer_lime.explain_instance(chosen_instance, model_rf.predict_proba, num_features=len(X_test.columns))
    
    print(exp.local_exp)
    val = exp.local_exp[1]
    filtered_exp = [tup for tup in val if tup[0] == 89]
    print(filtered_exp)

generate_filter_lime_scores(187)

In [None]:
lime_scores = []

def generate_lime_scores(row):
    
    chosen_instance = X_test.loc[row]
    explainer_lime = LimeTabularExplainer(X_test.values, feature_names=X_test.columns.tolist(), class_names=["non-default", "default"], mode='classification', random_state=42)
    exp = explainer_lime.explain_instance(chosen_instance, model_rf.predict_proba, num_features=len(X_test.columns))
    lime_scores_row = {feature_index_to_name[i]: round(score, 2) for i, score in exp.local_exp[1]}

    return lime_scores_row

X_test_default = X_test.copy()
X_test_default['predictions'] = model_rf.predict(X_test)
X_test_default = X_test_default[X_test_default['predictions'] == 1]
del X_test_default['predictions']

for idx, row in X_test_default.iterrows():
    lime_scores_row = generate_lime_scores(idx)
    lime_scores.append(lime_scores_row)

lime_scores_df = pd.DataFrame(lime_scores)
lime_scores_df.index = X_test_default.index
lime_scores_df.head()

In [None]:
n = 12
specific_row = lime_scores_df.loc[453167]
value_check = specific_row["ORGANIZATION_TYPE_app_Industry: type 12"]
value_check

In [None]:
for column in lime_scores_df.columns:
    lime_scores_df[column] = lime_scores_df[column].apply(lambda x: "positive" if x > 0 else "negative" if x < 0 else "no impact")
lime_scores_df.head()

# It is important to note that LIME produces a specific output for the variables of continous data type, namely interval explanations
# To ease the computation of the task of solely retaining the rows with common explanations between SHAP and LIME methods, the interval explanations
# are later added to the final randomly selected default instances, in their original format before applying MinMax scaling.

In [None]:
for column in lime_scores_df.columns:
    lime_scores_df[column] = lime_scores_df[column].astype('object')

## 3. Intersection between SHAP and LIME methods

In [None]:
df.columns = [col[:-2] if col.endswith('_y') else col for col in df.columns]
display(df.head())
display(lime_scores_df.head())

In [None]:
common_columns = df.columns.intersection(lime_scores_df.columns)

result_data = []

for index, row in df.iterrows():
    result_row = {}
    for col in common_columns:
        if row[col] == lime_scores_df.loc[index, col]: # we compare the value of the current column in df with the corresponding value in lime_scores_df
            result_row[col] = row[col] # if they are equal, then we store that value in the newdataset for that column
        else:
            result_row[col] = "no match" # if they are not equal, then we store "no match"
            
    result_data.append(result_row)

result_df = pd.DataFrame(result_data, index=df.index, columns=common_columns)

no_match_columns = result_df.columns[result_df.eq("no match").all()]
result_df.drop(columns=no_match_columns, inplace=True) # from 138 to 106

cols_to_drop = result_df.columns[((result_df == "no match") | (result_df == "negative")).all(axis=0)]
result_df.drop(cols_to_drop, axis=1, inplace=True) # from 106 to 98

no_match_columns = result_df.columns[result_df.eq("no impact").all()]
result_df.drop(columns=no_match_columns, inplace=True) # from 98 to 94

result_df.head()

In [None]:
filtered_df = result_df.loc[453167, result_df.loc[453167] == 'positive']
filtered_df

In [None]:
filtered_df = result_df.loc[209065, result_df.loc[209065] == 'positive']
filtered_df

### 3.1 Insights - Exploratory Data Analysis

In [None]:
dataframe = pd.DataFrame()

def calculate_percentage(row, value):
    total = len(row)
    value_count = row.value_counts().get(value, 0)
    return (value_count / total) * 100

for value in ['positive', 'negative', 'no match', 'no impact']:
    percentage_column_name = f'percentage_{value}'
    dataframe[percentage_column_name] = result_df.apply(lambda row: calculate_percentage(row, value), axis=1)

del dataframe["percentage_no impact"]

dataframe= dataframe.round().astype(int)
dataframe.head()

In [None]:
column_averages = dataframe.mean()
print(column_averages)

In [None]:
result_df.replace({'no match': 'not applicable', 'negative': 'not applicable'}, inplace=True)

dataframe = pd.DataFrame()

def calculate_percentage(row, value):
    total = len(row)
    value_count = row.value_counts().get(value, 0)
    return (value_count / total) * 100

for value in ['positive', 'not applicable']:
    percentage_column_name = f'percentage_{value}'
    dataframe[percentage_column_name] = result_df.apply(lambda row: calculate_percentage(row, value), axis=1)

dataframe = dataframe.round().astype(int)
dataframe.head()

In [None]:
column_averages = dataframe.mean()
print(column_averages)

## 4. Combination of Chapter 3 with Conformal Prediction

In [None]:
cp = pd.read_csv("../data/processed/conformal_prediction.csv")
dataframe.reset_index(inplace=True)
display(dataframe.head())
display(cp.head())

In [None]:
merged_df = pd.merge(cp, dataframe, on='SK_ID_CURR', how='inner')
merged_df = merged_df[["SK_ID_CURR","percentage_positive","percentage_not applicable","level"]]
merged_df.set_index('SK_ID_CURR', inplace=True) 
merged_df.head()

In [None]:
merged_df["level"].value_counts()

In [None]:
means_per_group = merged_df.groupby('level').mean()
means_per_group

### Part 2

In [None]:
model_rf = joblib.load('../model/model_rf.joblib')
X_test = pd.read_csv('../data/processed/X_test_cp.csv',index_col=0)
X_train = pd.read_csv('../data/processed/X_remaining_cp.csv',index_col=0)

X_test_default = X_test.copy()
X_test_default['predictions'] = model_rf.predict(X_test)
X_test_default = X_test_default[X_test_default['predictions'] == 1]
X_test_default.reset_index(inplace=True) 

del X_test_default['predictions']
X_test_default.set_index('SK_ID_CURR', inplace=True)

X_test_default.head()

## 1. Guided Prototypes algorithm

### 1.1 Local level - Random row with predicted default class <a class="anchor" id="chapterdefaultrow"></a>

In [None]:
X_test = X_test.to_numpy()
X_train = X_train.to_numpy()

X = X_test[187].reshape((1,) + X_test[187].shape)
shape = X.shape

cf = CounterfactualProto(model_rf.predict_proba, shape)

cf.fit(X_train)

explanation = cf.explain(X)

CF = np.array([explanation['data']['cf']['X'][0]])

print(f'Original prediction: {explanation.orig_class}')
print(f'Counterfactual prediction: {explanation.cf["class"]}')

In [None]:
model_rf = joblib.load('../model/model_rf.joblib')
X_test = pd.read_csv('../data/processed/X_test_cp.csv',index_col=0)
X_train = pd.read_csv('../data/processed/X_remaining_cp.csv',index_col=0)

X_test_default = X_test.copy()
X_test_default['predictions'] = model_rf.predict(X_test)
X_test_default = X_test_default[X_test_default['predictions'] == 1]
X_test_default.reset_index(inplace=True) 

del X_test_default['predictions']
X_test_default.set_index('SK_ID_CURR', inplace=True)

X_test_default.head()

In [None]:
X_reshaped = X.flatten() # Reshape X and CF to be 1-dimensional arrays
CF_reshaped = CF.flatten()

differences = CF_reshaped - X_reshaped

df_diff = pd.DataFrame({'Features': X_test.columns,'Original': X_reshaped,'CF': CF_reshaped,'Difference': differences})
df_diff_filtered = df_diff[df_diff['Difference'] != 0]
df_diff_filtered['Result'] = df_diff_filtered['Difference'].apply(lambda x: 'higher' if x > 0 else 'lower')
print(df_diff_filtered.shape)

df_diff_filtered.head()

### 1.2 Local level - for all default predictions <a class="anchor" id="chapteralldefaultrows"></a>

In [None]:
cf = CounterfactualProto(model_rf.predict_proba, shape)
cf.fit(X_train)

results = []

for i, row in tqdm(X_test_default.iterrows(), total=len(X_test_default)):

    X = row.values.reshape(1, -1)  # Reshape the row to be 2-dimensional
    explanation = cf.explain(X)
    
    if explanation is not None and explanation.cf is not None and 'X' in explanation.cf:

        differences = explanation.cf['X'] - X
        
        feature_results = []
        for j, diff in enumerate(differences.flatten()):
            if diff == 0:
                feature_results.append('not applicable')
            elif diff < 0:
                feature_results.append('lower')
            else:
                feature_results.append('higher')
 
        results.append(feature_results)

    else:

        results.append(['not found'] * len(X_test.columns))  # If no counterfactual explanation was found, append 'not found' for all features

result_df = pd.DataFrame(results, columns=X_test.columns)
print(result_df.shape)
result_df.head()

In [None]:
columns_to_drop = result_df.columns[result_df.eq('not applicable').all()] # We remove the columns that had no difference
result_df = result_df.drop(columns=columns_to_drop)

## 2. CEML algorithm

### 2.1 Local level - Random row with predicted default class <a class="anchor" id="chapterdefaultrow"></a>

In [None]:
X_test = X_test.to_numpy()
x = X_test[187,:]

result_ceml = generate_counterfactual(model_rf, x, y_target=0)
x_cf_array = result_ceml['x_cf']

In [None]:
feature_names = X_test_default.columns.tolist()
df_ceml = pd.DataFrame({0: x_cf_array})

df_ceml_transposed = df_ceml.T
df_ceml_transposed.columns = feature_names
df_ceml_transposed["DAYS_LAST_PHONE_CHANGE_app"]

In [None]:
feature_names = X_test_default.columns.tolist()
df_ceml = pd.DataFrame({0: x_cf_array})

df_ceml_transposed = df_ceml.T
df_ceml_transposed.columns = feature_names

def format_non_scientific(x):
    return "{:.10f}".format(x)

pd.set_option('display.float_format', format_non_scientific)

df_ceml_transposed

In [None]:
feature_names = X_test_default.columns.tolist()
df_ceml = pd.DataFrame({0: x_cf_array})

df_ceml_transposed = df_ceml.T
df_ceml_transposed.columns = feature_names

discrete_numerical_features = ['TARGET_app', 'NAME_CONTRACT_TYPE_app', 'CODE_GENDER_app', 'FLAG_OWN_CAR_app', 'FLAG_OWN_REALTY_app', 'CNT_CHILDREN_app', 'FLAG_MOBIL_app', 'FLAG_EMP_PHONE_app', 'FLAG_WORK_PHONE_app', 'FLAG_CONT_MOBILE_app', 'FLAG_PHONE_app', 'FLAG_EMAIL_app', 'HOUR_APPR_PROCESS_START_app', 'REG_REGION_NOT_LIVE_REGION_app', 'REG_REGION_NOT_WORK_REGION_app', 'REG_CITY_NOT_LIVE_CITY_app', 'LIVE_CITY_NOT_WORK_CITY_app', 'OBS_30_CNT_SOCIAL_CIRCLE_app', 'OBS_60_CNT_SOCIAL_CIRCLE_app', 'AMT_REQ_CREDIT_BUREAU_YEAR_app', 'CREDIT_CURRENCY_bur', 'NAME_TYPE_SUITE_app_Children', 'NAME_TYPE_SUITE_app_Family', 'NAME_TYPE_SUITE_app_Group of people', 'NAME_TYPE_SUITE_app_Other_A', 'NAME_TYPE_SUITE_app_Other_B', 'NAME_TYPE_SUITE_app_Spouse, partner', 'NAME_TYPE_SUITE_app_Unaccompanied', 'NAME_INCOME_TYPE_app_Businessman', 'NAME_INCOME_TYPE_app_Commercial associate', 'NAME_INCOME_TYPE_app_State servant', 'NAME_INCOME_TYPE_app_Student', 'NAME_INCOME_TYPE_app_Working', 'NAME_EDUCATION_TYPE_app_Academic degree', 'NAME_EDUCATION_TYPE_app_Higher education', 'NAME_EDUCATION_TYPE_app_Incomplete higher', 'NAME_EDUCATION_TYPE_app_Lower secondary', 'NAME_EDUCATION_TYPE_app_Secondary / secondary special', 'NAME_FAMILY_STATUS_app_Civil marriage', 'NAME_FAMILY_STATUS_app_Married', 'NAME_FAMILY_STATUS_app_Separated', 'NAME_FAMILY_STATUS_app_Single / not married', 'NAME_FAMILY_STATUS_app_Widow', 'NAME_HOUSING_TYPE_app_Co-op apartment', 'NAME_HOUSING_TYPE_app_House / apartment', 'NAME_HOUSING_TYPE_app_Municipal apartment', 'NAME_HOUSING_TYPE_app_Office apartment', 'NAME_HOUSING_TYPE_app_Rented apartment', 'NAME_HOUSING_TYPE_app_With parents', 'REGION_RATING_CLIENT_app_1', 'REGION_RATING_CLIENT_app_2', 'REGION_RATING_CLIENT_app_3', 'WEEKDAY_APPR_PROCESS_START_app_FRIDAY', 'WEEKDAY_APPR_PROCESS_START_app_MONDAY', 'WEEKDAY_APPR_PROCESS_START_app_SATURDAY', 'WEEKDAY_APPR_PROCESS_START_app_SUNDAY', 'WEEKDAY_APPR_PROCESS_START_app_THURSDAY', 'WEEKDAY_APPR_PROCESS_START_app_TUESDAY', 'WEEKDAY_APPR_PROCESS_START_app_WEDNESDAY', 'ORGANIZATION_TYPE_app_Advertising', 'ORGANIZATION_TYPE_app_Agriculture', 'ORGANIZATION_TYPE_app_Bank', 'ORGANIZATION_TYPE_app_Business Entity Type 1', 'ORGANIZATION_TYPE_app_Business Entity Type 2', 'ORGANIZATION_TYPE_app_Business Entity Type 3', 'ORGANIZATION_TYPE_app_Cleaning', 'ORGANIZATION_TYPE_app_Construction', 'ORGANIZATION_TYPE_app_Culture', 'ORGANIZATION_TYPE_app_Electricity', 'ORGANIZATION_TYPE_app_Emergency', 'ORGANIZATION_TYPE_app_Government', 'ORGANIZATION_TYPE_app_Hotel', 'ORGANIZATION_TYPE_app_Housing', 'ORGANIZATION_TYPE_app_Industry: type 1', 'ORGANIZATION_TYPE_app_Industry: type 10', 'ORGANIZATION_TYPE_app_Industry: type 11', 'ORGANIZATION_TYPE_app_Industry: type 12', 'ORGANIZATION_TYPE_app_Industry: type 13', 'ORGANIZATION_TYPE_app_Industry: type 2', 'ORGANIZATION_TYPE_app_Industry: type 3', 'ORGANIZATION_TYPE_app_Industry: type 4', 'ORGANIZATION_TYPE_app_Industry: type 5', 'ORGANIZATION_TYPE_app_Industry: type 6', 'ORGANIZATION_TYPE_app_Industry: type 7', 'ORGANIZATION_TYPE_app_Industry: type 9', 'ORGANIZATION_TYPE_app_Insurance', 'ORGANIZATION_TYPE_app_Kindergarten', 'ORGANIZATION_TYPE_app_Legal Services', 'ORGANIZATION_TYPE_app_Medicine', 'ORGANIZATION_TYPE_app_Military', 'ORGANIZATION_TYPE_app_Mobile', 'ORGANIZATION_TYPE_app_Other', 'ORGANIZATION_TYPE_app_Police', 'ORGANIZATION_TYPE_app_Postal', 'ORGANIZATION_TYPE_app_Realtor', 'ORGANIZATION_TYPE_app_Religion', 'ORGANIZATION_TYPE_app_Restaurant', 'ORGANIZATION_TYPE_app_School', 'ORGANIZATION_TYPE_app_Security', 'ORGANIZATION_TYPE_app_Security Ministries', 'ORGANIZATION_TYPE_app_Self-employed', 'ORGANIZATION_TYPE_app_Services', 'ORGANIZATION_TYPE_app_Telecom', 'ORGANIZATION_TYPE_app_Trade: type 1', 'ORGANIZATION_TYPE_app_Trade: type 2', 'ORGANIZATION_TYPE_app_Trade: type 3', 'ORGANIZATION_TYPE_app_Trade: type 4', 'ORGANIZATION_TYPE_app_Trade: type 5', 'ORGANIZATION_TYPE_app_Trade: type 6', 'ORGANIZATION_TYPE_app_Trade: type 7', 'ORGANIZATION_TYPE_app_Transport: type 1', 'ORGANIZATION_TYPE_app_Transport: type 2', 'ORGANIZATION_TYPE_app_Transport: type 3', 'ORGANIZATION_TYPE_app_Transport: type 4', 'ORGANIZATION_TYPE_app_University', 'CREDIT_ACTIVE_bur_Active', 'CREDIT_ACTIVE_bur_Closed', 'CREDIT_ACTIVE_bur_Sold', 'CREDIT_TYPE_bur_Another type of loan', 'CREDIT_TYPE_bur_Car loan', 'CREDIT_TYPE_bur_Consumer credit', 'CREDIT_TYPE_bur_Credit card', 'CREDIT_TYPE_bur_Loan for business development', 'CREDIT_TYPE_bur_Microloan', 'CREDIT_TYPE_bur_Mortgage']
filtered_columns = list(set(discrete_numerical_features) & set(df_ceml_transposed.columns))

filtered_ceml = df_ceml_transposed.drop(columns=filtered_columns)

df_ceml_original = pd.DataFrame({0: x})

df_ceml_transposed_original = df_ceml_original.T
df_ceml_transposed_original.columns = feature_names

filtered_columns = list(set(discrete_numerical_features) & set(df_ceml_transposed_original.columns))
filtered_original = df_ceml_transposed_original.drop(columns=filtered_columns)

filtered_original = filtered_original.astype(float)
filtered_ceml = filtered_ceml.astype(float)

threshold = 0.0000000001
absolute_diff = np.abs(filtered_ceml - filtered_original)
subtracted_df = (filtered_ceml - filtered_original).mask(absolute_diff < threshold, 0)

columns = subtracted_df.columns[subtracted_df.eq(0).all()]
subtracted_df.drop(columns=columns, inplace=True) # we drop the columns with 0 difference
display(subtracted_df)
subtracted_df = subtracted_df.applymap(lambda x: "higher" if x > 0 else ("lower" if x < 0 else x))
display(subtracted_df)

### 2.2 Local level - for all default predictions <a class="anchor" id="chapteralldefaultrows"></a>

In [None]:
classification_df = pd.DataFrame(index=X_test_default.index, columns=X_test_default.columns)

for index, row in tqdm(X_test_default.iterrows(), total=len(X_test_default)):

    row = row.values
    result_ceml = generate_counterfactual(model_rf, row, y_target=0)
    

    if result_ceml is not None and 'x_cf' in result_ceml:
        x_cf_array = result_ceml['x_cf']
        differences = x_cf_array - row
        
        feature_results = []
        for j, diff in enumerate(differences.flatten()):
            if diff == 0:
                feature_results.append('not applicable')
            elif diff < 0:
                feature_results.append('lower')
            else:
                feature_results.append('higher')
        
        classification_df.loc[index] = feature_results

    else:

        classification_df.loc[index] = ['not found'] * len(X_test_default.columns)

classification_df

In [None]:
common_columns = result_df.columns.intersection(classification_df.columns)

result_data = []

for result_row, classification_row in zip(result_df.itertuples(), classification_df.itertuples()):
    result_row_data = {}
    for col in common_columns:
        result_value = getattr(result_row, col)
        classification_value = getattr(classification_row, col)
        if result_value == classification_value:
            result_row_data[col] = result_value
        else:
            result_row_data[col] = "no match"
    result_data.append(result_row_data)

result = pd.DataFrame(result_data, index=classification_df.index, columns=common_columns)

columns_to_drop = result.columns[result.eq('not applicable').all()]
result = result.drop(columns=columns_to_drop)
no_match_columns = result.columns[result.eq("no match").all()]
result.drop(columns=no_match_columns, inplace=True)

result.head()

In [None]:
result.replace({'no match': 'not applicable'}, inplace=True)

## 3. Insights - Exploratory Data Analysis

In [None]:
dataframe = pd.DataFrame()

def calculate_percentage(row, value):
    total = len(row)
    value_count = row.value_counts().get(value, 0)
    return (value_count / total) * 100

for value in ['higher', 'lower', 'not applicable']:
    percentage_column_name = f'percentage_{value}'
    dataframe[percentage_column_name] = result.apply(lambda row: calculate_percentage(row, value), axis=1)

dataframe= dataframe.round().astype(int)
dataframe.head()

In [None]:
column_averages = dataframe.mean()
print(column_averages)

## 4. Combination of Chapter 3 with Conformal Prediction

In [None]:
cp = pd.read_csv("../data/processed/conformal_prediction.csv")
dataframe.reset_index(inplace=True)
display(dataframe.head())
display(cp.head())

In [None]:
merged_df = pd.merge(cp, dataframe, on='SK_ID_CURR', how='inner')
merged_df = merged_df[["SK_ID_CURR","percentage_higher","percentage_lower","percentage_not applicable","level"]]
merged_df.set_index('SK_ID_CURR', inplace=True) 
merged_df.head()

In [None]:
means_per_group = merged_df.groupby('level').mean()
means_per_group