# Counterfactual Explanations for Tabular Data

In this notebook, we use the "dice-ml" library to generate counterfactual explanations for tabular data. These counterfactual explanations help explain a model's prediction by evaluating the necessity and sufficiency of features in determining the model's output. A feature is necessary if changing it changes the model's prediction while keeping others constant, and it is sufficient if the model's output cannot change while the feature remains unchanged [1]. For getting more details on this method, you can refer to [this link](https://interpret.ml/DiCE/notebooks/DiCE_getting_started.html).

## 1. Importing Required Libraries

In [1]:
import dice_ml
import pandas as pd
import numpy as np
from sklearn.metrics import f1_score, precision_score, recall_score, roc_auc_score
from sklearn.model_selection import train_test_split
from interpret.glassbox import ExplainableBoostingClassifier

import warnings
warnings.filterwarnings("ignore")

## 2. Data Loading and Preprocessing
In this notebook, we utilize the US-130 dataset, a structured medical dataset containing various clinical measurements collected from patients diagnosed with diabetes from 1999 to 2008. This dataset provides valuable information about patient health indicators, clinical variables, and potential risk factors associated with diabetes. These include demographic information, lab results, diabetes-related indicators, and medication history. Machine learning models can be applied to this dataset to estimate diabetes-related outcomes. e.g., hospital readmission prediction. To find more information, you can refer to [this link](https://archive.ics.uci.edu/dataset/296/diabetes+130-us+hospitals+for+years+1999-2008).

In [2]:
# preprocessing function
def process_us_130_csv(df: pd.DataFrame) -> pd.DataFrame:
    """
    Function for new data downloaded from UCL
    """
    age_transform = { '[0-10)' : 5,
                      '[10-20)' : 15,
                      '[20-30)' : 25,
                      '[30-40)' : 35,
                      '[40-50)' : 45,
                      '[50-60)' : 55,
                      '[60-70)' : 65,
                      '[70-80)' : 75,
                      '[80-90)' : 85,
                      '[90-100)' : 95
                    }
    
    #Apply column specific transformations
    df['age'] = df['age'].apply(lambda x : age_transform[x])
    df['diag_1'] = df['diag_1'].apply(lambda x: x[:x.find(".")])
    df['diag_2'] = df['diag_2'].apply(lambda x: x[:x.find(".")])
    df['diag_3'] = df['diag_3'].apply(lambda x: x[:x.find(".")])
    # binarizing the target values
    df['readmitted_binarized'] = df['readmitted'].apply(lambda x: 1 if x=='<30' else 0)
    # imputing missing values
    df['max_glu_serum'] = df['max_glu_serum'].apply(lambda x: 'Unknown' if type(x) != str else x)
    df['A1Cresult'] = df['A1Cresult'].apply(lambda x: 'Unknown' if type(x) != str else x)

    #Drop columns which are not needed
    df = df.drop(['encounter_id', 'patient_nbr', 'examide',
                  'readmitted','weight','payer_code', 'medical_specialty'], axis=1)

    #Frequency encoding of categorical columns: The categorical variables are converted into numerical features by replacing each unique value in a column with its corresponding frequency in that column.
    categorical_columns = df.select_dtypes(include=['object', 'category']).columns.tolist()   

    for col in df.columns:
        if df[col].nunique() < 10 and col not in categorical_columns:
            categorical_columns.append(col)

    numerical_columns = [col for col in df.columns if col not in categorical_columns]  
    for cat_column in categorical_columns:
        if cat_column!="readmitted_binarized":
            frequency_encoding = df[cat_column].value_counts(normalize=True).to_dict()
            df[f'encoded_{cat_column}'] = df[cat_column].map(frequency_encoding)
            df = df.drop(cat_column, axis=1)

    return df, numerical_columns

In [3]:
# data loading and preparation
data = pd.read_csv("../datasets/diabetes_130/diabetic_data.csv")
df, numerical_columns= process_us_130_csv(data)
df.index=range(df.shape[0])
df.head() # printing the first five rows of the dataset

Unnamed: 0,age,discharge_disposition_id,admission_source_id,time_in_hospital,num_lab_procedures,num_medications,number_outpatient,number_emergency,number_inpatient,number_diagnoses,...,encoded_insulin,encoded_glyburide-metformin,encoded_glipizide-metformin,encoded_glimepiride-pioglitazone,encoded_metformin-rosiglitazone,encoded_metformin-pioglitazone,encoded_change,encoded_diabetesMed,encoded_admission_type_id,encoded_num_procedures
0,5,25,1,1,41,1,0,0,0,1,...,0.465607,0.993063,0.999872,0.99999,0.99998,0.99999,0.538048,0.229969,0.051992,0.458424
1,15,1,7,3,59,18,0,0,0,9,...,0.111196,0.993063,0.999872,0.99999,0.99998,0.99999,0.461952,0.770031,0.530531,0.458424
2,25,1,7,2,11,13,2,0,1,6,...,0.465607,0.993063,0.999872,0.99999,0.99998,0.99999,0.538048,0.770031,0.530531,0.030246
3,35,1,7,2,44,16,0,0,0,7,...,0.111196,0.993063,0.999872,0.99999,0.99998,0.99999,0.461952,0.770031,0.530531,0.203821
4,45,1,7,1,51,8,0,0,0,5,...,0.303137,0.993063,0.999872,0.99999,0.99998,0.99999,0.461952,0.770031,0.530531,0.458424


## 3. Model Training

In this section, we train an Explainable Boosting Machine (EBM) model to predict the risk of readmission within 30 days of discharge. EBM is a glass-box machine learning model that can balance the predictive performance and interpretability.

In [None]:
X , y = df.drop("readmitted_binarized",axis=1) , df["readmitted_binarized"]
k = 2  # 2-fold split to increase the trainin
indices = np.arange(len(X))
np.random.shuffle(indices)
folds = np.array_split(indices, k)
auc_scores = []
f1_scores = []
recall_scores = []
precision_scores = []

for i in range(k):
    test_idx = folds[i]
    X_test, y_test = X.iloc[test_idx], y.iloc[test_idx]

    # Use all other folds as the training set
    train_idx = np.concatenate([folds[j] for j in range(k) if j != i])
    X_train, y_train = X.iloc[train_idx], y.iloc[train_idx]
    xgb_clf = ExplainableBoostingClassifier()

    xgb_clf.fit(X_train.values, y_train.values)
    y_pred = xgb_clf.predict(X_test.values)

    y_prob = xgb_clf.predict_proba(X_test.values)[:,1]

    test_auc = roc_auc_score(y_test, y_prob)
    test_f1 = f1_score(y_test,y_pred)
    test_precision = precision_score(y_test,y_pred)
    test_recall = recall_score(y_test,y_pred)
    
    auc_scores.append(test_auc)
    f1_scores.append(test_f1)
    recall_scores.append(test_precision)
    precision_scores.append(test_recall)
    
print("mean_test_auc",np.mean(auc_scores))
print("mean_test_f1",np.mean(f1_scores))
print("mean_test_precision",np.mean(precision_scores))
print("mean_test_recall",np.mean(recall_scores))

    

In [None]:
def counter(train_data,test_data,model,cont_features_list,outcome):

    d = dice_ml.Data(dataframe=train_data, continuous_features=cont_features_list, outcome_name=outcome)
    m = dice_ml.Model(model=model, backend="sklearn")
    exp = dice_ml.Dice(d, m, method="random")
    e = exp.generate_counterfactuals(test_data[0:1], total_CFs=5, desired_class="opposite")
    e.visualize_as_dataframe(show_only_changes=True)

In [None]:
X_train["readmitted_binarized"] = y_train

In [None]:
counter(X_train,X_test,xgb_clf,numerical_columns,"readmitted_binarized")

As can be observed, by purturbing some features such as **"number_inpatient" and "encoded_diabetesMed"**, the prediction can be changed.

number_inpatient: Number of inpatient visits of the patient in the year preceding the encounter [2].Patients with multiple previous inpatient visits may have severe conditions, leading to further readmissions.



encoded_diabetesMed: Indicates if there was any diabetic medication prescribed. Actual Values: yes and no [2]. Different medication types can indicate the severity of patient condition and therefore, determine the readmission likelihood.

## References

[1] https://interpret.ml/DiCE/notebooks/DiCE_getting_started.html

[2] https://archive.ics.uci.edu/dataset/296/diabetes+130-us+hospitals+for+years+1999-2008

