## Setting Up and Importing Libraries
In this step, we'll import the necessary libraries and modules. We'll also ensure the correct path is set to access our custom modules.

In [1]:
# Import necessary libraries
import pandas as pd
import os
import sys
import json


In [2]:
# Make sure that the current working directory is the parent directory of the project
os.chdir(os.path.join(os.getcwd(), '..'))
print(os.getcwd())

/home/rita/TRUST_AI/trustframework/trustCE


In [3]:
from codice.cfsearch import CFsearch
from codice.dataset import Dataset
from codice.cemodels.base_model import BaseModel
from codice.cemodels.sklearn_model import SklearnModel
from codice.ceinstance.instance_sampler import CEInstanceSampler
from codice.config import Config
from codice.transformer import Transformer
from codice.ceinstance.instance_factory import InstanceFactory
from codice import load_datasets

## Loading Configuration
Here, we'll load our configuration files which dictate various parameters for our counterfactual search. It includes dataset details, feature management, and other related configurations.

In [4]:
# Load configuration
config_file_path = "config/conf_diabetes.yaml"
config = Config(config_file_path)

with open("config/constraints_conf_diabetes.json", 'r') as file:
    constraints = json.load(file)

print("Configuration Loaded:")
print(config)

Configuration Loaded:
<codice.config.Config object at 0x7f7197983ca0>


## Preparing Dataset and Model
In this section, we initialize our dataset, model, and the required transformers. We'll also define a sample instance for which we wish to find the counterfactuals.

In [5]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression


# Loading dataset and trianing model
input=pd.read_csv('datasets/diabetes.csv',sep=',')

#get X and y
print(input.columns)
X=input.copy().drop(['Class variable'], axis=1)
y=input.copy()['Class variable']
#scaler = MinMaxScaler()
#X_normalized = scaler.fit_transform(X)

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

# Optimized parameters
C_optimized = 0.23357214690901212
class_weight_optimized = 'balanced'
solver_optimized = 'liblinear'
model_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression(max_iter=1000))
])
model_pipeline.fit(X_train, y_train)
# Check the accuracy of the model
print("Accuracy on training set: ", model_pipeline.score(X_train, y_train))
print("Accuracy on test set: ", model_pipeline.score(X_test, y_test))

Index(['Number of times pregnant',
       'Plasma glucose concentration a 2 hours in an oral glucose tolerance test',
       'Diastolic blood pressure (mm Hg)', 'Triceps skin fold thickness (mm)',
       '2-Hour serum insulin (mu U/ml)',
       'Body mass index (weight in kg/(height in m)^2)',
       'Diabetes pedigree function', 'Age (years)', 'Class variable'],
      dtype='object')
Accuracy on training set:  0.7756874095513748
Accuracy on test set:  0.7012987012987013


In [6]:
# Set the target instance path
target_instance_json = "input_instance/instance_diabetes.json"

# Load the dataset and set up the necessary objects
load_datasets.download("diabetes")

In [7]:
data = Dataset(config.get_config_value("dataset"), "Class variable")
normalization_transformer = Transformer(data, config)
instance_factory = InstanceFactory(data)
sampler = CEInstanceSampler(config, normalization_transformer, instance_factory)

model = BaseModel(config.get_config_value("model"), model_pipeline)

Features verified
Continious features: ['Number of times pregnant', 'Plasma glucose concentration a 2 hours in an oral glucose tolerance test', 'Diastolic blood pressure (mm Hg)', 'Triceps skin fold thickness (mm)', '2-Hour serum insulin (mu U/ml)', 'Body mass index (weight in kg/(height in m)^2)', 'Diabetes pedigree function', 'Age (years)']
Categorical features: []
Dataset preprocessed
Feature: Diastolic blood pressure (mm Hg)
Range: [0, 122]
Feature: Age (years)
Range: [21, 81]
Feature: Diabetes pedigree function
Range: [0.078, 2.42]
Feature: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
Range: [0, 199]
Feature: Triceps skin fold thickness (mm)
Range: [0, 99]
Feature: Body mass index (weight in kg/(height in m)^2)
Range: [0.0, 67.1]
Feature: 2-Hour serum insulin (mu U/ml)
Range: [0, 846]
Feature: Number of times pregnant
Range: [0, 17]
Constraint Type: immutable


## Finding Counterfactuals
With everything set up, we'll now search for counterfactuals for our sample instance using the CFsearch object.

In [8]:
# Create a CFsearch object
config_for_cfsearch = config.get_config_value("cfsearch")
search = CFsearch(normalization_transformer, model, sampler, config,
                  optimizer_name=config_for_cfsearch["optimizer"], 
                  distance_continuous=config_for_cfsearch["continuous_distance"], 
                  distance_categorical=config_for_cfsearch["categorical_distance"], 
                  loss_type=config_for_cfsearch["loss_type"], 
                  coherence=config_for_cfsearch["coherence"],
                  objective_function_weights=config_for_cfsearch["objective_function_weights"])

# Load target instance and find counterfactuals
with open(target_instance_json, 'r') as file:
    target_instance_json_content = file.read()

target_instance = instance_factory.create_instance_from_json(target_instance_json_content)

In [9]:
counterfactuals = search.find_counterfactuals(target_instance, number_cf=1, desired_class="opposite", maxiterations=50)

Get values of one population item {'Number of times pregnant': 0, 'Plasma glucose concentration a 2 hours in an oral glucose tolerance test': 111.79337120315033, 'Diastolic blood pressure (mm Hg)': 63.63224496794753, 'Triceps skin fold thickness (mm)': 55.77740876057779, '2-Hour serum insulin (mu U/ml)': 512.6466044514375, 'Body mass index (weight in kg/(height in m)^2)': 44.15282177282836, 'Diabetes pedigree function': 1.4354574992214515, 'Age (years)': 77.0401721421106}
Valid counterfactuals were found:  {'Number of times pregnant': 0, 'Plasma glucose concentration a 2 hours in an oral glucose tolerance test': 41.76982703433251, 'Diastolic blood pressure (mm Hg)': 76.93875078855744, 'Triceps skin fold thickness (mm)': 6.538265247996674, '2-Hour serum insulin (mu U/ml)': 543.9708577225455, 'Body mass index (weight in kg/(height in m)^2)': 39.72825381536026, 'Diabetes pedigree function': 2.2129718395235236, 'Age (years)': 32.924782441193976}


## Evaluation and Visualization
Once the counterfactuals are generated, it's crucial to evaluate and visualize them. This helps in understanding how the counterfactuals differ from the original instance and assessing their quality.

In [10]:
# Evaluate and visualize the counterfactuals
search.evaluate_counterfactuals(target_instance, counterfactuals)

# Display the counterfactuals and original instance in the notebook
display_df = search.visualize_as_dataframe(target_instance, counterfactuals)
display(display_df)

Feature Number of times pregnant changed its value from 0 to 0
Feature Plasma glucose concentration a 2 hours in an oral glucose tolerance test changed its value from 137 to 41.76982703433251
Feature Diastolic blood pressure (mm Hg) changed its value from 40 to 76.93875078855744
Feature Triceps skin fold thickness (mm) changed its value from 35 to 6.538265247996674
Feature 2-Hour serum insulin (mu U/ml) changed its value from 168 to 543.9708577225455
Feature Body mass index (weight in kg/(height in m)^2) changed its value from 43.1 to 39.72825381536026
Feature Diabetes pedigree function changed its value from 2.2880000000000003 to 2.2129718395235236
Feature Age (years) changed its value from 33 to 32.924782441193976
CF instance:  {'Number of times pregnant': 0, 'Plasma glucose concentration a 2 hours in an oral glucose tolerance test': 41.76982703433251, 'Diastolic blood pressure (mm Hg)': 76.93875078855744, 'Triceps skin fold thickness (mm)': 6.538265247996674, '2-Hour serum insulin (

Unnamed: 0,Number of times pregnant,Plasma glucose concentration a 2 hours in an oral glucose tolerance test,Diastolic blood pressure (mm Hg),Triceps skin fold thickness (mm),2-Hour serum insulin (mu U/ml),Body mass index (weight in kg/(height in m)^2),Diabetes pedigree function,Age (years)
0,0,137,40,35,168,43.1,2.288,33



Counterfactual set (new outcome: [array([0])])


Unnamed: 0,Number of times pregnant,Plasma glucose concentration a 2 hours in an oral glucose tolerance test,Diastolic blood pressure (mm Hg),Triceps skin fold thickness (mm),2-Hour serum insulin (mu U/ml),Body mass index (weight in kg/(height in m)^2),Diabetes pedigree function,Age (years)
0,-,41.76982703433251,76.93875078855744,6.538265247996674,543.9708577225455,39.72825381536026,2.212971839523524,32.924782441193976


None

## Storing the Results
For reproducibility and further analysis, we'll store the counterfactuals and their evaluations in designated folders.

In [11]:
# Store results
search.store_counterfactuals(config.get_config_value("output_folder"), "diabetes_first_test")
search.store_evaluations(config.get_config_value("output_folder"), "diabetes_first_test")

Store counterfactuals to  results/diabetes_first_test_0.json
Store counterfactuals evaluation to  results/diabetes_first_test_eval_0.json
