# "Adult" dataset
Predict whether income exceeds $50K/yr based on census data. Also known as "Census Income" dataset.
https://archive.ics.uci.edu/dataset/2/adult

### Loading the dataset
We also create a data object from dice where we need to specify wether the data is continuous or discrete to be able to do perturbations later

In [1]:
import dice_ml
from dice_ml.utils import helpers
import pandas as pd
dataset = helpers.load_adult_income_dataset()
dataset.head()

Unnamed: 0,age,workclass,education,marital_status,occupation,race,gender,hours_per_week,income
0,28,Private,Bachelors,Single,White-Collar,White,Female,60,0
1,30,Self-Employed,Assoc,Married,Professional,White,Male,65,1
2,32,Private,Some-college,Married,White-Collar,White,Male,50,0
3,20,Private,Some-college,Single,Service,White,Female,35,0
4,41,Self-Employed,Some-college,Married,White-Collar,White,Male,50,0


In [2]:
# description of transformed features
adult_info = helpers.get_adult_data_info()
adult_info

{'age': 'age',
 'workclass': 'type of industry (Government, Other/Unknown, Private, Self-Employed)',
 'education': 'education level (Assoc, Bachelors, Doctorate, HS-grad, Masters, Prof-school, School, Some-college)',
 'marital_status': 'marital status (Divorced, Married, Separated, Single, Widowed)',
 'occupation': 'occupation (Blue-Collar, Other/Unknown, Professional, Sales, Service, White-Collar)',
 'race': 'white or other race?',
 'gender': 'male or female?',
 'hours_per_week': 'total work hours per week',
 'income': '0 (<=50K) vs 1 (>50K)'}

In [3]:
def construct_variable(variable_list, info, i):
    return variable_list[i] + ": " + info[variable_list[i]]

construct_variable(dataset.columns,adult_info,1)

'workclass: type of industry (Government, Other/Unknown, Private, Self-Employed)'

In [4]:
def string_info(variable_list, info):
    string = ''
    for i in range(len(variable_list)):
        string += construct_variable(variable_list,info,i) + '\n'
    return string

print(string_info(dataset.columns,adult_info))

age: age
workclass: type of industry (Government, Other/Unknown, Private, Self-Employed)
education: education level (Assoc, Bachelors, Doctorate, HS-grad, Masters, Prof-school, School, Some-college)
marital_status: marital status (Divorced, Married, Separated, Single, Widowed)
occupation: occupation (Blue-Collar, Other/Unknown, Professional, Sales, Service, White-Collar)
race: white or other race?
gender: male or female?
hours_per_week: total work hours per week
income: 0 (<=50K) vs 1 (>50K)



Explanations are critical for machine learning, especially as machine learning-based systems are being used to inform decisions in societally critical domains such as finance, healthcare, education, and criminal justice. However, most explanation methods depend on an approximation of the ML model to create an interpretable explanation. For example, consider a person who applied for a loan and was rejected by the loan distribution algorithm of a financial company. Typically, the company may provide an explanation on why the loan was rejected, for example, due to “poor credit history”. However, such an explanation does not help the person decide what they should do next to improve their chances of being approved in the future. Critically, the most important feature may not be enough to flip the decision of the algorithm, and in practice, may not even be changeable such as gender and race.

DiCE implements counterfactual (CF) explanations that provide this information by showing feature-perturbed versions of the same person who would have received the loan, e.g., you would have received the loan if your income was higher by $10,000. In other words, it provides “what-if” explanations for model output and can be a useful complement to other explanation methods, both for end-users and model developers.

In [7]:
from sklearn.model_selection import train_test_split
import random
random.seed(42)
target = dataset["income"]
train_dataset, test_dataset, y_train, y_test = train_test_split(dataset,
                                                                target,
                                                                test_size=0.2,
                                                                random_state=0,
                                                                stratify=target)
x_train = train_dataset.drop('income', axis=1)
x_test = test_dataset.drop('income', axis=1)


In [8]:
# Step 1: dice_ml.Data
d = dice_ml.Data(dataframe=train_dataset, continuous_features=['age', 'hours_per_week'], outcome_name='income')

In [10]:
from pathlib import Path
directory = Path("./data")
# Create the directory if it does not exist
directory.mkdir(parents=True, exist_ok=True)
train_dataset.to_csv('./data/adult_train_dataset.csv', index=False)
test_dataset.to_csv('./data/adult_test_dataset.csv', index=False)

### Loading the model

In [12]:
# Sklearn imports
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier

numerical = ["age", "hours_per_week"]
categorical = x_train.columns.difference(numerical)

categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

transformations = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, categorical)])

# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
clf = Pipeline(steps=[('preprocessor', transformations),
                      ('classifier', RandomForestClassifier())])
model = clf.fit(x_train, y_train)
model

In [13]:
import pickle
from pathlib import Path
directory = Path("./models")
# Create the directory if it does not exist
directory.mkdir(parents=True, exist_ok=True)


# Open the file in binary write mode and save the object
with open('./models/loan_model.pkl', 'wb') as file:
    pickle.dump(model, file)

# Examples

In [16]:
model_description = """ML-system that predicts wether a person will earn more than 50k $ a year"""
exp_m = ExplanationMachine2(model, model_description, string_info(dataset.columns,adult_info) ,train_dataset,test_dataset,'zero',5, False)
exp_m.fit()
example = exp_m.explain(example = test.iloc[[0]], verbose = False)
print(example)

100%|██████████| 1/1 [00:00<00:00,  2.08it/s]


Based on the analysis of your data by our Machine Learning system, here are some steps you could take to increase your chances of earning more than $50,000 a year:

1. Invest in Higher Education: One of the most significant factors that could increase your income is obtaining a higher level of education. Pursuing a Masters or Bachelors degree, if possible, could substantially increase your chances of earning more.

2. Consider Self-Employment or Working for the Government: Even if you're only a high school graduate, changing your work sector to Self-Employed or Government can also improve your income outcome.

3. Seek Professional or White-Collar Jobs: Regardless of the industry you work in, aiming for a professional or white-collar job can also enhance your income potential.

4. Work Hours Don't Seem to Matter: Interestingly, the number of hours you work each week doesn't seem to significantly impact your potential income in these scenarios, so focus on the quality of job rather than 

# Experiments

In [15]:
from prompts import *
from prompt_processing import *
from exp_machines import *

In [6]:
import pickle
import pandas as pd
with open(""".models/loan_model.pkl""", 'rb') as file:
    model = pickle.load(file)
train_dataset = pd.read_csv('./data/adult_train_dataset.csv')
test_dataset = pd.read_csv('./data/adult_test_dataset.csv')
test_df = pd.read_csv('test_examples.csv')
test = test_df.drop(columns=['income'], axis = 1)

## Example of each of the prompt types

In [8]:
test.iloc[[0]]

Unnamed: 0,age,workclass,education,marital_status,occupation,race,gender,hours_per_week
0,29,Private,HS-grad,Married,Blue-Collar,White,Female,38


### Zero shot

In [17]:
exp_m.explain_evaluate(example = test.iloc[[1]], verbose = True)

  0%|          | 0/1 [00:00<?, ?it/s]

100%|██████████| 1/1 [00:00<00:00,  1.84it/s]


   age      workclass     education marital_status     occupation   race  \
0   50  Self-Employed   Prof-school        Married  Other/Unknown  White   
1   50  Self-Employed       Masters        Married  Other/Unknown  White   
2   34  Other/Unknown  Some-college        Married  Other/Unknown  Other   
3   84  Other/Unknown     Doctorate        Married  Other/Unknown  White   
4   50  Other/Unknown       Masters        Married   White-Collar  White   

  gender  hours_per_week  income  
0   Male              40       1  
1   Male              40       1  
2   Male              40       1  
3   Male              40       1  
4   Male              40       1  
1. Higher education such as 'Prof-school', 'Masters' or 'Doctorate' tends to predict a higher income.
2. Individuals who are 'Self-Employed' or in 'White-Collar' occupations are more likely to earn more than $50k a year.
3. Age doesn't seem to have a significant impact on income prediction in these cases.
4. Marital status 'Married

(1, 4, 3, 1, 1, 1, False, False)

### One Shot

In [8]:
model_description = """ML-system that predicts wether a person will earn more than 50k $ a year"""
exp_m = ExplanationMachine2(model, model_description, string_info(dataset.columns,adult_info), train_dataset, test_dataset,'one',5 )
exp_m.fit()
example = exp_m.explain(example = test.iloc[[1]], verbose = False)
print(example)

100%|██████████| 1/1 [00:00<00:00,  1.85it/s]


Based on the results of the analysis, there are several things that could potentially increase your chances of earning a higher income. Here's a straightforward interpretation of the findings:

Pursue Advanced Education: The data analysis suggests that obtaining a higher education degree such as a Doctorate or attending a Professional school can greatly increase your chances of earning more than 50k a year. Therefore, furthering your education could be a worthwhile investment for your future financial situation.

Consider Your Age: The analysis also indicates that people under the age of 50 are more likely to earn a higher income. While age isn't something you can change, this information could be useful in terms of career planning and financial forecasting. 

Reflect on Race: The data suggests that individuals of races other than white have a higher likelihood of earning more than 50k a year. While your race isn't something you can change, this insight may spur conversations and initi

In [9]:
exp_m.explain_evaluate(example = test.iloc[[1]], verbose = False)

100%|██████████| 1/1 [00:00<00:00,  1.80it/s]


(0, 4, 4, 1, 1, 1, False, False)

### ToT

In [3]:
model_description = """ML-system that predicts wether a person will earn more than 50k $ a year"""
exp_m = ToTExplanationMachine(model, model_description, train_dataset, test_dataset, prompt_type = 'zero', n_counterfactuals=5, branches = 3)
exp_m.fit()
example = exp_m.explain(example = test.iloc[[1]], verbose = False)
print(example)

100%|██████████| 1/1 [00:00<00:00,  2.01it/s]
100%|██████████| 1/1 [00:00<00:00,  2.14it/s]
100%|██████████| 1/1 [00:00<00:00,  1.83it/s]


Based on the suggestions provided by the three systems, here's an accessible summary:

1. **Education**: All three systems suggest that increasing your education level could potentially increase your income. Specifically, pursuing a 'Masters', 'Doctorate', or 'Prof-school' degree could be beneficial.

2. **Occupation**: There seems to be some consensus that changing your occupation might help. Consider roles in 'Sales', 'Service', 'White-Collar', or 'Professional' fields if possible.

3. **Work Hours**: Increasing your work hours could possibly boost your income, but the suggested hours per week vary quite a bit. While one system suggests working up to 75 hours, another suggests as many as 99 hours. Before deciding to work more hours, consider your personal circumstances.

4. **Workclass**: If you're currently in the 'Other/Unknown' workclass, you might want to seek employment in the 'Private' sector as it could potentially enhance your income.

5. **Age**: While one system suggests th

In [4]:
model_description = """ML-system that predicts wether a person will earn more than 50k $ a year"""
exp_m = ToTExplanationMachine(model, model_description, train_dataset, test_dataset, prompt_type = 'one', n_counterfactuals=5, branches = 3)
exp_m.fit()
example = exp_m.explain(example = test.iloc[[1]], verbose = False)
print(example)

100%|██████████| 1/1 [00:00<00:00,  1.76it/s]
100%|██████████| 1/1 [00:00<00:00,  1.70it/s]
100%|██████████| 1/1 [00:00<00:00,  1.94it/s]


Based on the insights provided by the three systems, here are some potential strategies to increase your chances of earning over 50k a year:

1. **Invest in Higher Education:** All three systems strongly suggest that higher education significantly increases your chances of earning a higher income. Whether it's a Bachelor's, Master's, or Professional degree, further education appears to be a common factor among those earning over 50k a year. 

2. **Consider Your Occupation:** The type of work you do can also impact your income. While System 2 suggests pursuing white-collar occupations, System 3 recommends targeting roles in sales. You may want to explore career opportunities in these areas to see if they align with your interests and skills.

3. **Look at Your Work Hours:** Interestingly, the number of hours you work each week doesn't always correlate with higher income. Both Systems 1 and 3 indicate that working fewer hours can potentially lead to higher earnings. This could be due to 

### User Input

In [7]:
model_description = """ML-system that predicts wether a person will earn more than 50k $ a year"""
exp_m = ExplanationMachine(model, model_description, train_dataset,test_dataset,'zero',5, True )
exp_m.fit()
example = exp_m.explain(example = test.iloc[[1]], verbose = False)
print(example)

100%|██████████| 1/1 [00:00<00:00,  2.00it/s]


To improve your chances of earning more than 50k $ a year, here are some steps you could consider based on the data we have analyzed:

1. Furthering your education seems to have a significant impact. Specifically, pursuing "Prof-school" or a "Masters" degree seems to increase the likelihood of earning more. This rule is supported by four counterfactuals, making it the most reliable recommendation.
   
2. Your work sector matters. Working for the "Government" or becoming "Self-Employed" appear to increase the probability of earning more than 50k $ a year. This rule is supported by two counterfactuals.

3. Race seems to play a role, with individuals classified as "Other" having a higher chance of earning more than 50k $ a year. This is supported by two counterfactuals. It's important to note that this is a reflection of the data, not an endorsement of discrimination.

4. If you're in a position to change your occupation, moving to a "Blue-Collar" job seems to increase the chance of earni

## Experimentining with datasets

In [19]:
#create datasets to test
"""test10 = test.iloc[-10:].reset_index(drop = True)
test10.to_csv('test10.csv', index = False)
test100 = test.iloc[-100:].reset_index(drop = True)
test100.to_csv('test100.csv', index = False)"""

In [10]:
test10 = pd.read_csv('test10.csv')
model_description = """ML-system that predicts wether a person will earn more than 50k $ a year"""
prompts = ['one']
for prompt in prompts:
    print(prompt)
    exp_m = ExplanationMachine2(model, model_description, string_info(dataset.columns,adult_info), train_dataset, test_dataset,prompt,5 )
    exp_m.fit()
    for i in range(test10.shape[0]):
        try:
            example_label, n_rules, rules_followed, first_rule, second_rule,third_rule,in_cfs, in_dataset = exp_m.explain_evaluate(example = test.iloc[[i]], verbose = False)
            os.rename('temp_csv.csv', f'ex_5fcs_{i}.csv')
            os.rename('evaluation.csv', f'eval_5fcs_{i}.csv')            
            test10.loc[i, prompt + '_label'] = example_label
            test10.loc[i, prompt + '_rules'] = n_rules
            test10.loc[i, prompt + '_rules_followed'] = rules_followed
            test10.loc[i, prompt + '_rule_1'] = first_rule
            test10.loc[i, prompt + '_rule_2'] = second_rule
            test10.loc[i, prompt + '_rule_3'] = third_rule
            test10.loc[i, prompt + '_in_cfs'] = in_cfs
            test10.loc[i, prompt + '_in_data'] = in_dataset
            test10.loc[i, prompt + '_status'] = 0
        except Exception as e:
            test10.loc[i, prompt + '_status'] = 1
test10

one


100%|██████████| 1/1 [00:00<00:00,  1.86it/s]
100%|██████████| 1/1 [00:00<00:00,  1.64it/s]
100%|██████████| 1/1 [00:00<00:00,  1.77it/s]
100%|██████████| 1/1 [00:00<00:00,  1.54it/s]
100%|██████████| 1/1 [00:00<00:00,  1.47it/s]
100%|██████████| 1/1 [00:00<00:00,  1.63it/s]
100%|██████████| 1/1 [00:00<00:00,  1.86it/s]
100%|██████████| 1/1 [00:00<00:00,  1.67it/s]
100%|██████████| 1/1 [00:00<00:00,  1.16it/s]
100%|██████████| 1/1 [00:00<00:00,  1.65it/s]


Unnamed: 0,age,workclass,education,marital_status,occupation,race,gender,hours_per_week,one_label,one_rules,one_rules_followed,one_rule_1,one_rule_2,one_rule_3,one_in_cfs,one_in_data,one_status
0,45,Self-Employed,HS-grad,Married,Blue-Collar,White,Male,50,1.0,3.0,3.0,1.0,1.0,1.0,False,False,0.0
1,46,Private,HS-grad,Divorced,Blue-Collar,White,Male,30,,,,,,,,,1.0
2,49,Private,HS-grad,Widowed,Service,White,Male,40,1.0,3.0,3.0,1.0,1.0,1.0,False,True,0.0
3,26,Government,Bachelors,Single,Professional,Other,Female,40,1.0,3.0,2.0,1.0,1.0,0.0,False,False,0.0
4,45,Government,Bachelors,Divorced,Service,White,Female,40,0.0,4.0,4.0,1.0,1.0,1.0,False,False,0.0
5,35,Private,Some-college,Married,Blue-Collar,White,Male,60,1.0,3.0,3.0,1.0,1.0,1.0,False,True,0.0
6,29,Private,HS-grad,Single,Blue-Collar,Other,Male,40,1.0,3.0,3.0,1.0,1.0,1.0,False,False,0.0
7,40,Self-Employed,Some-college,Married,Service,White,Male,50,0.0,3.0,2.0,1.0,1.0,0.0,False,False,0.0
8,28,Other/Unknown,HS-grad,Married,Other/Unknown,White,Female,20,0.0,3.0,2.0,1.0,1.0,0.0,False,False,0.0
9,20,Private,Some-college,Single,Service,White,Female,40,1.0,3.0,3.0,1.0,1.0,1.0,False,False,0.0


In [11]:
def get_metrics(df,prompt):
    validity = df[f'{prompt}_label'].mean()
    rules = df[f'{prompt}_rules'].mean()
    rules_ratio = np.mean(df[f'{prompt}_rules_followed']/df[f'{prompt}_rules'])
    first = df[f'{prompt}_rule_1'].mean()
    second = df[f'{prompt}_rule_2'].mean()
    third = df[f'{prompt}_rule_3'].mean()
    return [validity, rules, rules_ratio, first, second, third]
get_metrics(test10,'one')

[0.6666666666666666,
 3.111111111111111,
 0.8888888888888888,
 1.0,
 1.0,
 0.6666666666666666]

In [8]:
test10 = pd.read_csv('test10.csv')
model_description = """ML-system that predicts wether a person will earn more than 50k $ a year"""
prompts = ['one']
for prompt in prompts:
    print(prompt)
    exp_m = ExplanationMachine2(model, model_description, string_info(dataset.columns,adult_info), train_dataset, test_dataset,prompt,1 )
    exp_m.fit()
    for i in range(test10.shape[0]):
        try:
            example_label, n_rules, rules_followed, first_rule, second_rule,third_rule,in_cfs, in_dataset = exp_m.explain_evaluate(example = test.iloc[[i]], verbose = False)
            os.rename('temp_csv.csv', f'ex_1fcs_{i}.csv')
            os.rename('evaluation.csv', f'eval_1fcs_{i}.csv')
            test10.loc[i, prompt + '_label'] = example_label
            test10.loc[i, prompt + '_rules'] = n_rules
            test10.loc[i, prompt + '_rules_followed'] = rules_followed
            test10.loc[i, prompt + '_rule_1'] = first_rule
            test10.loc[i, prompt + '_rule_2'] = second_rule
            test10.loc[i, prompt + '_rule_3'] = third_rule
            test10.loc[i, prompt + '_in_cfs'] = in_cfs
            test10.loc[i, prompt + '_in_data'] = in_dataset
            test10.loc[i, prompt + '_status'] = 0
        except Exception as e:
            test10.loc[i, prompt + '_status'] = 1
test10

one


100%|██████████| 1/1 [00:00<00:00,  7.16it/s]
100%|██████████| 1/1 [00:00<00:00,  6.20it/s]
100%|██████████| 1/1 [00:00<00:00,  6.81it/s]
100%|██████████| 1/1 [00:00<00:00,  6.59it/s]
100%|██████████| 1/1 [00:00<00:00,  7.09it/s]
100%|██████████| 1/1 [00:00<00:00,  7.01it/s]
100%|██████████| 1/1 [00:00<00:00,  7.06it/s]
100%|██████████| 1/1 [00:00<00:00,  7.18it/s]
100%|██████████| 1/1 [00:00<00:00,  2.75it/s]
100%|██████████| 1/1 [00:00<00:00,  2.91it/s]


Unnamed: 0,age,workclass,education,marital_status,occupation,race,gender,hours_per_week,one_label,one_rules,one_rules_followed,one_rule_1,one_rule_2,one_rule_3,one_in_cfs,one_in_data,one_status
0,45,Self-Employed,HS-grad,Married,Blue-Collar,White,Male,50,1.0,3.0,2.0,1.0,1.0,0.0,False,False,0.0
1,46,Private,HS-grad,Divorced,Blue-Collar,White,Male,30,1.0,3.0,3.0,1.0,1.0,1.0,False,True,0.0
2,49,Private,HS-grad,Widowed,Service,White,Male,40,1.0,3.0,3.0,1.0,1.0,1.0,False,False,0.0
3,26,Government,Bachelors,Single,Professional,Other,Female,40,,,,,,,,,1.0
4,45,Government,Bachelors,Divorced,Service,White,Female,40,0.0,3.0,3.0,1.0,1.0,1.0,False,False,0.0
5,35,Private,Some-college,Married,Blue-Collar,White,Male,60,1.0,3.0,3.0,1.0,1.0,1.0,False,True,0.0
6,29,Private,HS-grad,Single,Blue-Collar,Other,Male,40,0.0,3.0,3.0,1.0,1.0,1.0,False,False,0.0
7,40,Self-Employed,Some-college,Married,Service,White,Male,50,1.0,3.0,1.0,1.0,0.0,0.0,False,False,0.0
8,28,Other/Unknown,HS-grad,Married,Other/Unknown,White,Female,20,1.0,3.0,2.0,1.0,1.0,0.0,True,False,0.0
9,20,Private,Some-college,Single,Service,White,Female,40,1.0,3.0,3.0,1.0,1.0,1.0,False,False,0.0


In [9]:
def get_metrics(df,prompt):
    validity = df[f'{prompt}_label'].mean()
    rules = df[f'{prompt}_rules'].mean()
    rules_ratio = np.mean(df[f'{prompt}_rules_followed']/df[f'{prompt}_rules'])
    first = df[f'{prompt}_rule_1'].mean()
    second = df[f'{prompt}_rule_2'].mean()
    third = df[f'{prompt}_rule_3'].mean()
    return [validity, rules, rules_ratio, first, second, third]
get_metrics(test10,'one')

[0.7777777777777778,
 3.0,
 0.8518518518518519,
 1.0,
 0.8888888888888888,
 0.6666666666666666]

In [9]:
test10 = pd.read_csv('test10.csv')
model_description = """ML-system that predicts wether a person will earn more than 50k $ a year"""
prompts = ['one']
for prompt in prompts:
    print(prompt)
    exp_m = ExplanationMachine2(model, model_description, string_info(dataset.columns,adult_info), train_dataset, test_dataset,prompt,3 )
    exp_m.fit()
    for i in range(test10.shape[0]):
        try:
            example_label, n_rules, rules_followed, first_rule, second_rule,third_rule,in_cfs, in_dataset = exp_m.explain_evaluate(example = test.iloc[[i]], verbose = False)
            os.rename('temp_csv.csv', f'ex_3fcs_{i}.csv')
            os.rename('evaluation.csv', f'eval_3fcs_{i}.csv')
            test10.loc[i, prompt + '_label'] = example_label
            test10.loc[i, prompt + '_rules'] = n_rules
            test10.loc[i, prompt + '_rules_followed'] = rules_followed
            test10.loc[i, prompt + '_rule_1'] = first_rule
            test10.loc[i, prompt + '_rule_2'] = second_rule
            test10.loc[i, prompt + '_rule_3'] = third_rule
            test10.loc[i, prompt + '_in_cfs'] = in_cfs
            test10.loc[i, prompt + '_in_data'] = in_dataset
            test10.loc[i, prompt + '_status'] = 0
        except Exception as e:
            test10.loc[i, prompt + '_status'] = 1
            print(e)
test10

one


100%|██████████| 1/1 [00:00<00:00,  5.40it/s]
100%|██████████| 1/1 [00:00<00:00,  5.66it/s]
100%|██████████| 1/1 [00:00<00:00,  4.76it/s]


single positional indexer is out-of-bounds


100%|██████████| 1/1 [00:00<00:00,  5.80it/s]
100%|██████████| 1/1 [00:00<00:00,  5.62it/s]
100%|██████████| 1/1 [00:00<00:00,  5.45it/s]
100%|██████████| 1/1 [00:00<00:00,  5.67it/s]
100%|██████████| 1/1 [00:00<00:00,  5.82it/s]
100%|██████████| 1/1 [00:00<00:00,  4.68it/s]
100%|██████████| 1/1 [00:00<00:00,  5.43it/s]


Unnamed: 0,age,workclass,education,marital_status,occupation,race,gender,hours_per_week,one_label,one_rules,one_rules_followed,one_rule_1,one_rule_2,one_rule_3,one_in_cfs,one_in_data,one_status
0,45,Self-Employed,HS-grad,Married,Blue-Collar,White,Male,50,1.0,3.0,2.0,1.0,1.0,0.0,False,False,0.0
1,46,Private,HS-grad,Divorced,Blue-Collar,White,Male,30,1.0,3.0,2.0,1.0,1.0,0.0,False,False,0.0
2,49,Private,HS-grad,Widowed,Service,White,Male,40,,,,,,,,,1.0
3,26,Government,Bachelors,Single,Professional,Other,Female,40,0.0,3.0,3.0,1.0,1.0,1.0,False,False,0.0
4,45,Government,Bachelors,Divorced,Service,White,Female,40,1.0,3.0,2.0,1.0,1.0,0.0,False,False,0.0
5,35,Private,Some-college,Married,Blue-Collar,White,Male,60,1.0,3.0,1.0,1.0,0.0,0.0,False,False,0.0
6,29,Private,HS-grad,Single,Blue-Collar,Other,Male,40,1.0,3.0,3.0,1.0,1.0,1.0,False,False,0.0
7,40,Self-Employed,Some-college,Married,Service,White,Male,50,0.0,3.0,1.0,1.0,0.0,0.0,False,False,0.0
8,28,Other/Unknown,HS-grad,Married,Other/Unknown,White,Female,20,1.0,3.0,3.0,1.0,1.0,1.0,True,False,0.0
9,20,Private,Some-college,Single,Service,White,Female,40,1.0,3.0,3.0,1.0,1.0,1.0,False,False,0.0


In [11]:
def get_metrics(df,prompt):
    validity = df[f'{prompt}_label'].mean()
    rules = df[f'{prompt}_rules'].mean()
    rules_ratio = np.mean(df[f'{prompt}_rules_followed']/df[f'{prompt}_rules'])
    first = df[f'{prompt}_rule_1'].mean()
    second = df[f'{prompt}_rule_2'].mean()
    third = df[f'{prompt}_rule_3'].mean()
    return [validity, rules, rules_ratio, first, second, third]
get_metrics(test10,'one')

[0.7777777777777778,
 3.0,
 0.7407407407407407,
 1.0,
 0.7777777777777778,
 0.4444444444444444]

In [16]:
test10 = pd.read_csv('test10.csv')
model_description = """ML-system that predicts wether a person will earn more than 50k $ a year"""
prompts = ['zero']
for prompt in prompts:
    print(prompt)
    exp_m = ExplanationMachine2(model, model_description, string_info(dataset.columns,adult_info), train_dataset, test_dataset,prompt,5)
    exp_m.fit()
    for i in range(test10.shape[0]):
        try:
            example_label, n_rules, rules_followed, first_rule, second_rule,third_rule,in_cfs, in_dataset = exp_m.explain_evaluate(example = test.iloc[[i]], verbose = False)
            os.rename('temp_csv.csv', f'ex_zero_5fcs_{i}.csv')
            os.rename('evaluation.csv', f'eval_zero_5fcs_{i}.csv')
            test10.loc[i, prompt + '_label'] = example_label
            test10.loc[i, prompt + '_rules'] = n_rules
            test10.loc[i, prompt + '_rules_followed'] = rules_followed
            test10.loc[i, prompt + '_rule_1'] = first_rule
            test10.loc[i, prompt + '_rule_2'] = second_rule
            
            test10.loc[i, prompt + '_rule_3'] = third_rule
            test10.loc[i, prompt + '_in_cfs'] = in_cfs
            test10.loc[i, prompt + '_in_data'] = in_dataset
            test10.loc[i, prompt + '_status'] = 0
        except Exception as e:
            test10.loc[i, prompt + '_status'] = 1
test10

zero


100%|██████████| 1/1 [00:00<00:00,  1.85it/s]
100%|██████████| 1/1 [00:00<00:00,  1.79it/s]
100%|██████████| 1/1 [00:00<00:00,  1.81it/s]
100%|██████████| 1/1 [00:00<00:00,  1.93it/s]
100%|██████████| 1/1 [00:00<00:00,  1.71it/s]
100%|██████████| 1/1 [00:00<00:00,  1.93it/s]
100%|██████████| 1/1 [00:00<00:00,  2.02it/s]
100%|██████████| 1/1 [00:00<00:00,  1.93it/s]
100%|██████████| 1/1 [00:00<00:00,  1.66it/s]
100%|██████████| 1/1 [00:00<00:00,  1.91it/s]


Unnamed: 0,age,workclass,education,marital_status,occupation,race,gender,hours_per_week,zero_label,zero_rules,zero_rules_followed,zero_rule_1,zero_rule_2,zero_rule_3,zero_in_cfs,zero_in_data,zero_status
0,45,Self-Employed,HS-grad,Married,Blue-Collar,White,Male,50,1.0,4.0,4.0,1.0,1.0,1.0,False,False,0.0
1,46,Private,HS-grad,Divorced,Blue-Collar,White,Male,30,,,,,,,,,1.0
2,49,Private,HS-grad,Widowed,Service,White,Male,40,,,,,,,,,1.0
3,26,Government,Bachelors,Single,Professional,Other,Female,40,,,,,,,,,1.0
4,45,Government,Bachelors,Divorced,Service,White,Female,40,1.0,4.0,4.0,1.0,1.0,1.0,False,False,0.0
5,35,Private,Some-college,Married,Blue-Collar,White,Male,60,,,,,,,,,1.0
6,29,Private,HS-grad,Single,Blue-Collar,Other,Male,40,,,,,,,,,1.0
7,40,Self-Employed,Some-college,Married,Service,White,Male,50,,,,,,,,,1.0
8,28,Other/Unknown,HS-grad,Married,Other/Unknown,White,Female,20,,,,,,,,,1.0
9,20,Private,Some-college,Single,Service,White,Female,40,,,,,,,,,1.0


In [17]:
def get_metrics(df,prompt):
    validity = df[f'{prompt}_label'].mean()
    rules = df[f'{prompt}_rules'].mean()
    rules_ratio = np.mean(df[f'{prompt}_rules_followed']/df[f'{prompt}_rules'])
    in_data = df[f'{prompt}_in_data'].mean()
    fail = df[f'{prompt}_status'].mean()
    first = df[f'{prompt}_rule_1'].mean()
    second = df[f'{prompt}_rule_2'].mean()
    third = df[f'{prompt}_rule_3'].mean()
    return [validity, rules, rules_ratio, in_data, first, second, third, fail]
get_metrics(test10,'zero')

[1.0, 4.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.8]

In [18]:
test10 = pd.read_csv('test10.csv')
model_description = """ML-system that predicts wether a person will earn more than 50k $ a year"""
prompts = ['zero']
for prompt in prompts:
    print(prompt)
    exp_m = ExplanationMachine2(model, model_description, string_info(dataset.columns,adult_info), train_dataset, test_dataset,prompt,3)
    exp_m.fit()
    for i in range(test10.shape[0]):
        try:
            example_label, n_rules, rules_followed, first_rule, second_rule,third_rule,in_cfs, in_dataset = exp_m.explain_evaluate(example = test.iloc[[i]], verbose = False)
            os.rename('temp_csv.csv', f'ex_zero_3fcs_{i}.csv')
            os.rename('evaluation.csv', f'eval_zero_3fcs_{i}.csv')
            test10.loc[i, prompt + '_label'] = example_label
            test10.loc[i, prompt + '_rules'] = n_rules
            test10.loc[i, prompt + '_rules_followed'] = rules_followed
            test10.loc[i, prompt + '_rule_1'] = first_rule
            test10.loc[i, prompt + '_rule_2'] = second_rule
            
            test10.loc[i, prompt + '_rule_3'] = third_rule
            test10.loc[i, prompt + '_in_cfs'] = in_cfs
            test10.loc[i, prompt + '_in_data'] = in_dataset
            test10.loc[i, prompt + '_status'] = 0
        except Exception as e:
            test10.loc[i, prompt + '_status'] = 1
test10

zero


100%|██████████| 1/1 [00:00<00:00,  2.55it/s]
100%|██████████| 1/1 [00:00<00:00,  1.64it/s]
100%|██████████| 1/1 [00:00<00:00,  1.86it/s]
100%|██████████| 1/1 [00:00<00:00,  1.98it/s]
100%|██████████| 1/1 [00:00<00:00,  1.55it/s]
100%|██████████| 1/1 [00:00<00:00,  1.86it/s]
100%|██████████| 1/1 [00:00<00:00,  1.94it/s]
100%|██████████| 1/1 [00:00<00:00,  1.88it/s]
100%|██████████| 1/1 [00:00<00:00,  1.74it/s]
100%|██████████| 1/1 [00:00<00:00,  2.15it/s]


Unnamed: 0,age,workclass,education,marital_status,occupation,race,gender,hours_per_week,zero_label,zero_rules,zero_rules_followed,zero_rule_1,zero_rule_2,zero_rule_3,zero_in_cfs,zero_in_data,zero_status
0,45,Self-Employed,HS-grad,Married,Blue-Collar,White,Male,50,1.0,5.0,2.0,1.0,0.0,1.0,False,False,0.0
1,46,Private,HS-grad,Divorced,Blue-Collar,White,Male,30,1.0,3.0,3.0,1.0,1.0,1.0,False,True,0.0
2,49,Private,HS-grad,Widowed,Service,White,Male,40,1.0,4.0,4.0,1.0,1.0,1.0,False,True,0.0
3,26,Government,Bachelors,Single,Professional,Other,Female,40,1.0,4.0,3.0,1.0,1.0,0.0,False,False,0.0
4,45,Government,Bachelors,Divorced,Service,White,Female,40,0.0,4.0,3.0,1.0,1.0,0.0,False,False,0.0
5,35,Private,Some-college,Married,Blue-Collar,White,Male,60,1.0,3.0,3.0,1.0,1.0,1.0,False,False,0.0
6,29,Private,HS-grad,Single,Blue-Collar,Other,Male,40,1.0,3.0,3.0,1.0,1.0,1.0,False,False,0.0
7,40,Self-Employed,Some-college,Married,Service,White,Male,50,1.0,4.0,3.0,1.0,1.0,1.0,False,False,0.0
8,28,Other/Unknown,HS-grad,Married,Other/Unknown,White,Female,20,1.0,4.0,4.0,1.0,1.0,1.0,False,True,0.0
9,20,Private,Some-college,Single,Service,White,Female,40,1.0,3.0,2.0,1.0,1.0,0.0,False,False,0.0


In [19]:
get_metrics(test10,'zero')

[0.9, 3.7, 0.8316666666666667, 0.3, 1.0, 0.9, 0.7, 0.0]

In [12]:
test10 = pd.read_csv('test10.csv')
model_description = """ML-system that predicts wether a person will earn more than 50k $ a year"""
prompts = ['zero']
for prompt in prompts:
    print(prompt)
    exp_m = ExplanationMachine2(model, model_description, string_info(dataset.columns,adult_info), train_dataset, test_dataset,prompt,1)
    exp_m.fit()
    for i in range(test10.shape[0]):
        try:
            example_label, n_rules, rules_followed, first_rule, second_rule,third_rule,in_cfs, in_dataset = exp_m.explain_evaluate(example = test.iloc[[i]], verbose = False)
            os.rename('temp_csv.csv', f'ex_zero_1fcs_{i}.csv')
            os.rename('evaluation.csv', f'eval_zero_1fcs_{i}.csv')
            test10.loc[i, prompt + '_label'] = example_label
            test10.loc[i, prompt + '_rules'] = n_rules
            test10.loc[i, prompt + '_rules_followed'] = rules_followed
            test10.loc[i, prompt + '_rule_1'] = first_rule
            test10.loc[i, prompt + '_rule_2'] = second_rule
            
            test10.loc[i, prompt + '_rule_3'] = third_rule
            test10.loc[i, prompt + '_in_cfs'] = in_cfs
            test10.loc[i, prompt + '_in_data'] = in_dataset
            test10.loc[i, prompt + '_status'] = 0
        except Exception as e:
            test10.loc[i, prompt + '_status'] = 1
test10

zero


100%|██████████| 1/1 [00:00<00:00,  2.94it/s]
100%|██████████| 1/1 [00:00<00:00,  2.87it/s]
100%|██████████| 1/1 [00:00<00:00,  2.75it/s]
100%|██████████| 1/1 [00:00<00:00,  2.58it/s]
100%|██████████| 1/1 [00:00<00:00,  3.12it/s]
100%|██████████| 1/1 [00:00<00:00,  2.66it/s]
100%|██████████| 1/1 [00:00<00:00,  2.67it/s]
100%|██████████| 1/1 [00:00<00:00,  2.74it/s]
100%|██████████| 1/1 [00:00<00:00,  2.65it/s]
100%|██████████| 1/1 [00:00<00:00,  2.51it/s]


Unnamed: 0,age,workclass,education,marital_status,occupation,race,gender,hours_per_week,zero_status,zero_label,zero_rules,zero_rules_followed,zero_rule_1,zero_rule_2,zero_rule_3,zero_in_cfs,zero_in_data
0,45,Self-Employed,HS-grad,Married,Blue-Collar,White,Male,50,1.0,,,,,,,,
1,46,Private,HS-grad,Divorced,Blue-Collar,White,Male,30,0.0,1.0,7.0,3.0,1.0,0.0,1.0,False,True
2,49,Private,HS-grad,Widowed,Service,White,Male,40,0.0,1.0,4.0,2.0,1.0,1.0,0.0,True,False
3,26,Government,Bachelors,Single,Professional,Other,Female,40,1.0,,,,,,,,
4,45,Government,Bachelors,Divorced,Service,White,Female,40,0.0,1.0,4.0,4.0,1.0,1.0,1.0,True,False
5,35,Private,Some-college,Married,Blue-Collar,White,Male,60,0.0,1.0,4.0,2.0,0.0,1.0,1.0,False,True
6,29,Private,HS-grad,Single,Blue-Collar,Other,Male,40,1.0,,,,,,,,
7,40,Self-Employed,Some-college,Married,Service,White,Male,50,1.0,,,,,,,,
8,28,Other/Unknown,HS-grad,Married,Other/Unknown,White,Female,20,1.0,,,,,,,,
9,20,Private,Some-college,Single,Service,White,Female,40,1.0,,,,,,,,


In [15]:
get_metrics(test10,'zero')

[1.0, 4.75, 0.6071428571428572, 0.5, 0.75, 0.75, 0.75, 0.6]

In [8]:
test10 = pd.read_csv('test10.csv')
model_description = """ML-system that predicts wether a person will earn more than 50k $ a year"""
prompts = ['one']
for prompt in prompts:
    print(prompt)
    exp_m = ToTExplanationMachine(model, model_description, string_info(dataset.columns,adult_info), train_dataset, test_dataset,prompt,5, 3 )
    exp_m.fit()
    for i in range(test10.shape[0]):
        try:
            example_label, n_rules, rules_followed, first_rule, second_rule, third_rule, in_dataset = exp_m.explain_evaluate(user_data = test.iloc[[i]], verbose = False)
            os.rename('temp_csv.csv', f'ex_5fcs_{i}.csv')
            os.rename('evaluation.csv', f'eval_5fcs_{i}.csv')            
            test10.loc[i, prompt + '_label'] = example_label
            test10.loc[i, prompt + '_rules'] = n_rules
            test10.loc[i, prompt + '_rules_followed'] = rules_followed
            test10.loc[i, prompt + '_rule_1'] = first_rule
            test10.loc[i, prompt + '_rule_2'] = second_rule
            test10.loc[i, prompt + '_rule_3'] = third_rule
            #test10.loc[i, prompt + '_in_cfs'] = in_cfs
            test10.loc[i, prompt + '_in_data'] = in_dataset
            test10.loc[i, prompt + '_status'] = 0
            break
        except Exception as e:
            test10.loc[i, prompt + '_status'] = 1
            print(e)
test10


one


100%|██████████| 1/1 [00:00<00:00,  4.30it/s]
100%|██████████| 1/1 [00:00<00:00,  4.75it/s]
100%|██████████| 1/1 [00:00<00:00,  4.59it/s]


Unnamed: 0,age,workclass,education,marital_status,occupation,race,gender,hours_per_week,one_label,one_rules,one_rules_followed,one_rule_1,one_rule_2,one_rule_3,one_in_data,one_status
0,45,Self-Employed,HS-grad,Married,Blue-Collar,White,Male,50,1.0,3.0,2.0,1.0,1.0,0.0,False,0.0
1,46,Private,HS-grad,Divorced,Blue-Collar,White,Male,30,,,,,,,,
2,49,Private,HS-grad,Widowed,Service,White,Male,40,,,,,,,,
3,26,Government,Bachelors,Single,Professional,Other,Female,40,,,,,,,,
4,45,Government,Bachelors,Divorced,Service,White,Female,40,,,,,,,,
5,35,Private,Some-college,Married,Blue-Collar,White,Male,60,,,,,,,,
6,29,Private,HS-grad,Single,Blue-Collar,Other,Male,40,,,,,,,,
7,40,Self-Employed,Some-college,Married,Service,White,Male,50,,,,,,,,
8,28,Other/Unknown,HS-grad,Married,Other/Unknown,White,Female,20,,,,,,,,
9,20,Private,Some-college,Single,Service,White,Female,40,,,,,,,,
