# Many to One Explainer - Demo of Conjunctive vs Disjunctive Explanations

This demo measures the differences in quality metrics and runtime of explanation generation between conjunctive and disjunctive explanations, on various datasets.\
Each dataset has multiple tests, and at the bottom of each section, the average results are displayed.\
Additionally, at the bottom of the notebook, a collection of all the average results + average across all datasets is displayed.

## Setup

In [1]:
import pandas as pd
import pd_explain
import time
import numpy as np
from sklearn.cluster import KMeans

usetex-False
(CVXPY) Jan 24 11:30:57 PM: Encountered unexpected exception importing solver GLOP:
RuntimeError('Unrecognized new version of ortools (9.11.4210). Expected < 9.10.0. Please open a feature request on cvxpy to enable support for this version.')
(CVXPY) Jan 24 11:30:57 PM: Encountered unexpected exception importing solver PDLP:
RuntimeError('Unrecognized new version of ortools (9.11.4210). Expected < 9.10.0. Please open a feature request on cvxpy to enable support for this version.')


For this demo, we use the default separation error threshold of 0.3, to represent a real world scenario.\
However, if you would like to change it, you can do so by setting the `separation_error` variable below.

In [3]:
separation_error = 0.3

By default, we use sampling to speed up the explanation generation process.

In [4]:
pd_explain.get_use_sampling_value()

True

This can be changed either per function call or globally.\
If you would like to run this demo without sampling, uncomment and run the cell below prior to running the rest of the notebook.

In [5]:
# pd_explain.toggle_sampling(False)
# pd_explain.get_use_sampling_value()

In [6]:
def build_comparison_df(comparison_df, test_name, explanation_type, res_df, runtime):
    try:
        # Each group without an explanation has 1 row with "No explanation found" in the 'Explanation' column / index
        num_groups_not_found = res_df.loc[:, 'No explanation found', :].shape[0]
        # Each group that does have explanations can have multiple rows, and we want to count the number of groups, which is the length of the index on the first level
        num_groups_total = len(res_df.index.levels[0])
        percentage_not_found = (num_groups_not_found / num_groups_total) * 100
    # This should only happen in case explanations were found for all instances
    except KeyError:
        percentage_not_found = 0
    try:
        # Get the string representation of the explanations, for each explanation that is not "No explanation found"
        explanations = res_df.index.get_level_values(1)[res_df.index.get_level_values(1) != 'No explanation found'].to_series()
        # Measure the length of each explanation, by counting the number of 'AND' or 'OR' substrings
        explanation_lengths = explanations.apply(lambda x: x.count('AND') + x.count('OR') + 1)
        # Calculate the average explanation length
        avg_explanation_length = explanation_lengths.mean()
        num_explanations_found = len(explanation_lengths)
    # This should only happen in case explanations were not found for any instance
    except (ZeroDivisionError, KeyError):
        avg_explanation_length = pd.NA
        num_explanations_found = 0

    comparison_df.loc[(test_name, explanation_type), :] = [np.mean(res_df['Coverage']), np.mean(res_df['Separation Error']), percentage_not_found, avg_explanation_length, num_explanations_found, runtime]

In [7]:
columns = ['Average Coverage', 'Average Separation Error', '% Explanation Not Found', 'Average Explanation Length', 'Number of Explanations Found', 'Runtime']
index = pd.MultiIndex(levels=[[], []], codes=[[], []], names=['Test', 'Explanation Type'])

## Adults dataset

In [8]:
adults = pd.read_csv(r"..\Examples\Datasets\adult.csv")
adults

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,label
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,39,Private,215419,Bachelors,13,Divorced,Prof-specialty,Not-in-family,White,Female,0,0,36,United-States,<=50K
48838,64,?,321403,HS-grad,9,Widowed,?,Other-relative,Black,Male,0,0,40,United-States,<=50K
48839,38,Private,374983,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,50,United-States,<=50K
48840,44,Private,83891,Bachelors,13,Divorced,Adm-clerical,Own-child,Asian-Pac-Islander,Male,5455,0,40,United-States,<=50K


In [9]:
adults_comparison_df = pd.DataFrame(columns=columns, index=index)

### Full dataset test

In [10]:
start_time = time.time()
conj_res = adults.explain(explainer='many to one', labels='label', explanation_form='conj', separation_threshold=separation_error)
conj_time = time.time() - start_time
print(f"Conjunctive explanation runtime: {conj_time}")
build_comparison_df(adults_comparison_df, 'Full dataset', 'conjunctive', conj_res, conj_time)
conj_res

Conjunctive explanation runtime: 0.28957557678222656


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
<=50K,0 <= capital-gain <= 5095.5 AND 1 <= education-num <= 10,0.75,0.13
<=50K,0 <= capital-gain <= 4243.5,0.99,0.2
<=50K,1 <= education-num <= 10,0.75,0.15
<=50K,0 <= capital-gain <= 5095.5,1.0,0.21
>50K,No explanation found,,


In [11]:
start_time = time.time()
disj_res = adults.explain(explainer='many to one', labels='label', explanation_form='disj', separation_threshold=separation_error)
disj_time = time.time() - start_time
print(f"Disjunctive explanation runtime: {disj_time}")
build_comparison_df(adults_comparison_df, 'Full dataset', 'disjunctive', disj_res, disj_time)
disj_res

Disjunctive explanation runtime: 0.34167981147766113


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
<=50K,1 <= education-num <= 10,0.75,0.15
<=50K,0 <= capital-gain <= 4243.5,0.99,0.2
<=50K,0 <= capital-gain <= 5095.5,1.0,0.21
>50K,No explanation found,,


### Numeric only and categorical only tests

In [12]:
numeric_only = adults[['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']]
categorical_only = adults[['workclass', 'education', 'marital-status', 'occupation', 'relationship']]
labels = adults['label']

In [13]:
start_time = time.time()
conj_res = numeric_only.explain(explainer='many to one', labels=labels, explanation_form='conj', separation_threshold=separation_error)
conj_time = time.time() - start_time
print(f"Conjunctive explanation runtime: {conj_time}")
build_comparison_df(adults_comparison_df, 'Numeric only', 'conjunctive', conj_res, conj_time)
conj_res

Conjunctive explanation runtime: 0.8245134353637695


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
<=50K,0 <= capital-gain <= 5095.5 AND 1 <= education-num <= 10,0.75,0.13
<=50K,0 <= capital-gain <= 4243.5,0.99,0.2
<=50K,1 <= education-num <= 10,0.75,0.15
<=50K,0 <= capital-gain <= 5095.5,1.0,0.21
>50K,No explanation found,,


In [14]:
start_time = time.time()
disj_res = numeric_only.explain(explainer='many to one', labels=labels, explanation_form='disj', separation_threshold=separation_error)
disj_time = time.time() - start_time
print(f"Disjunctive explanation runtime: {disj_time}")
build_comparison_df(adults_comparison_df, 'Numeric only', 'disjunctive', disj_res, disj_time)
disj_res

Disjunctive explanation runtime: 0.6171953678131104


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
<=50K,1 <= education-num <= 10,0.75,0.15
<=50K,0 <= capital-gain <= 4243.5,0.99,0.2
<=50K,0 <= capital-gain <= 5095.5,1.0,0.21
>50K,No explanation found,,


In [15]:
start_time = time.time()
conj_res = categorical_only.explain(explainer='many to one', labels=labels, explanation_form='conj', separation_threshold=separation_error)
conj_time = time.time() - start_time
print(f"Conjunctive explanation runtime: {conj_time}")
build_comparison_df(adults_comparison_df, 'Categorical only', 'conjunctive', conj_res, conj_time)
conj_res

Conjunctive explanation runtime: 0.09193849563598633


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
<=50K,occupation != Prof-specialty AND occupation != Exec-managerial,0.82,0.17
<=50K,occupation != Exec-managerial,0.91,0.21
<=50K,occupation != Prof-specialty,0.91,0.21
>50K,No explanation found,,


In [16]:
start_time = time.time()
disj_res = categorical_only.explain(explainer='many to one', labels=labels, explanation_form='disj', separation_threshold=separation_error)
disj_time = time.time() - start_time
print(f"Disjunctive explanation runtime: {disj_time}")
build_comparison_df(adults_comparison_df, 'Categorical only', 'disjunctive', disj_res, disj_time)
disj_res

Disjunctive explanation runtime: 0.11255860328674316


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
<=50K,occupation != Exec-managerial,0.91,0.21
<=50K,occupation != Prof-specialty,0.91,0.21
<=50K,occupation != Prof-specialty OR occupation != Exec-managerial,1.0,0.24
>50K,No explanation found,,


### Groupby tests

In [17]:
single_attribute_groupby = adults.groupby('relationship').mean()
multi_attribute_groupby = adults.groupby(['sex', 'relationship']).mean()

In [18]:
start_time = time.time()
conj_res = single_attribute_groupby.explain(explainer='many to one', explanation_form='conj', separation_threshold=separation_error)
conj_time = time.time() - start_time
print(f"Conjunctive explanation runtime: {conj_time}")
build_comparison_df(adults_comparison_df, 'Single attribute groupby', 'conjunctive', conj_res, conj_time)
conj_res

Conjunctive explanation runtime: 1.3326494693756104


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
Husband,marital-status == Married-civ-spouse AND sex != Female,1.0,0.01
Husband,marital-status == Married-civ-spouse,1.0,0.11
Wife,marital-status == Married-civ-spouse AND sex == Female,0.99,0.05
Not-in-family,No explanation found,,
Other-relative,No explanation found,,
Own-child,No explanation found,,
Unmarried,No explanation found,,


In [19]:
start_time = time.time()
disj_res = single_attribute_groupby.explain(explainer='many to one', explanation_form='disj', separation_threshold=separation_error)
disj_time = time.time() - start_time
print(f"Disjunctive explanation runtime: {disj_time}")
build_comparison_df(adults_comparison_df, 'Single attribute groupby', 'disjunctive', disj_res, disj_time)
disj_res

Disjunctive explanation runtime: 1.4520368576049805


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
Husband,marital-status == Married-civ-spouse,1.0,0.11
Not-in-family,No explanation found,,
Other-relative,No explanation found,,
Own-child,No explanation found,,
Unmarried,No explanation found,,
Wife,No explanation found,,


In [20]:
start_time = time.time()
conj_res = multi_attribute_groupby.explain(explainer='many to one', explanation_form='conj', separation_threshold=separation_error)
conj_time = time.time() - start_time
print(f"Conjunctive explanation runtime: {conj_time}")
build_comparison_df(adults_comparison_df, 'Multi attribute groupby', 'conjunctive', conj_res, conj_time)
conj_res

There are more than the specified max number of 10 unique labels, and the option `prune_if_too_many_labels` is set to True. Pruning the labels to the top 10 most common labels.

Conjunctive explanation runtime: 2.692655086517334


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
"('Male', 'Husband')",marital-status == Married-civ-spouse,1.0,0.11
"('Male', 'Husband')",marital-status == Married-civ-spouse AND occupation != Adm-clerical,0.95,0.09
"('Male', 'Husband')",marital-status == Married-civ-spouse AND 39.5 <= hours-per-week <= 99,0.88,0.08
"('Male', 'Husband')",marital-status == Married-civ-spouse AND 38.5 <= hours-per-week <= 99,0.88,0.08
"('Male', 'Husband')",marital-status == Married-civ-spouse AND 40 <= hours-per-week <= 99,0.88,0.08
"('Female', 'Not-in-family')",No explanation found,,
"('Female', 'Other-relative')",No explanation found,,
"('Female', 'Own-child')",No explanation found,,
"('Female', 'Unmarried')",No explanation found,,
"('Female', 'Wife')",No explanation found,,


In [21]:
start_time = time.time()
disj_res = multi_attribute_groupby.explain(explainer='many to one', explanation_form='disj', separation_threshold=separation_error)
disj_time = time.time() - start_time
print(f"Disjunctive explanation runtime: {disj_time}")
build_comparison_df(adults_comparison_df, 'Multi attribute groupby', 'disjunctive', disj_res, disj_time)
disj_res

There are more than the specified max number of 10 unique labels, and the option `prune_if_too_many_labels` is set to True. Pruning the labels to the top 10 most common labels.

Disjunctive explanation runtime: 2.7848727703094482


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
"('Male', 'Husband')",marital-status == Married-civ-spouse,1.0,0.11
"('Female', 'Not-in-family')",No explanation found,,
"('Female', 'Other-relative')",No explanation found,,
"('Female', 'Own-child')",No explanation found,,
"('Female', 'Unmarried')",No explanation found,,
"('Female', 'Wife')",No explanation found,,
"('Male', 'Not-in-family')",No explanation found,,
"('Male', 'Other-relative')",No explanation found,,
"('Male', 'Own-child')",No explanation found,,
"('Male', 'Unmarried')",No explanation found,,


### Clustering test

#### 3 clusters

In [22]:
cluster_labels = KMeans(n_clusters=3).fit_predict(pd.get_dummies(adults))

[WinError 2] The system cannot find the file specified
  File "C:\Users\Yuval\PycharmProjects\FEDEx_Generator\.venv1\Lib\site-packages\joblib\externals\loky\backend\context.py", line 257, in _count_physical_cores
    cpu_info = subprocess.run(
               ^^^^^^^^^^^^^^^
  File "C:\Users\Yuval\AppData\Local\Programs\Python\Python312\Lib\subprocess.py", line 548, in run
    with Popen(*popenargs, **kwargs) as process:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Yuval\AppData\Local\Programs\Python\Python312\Lib\subprocess.py", line 1026, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "C:\Users\Yuval\AppData\Local\Programs\Python\Python312\Lib\subprocess.py", line 1538, in _execute_child
    hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


In [23]:
start_time = time.time()
conj_res = adults.explain(explainer='many to one', labels=cluster_labels, explanation_form='conj', separation_threshold=separation_error)
conj_time = time.time() - start_time
print(f"Conjunctive explanation runtime: {conj_time}")
build_comparison_df(adults_comparison_df, '3 clusters', 'conjunctive', conj_res, conj_time)
conj_res

Conjunctive explanation runtime: 0.8298277854919434


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
0,291675.5 <= fnlwgt <= 1490400,1.0,0.0
1,13769 <= fnlwgt <= 149628.5,1.0,0.0
2,149628.5 <= fnlwgt <= 291675.5,1.0,0.0


In [24]:
start_time = time.time()
disj_res = adults.explain(explainer='many to one', labels=cluster_labels, explanation_form='disj', separation_threshold=separation_error)
disj_time = time.time() - start_time
print(f"Disjunctive explanation runtime: {disj_time}")
build_comparison_df(adults_comparison_df, '3 clusters', 'disjunctive', disj_res, disj_time)
disj_res

Disjunctive explanation runtime: 0.9882702827453613


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
0,291675.5 <= fnlwgt <= 1490400,1.0,0.0
1,13769 <= fnlwgt <= 149628.5,1.0,0.0
2,149628.5 <= fnlwgt <= 291675.5,1.0,0.0


#### 15 clusters

In [25]:
cluster_labels = KMeans(n_clusters=15).fit_predict(pd.get_dummies(adults))

In [26]:
start_time = time.time()
conj_res = adults.explain(explainer='many to one', labels=cluster_labels, explanation_form='conj', separation_threshold=separation_error)
conj_time = time.time() - start_time
print(f"Conjunctive explanation runtime: {conj_time}")
build_comparison_df(adults_comparison_df, '15 clusters', 'conjunctive', conj_res, conj_time)
conj_res

There are more than the specified max number of 10 unique labels, and the option `prune_if_too_many_labels` is set to True. Pruning the labels to the top 10 most common labels.

Conjunctive explanation runtime: 4.713717460632324


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
0,145798.0 <= fnlwgt <= 179933.0,1.0,0.0
2,68158.5 <= fnlwgt <= 110200.5,1.0,0.0
5,19914 <= fnlwgt <= 68158.5,1.0,0.0
6,405736.0 <= fnlwgt <= 488706,1.0,0.0
9,110200.5 <= fnlwgt <= 145798.0,1.0,0.0
11,176467 <= fnlwgt <= 207202,0.84,0.1
11,176467 <= fnlwgt <= 231413,1.0,0.28
1,No explanation found,,
3,No explanation found,,
7,No explanation found,,


In [27]:
start_time = time.time()
disj_res = adults.explain(explainer='many to one', labels=cluster_labels, explanation_form='disj', separation_threshold=separation_error)
disj_time = time.time() - start_time
print(f"Disjunctive explanation runtime: {disj_time}")
build_comparison_df(adults_comparison_df, '15 clusters', 'disjunctive', disj_res, disj_time)
disj_res

There are more than the specified max number of 10 unique labels, and the option `prune_if_too_many_labels` is set to True. Pruning the labels to the top 10 most common labels.

Disjunctive explanation runtime: 4.624136686325073


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
0,145798.0 <= fnlwgt <= 179933.0,1.0,0.0
2,68158.5 <= fnlwgt <= 110200.5,1.0,0.0
5,19914 <= fnlwgt <= 68158.5,1.0,0.0
6,405736.0 <= fnlwgt <= 488706,1.0,0.0
9,110200.5 <= fnlwgt <= 145798.0,1.0,0.0
11,176467 <= fnlwgt <= 207202,0.84,0.1
11,176467 <= fnlwgt <= 231413,1.0,0.28
1,No explanation found,,
3,No explanation found,,
7,No explanation found,,


### Summary of results

In [28]:
conj_averages = adults_comparison_df.loc[:, 'conjunctive', :].mean()
disj_averages = adults_comparison_df.loc[:, 'disjunctive', :].mean()
adults_comparison_df.loc[('Average', 'conjunctive'), :] = conj_averages
adults_comparison_df.loc[('Average', 'disjunctive'), :] = disj_averages

In [29]:
adults_comparison_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Average Coverage,Average Separation Error,% Explanation Not Found,Average Explanation Length,Number of Explanations Found,Runtime
Test,Explanation Type,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Full dataset,conjunctive,0.8725,0.1725,50.0,1.25,4.0,0.289576
Full dataset,disjunctive,0.913333,0.186667,50.0,1.0,3.0,0.34168
Numeric only,conjunctive,0.8725,0.1725,50.0,1.25,4.0,0.824513
Numeric only,disjunctive,0.913333,0.186667,50.0,1.0,3.0,0.617195
Categorical only,conjunctive,0.88,0.196667,50.0,1.333333,3.0,0.091938
Categorical only,disjunctive,0.94,0.22,50.0,1.333333,3.0,0.112559
Single attribute groupby,conjunctive,0.996667,0.056667,66.666667,1.666667,3.0,1.332649
Single attribute groupby,disjunctive,1.0,0.11,83.333333,1.0,1.0,1.452037
Multi attribute groupby,conjunctive,0.918,0.088,90.0,1.8,5.0,2.692655
Multi attribute groupby,disjunctive,1.0,0.11,90.0,1.0,1.0,2.784873


## Spotify dataset

In [30]:
spotify = pd.read_csv(r"..\Examples\Datasets\spotify_all.csv")
spotify

Unnamed: 0,acousticness,artists,danceability,energy,explicit,id,instrumentalness,key,liveness,loudness,...,name,popularity,speechiness,tempo,valence,year,decade,popularity_score,main_artist,duration_minutes
0,0.991000,['Mamie Smith'],0.598,0.224,0,0cS0A1fUEUd1EW3FcF8AEI,0.000522,5,0.3790,-12.628,...,Keep A Song In Your Soul,12,0.0936,149.976,0.6340,1920,1920,10,Mamie Smith,2.805550
1,0.643000,"[""Screamin' Jay Hawkins""]",0.852,0.517,0,0hbkKFIJm7Z05H8Zl9w30f,0.026400,5,0.0809,-7.261,...,I Put A Spell On You,7,0.0534,86.889,0.9500,1920,1920,0,Screamin' Jay Hawkins,2.503333
2,0.993000,['Mamie Smith'],0.647,0.186,0,11m7laMUgmOKqI3oYzuhne,0.000018,0,0.5190,-12.098,...,Golfing Papa,4,0.1740,97.600,0.6890,1920,1920,0,Mamie Smith,2.730450
3,0.000173,['Oscar Velazquez'],0.730,0.798,0,19Lc5SfJJ5O1oaxY0fpwfh,0.801000,2,0.1280,-7.311,...,True House Music - Xavier Santos & Carlos Gomi...,17,0.0425,127.997,0.0422,1920,1920,10,Oscar Velazquez,7.034783
4,0.295000,['Mixe'],0.704,0.707,1,2hJjbsLCytGsnAHfdsLejp,0.000246,10,0.4020,-6.036,...,Xuniverxe,2,0.0768,122.076,0.2990,1920,1920,0,Mixe,2.753733
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
174384,0.009170,"['DJ Combo', 'Sander-7', 'Tony T']",0.792,0.866,0,46LhBf6TvYjZU2SMvGZAbn,0.000060,6,0.1780,-5.089,...,The One,0,0.0356,125.972,0.1860,2020,2020,0,DJ Combo,2.460250
174385,0.795000,['Alessia Cara'],0.429,0.211,0,7tue2Wemjd0FZzRtDrQFZd,0.000000,4,0.1960,-11.665,...,A Little More,0,0.0360,94.710,0.2280,2021,2020,0,Alessia Cara,2.412000
174386,0.806000,['Roger Fly'],0.671,0.589,0,48Qj61hOdYmUCFJbpQ29Ob,0.920000,4,0.1130,-12.393,...,Together,0,0.0282,108.058,0.7140,2020,2020,0,Roger Fly,3.635783
174387,0.920000,['Taylor Swift'],0.462,0.240,1,1gcyHQpBQ1lfXGdhZmWrHP,0.000000,0,0.1130,-12.077,...,champagne problems,69,0.0377,171.319,0.3200,2021,2020,60,Taylor Swift,4.066667


In [31]:
spotify_comparison_df = pd.DataFrame(columns=columns, index=index)

### Full dataset test

In [32]:
start_time = time.time()
conj_res = spotify.explain(explainer='many to one', labels='explicit', explanation_form='conj', separation_threshold=separation_error)
conj_time = time.time() - start_time
print(f"Conjunctive explanation runtime: {conj_time}")
build_comparison_df(spotify_comparison_df, 'Full dataset', 'conjunctive', conj_res, conj_time)
conj_res

Conjunctive explanation runtime: 0.755469560623169


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.0 <= speechiness <= 0.16166666666666665,0.91,0.03
0,0.0 <= speechiness <= 0.485,0.97,0.06
0,0.0 <= speechiness <= 0.2425,0.94,0.04
0,0.0 <= speechiness <= 0.3233333333333333 AND 0 <= popularity <= 46.5,0.78,0.02
0,0.0 <= speechiness <= 0.3233333333333333,0.95,0.05
0,0.0 <= speechiness <= 0.194 AND 0 <= popularity <= 46.5,0.75,0.01
1,No explanation found,,


In [33]:
start_time = time.time()
disj_res = spotify.explain(explainer='many to one', labels='explicit', explanation_form='disj', separation_threshold=separation_error)
disj_time = time.time() - start_time
print(f"Disjunctive explanation runtime: {disj_time}")
build_comparison_df(spotify_comparison_df, 'Full dataset', 'disjunctive', disj_res, disj_time)
disj_res

Disjunctive explanation runtime: 0.8857603073120117


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.0 <= popularity <= 44.0 OR 0.0 <= speechiness <= 0.2425,1.0,0.06
0,0.0 <= speechiness <= 0.16166666666666665 OR 0 <= popularity <= 46.5,0.99,0.05
0,0.0 <= speechiness <= 0.2425,0.94,0.04
0,0.0 <= speechiness <= 0.3233333333333333,0.95,0.05
0,0.0 <= popularity <= 44.0 OR 0.0 <= speechiness <= 0.16166666666666665,0.99,0.05
0,0.0 <= speechiness <= 0.2425 OR 0 <= popularity <= 46.5,1.0,0.06
0,0.0 <= speechiness <= 0.16166666666666665,0.91,0.03
0,0.0 <= popularity <= 44.0 OR 0.0 <= speechiness <= 0.194,0.99,0.05
0,0.0 <= speechiness <= 0.485,0.97,0.06
1,No explanation found,,


### Numeric only and categorical only tests

In [34]:
numeric_only = spotify[['acousticness', 'danceability', 'energy', 'instrumentalness', 'liveness', 'loudness', 'speechiness', 'tempo', 'valence']]
categorical_only = spotify[['name', 'main_artist', 'year']]
labels = spotify['explicit']

In [35]:
start_time = time.time()
conj_res = numeric_only.explain(explainer='many to one', labels=labels, explanation_form='conj', separation_threshold=separation_error)
conj_time = time.time() - start_time
print(f"Conjunctive explanation runtime: {conj_time}")
build_comparison_df(spotify_comparison_df, 'Numeric only', 'conjunctive', conj_res, conj_time)
conj_res

Conjunctive explanation runtime: 0.5948865413665771


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.0 <= speechiness <= 0.485,0.97,0.06
0,0.0 <= speechiness <= 0.2425,0.94,0.04
0,0.0 <= speechiness <= 0.3233333333333333,0.95,0.05
0,0.0 <= speechiness <= 0.16166666666666665,0.91,0.03
1,No explanation found,,


In [36]:
start_time = time.time()
disj_res = numeric_only.explain(explainer='many to one', labels=labels, explanation_form='disj', separation_threshold=separation_error)
disj_time = time.time() - start_time
print(f"Disjunctive explanation runtime: {disj_time}")
build_comparison_df(spotify_comparison_df, 'Numeric only', 'disjunctive', disj_res, disj_time)
disj_res

Disjunctive explanation runtime: 0.7212879657745361


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.0 <= speechiness <= 0.3233333333333333 OR 0.0 <= instrumentalness <= 0.333,1.0,0.07
0,0.0 <= speechiness <= 0.16166666666666665,0.91,0.03
0,0.0 <= instrumentalness <= 0.4995 OR 0.0 <= speechiness <= 0.485,1.0,0.07
0,0.0 <= speechiness <= 0.485,0.97,0.06
0,0.0 <= speechiness <= 0.3233333333333333 OR 0.0 <= instrumentalness <= 0.4995,1.0,0.07
0,0.0 <= instrumentalness <= 0.24975 OR 0.0 <= speechiness <= 0.485,1.0,0.07
0,0.0 <= speechiness <= 0.2425,0.94,0.04
0,0.0 <= speechiness <= 0.3233333333333333,0.95,0.05
0,0.0 <= instrumentalness <= 0.333 OR 0.0 <= speechiness <= 0.485,1.0,0.07
1,No explanation found,,


In [37]:
start_time = time.time()
conj_res = categorical_only.explain(explainer='many to one', labels=labels, explanation_form='conj', separation_threshold=separation_error)
conj_time = time.time() - start_time
print(f"Conjunctive explanation runtime: {conj_time}")
build_comparison_df(spotify_comparison_df, 'Categorical only', 'conjunctive', conj_res, conj_time)
conj_res

Conjunctive explanation runtime: 0.32207226753234863


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
0,name == infrequent_sklearn,1.0,0.07
1,No explanation found,,


In [38]:
start_time = time.time()
disj_res = categorical_only.explain(explainer='many to one', labels=labels, explanation_form='disj', separation_threshold=separation_error)
disj_time = time.time() - start_time
print(f"Disjunctive explanation runtime: {disj_time}")
build_comparison_df(spotify_comparison_df, 'Categorical only', 'disjunctive', disj_res, disj_time)
disj_res

Disjunctive explanation runtime: 0.28539109230041504


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
0,name == infrequent_sklearn,1.0,0.07
1,No explanation found,,


### Groupby tests

In [39]:
single_attribute_groupby = spotify.groupby('mode').mean()

In [40]:
start_time = time.time()
conj_res = single_attribute_groupby.explain(explainer='many to one', explanation_form='conj', separation_threshold=separation_error)
conj_time = time.time() - start_time
print(f"Conjunctive explanation runtime: {conj_time}")
build_comparison_df(spotify_comparison_df, 'Single attribute groupby', 'conjunctive', conj_res, conj_time)
conj_res

Conjunctive explanation runtime: 0.9400739669799805


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
1,0.07324999943375587 <= acousticness <= 0.996,0.78,0.28
1,0.0 <= speechiness <= 0.485,0.95,0.3
1,0.0 <= speechiness <= 0.3233333333333333 AND 0.07324999943375587 <= acousticness <= 0.996,0.73,0.27
0,No explanation found,,


In [41]:
start_time = time.time()
disj_res = single_attribute_groupby.explain(explainer='many to one', explanation_form='disj', separation_threshold=separation_error)
disj_time = time.time() - start_time
print(f"Disjunctive explanation runtime: {disj_time}")
build_comparison_df(spotify_comparison_df, 'Single attribute groupby', 'disjunctive', disj_res, disj_time)
disj_res

Disjunctive explanation runtime: 1.0879826545715332


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
1,0.07324999943375587 <= acousticness <= 0.996,0.78,0.28
1,0.07324999943375587 <= acousticness <= 0.996 OR 0.0 <= speechiness <= 0.485,1.0,0.3
1,0.0 <= speechiness <= 0.485,0.95,0.3
0,No explanation found,,


### Clustering test

#### 3 clusters

In [42]:
# The spotify dataset is too big to cluster all columns with get_dummies (I tried, got a MemoryError, saying it is unable to allocate 22.3 GiB)
cols = ['acousticness', 'danceability', 'energy', 'instrumentalness', 'liveness', 'loudness', 'speechiness', 'tempo', 'valence']

In [43]:
cluster_labels = KMeans(n_clusters=3).fit_predict(spotify[cols])

In [44]:
start_time = time.time()
conj_res = spotify.explain(explainer='many to one', labels=cluster_labels, explanation_form='conj', separation_threshold=separation_error)
conj_time = time.time() - start_time
print(f"Conjunctive explanation runtime: {conj_time}")
build_comparison_df(spotify_comparison_df, '3 clusters', 'conjunctive', conj_res, conj_time)
conj_res

Conjunctive explanation runtime: 1.1946170330047607


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.0 <= tempo <= 106.38400268554688,0.99,0.0
0,0.0 <= tempo <= 109.723,1.0,0.07
1,106.68099975585938 <= tempo <= 219.446,1.0,0.25
1,109.72300000000001 <= tempo <= 146.29733333333334,0.92,0.0
1,106.78049850463867 <= tempo <= 219.446,1.0,0.25
1,109.723 <= tempo <= 164.5845,0.93,0.13
2,147.2570037841797 <= tempo <= 219.446,1.0,0.0


In [45]:
start_time = time.time()
disj_res = spotify.explain(explainer='many to one', labels=cluster_labels, explanation_form='disj', separation_threshold=separation_error)
disj_time = time.time() - start_time
print(f"Disjunctive explanation runtime: {disj_time}")
build_comparison_df(spotify_comparison_df, '3 clusters', 'disjunctive', disj_res, disj_time)
disj_res

Disjunctive explanation runtime: 1.3401148319244385


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.0 <= tempo <= 109.723,1.0,0.07
0,0.0 <= tempo <= 106.38400268554688,0.99,0.0
1,106.78049850463867 <= tempo <= 219.446,1.0,0.25
1,109.72300000000001 <= tempo <= 146.29733333333334,0.92,0.0
1,109.723 <= tempo <= 164.5845,0.93,0.13
1,106.68099975585938 <= tempo <= 219.446,1.0,0.25
2,147.2570037841797 <= tempo <= 219.446,1.0,0.0


#### 15 clusters

In [46]:
cluster_labels = KMeans(n_clusters=15).fit_predict(spotify[cols])

In [47]:
start_time = time.time()
conj_res = spotify.explain(explainer='many to one', labels=cluster_labels, explanation_form='conj', separation_threshold=separation_error)
conj_time = time.time() - start_time
print(f"Conjunctive explanation runtime: {conj_time}")
build_comparison_df(spotify_comparison_df, '15 clusters', 'conjunctive', conj_res, conj_time)
conj_res

There are more than the specified max number of 10 unique labels, and the option `prune_if_too_many_labels` is set to True. Pruning the labels to the top 10 most common labels.

Conjunctive explanation runtime: 4.911241292953491


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
0,112.581 <= tempo <= 125.547 AND -20.3665 <= loudness <= 0.14300000000000068,0.96,0.2
0,112.581 <= tempo <= 125.547,0.97,0.21
3,87.1984977722168 <= tempo <= 100.50500106811523 AND -15.725500106811523 <= loudness <= 0.143,0.95,0.03
3,86.93149948120117 <= tempo <= 100.50500106811523,1.0,0.21
3,87.762 <= tempo <= 99.814,0.93,0.19
3,86.93149948120117 <= tempo <= 100.50500106811523 AND -15.067500114440918 <= loudness <= 0.143,0.93,0.02
3,86.93149948120117 <= tempo <= 100.50500106811523 AND -20.3665 <= loudness <= 0.14300000000000068,1.0,0.13
3,87.1984977722168 <= tempo <= 100.50500106811523 AND -15.725500106811523 <= loudness <= 0.143 AND 0.0228 <= speechiness <= 0.49589999999999995,0.94,0.02
3,87.762 <= tempo <= 99.814 AND -15.725500106811523 <= loudness <= 0.143,0.88,0.01
3,86.93149948120117 <= tempo <= 100.50500106811523 AND -16.27899932861328 <= loudness <= 0.143,0.97,0.04


In [48]:
start_time = time.time()
disj_res = spotify.explain(explainer='many to one', labels=cluster_labels, explanation_form='disj', separation_threshold=separation_error)
disj_time = time.time() - start_time
print(f"Disjunctive explanation runtime: {disj_time}")
build_comparison_df(spotify_comparison_df, '15 clusters', 'disjunctive', disj_res, disj_time)
disj_res

There are more than the specified max number of 10 unique labels, and the option `prune_if_too_many_labels` is set to True. Pruning the labels to the top 10 most common labels.

Disjunctive explanation runtime: 4.6106226444244385


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
0,112.581 <= tempo <= 125.547,0.97,0.21
3,87.762 <= tempo <= 99.814,0.93,0.19
3,86.93149948120117 <= tempo <= 100.50500106811523,1.0,0.21
8,133.0308 <= tempo <= 154.0424,1.0,0.3
13,160.44000244140625 <= tempo <= 175.054,0.98,0.0
13,157.54433333333333 <= tempo <= 175.054,1.0,0.17
2,No explanation found,,
4,No explanation found,,
5,No explanation found,,
6,No explanation found,,


### Summary of results

In [49]:
conj_averages = spotify_comparison_df.loc[:, 'conjunctive', :].mean()
disj_averages = spotify_comparison_df.loc[:, 'disjunctive', :].mean()
spotify_comparison_df.loc[('Average', 'conjunctive'), :] = conj_averages
spotify_comparison_df.loc[('Average', 'disjunctive'), :] = disj_averages

In [50]:
spotify_comparison_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Average Coverage,Average Separation Error,% Explanation Not Found,Average Explanation Length,Number of Explanations Found,Runtime
Test,Explanation Type,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Full dataset,conjunctive,0.883333,0.035,50.0,1.333333,6.0,0.75547
Full dataset,disjunctive,0.971111,0.05,50.0,1.555556,9.0,0.88576
Numeric only,conjunctive,0.9425,0.045,50.0,1.0,4.0,0.594887
Numeric only,disjunctive,0.974444,0.058889,50.0,1.555556,9.0,0.721288
Categorical only,conjunctive,1.0,0.07,50.0,1.0,1.0,0.322072
Categorical only,disjunctive,1.0,0.07,50.0,1.0,1.0,0.285391
Single attribute groupby,conjunctive,0.82,0.283333,50.0,1.333333,3.0,0.940074
Single attribute groupby,disjunctive,0.91,0.293333,50.0,1.333333,3.0,1.087983
3 clusters,conjunctive,0.977143,0.1,0.0,1.0,7.0,1.194617
3 clusters,disjunctive,0.977143,0.1,0.0,1.0,7.0,1.340115


## House Prices dataset

In [51]:
houses = pd.read_csv(r"..\Examples\Datasets\houses.csv")
houses = houses[['MSSubClass', 'LotArea', 'Street', 'LotConfig', 'LandSlope', 'Neighborhood', 'BldgType', 'HouseStyle',
'OverallQual', 'OverallCond', 'YearBuilt', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'ExterQual', 'Foundation', 'BsmtQual',
'CentralAir', '1stFlrSF', '2ndFlrSF', 'GrLivArea', 'FullBath', 'TotRmsAbvGrd', 'GarageCars', 'PoolArea', 'YrSold', 'SalePrice']]

In [52]:
houses_comparison_df = pd.DataFrame(columns=columns, index=index)

### Full dataset test

In [53]:
start_time = time.time()
conj_res = houses.explain(explainer='many to one', labels='HouseStyle', explanation_form='conj', separation_threshold=separation_error)
conj_time = time.time() - start_time
print(f"Conjunctive explanation runtime: {conj_time}")
build_comparison_df(houses_comparison_df, 'Full dataset', 'conjunctive', conj_res, conj_time)
conj_res

Conjunctive explanation runtime: 2.317922353744507


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
2Story,671.0 <= 2ndFlrSF <= 2065 AND 55.0 <= MSSubClass <= 190,0.85,0.07
2Story,545.5 <= 2ndFlrSF <= 2065 AND 55.0 <= MSSubClass <= 190,0.96,0.09
2Story,55.0 <= MSSubClass <= 72.5,0.8,0.01
2Story,200.0 <= 2ndFlrSF <= 2065 AND 60 <= MSSubClass <= 190,0.99,0.1
2Story,671.0 <= 2ndFlrSF <= 2065,0.85,0.14
2Story,600 <= 2ndFlrSF <= 2065,0.91,0.17
2Story,545.5 <= 2ndFlrSF <= 2065 AND 60 <= MSSubClass <= 190,0.96,0.09
2Story,200.0 <= 2ndFlrSF <= 2065,1.0,0.29
2Story,671.0 <= 2ndFlrSF <= 2065 AND 60 <= MSSubClass <= 190,0.85,0.07
2Story,545.5 <= 2ndFlrSF <= 2065,0.97,0.2


In [54]:
start_time = time.time()
disj_res = houses.explain(explainer='many to one', labels='HouseStyle', explanation_form='disj', separation_threshold=separation_error)
disj_time = time.time() - start_time
print(f"Disjunctive explanation runtime: {disj_time}")
build_comparison_df(houses_comparison_df, 'Full dataset', 'disjunctive', disj_res, disj_time)
disj_res

Disjunctive explanation runtime: 1.0194993019104004


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
2Story,200.0 <= 2ndFlrSF <= 2065,1.0,0.29
2Story,545.5 <= 2ndFlrSF <= 2065 OR 55.0 <= MSSubClass <= 72.5,0.98,0.2
2Story,545.5 <= 2ndFlrSF <= 2065,0.97,0.2
2Story,600 <= 2ndFlrSF <= 2065,0.91,0.17
2Story,55.0 <= MSSubClass <= 72.5,0.8,0.01
2Story,671.0 <= 2ndFlrSF <= 2065,0.85,0.14
2Story,671.0 <= 2ndFlrSF <= 2065 OR 55.0 <= MSSubClass <= 72.5,0.91,0.13
2Story,55.0 <= MSSubClass <= 72.5 OR 600 <= 2ndFlrSF <= 2065,0.95,0.17
1Story,20 <= MSSubClass <= 20,0.74,0.0
1Story,20 <= MSSubClass <= 42.5,0.83,0.01


### Numeric only and categorical only tests

In [55]:
numeric_only = houses[['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'GrLivArea', 'FullBath', 'TotRmsAbvGrd', 'GarageCars', 'PoolArea', 'SalePrice']]
categorical_only = houses[['MSSubClass', 'Street', 'LotConfig', 'LandSlope', 'Neighborhood', 'BldgType', 'OverallQual', 'OverallCond', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'ExterQual', 'Foundation', 'BsmtQual', 'CentralAir', 'YrSold']]
labels = houses['HouseStyle']

In [56]:
start_time = time.time()
conj_res = numeric_only.explain(explainer='many to one', labels=labels, explanation_form='conj', separation_threshold=separation_error)
conj_time = time.time() - start_time
print(f"Conjunctive explanation runtime: {conj_time}")
build_comparison_df(houses_comparison_df, 'Numeric only', 'conjunctive', conj_res, conj_time)
conj_res

Conjunctive explanation runtime: 1.8937678337097168


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
2Story,545.5 <= 2ndFlrSF <= 2065 AND 334.0 <= GrLivArea <= 2988.0,0.95,0.19
2Story,671.0 <= 2ndFlrSF <= 2065,0.85,0.14
2Story,671.0 <= 2ndFlrSF <= 2065 AND 1964.0 <= YearBuilt <= 2010.0,0.7,0.04
2Story,671.0 <= 2ndFlrSF <= 2065 AND 334.0 <= GrLivArea <= 2988.0,0.83,0.13
2Story,545.5 <= 2ndFlrSF <= 2065 AND 1968.5 <= YearBuilt <= 2010,0.75,0.06
2Story,200.0 <= 2ndFlrSF <= 2065 AND 1964.0 <= YearBuilt <= 2010.0,0.81,0.08
2Story,600 <= 2ndFlrSF <= 2065,0.91,0.17
2Story,545.5 <= 2ndFlrSF <= 2065 AND 1941.0 <= YearBuilt <= 2010.0,0.82,0.12
2Story,200.0 <= 2ndFlrSF <= 2065,1.0,0.29
2Story,545.5 <= 2ndFlrSF <= 2065,0.97,0.2


In [57]:
start_time = time.time()
disj_res = numeric_only.explain(explainer='many to one', labels=labels, explanation_form='disj', separation_threshold=separation_error)
disj_time = time.time() - start_time
print(f"Disjunctive explanation runtime: {disj_time}")
build_comparison_df(houses_comparison_df, 'Numeric only', 'disjunctive', disj_res, disj_time)
disj_res

Disjunctive explanation runtime: 0.799609899520874


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
2Story,200.0 <= 2ndFlrSF <= 2065,1.0,0.29
2Story,688.3333333333334 <= 2ndFlrSF <= 1376.6666666666667,0.78,0.12
2Story,600 <= 2ndFlrSF <= 2065,0.91,0.17
2Story,671.0 <= 2ndFlrSF <= 2065,0.85,0.14
2Story,545.5 <= 2ndFlrSF <= 2065,0.97,0.2
1Story,0 <= 2ndFlrSF <= 55.0,1.0,0.13
1Story,0 <= 2ndFlrSF <= 0,1.0,0.13
1.5Fin,No explanation found,,
1.5Unf,No explanation found,,
2.5Fin,No explanation found,,


In [58]:
start_time = time.time()
conj_res = categorical_only.explain(explainer='many to one', labels=labels, explanation_form='conj', separation_threshold=separation_error)
conj_time = time.time() - start_time
print(f"Conjunctive explanation runtime: {conj_time}")
build_comparison_df(houses_comparison_df, 'Categorical only', 'conjunctive', conj_res, conj_time)
conj_res

Conjunctive explanation runtime: 0.45153117179870605


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
2Story,55.0 <= MSSubClass <= 72.5,0.8,0.01
2Story,50 <= MSSubClass <= 70,0.81,0.28
1Story,20 <= MSSubClass <= 30,0.83,0.01
1Story,20 <= MSSubClass <= 42.5,0.83,0.01
1Story,20 <= MSSubClass <= 25.0,0.74,0.0
1Story,MSSubClass == 20,0.74,0.0
1Story,MSSubClass == 20,0.74,0.0
1.5Fin,MSSubClass == 50,0.92,0.02
1.5Unf,42.5 <= MSSubClass <= 47.5,0.86,0.0
1.5Unf,MSSubClass == 45,0.86,0.0


In [59]:
start_time = time.time()
disj_res = categorical_only.explain(explainer='many to one', labels=labels, explanation_form='disj', separation_threshold=separation_error)
disj_time = time.time() - start_time
print(f"Disjunctive explanation runtime: {disj_time}")
build_comparison_df(houses_comparison_df, 'Categorical only', 'disjunctive', disj_res, disj_time)
disj_res

Disjunctive explanation runtime: 0.48761868476867676


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
2Story,50 <= MSSubClass <= 70,0.81,0.28
2Story,55.0 <= MSSubClass <= 72.5,0.8,0.01
1Story,20 <= MSSubClass <= 30,0.83,0.01
1Story,20 <= MSSubClass <= 42.5,0.83,0.01
1Story,20 <= MSSubClass <= 20,0.74,0.0
1Story,20 <= MSSubClass <= 25.0,0.74,0.0
1.5Fin,42.5 <= MSSubClass <= 55.0,0.92,0.1
1.5Unf,42.5 <= MSSubClass <= 47.5,0.86,0.0
2.5Fin,No explanation found,,
2.5Unf,No explanation found,,


### Groupby tests

In [60]:
single_attribute_groupby = houses.groupby('ExterQual').mean()
multi_attribute_groupby = houses.groupby(['OverallQual', 'ExterQual']).mean()

In [61]:
start_time = time.time()
conj_res = single_attribute_groupby.explain(explainer='many to one', explanation_form='conj', separation_threshold=separation_error)
conj_time = time.time() - start_time
print(f"Conjunctive explanation runtime: {conj_time}")
build_comparison_df(houses_comparison_df, 'Single attribute groupby', 'conjunctive', conj_res, conj_time)
conj_res

Conjunctive explanation runtime: 0.8108689785003662


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
Gd,1984.5 <= YearBuilt <= 2010,0.86,0.27
Gd,1982.4 <= YearBuilt <= 2010.0,0.87,0.28
Gd,1980.5 <= YearBuilt <= 2010 AND 7.0 <= OverallQual <= 10.0,0.76,0.19
Gd,1980.5 <= YearBuilt <= 2010 AND 6.5 <= OverallQual <= 10,0.76,0.19
Gd,6.4 <= OverallQual <= 8.2,0.76,0.24
Gd,1987.0 <= YearBuilt <= 2010.0 AND 5.5 <= OverallQual <= 10.0 AND 34900.0 <= SalePrice <= 394950.0,0.82,0.22
Gd,1980.5 <= YearBuilt <= 2010 AND 171625.0 <= SalePrice <= 755000,0.79,0.23
Gd,1980.5 <= YearBuilt <= 2010 AND 6.4 <= OverallQual <= 8.2,0.72,0.12
Gd,1980.5 <= YearBuilt <= 2010 AND 6 <= OverallQual <= 10,0.88,0.25
Gd,1982.4 <= YearBuilt <= 2010.0 AND 6.4 <= OverallQual <= 8.2,0.72,0.12


In [62]:
start_time = time.time()
disj_res = single_attribute_groupby.explain(explainer='many to one', explanation_form='disj', separation_threshold=separation_error)
disj_time = time.time() - start_time
print(f"Disjunctive explanation runtime: {disj_time}")
build_comparison_df(houses_comparison_df, 'Single attribute groupby', 'disjunctive', disj_res, disj_time)
disj_res

Disjunctive explanation runtime: 0.6081326007843018


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
Gd,7.0 <= OverallQual <= 8.5,0.76,0.24
Gd,6.4 <= OverallQual <= 8.2,0.76,0.24
Gd,1984.5 <= YearBuilt <= 2010,0.86,0.27
Gd,1982.4 <= YearBuilt <= 2010.0,0.87,0.28
Gd,1980.5 <= YearBuilt <= 2010,0.88,0.29
TA,1872 <= YearBuilt <= 1980.5 OR 34900.0 <= SalePrice <= 214925.0,0.98,0.24
TA,1872 <= YearBuilt <= 1980.5,0.86,0.09
TA,1 <= OverallQual <= 6 OR 34900.0 <= SalePrice <= 214925.0,0.97,0.23
TA,1872 <= YearBuilt <= 1973 OR 4.0 <= OverallQual <= 7.0,1.0,0.27
TA,1.5 <= OverallQual <= 6.5,0.89,0.12


In [63]:
start_time = time.time()
conj_res = multi_attribute_groupby.explain(explainer='many to one', explanation_form='conj', separation_threshold=separation_error)
conj_time = time.time() - start_time
print(f"Conjunctive explanation runtime: {conj_time}")
build_comparison_df(houses_comparison_df, 'Multi attribute groupby', 'conjunctive', conj_res, conj_time)
conj_res

There are more than the specified max number of 10 unique labels, and the option `prune_if_too_many_labels` is set to True. Pruning the labels to the top 10 most common labels.

Conjunctive explanation runtime: 1.3080627918243408


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
"(4, 'TA')",No explanation found,,
"(5, 'Gd')",No explanation found,,
"(5, 'TA')",No explanation found,,
"(6, 'Gd')",No explanation found,,
"(6, 'TA')",No explanation found,,
"(7, 'Gd')",No explanation found,,
"(7, 'TA')",No explanation found,,
"(8, 'Gd')",No explanation found,,
"(9, 'Ex')",No explanation found,,
"(9, 'Gd')",No explanation found,,


In [64]:
start_time = time.time()
disj_res = multi_attribute_groupby.explain(explainer='many to one', explanation_form='disj', separation_threshold=separation_error)
disj_time = time.time() - start_time
print(f"Disjunctive explanation runtime: {disj_time}")
build_comparison_df(houses_comparison_df, 'Multi attribute groupby', 'disjunctive', disj_res, disj_time)
disj_res

There are more than the specified max number of 10 unique labels, and the option `prune_if_too_many_labels` is set to True. Pruning the labels to the top 10 most common labels.

Disjunctive explanation runtime: 1.1448500156402588


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
"(4, 'TA')",No explanation found,,
"(5, 'Gd')",No explanation found,,
"(5, 'TA')",No explanation found,,
"(6, 'Gd')",No explanation found,,
"(6, 'TA')",No explanation found,,
"(7, 'Gd')",No explanation found,,
"(7, 'TA')",No explanation found,,
"(8, 'Gd')",No explanation found,,
"(9, 'Ex')",No explanation found,,
"(9, 'Gd')",No explanation found,,


### Clustering test

#### 3 clusters

In [65]:
labels = KMeans(n_clusters=3).fit_predict(pd.get_dummies(houses))

In [66]:
start_time = time.time()
conj_res = houses.explain(explainer='many to one', labels=labels, explanation_form='conj', separation_threshold=separation_error)
conj_time = time.time() - start_time
print(f"Conjunctive explanation runtime: {conj_time}")
build_comparison_df(houses_comparison_df, '3 clusters', 'conjunctive', conj_res, conj_time)
conj_res

Conjunctive explanation runtime: 0.796339750289917


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
0,34900 <= SalePrice <= 173816.5,1.0,0.0
1,173816.5 <= SalePrice <= 755000,1.0,0.18
1,178920.0 <= SalePrice <= 322940.0,0.9,0.07
2,296246.5 <= SalePrice <= 755000,1.0,0.0


In [67]:
start_time = time.time()
disj_res = houses.explain(explainer='many to one', labels=labels, explanation_form='disj', separation_threshold=separation_error)
disj_time = time.time() - start_time
print(f"Disjunctive explanation runtime: {disj_time}")
build_comparison_df(houses_comparison_df, '3 clusters', 'disjunctive', disj_res, disj_time)
disj_res

Disjunctive explanation runtime: 0.7269830703735352


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
0,34900 <= SalePrice <= 173816.5,1.0,0.0
1,178920.0 <= SalePrice <= 322940.0,0.9,0.07
1,173816.5 <= SalePrice <= 755000,1.0,0.18
2,296246.5 <= SalePrice <= 755000,1.0,0.0


#### 15 clusters

In [68]:
labels = KMeans(n_clusters=15).fit_predict(pd.get_dummies(houses))

In [69]:
start_time = time.time()
conj_res = houses.explain(explainer='many to one', labels=labels, explanation_form='conj', separation_threshold=separation_error)
conj_time = time.time() - start_time
print(f"Conjunctive explanation runtime: {conj_time}")
build_comparison_df(houses_comparison_df, '15 clusters', 'conjunctive', conj_res, conj_time)
conj_res

There are more than the specified max number of 10 unique labels, and the option `prune_if_too_many_labels` is set to True. Pruning the labels to the top 10 most common labels.

Conjunctive explanation runtime: 2.163193464279175


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
0,128100.0 <= SalePrice <= 151200.0 AND 483.0 <= 1stFlrSF <= 1184.5,0.78,0.0
0,128100.0 <= SalePrice <= 151200.0,0.99,0.01
3,103100.0 <= SalePrice <= 128100.0,1.0,0.01
5,301032.0 <= SalePrice <= 358165.0,0.97,0.02
7,72500 <= SalePrice <= 103100.0,1.0,0.0
9,150825.0 <= SalePrice <= 174250.0 AND Exterior1st != MetalSd,0.84,0.0
9,150825.0 <= SalePrice <= 174250.0,1.0,0.01
9,150825.0 <= SalePrice <= 174250.0 AND 20.0 <= MSSubClass <= 105.0,0.84,0.0
12,352000.0 <= SalePrice <= 415298,1.0,0.0
1,No explanation found,,


In [70]:
start_time = time.time()
disj_res = houses.explain(explainer='many to one', labels=labels, explanation_form='disj', separation_threshold=separation_error)
disj_time = time.time() - start_time
print(f"Disjunctive explanation runtime: {disj_time}")
build_comparison_df(houses_comparison_df, '15 clusters', 'disjunctive', disj_res, disj_time)
disj_res

There are more than the specified max number of 10 unique labels, and the option `prune_if_too_many_labels` is set to True. Pruning the labels to the top 10 most common labels.

Disjunctive explanation runtime: 1.5325193405151367


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
0,128100.0 <= SalePrice <= 151200.0,0.99,0.01
3,103100.0 <= SalePrice <= 128100.0,1.0,0.01
5,301032.0 <= SalePrice <= 358165.0,0.97,0.02
7,72500 <= SalePrice <= 103100.0,1.0,0.0
9,150825.0 <= SalePrice <= 174250.0,1.0,0.01
12,352000.0 <= SalePrice <= 415298,1.0,0.0
1,No explanation found,,
4,No explanation found,,
11,No explanation found,,
14,No explanation found,,


### Summary of results

In [71]:
conj_averages = houses_comparison_df.loc[:, 'conjunctive', :].mean()
disj_averages = houses_comparison_df.loc[:, 'disjunctive', :].mean()
houses_comparison_df.loc[('Average', 'conjunctive'), :] = conj_averages
houses_comparison_df.loc[('Average', 'disjunctive'), :] = disj_averages

In [72]:
houses_comparison_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Average Coverage,Average Separation Error,% Explanation Not Found,Average Explanation Length,Number of Explanations Found,Runtime
Test,Explanation Type,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Full dataset,conjunctive,0.862727,0.045682,25.0,1.613636,44.0,2.317922
Full dataset,disjunctive,0.893125,0.105625,50.0,1.1875,16.0,1.019499
Numeric only,conjunctive,0.892778,0.130556,75.0,1.611111,18.0,1.893768
Numeric only,disjunctive,0.93,0.168571,75.0,1.0,7.0,0.79961
Categorical only,conjunctive,0.82,0.03,37.5,1.0,11.0,0.451531
Categorical only,disjunctive,0.81625,0.05125,50.0,1.0,8.0,0.487619
Single attribute groupby,conjunctive,0.815238,0.184048,50.0,1.904762,42.0,0.810869
Single attribute groupby,disjunctive,0.919524,0.209048,50.0,1.47619,21.0,0.608133
Multi attribute groupby,conjunctive,,,100.0,,0.0,1.308063
Multi attribute groupby,disjunctive,,,100.0,,0.0,1.14485


## Bank churn dataset

In [73]:
bank_df = pd.read_csv(r"..\Examples\Datasets\bank_churners_user_study.csv")
bank_df = bank_df.drop(columns=['CLIENTNUM'])
bank_df

Unnamed: 0,Attrition_Flag,Customer_Age,Gender,Dependent_count,Education_Level,Marital_Status,Income_Category,Card_Category,Months_on_book,Registered_Products_Count,Months_Inactive_Count_Last_Year,Contacts_Count_Last_Year,Credit_Limit,Credit_Used,Credit_Open_To_Buy,Total_Amount_Change_Q4_vs_Q1,Total_Transitions_Amount,Total_Transitions_Count,Total_Count_Change_Q4_vs_Q1,Credit_Avg_Utilization_Ratio
0,Existing Customer,45,M,3,High School,Married,$60K - $80K,Blue,39,5,1,3,12691.0,777,11914.0,1.335,1144,42,1.625,0.061
1,Existing Customer,49,F,5,Graduate,Single,Less than $40K,Blue,44,6,1,2,8256.0,864,7392.0,1.541,1291,33,3.714,0.105
2,Existing Customer,51,M,3,Graduate,Married,$80K - $120K,Blue,36,4,1,0,3418.0,0,3418.0,2.594,1887,20,2.333,0.000
3,Existing Customer,40,F,4,High School,Unknown,Less than $40K,Blue,34,3,4,1,3313.0,2517,796.0,1.405,1171,20,2.333,0.760
4,Existing Customer,40,M,3,Uneducated,Married,$60K - $80K,Blue,21,5,1,0,4716.0,0,4716.0,2.175,816,28,2.500,0.000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10122,Existing Customer,50,M,2,Graduate,Single,$40K - $60K,Blue,40,3,2,3,4003.0,1851,2152.0,0.703,15476,117,0.857,0.462
10123,Attrited Customer,41,M,2,Unknown,Divorced,$40K - $60K,Blue,25,4,2,3,4277.0,2186,2091.0,0.804,8764,69,0.683,0.511
10124,Attrited Customer,44,F,1,High School,Married,Less than $40K,Blue,36,5,3,4,5409.0,0,5409.0,0.819,10291,60,0.818,0.000
10125,Attrited Customer,30,M,2,Graduate,Unknown,$40K - $60K,Blue,36,4,3,3,5281.0,0,5281.0,0.535,8395,62,0.722,0.000


In [74]:
bank_df_comparison_df = pd.DataFrame(columns=columns, index=index)

### Full dataset test

In [75]:
start_time = time.time()
conj_res = bank_df.explain(explainer='many to one', labels='Attrition_Flag', explanation_form='conj', separation_threshold=separation_error)
conj_time = time.time() - start_time
print(f"Conjunctive explanation runtime: {conj_time}")
build_comparison_df(bank_df_comparison_df, 'Full dataset', 'conjunctive', conj_res, conj_time)
conj_res

Conjunctive explanation runtime: 0.6959044933319092


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
Existing Customer,537.5 <= Credit_Used <= 2415.5,0.76,0.07
Existing Customer,54.5 <= Total_Transitions_Count <= 131,0.74,0.05
Attrited Customer,No explanation found,,


In [76]:
start_time = time.time()
disj_res = bank_df.explain(explainer='many to one', labels='Attrition_Flag', explanation_form='disj', separation_threshold=separation_error)
disj_time = time.time() - start_time
print(f"Disjunctive explanation runtime: {disj_time}")
build_comparison_df(bank_df_comparison_df, 'Full dataset', 'disjunctive', disj_res, disj_time)
disj_res

Disjunctive explanation runtime: 0.7413156032562256


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
Existing Customer,54.5 <= Total_Transitions_Count <= 131,0.74,0.05
Existing Customer,537.5 <= Credit_Used <= 2415.5,0.76,0.07
Attrited Customer,No explanation found,,


### Numeric only and categorical only tests

In [77]:
numeric_only = bank_df[['Customer_Age', 'Dependent_count', 'Months_on_book', 'Registered_Products_Count', 'Months_Inactive_Count_Last_Year', 'Contacts_Count_Last_Year', 'Credit_Limit', 'Credit_Used', 'Credit_Open_To_Buy', 'Total_Amount_Change_Q4_vs_Q1', 'Total_Transitions_Amount', 'Total_Transitions_Count', 'Total_Count_Change_Q4_vs_Q1', 'Credit_Avg_Utilization_Ratio']]
categorical_only = bank_df[['Education_Level', 'Marital_Status', 'Income_Category', 'Card_Category']]
labels = bank_df['Attrition_Flag']

In [78]:
start_time = time.time()
conj_res = numeric_only.explain(explainer='many to one', labels=labels, explanation_form='conj', separation_threshold=separation_error)
conj_time = time.time() - start_time
print(f"Conjunctive explanation runtime: {conj_time}")
build_comparison_df(bank_df_comparison_df, 'Numeric only', 'conjunctive', conj_res, conj_time)
conj_res

Conjunctive explanation runtime: 0.6578261852264404


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
Existing Customer,537.5 <= Credit_Used <= 2415.5,0.76,0.07
Existing Customer,54.5 <= Total_Transitions_Count <= 131,0.74,0.05
Attrited Customer,No explanation found,,


In [79]:
start_time = time.time()
disj_res = numeric_only.explain(explainer='many to one', labels=labels, explanation_form='disj', separation_threshold=separation_error)
disj_time = time.time() - start_time
print(f"Disjunctive explanation runtime: {disj_time}")
build_comparison_df(bank_df_comparison_df, 'Numeric only', 'disjunctive', disj_res, disj_time)
disj_res

Disjunctive explanation runtime: 0.7017171382904053


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
Existing Customer,54.5 <= Total_Transitions_Count <= 131,0.74,0.05
Existing Customer,537.5 <= Credit_Used <= 2415.5,0.76,0.07
Attrited Customer,No explanation found,,


In [80]:
start_time = time.time()
conj_res = categorical_only.explain(explainer='many to one', labels=labels, explanation_form='conj', separation_threshold=separation_error)
conj_time = time.time() - start_time
print(f"Conjunctive explanation runtime: {conj_time}")
build_comparison_df(bank_df_comparison_df, 'Categorical only', 'conjunctive', conj_res, conj_time)
conj_res

Conjunctive explanation runtime: 0.14700865745544434


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
Existing Customer,Card_Category != Category_infrequent_sklearn,1.0,0.17
Attrited Customer,No explanation found,,


In [81]:
start_time = time.time()
disj_res = categorical_only.explain(explainer='many to one', labels=labels, explanation_form='disj', separation_threshold=separation_error)
disj_time = time.time() - start_time
print(f"Disjunctive explanation runtime: {disj_time}")
build_comparison_df(bank_df_comparison_df, 'Categorical only', 'disjunctive', disj_res, disj_time)
disj_res

Disjunctive explanation runtime: 0.08714985847473145


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
Existing Customer,Card_Category != Category_infrequent_sklearn,1.0,0.17
Attrited Customer,No explanation found,,


### Groupby tests

In [82]:
single_attribute_groupby = bank_df.groupby('Education_Level').mean()
multi_attribute_groupby = bank_df.groupby(['Education_Level', 'Income_Category']).mean()

In [83]:
start_time = time.time()
conj_res = single_attribute_groupby.explain(explainer='many to one', explanation_form='conj', separation_threshold=separation_error)
conj_time = time.time() - start_time
print(f"Conjunctive explanation runtime: {conj_time}")
build_comparison_df(bank_df_comparison_df, 'Single attribute groupby', 'conjunctive', conj_res, conj_time)
conj_res

Conjunctive explanation runtime: 9.903491497039795


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
College,No explanation found,,
Doctorate,No explanation found,,
Graduate,No explanation found,,
High School,No explanation found,,
Post-Graduate,No explanation found,,
Uneducated,No explanation found,,
Unknown,No explanation found,,


In [84]:
start_time = time.time()
disj_res = single_attribute_groupby.explain(explainer='many to one', explanation_form='disj', separation_threshold=separation_error)
disj_time = time.time() - start_time
print(f"Disjunctive explanation runtime: {disj_time}")
build_comparison_df(bank_df_comparison_df, 'Single attribute groupby', 'disjunctive', disj_res, disj_time)
disj_res

Disjunctive explanation runtime: 10.580077171325684


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
College,No explanation found,,
Doctorate,No explanation found,,
Graduate,No explanation found,,
High School,No explanation found,,
Post-Graduate,No explanation found,,
Uneducated,No explanation found,,
Unknown,No explanation found,,


In [85]:
start_time = time.time()
conj_res = multi_attribute_groupby.explain(explainer='many to one', explanation_form='conj', separation_threshold=separation_error)
conj_time = time.time() - start_time
print(f"Conjunctive explanation runtime: {conj_time}")
build_comparison_df(bank_df_comparison_df, 'Multi attribute groupby', 'conjunctive', conj_res, conj_time)
conj_res

There are more than the specified max number of 10 unique labels, and the option `prune_if_too_many_labels` is set to True. Pruning the labels to the top 10 most common labels.

Conjunctive explanation runtime: 7.984471321105957


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
"('College', 'Less than $40K')",No explanation found,,
"('Graduate', '$40K - $60K')",No explanation found,,
"('Graduate', '$60K - $80K')",No explanation found,,
"('Graduate', '$80K - $120K')",No explanation found,,
"('Graduate', 'Less than $40K')",No explanation found,,
"('Graduate', 'Unknown')",No explanation found,,
"('High School', '$40K - $60K')",No explanation found,,
"('High School', 'Less than $40K')",No explanation found,,
"('Uneducated', 'Less than $40K')",No explanation found,,
"('Unknown', 'Less than $40K')",No explanation found,,


In [86]:
start_time = time.time()
disj_res = multi_attribute_groupby.explain(explainer='many to one', explanation_form='disj', separation_threshold=separation_error)
disj_time = time.time() - start_time
print(f"Disjunctive explanation runtime: {disj_time}")
build_comparison_df(bank_df_comparison_df, 'Multi attribute groupby', 'disjunctive', disj_res, disj_time)
disj_res

There are more than the specified max number of 10 unique labels, and the option `prune_if_too_many_labels` is set to True. Pruning the labels to the top 10 most common labels.

Disjunctive explanation runtime: 8.00534439086914


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
"('College', 'Less than $40K')",No explanation found,,
"('Graduate', '$40K - $60K')",No explanation found,,
"('Graduate', '$60K - $80K')",No explanation found,,
"('Graduate', '$80K - $120K')",No explanation found,,
"('Graduate', 'Less than $40K')",No explanation found,,
"('Graduate', 'Unknown')",No explanation found,,
"('High School', '$40K - $60K')",No explanation found,,
"('High School', 'Less than $40K')",No explanation found,,
"('Uneducated', 'Less than $40K')",No explanation found,,
"('Unknown', 'Less than $40K')",No explanation found,,


### Clustering test

#### 3 clusters

In [87]:
labels = KMeans(n_clusters=3).fit_predict(pd.get_dummies(bank_df))

In [88]:
start_time = time.time()
conj_res = bank_df.explain(explainer='many to one', labels=labels, explanation_form='conj', separation_threshold=separation_error)
conj_time = time.time() - start_time
print(f"Conjunctive explanation runtime: {conj_time}")
build_comparison_df(bank_df_comparison_df, '3 clusters', 'conjunctive', conj_res, conj_time)
conj_res

Conjunctive explanation runtime: 0.983729362487793


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
0,22381.0 <= Credit_Limit <= 34516.0,0.99,0.01
0,23490.1 <= Credit_Limit <= 34516.0,0.93,0.0
0,21008.0 <= Credit_Open_To_Buy <= 34516.0,1.0,0.02
0,22381.0 <= Credit_Limit <= 34516.0 AND 21008.0 <= Credit_Open_To_Buy <= 34516.0,0.99,0.0
1,9085.0 <= Credit_Limit <= 34516.0 AND 3.0 <= Credit_Open_To_Buy <= 17259.5,0.79,0.01
1,8631.25 <= Credit_Open_To_Buy <= 17259.5,0.71,0.0
1,8650.5 <= Credit_Limit <= 34516.0 AND 3.0 <= Credit_Open_To_Buy <= 17259.5,0.82,0.04
1,8159.0 <= Credit_Limit <= 34516.0 AND 3.0 <= Credit_Open_To_Buy <= 17259.5,0.83,0.11
2,1438.3 <= Credit_Limit <= 9707.724999999999,1.0,0.04
2,1438.3 <= Credit_Limit <= 9707.724999999999 AND 3.0 <= Credit_Open_To_Buy <= 8631.25,1.0,0.02


In [89]:
start_time = time.time()
disj_res = bank_df.explain(explainer='many to one', labels=labels, explanation_form='disj', separation_threshold=separation_error)
disj_time = time.time() - start_time
print(f"Disjunctive explanation runtime: {disj_time}")
build_comparison_df(bank_df_comparison_df, '3 clusters', 'disjunctive', disj_res, disj_time)
disj_res

Disjunctive explanation runtime: 0.8672430515289307


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
0,22381.0 <= Credit_Limit <= 34516.0,0.99,0.01
0,23490.1 <= Credit_Limit <= 34516.0,0.93,0.0
0,23490.1 <= Credit_Limit <= 34516.0 OR 23011.666666666668 <= Credit_Open_To_Buy <= 34516.0,0.94,0.0
0,21008.0 <= Credit_Open_To_Buy <= 34516.0,1.0,0.02
1,8631.25 <= Credit_Open_To_Buy <= 17259.5,0.71,0.0
2,1438.3 <= Credit_Limit <= 9707.724999999999,1.0,0.04
2,3.0 <= Credit_Open_To_Buy <= 6987.0 OR 1438.3 <= Credit_Limit <= 8281.5,0.99,0.0
2,1438.3 <= Credit_Limit <= 8281.5,0.98,0.0
2,3.0 <= Credit_Open_To_Buy <= 8631.25,1.0,0.04


#### 15 clusters

In [90]:
labels = KMeans(n_clusters=15).fit_predict(pd.get_dummies(bank_df))

In [91]:
start_time = time.time()
conj_res = bank_df.explain(explainer='many to one', labels=labels, explanation_form='conj', separation_threshold=separation_error)
conj_time = time.time() - start_time
print(f"Conjunctive explanation runtime: {conj_time}")
build_comparison_df(bank_df_comparison_df, '15 clusters', 'conjunctive', conj_res, conj_time)
conj_res

There are more than the specified max number of 10 unique labels, and the option `prune_if_too_many_labels` is set to True. Pruning the labels to the top 10 most common labels.

Conjunctive explanation runtime: 3.1712865829467773


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
0,3399.0 <= Total_Transitions_Amount <= 6268.0 AND 1438.3 <= Credit_Limit <= 3073.0,0.83,0.0
0,24.0 <= Credit_Open_To_Buy <= 2959.0 AND 3393.0 <= Total_Transitions_Amount <= 17744 AND 1438.3 <= Credit_Limit <= 4043.0,0.98,0.06
0,24.0 <= Credit_Open_To_Buy <= 2959.0 AND 3399.0 <= Total_Transitions_Amount <= 6268.0 AND 1438.3 <= Credit_Limit <= 4043.0,0.95,0.02
0,3306.5 <= Total_Transitions_Amount <= 17744 AND 1438.3 <= Credit_Limit <= 4043.0,0.99,0.11
0,24.0 <= Credit_Open_To_Buy <= 5772.666666666667 AND 3306.5 <= Total_Transitions_Amount <= 17744,1.0,0.3
...,...,...,...
14,24.0 <= Credit_Open_To_Buy <= 2959.0 AND 530.0 <= Total_Transitions_Amount <= 3399.0,0.99,0.08
14,530.0 <= Total_Transitions_Amount <= 3399.0 AND 1438.3 <= Credit_Limit <= 3073.0,0.8,0.02
14,530.0 <= Total_Transitions_Amount <= 3399.0 AND 1438.3 <= Credit_Limit <= 4043.0,0.99,0.08
14,24.0 <= Credit_Open_To_Buy <= 2959.0 AND 530.0 <= Total_Transitions_Amount <= 3399.0 AND 1438.3 <= Credit_Limit <= 4043.0,0.99,0.03


In [92]:
start_time = time.time()
disj_res = bank_df.explain(explainer='many to one', labels=labels, explanation_form='disj', separation_threshold=separation_error)
disj_time = time.time() - start_time
print(f"Disjunctive explanation runtime: {disj_time}")
build_comparison_df(bank_df_comparison_df, '15 clusters', 'disjunctive', disj_res, disj_time)
disj_res

There are more than the specified max number of 10 unique labels, and the option `prune_if_too_many_labels` is set to True. Pruning the labels to the top 10 most common labels.

Disjunctive explanation runtime: 1.8690433502197266


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
2,12464.199999999999 <= Credit_Limit <= 17977.149999999998,0.88,0.28
8,12006.0 <= Total_Transitions_Amount <= 17744.0,1.0,0.0
8,10922.0 <= Total_Transitions_Amount <= 17744,1.0,0.0
9,25893.0 <= Credit_Open_To_Buy <= 34516.0,1.0,0.0
9,28767.333333333336 <= Credit_Open_To_Buy <= 34516.0,1.0,0.0
9,27466.5 <= Credit_Open_To_Buy <= 34516.0,1.0,0.0
9,27617.6 <= Credit_Open_To_Buy <= 34516.0,1.0,0.0
11,21284.919999999995 <= Credit_Limit <= 27900.459999999995,0.93,0.01
11,20719.199999999997 <= Credit_Open_To_Buy <= 27617.6,0.81,0.0
11,21284.919999999995 <= Credit_Limit <= 27900.459999999995 OR 20719.199999999997 <= Credit_Open_To_Buy <= 27617.6,0.96,0.01


### Summary of results

In [93]:
conj_averages = bank_df_comparison_df.loc[:, 'conjunctive', :].mean()
disj_averages = bank_df_comparison_df.loc[:, 'disjunctive', :].mean()
bank_df_comparison_df.loc[('Average', 'conjunctive'), :] = conj_averages
bank_df_comparison_df.loc[('Average', 'disjunctive'), :] = disj_averages

In [94]:
bank_df_comparison_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Average Coverage,Average Separation Error,% Explanation Not Found,Average Explanation Length,Number of Explanations Found,Runtime
Test,Explanation Type,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Full dataset,conjunctive,0.75,0.06,50.0,1.0,2.0,0.695904
Full dataset,disjunctive,0.75,0.06,50.0,1.0,2.0,0.741316
Numeric only,conjunctive,0.75,0.06,50.0,1.0,2.0,0.657826
Numeric only,disjunctive,0.75,0.06,50.0,1.0,2.0,0.701717
Categorical only,conjunctive,1.0,0.17,50.0,1.0,1.0,0.147009
Categorical only,disjunctive,1.0,0.17,50.0,1.0,1.0,0.08715
Single attribute groupby,conjunctive,,,100.0,,0.0,9.903491
Single attribute groupby,disjunctive,,,100.0,,0.0,10.580077
Multi attribute groupby,conjunctive,,,100.0,,0.0,7.984471
Multi attribute groupby,disjunctive,,,100.0,,0.0,8.005344


## Summary of all results

In [95]:
results_summary = pd.DataFrame(columns=columns, index=pd.MultiIndex(levels=[[], [], []], codes=[[], [], []], names=['Dataset', 'Test', 'Explanation Type']))

In [96]:
def build_final_df(final_df, dataset_df, dataset_name):
    for row in dataset_df.iterrows():
        final_df.loc[(dataset_name, *row[0]), :] = row[1]

In [97]:
build_final_df(results_summary, adults_comparison_df, 'Adults')
build_final_df(results_summary, spotify_comparison_df, 'Spotify')
build_final_df(results_summary, houses_comparison_df, 'Houses')
build_final_df(results_summary, bank_df_comparison_df, 'Bank churn')

In [98]:
# Compute the average across all datasets, without the 'average' row for each dataset
all_datasets_avg_conj_averages = results_summary.loc[(slice(None), slice(None), 'conjunctive'), :].drop('Average', level=1).mean()
all_datasets_avg_disj_averages = results_summary.loc[(slice(None), slice(None), 'disjunctive'), :].drop('Average', level=1).mean()
results_summary.loc[('All Datasets', 'Average', 'conjunctive'), :] = all_datasets_avg_conj_averages
results_summary.loc[('All Datasets', 'Average', 'disjunctive'), :] = all_datasets_avg_disj_averages

In [99]:
results_summary

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Average Coverage,Average Separation Error,% Explanation Not Found,Average Explanation Length,Number of Explanations Found,Runtime
Dataset,Test,Explanation Type,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Adults,Full dataset,conjunctive,0.8725,0.1725,50.0,1.25,4,0.289576
Adults,Full dataset,disjunctive,0.913333,0.186667,50.0,1.0,3,0.34168
Adults,Numeric only,conjunctive,0.8725,0.1725,50.0,1.25,4,0.824513
Adults,Numeric only,disjunctive,0.913333,0.186667,50.0,1.0,3,0.617195
Adults,Categorical only,conjunctive,0.88,0.196667,50.0,1.333333,3,0.091938
...,...,...,...,...,...,...,...,...
Bank churn,15 clusters,disjunctive,0.943636,0.054545,50.0,1.090909,11,1.869043
Bank churn,Average,conjunctive,0.858899,0.090196,50.0,1.335507,12.285714,3.363388
Bank churn,Average,disjunctive,0.878505,0.071354,57.142857,1.062626,3.571429,3.264556
All Datasets,Average,conjunctive,0.902117,0.094948,45.709877,1.336802,11.333333,1.954765
