# Many to One Explainer - Demo of Conjunctive vs Disjunctive Explanations

This demo measures the differences in quality metrics and runtime of explanation generation between conjunctive and disjunctive explanations, on various datasets.\
Each dataset has multiple tests, and at the bottom of each section, the average results are displayed.\
Additionally, at the bottom of the notebook, a collection of all the average results + average across all datasets is displayed.

## Setup

In [1]:
import pandas as pd
import pd_explain
import time
import numpy as np
from sklearn.cluster import KMeans

usetex-False
(CVXPY) Jan 24 07:13:36 PM: Encountered unexpected exception importing solver GLOP:
RuntimeError('Unrecognized new version of ortools (9.11.4210). Expected < 9.10.0. Please open a feature request on cvxpy to enable support for this version.')
(CVXPY) Jan 24 07:13:36 PM: Encountered unexpected exception importing solver PDLP:
RuntimeError('Unrecognized new version of ortools (9.11.4210). Expected < 9.10.0. Please open a feature request on cvxpy to enable support for this version.')


This comparison uses a very high, practically unrealistic separation error threshold of 0.8.\
This is set so that less explanations are filtered, and thus we don't get a situation where one form appears better than the other because most of its results were filtered out.\
You can of-course adjust this threshold to a more realistic value, such as 0.2 or 0.3, to get results that are more representative of a real-world scenario.

In [2]:
separation_error = 0.8

By default, we use sampling to speed up the explanation generation process.

In [3]:
pd_explain.get_use_sampling_value()

True

This can be changed either per function call or globally.\
If you would like to run this demo without sampling, uncomment and run the cell below prior to running the rest of the notebook.

In [4]:
# pd_explain.toggle_sampling(False)
# pd_explain.get_use_sampling_value()

In [5]:
def build_comparison_df(comparison_df, test_name, explanation_type, res_df, runtime):
    try:
        # Each group without an explanation has 1 row with "No explanation found" in the 'Explanation' column / index
        num_groups_not_found = res_df.loc[:, 'No explanation found', :].shape[0]
        # Each group that does have explanations can have multiple rows, and we want to count the number of groups, which is the length of the index on the first level
        num_groups_total = len(res_df.index.levels[0])
        percentage_not_found = (num_groups_not_found / num_groups_total) * 100
    # This should only happen in case explanations were found for all instances
    except KeyError:
        percentage_not_found = 0
    try:
        # Get the string representation of the explanations, for each explanation that is not "No explanation found"
        explanations = res_df.index.get_level_values(1)[res_df.index.get_level_values(1) != 'No explanation found'].to_series()
        # Measure the length of each explanation, by counting the number of 'AND' or 'OR' substrings
        explanation_lengths = explanations.apply(lambda x: x.count('AND') + x.count('OR') + 1)
        # Calculate the average explanation length
        avg_explanation_length = explanation_lengths.mean()
        num_explanations_found = len(explanation_lengths)
    # This should only happen in case explanations were not found for any instance
    except (ZeroDivisionError, KeyError):
        avg_explanation_length = pd.NA
        num_explanations_found = 0

    comparison_df.loc[(test_name, explanation_type), :] = [np.mean(res_df['Coverage']), np.mean(res_df['Separation Error']), percentage_not_found, avg_explanation_length, num_explanations_found, runtime]

In [6]:
columns = ['Average Coverage', 'Average Separation Error', '% Explanation Not Found', 'Average Explanation Length', 'Number of Explanations Found', 'Runtime']
index = pd.MultiIndex(levels=[[], []], codes=[[], []], names=['Test', 'Explanation Type'])

## Adults dataset

In [106]:
adults = pd.read_csv(r"..\Examples\Datasets\adult.csv")
adults

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,label
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,39,Private,215419,Bachelors,13,Divorced,Prof-specialty,Not-in-family,White,Female,0,0,36,United-States,<=50K
48838,64,?,321403,HS-grad,9,Widowed,?,Other-relative,Black,Male,0,0,40,United-States,<=50K
48839,38,Private,374983,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,50,United-States,<=50K
48840,44,Private,83891,Bachelors,13,Divorced,Adm-clerical,Own-child,Asian-Pac-Islander,Male,5455,0,40,United-States,<=50K


In [107]:
adults_comparison_df = pd.DataFrame(columns=columns, index=index)

### Full dataset test

In [108]:
start_time = time.time()
conj_res = adults.explain(explainer='many to one', labels='label', explanation_form='conj', separation_threshold=separation_error)
conj_time = time.time() - start_time
print(f"Conjunctive explanation runtime: {conj_time}")
build_comparison_df(adults_comparison_df, 'Full dataset', 'conjunctive', conj_res, conj_time)
conj_res

Conjunctive explanation runtime: 1.780376672744751


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
<=50K,0 <= capital-gain <= 5095.5 AND 1 <= education-num <= 10,0.75,0.13
<=50K,0 <= capital-gain <= 4243.5,0.99,0.2
<=50K,1 <= education-num <= 10,0.75,0.15
<=50K,0 <= capital-gain <= 5095.5,1.0,0.21
>50K,marital-status == Married-civ-spouse AND 8.5 <= education-num <= 16.0,0.81,0.52
>50K,0 <= capital-gain <= 99999,1.0,0.76
>50K,marital-status == Married-civ-spouse,0.84,0.55
>50K,8.5 <= education-num <= 16.0,0.97,0.73


In [109]:
start_time = time.time()
disj_res = adults.explain(explainer='many to one', labels='label', explanation_form='disj', separation_threshold=separation_error)
disj_time = time.time() - start_time
print(f"Disjunctive explanation runtime: {disj_time}")
build_comparison_df(adults_comparison_df, 'Full dataset', 'disjunctive', disj_res, disj_time)
disj_res

Disjunctive explanation runtime: 1.0024807453155518


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
<=50K,1 <= education-num <= 10,0.75,0.15
<=50K,0 <= capital-gain <= 5095.5,1.0,0.21
<=50K,0 <= capital-gain <= 4243.5,0.99,0.2
>50K,0 <= capital-gain <= 99999,1.0,0.76
>50K,marital-status == Married-civ-spouse,0.84,0.55
>50K,8.5 <= education-num <= 16.0,0.97,0.73
>50K,8.5 <= education-num <= 16.0 OR marital-status == Married-civ-spouse,1.0,0.74


### Numeric only and categorical only tests

In [110]:
numeric_only = adults[['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']]
categorical_only = adults[['workclass', 'education', 'marital-status', 'occupation', 'relationship']]
labels = adults['label']

In [111]:
start_time = time.time()
conj_res = numeric_only.explain(explainer='many to one', labels=labels, explanation_form='conj', separation_threshold=separation_error)
conj_time = time.time() - start_time
print(f"Conjunctive explanation runtime: {conj_time}")
build_comparison_df(adults_comparison_df, 'Numeric only', 'conjunctive', conj_res, conj_time)
conj_res

Conjunctive explanation runtime: 1.4226570129394531


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
<=50K,0 <= capital-gain <= 5095.5 AND 1 <= education-num <= 10,0.75,0.13
<=50K,0 <= capital-gain <= 4243.5,0.99,0.2
<=50K,1 <= education-num <= 10,0.75,0.15
<=50K,0 <= capital-gain <= 5095.5,1.0,0.21
>50K,31.5 <= age <= 90 AND 8.5 <= education-num <= 16.0,0.87,0.64
>50K,8.5 <= education-num <= 16.0 AND 37 <= age <= 90,0.72,0.62
>50K,0 <= capital-gain <= 99999,1.0,0.76
>50K,31.5 <= age <= 90,0.9,0.67
>50K,37 <= age <= 90,0.75,0.65
>50K,8.5 <= education-num <= 16.0,0.97,0.73


In [112]:
start_time = time.time()
disj_res = numeric_only.explain(explainer='many to one', labels=labels, explanation_form='disj', separation_threshold=separation_error)
disj_time = time.time() - start_time
print(f"Disjunctive explanation runtime: {disj_time}")
build_comparison_df(adults_comparison_df, 'Numeric only', 'disjunctive', disj_res, disj_time)
disj_res

Disjunctive explanation runtime: 1.7917511463165283


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
<=50K,1 <= education-num <= 10,0.75,0.15
<=50K,0 <= capital-gain <= 5095.5,1.0,0.21
<=50K,0 <= capital-gain <= 4243.5,0.99,0.2
>50K,0 <= capital-gain <= 99999,1.0,0.76
>50K,31.5 <= age <= 90,0.9,0.67
>50K,8.5 <= education-num <= 16.0,0.97,0.73
>50K,8.5 <= education-num <= 16.0 OR 37 <= age <= 90,1.0,0.74
>50K,37 <= age <= 90,0.75,0.65


In [113]:
start_time = time.time()
conj_res = categorical_only.explain(explainer='many to one', labels=labels, explanation_form='conj', separation_threshold=separation_error)
conj_time = time.time() - start_time
print(f"Conjunctive explanation runtime: {conj_time}")
build_comparison_df(adults_comparison_df, 'Categorical only', 'conjunctive', conj_res, conj_time)
conj_res

Conjunctive explanation runtime: 0.2548182010650635


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
<=50K,occupation != Prof-specialty AND occupation != Exec-managerial,0.82,0.17
<=50K,occupation != Prof-specialty,0.91,0.21
<=50K,occupation != Exec-managerial,0.91,0.21
>50K,marital-status == Married-civ-spouse,0.84,0.55


In [114]:
start_time = time.time()
disj_res = categorical_only.explain(explainer='many to one', labels=labels, explanation_form='disj', separation_threshold=separation_error)
disj_time = time.time() - start_time
print(f"Disjunctive explanation runtime: {disj_time}")
build_comparison_df(adults_comparison_df, 'Categorical only', 'disjunctive', disj_res, disj_time)
disj_res

Disjunctive explanation runtime: 0.3261420726776123


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
<=50K,occupation != Exec-managerial,0.91,0.21
<=50K,occupation != Exec-managerial OR occupation != Prof-specialty,1.0,0.24
<=50K,occupation != Prof-specialty,0.91,0.21
>50K,marital-status == Married-civ-spouse,0.84,0.55


### Groupby tests

In [115]:
single_attribute_groupby = adults.groupby('relationship').mean()
multi_attribute_groupby = adults.groupby(['sex', 'relationship']).mean()

In [116]:
start_time = time.time()
conj_res = single_attribute_groupby.explain(explainer='many to one', explanation_form='conj', separation_threshold=separation_error)
conj_time = time.time() - start_time
print(f"Conjunctive explanation runtime: {conj_time}")
build_comparison_df(adults_comparison_df, 'Single attribute groupby', 'conjunctive', conj_res, conj_time)
conj_res

Conjunctive explanation runtime: 3.3704495429992676


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
Not-in-family,marital-status != Married-civ-spouse AND 20.5 <= age <= 90,0.96,0.47
Not-in-family,marital-status != Married-civ-spouse AND 23.5 <= age <= 90,0.88,0.44
Not-in-family,marital-status != Married-civ-spouse,1.0,0.52
Husband,marital-status == Married-civ-spouse,1.0,0.11
Husband,marital-status == Married-civ-spouse AND sex != Female,1.0,0.01
Husband,marital-status == Married-civ-spouse AND sex == Male,1.0,0.01
Wife,marital-status == Married-civ-spouse AND sex == Female,0.99,0.05
Own-child,marital-status != Married-civ-spouse AND 17.0 <= age <= 31.6,0.82,0.53
Own-child,17 <= age <= 31 AND marital-status == Never-married,0.78,0.49
Own-child,marital-status != Married-civ-spouse,0.98,0.72


In [117]:
start_time = time.time()
disj_res = single_attribute_groupby.explain(explainer='many to one', explanation_form='disj', separation_threshold=separation_error)
disj_time = time.time() - start_time
print(f"Disjunctive explanation runtime: {disj_time}")
build_comparison_df(adults_comparison_df, 'Single attribute groupby', 'disjunctive', disj_res, disj_time)
disj_res

Disjunctive explanation runtime: 3.6502394676208496


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
Not-in-family,marital-status != Married-civ-spouse,1.0,0.52
Husband,marital-status == Married-civ-spouse,1.0,0.11
Own-child,marital-status != Married-civ-spouse,0.98,0.72
Own-child,17 <= age <= 28 OR marital-status == Never-married,0.92,0.64
Own-child,marital-status != Married-civ-spouse OR 17.0 <= age <= 41.33333333333333,1.0,0.8
Own-child,17 <= age <= 31 OR marital-status == Never-married,0.93,0.68
Own-child,17 <= age <= 28,0.76,0.56
Own-child,marital-status != Married-civ-spouse OR 17 <= age <= 28,0.99,0.74
Own-child,marital-status == Never-married,0.89,0.59
Own-child,marital-status == Never-married OR 17.0 <= age <= 31.6,0.93,0.68


In [118]:
start_time = time.time()
conj_res = multi_attribute_groupby.explain(explainer='many to one', explanation_form='conj', separation_threshold=separation_error)
conj_time = time.time() - start_time
print(f"Conjunctive explanation runtime: {conj_time}")
build_comparison_df(adults_comparison_df, 'Multi attribute groupby', 'conjunctive', conj_res, conj_time)
conj_res

There are more than the specified max number of 10 unique labels, and the option `prune_if_too_many_labels` is set to True. Pruning the labels to the top 10 most common labels.

Conjunctive explanation runtime: 7.376740217208862


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
"('Male', 'Not-in-family')",marital-status != Married-civ-spouse AND 26.5 <= age <= 90,0.78,0.69
"('Male', 'Not-in-family')",marital-status != Married-civ-spouse AND 20.5 <= age <= 90,0.97,0.71
"('Male', 'Not-in-family')",marital-status != Married-civ-spouse,1.0,0.74
"('Male', 'Not-in-family')",marital-status != Married-civ-spouse AND 23.5 <= age <= 90,0.89,0.7
"('Male', 'Husband')",marital-status == Married-civ-spouse AND 39.5 <= hours-per-week <= 99,0.88,0.08
"('Male', 'Husband')",marital-status == Married-civ-spouse AND 40 <= hours-per-week <= 99,0.88,0.08
"('Male', 'Husband')",marital-status == Married-civ-spouse AND occupation != Adm-clerical,0.95,0.09
"('Male', 'Husband')",marital-status == Married-civ-spouse,1.0,0.11
"('Male', 'Husband')",marital-status == Married-civ-spouse AND 38.5 <= hours-per-week <= 99,0.88,0.08
"('Female', 'Not-in-family')",marital-status != Married-civ-spouse AND 26.5 <= age <= 90,0.79,0.74


In [119]:
start_time = time.time()
disj_res = multi_attribute_groupby.explain(explainer='many to one', explanation_form='disj', separation_threshold=separation_error)
disj_time = time.time() - start_time
print(f"Disjunctive explanation runtime: {disj_time}")
build_comparison_df(adults_comparison_df, 'Multi attribute groupby', 'disjunctive', disj_res, disj_time)
disj_res

There are more than the specified max number of 10 unique labels, and the option `prune_if_too_many_labels` is set to True. Pruning the labels to the top 10 most common labels.

Disjunctive explanation runtime: 7.442141771316528


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
"('Male', 'Not-in-family')",marital-status != Married-civ-spouse,1.0,0.74
"('Male', 'Husband')",marital-status == Married-civ-spouse,1.0,0.11
"('Female', 'Not-in-family')",marital-status != Married-civ-spouse,1.0,0.78
"('Female', 'Own-child')",17.0 <= age <= 29.166666666666664,0.79,0.8
"('Female', 'Own-child')",17 <= age <= 26,0.7,0.76
"('Male', 'Own-child')",marital-status == Never-married,0.88,0.77
"('Male', 'Own-child')",17 <= age <= 28 OR marital-status == Never-married,0.91,0.8
"('Male', 'Own-child')",17 <= age <= 28,0.75,0.76
"('Female', 'Other-relative')",No explanation found,,
"('Female', 'Unmarried')",No explanation found,,


### Clustering test

#### 3 clusters

In [120]:
cluster_labels = KMeans(n_clusters=3).fit_predict(pd.get_dummies(adults))

In [121]:
start_time = time.time()
conj_res = adults.explain(explainer='many to one', labels=cluster_labels, explanation_form='conj', separation_threshold=separation_error)
conj_time = time.time() - start_time
print(f"Conjunctive explanation runtime: {conj_time}")
build_comparison_df(adults_comparison_df, '3 clusters', 'conjunctive', conj_res, conj_time)
conj_res

Conjunctive explanation runtime: 2.1290814876556396


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
0,291277.5 <= fnlwgt <= 1490400,1.0,0.0
1,13769 <= fnlwgt <= 149278.5,1.0,0.0
2,149278.5 <= fnlwgt <= 291277.5,1.0,0.0


In [122]:
start_time = time.time()
disj_res = adults.explain(explainer='many to one', labels=cluster_labels, explanation_form='disj', separation_threshold=separation_error)
disj_time = time.time() - start_time
print(f"Disjunctive explanation runtime: {disj_time}")
build_comparison_df(adults_comparison_df, '3 clusters', 'disjunctive', disj_res, disj_time)
disj_res

Disjunctive explanation runtime: 2.4612245559692383


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
0,291277.5 <= fnlwgt <= 1490400,1.0,0.0
1,13769 <= fnlwgt <= 149278.5,1.0,0.0
2,149278.5 <= fnlwgt <= 291277.5,1.0,0.0


#### 15 clusters

In [123]:
cluster_labels = KMeans(n_clusters=15).fit_predict(pd.get_dummies(adults))

In [124]:
start_time = time.time()
conj_res = adults.explain(explainer='many to one', labels=cluster_labels, explanation_form='conj', separation_threshold=separation_error)
conj_time = time.time() - start_time
print(f"Conjunctive explanation runtime: {conj_time}")
build_comparison_df(adults_comparison_df, '15 clusters', 'conjunctive', conj_res, conj_time)
conj_res

There are more than the specified max number of 10 unique labels, and the option `prune_if_too_many_labels` is set to True. Pruning the labels to the top 10 most common labels.

Conjunctive explanation runtime: 17.394453048706055


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
1,175034 <= fnlwgt <= 223242,1.0,0.38
1,175034 <= fnlwgt <= 223242 AND 8.5 <= education-num <= 16.0,0.87,0.37
1,175034 <= fnlwgt <= 223242 AND 8.5 <= education-num <= 13.5,0.79,0.36
1,175034 <= fnlwgt <= 202206,0.72,0.33
2,126259.0 <= fnlwgt <= 156449.5,1.0,0.0
4,281883.0 <= fnlwgt <= 371408,1.0,0.43
4,281883.0 <= fnlwgt <= 371408 AND 8.5 <= education-num <= 13.5,0.8,0.42
5,19302 <= fnlwgt <= 58238.0,1.0,0.0
6,202206 <= fnlwgt <= 253370,1.0,0.39
6,202206 <= fnlwgt <= 253370 AND 1 <= education-num <= 10,0.71,0.38


In [125]:
start_time = time.time()
disj_res = adults.explain(explainer='many to one', labels=cluster_labels, explanation_form='disj', separation_threshold=separation_error)
disj_time = time.time() - start_time
print(f"Disjunctive explanation runtime: {disj_time}")
build_comparison_df(adults_comparison_df, '15 clusters', 'disjunctive', disj_res, disj_time)
disj_res

There are more than the specified max number of 10 unique labels, and the option `prune_if_too_many_labels` is set to True. Pruning the labels to the top 10 most common labels.

Disjunctive explanation runtime: 19.559987545013428


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
1,175034 <= fnlwgt <= 223242,1.0,0.38
1,175034 <= fnlwgt <= 202206,0.72,0.33
2,126259.0 <= fnlwgt <= 156449.5,1.0,0.0
4,281883.0 <= fnlwgt <= 371408,1.0,0.43
5,19302 <= fnlwgt <= 58238.0,1.0,0.0
6,202206 <= fnlwgt <= 253370,1.0,0.39
10,254039.33333333334 <= fnlwgt <= 312723.6666666667,0.7,0.49
10,230565.59999999998 <= fnlwgt <= 300986.8,1.0,0.5
11,323591.5 <= fnlwgt <= 371408,1.0,0.0
12,156449.5 <= fnlwgt <= 183996.5,1.0,0.0


### Summary of results

In [126]:
conj_averages = adults_comparison_df.loc[:, 'conjunctive', :].mean()
disj_averages = adults_comparison_df.loc[:, 'disjunctive', :].mean()
adults_comparison_df.loc[('Average', 'conjunctive'), :] = conj_averages
adults_comparison_df.loc[('Average', 'disjunctive'), :] = disj_averages

In [127]:
adults_comparison_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Average Coverage,Average Separation Error,% Explanation Not Found,Average Explanation Length,Number of Explanations Found,Runtime
Test,Explanation Type,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Full dataset,conjunctive,0.88875,0.40625,0.0,1.25,8.0,1.780376672744751
Full dataset,disjunctive,0.9357142857142856,0.4771428571428571,0.0,1.1428571428571428,7.0,1.0024807453155518
Numeric only,conjunctive,0.8700000000000001,0.476,0.0,1.3,10.0,1.4226570129394531
Numeric only,disjunctive,0.92,0.51375,0.0,1.125,8.0,1.7917511463165283
Categorical only,conjunctive,0.87,0.285,0.0,1.25,4.0,0.2548182010650635
Categorical only,disjunctive,0.915,0.3025,0.0,1.25,4.0,0.3261420726776123
Single attribute groupby,conjunctive,0.8850000000000001,0.4996428571428571,16.666666666666664,1.8214285714285716,28.0,3.370449542999268
Single attribute groupby,disjunctive,0.94,0.604,50.0,1.5,10.0,3.6502394676208496
Multi attribute groupby,conjunctive,0.8454285714285714,0.6571428571428571,40.0,1.8,35.0,7.376740217208862
Multi attribute groupby,disjunctive,0.87875,0.69,50.0,1.125,8.0,7.442141771316528


## Spotify dataset

In [128]:
spotify = pd.read_csv(r"..\Examples\Datasets\spotify_all.csv")
spotify

Unnamed: 0,acousticness,artists,danceability,energy,explicit,id,instrumentalness,key,liveness,loudness,...,name,popularity,speechiness,tempo,valence,year,decade,popularity_score,main_artist,duration_minutes
0,0.991,['Mamie Smith'],0.598,0.2239999999999999,0,0cS0A1fUEUd1EW3FcF8AEI,0.000522,5,0.379,-12.628,...,Keep A Song In Your Soul,12,0.0936,149.976,0.634,1920,1920,10,Mamie Smith,2.80555
1,0.643,"[""Screamin' Jay Hawkins""]",0.852,0.517,0,0hbkKFIJm7Z05H8Zl9w30f,0.0264,5,0.0809,-7.261,...,I Put A Spell On You,7,0.0534,86.889,0.95,1920,1920,0,Screamin' Jay Hawkins,2.5033333333333334
2,0.993,['Mamie Smith'],0.647,0.186,0,11m7laMUgmOKqI3oYzuhne,1.76e-05,0,0.519,-12.098,...,Golfing Papa,4,0.174,97.6,0.6890000000000001,1920,1920,0,Mamie Smith,2.73045
3,0.000173,['Oscar Velazquez'],0.73,0.7979999999999999,0,19Lc5SfJJ5O1oaxY0fpwfh,0.8009999999999999,2,0.128,-7.311,...,True House Music - Xavier Santos & Carlos Gomix Remix,17,0.0425,127.997,0.0422,1920,1920,10,Oscar Velazquez,7.034783333333333
4,0.295,['Mixe'],0.7040000000000001,0.7070000000000001,1,2hJjbsLCytGsnAHfdsLejp,0.000246,10,0.402,-6.0360000000000005,...,Xuniverxe,2,0.0768,122.076,0.299,1920,1920,0,Mixe,2.7537333333333334
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
174384,0.00917,"['DJ Combo', 'Sander-7', 'Tony T']",0.792,0.866,0,46LhBf6TvYjZU2SMvGZAbn,5.99e-05,6,0.178,-5.089,...,The One,0,0.0356,125.972,0.186,2020,2020,0,DJ Combo,2.46025
174385,0.795,['Alessia Cara'],0.429,0.211,0,7tue2Wemjd0FZzRtDrQFZd,0.0,4,0.196,-11.665,...,A Little More,0,0.036,94.71,0.228,2021,2020,0,Alessia Cara,2.412
174386,0.8059999999999999,['Roger Fly'],0.6709999999999999,0.589,0,48Qj61hOdYmUCFJbpQ29Ob,0.92,4,0.113,-12.393,...,Together,0,0.0282,108.058,0.7140000000000001,2020,2020,0,Roger Fly,3.635783333333333
174387,0.92,['Taylor Swift'],0.462,0.24,1,1gcyHQpBQ1lfXGdhZmWrHP,0.0,0,0.113,-12.077,...,champagne problems,69,0.0377,171.31900000000005,0.32,2021,2020,60,Taylor Swift,4.066666666666666


In [129]:
spotify_comparison_df = pd.DataFrame(columns=columns, index=index)

### Full dataset test

In [130]:
start_time = time.time()
conj_res = spotify.explain(explainer='many to one', labels='explicit', explanation_form='conj')
conj_time = time.time() - start_time
print(f"Conjunctive explanation runtime: {conj_time}")
build_comparison_df(spotify_comparison_df, 'Full dataset', 'conjunctive', conj_res, conj_time)
conj_res

Conjunctive explanation runtime: 2.2431085109710693


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.0 <= speechiness <= 0.2425,0.94,0.04
0,0.0 <= speechiness <= 0.194 AND 0 <= popularity <= 46.5,0.75,0.01
0,0.0 <= speechiness <= 0.3233333333333333 AND 0 <= popularity <= 46.5,0.78,0.02
0,0.0 <= speechiness <= 0.3233333333333333,0.95,0.05
0,0.0 <= speechiness <= 0.16166666666666665,0.91,0.03
0,0.0 <= speechiness <= 0.485,0.97,0.06
1,No explanation found,,


In [131]:
start_time = time.time()
disj_res = spotify.explain(explainer='many to one', labels='explicit', explanation_form='disj')
disj_time = time.time() - start_time
print(f"Disjunctive explanation runtime: {disj_time}")
build_comparison_df(spotify_comparison_df, 'Full dataset', 'disjunctive', disj_res, disj_time)
disj_res

Disjunctive explanation runtime: 2.6184418201446533


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.0 <= speechiness <= 0.3233333333333333,0.95,0.05
0,0 <= popularity <= 46.5 OR 0.0 <= speechiness <= 0.16166666666666665,0.99,0.05
0,0.0 <= speechiness <= 0.2425 OR 0.0 <= popularity <= 44.0,1.0,0.06
0,0.0 <= speechiness <= 0.485,0.97,0.06
0,0.0 <= speechiness <= 0.16166666666666665,0.91,0.03
0,0.0 <= popularity <= 44.0 OR 0.0 <= speechiness <= 0.16166666666666665,0.99,0.05
0,0.0 <= popularity <= 44.0 OR 0.0 <= speechiness <= 0.194,0.99,0.05
0,0.0 <= speechiness <= 0.2425,0.94,0.04
0,0.0 <= speechiness <= 0.2425 OR 0 <= popularity <= 46.5,1.0,0.06
1,No explanation found,,


### Numeric only and categorical only tests

In [132]:
numeric_only = spotify[['acousticness', 'danceability', 'energy', 'instrumentalness', 'liveness', 'loudness', 'speechiness', 'tempo', 'valence']]
categorical_only = spotify[['name', 'main_artist', 'year']]
labels = spotify['explicit']

In [133]:
start_time = time.time()
conj_res = numeric_only.explain(explainer='many to one', labels=labels, explanation_form='conj')
conj_time = time.time() - start_time
print(f"Conjunctive explanation runtime: {conj_time}")
build_comparison_df(spotify_comparison_df, 'Numeric only', 'conjunctive', conj_res, conj_time)
conj_res

Conjunctive explanation runtime: 2.027392625808716


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.0 <= speechiness <= 0.2425,0.94,0.04
0,0.0 <= speechiness <= 0.16166666666666665,0.91,0.03
0,0.0 <= speechiness <= 0.3233333333333333,0.95,0.05
0,0.0 <= speechiness <= 0.485,0.97,0.06
1,No explanation found,,


In [134]:
start_time = time.time()
disj_res = numeric_only.explain(explainer='many to one', labels=labels, explanation_form='disj')
disj_time = time.time() - start_time
print(f"Disjunctive explanation runtime: {disj_time}")
build_comparison_df(spotify_comparison_df, 'Numeric only', 'disjunctive', disj_res, disj_time)
disj_res

Disjunctive explanation runtime: 1.8733229637145996


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.0 <= speechiness <= 0.3233333333333333,0.95,0.05
0,0.0 <= speechiness <= 0.485,0.97,0.06
0,0.0 <= speechiness <= 0.485 OR 0.0 <= instrumentalness <= 0.4995,1.0,0.07
0,0.0 <= speechiness <= 0.16166666666666665,0.91,0.03
0,0.0 <= speechiness <= 0.2425,0.94,0.04
0,0.0 <= instrumentalness <= 0.333 OR 0.0 <= speechiness <= 0.3233333333333333,1.0,0.07
0,0.0 <= instrumentalness <= 0.24975 OR 0.0 <= speechiness <= 0.485,1.0,0.07
0,0.0 <= instrumentalness <= 0.333 OR 0.0 <= speechiness <= 0.485,1.0,0.07
0,0.0 <= speechiness <= 0.3233333333333333 OR 0.0 <= instrumentalness <= 0.4995,1.0,0.07
1,No explanation found,,


In [135]:
start_time = time.time()
conj_res = categorical_only.explain(explainer='many to one', labels=labels, explanation_form='conj')
conj_time = time.time() - start_time
print(f"Conjunctive explanation runtime: {conj_time}")
build_comparison_df(spotify_comparison_df, 'Categorical only', 'conjunctive', conj_res, conj_time)
conj_res

Conjunctive explanation runtime: 0.7465677261352539


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
0,name == infrequent_sklearn,1.0,0.07
0,main_artist == artist_infrequent_sklearn,1.0,0.07
1,No explanation found,,


In [136]:
start_time = time.time()
disj_res = categorical_only.explain(explainer='many to one', labels=labels, explanation_form='disj')
disj_time = time.time() - start_time
print(f"Disjunctive explanation runtime: {disj_time}")
build_comparison_df(spotify_comparison_df, 'Categorical only', 'disjunctive', disj_res, disj_time)
disj_res

Disjunctive explanation runtime: 0.8602464199066162


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
0,name == infrequent_sklearn,1.0,0.07
0,main_artist == artist_infrequent_sklearn,1.0,0.07
1,No explanation found,,


### Groupby tests

In [137]:
single_attribute_groupby = spotify.groupby('mode').mean()

In [138]:
start_time = time.time()
conj_res = single_attribute_groupby.explain(explainer='many to one', explanation_form='conj')
conj_time = time.time() - start_time
print(f"Conjunctive explanation runtime: {conj_time}")
build_comparison_df(spotify_comparison_df, 'Single attribute groupby', 'conjunctive', conj_res, conj_time)
conj_res

Conjunctive explanation runtime: 2.795407295227051


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
1,0.07324999943375587 <= acousticness <= 0.996,0.78,0.28
1,0.0 <= speechiness <= 0.3233333333333333 AND 0.07324999943375587 <= acousticness <= 0.996,0.73,0.27
1,0.0 <= speechiness <= 0.485,0.95,0.3
0,No explanation found,,


In [139]:
start_time = time.time()
disj_res = single_attribute_groupby.explain(explainer='many to one', explanation_form='disj')
disj_time = time.time() - start_time
print(f"Disjunctive explanation runtime: {disj_time}")
build_comparison_df(spotify_comparison_df, 'Single attribute groupby', 'disjunctive', disj_res, disj_time)
disj_res

Disjunctive explanation runtime: 3.223463296890259


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
1,0.0 <= speechiness <= 0.485 OR 0.07324999943375587 <= acousticness <= 0.996,1.0,0.3
1,0.07324999943375587 <= acousticness <= 0.996,0.78,0.28
1,0.0 <= speechiness <= 0.485,0.95,0.3
0,No explanation found,,


### Clustering test

#### 3 clusters

In [140]:
# The spotify dataset is too big to cluster all columns with get_dummies (I tried, got a MemoryError, saying it is unable to allocate 22.3 GiB)
cols = ['acousticness', 'danceability', 'energy', 'instrumentalness', 'liveness', 'loudness', 'speechiness', 'tempo', 'valence']

In [141]:
cluster_labels = KMeans(n_clusters=3).fit_predict(spotify[cols])

In [142]:
start_time = time.time()
conj_res = spotify.explain(explainer='many to one', labels=cluster_labels, explanation_form='conj')
conj_time = time.time() - start_time
print(f"Conjunctive explanation runtime: {conj_time}")
build_comparison_df(spotify_comparison_df, '3 clusters', 'conjunctive', conj_res, conj_time)
conj_res

Conjunctive explanation runtime: 3.391819953918457


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
0,147.00550079345703 <= tempo <= 219.446,1.0,0.0
1,0.0 <= tempo <= 106.19950103759766,0.99,0.0
1,0.0 <= tempo <= 109.723,1.0,0.08
1,0.0 <= tempo <= 106.32799911499023,0.99,0.0
2,106.73249816894531 <= tempo <= 147.00550079345703,1.0,0.0


In [143]:
start_time = time.time()
disj_res = spotify.explain(explainer='many to one', labels=cluster_labels, explanation_form='disj')
disj_time = time.time() - start_time
print(f"Disjunctive explanation runtime: {disj_time}")
build_comparison_df(spotify_comparison_df, '3 clusters', 'disjunctive', disj_res, disj_time)
disj_res

Disjunctive explanation runtime: 3.8467187881469727


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
0,147.00550079345703 <= tempo <= 219.446,1.0,0.0
1,0.0 <= tempo <= 109.723,1.0,0.08
1,0.0 <= tempo <= 106.32799911499023,0.99,0.0
1,0.0 <= tempo <= 106.19950103759766,0.99,0.0
2,106.73249816894531 <= tempo <= 147.00550079345703,1.0,0.0


#### 15 clusters

In [144]:
cluster_labels = KMeans(n_clusters=15).fit_predict(spotify[cols])

In [145]:
start_time = time.time()
conj_res = spotify.explain(explainer='many to one', labels=cluster_labels, explanation_form='conj')
conj_time = time.time() - start_time
print(f"Conjunctive explanation runtime: {conj_time}")
build_comparison_df(spotify_comparison_df, '15 clusters', 'conjunctive', conj_res, conj_time)
conj_res

There are more than the specified max number of 10 unique labels, and the option `prune_if_too_many_labels` is set to True. Pruning the labels to the top 10 most common labels.

Conjunctive explanation runtime: 12.207214593887329


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
2,70.283 <= tempo <= 94.278 AND -43.738 <= loudness <= -13.771 AND 0.05980000086128712 <= acousticness <= 0.996,0.99,0.26
2,70.283 <= tempo <= 94.278 AND -43.738 <= loudness <= -15.783,0.91,0.11
2,70.283 <= tempo <= 94.278 AND -43.738 <= loudness <= -15.783 AND 0.05980000086128712 <= acousticness <= 0.996,0.9,0.1
2,70.283 <= tempo <= 94.278 AND -43.738 <= loudness <= -14.898 AND 0.058150000870227814 <= acousticness <= 0.996,0.96,0.17
2,70.283 <= tempo <= 94.278 AND -43.738 <= loudness <= -15.783 AND 0.058150000870227814 <= acousticness <= 0.996,0.9,0.1
2,70.283 <= tempo <= 94.278 AND -43.738 <= loudness <= -14.898 AND 0.05980000086128712 <= acousticness <= 0.996,0.96,0.17
2,70.283 <= tempo <= 90.657 AND -43.738 <= loudness <= -15.783,0.82,0.05
2,70.283 <= tempo <= 94.278 AND -43.738 <= loudness <= -13.771,1.0,0.27
2,70.283 <= tempo <= 91.1602 AND -43.738 <= loudness <= -15.783,0.84,0.06
2,70.283 <= tempo <= 94.278 AND -43.738 <= loudness <= -14.898,0.97,0.18


In [146]:
start_time = time.time()
disj_res = spotify.explain(explainer='many to one', labels=cluster_labels, explanation_form='disj')
disj_time = time.time() - start_time
print(f"Disjunctive explanation runtime: {disj_time}")
build_comparison_df(spotify_comparison_df, '15 clusters', 'disjunctive', disj_res, disj_time)
disj_res

There are more than the specified max number of 10 unique labels, and the option `prune_if_too_many_labels` is set to True. Pruning the labels to the top 10 most common labels.

Disjunctive explanation runtime: 10.725688457489014


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
3,113.077 <= tempo <= 125.01,0.92,0.21
4,120.076 <= tempo <= 136.47,1.0,0.3
5,86.92000198364258 <= tempo <= 100.5,1.0,0.22
5,86.92000198364258 <= tempo <= 100.5620002746582,1.0,0.22
5,87.855 <= tempo <= 100.062,0.93,0.19
6,160.1125030517578 <= tempo <= 174.669,0.98,0.01
6,157.27133333333336 <= tempo <= 174.66900000000004,1.0,0.18
9,100.5 <= tempo <= 112.34099960327148,0.98,0.26
9,100.5359992980957 <= tempo <= 112.48799896240234,0.99,0.27
10,132.9146 <= tempo <= 153.79180000000002,1.0,0.3


### Summary of results

In [147]:
conj_averages = spotify_comparison_df.loc[:, 'conjunctive', :].mean()
disj_averages = spotify_comparison_df.loc[:, 'disjunctive', :].mean()
spotify_comparison_df.loc[('Average', 'conjunctive'), :] = conj_averages
spotify_comparison_df.loc[('Average', 'disjunctive'), :] = disj_averages

In [148]:
spotify_comparison_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Average Coverage,Average Separation Error,% Explanation Not Found,Average Explanation Length,Number of Explanations Found,Runtime
Test,Explanation Type,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Full dataset,conjunctive,0.8833333333333333,0.035,50.0,1.3333333333333333,6.0,2.2431085109710693
Full dataset,disjunctive,0.9711111111111111,0.0499999999999999,50.0,1.5555555555555556,9.0,2.6184418201446533
Numeric only,conjunctive,0.9425,0.045,50.0,1.0,4.0,2.027392625808716
Numeric only,disjunctive,0.9744444444444444,0.0588888888888888,50.0,1.5555555555555556,9.0,1.8733229637146
Categorical only,conjunctive,1.0,0.07,50.0,1.0,2.0,0.7465677261352539
Categorical only,disjunctive,1.0,0.07,50.0,1.0,2.0,0.8602464199066162
Single attribute groupby,conjunctive,0.82,0.2833333333333334,50.0,1.3333333333333333,3.0,2.795407295227051
Single attribute groupby,disjunctive,0.91,0.2933333333333334,50.0,1.3333333333333333,3.0,3.223463296890259
3 clusters,conjunctive,0.996,0.016,0.0,1.0,5.0,3.391819953918457
3 clusters,disjunctive,0.996,0.016,0.0,1.0,5.0,3.8467187881469727


## House Prices dataset

In [7]:
houses = pd.read_csv(r"..\Examples\Datasets\houses.csv")
houses = houses[['MSSubClass', 'LotArea', 'Street', 'LotConfig', 'LandSlope', 'Neighborhood', 'BldgType', 'HouseStyle',
'OverallQual', 'OverallCond', 'YearBuilt', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'ExterQual', 'Foundation', 'BsmtQual',
'CentralAir', '1stFlrSF', '2ndFlrSF', 'GrLivArea', 'FullBath', 'TotRmsAbvGrd', 'GarageCars', 'PoolArea', 'YrSold', 'SalePrice']]

In [8]:
houses_comparison_df = pd.DataFrame(columns=columns, index=index)

### Full dataset test

In [151]:
start_time = time.time()
conj_res = houses.explain(explainer='many to one', labels='HouseStyle', explanation_form='conj')
conj_time = time.time() - start_time
print(f"Conjunctive explanation runtime: {conj_time}")
build_comparison_df(houses_comparison_df, 'Full dataset', 'conjunctive', conj_res, conj_time)
conj_res

Conjunctive explanation runtime: 5.878854990005493


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
2Story,600 <= 2ndFlrSF <= 2065,0.91,0.17
2Story,55.0 <= MSSubClass <= 72.5,0.8,0.01
2Story,545.5 <= 2ndFlrSF <= 2065,0.97,0.2
2Story,200.0 <= 2ndFlrSF <= 2065,1.0,0.29
2Story,545.5 <= 2ndFlrSF <= 2065 AND 60 <= MSSubClass <= 190,0.96,0.09
2Story,200.0 <= 2ndFlrSF <= 2065 AND 60 <= MSSubClass <= 190,0.99,0.1
2Story,671.0 <= 2ndFlrSF <= 2065 AND 60 <= MSSubClass <= 190,0.85,0.07
2Story,671.0 <= 2ndFlrSF <= 2065,0.85,0.14
2Story,545.5 <= 2ndFlrSF <= 2065 AND 55.0 <= MSSubClass <= 190,0.96,0.09
2Story,671.0 <= 2ndFlrSF <= 2065 AND 55.0 <= MSSubClass <= 190,0.85,0.07


In [152]:
start_time = time.time()
disj_res = houses.explain(explainer='many to one', labels='HouseStyle', explanation_form='disj', bin_numeric=True)
disj_time = time.time() - start_time
print(f"Disjunctive explanation runtime: {disj_time}")
build_comparison_df(houses_comparison_df, 'Full dataset', 'disjunctive', disj_res, disj_time)
disj_res

Disjunctive explanation runtime: 3.243934392929077


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
2Story,55.0 <= MSSubClass <= 72.5,0.8,0.01
2Story,600 <= 2ndFlrSF <= 2065,0.91,0.17
2Story,671.0 <= 2ndFlrSF <= 2065,0.85,0.14
2Story,545.5 <= 2ndFlrSF <= 2065 OR 55.0 <= MSSubClass <= 72.5,0.98,0.2
2Story,200.0 <= 2ndFlrSF <= 2065,1.0,0.29
2Story,55.0 <= MSSubClass <= 72.5 OR 671.0 <= 2ndFlrSF <= 2065,0.91,0.13
2Story,600 <= 2ndFlrSF <= 2065 OR 55.0 <= MSSubClass <= 72.5,0.95,0.17
2Story,545.5 <= 2ndFlrSF <= 2065,0.97,0.2
1Story,20 <= MSSubClass <= 30,0.83,0.01
1Story,0 <= 2ndFlrSF <= 0,1.0,0.13


### Numeric only and categorical only tests

In [7]:
numeric_only = houses[['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'GrLivArea', 'FullBath', 'TotRmsAbvGrd', 'GarageCars', 'PoolArea', 'SalePrice']]
categorical_only = houses[['MSSubClass', 'Street', 'LotConfig', 'LandSlope', 'Neighborhood', 'BldgType', 'OverallQual', 'OverallCond', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'ExterQual', 'Foundation', 'BsmtQual', 'CentralAir', 'YrSold']]
labels = houses['HouseStyle']

NameError: name 'houses' is not defined

In [154]:
start_time = time.time()
conj_res = numeric_only.explain(explainer='many to one', labels=labels, explanation_form='conj')
conj_time = time.time() - start_time
print(f"Conjunctive explanation runtime: {conj_time}")
build_comparison_df(houses_comparison_df, 'Numeric only', 'conjunctive', conj_res, conj_time)
conj_res

Conjunctive explanation runtime: 5.5691399574279785


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
2Story,545.5 <= 2ndFlrSF <= 2065 AND 1941.0 <= YearBuilt <= 2010.0,0.82,0.12
2Story,545.5 <= 2ndFlrSF <= 2065 AND 1968.5 <= YearBuilt <= 2010,0.75,0.06
2Story,600 <= 2ndFlrSF <= 2065,0.91,0.17
2Story,671.0 <= 2ndFlrSF <= 2065 AND 334.0 <= GrLivArea <= 2988.0,0.83,0.13
2Story,671.0 <= 2ndFlrSF <= 2065 AND 1964.0 <= YearBuilt <= 2010.0,0.7,0.04
2Story,545.5 <= 2ndFlrSF <= 2065,0.97,0.2
2Story,200.0 <= 2ndFlrSF <= 2065,1.0,0.29
2Story,200.0 <= 2ndFlrSF <= 2065 AND 1964.0 <= YearBuilt <= 2010.0,0.81,0.08
2Story,545.5 <= 2ndFlrSF <= 2065 AND 334.0 <= GrLivArea <= 2988.0,0.95,0.19
2Story,671.0 <= 2ndFlrSF <= 2065,0.85,0.14


In [8]:
start_time = time.time()
disj_res = numeric_only.explain(explainer='many to one', labels=labels, explanation_form='disj')
disj_time = time.time() - start_time
print(f"Disjunctive explanation runtime: {disj_time}")
build_comparison_df(houses_comparison_df, 'Numeric only', 'disjunctive', disj_res, disj_time)
disj_res

NameError: name 'numeric_only' is not defined

In [156]:
start_time = time.time()
conj_res = categorical_only.explain(explainer='many to one', labels=labels, explanation_form='conj')
conj_time = time.time() - start_time
print(f"Conjunctive explanation runtime: {conj_time}")
build_comparison_df(houses_comparison_df, 'Categorical only', 'conjunctive', conj_res, conj_time)
conj_res

Conjunctive explanation runtime: 1.4885356426239014


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
2Story,50 <= MSSubClass <= 70,0.81,0.28
2Story,55.0 <= MSSubClass <= 72.5,0.8,0.01
1Story,MSSubClass == 20,0.74,0.0
1Story,20 <= MSSubClass <= 42.5,0.83,0.01
1Story,20 <= MSSubClass <= 30,0.83,0.01
1Story,20 <= MSSubClass <= 25.0,0.74,0.0
1Story,MSSubClass == 20,0.74,0.0
1.5Fin,MSSubClass == 50,0.92,0.02
1.5Unf,MSSubClass == 45,0.86,0.0
1.5Unf,42.5 <= MSSubClass <= 47.5,0.86,0.0


In [157]:
start_time = time.time()
disj_res = categorical_only.explain(explainer='many to one', labels=labels, explanation_form='disj')
disj_time = time.time() - start_time
print(f"Disjunctive explanation runtime: {disj_time}")
build_comparison_df(houses_comparison_df, 'Categorical only', 'disjunctive', disj_res, disj_time)
disj_res

Disjunctive explanation runtime: 1.4473304748535156


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
2Story,55.0 <= MSSubClass <= 72.5,0.8,0.01
2Story,50 <= MSSubClass <= 70,0.81,0.28
1Story,20 <= MSSubClass <= 20,0.74,0.0
1Story,20 <= MSSubClass <= 25.0,0.74,0.0
1Story,20 <= MSSubClass <= 42.5,0.83,0.01
1Story,20 <= MSSubClass <= 30,0.83,0.01
1.5Fin,42.5 <= MSSubClass <= 55.0,0.92,0.1
1.5Unf,42.5 <= MSSubClass <= 47.5,0.86,0.0
2.5Fin,No explanation found,,
2.5Unf,No explanation found,,


### Groupby tests

In [14]:
houses

Unnamed: 0,MSSubClass,LotArea,Street,LotConfig,LandSlope,Neighborhood,BldgType,HouseStyle,OverallQual,OverallCond,...,CentralAir,1stFlrSF,2ndFlrSF,GrLivArea,FullBath,TotRmsAbvGrd,GarageCars,PoolArea,YrSold,SalePrice
0,60,8450,Pave,Inside,Gtl,CollgCr,1Fam,2Story,7,5,...,Y,856,854,1710,2,8,2,0,2008,208500
1,20,9600,Pave,FR2,Gtl,Veenker,1Fam,1Story,6,8,...,Y,1262,0,1262,2,6,2,0,2007,181500
2,60,11250,Pave,Inside,Gtl,CollgCr,1Fam,2Story,7,5,...,Y,920,866,1786,2,6,2,0,2008,223500
3,70,9550,Pave,Corner,Gtl,Crawfor,1Fam,2Story,7,5,...,Y,961,756,1717,1,7,3,0,2006,140000
4,60,14260,Pave,FR2,Gtl,NoRidge,1Fam,2Story,8,5,...,Y,1145,1053,2198,2,9,3,0,2008,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,60,7917,Pave,Inside,Gtl,Gilbert,1Fam,2Story,6,5,...,Y,953,694,1647,2,7,2,0,2007,175000
1456,20,13175,Pave,Inside,Gtl,NWAmes,1Fam,1Story,6,6,...,Y,2073,0,2073,2,7,2,0,2010,210000
1457,70,9042,Pave,Inside,Gtl,Crawfor,1Fam,2Story,7,9,...,Y,1188,1152,2340,2,9,1,0,2010,266500
1458,20,9717,Pave,Inside,Gtl,NAmes,1Fam,1Story,5,6,...,Y,1078,0,1078,1,5,1,0,2010,142125


In [34]:
single_attribute_groupby = houses.groupby('ExterQual').mean()
multi_attribute_groupby = houses.groupby(['OverallQual', 'ExterQual']).mean()

In [10]:
start_time = time.time()
conj_res = single_attribute_groupby.explain(explainer='many to one', explanation_form='conj')
conj_time = time.time() - start_time
print(f"Conjunctive explanation runtime: {conj_time}")
build_comparison_df(houses_comparison_df, 'Single attribute groupby', 'conjunctive', conj_res, conj_time)
conj_res

Conjunctive explanation runtime: 1.1962735652923584


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
Gd,1980.5 <= YearBuilt <= 2010 AND 6.4 <= OverallQual <= 8.2,0.72,0.12
Gd,1980.5 <= YearBuilt <= 2010 AND 7.0 <= OverallQual <= 8.5,0.72,0.12
Gd,1982.4 <= YearBuilt <= 2010.0,0.87,0.28
Gd,1964.0 <= YearBuilt <= 2010.0 AND 6.4 <= OverallQual <= 8.2,0.74,0.18
Gd,7.0 <= OverallQual <= 8.5,0.76,0.24
Gd,1987.0 <= YearBuilt <= 2010.0 AND 5.5 <= OverallQual <= 10.0,0.85,0.24
Gd,1982.4 <= YearBuilt <= 2010.0 AND 6.4 <= OverallQual <= 8.2,0.72,0.12
Gd,1980.5 <= YearBuilt <= 2010 AND 5.5 <= OverallQual <= 10.0 AND 34900.0 <= SalePrice <= 394950.0,0.85,0.23
Gd,1980.5 <= YearBuilt <= 2010 AND 5.5 <= OverallQual <= 10.0,0.88,0.25
Gd,1982.4 <= YearBuilt <= 2010.0 AND 6.5 <= OverallQual <= 10,0.76,0.19


In [18]:
start_time = time.time()
disj_res = single_attribute_groupby.explain(explainer='many to one', explanation_form='disj', max_labels=50)
disj_time = time.time() - start_time
print(f"Disjunctive explanation runtime: {disj_time}")
build_comparison_df(houses_comparison_df, 'Single attribute groupby', 'disjunctive', disj_res, disj_time)
disj_res

Disjunctive explanation runtime: 0.9243829250335693


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
Gd,1982.4 <= YearBuilt <= 2010.0,0.87,0.28
Gd,6.4 <= OverallQual <= 8.2,0.76,0.24
Gd,1980.5 <= YearBuilt <= 2010,0.88,0.29
Gd,1984.5 <= YearBuilt <= 2010,0.86,0.27
Gd,7.0 <= OverallQual <= 8.5,0.76,0.24
TA,1872 <= YearBuilt <= 1980.5 OR 1 <= OverallQual <= 6,0.96,0.14
TA,1872 <= YearBuilt <= 1980.5 OR 1 <= OverallQual <= 6.5,0.96,0.14
TA,1 <= OverallQual <= 6,0.89,0.12
TA,1 <= OverallQual <= 6.5,0.89,0.12
TA,4.0 <= OverallQual <= 7.0 OR 1872 <= YearBuilt <= 1973,1.0,0.27


In [35]:
start_time = time.time()
conj_res = multi_attribute_groupby.explain(explainer='many to one', explanation_form='conj')
conj_time = time.time() - start_time
print(f"Conjunctive explanation runtime: {conj_time}")
build_comparison_df(houses_comparison_df, 'Multi attribute groupby', 'conjunctive', conj_res, conj_time)
conj_res

There are more than the specified max number of 10 unique labels, and the option `prune_if_too_many_labels` is set to True. Pruning the labels to the top 10 most common labels.

Conjunctive explanation runtime: 3.0623204708099365


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
"(4, 'TA')",No explanation found,,
"(5, 'Gd')",No explanation found,,
"(5, 'TA')",No explanation found,,
"(6, 'Gd')",No explanation found,,
"(6, 'TA')",No explanation found,,
"(7, 'Gd')",No explanation found,,
"(7, 'TA')",No explanation found,,
"(8, 'Gd')",No explanation found,,
"(9, 'Ex')",No explanation found,,
"(9, 'Gd')",No explanation found,,


In [36]:
start_time = time.time()
disj_res = multi_attribute_groupby.explain(explainer='many to one', explanation_form='disj')
disj_time = time.time() - start_time
print(f"Disjunctive explanation runtime: {disj_time}")
build_comparison_df(houses_comparison_df, 'Multi attribute groupby', 'disjunctive', disj_res, disj_time)
disj_res

There are more than the specified max number of 10 unique labels, and the option `prune_if_too_many_labels` is set to True. Pruning the labels to the top 10 most common labels.

Disjunctive explanation runtime: 2.44771671295166


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
"(4, 'TA')",No explanation found,,
"(5, 'Gd')",No explanation found,,
"(5, 'TA')",No explanation found,,
"(6, 'Gd')",No explanation found,,
"(6, 'TA')",No explanation found,,
"(7, 'Gd')",No explanation found,,
"(7, 'TA')",No explanation found,,
"(8, 'Gd')",No explanation found,,
"(9, 'Ex')",No explanation found,,
"(9, 'Gd')",No explanation found,,


### Clustering test

#### 3 clusters

In [37]:
labels = KMeans(n_clusters=3).fit_predict(pd.get_dummies(houses))

[WinError 2] The system cannot find the file specified
  File "C:\Users\Yuval\PycharmProjects\FEDEx_Generator\.venv1\Lib\site-packages\joblib\externals\loky\backend\context.py", line 257, in _count_physical_cores
    cpu_info = subprocess.run(
               ^^^^^^^^^^^^^^^
  File "C:\Users\Yuval\AppData\Local\Programs\Python\Python312\Lib\subprocess.py", line 548, in run
    with Popen(*popenargs, **kwargs) as process:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Yuval\AppData\Local\Programs\Python\Python312\Lib\subprocess.py", line 1026, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "C:\Users\Yuval\AppData\Local\Programs\Python\Python312\Lib\subprocess.py", line 1538, in _execute_child
    hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


In [38]:
start_time = time.time()
conj_res = houses.explain(explainer='many to one', labels=labels, explanation_form='conj')
conj_time = time.time() - start_time
print(f"Conjunctive explanation runtime: {conj_time}")
build_comparison_df(houses_comparison_df, '3 clusters', 'conjunctive', conj_res, conj_time)
conj_res

Conjunctive explanation runtime: 1.9508733749389648


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
0,293538.5 <= SalePrice <= 755000,1.0,0.0
1,34900 <= SalePrice <= 172200.0,1.0,0.0
2,172200.0 <= SalePrice <= 293538.5,1.0,0.0


In [39]:
start_time = time.time()
disj_res = houses.explain(explainer='many to one', labels=labels, explanation_form='disj')
disj_time = time.time() - start_time
print(f"Disjunctive explanation runtime: {disj_time}")
build_comparison_df(houses_comparison_df, '3 clusters', 'disjunctive', disj_res, disj_time)
disj_res

Disjunctive explanation runtime: 1.61720871925354


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
0,293538.5 <= SalePrice <= 755000,1.0,0.0
1,34900 <= SalePrice <= 172200.0,1.0,0.0
2,172200.0 <= SalePrice <= 293538.5,1.0,0.0


#### 15 clusters

In [40]:
labels = KMeans(n_clusters=15).fit_predict(pd.get_dummies(houses))

In [41]:
start_time = time.time()
conj_res = houses.explain(explainer='many to one', labels=labels, explanation_form='conj')
conj_time = time.time() - start_time
print(f"Conjunctive explanation runtime: {conj_time}")
build_comparison_df(houses_comparison_df, '15 clusters', 'conjunctive', conj_res, conj_time)
conj_res

There are more than the specified max number of 10 unique labels, and the option `prune_if_too_many_labels` is set to True. Pruning the labels to the top 10 most common labels.

Conjunctive explanation runtime: 5.602252721786499


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
0,301865.3333333334 <= SalePrice <= 358581.6666666667,0.95,0.0
1,148650.0 <= SalePrice <= 170500.0,0.99,0.0
3,75000 <= SalePrice <= 102388.0,1.0,0.0
6,102388.0 <= SalePrice <= 126337.5,1.0,0.0
7,245149.0 <= SalePrice <= 301865.3333333334 AND 8996.5 <= LotArea <= 70761,0.9,0.22
7,245149.0 <= SalePrice <= 301865.3333333334,1.0,0.24
11,358581.6666666667 <= SalePrice <= 415298.0,1.0,0.0
11,356550.0 <= SalePrice <= 415298,1.0,0.0
13,170500.0 <= SalePrice <= 195200.0,1.0,0.0
14,126337.5 <= SalePrice <= 148650.0,1.0,0.0


In [42]:
start_time = time.time()
disj_res = houses.explain(explainer='many to one', labels=labels, explanation_form='disj')
disj_time = time.time() - start_time
print(f"Disjunctive explanation runtime: {disj_time}")
build_comparison_df(houses_comparison_df, '15 clusters', 'disjunctive', disj_res, disj_time)
disj_res

There are more than the specified max number of 10 unique labels, and the option `prune_if_too_many_labels` is set to True. Pruning the labels to the top 10 most common labels.

Disjunctive explanation runtime: 4.3408918380737305


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
0,301865.3333333334 <= SalePrice <= 358581.6666666667,0.95,0.0
1,148650.0 <= SalePrice <= 170500.0,0.99,0.0
3,75000 <= SalePrice <= 102388.0,1.0,0.0
6,102388.0 <= SalePrice <= 126337.5,1.0,0.0
7,245149.0 <= SalePrice <= 301865.3333333334,1.0,0.24
11,356550.0 <= SalePrice <= 415298,1.0,0.0
11,358581.6666666667 <= SalePrice <= 415298.0,1.0,0.0
13,170500.0 <= SalePrice <= 195200.0,1.0,0.0
14,126337.5 <= SalePrice <= 148650.0,1.0,0.0
2,No explanation found,,


### Summary of results

In [43]:
conj_averages = houses_comparison_df.loc[:, 'conjunctive', :].mean()
disj_averages = houses_comparison_df.loc[:, 'disjunctive', :].mean()
houses_comparison_df.loc[('Average', 'conjunctive'), :] = conj_averages
houses_comparison_df.loc[('Average', 'disjunctive'), :] = disj_averages

In [44]:
houses_comparison_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Average Coverage,Average Separation Error,% Explanation Not Found,Average Explanation Length,Number of Explanations Found,Runtime
Test,Explanation Type,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Single attribute groupby,conjunctive,0.805278,0.187222,50.0,1.861111,36.0,1.196274
Single attribute groupby,disjunctive,0.900667,0.2,50.0,1.333333,15.0,0.924383
Multi attribute groupby,conjunctive,,,100.0,,0.0,3.06232
Multi attribute groupby,disjunctive,,,100.0,,0.0,2.447717
3 clusters,conjunctive,1.0,0.0,0.0,1.0,3.0,1.950873
3 clusters,disjunctive,1.0,0.0,0.0,1.0,3.0,1.617209
15 clusters,conjunctive,0.984,0.046,20.0,1.1,10.0,5.602253
15 clusters,disjunctive,0.993333,0.026667,20.0,1.0,9.0,4.340892
Average,conjunctive,0.929759,0.077741,42.5,1.32037,12.25,2.95293
Average,disjunctive,0.964667,0.075556,42.5,1.111111,6.75,2.33255
