# Many to One Explainer - Demo of Conjunctive vs Disjunctive Explanations

This demo measures the differences in quality metrics and runtime of explanation generation between conjunctive and disjunctive explanations, on various datasets.

## Setup

In [2]:
import pandas as pd
import pd_explain
import time
import numpy as np
from sklearn.cluster import KMeans

usetex-False
(CVXPY) Jan 22 08:57:03 PM: Encountered unexpected exception importing solver GLOP:
RuntimeError('Unrecognized new version of ortools (9.11.4210). Expected < 9.10.0. Please open a feature request on cvxpy to enable support for this version.')
(CVXPY) Jan 22 08:57:03 PM: Encountered unexpected exception importing solver PDLP:
RuntimeError('Unrecognized new version of ortools (9.11.4210). Expected < 9.10.0. Please open a feature request on cvxpy to enable support for this version.')


This comparison uses a very high, practically unrealistic separation error threshold of 0.8.\
This is set so that less explanations are filtered, and thus we don't get a situation where one form appears better than the other because most of its results were filtered out.\
You can of-course adjust this threshold to a more realistic value, such as 0.2 or 0.3, to get results that are more representative of a real-world scenario.

In [3]:
separation_error = 0.8

By default, we use sampling to speed up the explanation generation process.

In [4]:
pd_explain.get_use_sampling_value()

True

This can be changed either per function call or globally.\
If you would like to run this demo without sampling, uncomment and run the cell below prior to running the rest of the notebook.

In [5]:
# pd_explain.toggle_sampling(False)
# pd_explain.get_use_sampling_value()

In [20]:
def build_comparison_df(comparison_df, test_name, explanation_type, res_df, runtime):
    try:
        # Each group without an explanation has 1 row with "No explanation found" in the 'Explanation' column / index
        num_groups_not_found = res_df.loc[:, 'No explanation found', :].shape[0]
        # Each group that does have explanations can have multiple rows, and we want to count the number of groups, which is the length of the index on the first level
        num_groups_total = len(res_df.index.levels[0])
        percentage_not_found = (num_groups_not_found / num_groups_total) * 100
    # This should only happen in case explanations were found for all instances
    except KeyError:
        percentage_not_found = 0
    comparison_df.loc[(test_name, explanation_type), :] = [np.mean(res_df['Coverage']), np.mean(res_df['Separation Error']), percentage_not_found, runtime]

In [37]:
columns = ['Average Coverage', 'Average Separation Error', '% Explanation Not Found', 'Runtime']
index = pd.MultiIndex(levels=[[], []], codes=[[], []], names=['Test', 'Explanation Type'])

## Adults dataset

In [7]:
adults = pd.read_csv(r"..\Examples\Datasets\adult.csv")
adults

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,label
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,39,Private,215419,Bachelors,13,Divorced,Prof-specialty,Not-in-family,White,Female,0,0,36,United-States,<=50K
48838,64,?,321403,HS-grad,9,Widowed,?,Other-relative,Black,Male,0,0,40,United-States,<=50K
48839,38,Private,374983,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,50,United-States,<=50K
48840,44,Private,83891,Bachelors,13,Divorced,Adm-clerical,Own-child,Asian-Pac-Islander,Male,5455,0,40,United-States,<=50K


In [8]:
adults_comparison_df = pd.DataFrame(columns=columns, index=index)

### Full dataset test

In [18]:
start_time = time.time()
conj_res = adults.explain(explainer='many to one', labels='label', explanation_form='conj', separation_threshold=separation_error)
conj_time = time.time() - start_time
print(f"Conjunctive explanation runtime: {conj_time}")
build_comparison_df(adults_comparison_df, 'Full dataset', 'conjunctive', conj_res, conj_time)
conj_res

Conjunctive explanation runtime: 1.2051966190338135


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
<=50K,0 <= capital-gain <= 5095.5 AND 1 <= education-num <= 10,0.75,0.13
<=50K,0 <= capital-gain <= 4243.5,0.99,0.2
<=50K,0 <= capital-gain <= 5095.5,1.0,0.21
<=50K,1 <= education-num <= 10,0.75,0.15
>50K,0 <= capital-gain <= 99999,1.0,0.76
>50K,marital-status == Married-civ-spouse,0.84,0.55
>50K,8.5 <= education-num <= 16.0,0.97,0.73
>50K,marital-status == Married-civ-spouse AND 8.5 <= education-num <= 16.0,0.81,0.52


In [19]:
start_time = time.time()
disj_res = adults.explain(explainer='many to one', labels='label', explanation_form='disj', separation_threshold=separation_error)
disj_time = time.time() - start_time
print(f"Disjunctive explanation runtime: {disj_time}")
build_comparison_df(adults_comparison_df, 'Full dataset', 'disjunctive', disj_res, disj_time)
disj_res

Disjunctive explanation runtime: 0.7175784111022949


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
<=50K,0 <= capital-gain <= 4243.5,0.99,0.2
<=50K,1 <= education-num <= 10,0.75,0.15
<=50K,0 <= capital-gain <= 5095.5,1.0,0.21
>50K,marital-status == Married-civ-spouse,0.84,0.55
>50K,marital-status == Married-civ-spouse OR 8.5 <= education-num <= 16.0,1.0,0.74
>50K,0 <= capital-gain <= 99999,1.0,0.76
>50K,8.5 <= education-num <= 16.0,0.97,0.73


### Numeric only and categorical only tests

In [11]:
numeric_only = adults[['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']]
categorical_only = adults[['workclass', 'education', 'marital-status', 'occupation', 'relationship']]
labels = adults['label']

In [12]:
start_time = time.time()
conj_res = numeric_only.explain(explainer='many to one', labels=labels, explanation_form='conj', separation_threshold=separation_error)
conj_time = time.time() - start_time
print(f"Conjunctive explanation runtime: {conj_time}")
build_comparison_df(adults_comparison_df, 'Numeric only', 'conjunctive', conj_res, conj_time)
conj_res

Conjunctive explanation runtime: 0.7494893074035645


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
<=50K,0 <= capital-gain <= 5095.5 AND 1 <= education-num <= 10,0.75,0.13
<=50K,0 <= capital-gain <= 4243.5,0.99,0.2
<=50K,0 <= capital-gain <= 5095.5,1.0,0.21
<=50K,1 <= education-num <= 10,0.75,0.15
>50K,31.5 <= age <= 90 AND 8.5 <= education-num <= 16.0,0.87,0.64
>50K,0 <= capital-gain <= 99999,1.0,0.76
>50K,8.5 <= education-num <= 16.0 AND 37 <= age <= 90,0.72,0.62
>50K,8.5 <= education-num <= 16.0,0.97,0.73
>50K,31.5 <= age <= 90,0.9,0.67
>50K,37 <= age <= 90,0.75,0.65


In [13]:
start_time = time.time()
disj_res = numeric_only.explain(explainer='many to one', labels=labels, explanation_form='disj', separation_threshold=separation_error)
disj_time = time.time() - start_time
print(f"Disjunctive explanation runtime: {disj_time}")
build_comparison_df(adults_comparison_df, 'Numeric only', 'disjunctive', disj_res, disj_time)
disj_res

Disjunctive explanation runtime: 0.990556001663208


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
<=50K,0 <= capital-gain <= 4243.5,0.99,0.2
<=50K,1 <= education-num <= 10,0.75,0.15
<=50K,0 <= capital-gain <= 5095.5,1.0,0.21
>50K,31.5 <= age <= 90,0.9,0.67
>50K,37 <= age <= 90,0.75,0.65
>50K,37 <= age <= 90 OR 8.5 <= education-num <= 16.0,1.0,0.74
>50K,0 <= capital-gain <= 99999,1.0,0.76
>50K,8.5 <= education-num <= 16.0,0.97,0.73


In [14]:
start_time = time.time()
conj_res = categorical_only.explain(explainer='many to one', labels=labels, explanation_form='conj', separation_threshold=separation_error)
conj_time = time.time() - start_time
print(f"Conjunctive explanation runtime: {conj_time}")
build_comparison_df(adults_comparison_df, 'Categorical only', 'conjunctive', conj_res, conj_time)
conj_res

Conjunctive explanation runtime: 0.15270662307739258


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
<=50K,occupation != Exec-managerial,0.91,0.21
<=50K,occupation != Prof-specialty,0.91,0.21
<=50K,occupation != Prof-specialty AND occupation != Exec-managerial,0.82,0.17
>50K,marital-status == Married-civ-spouse,0.84,0.55


In [15]:
start_time = time.time()
disj_res = categorical_only.explain(explainer='many to one', labels=labels, explanation_form='disj', separation_threshold=separation_error)
disj_time = time.time() - start_time
print(f"Disjunctive explanation runtime: {disj_time}")
build_comparison_df(adults_comparison_df, 'Categorical only', 'disjunctive', disj_res, disj_time)
disj_res

Disjunctive explanation runtime: 0.17460989952087402


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
<=50K,occupation != Exec-managerial,0.91,0.21
<=50K,occupation != Prof-specialty OR occupation != Exec-managerial,1.0,0.24
<=50K,occupation != Prof-specialty,0.91,0.21
>50K,marital-status == Married-civ-spouse,0.84,0.55


### Groupby tests

In [16]:
single_attribute_groupby = adults.groupby('relationship').mean()
multi_attribute_groupby = adults.groupby(['sex', 'label']).mean()

In [21]:
start_time = time.time()
conj_res = single_attribute_groupby.explain(explainer='many to one', explanation_form='conj', separation_threshold=separation_error)
conj_time = time.time() - start_time
print(f"Conjunctive explanation runtime: {conj_time}")
build_comparison_df(adults_comparison_df, 'Single attribute groupby', 'conjunctive', conj_res, conj_time)
conj_res

Conjunctive explanation runtime: 4.567002058029175


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
Not-in-family,marital-status != Married-civ-spouse AND 20.5 <= age <= 90,0.96,0.47
Not-in-family,marital-status != Married-civ-spouse AND 23.5 <= age <= 90,0.88,0.44
Not-in-family,marital-status != Married-civ-spouse,1.0,0.52
Husband,marital-status == Married-civ-spouse AND sex != Female,1.0,0.01
Husband,marital-status == Married-civ-spouse,1.0,0.11
Husband,marital-status == Married-civ-spouse AND sex == Male,1.0,0.01
Wife,marital-status == Married-civ-spouse AND sex == Female,0.99,0.05
Own-child,marital-status != Married-civ-spouse AND 17 <= age <= 37,0.9,0.6
Own-child,17 <= age <= 31 AND marital-status == Never-married,0.78,0.49
Own-child,marital-status != Married-civ-spouse AND 17 <= age <= 28,0.76,0.47


In [22]:
start_time = time.time()
disj_res = single_attribute_groupby.explain(explainer='many to one', explanation_form='disj', separation_threshold=separation_error)
disj_time = time.time() - start_time
print(f"Disjunctive explanation runtime: {disj_time}")
build_comparison_df(adults_comparison_df, 'Single attribute groupby', 'disjunctive', disj_res, disj_time)
disj_res

Disjunctive explanation runtime: 3.477092742919922


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
Not-in-family,marital-status != Married-civ-spouse,1.0,0.52
Husband,marital-status == Married-civ-spouse,1.0,0.11
Own-child,17 <= age <= 28,0.76,0.56
Own-child,marital-status == Never-married,0.89,0.59
Own-child,17 <= age <= 28 OR marital-status != Married-civ-spouse,0.99,0.74
Own-child,17 <= age <= 28 OR marital-status == Never-married,0.92,0.64
Own-child,marital-status == Never-married OR 17 <= age <= 31,0.93,0.68
Own-child,17.0 <= age <= 31.6 OR marital-status == Never-married,0.93,0.68
Own-child,17.0 <= age <= 41.33333333333333 OR marital-status != Married-civ-spouse,1.0,0.8
Own-child,marital-status != Married-civ-spouse,0.98,0.72


In [23]:
start_time = time.time()
conj_res = multi_attribute_groupby.explain(explainer='many to one', explanation_form='conj', separation_threshold=separation_error)
conj_time = time.time() - start_time
print(f"Conjunctive explanation runtime: {conj_time}")
build_comparison_df(adults_comparison_df, 'Multi attribute groupby', 'conjunctive', conj_res, conj_time)
conj_res

Conjunctive explanation runtime: 0.7219343185424805


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
"('Male', '<=50K')",relationship != infrequent_sklearn AND 0 <= capital-gain <= 4243.5,0.96,0.49
"('Male', '<=50K')",0 <= capital-gain <= 5095.5,1.0,0.51
"('Male', '<=50K')",relationship != infrequent_sklearn AND 0 <= capital-gain <= 5095.5,0.96,0.49
"('Female', '<=50K')",relationship != Husband,1.0,0.51
"('Female', '<=50K')",relationship != Husband AND 0 <= capital-gain <= 4243.5,0.99,0.49
"('Female', '<=50K')",relationship != Husband AND 0 <= capital-gain <= 6457.5,1.0,0.5
"('Male', '>50K')",0 <= capital-gain <= 99999,1.0,0.8
"('Male', '>50K')",8.5 <= education-num <= 16.0,0.96,0.78
"('Male', '>50K')",relationship == Husband,0.88,0.56
"('Male', '>50K')",relationship == Husband AND 8.5 <= education-num <= 16.0,0.85,0.52


In [24]:
start_time = time.time()
disj_res = multi_attribute_groupby.explain(explainer='many to one', explanation_form='disj', separation_threshold=separation_error)
disj_time = time.time() - start_time
print(f"Disjunctive explanation runtime: {disj_time}")
build_comparison_df(adults_comparison_df, 'Multi attribute groupby', 'disjunctive', disj_res, disj_time)
disj_res

Disjunctive explanation runtime: 0.8337545394897461


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
"('Male', '<=50K')",0 <= capital-gain <= 5095.5,1.0,0.51
"('Female', '<=50K')",relationship != Husband,1.0,0.51
"('Male', '>50K')",relationship == Husband OR 8.5 <= education-num <= 16.0,1.0,0.78
"('Male', '>50K')",relationship == Husband,0.88,0.56
"('Male', '>50K')",0 <= capital-gain <= 99999,1.0,0.8
"('Male', '>50K')",8.5 <= education-num <= 16.0,0.96,0.78
"('Female', '>50K')",No explanation found,,


### Clustering test

#### 3 clusters

In [25]:
cluster_labels = KMeans(n_clusters=3).fit_predict(pd.get_dummies(adults))

[WinError 2] The system cannot find the file specified
  File "C:\Users\Yuval\PycharmProjects\FEDEx_Generator\.venv1\Lib\site-packages\joblib\externals\loky\backend\context.py", line 257, in _count_physical_cores
    cpu_info = subprocess.run(
               ^^^^^^^^^^^^^^^
  File "C:\Users\Yuval\AppData\Local\Programs\Python\Python312\Lib\subprocess.py", line 548, in run
    with Popen(*popenargs, **kwargs) as process:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Yuval\AppData\Local\Programs\Python\Python312\Lib\subprocess.py", line 1026, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "C:\Users\Yuval\AppData\Local\Programs\Python\Python312\Lib\subprocess.py", line 1538, in _execute_child
    hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


In [26]:
start_time = time.time()
conj_res = adults.explain(explainer='many to one', labels=cluster_labels, explanation_form='conj', separation_threshold=separation_error)
conj_time = time.time() - start_time
print(f"Conjunctive explanation runtime: {conj_time}")
build_comparison_df(adults_comparison_df, '3 clusters', 'conjunctive', conj_res, conj_time)
conj_res

Conjunctive explanation runtime: 2.0711421966552734


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
0,149278.5 <= fnlwgt <= 1490400 AND 8.5 <= education-num <= 16.0,0.87,0.21
0,149278.5 <= fnlwgt <= 1490400,1.0,0.22
1,291277.5 <= fnlwgt <= 1490400,1.0,0.0
2,13769 <= fnlwgt <= 149278.5,1.0,0.0


In [27]:
start_time = time.time()
disj_res = adults.explain(explainer='many to one', labels=cluster_labels, explanation_form='disj', separation_threshold=separation_error)
disj_time = time.time() - start_time
print(f"Disjunctive explanation runtime: {disj_time}")
build_comparison_df(adults_comparison_df, '3 clusters', 'disjunctive', disj_res, disj_time)
disj_res

Disjunctive explanation runtime: 2.339325428009033


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
0,149278.5 <= fnlwgt <= 1490400,1.0,0.22
1,291277.5 <= fnlwgt <= 1490400,1.0,0.0
2,13769 <= fnlwgt <= 149278.5,1.0,0.0


#### 15 clusters

In [28]:
cluster_labels = KMeans(n_clusters=15).fit_predict(pd.get_dummies(adults))

In [29]:
start_time = time.time()
conj_res = adults.explain(explainer='many to one', labels=cluster_labels, explanation_form='conj', separation_threshold=separation_error)
conj_time = time.time() - start_time
print(f"Conjunctive explanation runtime: {conj_time}")
build_comparison_df(adults_comparison_df, '15 clusters', 'conjunctive', conj_res, conj_time)
conj_res

Conjunctive explanation runtime: 7.512061357498169


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
0,429223.0 <= fnlwgt <= 1490400 AND 17.0 <= age <= 53.5 AND 1 <= education-num <= 10,0.71,0.29
0,429223.0 <= fnlwgt <= 1490400,1.0,0.3
1,131643.5 <= fnlwgt <= 165154.5,1.0,0.0
2,269888.0 <= fnlwgt <= 315581.0,1.0,0.0
3,13769 <= fnlwgt <= 60226.5,1.0,0.0
4,194360 <= fnlwgt <= 255559 AND 7.5 <= education-num <= 16,0.91,0.31
4,194360 <= fnlwgt <= 255559,1.0,0.33
4,194360 <= fnlwgt <= 255559 AND 4.5 <= education-num <= 16,0.97,0.32
5,899747.6000000001 <= fnlwgt <= 1195073.8,1.0,0.0
6,97316.0 <= fnlwgt <= 131643.5,1.0,0.0


In [30]:
start_time = time.time()
disj_res = adults.explain(explainer='many to one', labels=cluster_labels, explanation_form='disj', separation_threshold=separation_error)
disj_time = time.time() - start_time
print(f"Disjunctive explanation runtime: {disj_time}")
build_comparison_df(adults_comparison_df, '15 clusters', 'disjunctive', disj_res, disj_time)
disj_res

Disjunctive explanation runtime: 4.69097375869751


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
0,429223.0 <= fnlwgt <= 1490400,1.0,0.3
1,131643.5 <= fnlwgt <= 165154.5,1.0,0.0
2,269888.0 <= fnlwgt <= 315581.0,1.0,0.0
3,13769 <= fnlwgt <= 60226.5,1.0,0.0
4,194360 <= fnlwgt <= 255559,1.0,0.33
5,899747.6000000001 <= fnlwgt <= 1195073.8,1.0,0.0
6,97316.0 <= fnlwgt <= 131643.5,1.0,0.0
7,366566.0 <= fnlwgt <= 1490400,1.0,0.41
8,140676 <= fnlwgt <= 208043,1.0,0.48
8,157127 <= fnlwgt <= 194360,0.95,0.18


### Summary of results

In [31]:
conj_averages = adults_comparison_df.loc[:, 'conjunctive', :].mean()
disj_averages = adults_comparison_df.loc[:, 'disjunctive', :].mean()
adults_comparison_df.loc[('Average', 'conjunctive'), :] = conj_averages
adults_comparison_df.loc[('Average', 'disjunctive'), :] = disj_averages

In [32]:
adults_comparison_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Average Coverage,Average Separation Error,% Explanation Not Found,Runtime
Test,Explanation Type,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Full dataset,conjunctive,0.88875,0.40625,0.0,1.205197
Full dataset,disjunctive,0.935714,0.477143,0.0,0.717578
Numeric only,conjunctive,0.87,0.476,0.0,0.749489
Numeric only,disjunctive,0.92,0.51375,0.0,0.990556
Categorical only,conjunctive,0.87,0.285,0.0,0.152707
Categorical only,disjunctive,0.915,0.3025,0.0,0.17461
Single attribute groupby,conjunctive,0.885,0.499643,16.666667,4.567002
Single attribute groupby,disjunctive,0.94,0.604,50.0,3.477093
Multi attribute groupby,conjunctive,0.96,0.565,25.0,0.721934
Multi attribute groupby,disjunctive,0.973333,0.656667,25.0,0.833755


## Spotify dataset

In [33]:
spotify = pd.read_csv(r"..\Examples\Datasets\spotify_all.csv")
spotify

Unnamed: 0,acousticness,artists,danceability,energy,explicit,id,instrumentalness,key,liveness,loudness,...,name,popularity,speechiness,tempo,valence,year,decade,popularity_score,main_artist,duration_minutes
0,0.991000,['Mamie Smith'],0.598,0.224,0,0cS0A1fUEUd1EW3FcF8AEI,0.000522,5,0.3790,-12.628,...,Keep A Song In Your Soul,12,0.0936,149.976,0.6340,1920,1920,10,Mamie Smith,2.805550
1,0.643000,"[""Screamin' Jay Hawkins""]",0.852,0.517,0,0hbkKFIJm7Z05H8Zl9w30f,0.026400,5,0.0809,-7.261,...,I Put A Spell On You,7,0.0534,86.889,0.9500,1920,1920,0,Screamin' Jay Hawkins,2.503333
2,0.993000,['Mamie Smith'],0.647,0.186,0,11m7laMUgmOKqI3oYzuhne,0.000018,0,0.5190,-12.098,...,Golfing Papa,4,0.1740,97.600,0.6890,1920,1920,0,Mamie Smith,2.730450
3,0.000173,['Oscar Velazquez'],0.730,0.798,0,19Lc5SfJJ5O1oaxY0fpwfh,0.801000,2,0.1280,-7.311,...,True House Music - Xavier Santos & Carlos Gomi...,17,0.0425,127.997,0.0422,1920,1920,10,Oscar Velazquez,7.034783
4,0.295000,['Mixe'],0.704,0.707,1,2hJjbsLCytGsnAHfdsLejp,0.000246,10,0.4020,-6.036,...,Xuniverxe,2,0.0768,122.076,0.2990,1920,1920,0,Mixe,2.753733
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
174384,0.009170,"['DJ Combo', 'Sander-7', 'Tony T']",0.792,0.866,0,46LhBf6TvYjZU2SMvGZAbn,0.000060,6,0.1780,-5.089,...,The One,0,0.0356,125.972,0.1860,2020,2020,0,DJ Combo,2.460250
174385,0.795000,['Alessia Cara'],0.429,0.211,0,7tue2Wemjd0FZzRtDrQFZd,0.000000,4,0.1960,-11.665,...,A Little More,0,0.0360,94.710,0.2280,2021,2020,0,Alessia Cara,2.412000
174386,0.806000,['Roger Fly'],0.671,0.589,0,48Qj61hOdYmUCFJbpQ29Ob,0.920000,4,0.1130,-12.393,...,Together,0,0.0282,108.058,0.7140,2020,2020,0,Roger Fly,3.635783
174387,0.920000,['Taylor Swift'],0.462,0.240,1,1gcyHQpBQ1lfXGdhZmWrHP,0.000000,0,0.1130,-12.077,...,champagne problems,69,0.0377,171.319,0.3200,2021,2020,60,Taylor Swift,4.066667


In [38]:
spotify_comparison_df = pd.DataFrame(columns=columns, index=index)

### Full dataset test

In [39]:
start_time = time.time()
conj_res = spotify.explain(explainer='many to one', labels='explicit', explanation_form='conj')
conj_time = time.time() - start_time
print(f"Conjunctive explanation runtime: {conj_time}")
build_comparison_df(spotify_comparison_df, 'Full dataset', 'conjunctive', conj_res, conj_time)
conj_res

Conjunctive explanation runtime: 1.606048583984375


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.0 <= speechiness <= 0.194 AND 0 <= popularity <= 46.5,0.75,0.01
0,0.0 <= speechiness <= 0.485,0.97,0.06
0,0.0 <= speechiness <= 0.16166666666666665,0.91,0.03
0,0.0 <= speechiness <= 0.3233333333333333,0.95,0.05
0,0.0 <= speechiness <= 0.2425,0.94,0.04
0,0.0 <= speechiness <= 0.3233333333333333 AND 0 <= popularity <= 46.5,0.78,0.02
1,No explanation found,,


In [40]:
start_time = time.time()
disj_res = spotify.explain(explainer='many to one', labels='explicit', explanation_form='disj')
disj_time = time.time() - start_time
print(f"Disjunctive explanation runtime: {disj_time}")
build_comparison_df(spotify_comparison_df, 'Full dataset', 'disjunctive', disj_res, disj_time)
disj_res

Disjunctive explanation runtime: 1.8355743885040283


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.0 <= speechiness <= 0.2425,0.94,0.04
0,0.0 <= speechiness <= 0.3233333333333333,0.95,0.05
0,0.0 <= speechiness <= 0.2425 OR 0.0 <= popularity <= 44.0,1.0,0.06
0,0.0 <= speechiness <= 0.2425 OR 0 <= popularity <= 46.5,1.0,0.06
0,0.0 <= speechiness <= 0.16166666666666665 OR 0 <= popularity <= 46.5,0.99,0.05
0,0.0 <= speechiness <= 0.485,0.97,0.06
0,0.0 <= speechiness <= 0.194 OR 0.0 <= popularity <= 44.0,0.99,0.05
0,0.0 <= speechiness <= 0.16166666666666665,0.91,0.03
0,0.0 <= speechiness <= 0.16166666666666665 OR 0.0 <= popularity <= 44.0,0.99,0.05
1,No explanation found,,


### Numeric only and categorical only tests

In [42]:
numeric_only = spotify[['acousticness', 'danceability', 'energy', 'instrumentalness', 'liveness', 'loudness', 'speechiness', 'tempo', 'valence']]
categorical_only = spotify[['name', 'main_artist', 'year']]
labels = spotify['explicit']

In [43]:
start_time = time.time()
conj_res = numeric_only.explain(explainer='many to one', labels=labels, explanation_form='conj')
conj_time = time.time() - start_time
print(f"Conjunctive explanation runtime: {conj_time}")
build_comparison_df(spotify_comparison_df, 'Numeric only', 'conjunctive', conj_res, conj_time)
conj_res

Conjunctive explanation runtime: 1.5166387557983398


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.0 <= speechiness <= 0.16166666666666665,0.91,0.03
0,0.0 <= speechiness <= 0.485,0.97,0.06
0,0.0 <= speechiness <= 0.3233333333333333,0.95,0.05
0,0.0 <= speechiness <= 0.2425,0.94,0.04
1,No explanation found,,


In [44]:
start_time = time.time()
disj_res = numeric_only.explain(explainer='many to one', labels=labels, explanation_form='disj')
disj_time = time.time() - start_time
print(f"Disjunctive explanation runtime: {disj_time}")
build_comparison_df(spotify_comparison_df, 'Numeric only', 'disjunctive', disj_res, disj_time)
disj_res

Disjunctive explanation runtime: 1.8359169960021973


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.0 <= speechiness <= 0.3233333333333333,0.95,0.05
0,0.0 <= speechiness <= 0.3233333333333333 OR 0.0 <= instrumentalness <= 0.333,1.0,0.07
0,0.0 <= speechiness <= 0.2425,0.94,0.04
0,0.0 <= instrumentalness <= 0.333 OR 0.0 <= speechiness <= 0.485,1.0,0.07
0,0.0 <= instrumentalness <= 0.24975 OR 0.0 <= speechiness <= 0.485,1.0,0.07
0,0.0 <= instrumentalness <= 0.4995 OR 0.0 <= speechiness <= 0.485,1.0,0.07
0,0.0 <= instrumentalness <= 0.4995 OR 0.0 <= speechiness <= 0.3233333333333333,1.0,0.07
0,0.0 <= speechiness <= 0.485,0.97,0.06
0,0.0 <= speechiness <= 0.16166666666666665,0.91,0.03
1,No explanation found,,


In [45]:
start_time = time.time()
conj_res = categorical_only.explain(explainer='many to one', labels=labels, explanation_form='conj')
conj_time = time.time() - start_time
print(f"Conjunctive explanation runtime: {conj_time}")
build_comparison_df(spotify_comparison_df, 'Categorical only', 'conjunctive', conj_res, conj_time)
conj_res

Conjunctive explanation runtime: 0.6001811027526855


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
0,name == infrequent_sklearn,1.0,0.07
0,main_artist == artist_infrequent_sklearn,1.0,0.07
1,No explanation found,,


In [46]:
start_time = time.time()
disj_res = categorical_only.explain(explainer='many to one', labels=labels, explanation_form='disj')
disj_time = time.time() - start_time
print(f"Disjunctive explanation runtime: {disj_time}")
build_comparison_df(spotify_comparison_df, 'Categorical only', 'disjunctive', disj_res, disj_time)
disj_res

Disjunctive explanation runtime: 0.738072395324707


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
0,name == infrequent_sklearn,1.0,0.07
0,main_artist == artist_infrequent_sklearn,1.0,0.07
1,No explanation found,,


### Groupby tests

In [48]:
single_attribute_groupby = spotify.groupby('mode').mean()

In [49]:
start_time = time.time()
conj_res = single_attribute_groupby.explain(explainer='many to one', explanation_form='conj')
conj_time = time.time() - start_time
print(f"Conjunctive explanation runtime: {conj_time}")
build_comparison_df(spotify_comparison_df, 'Single attribute groupby', 'conjunctive', conj_res, conj_time)
conj_res

Conjunctive explanation runtime: 2.2848331928253174


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
1,0.0 <= speechiness <= 0.3233333333333333 AND 0.07324999943375587 <= acousticness <= 0.996,0.73,0.27
1,0.0 <= speechiness <= 0.485,0.95,0.3
1,0.07324999943375587 <= acousticness <= 0.996,0.78,0.28
0,No explanation found,,


In [50]:
start_time = time.time()
disj_res = single_attribute_groupby.explain(explainer='many to one', explanation_form='disj')
disj_time = time.time() - start_time
print(f"Disjunctive explanation runtime: {disj_time}")
build_comparison_df(spotify_comparison_df, 'Single attribute groupby', 'disjunctive', disj_res, disj_time)
disj_res

Disjunctive explanation runtime: 2.7960195541381836


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
1,0.07324999943375587 <= acousticness <= 0.996 OR 0.0 <= speechiness <= 0.485,1.0,0.3
1,0.0 <= speechiness <= 0.485,0.95,0.3
1,0.07324999943375587 <= acousticness <= 0.996,0.78,0.28
0,No explanation found,,


### Clustering test

#### 3 clusters

In [54]:
# The spotify dataset is too big to cluster all columns with get_dummies (I tried, got a MemoryError, saying it is unable to allocate 22.3 GiB)
cols = ['acousticness', 'danceability', 'energy', 'instrumentalness', 'liveness', 'loudness', 'speechiness', 'tempo', 'valence']

In [55]:
cluster_labels = KMeans(n_clusters=3).fit_predict(spotify[cols])

In [56]:
start_time = time.time()
conj_res = spotify.explain(explainer='many to one', labels=cluster_labels, explanation_form='conj')
conj_time = time.time() - start_time
print(f"Conjunctive explanation runtime: {conj_time}")
build_comparison_df(spotify_comparison_df, '3 clusters', 'conjunctive', conj_res, conj_time)
conj_res

Conjunctive explanation runtime: 2.4809114933013916


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
0,147.00550079345703 <= tempo <= 219.446,1.0,0.0
1,0.0 <= tempo <= 106.19950103759766,0.99,0.0
1,0.0 <= tempo <= 106.32799911499023,0.99,0.0
1,0.0 <= tempo <= 109.723,1.0,0.08
2,106.73249816894531 <= tempo <= 147.00550079345703,1.0,0.0


In [57]:
start_time = time.time()
disj_res = spotify.explain(explainer='many to one', labels=cluster_labels, explanation_form='disj')
disj_time = time.time() - start_time
print(f"Disjunctive explanation runtime: {disj_time}")
build_comparison_df(spotify_comparison_df, '3 clusters', 'disjunctive', disj_res, disj_time)
disj_res

Disjunctive explanation runtime: 3.2725210189819336


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
0,147.00550079345703 <= tempo <= 219.446,1.0,0.0
1,0.0 <= tempo <= 106.32799911499023,0.99,0.0
1,0.0 <= tempo <= 109.723,1.0,0.08
1,0.0 <= tempo <= 106.19950103759766,0.99,0.0
2,106.73249816894531 <= tempo <= 147.00550079345703,1.0,0.0


#### 15 clusters

In [59]:
cluster_labels = KMeans(n_clusters=15).fit_predict(spotify[cols])

In [60]:
start_time = time.time()
conj_res = spotify.explain(explainer='many to one', labels=cluster_labels, explanation_form='conj')
conj_time = time.time() - start_time
print(f"Conjunctive explanation runtime: {conj_time}")
build_comparison_df(spotify_comparison_df, '15 clusters', 'conjunctive', conj_res, conj_time)
conj_res

Conjunctive explanation runtime: 9.19522213935852


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
0,100.64099884033203 <= tempo <= 113.46200180053711 AND -12.768799999999999 <= loudness <= -0.9609999999999985,0.86,0.02
0,100.64099884033203 <= tempo <= 113.46200180053711,1.0,0.28
0,100.64099884033203 <= tempo <= 113.46200180053711 AND -15.720749999999995 <= loudness <= -0.9609999999999985,0.99,0.11
0,100.64099884033203 <= tempo <= 113.46200180053711 AND -14.2475004196167 <= loudness <= -0.961,0.96,0.07
0,100.58349990844727 <= tempo <= 113.49750137329102 AND -14.2475004196167 <= loudness <= -0.961,0.96,0.07
...,...,...,...
14,90.001 <= tempo <= 107.148 AND -60.0 <= loudness <= -13.062,0.93,0.25
14,92.04199981689453 <= tempo <= 107.2234992980957 AND -60.0 <= loudness <= -14.603,0.78,0.07
14,92.04199981689453 <= tempo <= 107.2234992980957 AND -60.0 <= loudness <= -13.062,0.92,0.19
14,90.001 <= tempo <= 107.148 AND -60.0 <= loudness <= -14.603,0.79,0.15


In [61]:
start_time = time.time()
disj_res = spotify.explain(explainer='many to one', labels=cluster_labels, explanation_form='disj')
disj_time = time.time() - start_time
print(f"Disjunctive explanation runtime: {disj_time}")
build_comparison_df(spotify_comparison_df, '15 clusters', 'disjunctive', disj_res, disj_time)
disj_res

Disjunctive explanation runtime: 6.80797004699707


Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
0,100.64099884033203 <= tempo <= 113.46200180053711,1.0,0.28
0,100.58349990844727 <= tempo <= 113.49750137329102,1.0,0.28
2,165.96400451660156 <= tempo <= 219.446,1.0,0.21
7,0.0 <= tempo <= 36.815500259399414,1.0,0.0
7,0.0 <= tempo <= 36.574333333333335,1.0,0.0
8,186.69149780273438 <= tempo <= 219.446,1.0,0.0
10,36.574333333333335 <= tempo <= 73.14866666666667,0.93,0.09
10,36.815500259399414 <= tempo <= 73.3915023803711,0.95,0.1
11,86.73649978637695 <= tempo <= 100.49849700927734,1.0,0.29
12,113.49750137329102 <= tempo <= 124.86849975585938,0.99,0.22


### Summary of results

In [62]:
conj_averages = spotify_comparison_df.loc[:, 'conjunctive', :].mean()
disj_averages = spotify_comparison_df.loc[:, 'disjunctive', :].mean()
spotify_comparison_df.loc[('Average', 'conjunctive'), :] = conj_averages
spotify_comparison_df.loc[('Average', 'disjunctive'), :] = disj_averages

In [63]:
spotify_comparison_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Average Coverage,Average Separation Error,% Explanation Not Found,Runtime
Test,Explanation Type,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Full dataset,conjunctive,0.883333,0.035,50.0,1.606049
Full dataset,disjunctive,0.971111,0.05,50.0,1.835574
Numeric only,conjunctive,0.9425,0.045,50.0,1.516639
Numeric only,disjunctive,0.974444,0.058889,50.0,1.835917
Categorical only,conjunctive,1.0,0.07,50.0,0.600181
Categorical only,disjunctive,1.0,0.07,50.0,0.738072
Single attribute groupby,conjunctive,0.82,0.283333,50.0,2.284833
Single attribute groupby,disjunctive,0.91,0.293333,50.0,2.79602
3 clusters,conjunctive,0.996,0.016,0.0,2.480911
3 clusters,disjunctive,0.996,0.016,0.0,3.272521
