# Many to one explainer - demo

This notebook demonstrates the many to one explainer on the adult dataset.\
This explainer is used to explain the relationship between a series labels and the rest of the dataset.\
These labels can be a column from the datasets, a list of labels (for example, cluster labels), or groupby groups.\
The explainer will provide rule based explanations for the labels, based on the rest of the dataset.\
Please check the documentation for the available parameters and options for the many to one explainer, as not all of them are demonstrated here.

In [1]:
import pandas as pd
import pd_explain

usetex-False
(CVXPY) Jan 27 11:24:07 AM: Encountered unexpected exception importing solver GLOP:
RuntimeError('Unrecognized new version of ortools (9.11.4210). Expected < 9.10.0. Please open a feature request on cvxpy to enable support for this version.')
(CVXPY) Jan 27 11:24:07 AM: Encountered unexpected exception importing solver PDLP:
RuntimeError('Unrecognized new version of ortools (9.11.4210). Expected < 9.10.0. Please open a feature request on cvxpy to enable support for this version.')


In [2]:
adults = pd.read_csv(r"..\Examples\Datasets\adult.csv")

# Simple example - using the many to one explainer on the entire dataset:

Using the many to one explainer on a column from the dataset that has a many to one relationship with other columns:

In [32]:
adults.explain(explainer='many_to_one', labels='label')

Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
<=50K,0 <= capital-gain <= 4243.5,0.99,0.2
<=50K,0 <= capital-gain <= 5095.5,1.0,0.21
<=50K,1 <= education-num <= 10,0.75,0.15
<=50K,0 <= capital-gain <= 5095.5 AND 1 <= education-num <= 10,0.75,0.13
>50K,No explanation found,,


# Using the many to one explainer on a subset of the dataset:

Using the many to one explainer on subsets of the data, that only contain numerical / categorical columns:

In [33]:
adults_numeric_only = adults[['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']]
adults_numeric_only

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
0,39,77516,13,2174,0,40
1,50,83311,13,0,0,13
2,38,215646,9,0,0,40
3,53,234721,7,0,0,40
4,28,338409,13,0,0,40
...,...,...,...,...,...,...
48837,39,215419,13,0,0,36
48838,64,321403,9,0,0,40
48839,38,374983,13,0,0,50
48840,44,83891,13,5455,0,40


Just like we could provide a column name (if it is a column in the dataset), we can also pass a pandas series as the labels parameter:

In [34]:
labels = adults['label']

In [35]:
adults_numeric_only.explain(explainer='many_to_one', labels=labels, coverage_threshold=0.8, separation_threshold=0.2)

Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
<=50K,0 <= capital-gain <= 4243.5,0.99,0.2
>50K,No explanation found,,


In [36]:
adults_categorical_only = adults[['workclass', 'education', 'marital-status', 'occupation', 'relationship']]
adults_categorical_only

Unnamed: 0,workclass,education,marital-status,occupation,relationship
0,State-gov,Bachelors,Never-married,Adm-clerical,Not-in-family
1,Self-emp-not-inc,Bachelors,Married-civ-spouse,Exec-managerial,Husband
2,Private,HS-grad,Divorced,Handlers-cleaners,Not-in-family
3,Private,11th,Married-civ-spouse,Handlers-cleaners,Husband
4,Private,Bachelors,Married-civ-spouse,Prof-specialty,Wife
...,...,...,...,...,...
48837,Private,Bachelors,Divorced,Prof-specialty,Not-in-family
48838,?,HS-grad,Widowed,?,Other-relative
48839,Private,Bachelors,Married-civ-spouse,Prof-specialty,Husband
48840,Private,Bachelors,Divorced,Adm-clerical,Own-child


In [37]:
adults_categorical_only.explain(explainer='many_to_one', labels=labels, coverage_threshold=0.8, separation_threshold=0.3)

Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
<=50K,occupation != Prof-specialty AND occupation != Exec-managerial,0.82,0.17
<=50K,occupation != Exec-managerial,0.91,0.21
<=50K,occupation != Prof-specialty,0.91,0.21
>50K,No explanation found,,


We can also specify only specific columns to be used in the many to one explainer:

In [38]:
adults.explain(explainer='many_to_one', attributes=['age', 'education', 'marital-status', 'relationship'], labels='label', coverage_threshold=0.8, separation_threshold=0.5)

Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
<=50K,education != Bachelors,0.87,0.21
>50K,No explanation found,,


# Using many to one explainer with clustering labels:

In [39]:
from sklearn.cluster import KMeans

In [40]:
adults_with_dummies = pd.get_dummies(adults)

In [41]:
cluster_labels = KMeans(n_clusters=3).fit_predict(adults_with_dummies)

In [42]:
adults.explain(explainer='many_to_one', labels=cluster_labels)

Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
0,149278.5 <= fnlwgt <= 1490400,1.0,0.22
0,149278.5 <= fnlwgt <= 1490400 AND 8.5 <= education-num <= 16.0,0.87,0.21
1,291277.5 <= fnlwgt <= 1490400,1.0,0.0
2,13769 <= fnlwgt <= 149278.5,1.0,0.0


# Using many to one explainer to explain groupby groups:

Many to one explainer can also be used to explain groupby groups. It can work with single or multiple attributes groupby:

In [43]:
single_attribute_groupby = adults.groupby('relationship').mean()
single_attribute_groupby

Unnamed: 0_level_0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
relationship,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Husband,43.906066,187193.324102,10.305133,1774.499797,122.200649,44.166109
Not-in-family,38.421601,190352.245013,10.32687,783.963284,76.840817,40.529842
Other-relative,33.420983,203524.602258,8.84927,234.989376,49.920983,37.128154
Own-child,24.79409,193782.299697,9.494394,174.38148,41.276217,33.142461
Unmarried,40.312195,191370.284878,9.63161,450.863805,46.638634,39.172488
Wife,40.078507,180748.781639,10.488631,1658.779923,116.033033,36.72973


To explain a groupby's groups, we can use the explainer on the groupby dataframe without specifying the labels parameter:

In [44]:
single_attribute_groupby.explain(explainer='many_to_one', coverage_threshold=0.8, separation_threshold=0.2)

Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
Husband,marital-status == Married-civ-spouse AND sex != Female,1.0,0.01
Husband,marital-status == Married-civ-spouse AND sex == Male,1.0,0.01
Husband,marital-status == Married-civ-spouse,1.0,0.11
Wife,marital-status == Married-civ-spouse AND sex == Female,0.99,0.05
Not-in-family,No explanation found,,
Other-relative,No explanation found,,
Own-child,No explanation found,,
Unmarried,No explanation found,,


In [45]:
multi_attribute_groupby = adults.groupby(['sex', 'label']).mean()
multi_attribute_groupby

Unnamed: 0_level_0,Unnamed: 1_level_0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
sex,label,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Female,<=50K,36.278999,185690.282604,9.824724,122.876586,46.910976,35.875962
Female,>50K,42.219333,183989.519503,11.832109,4313.663652,180.222725,40.678915
Male,<=50K,37.248548,192799.098671,9.454953,162.322673,58.746173,40.720702
Male,>50K,44.641863,189269.827082,11.561908,3993.827586,195.902299,46.304396


In [46]:
multi_attribute_groupby.explain(explainer='many_to_one', coverage_threshold=0.8, separation_threshold=0.55)

Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
"('Male', '<=50K')",relationship != infrequent_sklearn AND 0 <= capital-gain <= 4243.5,0.96,0.49
"('Male', '<=50K')",relationship != infrequent_sklearn AND 0 <= capital-gain <= 5095.5,0.96,0.49
"('Male', '<=50K')",0 <= capital-gain <= 5095.5,1.0,0.51
"('Female', '<=50K')",relationship != Husband AND 0 <= capital-gain <= 4243.5,0.99,0.49
"('Female', '<=50K')",relationship != Husband,1.0,0.51
"('Female', '<=50K')",relationship != Husband AND 0 <= capital-gain <= 6457.5,1.0,0.5
"('Male', '>50K')",relationship == Husband AND 8.5 <= education-num <= 16.0,0.85,0.52
"('Female', '>50K')",No explanation found,,


It is also possible to pass a list of labels to the many to one explainer.
It will then treat that list as groupby groups and explain each group separately, as if a groupby was performed:

In [47]:
adults.explain(explainer='many_to_one', coverage_threshold=0.8, separation_threshold=0.55, labels=['sex', 'label'])

Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
"('Male', '<=50K')",relationship != infrequent_sklearn AND 0 <= capital-gain <= 4243.5,0.96,0.49
"('Male', '<=50K')",relationship != infrequent_sklearn AND 0 <= capital-gain <= 5095.5,0.96,0.49
"('Male', '<=50K')",0 <= capital-gain <= 5095.5,1.0,0.51
"('Female', '<=50K')",relationship != Husband AND 0 <= capital-gain <= 4243.5,0.99,0.49
"('Female', '<=50K')",relationship != Husband,1.0,0.51
"('Female', '<=50K')",relationship != Husband AND 0 <= capital-gain <= 6457.5,1.0,0.5
"('Male', '>50K')",relationship == Husband AND 8.5 <= education-num <= 16.0,0.85,0.52
"('Female', '>50K')",No explanation found,,


# Many to one explainer when there are many groups / clusters:

In [3]:
from sklearn.cluster import KMeans
adults_with_dummies = pd.get_dummies(adults)

In [4]:
many_clusters = KMeans(n_clusters=20).fit_predict(adults_with_dummies)

[WinError 2] The system cannot find the file specified
  File "C:\Users\Yuval\PycharmProjects\FEDEx_Generator\.venv1\Lib\site-packages\joblib\externals\loky\backend\context.py", line 257, in _count_physical_cores
    cpu_info = subprocess.run(
               ^^^^^^^^^^^^^^^
  File "C:\Users\Yuval\AppData\Local\Programs\Python\Python312\Lib\subprocess.py", line 548, in run
    with Popen(*popenargs, **kwargs) as process:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Yuval\AppData\Local\Programs\Python\Python312\Lib\subprocess.py", line 1026, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "C:\Users\Yuval\AppData\Local\Programs\Python\Python312\Lib\subprocess.py", line 1538, in _execute_child
    hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


In the case there are many groups (by default - over 10), the explainer will automatically prune them, selecting the most common ones.\
This behavior is toggled by the `prune_if_too_many_labels` parameter, which is set to True by default. The maximum number of labels to keep is controlled by the `max_labels` parameter, which is set to 10 by default.\
There are several pruning methods you can choose from using the `pruning_method` parameter:
1. `largest` - keep the k largest clusters. This is the default method.
2. `smallest` - keep the k smallest clusters.
3. `random` - keep k random clusters.
4. `max_dist` - keep the k clusters with the largest distance between their centroids, multiplied by the size of the cluster.
5. `min_dist` - like max_dist, but with the smallest distance.
6. `max_silhouette` - keep the k clusters with the highest silhouette score, multiplied by the size of the cluster.
7. `min_silhouette` - like max_silhouette, but with the lowest silhouette score.

In [7]:
adults.explain(explainer='many_to_one', labels=many_clusters, prune_if_too_many_labels=True, max_labels=10, pruning_method='largest')

There are more than the specified max number of 10 unique labels, and the option `prune_if_too_many_labels` is set to True. Pruning the labels to the top 10 most common labels using method largest



Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
1,158146.0 <= fnlwgt <= 181984.0,1.0,0.0
3,49836.0 <= fnlwgt <= 80356.5,1.0,0.0
6,107871.5 <= fnlwgt <= 132456.0,1.0,0.0
10,181666 <= fnlwgt <= 216473 AND 5.5 <= education-num <= 16,0.96,0.25
10,181666 <= fnlwgt <= 216473,1.0,0.26
11,132456.0 <= fnlwgt <= 158146.0,1.0,0.0
13,13769 <= fnlwgt <= 49836.0,1.0,0.0
18,261827.5 <= fnlwgt <= 295067,1.0,0.0
19,80356.5 <= fnlwgt <= 107871.5,1.0,0.0
4,No explanation found,,


# Disjunctive explanations:

The many-to-one explainer supports disjunctive explanation generation, as well as the conjunctive explanations shown above.
To generate disjunctive explanations, set the explanation_form parameter to 'disj' or 'disjunction':

In [51]:
single_attribute_groupby.explain(explainer='many_to_one', explanation_form='disj')

Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
Husband,marital-status == Married-civ-spouse,1.0,0.11
Not-in-family,No explanation found,,
Other-relative,No explanation found,,
Own-child,No explanation found,,
Unmarried,No explanation found,,
Wife,No explanation found,,


# Additional features:

### Automatic label dropping:

If the labels provided to the explainer contain NA values, the explainer will automatically drop those labels and the corresponding rows from the dataset:

In [52]:
labels = adults['label'].copy()
labels.iloc[:20] = None

In [53]:
adults.drop(columns=['label']).explain(explainer='many_to_one', labels=labels)

Dropping 20 labels with missing values.



Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
<=50K,0 <= capital-gain <= 4243.5 AND 1 <= education-num <= 10,0.75,0.12
<=50K,0 <= capital-gain <= 4243.5,0.99,0.2
<=50K,0 <= capital-gain <= 5095.5,1.0,0.21
<=50K,1 <= education-num <= 10,0.76,0.15
<=50K,0 <= capital-gain <= 5095.5 AND 1 <= education-num <= 10,0.76,0.13
>50K,No explanation found,,


### Binning numeric labels:

If the labels provided are numeric, the explainer can automatically bin them using either the 'quantile' or 'uniform' strategies. This is controlled by the `bin_numeric` parameter, which is set to False by default.\
The behavior of the binning can be controlled by the parameters `num_bins`, which is set to 10 by default, and `binning_method`, which is set to 'quantile' by default.\
Binning will happen if the `bin_numeric` parameter is set to true, the labels are numeric, and the number of unique labels is greater than 10:

In [54]:
adults.explain(explainer='many_to_one', labels='education-num', bin_numeric=True)

Attribute 'education-num' in labels has more than specified number of 10 unique values. Binning the attribute.



Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
7.0 < education-num <= 9.0,education == HS-grad,0.96,0.0
9.0 < education-num <= 10.0,education == Some-college,1.0,0.0
0.999 < education-num <= 7.0,No explanation found,,
10.0 < education-num <= 11.0,No explanation found,,
11.0 < education-num <= 13.0,No explanation found,,
13.0 < education-num <= 16.0,No explanation found,,


### Sampling:

By default, the many to one explainer uses uniform random sampling to speed up explanation generation.\
While the impact on explanation quality should, in most cases, be negligible, it is possible to turn off sampling by setting the `use_sampling` parameter to False.\

In [55]:
adults.explain(explainer='many_to_one', labels='label', use_sampling=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
<=50K,0 <= capital-gain <= 4243.5 AND 1 <= education-num <= 10,0.75,0.13
<=50K,0 <= capital-gain <= 4668.5 AND 1 <= education-num <= 10,0.75,0.13
<=50K,0 <= capital-gain <= 4243.5,0.99,0.2
<=50K,0.0 <= capital-gain <= 19999.8,1.0,0.23
<=50K,0 <= capital-gain <= 4668.5,0.99,0.2
<=50K,1 <= education-num <= 10,0.76,0.15
<=50K,0.0 <= capital-gain <= 24999.75,1.0,0.23
<=50K,0.0 <= capital-gain <= 16666.5,1.0,0.23
>50K,No explanation found,,
