# Many to one explainer - demo

This notebook demonstrates the many to one explainer on the adult dataset.\
This explainer is used to explain the relationship between a series labels and the rest of the dataset.\
These labels can be a column from the datasets, a list of labels (for example, cluster labels), or groupby groups.\
The explainer will provide rule based explanations for the labels, based on the rest of the dataset.\
Please check the documentation for the available parameters and options for the many to one explainer, as not all of them are demonstrated here.

In [1]:
import pandas as pd
import pd_explain

usetex-False


In [2]:
adults = pd.read_csv(r"../../Datasets/adult.csv")

# Simple example - using the many to one explainer on the entire dataset:

Using the many to one explainer on a column from the dataset that has a many to one relationship with other columns:

In [3]:
adults.explain(explainer='many_to_one', labels='label')

Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error,Separation Error Origins
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
<=50K,1 <= education-num <= 10,0.75,0.15,100.00% from group >50K
<=50K,0 <= capital-gain <= 5095.5,1.0,0.21,100.00% from group >50K
<=50K,0 <= capital-gain <= 5095.5 AND 1 <= education-num <= 10,0.75,0.13,100.00% from group >50K
<=50K,0 <= capital-gain <= 4243.5,0.99,0.2,100.00% from group >50K
>50K,No explanation found,,,


# Using the many to one explainer on a subset of the dataset:

Using the many to one explainer on subsets of the data, that only contain numerical / categorical columns:

In [7]:
adults_numeric_only = adults[['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']]
adults_numeric_only

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
0,39,77516,13,2174,0,40
1,50,83311,13,0,0,13
2,38,215646,9,0,0,40
3,53,234721,7,0,0,40
4,28,338409,13,0,0,40
...,...,...,...,...,...,...
48837,39,215419,13,0,0,36
48838,64,321403,9,0,0,40
48839,38,374983,13,0,0,50
48840,44,83891,13,5455,0,40


Just like we could provide a column name (if it is a column in the dataset), we can also pass a pandas series as the labels parameter:

In [8]:
labels = adults['label']

In [6]:
adults_numeric_only.explain(explainer='many_to_one', labels=labels, coverage_threshold=0.8, separation_threshold=0.2)

Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error,Separation Error Origins
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
<=50K,0 <= capital-gain <= 4243.5,0.99,0.2,100.00% from group >50K
>50K,No explanation found,,,


In [9]:
adults_categorical_only = adults[['workclass', 'education', 'marital-status', 'occupation', 'relationship']]
adults_categorical_only

Unnamed: 0,workclass,education,marital-status,occupation,relationship
0,State-gov,Bachelors,Never-married,Adm-clerical,Not-in-family
1,Self-emp-not-inc,Bachelors,Married-civ-spouse,Exec-managerial,Husband
2,Private,HS-grad,Divorced,Handlers-cleaners,Not-in-family
3,Private,11th,Married-civ-spouse,Handlers-cleaners,Husband
4,Private,Bachelors,Married-civ-spouse,Prof-specialty,Wife
...,...,...,...,...,...
48837,Private,Bachelors,Divorced,Prof-specialty,Not-in-family
48838,?,HS-grad,Widowed,?,Other-relative
48839,Private,Bachelors,Married-civ-spouse,Prof-specialty,Husband
48840,Private,Bachelors,Divorced,Adm-clerical,Own-child


In [8]:
adults_categorical_only.explain(explainer='many_to_one', labels=labels, coverage_threshold=0.8, separation_threshold=0.3)

Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error,Separation Error Origins
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
<=50K,occupation != Prof-specialty AND occupation != Exec-managerial,0.82,0.17,100.00% from group >50K
<=50K,occupation != Exec-managerial,0.91,0.21,100.00% from group >50K
<=50K,occupation != Prof-specialty,0.91,0.21,100.00% from group >50K
>50K,No explanation found,,,


We can also specify only specific columns to be used in the many to one explainer:

In [9]:
adults.explain(explainer='many_to_one', attributes=['age', 'education', 'marital-status', 'relationship'], labels='label', coverage_threshold=0.8, separation_threshold=0.5)

Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error,Separation Error Origins
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
<=50K,education != Bachelors,0.87,0.21,100.00% from group >50K
<=50K,education != Bachelors AND education != Masters,0.84,0.18,100.00% from group >50K
<=50K,education != Masters,0.97,0.22,100.00% from group >50K
>50K,No explanation found,,,


# Using many to one explainer with clustering labels:

In [10]:
from sklearn.cluster import KMeans

In [11]:
adults_with_dummies = pd.get_dummies(adults)

In [None]:
cluster_labels = KMeans(n_clusters=3).fit_predict(adults_with_dummies)

In [13]:
adults.explain(explainer='many_to_one', labels=cluster_labels)

Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error,Separation Error Origins
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,149278.5 <= fnlwgt <= 1490400,1.0,0.22,100.00% from group 1
0,149278.5 <= fnlwgt <= 1490400 AND 8.5 <= education-num <= 16.0,0.87,0.21,100.00% from group 1
1,291277.5 <= fnlwgt <= 1490400,1.0,0.0,Rule has no separation error.
2,13769 <= fnlwgt <= 149278.5,1.0,0.0,Rule has no separation error.


# Using many to one explainer to explain groupby groups:

Many to one explainer can also be used to explain groupby groups. It can work with single or multiple attributes groupby:

In [14]:
single_attribute_groupby = adults.groupby('relationship').mean()
single_attribute_groupby

Unnamed: 0_level_0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
relationship,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Husband,43.906066,187193.324102,10.305133,1774.499797,122.200649,44.166109
Not-in-family,38.421601,190352.245013,10.32687,783.963284,76.840817,40.529842
Other-relative,33.420983,203524.602258,8.84927,234.989376,49.920983,37.128154
Own-child,24.79409,193782.299697,9.494394,174.38148,41.276217,33.142461
Unmarried,40.312195,191370.284878,9.63161,450.863805,46.638634,39.172488
Wife,40.078507,180748.781639,10.488631,1658.779923,116.033033,36.72973


To explain a groupby's groups, we can use the explainer on the groupby dataframe without specifying the labels parameter:

In [15]:
single_attribute_groupby.explain(explainer='many_to_one', coverage_threshold=0.8, separation_threshold=0.2)

Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error,Separation Error Origins
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Husband,marital-status == Married-civ-spouse AND sex == Male,1.0,0.01,"50.00% from group Other-relative, 40.91% from ..."
Husband,marital-status == Married-civ-spouse AND sex != Female,1.0,0.01,"50.00% from group Other-relative, 40.91% from ..."
Husband,marital-status == Married-civ-spouse,1.0,0.11,"87.69% from group Wife, 6.92% from group Other..."
Wife,marital-status == Married-civ-spouse AND sex != Male,0.99,0.05,"63.64% from group Other-relative, 36.36% from ..."
Wife,marital-status == Married-civ-spouse AND sex == Female,0.99,0.05,"63.64% from group Other-relative, 36.36% from ..."
Not-in-family,No explanation found,,,
Other-relative,No explanation found,,,
Own-child,No explanation found,,,
Unmarried,No explanation found,,,


In [3]:
multi_attribute_groupby = adults.groupby(['sex', 'label']).mean()
multi_attribute_groupby

Unnamed: 0_level_0,Unnamed: 1_level_0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
sex,label,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Female,<=50K,36.278999,185690.282604,9.824724,122.876586,46.910976,35.875962
Female,>50K,42.219333,183989.519503,11.832109,4313.663652,180.222725,40.678915
Male,<=50K,37.248548,192799.098671,9.454953,162.322673,58.746173,40.720702
Male,>50K,44.641863,189269.827082,11.561908,3993.827586,195.902299,46.304396


In [17]:
multi_attribute_groupby.explain(explainer='many_to_one', coverage_threshold=0.8, separation_threshold=0.55)

Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error,Separation Error Origins
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"('Male', '<=50K')",relationship != Wife,1.0,0.51,"55.36% from group ('Female', '<=50K'), 41.31% ..."
"('Male', '<=50K')",0 <= capital-gain <= 5095.5,1.0,0.51,"59.76% from group ('Female', '<=50K'), 33.80% ..."
"('Male', '<=50K')",relationship != Wife AND 0 <= capital-gain <= 5095.5,1.0,0.49,"60.39% from group ('Female', '<=50K'), 36.91% ..."
"('Female', '<=50K')",relationship != Husband AND 0 <= capital-gain <= 4243.5,0.99,0.49,"84.54% from group ('Male', '<=50K'), 10.40% fr..."
"('Female', '<=50K')",relationship != Husband AND 0 <= capital-gain <= 6457.5,1.0,0.5,"83.72% from group ('Male', '<=50K'), 11.04% fr..."
"('Female', '<=50K')",relationship != Husband,1.0,0.51,"79.11% from group ('Male', '<=50K'), 13.27% fr..."
"('Male', '>50K')",relationship == Husband AND 8.5 <= education-num <= 16.0,0.85,0.52,"100.00% from group ('Male', '<=50K')"
"('Female', '>50K')",No explanation found,,,


It is also possible to pass a list of labels to the many to one explainer.
It will then treat that list as groupby groups and explain each group separately, as if a groupby was performed:

In [18]:
adults.explain(explainer='many_to_one', coverage_threshold=0.8, separation_threshold=0.55, labels=['sex', 'label'])

Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error,Separation Error Origins
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"('Male', '<=50K')",relationship != Wife,1.0,0.51,"55.36% from group ('Female', '<=50K'), 41.31% ..."
"('Male', '<=50K')",0 <= capital-gain <= 5095.5,1.0,0.51,"59.76% from group ('Female', '<=50K'), 33.80% ..."
"('Male', '<=50K')",relationship != Wife AND 0 <= capital-gain <= 5095.5,1.0,0.49,"60.39% from group ('Female', '<=50K'), 36.91% ..."
"('Female', '<=50K')",relationship != Husband AND 0 <= capital-gain <= 4243.5,0.99,0.49,"84.54% from group ('Male', '<=50K'), 10.40% fr..."
"('Female', '<=50K')",relationship != Husband AND 0 <= capital-gain <= 6457.5,1.0,0.5,"83.72% from group ('Male', '<=50K'), 11.04% fr..."
"('Female', '<=50K')",relationship != Husband,1.0,0.51,"79.11% from group ('Male', '<=50K'), 13.27% fr..."
"('Male', '>50K')",relationship == Husband AND 8.5 <= education-num <= 16.0,0.85,0.52,"100.00% from group ('Male', '<=50K')"
"('Female', '>50K')",No explanation found,,,


Another option is to pass a dataframe with multiple columns as the labels parameter.\
These columns will be aggregated into a single column, and treated like a groupby:

In [19]:
labels = adults[['sex', 'label']]
adults.drop(columns=['sex', 'label']).explain(explainer='many_to_one', labels=labels, coverage_threshold=0.8, separation_threshold=0.55)

Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error,Separation Error Origins
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"('Male', '<=50K')",relationship != Wife,1.0,0.51,"55.36% from group ('Female', '<=50K'), 41.31% ..."
"('Male', '<=50K')",0 <= capital-gain <= 5095.5,1.0,0.51,"59.76% from group ('Female', '<=50K'), 33.80% ..."
"('Male', '<=50K')",relationship != Wife AND 0 <= capital-gain <= 5095.5,1.0,0.49,"60.39% from group ('Female', '<=50K'), 36.91% ..."
"('Female', '<=50K')",relationship != Husband AND 0 <= capital-gain <= 4243.5,0.99,0.49,"84.54% from group ('Male', '<=50K'), 10.40% fr..."
"('Female', '<=50K')",relationship != Husband AND 0 <= capital-gain <= 6457.5,1.0,0.5,"83.72% from group ('Male', '<=50K'), 11.04% fr..."
"('Female', '<=50K')",relationship != Husband,1.0,0.51,"79.11% from group ('Male', '<=50K'), 13.27% fr..."
"('Male', '>50K')",relationship == Husband AND 8.5 <= education-num <= 16.0,0.85,0.52,"100.00% from group ('Male', '<=50K')"
"('Female', '>50K')",No explanation found,,,


# Many to one explainer when there are many groups / clusters:

In [20]:
from sklearn.cluster import KMeans
adults_with_dummies = pd.get_dummies(adults)

In [21]:
many_clusters = KMeans(n_clusters=20).fit_predict(adults_with_dummies)

In the case there are many groups (by default - over 10), the explainer will automatically prune them, selecting the most common ones.\
This behavior is toggled by the `prune_if_too_many_labels` parameter, which is set to True by default. The maximum number of labels to keep is controlled by the `max_labels` parameter, which is set to 10 by default.\
There are several pruning methods you can choose from using the `pruning_method` parameter:
1. `largest` - keep the k largest clusters. This is the default method.
2. `smallest` - keep the k smallest clusters.
3. `random` - keep k random clusters.
4. `max_dist` - keep the k clusters with the largest distance between their centroids.
5. `min_dist` - like max_dist, but with the smallest distance.
6. `max_silhouette` - keep the k clusters with the highest silhouette score.
7. `min_silhouette` - like max_silhouette, but with the lowest silhouette score.

In [22]:
adults.explain(explainer='many_to_one', labels=many_clusters, prune_if_too_many_labels=True, max_labels=10, pruning_method='largest')

There are more than the specified max number of 10 unique labels, and the option `prune_if_too_many_labels` is set to True. Pruning the labels to the top 10 most common labels using method largest



Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error,Separation Error Origins
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,183823.5 <= fnlwgt <= 206559.0,1.0,0.0,Rule has no separation error.
2,93156.0 <= fnlwgt <= 115335.0,1.0,0.0,Rule has no separation error.
5,13769 <= fnlwgt <= 56555.5,1.0,0.0,Rule has no separation error.
5,13769.0 <= fnlwgt <= 60842.833333333336,1.0,0.0,Rule has no separation error.
9,115335.0 <= fnlwgt <= 137813.5,1.0,0.0,Rule has no separation error.
11,262336.5 <= fnlwgt <= 296212,1.0,0.0,Rule has no separation error.
12,137813.5 <= fnlwgt <= 161472.5,1.0,0.0,Rule has no separation error.
16,161472.5 <= fnlwgt <= 183823.5,1.0,0.0,Rule has no separation error.
19,56555.5 <= fnlwgt <= 93156.0,1.0,0.0,Rule has no separation error.
8,No explanation found,,,


# Disjunctive explanations:

The many-to-one explainer supports disjunctive explanation generation, as well as the conjunctive explanations shown above.
To generate disjunctive explanations, set the explanation_form parameter to 'disj' or 'disjunction':

In [13]:
adults.explain(explainer='many_to_one', explanation_form='disj', labels='label', attributes=['workclass', 'education', 'marital-status', 'occupation', 'relationship'], use_sampling=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error,Separation Error Origins
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
<=50K,occupation != Prof-specialty OR education != Bachelors,0.96,0.23,100.00% from group >50K
<=50K,occupation != Prof-specialty,0.91,0.21,100.00% from group >50K
>50K,No explanation found,,,


# Additional features:

### Automatic label dropping:

If the labels provided to the explainer contain NA values, the explainer will automatically drop those labels and the corresponding rows from the dataset:

In [24]:
labels = adults['label'].copy()
labels.iloc[:20] = None

In [25]:
adults.drop(columns=['label']).explain(explainer='many_to_one', labels=labels)

Dropping 20 labels with missing values.



Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error,Separation Error Origins
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
<=50K,1 <= education-num <= 10,0.76,0.15,100.00% from group >50K
<=50K,0 <= capital-gain <= 5095.5,1.0,0.21,100.00% from group >50K
<=50K,0 <= capital-gain <= 5095.5 AND 1 <= education-num <= 10,0.76,0.13,100.00% from group >50K
<=50K,0 <= capital-gain <= 4243.5,0.99,0.2,100.00% from group >50K
<=50K,0 <= capital-gain <= 4243.5 AND 1 <= education-num <= 10,0.75,0.12,100.00% from group >50K
>50K,No explanation found,,,


### Binning numeric labels:

If the labels provided are numeric, the explainer can automatically bin them using either the 'quantile' or 'uniform' strategies. This is controlled by the `bin_numeric` parameter, which is set to False by default.\
The behavior of the binning can be controlled by the parameters `num_bins`, which is set to 10 by default, and `binning_method`, which is set to 'quantile' by default.\
Binning will happen if the `bin_numeric` parameter is set to true, the labels are numeric, and the number of unique labels is greater than 10:

In [26]:
adults.explain(explainer='many_to_one', labels='education-num', bin_numeric=True)

Attribute 'education-num' in labels has more than specified number of 10 unique values. Binning the attribute.



Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error,Separation Error Origins
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
11.0 < education-num <= 13.0,education == Bachelors,0.84,0.0,Rule has no separation error.
7.0 < education-num <= 9.0,education == HS-grad,0.96,0.0,Rule has no separation error.
9.0 < education-num <= 10.0,education == Some-college,1.0,0.0,Rule has no separation error.
10.0 < education-num <= 11.0,education == Assoc-voc,1.0,0.0,Rule has no separation error.
0.999 < education-num <= 7.0,No explanation found,,,
13.0 < education-num <= 16.0,No explanation found,,,


### Sampling:

By default, the many to one explainer uses uniform random sampling to speed up explanation generation.\
While the impact on explanation quality should, in most cases, be negligible, it is possible to turn off sampling by setting the `use_sampling` parameter to False.\
You can also control the sample size using the `sample_size` parameter, which is set to 5000 by default.\
This parameter can either be an integer, in which case it will be the exact sample size, or a float between 0 and 1, in which case it will be the percentage of the dataset size.\
Please note that we do not allow the sample size to be below the default value of 5000, as this can lead to poor explanation quality for a very small gain in performance - you can only increase the sample size.

In [27]:
adults.explain(explainer='many_to_one', labels='label', use_sampling=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error,Separation Error Origins
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
<=50K,1 <= education-num <= 10,0.76,0.15,100.00% from group >50K
<=50K,0.0 <= capital-gain <= 16666.5,1.0,0.23,100.00% from group >50K
<=50K,0 <= capital-gain <= 4668.5 AND 1 <= education-num <= 10,0.75,0.13,100.00% from group >50K
<=50K,0 <= capital-gain <= 4668.5,0.99,0.2,100.00% from group >50K
<=50K,0.0 <= capital-gain <= 19999.8,1.0,0.23,100.00% from group >50K
<=50K,0 <= capital-gain <= 4243.5,0.99,0.2,100.00% from group >50K
<=50K,0.0 <= capital-gain <= 24999.75,1.0,0.23,100.00% from group >50K
<=50K,0 <= capital-gain <= 4243.5 AND 1 <= education-num <= 10,0.75,0.13,100.00% from group >50K
>50K,No explanation found,,,


### Explaining the separation error

The many to one explainer, by default, also provides an explanation for the separation error.\
This explanation tells you how much of the separation error originates from each of the other groups, by percentage.\
This can be disabled by setting the `explain_errors` parameter to False.

In [28]:
adults.explain(explainer='many_to_one', labels='label', explain_errors=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1
<=50K,1 <= education-num <= 10,0.75,0.15
<=50K,0 <= capital-gain <= 5095.5,1.0,0.21
<=50K,0 <= capital-gain <= 5095.5 AND 1 <= education-num <= 10,0.75,0.13
<=50K,0 <= capital-gain <= 4243.5,0.99,0.2
>50K,No explanation found,,


This behavior is also controlled by the `error_explanation_threshold` parameter, which is set to 0.05 by default. This parameter defines the minimum percentage a group must contribute to the separation error to be explained. If it contributes less, it will be aggregated along with other groups that do not meet the threshold, and explained as 'x% of the error originates from y other groups`.

In [29]:
single_attribute_groupby.explain(explainer='many_to_one', explanation_form='disj', error_explanation_threshold=0.01)

Unnamed: 0_level_0,Unnamed: 1_level_0,Coverage,Separation Error,Separation Error Origins
Group / Cluster,Explanation,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Husband,marital-status == Married-civ-spouse,1.0,0.11,"87.69% from group Wife, 6.92% from group Other..."
Not-in-family,No explanation found,,,
Other-relative,No explanation found,,,
Own-child,No explanation found,,,
Unmarried,No explanation found,,,
Wife,No explanation found,,,
