SimpsonsParadox: Automatic Simpson's Paradox Detector

Description

This function automatically detects Simpson’s Pairs in a dataset using regression. Simpson's Pairs are pairs of independent variables where one of those variables reverses relationship with the dependent variable when conditioned on the second independent variable.

This function applies the following data pre-processing steps:

Excluding user-defined variables from the analysis
Encoding any non-numeric variables in the data set
Standardizing any continuous variables in the data
Binning large conditioning variables in the data

This function outputs the following Simpson's Pairs details:

The pair of variables with Simpson's Paradox (i.e. independent and conditioning independent variables)
Some of the summary statistics from the model generated for the aggregated data and disaggregated data
A simple plot of the regression lines for the aggregated data and each subgroup of the disaggregated data

You can find more details about pre-processing, outputs, and terminology in the Wiki.

You can also read about Simpson's Paradox in a corresponding article on KDnuggets here.

Usage

You can call this function from a Jupyter notebook by importing the simpsons_paradox module.

Download and unzip SimpsonsParadox-master.zip or use Git Bash to clone the repository.
Open Anaconda Prompt, navigate to this project's directory, and run the following commands:
- conda env create -f environment.yml
- conda activate simpsons-paradox
- jupyter lab or jupyter notebook
Load your own data, or use the examples provided in simpsons_paradox.ipynb
Modify the function parameters depending on your use case. Here is an example:

params = {
    "df": df, # The pandas dataframe with the desired data set
    "dv": 'class', # The dependent variable to run analysis for
    "ignore_columns": ['user_id'], # A list of columns to ignore
    "bin_columns": ['timestamp'], # A list of the columns to bin
    "output_plots": True # Displays plots and summary statistics
}

sp = SimpsonsParadox(**params)
sp.get_simpsons_pairs()

This outputs a list of Simpson's Pairs and displays plots and summary statistics for each Simpson's pair.

It's possible to modify other parameters, which all have defaults outlined in this Wiki.

Here are some of the parameters:

self.model = 'logistic'
self.weighting = True # Excludes weak cases of Simpson's Paradox
self.max_pvalue = 0.05 # Filters out pairs with large p-values
self.min_corr = 0.01 # Filters out pairs with no correlation
self.quiet = True # Silences all warnings and verbosity

Example

In some cases, the function will run into issues if the appropriate arguments aren't passed.

For example, if you pass these arguments on this sample data set from the notebook:

import pandas as pd
from simpsons_paradox import SimpsonsParadox
from sklearn.datasets import load_iris
iris = load_iris(as_frame=True).frame
sp = SimpsonsParadox(df=iris, dv='target')
sp.get_simpsons_pairs()

You'll get the following error:

ValueError: You have a non-binary DV. Pass a value to the target_category in the function or re-bin your DV prior to using the function.

To fix this issue, add an additional argument target_category to set a target category for one-versus-all regression:

sp = SimpsonsParadox(df=iris, dv='target', target_category=1)
sp.get_simpsons_pairs()

Now the function should be able to run successfully.

Command-line usage

You can also call this function from the command line.

Download and unzip SimpsonsParadox-master.zip or use Git Bash to clone the repository.
Open a command line and navigate to the directory of this project: cd SimpsonsParadox
See the Wiki for sample commands, argument descriptions and their defaults.

References

Some existing tools and resources that we reference in this project:

Simpsons R Package: https://rdrr.io/cran/Simpsons/man/Simpsons.html
Can you Trust the Trend: Discovering Simpson's Paradoxes in Social Data: https://arxiv.org/abs/1801.04385

Acknowledgements

This function was created as part of a summer internship project at Altair Engineering.

Please let us know if you have any feedback and/or suggestions by starting an issue or reaching out.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
data		data
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
main.py		main.py
simpsons_paradox.ipynb		simpsons_paradox.ipynb
simpsons_paradox.py		simpsons_paradox.py
tests.py		tests.py
utilities.py		utilities.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SimpsonsParadox: Automatic Simpson's Paradox Detector

Description

Usage

Example

Command-line usage

References

Acknowledgements

About

Releases 2

Packages

Contributors 2

Languages

License

ehart-altair/SimpsonsParadox

Folders and files

Latest commit

History

Repository files navigation

SimpsonsParadox: Automatic Simpson's Paradox Detector

Description

Usage

Example

Command-line usage

References

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 2

Languages

Packages