synthetic-data

Inspired by sklearn.datasets.make_classification, which in turn is based on work for the NIPS 2003 feature selection challenge [1] - targeting linear classifiers. Here the focus is on generating more complex, nonlinear datasets appropriate for use with deep learning/black box models which 'need' nonlinearity - otherwise you would/should use a simpler model.

Approach

Ideally, the method would provide a concise specification to generate tabular data with sensible defaults. The specification should provide knobs that the end user can dial up or down to see it's downstream impact.

Copulas are a model for specifying the joint probability p(x1, x2, ..., xn) given a correlation structure along with specifications for the marginal distribution of each feature. The current implementation uses a multivariate normal distribution with specified covariance matrix. Future work can expand this choice to other multivariate distributions.

Parameters

name	type	default	description
n_samples	int	(default=100)	The number of samples.
n_informative	int	(default=2)	The number of informative features - these should all be represented in the symbolic expression used to generate y_reg
n_nuisance	int	(default=0)	The number of nuisance features - these should not be included in the symbolic expression - and hence have no role in the DGP.
n_clases	int	(default=2)	the number of classes
dist	list		a list of the marginal distributions to apply to the features/columns
cov	matrix		a square numpy array with dimensions (??? x ???) - should be n_total where n_total=n_informative + n_nuisance
expr	str		an expression providing y = f(X)
sig_k	float	(default=1.0)	the steepness of the sigmoid used in mapping y_reg to y_prob
sig_x0	float	(default=None)	the center point of the sigmoid used in mappying y_reg to y_prob
p_thresh	float	(default=0.5)	threshold probability that determines boundary between classes
noise_level_x	float	(default=0.0)	level of Gaussian white noise to apply to X
noise_level_y	float	(default=0.0)	level of Gaussian white noise to apply to y_label (like `flip_y`)

Getting Started

Local Installation

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
pip install -e .

Tests

Test/Lint Dependencies

pip install -r requirements-test.txt

To run tests:

make test_local

Pre-Commit

To install pre-commit hooks, run the following commands:

pre-commit install
pre-commit run

Referencing this library

If you use this library in your work, please cite our paper:

@inproceedings{barr:2020,
  author    = {Brian Barr and Ke Xu and Claudio Silva and Enrico Bertini and Robert Reilly and  C. Bayan Bruss and Jason D. Wittenbach},
  title     = {{Towards Ground Truth Explainability on Tabular Data}},
  year      = {2020},
  maintitle = {International Conference on Machine Learning},
  booktitle = {2020 ICML Workshop on Human Interpretability in Machine Learning (WHI 2020)},
  date = {2020-07-17},
  pages = {362-367},
}

Notes

If you have tabular data, and want to fit a copula from it, consider this python library: copulas Quick visual tutorial of copulas and probability integral transform.

To run the examples, you should run:

$ python -m pip install pandas pytest pytest-cov seaborn shap tensorflow "DataProfiler[full]"

References

[1] Guyon, “Design of experiments for the NIPS 2003 variable selection benchmark”, 2003.

Name		Name	Last commit message	Last commit date
Latest commit History 113 Commits
.github		.github
examples		examples
synthetic_data		synthetic_data
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CODEOWNERS		CODEOWNERS
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
Roadmap.md		Roadmap.md
Synthetic_Data_Generators.md		Synthetic_Data_Generators.md
generators_UML_diagram.pdf		generators_UML_diagram.pdf
requirements-test.txt		requirements-test.txt
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

synthetic-data

Approach

Parameters

Getting Started

Local Installation

Tests

Pre-Commit

Referencing this library

Notes

References

About

Releases 4

Packages

Contributors 17

Languages

License

capitalone/synthetic-data

Folders and files

Latest commit

History

Repository files navigation

synthetic-data

Approach

Parameters

Getting Started

Local Installation

Tests

Pre-Commit

Referencing this library

Notes

References

About

Resources

License

Stars

Watchers

Forks

Releases 4

Packages 0

Contributors 17

Languages

Packages