multiple sensitive features - postprocessing (#288)

* take changes from other branch that touches all modules Signed-off-by: Roman Lutz <rolutz@microsoft.com> * get all tests working again Signed-off-by: Roman Lutz <rolutz@microsoft.com> * Dashboard for Census Notebook (#171) Update the existing Census notebook for grid search to use the new dashboard. The bulk of the notebook is unchanged (including the fictional motivating scenario). Signed-off-by: Richard Edgar <riedgar@microsoft.com> Signed-off-by: Roman Lutz <rolutz@microsoft.com> * Stop installing old dashboard (#176) Have moved notebooks off the old dashboard. Remove dependency from pipelines Signed-off-by: Richard Edgar <riedgar@microsoft.com> Signed-off-by: Roman Lutz <rolutz@microsoft.com> * Update ReadMe with Yarn instructions (#177) Now that the dashboard tarball is no longer checked in, provide instructions on creating it in a cloned repo Signed-off-by: Richard Edgar <riedgar@microsoft.com> Signed-off-by: Roman Lutz <rolutz@microsoft.com> * Standardise input convertors for test (#178) Create and use a standard set of convertors for use with our 'argument type' tests. This has required adding several `__init__.py` files to the `test` directory to enable `pytest` to find the common code. Also add an 'argument type' test to `ExponentiatedGradient` Signed-off-by: Richard Edgar <riedgar@microsoft.com> Signed-off-by: Roman Lutz <rolutz@microsoft.com> * Small fixes to get the documentation appearing (#179) Fix issues in getting documentation to appear in Sphinx. Signed-off-by: Richard Edgar <riedgar@microsoft.com> Signed-off-by: Roman Lutz <rolutz@microsoft.com> * law school notebook (#169) * law school notebook Signed-off-by: Miro Dudik <mdudik@gmail.com> * Remove hypens from filename Some copy edit fixes Correct suspected bug in ExponentiatedGradient section Signed-off-by: Richard Edgar <riedgar@microsoft.com> * Didn't quite undo all my temporary changes Signed-off-by: Richard Edgar <riedgar@microsoft.com> * address some of the comments Signed-off-by: Miro Dudik <mdudik@gmail.com> * Fix typo in name Signed-off-by: Richard Edgar <riedgar@microsoft.com> * Improve spacing and add a comment in expgrad section Signed-off-by: Richard Edgar <riedgar@microsoft.com> * add AUC explanation Signed-off-by: Miro Dudik <mdudik@gmail.com> Signed-off-by: Roman Lutz <rolutz@microsoft.com> * Code cleanups (#181) Fix some minor things: - Make the dashboard use the same copyright notice as the rest of the code - Some renaming of `expgrad` to `ExponentiatedGradient` Signed-off-by: Richard Edgar <riedgar@microsoft.com> Signed-off-by: Roman Lutz <rolutz@microsoft.com> * Build the widget (#185) Add a job template which builds the widget to the PR-Gate, Nightly and Nightly-Fixed builds. Note that this does not run any tests, but just ensures that the widget builds successfully Signed-off-by: Richard Edgar <riedgar@microsoft.com> Signed-off-by: Roman Lutz <rolutz@microsoft.com> * update logging to use FileHandler instead of basicConfig (#175) Signed-off-by: Ilya Matiach <ilmat@microsoft.com> Signed-off-by: Roman Lutz <rolutz@microsoft.com> * Enable ReadTheDocs (#182) Change how the documentation is done slightly, so that our documentation can show up on ReadTheDocs. Some additional copy-editing of the in-code documentation has been done as a result of this. The docs should appear at: https://fairlearn.readthedocs.io/en/latest/ Signed-off-by: Richard Edgar <riedgar@microsoft.com> Signed-off-by: Roman Lutz <rolutz@microsoft.com> * Pin scikit-learn (#189) The recent update to scikit-learn is causing a break in one of the Notebooks. Until this is debugged, pin the version Signed-off-by: Richard Edgar <riedgar@microsoft.com> Signed-off-by: Roman Lutz <rolutz@microsoft.com> * Add more flake8 checks (#187) Add a number of extra flake8 checks: - flake8-blind-except - flake8-builtins - flake8-docstrings - flake8-logging-format - flake8-rst-docstrings Since these create a huge number of issues, suppress a lot of these for now in `setup.cfg` (plus a handful of special cases done inline). Put in fixes for the simpler complaints, such as: - Separate summaries in docstrings - Spacing within and around docstrings - Deferring string interpolation in `logging` calls Signed-off-by: Richard Edgar <riedgar@microsoft.com> Signed-off-by: Roman Lutz <rolutz@microsoft.com> * Rename files and update license and docs (#183) * rename files * update comment * update license * address comments Signed-off-by: Roman Lutz <rolutz@microsoft.com> * Fix for Law School Notebook (#191) Tweak the Law School notebook so that it works with the latest `scikit-learn` This enables us to unpin the version of `scikit-learn` in our `requirements.txt` file Signed-off-by: Richard Edgar <riedgar@microsoft.com> Signed-off-by: Roman Lutz <rolutz@microsoft.com> * Markdown updates based on doc bash (#186) * address feedback from doc bash Signed-off-by: Roman Lutz <rolutz@microsoft.com> * latex updates Signed-off-by: Roman Lutz <rolutz@microsoft.com> * latex update Signed-off-by: Roman Lutz <rolutz@microsoft.com> * latex update Signed-off-by: Roman Lutz <rolutz@microsoft.com> * undo latex changes Signed-off-by: Roman Lutz <rolutz@microsoft.com> * remove commas Signed-off-by: Roman Lutz <rolutz@microsoft.com> * rephrasing postprocessing constructor requirements Signed-off-by: Roman Lutz <rolutz@microsoft.com> * feedback from Miro Signed-off-by: Roman Lutz <rolutz@microsoft.com> Signed-off-by: Roman Lutz <rolutz@microsoft.com> * Reorganise documentation (#192) Reorganising how the documentation is presented, since the default style from `sphinx-apidoc` assumed we had lots of individual modules rather than larger packages Signed-off-by: Richard Edgar <riedgar@microsoft.com> Signed-off-by: Roman Lutz <rolutz@microsoft.com> * Declare 0.4.0 release (#193) Signed-off-by: Richard Edgar <riedgar@microsoft.com> Signed-off-by: Roman Lutz <rolutz@microsoft.com> * Workaround Python 3.5 issue with Linux (#194) An issue with the `pip` install of `shap` has appeared on the Linux agents under Python 3.5. Reasons are currently obscure, but this is blocking a release. Since Python 3.5 continues to work on Windows, rely on that (pending further debugging) Signed-off-by: Richard Edgar <riedgar@microsoft.com> Signed-off-by: Roman Lutz <rolutz@microsoft.com> * fix classification classification bug (#201) * Update readme for v0.4.0 (#196) Signed-off-by: Richard Edgar <riedgar@microsoft.com> * Pin troublesome package (#198) During our release process, a new version of `colorama` (required by one of our dependencies) was released. This has issues with the Windows/3.7 build. Unblock the release by pinning the version Signed-off-by: Richard Edgar <riedgar@microsoft.com> * fix classification classification bug Signed-off-by: Roman Lutz <rolutz@microsoft.com> * version change to address security bug (#203) Signed-off-by: Roman Lutz <rolutz@microsoft.com> * Fix accidental merge (#205) Some portions of the v0.4.0 release branch were accidentally merged into master - Making the ReadMe version suitable for PyPI - Pinning the `colorama` version to unblock the release train This changeset undoes these fixes in master Signed-off-by: Richard Edgar <riedgar@microsoft.com> Signed-off-by: Roman Lutz <rolutz@microsoft.com> * ReadMe Processor for Releases (#206) Create a python script to translate `ReadMe.md` from GitHub to PyPI. This will avoid the need to create a branch to do a release. This script is slightly dependent on the structure of the file, so if there are substantial changes to that, this script will require updating. It also assumes that a tag `v(fairlearn.__version__)` exists in the repo. Signed-off-by: Richard Edgar <riedgar@microsoft.com> Signed-off-by: Roman Lutz <rolutz@microsoft.com> * Track pip dependencies (#208) We've had trouble with our dependencies updating and breaking our builds Augment the build pipelines so that they publish the output of `pip freeze` to an artifact. This will aid debugging these issues. The name of both the artifact itself and the file therein can be specified. Signed-off-by: Richard Edgar <riedgar@microsoft.com> Signed-off-by: Roman Lutz <rolutz@microsoft.com> * Remove some flake8 global suppressions (#209) After adding more `flake8` analysers, we were obliged to put in some global suppressions to keep the number of issues manageable. Start the process of removing these with D102, D103 and D401. Some of these just move the suppression to file-level, while others tweak documentation blocks to suit. Signed-off-by: Richard Edgar <riedgar@microsoft.com> Signed-off-by: Roman Lutz <rolutz@microsoft.com> * Re-enable Linux 3.5 (#210) Roman figured out a workaround for getting `shap` installed with Linux and Python 3.5. Put this into `fairlearn` Signed-off-by: Richard Edgar <riedgar@microsoft.com> Signed-off-by: Roman Lutz <rolutz@microsoft.com> * Expand Notebook testing (#212) Increase the variety of platforms used for testing our Jupyter Notebooks. Unable to test on MacOS at present, due to some problem installing `lightgbm`. Signed-off-by: Richard Edgar <riedgar@microsoft.com> Signed-off-by: Roman Lutz <rolutz@microsoft.com> * Improvements for pinning requirements (#213) A better way of running our tests with pinned requirements. Rather than have a separate `requirements-fixed.txt` file, have a script to turn the `requirements.txt` file into the former. Update builds accordingly. Signed-off-by: Richard Edgar <riedgar@microsoft.com> Signed-off-by: Roman Lutz <rolutz@microsoft.com> * Standardise ML argument documentation (#214) Make our documentation of fit(), X, predict() etc. more standard between our various submodules. Signed-off-by: Richard Edgar <riedgar@microsoft.com> Signed-off-by: Roman Lutz <rolutz@microsoft.com> * perf test through Azure ML (#180) * perf test first version through Azure ML Signed-off-by: Roman Lutz <rolutz@microsoft.com> * move some code to tempeh Signed-off-by: Roman Lutz <rolutz@microsoft.com> * add missing files Signed-off-by: Roman Lutz <rolutz@microsoft.com> * perf tests that get auth details through Azure Keyvault Signed-off-by: Roman Lutz <rolutz@microsoft.com> * upgrade to alpha tempeh version Signed-off-by: Roman Lutz <rolutz@microsoft.com> * flake8 Signed-off-by: Roman Lutz <rolutz@microsoft.com> * exclude D100 and D103 for script generation python file Signed-off-by: Roman Lutz <rolutz@microsoft.com> * move variables into nightly-perf.yaml Signed-off-by: Roman Lutz <rolutz@microsoft.com> * azureml sdk requirement for perf test Signed-off-by: Roman Lutz <rolutz@microsoft.com> * remove powershell syntax Signed-off-by: Roman Lutz <rolutz@microsoft.com> * add cwd for tests Signed-off-by: Roman Lutz <rolutz@microsoft.com> * print cwd Signed-off-by: Roman Lutz <rolutz@microsoft.com> * remove special working directory condition Signed-off-by: Roman Lutz <rolutz@microsoft.com> * fix directory handling based on ADO Signed-off-by: Roman Lutz <rolutz@microsoft.com> * tempeh bump to a2 Signed-off-by: Roman Lutz <rolutz@microsoft.com> * print message for debugging Signed-off-by: Roman Lutz <rolutz@microsoft.com> * try upper case variables Signed-off-by: Roman Lutz <rolutz@microsoft.com> * remove extraneous dash Signed-off-by: Roman Lutz <rolutz@microsoft.com> * print env var names Signed-off-by: Roman Lutz <rolutz@microsoft.com> * try explicitly adding variables Signed-off-by: Roman Lutz <rolutz@microsoft.com> * use variables directly, tempeh bump Signed-off-by: Roman Lutz <rolutz@microsoft.com> * add hardcoced data as variables Signed-off-by: Roman Lutz <rolutz@microsoft.com> * use windows instead of linux because some of the UI packages aren't available in linux Signed-off-by: Roman Lutz <rolutz@microsoft.com> * add wheel dependency Signed-off-by: Roman Lutz <rolutz@microsoft.com> * fake dashboard files Signed-off-by: Roman Lutz <rolutz@microsoft.com> * pass parameters for perf tests via args Signed-off-by: Roman Lutz <rolutz@microsoft.com> * yaml fix Signed-off-by: Roman Lutz <rolutz@microsoft.com> * yaml fix Signed-off-by: Roman Lutz <rolutz@microsoft.com> * remove waiting for run to complete Signed-off-by: Roman Lutz <rolutz@microsoft.com> * refactor to submit all jobs without waiting for result Signed-off-by: Roman Lutz <rolutz@microsoft.com> * remove obsolete gitignore line Signed-off-by: Roman Lutz <rolutz@microsoft.com> * tempeh bump to 0.1.11 Signed-off-by: Roman Lutz <rolutz@microsoft.com> * azureml-sdk warning Signed-off-by: Roman Lutz <rolutz@microsoft.com> * pipeline improvements to use keyvault tasks Signed-off-by: Roman Lutz <rolutz@microsoft.com> * logically separate script generation into steps Signed-off-by: Roman Lutz <rolutz@microsoft.com> * simplify writing long string of = signs Signed-off-by: Roman Lutz <rolutz@microsoft.com> * flake8 Signed-off-by: Roman Lutz <rolutz@microsoft.com> * replace incorrect variables in yaml Signed-off-by: Roman Lutz <rolutz@microsoft.com> * flake8 Signed-off-by: Roman Lutz <rolutz@microsoft.com> * use importerror instead of modulenotfounderror for py3.5 Signed-off-by: Roman Lutz <rolutz@microsoft.com> * add PR trigger for changes to test/perf directory Signed-off-by: Roman Lutz <rolutz@microsoft.com> * remove output from notebook Signed-off-by: Roman Lutz <rolutz@microsoft.com> * quotes for yaml variables Signed-off-by: Roman Lutz <rolutz@microsoft.com> * correct parameter in yaml Signed-off-by: Roman Lutz <rolutz@microsoft.com> Signed-off-by: Roman Lutz <rolutz@microsoft.com> * Documentation and flake8 (#215) Various updates for the documentation: - Remove another `flake8` global suppression - Add explanations for remaining `flake8` suppressions - Replace the `:any:` references in the documentation with appropriate ones - Make some file-level suppressions (which may have actually turned `flake8` off entirely on the file) specific to the appropriate lines Signed-off-by: Richard Edgar <riedgar@microsoft.com> Signed-off-by: Roman Lutz <rolutz@microsoft.com> * Update Release pipeline after KV move (#216) The KeyVault containing the PyPI secrets has been moved to a more appropriate subscription. As a result, the Release pipeline needs to be updated with the correct service connection Signed-off-by: Richard Edgar <riedgar@microsoft.com> Signed-off-by: Roman Lutz <rolutz@microsoft.com> * Basic unit tests for EqualizedOdds and DemographicParity moments (#217) Some very basic unit tests for the `EqualizedOdds` and `DemographicParity` moment classes. These are 'pinning' tests two establish the behaviour of these classes. The `gamma` method is not yet included in these tests, since that requires a trained model. Signed-off-by: Richard Edgar <riedgar@microsoft.com> Signed-off-by: Roman Lutz <rolutz@microsoft.com> * Add more time metrics to performance tests (#219) * perf test first version through Azure ML Signed-off-by: Roman Lutz <rolutz@microsoft.com> * move some code to tempeh Signed-off-by: Roman Lutz <rolutz@microsoft.com> * add missing files Signed-off-by: Roman Lutz <rolutz@microsoft.com> * perf tests that get auth details through Azure Keyvault Signed-off-by: Roman Lutz <rolutz@microsoft.com> * upgrade to alpha tempeh version Signed-off-by: Roman Lutz <rolutz@microsoft.com> * flake8 Signed-off-by: Roman Lutz <rolutz@microsoft.com> * exclude D100 and D103 for script generation python file Signed-off-by: Roman Lutz <rolutz@microsoft.com> * move variables into nightly-perf.yaml Signed-off-by: Roman Lutz <rolutz@microsoft.com> * azureml sdk requirement for perf test Signed-off-by: Roman Lutz <rolutz@microsoft.com> * remove powershell syntax Signed-off-by: Roman Lutz <rolutz@microsoft.com> * add cwd for tests Signed-off-by: Roman Lutz <rolutz@microsoft.com> * print cwd Signed-off-by: Roman Lutz <rolutz@microsoft.com> * remove special working directory condition Signed-off-by: Roman Lutz <rolutz@microsoft.com> * fix directory handling based on ADO Signed-off-by: Roman Lutz <rolutz@microsoft.com> * tempeh bump to a2 Signed-off-by: Roman Lutz <rolutz@microsoft.com> * print message for debugging Signed-off-by: Roman Lutz <rolutz@microsoft.com> * try upper case variables Signed-off-by: Roman Lutz <rolutz@microsoft.com> * remove extraneous dash Signed-off-by: Roman Lutz <rolutz@microsoft.com> * print env var names Signed-off-by: Roman Lutz <rolutz@microsoft.com> * try explicitly adding variables Signed-off-by: Roman Lutz <rolutz@microsoft.com> * use variables directly, tempeh bump Signed-off-by: Roman Lutz <rolutz@microsoft.com> * add hardcoced data as variables Signed-off-by: Roman Lutz <rolutz@microsoft.com> * use windows instead of linux because some of the UI packages aren't available in linux Signed-off-by: Roman Lutz <rolutz@microsoft.com> * add wheel dependency Signed-off-by: Roman Lutz <rolutz@microsoft.com> * fake dashboard files Signed-off-by: Roman Lutz <rolutz@microsoft.com> * pass parameters for perf tests via args Signed-off-by: Roman Lutz <rolutz@microsoft.com> * yaml fix Signed-off-by: Roman Lutz <rolutz@microsoft.com> * yaml fix Signed-off-by: Roman Lutz <rolutz@microsoft.com> * remove waiting for run to complete Signed-off-by: Roman Lutz <rolutz@microsoft.com> * refactor to submit all jobs without waiting for result Signed-off-by: Roman Lutz <rolutz@microsoft.com> * remove obsolete gitignore line Signed-off-by: Roman Lutz <rolutz@microsoft.com> * tempeh bump to 0.1.11 Signed-off-by: Roman Lutz <rolutz@microsoft.com> * azureml-sdk warning Signed-off-by: Roman Lutz <rolutz@microsoft.com> * pipeline improvements to use keyvault tasks Signed-off-by: Roman Lutz <rolutz@microsoft.com> * logically separate script generation into steps Signed-off-by: Roman Lutz <rolutz@microsoft.com> * simplify writing long string of = signs Signed-off-by: Roman Lutz <rolutz@microsoft.com> * flake8 Signed-off-by: Roman Lutz <rolutz@microsoft.com> * replace incorrect variables in yaml Signed-off-by: Roman Lutz <rolutz@microsoft.com> * flake8 Signed-off-by: Roman Lutz <rolutz@microsoft.com> * use importerror instead of modulenotfounderror for py3.5 Signed-off-by: Roman Lutz <rolutz@microsoft.com> * add PR trigger for changes to test/perf directory Signed-off-by: Roman Lutz <rolutz@microsoft.com> * remove output from notebook Signed-off-by: Roman Lutz <rolutz@microsoft.com> * quotes for yaml variables Signed-off-by: Roman Lutz <rolutz@microsoft.com> * correct parameter in yaml Signed-off-by: Roman Lutz <rolutz@microsoft.com> * add additional time-based metrics Signed-off-by: Roman Lutz <rolutz@microsoft.com> * adjustments to fix syntax errors and logical issues in the calculation of metrics Signed-off-by: Roman Lutz <rolutz@microsoft.com> * add oracle calls Signed-off-by: Roman Lutz <rolutz@microsoft.com> * custom metrics for executions times, min, max, mean Signed-off-by: Roman Lutz <rolutz@microsoft.com> * undo sphinx special docs for test/perf Signed-off-by: Roman Lutz <rolutz@microsoft.com> * flake8 Signed-off-by: Roman Lutz <rolutz@microsoft.com> * flake8 Signed-off-by: Roman Lutz <rolutz@microsoft.com> * flake8 Signed-off-by: Roman Lutz <rolutz@microsoft.com> * flake8 Signed-off-by: Roman Lutz <rolutz@microsoft.com> * flake8 Signed-off-by: Roman Lutz <rolutz@microsoft.com> * update description of oracle execution time properties Signed-off-by: Roman Lutz <rolutz@microsoft.com> * flake8 Signed-off-by: Roman Lutz <rolutz@microsoft.com> Signed-off-by: Roman Lutz <rolutz@microsoft.com> * Fix perf tests by logging lists through log_list instead of log (#221) * perf test first version through Azure ML Signed-off-by: Roman Lutz <rolutz@microsoft.com> * move some code to tempeh Signed-off-by: Roman Lutz <rolutz@microsoft.com> * add missing files Signed-off-by: Roman Lutz <rolutz@microsoft.com> * perf tests that get auth details through Azure Keyvault Signed-off-by: Roman Lutz <rolutz@microsoft.com> * upgrade to alpha tempeh version Signed-off-by: Roman Lutz <rolutz@microsoft.com> * flake8 Signed-off-by: Roman Lutz <rolutz@microsoft.com> * exclude D100 and D103 for script generation python file Signed-off-by: Roman Lutz <rolutz@microsoft.com> * move variables into nightly-perf.yaml Signed-off-by: Roman Lutz <rolutz@microsoft.com> * azureml sdk requirement for perf test Signed-off-by: Roman Lutz <rolutz@microsoft.com> * remove powershell syntax Signed-off-by: Roman Lutz <rolutz@microsoft.com> * add cwd for tests Signed-off-by: Roman Lutz <rolutz@microsoft.com> * print cwd Signed-off-by: Roman Lutz <rolutz@microsoft.com> * remove special working directory condition Signed-off-by: Roman Lutz <rolutz@microsoft.com> * fix directory handling based on ADO Signed-off-by: Roman Lutz <rolutz@microsoft.com> * tempeh bump to a2 Signed-off-by: Roman Lutz <rolutz@microsoft.com> * print message for debugging Signed-off-by: Roman Lutz <rolutz@microsoft.com> * try upper case variables Signed-off-by: Roman Lutz <rolutz@microsoft.com> * remove extraneous dash Signed-off-by: Roman Lutz <rolutz@microsoft.com> * print env var names Signed-off-by: Roman Lutz <rolutz@microsoft.com> * try explicitly adding variables Signed-off-by: Roman Lutz <rolutz@microsoft.com> * use variables directly, tempeh bump Signed-off-by: Roman Lutz <rolutz@microsoft.com> * add hardcoced data as variables Signed-off-by: Roman Lutz <rolutz@microsoft.com> * use windows instead of linux because some of the UI packages aren't available in linux Signed-off-by: Roman Lutz <rolutz@microsoft.com> * add wheel dependency Signed-off-by: Roman Lutz <rolutz@microsoft.com> * fake dashboard files Signed-off-by: Roman Lutz <rolutz@microsoft.com> * pass parameters for perf tests via args Signed-off-by: Roman Lutz <rolutz@microsoft.com> * yaml fix Signed-off-by: Roman Lutz <rolutz@microsoft.com> * yaml fix Signed-off-by: Roman Lutz <rolutz@microsoft.com> * remove waiting for run to complete Signed-off-by: Roman Lutz <rolutz@microsoft.com> * refactor to submit all jobs without waiting for result Signed-off-by: Roman Lutz <rolutz@microsoft.com> * remove obsolete gitignore line Signed-off-by: Roman Lutz <rolutz@microsoft.com> * tempeh bump to 0.1.11 Signed-off-by: Roman Lutz <rolutz@microsoft.com> * azureml-sdk warning Signed-off-by: Roman Lutz <rolutz@microsoft.com> * pipeline improvements to use keyvault tasks Signed-off-by: Roman Lutz <rolutz@microsoft.com> * logically separate script generation into steps Signed-off-by: Roman Lutz <rolutz@microsoft.com> * simplify writing long string of = signs Signed-off-by: Roman Lutz <rolutz@microsoft.com> * flake8 Signed-off-by: Roman Lutz <rolutz@microsoft.com> * replace incorrect variables in yaml Signed-off-by: Roman Lutz <rolutz@microsoft.com> * flake8 Signed-off-by: Roman Lutz <rolutz@microsoft.com> * use importerror instead of modulenotfounderror for py3.5 Signed-off-by: Roman Lutz <rolutz@microsoft.com> * add PR trigger for changes to test/perf directory Signed-off-by: Roman Lutz <rolutz@microsoft.com> * remove output from notebook Signed-off-by: Roman Lutz <rolutz@microsoft.com> * quotes for yaml variables Signed-off-by: Roman Lutz <rolutz@microsoft.com> * correct parameter in yaml Signed-off-by: Roman Lutz <rolutz@microsoft.com> * add additional time-based metrics Signed-off-by: Roman Lutz <rolutz@microsoft.com> * adjustments to fix syntax errors and logical issues in the calculation of metrics Signed-off-by: Roman Lutz <rolutz@microsoft.com> * add oracle calls Signed-off-by: Roman Lutz <rolutz@microsoft.com> * custom metrics for executions times, min, max, mean Signed-off-by: Roman Lutz <rolutz@microsoft.com> * undo sphinx special docs for test/perf Signed-off-by: Roman Lutz <rolutz@microsoft.com> * flake8 Signed-off-by: Roman Lutz <rolutz@microsoft.com> * flake8 Signed-off-by: Roman Lutz <rolutz@microsoft.com> * flake8 Signed-off-by: Roman Lutz <rolutz@microsoft.com> * flake8 Signed-off-by: Roman Lutz <rolutz@microsoft.com> * flake8 Signed-off-by: Roman Lutz <rolutz@microsoft.com> * update description of oracle execution time properties Signed-off-by: Roman Lutz <rolutz@microsoft.com> * flake8 Signed-off-by: Roman Lutz <rolutz@microsoft.com> * bug fix for list logging Signed-off-by: Roman Lutz <rolutz@microsoft.com> Signed-off-by: Roman Lutz <rolutz@microsoft.com> * make metric logging a lot more readable and provide additional metrics to show the overhead fairlearn adds (#228) Signed-off-by: Roman Lutz <rolutz@microsoft.com> * Convert Notebook tests to papermill (#223) Rather than using `nbval`, convert our notebook tests to use `papermill`. With the help of `nteract-scrapbook` we can then examine the contents of particular variables from the notebooks to ensure that we're getting the expected results. Explicit `scrapbook` commands are required to save out values for future examination, but we don't want to include these when our users look at the notebooks. Accordingly, we include machinery for adding the necessary cells to the notebooks dynamically. Signed-off-by: Richard Edgar <riedgar@microsoft.com> Signed-off-by: Roman Lutz <rolutz@microsoft.com> * Bump dashboard npm package to match source code (#229) * publish latest version * update docs for push Signed-off-by: Roman Lutz <rolutz@microsoft.com> * Remove unused ReST files (#233) Two of the ReST files generated by sphinx-autodoc weren't actually used. Remove them to get rid of a warning. Signed-off-by: Richard Edgar <riedgar@microsoft.com> Signed-off-by: Roman Lutz <rolutz@microsoft.com> * Basic Moments documentation (#241) Add some basic documentation of the `Moment` class and its subclasses. Also: - Turn the `n` field of the `Moment` object into a `total_samples` property - Add `intersphinx` hook for `pandas` documentation Signed-off-by: Richard Edgar <riedgar@microsoft.com> Signed-off-by: Roman Lutz <rolutz@microsoft.com> * [WIP] create extensions to install custom plots separately & check in generated files (#240) * check in generated javascript files and split into package with extensions Signed-off-by: Roman Lutz <rolutz@microsoft.com> * add installation tests, move yml files to templates directory if appropriate, delete unused and broken yml file Signed-off-by: Roman Lutz <rolutz@microsoft.com> * flake8 Signed-off-by: Roman Lutz <rolutz@microsoft.com> * separate directories per package, composition with minimal fairlearn package Signed-off-by: Roman Lutz <rolutz@microsoft.com> * script updates to get doc and wheel builds in shape Signed-off-by: Roman Lutz <rolutz@microsoft.com> * update yml files and scripts to enable wheel upload per package Signed-off-by: Roman Lutz <rolutz@microsoft.com> * address feedback from PR by adding documentation to the pipeline definition yml files Signed-off-by: Roman Lutz <rolutz@microsoft.com> * remove "templates/" as location prefix for files in the templates directory itself Signed-off-by: Roman Lutz <rolutz@microsoft.com> * first version of widget build validation script Signed-off-by: Roman Lutz <rolutz@microsoft.com> * corrections in widget build validation Signed-off-by: Roman Lutz <rolutz@microsoft.com> * undo adjustments to completely split up packages Signed-off-by: Roman Lutz <rolutz@microsoft.com> * fix yaml Signed-off-by: Roman Lutz <rolutz@microsoft.com> * reverse code coverage build changes Signed-off-by: Roman Lutz <rolutz@microsoft.com> * fix yml Signed-off-by: Roman Lutz <rolutz@microsoft.com> * ignore install tests when necessary Signed-off-by: Roman Lutz <rolutz@microsoft.com> * add macos python 3.5 Signed-off-by: Roman Lutz <rolutz@microsoft.com> * add exceptions module back to documentaiton Signed-off-by: Roman Lutz <rolutz@microsoft.com> * add name for job Signed-off-by: Roman Lutz <rolutz@microsoft.com> * fix characters in job name Signed-off-by: Roman Lutz <rolutz@microsoft.com> * fix job name Signed-off-by: Roman Lutz <rolutz@microsoft.com> * add image label Signed-off-by: Roman Lutz <rolutz@microsoft.com> * correct installation path Signed-off-by: Roman Lutz <rolutz@microsoft.com> * undo other changes to rst file Signed-off-by: Roman Lutz <rolutz@microsoft.com> Signed-off-by: Roman Lutz <rolutz@microsoft.com> * Add logging variant of numpy.all_close (#246) The `numpy` package provides an `all_close` routine for comparing two arrays. Unfortunately, there's no mechanism for showing which elements failed the comparison. Put together a wrapper based on `numpy.isclose` which will print out information about failed comparisons. Signed-off-by: Richard Edgar <riedgar@microsoft.com> Signed-off-by: Roman Lutz <rolutz@microsoft.com> * Implement GroupMetricSet (#250) Create a `GroupMetricSet` class for holding collections of grouped metrics. This is to help with AzureML integration. There has been some (possibly unnecessary) reorganisation of things under `fairlearn/metrics` but the public interface is unchanged. Signed-off-by: Richard Edgar <riedgar@microsoft.com> Signed-off-by: Roman Lutz <rolutz@microsoft.com> * Exclude install tests in code coverage check (#251) * ignore install tests since they'll unexpectedly work Signed-off-by: Roman Lutz <rolutz@microsoft.com> * add python -m before pip install Signed-off-by: Roman Lutz <rolutz@microsoft.com> * upgrade tempeh to v0.1.12 Signed-off-by: Roman Lutz <rolutz@microsoft.com> Signed-off-by: Roman Lutz <rolutz@microsoft.com> * Replace powershell scripts with python and add Makefile (#249) * check in generated javascript files and split into package with extensions Signed-off-by: Roman Lutz <rolutz@microsoft.com> * add installation tests, move yml files to templates directory if appropriate, delete unused and broken yml file Signed-off-by: Roman Lutz <rolutz@microsoft.com> * flake8 Signed-off-by: Roman Lutz <rolutz@microsoft.com> * separate directories per package, composition with minimal fairlearn package Signed-off-by: Roman Lutz <rolutz@microsoft.com> * script updates to get doc and wheel builds in shape Signed-off-by: Roman Lutz <rolutz@microsoft.com> * update yml files and scripts to enable wheel upload per package Signed-off-by: Roman Lutz <rolutz@microsoft.com> * address feedback from PR by adding documentation to the pipeline definition yml files Signed-off-by: Roman Lutz <rolutz@microsoft.com> * remove "templates/" as location prefix for files in the templates directory itself Signed-off-by: Roman Lutz <rolutz@microsoft.com> * first version of widget build validation script Signed-off-by: Roman Lutz <rolutz@microsoft.com> * corrections in widget build validation Signed-off-by: Roman Lutz <rolutz@microsoft.com> * undo adjustments to completely split up packages Signed-off-by: Roman Lutz <rolutz@microsoft.com> * fix yaml Signed-off-by: Roman Lutz <rolutz@microsoft.com> * reverse code coverage build changes Signed-off-by: Roman Lutz <rolutz@microsoft.com> * fix yml Signed-off-by: Roman Lutz <rolutz@microsoft.com> * ignore install tests when necessary Signed-off-by: Roman Lutz <rolutz@microsoft.com> * add macos python 3.5 Signed-off-by: Roman Lutz <rolutz@microsoft.com> * add exceptions module back to documentaiton Signed-off-by: Roman Lutz <rolutz@microsoft.com> * add name for job Signed-off-by: Roman Lutz <rolutz@microsoft.com> * fix characters in job name Signed-off-by: Roman Lutz <rolutz@microsoft.com> * fix job name Signed-off-by: Roman Lutz <rolutz@microsoft.com> * add image label Signed-off-by: Roman Lutz <rolutz@microsoft.com> * correct installation path Signed-off-by: Roman Lutz <rolutz@microsoft.com> * undo other changes to rst file Signed-off-by: Roman Lutz <rolutz@microsoft.com> * rewrite scripts in python Signed-off-by: Roman Lutz <rolutz@microsoft.com> * replace widget build script with python script Signed-off-by: Roman Lutz <rolutz@microsoft.com> * build_widget adjustments to make it work Signed-off-by: Roman Lutz <rolutz@microsoft.com> * build_widget finalization plus add ls commands to find yarn installation in ADO Signed-off-by: Roman Lutz <rolutz@microsoft.com> * some more paths to check Signed-off-by: Roman Lutz <rolutz@microsoft.com> * task -> script Signed-off-by: Roman Lutz <rolutz@microsoft.com> * usr/bin/yarn check Signed-off-by: Roman Lutz <rolutz@microsoft.com> * workingDirectory adjustment Signed-off-by: Roman Lutz <rolutz@microsoft.com> * add ls Signed-off-by: Roman Lutz <rolutz@microsoft.com> * adjustment for fairlearn root dir check Signed-off-by: Roman Lutz <rolutz@microsoft.com> * add ./ Signed-off-by: Roman Lutz <rolutz@microsoft.com> * fix template Signed-off-by: Roman Lutz <rolutz@microsoft.com> * flake8 Signed-off-by: Roman Lutz <rolutz@microsoft.com> * comment about set-variable-from-file script only being required in ADO Signed-off-by: Roman Lutz <rolutz@microsoft.com> * add makefile, update contributing guide, and replace remaining ps1 occurrences in pipeline ymls Signed-off-by: Roman Lutz <rolutz@microsoft.com> * add romanlutz to codeowners for scripts dir Signed-off-by: Roman Lutz <rolutz@microsoft.com> * fix comment Signed-off-by: Roman Lutz <rolutz@microsoft.com> * make process_readme a standalone script again Signed-off-by: Roman Lutz <rolutz@microsoft.com> * delete build_docs Signed-off-by: Roman Lutz <rolutz@microsoft.com> * flake8 Signed-off-by: Roman Lutz <rolutz@microsoft.com> * makefile adjustments according to feedback Signed-off-by: Roman Lutz <rolutz@microsoft.com> Signed-off-by: Roman Lutz <rolutz@microsoft.com> * Fix pypi release yaml (#260) * undo erroneous changes to yaml Signed-off-by: Roman Lutz <rolutz@microsoft.com> * undo prior erroneous change Signed-off-by: Roman Lutz <rolutz@microsoft.com> * replace job template usage with just a step Signed-off-by: Roman Lutz <rolutz@microsoft.com> Signed-off-by: Roman Lutz <rolutz@microsoft.com> * add CHANGES.md for v0.4.2 (#262) * add CHANGES.md for v0.4.2 Signed-off-by: Roman Lutz <rolutz@microsoft.com> * add general instructions to always do that Signed-off-by: Roman Lutz <rolutz@microsoft.com> * Update CHANGES.md Adding `GroupMetricSet` to the changelog Signed-off-by: Richard Edgar <riedgar@microsoft.com> * comment out test that fails consistently only on windows Signed-off-by: Roman Lutz <rolutz@microsoft.com> * fix readme processing script by adding fairlearn dir to sys path, add second solution for issue 265 Signed-off-by: Roman Lutz <rolutz@microsoft.com> * fix syntax error, flake8 Signed-off-by: Roman Lutz <rolutz@microsoft.com> * bump version to 0.4.2 Signed-off-by: Roman Lutz <rolutz@microsoft.com> * test with list of lists instead of single list Signed-off-by: Roman Lutz <rolutz@microsoft.com> Co-authored-by: Richard Edgar <riedgar@microsoft.com> Signed-off-by: Roman Lutz <rolutz@microsoft.com> * Update metric keys to match dashboard (#268) The dashboard already had its own keys defined for mapping metric functions to strings. Update the `GroupMetricSet` to use the same keys. Figuring out how to unify the two implementations of this mapping is left as a issue #269 Signed-off-by: Richard Edgar <riedgar@microsoft.com> Signed-off-by: Roman Lutz <rolutz@microsoft.com> * Fix release blockers - widget generated files, widget validation (#267) * add built widget file updates & fix widget build validation, as well as pypi release template for empty DEV_VERSION Signed-off-by: Roman Lutz <rolutz@microsoft.com> * undo DEV_VERSION change Signed-off-by: Roman Lutz <rolutz@microsoft.com> * add comment and link to issue Signed-off-by: Roman Lutz <rolutz@microsoft.com> Signed-off-by: Roman Lutz <rolutz@microsoft.com> * remove --assert-no-changes flag in release as well (#272) Signed-off-by: Roman Lutz <rolutz@microsoft.com> * Update default metrics in GroupMetricSet (#271) Tweak the list of metrics computed by default by the `compute` method of `GroupMetricSet` to match those expected by the dashboard Signed-off-by: Richard Edgar <riedgar@microsoft.com> Signed-off-by: Roman Lutz <rolutz@microsoft.com> * set env var before installing fairlearn to correct version file name content (#274) Signed-off-by: Roman Lutz <rolutz@microsoft.com> * MNT use sklearn's NotFittedError instead of NotFittedException (#259) * MNT use sklearn's NotFittedError instead of NotFittedException Signed-off-by: adrinjalali <adrin.jalali@gmail.com> * add to the changelog Signed-off-by: adrinjalali <adrin.jalali@gmail.com> Signed-off-by: Roman Lutz <rolutz@microsoft.com> * Updates for GroupMetricResult and GroupMetricSet (#279) Add (in)equality operators to `GroupMetricResult` and `GroupMetricSet`, along with basic tests. These will simplify other testing in future. Change `GroupMetricSet` so that the `groups` have to be specified as sequential integers from zero. If this is not the case then the `compute()` method will remap the supplied groups to `[0, 1, 2, ….]` and put the stringified original values into the `group_names` property. Since the keys are now sequential integers, convert the `group_names` property itself from a dictionary into a list. Closes #275 Signed-off-by: Richard Edgar <riedgar@microsoft.com> Signed-off-by: Roman Lutz <rolutz@microsoft.com> * DOC contributing: trim lines and add notes on signoff (#276) * DOC contributing: trim lines and add notes on signoff Signed-off-by: adrinjalali <adrin.jalali@gmail.com> * hook Signed-off-by: adrinjalali <adrin.jalali@gmail.com> * modify note to point to the right answer Signed-off-by: adrinjalali <adrin.jalali@gmail.com> Signed-off-by: Roman Lutz <rolutz@microsoft.com> * Further metric changes (#281) A number of extra changes to metrics: - `GroupMetricResult` now dynamically calculates `maximum`, `range` etc. - `GroupMetricSet` has a consistency check - `GroupMetricSet` can transform itself to and from a dictionary matching the schema used by the dashboard Signed-off-by: Richard Edgar <riedgar@microsoft.com> Signed-off-by: Roman Lutz <rolutz@microsoft.com> * Preparations for v0.4.3 Release (#284) Bump version and update Markdown files Signed-off-by: Richard Edgar <riedgar@microsoft.com> Signed-off-by: Roman Lutz <rolutz@microsoft.com> * Use kwargs in metrics (#286) Change `metric_by_group` and `make_group_metric` to understand `**kwargs`. This removes the need for lots of small wrapper functions Signed-off-by: Richard Edgar <riedgar@microsoft.com> Signed-off-by: Roman Lutz <rolutz@microsoft.com> * take changes from other branch that touches all modules Signed-off-by: Roman Lutz <rolutz@microsoft.com> * get all tests working again Signed-off-by: Roman Lutz <rolutz@microsoft.com> * squeeze instead of reshape, deselect instead of skip in pytest, utility function for compression Signed-off-by: Roman Lutz <rolutz@microsoft.com> * flake8 Signed-off-by: Roman Lutz <rolutz@microsoft.com> Co-authored-by: Richard Edgar <riedgar@microsoft.com> Co-authored-by: MiroDudik <mdudik@gmail.com> Co-authored-by: Ilya Matiach <ilmat@microsoft.com> Co-authored-by: Brandon Horn <rihorn@microsoft.com> Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
fairlearn · Feb 10, 2020 · 7253eb8 · 7253eb8
1 parent 2ffe87c
commit 7253eb8
Show file tree

Hide file tree

Showing 7 changed files with 531 additions and 265 deletions.
diff --git a/fairlearn/_input_validation.py b/fairlearn/_input_validation.py
@@ -3,18 +3,113 @@
 
 import numpy as np
 import pandas as pd
+from sklearn.utils.validation import check_X_y, check_consistent_length, check_array
 
 
+_KW_SENSITIVE_FEATURES = "sensitive_features"
+
 _MESSAGE_X_NONE = "Must supply X"
 _MESSAGE_Y_NONE = "Must supply y"
+_MESSAGE_SENSITIVE_FEATURES_NONE = "Must specify {0} (for now)".format(_KW_SENSITIVE_FEATURES)
 _MESSAGE_X_Y_ROWS = "X and y must have same number of rows"
 _MESSAGE_X_SENSITIVE_ROWS = "X and the sensitive features must have same number of rows"
+_INPUT_DATA_FORMAT_ERROR_MESSAGE = "The only allowed input data formats for {} are: {}. " \
+                                     "Your provided data was of type {}."
+_EMPTY_INPUT_ERROR_MESSAGE = "At least one of sensitive_features, labels, or scores are empty."
+_SENSITIVE_FEATURES_NON_BINARY_ERROR_MESSAGE = "Sensitive features contain more than two unique" \
+                                               " values"
+_LABELS_NOT_0_1_ERROR_MESSAGE = "Supplied y labels are not 0 or 1"
+_MORE_THAN_ONE_COLUMN_ERROR_MESSAGE = "{} is a {} with more than one column"
+_NOT_ALLOWED_TYPE_ERROR_MESSAGE = "{} is not an ndarray, Series or DataFrame"
+_NDARRAY_NOT_TWO_DIMENSIONAL_ERROR_MESSAGE = "{} is an ndarray which is not 2D"
+_NOT_ALLOWED_MATRIX_TYPE_ERROR_MESSAGE = "{} is not an ndarray or DataFrame"
+
+_ALLOWED_INPUT_TYPES_X = [np.ndarray, pd.DataFrame]
+_ALLOWED_INPUT_TYPES_SENSITIVE_FEATURES = [np.ndarray, pd.DataFrame, pd.Series, list]
+_ALLOWED_INPUT_TYPES_Y = [np.ndarray, pd.DataFrame, pd.Series, list]
+
+_SENSITIVE_FEATURE_COMPRESSION_SEPARATOR = ","
+
+
+def _validate_and_reformat_input(X, y=None, expect_y=True, enforce_binary_sensitive_feature=False,
+                                 enforce_binary_labels=False, **kwargs):
+    """Validate input data and return the data in an appropriate format.
+
+    :param X: The feature matrix
+    :type X: numpy.ndarray or pandas.DataFrame
+    :param y: The label vector
+    :type y: numpy.ndarray, pandas.DataFrame, pandas.Series, or list
+    :param expect_y: if True y needs to be provided, otherwise ignores the argument; default True
+    :type expect_y: bool
+    :param enforce_binary_sensitive_feature: if True raise exception if there are more than two
+        distinct values in the `sensitive_features` data from `kwargs`; default False
+    :type enforce_binary_sensitive_feature: bool
+    :param enforce_binary_labels: if True raise exception if there are more than two distinct
+        values in the `y` data; default False
+    :type enforce_binary_labels: bool
+    """
+    if y is not None:
+        # calling check_X_y with a 2-dimensional y causes a warning, so ensure it is 1-dimensional
+        if isinstance(y, np.ndarray) and len(y.shape) == 2 and y.shape[1] == 1:
+            y = y.squeeze()
+        elif isinstance(y, pd.DataFrame) and y.shape[1] == 1:
+            y = y.to_numpy().squeeze()
+
+        X, y = check_X_y(X, y)
+        y = check_array(y, ensure_2d=False, dtype='numeric')
+        if enforce_binary_labels and not set(np.unique(y)).issubset(set([0, 1])):
+            raise ValueError(_LABELS_NOT_0_1_ERROR_MESSAGE)
+    elif expect_y:
+        raise ValueError(_MESSAGE_Y_NONE)
+    else:
+        X = check_array(X)
 
-_KW_SENSITIVE_FEATURES = "sensitive_features"
+    sensitive_features = kwargs.get(_KW_SENSITIVE_FEATURES)
+    if sensitive_features is None:
+        raise ValueError(_MESSAGE_SENSITIVE_FEATURES_NONE)
+
+    check_consistent_length(X, sensitive_features)
+    sensitive_features = check_array(sensitive_features, ensure_2d=False, dtype=None)
+
+    # compress multiple sensitive features into a single column
+    if len(sensitive_features.shape) > 1 and sensitive_features.shape[1] > 1:
+        sensitive_features = \
+            _compress_multiple_sensitive_features_into_single_column(sensitive_features)
+
+    if enforce_binary_sensitive_feature:
+        if len(np.unique(sensitive_features)) > 2:
+            raise ValueError(_SENSITIVE_FEATURES_NON_BINARY_ERROR_MESSAGE)
+
+    return pd.DataFrame(X), pd.Series(y), pd.Series(sensitive_features.squeeze())
+
+
+def _compress_multiple_sensitive_features_into_single_column(sensitive_features):
+    """Compress multiple sensitive features into a single column.
+
+    The resulting mapping converts multiple dimensions into the Cartesian product of the
+    individual columns.
+
+    :param sensitive_features: multi-dimensional array of sensitive features
+    :type sensitive_features: `numpy.ndarray`
+    :return: one-dimensional array of mapped sensitive features
+    """
+    if not isinstance(sensitive_features, np.ndarray):
+        raise ValueError("Received argument of type {} instead of expected numpy.ndarray"
+                         .format(type(sensitive_features).__name__))
+    return np.apply_along_axis(
+        lambda row: _SENSITIVE_FEATURE_COMPRESSION_SEPARATOR.join(
+            [str(row[i])
+                .replace("\\", "\\\\")  # escape backslash and separator
+                .replace(_SENSITIVE_FEATURE_COMPRESSION_SEPARATOR,
+                         "\\" + _SENSITIVE_FEATURE_COMPRESSION_SEPARATOR)
+                for i in range(len(row))]),
+        axis=1,
+        arr=sensitive_features)
 
 
 def _validate_and_reformat_reductions_input(X, y, enforce_binary_sensitive_feature=False,
                                             **kwargs):
+    # TODO: remove this function once reductions use _validate_and_reformat_input from above
     if X is None:
         raise ValueError(_MESSAGE_X_NONE)
 
@@ -47,6 +142,7 @@ def _validate_and_reformat_reductions_input(X, y, enforce_binary_sensitive_featu
 
 
 def _make_vector(formless, formless_name):
+    # TODO: remove this function once reductions use _validate_and_reformat_input from above
     formed_vector = None
     if isinstance(formless, list):
         formed_vector = pd.Series(formless)
@@ -74,9 +170,9 @@ def _make_vector(formless, formless_name):
 
 
 def _get_matrix_shape(formless, formless_name):
+    # TODO: remove this function once reductions use _validate_and_reformat_input from above
     num_rows = -1
     num_cols = -1
-
     if isinstance(formless, pd.DataFrame):
         num_cols = len(formless.columns)
         num_rows = len(formless.index)

diff --git a/fairlearn/postprocessing/_threshold_optimizer.py b/fairlearn/postprocessing/_threshold_optimizer.py
@@ -16,18 +16,15 @@
 
 from sklearn.exceptions import NotFittedError
 from fairlearn.postprocessing import PostProcessing
+from fairlearn._input_validation import _validate_and_reformat_input
 from ._constants import (LABEL_KEY, SCORE_KEY, SENSITIVE_FEATURE_KEY, OUTPUT_SEPARATOR,
                          DEMOGRAPHIC_PARITY, EQUALIZED_ODDS)
 from ._roc_curve_utilities import _interpolate_curve, _get_roc
 from ._interpolated_prediction import InterpolatedPredictor
 
 # various error messages
 DIFFERENT_INPUT_LENGTH_ERROR_MESSAGE = "{} need to be of equal length."
-EMPTY_INPUT_ERROR_MESSAGE = "At least one of sensitive_features, labels, or scores are empty."
 NON_BINARY_LABELS_ERROR_MESSAGE = "Labels other than 0/1 were provided."
-INPUT_DATA_FORMAT_ERROR_MESSAGE = "The only allowed input data formats are: " \
-                                  "list, numpy.ndarray, pandas.DataFrame, pandas.Series. " \
-                                  "Your provided data was of types ({}, {}, {})"
 NOT_SUPPORTED_CONSTRAINTS_ERROR_MESSAGE = "Currently only {} and {} are supported " \
     "constraints.".format(DEMOGRAPHIC_PARITY, EQUALIZED_ODDS)
 PREDICT_BEFORE_FIT_ERROR_MESSAGE = "It is required to call 'fit' before 'predict'."
@@ -97,7 +94,8 @@ def fit(self, X, y, *, sensitive_features, **kwargs):
         :type sensitive_features: currently 1D array as numpy.ndarray, list, pandas.DataFrame,
             or pandas.Series
         """
-        self._validate_input_data(X, sensitive_features, y)
+        _, _, sensitive_feature_vector = _validate_and_reformat_input(
+            X, y, sensitive_features=sensitive_features, enforce_binary_labels=True)
 
         # postprocessing can't handle 0/1 as floating point numbers, so this converts it to int
         if type(y) in [np.ndarray, pd.DataFrame, pd.Series]:
@@ -125,7 +123,7 @@ def fit(self, X, y, *, sensitive_features, **kwargs):
             raise ValueError(NOT_SUPPORTED_CONSTRAINTS_ERROR_MESSAGE)
 
         self._post_processed_predictor_by_sensitive_feature = threshold_optimization_method(
-            sensitive_features, y, scores, self._grid_size, self._flip, self._plot)
+            sensitive_feature_vector, y, scores, self._grid_size, self._flip, self._plot)
 
     def predict(self, X, *, sensitive_features, random_state=None):
         """Predict label for each sample in X while taking into account sensitive features.
@@ -144,12 +142,14 @@ def predict(self, X, *, sensitive_features, random_state=None):
             random.seed(random_state)
 
         self._validate_post_processed_predictor_is_fitted()
-        self._validate_input_data(X, sensitive_features)
+        _, _, sensitive_feature_vector = _validate_and_reformat_input(
+            X, y=None, sensitive_features=sensitive_features, expect_y=False,
+            enforce_binary_labels=True)
         unconstrained_predictions = self._unconstrained_predictor.predict(X)
 
         positive_probs = _vectorized_prediction(
             self._post_processed_predictor_by_sensitive_feature,
-            sensitive_features,
+            sensitive_feature_vector,
             unconstrained_predictions)
         return (positive_probs >= np.random.rand(len(positive_probs))) * 1
 
@@ -167,41 +167,18 @@ def _pmf_predict(self, X, *, sensitive_features):
         :rtype: numpy.ndarray
         """
         self._validate_post_processed_predictor_is_fitted()
-        self._validate_input_data(X, sensitive_features)
+        _, _, sensitive_feature_vector = _validate_and_reformat_input(
+            X, y=None, sensitive_features=sensitive_features, expect_y=False,
+            enforce_binary_labels=True)
         positive_probs = _vectorized_prediction(
-            self._post_processed_predictor_by_sensitive_feature, sensitive_features,
+            self._post_processed_predictor_by_sensitive_feature, sensitive_feature_vector,
             self._unconstrained_predictor.predict(X))
         return np.array([[1.0 - p, p] for p in positive_probs])
 
     def _validate_post_processed_predictor_is_fitted(self):
         if not self._post_processed_predictor_by_sensitive_feature:
             raise NotFittedError(PREDICT_BEFORE_FIT_ERROR_MESSAGE)
 
-    def _validate_input_data(self, X, sensitive_features, y=None):
-        allowed_input_types = [list, np.ndarray, pd.DataFrame, pd.Series]
-        if type(X) not in allowed_input_types or \
-                type(sensitive_features) not in allowed_input_types or \
-                (y is not None and type(y) not in allowed_input_types):
-            raise TypeError(INPUT_DATA_FORMAT_ERROR_MESSAGE
-                            .format(type(X).__name__,
-                                    type(y).__name__,
-                                    type(sensitive_features).__name__))
-
-        if len(X) == 0 or len(sensitive_features) == 0 or (y is not None and len(y) == 0):
-            raise ValueError(EMPTY_INPUT_ERROR_MESSAGE)
-
-        if y is None:
-            if len(X) != len(sensitive_features) or (y is not None and len(X) != len(y)):
-                raise ValueError(DIFFERENT_INPUT_LENGTH_ERROR_MESSAGE
-                                 .format("X and sensitive_features"))
-        else:
-            if len(X) != len(sensitive_features) or (y is not None and len(X) != len(y)):
-                raise ValueError(DIFFERENT_INPUT_LENGTH_ERROR_MESSAGE
-                                 .format("X, sensitive_features, and y"))
-
-        if set(np.unique(y)) > set([0, 1]):
-            raise ValueError(NON_BINARY_LABELS_ERROR_MESSAGE)
-
 
 def _threshold_optimization_demographic_parity(sensitive_features, labels, scores, grid_size=1000,
                                                flip=True, plot=False):
@@ -443,37 +420,13 @@ def _vectorized_prediction(function_dict, sensitive_features, scores):
     :type scores: list, numpy.ndarray, pandas.DataFrame, or pandas.Series
     """
     # handle type conversion to ndarray for other types
-    sensitive_features_vector = _convert_to_ndarray(
-        sensitive_features, MULTIPLE_DATA_COLUMNS_ERROR_MESSAGE.format("sensitive_features"))
-    scores_vector = _convert_to_ndarray(scores, SCORES_DATA_TOO_MANY_COLUMNS_ERROR_MESSAGE)
+    sensitive_features_vector = np.array(sensitive_features)
+    scores_vector = np.array(scores)
 
     return sum([(sensitive_features_vector == a) * function_dict[a].predict(scores_vector)
                 for a in function_dict])
 
 
-def _convert_to_ndarray(data, dataframe_multiple_columns_error_message):
-    """Convert the input data from list, pandas.Series, or pandas.DataFrame to numpy.ndarray.
-
-    :param data: the data to be converted into a numpy.ndarray
-    :type data: numpy.ndarray, pandas.Series, pandas.DataFrame, or list
-    :param dataframe_multiple_columns_error_message: the error message to show in case the
-        provided data is more than 1-dimensional
-    :type dataframe_multiple_columns_error_message:
-    :return: the input data formatted as numpy.ndarray
-    :rtype: numpy.ndarray
-    """
-    if type(data) == list:
-        data = np.array(data)
-    elif type(data) == pd.DataFrame:
-        if len(data.columns) > 1:
-            # TODO: extend to multiple columns for additional group data
-            raise ValueError(dataframe_multiple_columns_error_message)
-        data = data[data.columns[0]].values
-    elif type(data) == pd.Series:
-        data = data.values
-    return data
-
-
 def _reformat_and_group_data(sensitive_features, labels, scores, sensitive_feature_names=None):
     """Reformats the data into a new pandas.DataFrame and group by sensitive feature values.
 
@@ -535,7 +488,7 @@ def _reformat_data_into_dict(key, data_dict, additional_data):
             raise ValueError(
                 MULTIPLE_DATA_COLUMNS_ERROR_MESSAGE.format("sensitive_features"))
         else:
-            data_dict[key] = additional_data.reshape(-1)
+            data_dict[key] = additional_data.squeeze()
     elif type(additional_data) == pd.DataFrame:
         # TODO: extend to multiple columns for additional_data by using column names
         for attribute_column in additional_data.columns:

diff --git a/test/unit/constants.py b/test/unit/constants.py
@@ -0,0 +1,6 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+
+
+MULTIPLE_SENSITIVE_FEATURE_COMPRESSION_SKIP_REASON = \
+    "Multiple sensitive features cannot be compressed into one-dimensional data structure."
diff --git a/test/unit/input_convertors.py b/test/unit/input_convertors.py
@@ -4,6 +4,8 @@
 import numpy as np
 import pandas as pd
 
+from fairlearn._input_validation import _compress_multiple_sensitive_features_into_single_column
+
 
 def ensure_list(X):
     assert X is not None
@@ -18,6 +20,19 @@ def ensure_list(X):
     raise ValueError("Failed to convert to list")
 
 
+def ensure_list_1d(X):
+    assert X is not None
+    if isinstance(X, list):
+        return X
+    elif isinstance(X, np.ndarray):
+        return X.squeeze().tolist()
+    elif isinstance(X, pd.Series):
+        return X.tolist()
+    elif isinstance(X, pd.DataFrame):
+        return X.tolist()
+    raise ValueError("Failed to convert to list")
+
+
 def ensure_ndarray(X):
     assert X is not None
     if isinstance(X, list):
@@ -34,8 +49,10 @@ def ensure_ndarray(X):
 def ensure_ndarray_2d(X):
     assert X is not None
     tmp = ensure_ndarray(X)
-    if len(tmp.shape) != 1:
-        raise ValueError("Requires 1d array")
+    if len(tmp.shape) not in [1, 2]:
+        raise ValueError("Requires 1d or 2d array")
+    if len(tmp.shape) == 2:
+        return tmp
     result = np.expand_dims(tmp, 1)
     assert len(result.shape) == 2
     return result
@@ -46,7 +63,10 @@ def ensure_series(X):
     if isinstance(X, list):
         return pd.Series(X)
     elif isinstance(X, np.ndarray):
-        return pd.Series(X)
+        if len(X.shape) == 1:
+            return pd.Series(X)
+        if X.shape[1] == 1:
+            return pd.Series(X.squeeze())
     elif isinstance(X, pd.Series):
         return X
     elif isinstance(X, pd.DataFrame):
@@ -72,3 +92,10 @@ def ensure_dataframe(X):
                       ensure_ndarray_2d,
                       ensure_series,
                       ensure_dataframe]
+
+
+def _map_into_single_column(matrix):
+    if len(np.array(matrix).shape) == 1:
+        return np.array(matrix)
+
+    return _compress_multiple_sensitive_features_into_single_column(matrix)