Skip to content

amazon-science/minimax-fair

MinimaxFair - Convergent Algorithms for (Relaxed) Minimax Group Fairness

MinimaxFair is a Python package for training ML models for (relaxed) minimax group fairness as discussed in https://arxiv.org/abs/2011.03108

This repository contains python code for

  • learning models that achieve minimax group fairness for both regression and classification tasks
  • learning models that minimize error subject to relaxed group fairness constraints
  • visualizing tradeoffs between fairness and overall error

We also include some examples of fairness sensitive datasets for experimentation, though our package supports any dataset formatted as a .csv whose columns are labeled

Our algorithms support the following training objectives (loss functions):

  • Mean Squared Error
  • 0/1 Loss
  • Log Loss
  • False Positive Rate
  • False Negative Rate

Our algorithms support the following model classes:

  • Linear Regression
  • Logistic Regression
  • Paired Regression Classifier from https://github.com/algowatchpenn/GerryFair
  • Perceptron
  • Multi-Layer Perceptron for Classification - Uses custom wrapper class to work with our algorithm

Installation

To install the package and prepare for use, run:

git clone https://github.com/amazon-research/minimax-fair.git

The current version of the package uses the following packages: pandas, numpy, sklearn, matplotlib, pytorch, and scipy. If you do not have these packages, please run the following command to install them:

 pip install -r requirements.txt

Using our package

To use our package, we recommend modifying the relevant parameters in main_driver.py (which are described in detail within the file itself) and then running the following in the terminal from the root directory of MinimaxFair:

python3 main_driver.py

Datasets

Downloading pre-configured datasets.

To use one of the pre-configured datasets, you must first download the dataset from the corresponding online repository , and place it in the datasets folder with the specified file name.

We have provided a brief description of each dataset, along with a link to a page from where it can be downloaded . For some datasets, additional formatting or the running of provided scripts may be required.

Descriptions of pre-configured datasets

Adult - Predict whether income exceeds $50K/yr based on census data. http://archive.ics.uci.edu/ml/datasets/Adult

Downloading instructions: On the linked webpage click the "Data Folder" link and download adult.data. Rename the file to adult.csv and place it in the datasets folder in this package. Then, in the src folder of this package, run the script clean_adult_data.py which will create the the file adult_cleaned.csv which is formatted to be usedwith our pre-configured code.

Seoul Bike Sharing - The dataset contains count of public bikes rented at each hour in Seoul Bike sharing System with the corresponding Weather data and Holidays information. https://archive.ics.uci.edu/ml/datasets/Seoul+Bike+Sharing+Demand

Downloading instructions: On the linked webpage click the "Data Folder" link, download SeoulBikeData.csv, and place in the datasets folder. If there are issues, make sure that the csv is saved in utf-8 encoding, and delete the "∞" symbol in the column headers "Temperature(∞C)" and "Dew point temperature(∞C)".

Communities and Crime - Communities within the United States. The data combines socio-economic data from the 1990 US Census, law enforcement data from the 1990 US LEMAS survey, and crime data from the 1995 FBI UCR. We added a column for plurality race, which we used to divide communities into groups. https://github.com/algowatchpenn/GerryFair/tree/master/dataset

Downloading instructions: On the provided webpage, download communities.csv and place in the datasets folder. Then, in the src folder of this package, run the script clean_communities_data.py which will create the the file communities_cleaned.csv which is formatted to be used with our pre-configured code.

COMPAS - Data from Broward County Clerk’s Office, Broward County Sherrif's Office, Florida Department of Corrections , compiled by ProPublica. https://www.propublica.org/datastore/dataset/compas-recidivism-risk-score-data-and-analysis

Downloading instructions: Go to the linked github, download the file compas-scores-two-years.csv and place in the datasets folder.

Credit - German Credit Data for predicting good and bad credit risks, compiled by Prof Dr. Hans Hoffman. https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)

Downloading instructions: Open "Data Folder" and download german.data. Rename to german.csv and place in datasets folder. Then, in the src folder of this package, run the script setup_credit_dataset.py which will create the the file german_cleaned.csv which is formatted to be used with our pre-configured code.

Default - This dataset contains information on default payments, demographic factors, credit data, history of payment , and bill statements of credit card clients in Taiwan from April 2005 to September 2005. https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients

Downloading instructions: Follow the "Data Folder" link and download default of credit card clients.xls. Open the file in excel and delete the top row with X1, X2, etc. Save the file as default.csv and place in datasets folder.

Fires - This is a regression task, where the aim is to predict the burned area of forest fires, in the northeast region of Portugal, by using meteorological and other data. https://archive.ics.uci.edu/ml/datasets/forest+fires

Downloading instructions: In the Data Folder, download forestfires.csv directly into datasets folder.

Heart - This dataset contains the medical records of 299 patients who had heart failure, collected during their follow-up period, where each patient profile has 13 clinical features. https://archive.ics.uci.edu/ml/datasets/Heart+failure+clinical+records

Downloading instructions: Download heart_failure_clinical_records_dataset.csv into datasets folder.

Bank Marketing - The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y). https://archive.ics.uci.edu/ml/datasets/bank+marketing

Downloading instructions: In the Data Folder, download bank.zip, extract the folder, and move the files bank-full .csv and bank.csv into the datasets folder.

Student - Predict student performance in secondary education (high school). https://archive.ics.uci.edu/ml/datasets/student+performance

Downloading instructions: In the Data folder, download student.zip, extract, and move the file student-mat.csv to the datasets folder. If there are errors, the ";" characters may need to be replaced with ", " to meet the csv format. This can also be done by using the data-to-columns tool in Excel and using the ";" character as a delimiter to split the first column.

Wine - Using chemical analysis determine the origin of wines. https://archive.ics.uci.edu/ml/datasets/wine+quality

Downloading instructions: In the Data Folder, download winequality-red.data, and winequality-white.data and replace the .data extensions with .csv for both files, and place them in datasets folder. Then, run combine_wine_data.py which will create a new csv named winequality-full.csv in the datasets folder which is ready for use.

Running the code with pre-configured datasets

To use a pre-configured dataset simply change the value of use_preconfigured_dataset to True and select the relevant dataset index in main_drive .py. This selects the relevant parameter values which can be viewed for each dataset in the file dataset_mapping.py.

This setting can also be used to generate customized synthetic data using the dataset index 0. Settings for synthetic data generation are described in more detail in main_driver.py.

Using custom datasets

Our algorithm can be used with any dataset that is in the form of csv files with labeled columns. To use a custom dataset, do the following:

  1. Place the csv file for the dataset in the datasets folder which is located in the root directory of the package
  2. Specify the relevant parameters for interpreting the dataset in the section starting on line 63 of main_driver .py
    • Relevant parameters include target label, column labels for group categories, usable features, etc.
  3. Ensure the other parameters in main_driver.py have the desired value and run the file from the command line as described above

License

Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.

Licensed under the Apache License, Version 2.0 (the "License"). You may not use this file except in compliance with the License. You may obtain a copy of the License at

  http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Citation

If you publish material that uses this code, you can use the following citation:

@inproceedings{diana2021minimax,
title={Minimax Group Fairness: Algorithms and Experiments},
author={Diana, Emily and Gill, Wesley and Kearns, Michael and Kenthapadi, Krishnaram and Roth, Aaron},
year = {2021},
booktitle={Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society}
}

About

No description, website, or topics provided.

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages