Steps to create APIX environment (returning users must run - sudo su):
- You need to run these steps to setup the environemnt, please note that due to limited RAM on APIX data was restricted to 10k records:
- Go to project directory and run git clone https://github.com/codesrepo/GVC2021.git
- sudo apt-get update
- sudo apt-get install wget
- wget https://repo.anaconda.com/archive/Anaconda3-2021.05-Linux-x86_64.sh
- sudo bash Anaconda3-2021.05-Linux-x86_64.sh
- open bashrc to edit- vim ~/.bashrc
- add this line - export PATH=/anaconda3/bin:$PATH
- source ~/.bashrc
- sudo su
- conda install xgboost
The root folder GVC2021 has following artificats:
Please note that non-IP codes have been provided in this sandbox, for running IP codes i.e. "normative data approach" or the "explainability" piece, please refer the following API, with parameters, hosted on the APIX
URL - https://api.apixplatform.com/test/api
requestid - 2324b8f0-8694-4a68-a07e-59655c9e9996
serviceid - Use normative_report or explainability
File/Folder | Description |
---|---|
API_docs | Documnetation related to running the IP protected codes via APIX API https://api.apixplatform.com/test/api please use "explainability.json" or "normative_report.json" provided here to call the API |
codes | All the codes are placed in this folder |
data | Sample data (only 10k records) to run the end-2-end pipeline as well as test data to generate the actual results |
notebooks.zip | Development notebooks |
output | Output generated by running the pipeline |
README.md | This file |
requirements.txt | Python package list to simulate the environment |
TEST_MASTER_OUTPUT | The results generated by running the RUN_TEST_MASTER.py this file generates the results mentioned in the presentation |
tech_demo_v10152021.pptx | A ppt summarizing the submission artifacts |
PLease run "RUN_TEST_MASTER.py" to generate the reports on the actual output scores that were generated in this hackathon and are used in the presentation.
Below codes take 10,000 samples of raw input to demonstrate end to end pre-processing,model development and score correction pipeline. Will generate same final output as "RUN_TEST_MASTER.py" if complete dataset rather than 10,000 samples is used.
Codes present here are classified into below groups, it is recommended that these codes are run in sequence:-
-
Data preparation
-
Base model development
-
Fairness assessment on the original model
-
Applying decision repair ROC and fair training UBR to balance for fairness
-
Fairness assessment on corrected scores
-
Cost benefit analysis
-
Model monitoring - exploration vs exploitation analysis
-
Data preparation: Filters the new to credit (NTC) segment and applies applies pre-processin steps including label encoding, standardization, log transformation etc. :
Script | Description | Input | Output |
---|---|---|---|
hc_data_preparation.py | Filters out the NTC segment and performs data pre-processing | Home Credit data - application_train.csv and bureau.csv files | Saves "HC_train.csv" and "HC_test.csv" at the output location |
- Base model development and test data scoring This script uses optimized XGBoost parameters to train a base model only focused on optimizing the outcome without taking fairness into consideration.
Script | Description | Model | Data |
---|---|---|---|
hc_base_model_development.py | XGBoost base model development and scoring | XGBoost | "HC_train.csv" and "HC_test.csv" generated in the last step saves output as "final_outcome.csv" |
- Fairness assessment on the original model. Fairness assessment is performed on the protected attributes which are "['AGE',"CODE_GENDER","MARITAL_STATUS"]" over the test scores of the original model.
Script | Description | Input | Output |
---|---|---|---|
call_fairness_original_data.py | Fairness assessment is performed over the test scores generated in the above step | The data generated above i.e. final_outcome.csv is used as input | Saves the aggregated data as well as the charts named as %s_initial_fairness_report.csv"%(metric) |
- Applying decision repair and fair training to balance for fairness: We have used unintended bias removal and grid search as fair training methods and reject option classifier as the repair approach
Script | Description | Input | Output |
---|---|---|---|
call_ROC_correction.py | Performs ROC correction | The data generated above i.e. final_outcome.csv is used as input | Saves the repaired scores as "ROC_corrected_data_1.csv" |
call_UBR_correction.py | Performs unintended bias removal correction | The training and test data generated in step 1 is used for fair training | Saves the scores data as well as the charts named as "UBR_corrected_data.csv" |
call_GS_search.py | Performs unintended grid search | The training and test data generated in step 1 is used for fair training | Saves the scores data as well as the charts named as "GS_corrected_data.csv", access is restricted |
- Fairness assessment on corrected scores: We have used unintended bias removal and grid search as fair training methods and reject option classifier as the repair approach
Script | Description | Input | Output |
---|---|---|---|
call_fairness_ROC_data.py | Fairness assessment is performed over the test scores generated by call_ROC_correction.py | "ROC_corrected_data_1.csv" is used as input | Saves the repaired scores data as well as the charts named as "%s_ROC_fairness_report.csv"%(metric) and "%s_ROC_fairness_report.csv"%(metric) respectively |
call_fairness_UBR_data.py | Fairness assessment is performed over the test scores generated by call_UBR_correction.py | "UBR_corrected_data.csv" is used as input | Saves the repaired scores data as well as the charts named as "%s_UBR_fairness_report.csv"%(metric) and "%s_UBR_fairness_report.csv"%(metric) respectively |
call_fairness_GS_data.py | Fairness assessment is performed over the test scores generated by call_ROC_correction.py | "GS_corrected_data_1.csv" is used as input | Saves the repaired scores data as well as the charts named as "%s_GS_fairness_report.csv"%(metric) and "%s_GS_fairness_report.csv"%(metric) respectively access is restricted |
- Cost benefit analysis: Calculates the optimal operating score threshold based on the profit and loss assumption :
Script | Description | Input | Output |
---|---|---|---|
cost_benefit_analysis.py | Uses original and corrected scores to perform cost benefit analysis | Uses output generated in step 5 as input | %s_cost_benefit_analysis_report.csv |
- Model monitoring - exploration vs exploitation analysis: This code simulates the future data to demonstrate how fairness monitoring will work, for tracking disparate impact 50% of the test samples are generated, for tracking absolute average odds difference 1k random samples are used, 12 such snapshots are generated :
Script | Description | Input | Output |
---|---|---|---|
monitoring_simulation.py | Simulates data to demonstrate fairness monitoring | Data generated in step 5 is used as input | "DI_AOD_monitoring.csv" |
- Helper functions:
Script | Description | Input | Output |
---|---|---|---|
utils.py | Used to set path and store helper function cost_benefit_analysis | NA | NA |
FairnessM.py | Master class to calculate fairness metrices and generate reports | NA | NA |
Tested on python version: Python 3.8 (shall run on any python version) sample install command: pip install package==1.2.2
pandas==1.2.4 numpy==1.20.1 sklearn==0.24.1 matplotlib==3.3.4 seaborn==0.11.1 scipy==1.6.2 xgboost==1.3.3