Skip to content

codesrepo/GVC2021

Repository files navigation

Experian GVC2021 submission - Theme #3: Loan Origination & KYC - Problem Statement #1

Steps to create APIX environment (returning users must run - sudo su):

  1. You need to run these steps to setup the environemnt, please note that due to limited RAM on APIX data was restricted to 10k records:
  2. Go to project directory and run git clone https://github.com/codesrepo/GVC2021.git
  3. sudo apt-get update
  4. sudo apt-get install wget
  5. wget https://repo.anaconda.com/archive/Anaconda3-2021.05-Linux-x86_64.sh
  6. sudo bash Anaconda3-2021.05-Linux-x86_64.sh
  7. open bashrc to edit- vim ~/.bashrc
  8. add this line - export PATH=/anaconda3/bin:$PATH
  9. source ~/.bashrc
  10. sudo su
  11. conda install xgboost

The root folder GVC2021 has following artificats:

Please note that non-IP codes have been provided in this sandbox, for running IP codes i.e. "normative data approach" or the "explainability" piece, please refer the following API, with parameters, hosted on the APIX

URL - https://api.apixplatform.com/test/api

requestid - 2324b8f0-8694-4a68-a07e-59655c9e9996

serviceid - Use normative_report or explainability

File/Folder Description
API_docs Documnetation related to running the IP protected codes via APIX API https://api.apixplatform.com/test/api please use "explainability.json" or "normative_report.json" provided here to call the API
codes All the codes are placed in this folder
data Sample data (only 10k records) to run the end-2-end pipeline as well as test data to generate the actual results
notebooks.zip Development notebooks
output Output generated by running the pipeline
README.md This file
requirements.txt Python package list to simulate the environment
TEST_MASTER_OUTPUT The results generated by running the RUN_TEST_MASTER.py this file generates the results mentioned in the presentation
tech_demo_v10152021.pptx A ppt summarizing the submission artifacts

PLease run "RUN_TEST_MASTER.py" to generate the reports on the actual output scores that were generated in this hackathon and are used in the presentation.

Below codes take 10,000 samples of raw input to demonstrate end to end pre-processing,model development and score correction pipeline. Will generate same final output as "RUN_TEST_MASTER.py" if complete dataset rather than 10,000 samples is used.

Codes present here are classified into below groups, it is recommended that these codes are run in sequence:-

  1. Data preparation

  2. Base model development

  3. Fairness assessment on the original model

  4. Applying decision repair ROC and fair training UBR to balance for fairness

  5. Fairness assessment on corrected scores

  6. Cost benefit analysis

  7. Model monitoring - exploration vs exploitation analysis

  8. Data preparation: Filters the new to credit (NTC) segment and applies applies pre-processin steps including label encoding, standardization, log transformation etc. :

Script Description Input Output
hc_data_preparation.py Filters out the NTC segment and performs data pre-processing Home Credit data - application_train.csv and bureau.csv files Saves "HC_train.csv" and "HC_test.csv" at the output location
  1. Base model development and test data scoring This script uses optimized XGBoost parameters to train a base model only focused on optimizing the outcome without taking fairness into consideration.
Script Description Model Data
hc_base_model_development.py XGBoost base model development and scoring XGBoost "HC_train.csv" and "HC_test.csv" generated in the last step saves output as "final_outcome.csv"
  1. Fairness assessment on the original model. Fairness assessment is performed on the protected attributes which are "['AGE',"CODE_GENDER","MARITAL_STATUS"]" over the test scores of the original model.
Script Description Input Output
call_fairness_original_data.py Fairness assessment is performed over the test scores generated in the above step The data generated above i.e. final_outcome.csv is used as input Saves the aggregated data as well as the charts named as %s_initial_fairness_report.csv"%(metric)
  1. Applying decision repair and fair training to balance for fairness: We have used unintended bias removal and grid search as fair training methods and reject option classifier as the repair approach
Script Description Input Output
call_ROC_correction.py Performs ROC correction The data generated above i.e. final_outcome.csv is used as input Saves the repaired scores as "ROC_corrected_data_1.csv"
call_UBR_correction.py Performs unintended bias removal correction The training and test data generated in step 1 is used for fair training Saves the scores data as well as the charts named as "UBR_corrected_data.csv"
call_GS_search.py Performs unintended grid search The training and test data generated in step 1 is used for fair training Saves the scores data as well as the charts named as "GS_corrected_data.csv", access is restricted
  1. Fairness assessment on corrected scores: We have used unintended bias removal and grid search as fair training methods and reject option classifier as the repair approach
Script Description Input Output
call_fairness_ROC_data.py Fairness assessment is performed over the test scores generated by call_ROC_correction.py "ROC_corrected_data_1.csv" is used as input Saves the repaired scores data as well as the charts named as "%s_ROC_fairness_report.csv"%(metric) and "%s_ROC_fairness_report.csv"%(metric) respectively
call_fairness_UBR_data.py Fairness assessment is performed over the test scores generated by call_UBR_correction.py "UBR_corrected_data.csv" is used as input Saves the repaired scores data as well as the charts named as "%s_UBR_fairness_report.csv"%(metric) and "%s_UBR_fairness_report.csv"%(metric) respectively
call_fairness_GS_data.py Fairness assessment is performed over the test scores generated by call_ROC_correction.py "GS_corrected_data_1.csv" is used as input Saves the repaired scores data as well as the charts named as "%s_GS_fairness_report.csv"%(metric) and "%s_GS_fairness_report.csv"%(metric) respectively access is restricted
  1. Cost benefit analysis: Calculates the optimal operating score threshold based on the profit and loss assumption :
Script Description Input Output
cost_benefit_analysis.py Uses original and corrected scores to perform cost benefit analysis Uses output generated in step 5 as input %s_cost_benefit_analysis_report.csv
  1. Model monitoring - exploration vs exploitation analysis: This code simulates the future data to demonstrate how fairness monitoring will work, for tracking disparate impact 50% of the test samples are generated, for tracking absolute average odds difference 1k random samples are used, 12 such snapshots are generated :
Script Description Input Output
monitoring_simulation.py Simulates data to demonstrate fairness monitoring Data generated in step 5 is used as input "DI_AOD_monitoring.csv"
  1. Helper functions:
Script Description Input Output
utils.py Used to set path and store helper function cost_benefit_analysis NA NA
FairnessM.py Master class to calculate fairness metrices and generate reports NA NA

Tested on python version: Python 3.8 (shall run on any python version) sample install command: pip install package==1.2.2

pandas==1.2.4 numpy==1.20.1 sklearn==0.24.1 matplotlib==3.3.4 seaborn==0.11.1 scipy==1.6.2 xgboost==1.3.3

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages