Experian GVC2021 submission - Theme #3: Loan Origination & KYC - Problem Statement #1

Steps to create APIX environment (returning users must run - sudo su):

You need to run these steps to setup the environemnt, please note that due to limited RAM on APIX data was restricted to 10k records:
Go to project directory and run git clone https://github.com/codesrepo/GVC2021.git
sudo apt-get update
sudo apt-get install wget
wget https://repo.anaconda.com/archive/Anaconda3-2021.05-Linux-x86_64.sh
sudo bash Anaconda3-2021.05-Linux-x86_64.sh
open bashrc to edit- vim ~/.bashrc
add this line - export PATH=/anaconda3/bin:$PATH
source ~/.bashrc
sudo su
conda install xgboost

The root folder GVC2021 has following artificats:

Please note that non-IP codes have been provided in this sandbox, for running IP codes i.e. "normative data approach" or the "explainability" piece, please refer the following API, with parameters, hosted on the APIX

URL - https://api.apixplatform.com/test/api

requestid - 2324b8f0-8694-4a68-a07e-59655c9e9996

serviceid - Use normative_report or explainability

File/Folder	Description
API_docs	Documnetation related to running the IP protected codes via APIX API https://api.apixplatform.com/test/api please use "explainability.json" or "normative_report.json" provided here to call the API
codes	All the codes are placed in this folder
data	Sample data (only 10k records) to run the end-2-end pipeline as well as test data to generate the actual results
notebooks.zip	Development notebooks
output	Output generated by running the pipeline
README.md	This file
requirements.txt	Python package list to simulate the environment
TEST_MASTER_OUTPUT	The results generated by running the RUN_TEST_MASTER.py this file generates the results mentioned in the presentation
tech_demo_v10152021.pptx	A ppt summarizing the submission artifacts

PLease run "RUN_TEST_MASTER.py" to generate the reports on the actual output scores that were generated in this hackathon and are used in the presentation.

Below codes take 10,000 samples of raw input to demonstrate end to end pre-processing,model development and score correction pipeline. Will generate same final output as "RUN_TEST_MASTER.py" if complete dataset rather than 10,000 samples is used.

Codes present here are classified into below groups, it is recommended that these codes are run in sequence:-

Data preparation
Base model development
Fairness assessment on the original model
Applying decision repair ROC and fair training UBR to balance for fairness
Fairness assessment on corrected scores
Cost benefit analysis
Model monitoring - exploration vs exploitation analysis
Data preparation: Filters the new to credit (NTC) segment and applies applies pre-processin steps including label encoding, standardization, log transformation etc. :

Script	Description	Input	Output
hc_data_preparation.py	Filters out the NTC segment and performs data pre-processing	Home Credit data - application_train.csv and bureau.csv files	Saves "HC_train.csv" and "HC_test.csv" at the output location

Base model development and test data scoring This script uses optimized XGBoost parameters to train a base model only focused on optimizing the outcome without taking fairness into consideration.

Script	Description	Model	Data
hc_base_model_development.py	XGBoost base model development and scoring	XGBoost	"HC_train.csv" and "HC_test.csv" generated in the last step saves output as "final_outcome.csv"

Fairness assessment on the original model. Fairness assessment is performed on the protected attributes which are "['AGE',"CODE_GENDER","MARITAL_STATUS"]" over the test scores of the original model.

Script	Description	Input	Output
call_fairness_original_data.py	Fairness assessment is performed over the test scores generated in the above step	The data generated above i.e. final_outcome.csv is used as input	Saves the aggregated data as well as the charts named as %s_initial_fairness_report.csv"%(metric)

Applying decision repair and fair training to balance for fairness: We have used unintended bias removal and grid search as fair training methods and reject option classifier as the repair approach

Script	Description	Input	Output
call_ROC_correction.py	Performs ROC correction	The data generated above i.e. final_outcome.csv is used as input	Saves the repaired scores as "ROC_corrected_data_1.csv"
call_UBR_correction.py	Performs unintended bias removal correction	The training and test data generated in step 1 is used for fair training	Saves the scores data as well as the charts named as "UBR_corrected_data.csv"
call_GS_search.py	Performs unintended grid search	The training and test data generated in step 1 is used for fair training	Saves the scores data as well as the charts named as "GS_corrected_data.csv", access is restricted

Fairness assessment on corrected scores: We have used unintended bias removal and grid search as fair training methods and reject option classifier as the repair approach

Script	Description	Input	Output
call_fairness_ROC_data.py	Fairness assessment is performed over the test scores generated by call_ROC_correction.py	"ROC_corrected_data_1.csv" is used as input	Saves the repaired scores data as well as the charts named as "%s_ROC_fairness_report.csv"%(metric) and "%s_ROC_fairness_report.csv"%(metric) respectively
call_fairness_UBR_data.py	Fairness assessment is performed over the test scores generated by call_UBR_correction.py	"UBR_corrected_data.csv" is used as input	Saves the repaired scores data as well as the charts named as "%s_UBR_fairness_report.csv"%(metric) and "%s_UBR_fairness_report.csv"%(metric) respectively
call_fairness_GS_data.py	Fairness assessment is performed over the test scores generated by call_ROC_correction.py	"GS_corrected_data_1.csv" is used as input	Saves the repaired scores data as well as the charts named as "%s_GS_fairness_report.csv"%(metric) and "%s_GS_fairness_report.csv"%(metric) respectively access is restricted

Cost benefit analysis: Calculates the optimal operating score threshold based on the profit and loss assumption :

Script	Description	Input	Output
cost_benefit_analysis.py	Uses original and corrected scores to perform cost benefit analysis	Uses output generated in step 5 as input	%s_cost_benefit_analysis_report.csv

Model monitoring - exploration vs exploitation analysis: This code simulates the future data to demonstrate how fairness monitoring will work, for tracking disparate impact 50% of the test samples are generated, for tracking absolute average odds difference 1k random samples are used, 12 such snapshots are generated :

Script	Description	Input	Output
monitoring_simulation.py	Simulates data to demonstrate fairness monitoring	Data generated in step 5 is used as input	"DI_AOD_monitoring.csv"

Helper functions:

Script	Description	Input	Output
utils.py	Used to set path and store helper function cost_benefit_analysis	NA	NA
FairnessM.py	Master class to calculate fairness metrices and generate reports	NA	NA

Tested on python version: Python 3.8 (shall run on any python version) sample install command: pip install package==1.2.2

pandas==1.2.4 numpy==1.20.1 sklearn==0.24.1 matplotlib==3.3.4 seaborn==0.11.1 scipy==1.6.2 xgboost==1.3.3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Experian GVC2021 submission - Theme #3: Loan Origination & KYC - Problem Statement #1

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
API_docs		API_docs
TEST_MASTER_OUTPUT		TEST_MASTER_OUTPUT
codes		codes
data		data
output		output
README.md		README.md
notebooks.zip		notebooks.zip
requirements.txt		requirements.txt
tech_demo_v10152021.pptx		tech_demo_v10152021.pptx

codesrepo/GVC2021

Folders and files

Latest commit

History

Repository files navigation

Experian GVC2021 submission - Theme #3: Loan Origination & KYC - Problem Statement #1

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages