# Fair Lending Analysis - Did Not Complete!

Fair lending analysis is an essential component of any credit modeling process. In order to analyze the equity impacts of a model, you need to be able to identify the protected demographic characteristics of customers, and then measure the model outcomes for different demographic groups. 

Your dataset contains two wonderful nuggets of fair lending information: the applicant's ZIP code, and their last name. Using these two datapoints, we can implement the Consumer Financial Protection Bureau's [BISG methodology](https://files.consumerfinance.gov/f/201409_cfpb_report_proxy-methodology.pdf) for estimating race. BISG requires several steps:
* Extracting Last Name from the Email Address provided in this data file
    * Exploiting the capitalization schema of the email addresses can help
* Merging with the [Decennial Census Surname Files](https://www.census.gov/data/developers/data-sets/surnames.html), which contain race by surname
* Calculate protected demographics by ZIP code using federal surveys such as the [2020 American Community Survey](https://www.census.gov/programs-surveys/acs/data.html)
* Merge with ZIP codes provided in the applicant data
* Combine the geographic probabilities with the surname probabilities using Bayesian updating rules

This will provide each applicant's estimated racial profile, which can generate key statistics such as:
* Adverse Impact Ratio: The ratio of approval rates between the highest approval rate group and the lowest
* False Positive assessment: Is one race more likely than another to be falsely predicted to default?
* False Negative assessment: Similarly, is one race more likely to be falsely predicted as safe?
* See [this](https://academic.oup.com/oxrep/article-abstract/37/3/585/6374682?redirectedFrom=fulltext#303976608) paper for a more comprehensive definition of fair lending statistics across performance parity, separation, and sufficiency

Implementing this estimation and assessment methodology will show which models are the fairest, and how well the models perform on fairness overall. 

I would also be interest in implementing the XGBoost fair lending regularizer proposed by [this paper](https://arxiv.org/pdf/2009.01442.pdf). Authors successfully reduce the approval rate gap between protected groups by 50% or more for several different datasets. 

In [1]:
import numpy as np
import pandas as pd
import scipy.stats

pd.set_option('display.max_rows', 200)
pd.set_option('display.max_columns', 50)

In [2]:
# Import data and data dictionary
data = pd.read_pickle('output_data/01_data.pkl')
data_dict = pd.read_pickle('output_data/01_data_dict.pkl')

In [5]:
fair_lending_data = data.copy()

## Extracting Surnames

One variable we have that offers tremendous insight into borrower race is Surname. If you know somebody's last name, you can link it to the Census Bureau's tabulation of racial demographics by surname for the 1000 most popular in the US. But if you want surname, you need to extract it from email.

#### Method 1: Surname by Name Case in Email Address

My first hypothesis is that the last name within an email address can be identified by the fact that it will be the third and last capital in the address. 

Let's count the capitals. There should always be three. 

In [9]:
# Function for counting capitals
def count_capitals(x):
    count = 0
    for i in x:
        if i.isupper():
            count+=1
            
    return count

In [10]:
# Calculates new column of capitals
fair_lending_data['capitals'] = fair_lending_data['email'].apply(lambda x: count_capitals(x))

In [16]:
not_three_caps = fair_lending_data.loc[fair_lending_data['capitals']!=3, ['email', 'capitals']].shape[0]
print(f"Addresses with more or less than three capital letters: {not_three_caps}")

Addresses with more or less than three capital letters: 17
