# Introduction

With more time, I would implement the XGBoost fair lending regularizer proposed by [this paper](https://arxiv.org/pdf/2009.01442.pdf). Authors successfully reduce the approval rate gap between protected groups by 50% or more for several different datasets. 

In [1]:
import numpy as np
import pandas as pd
import scipy.stats

pd.set_option('display.max_rows', 200)
pd.set_option('display.max_columns', 50)

In [2]:
# Import data and data dictionary
data = pd.read_pickle('output_data/01_data.pkl')
data_dict = pd.read_pickle('output_data/01_data_dict.pkl')

In [5]:
fair_lending_data = data.copy()

## Extracting Surnames

One variable we have that offers tremendous insight into borrower race is Surname. If you know somebody's last name, you can link it to the Census Bureau's tabulation of racial demographics by surname for the 1000 most popular in the US. But if you want surname, you need to extract it from email.

#### Method 1: Surname by Name Case in Email Address

My first hypothesis is that the last name within an email address can be identified by the fact that it will be the third and last capital in the address. 

Let's count the capitals. There should always be three. 

In [9]:
# Function for counting capitals
def count_capitals(x):
    count = 0
    for i in x:
        if i.isupper():
            count+=1
            
    return count

In [10]:
# Calculates new column of capitals
fair_lending_data['capitals'] = fair_lending_data['email'].apply(lambda x: count_capitals(x))

In [16]:
not_three_caps = fair_lending_data.loc[fair_lending_data['capitals']!=3, ['email', 'capitals']].shape[0]
print(f"Addresses with more or less than three capital letters: {not_three_caps}")

Addresses with more or less than three capital letters: 17


In [None]:
## Age and Length of Credit History
## Also analyze duration of email, bank acct, and residence
