We seek to answer the question:
### Is algorithmic lending racially biased?


Before we look for loan biases in this housing data we must import and clean the data set so that we can perform analyses. <font color='red'>We need a data dictionary, I have found some that aren't great and don't totally match our data set but are released by HMDA affiliates.</font> 
### Import Data

In [33]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import csv
import hashlib
from functools import reduce
import enum

# Get the total number of rows in the data set prior to filtering out bad, missing, or corrupt lines
# use the number to compare the size of the data set after filtering 
columnNames = []
with open('headers.txt', 'r') as headerFile:
    headerReader = csv.reader(headerFile, delimiter=',')
    for row in headerReader:
        columnNames.append(row[1])
        
numCols = len(columnNames)

https://www.ffiec.gov/hmda/glossary.htm contains explanations of many columns and acronyms

In [2]:
invalidCols = 0; duplicateRows = 0; keptRows = 0; missingCols = 0; totalRows = 0
onHeader = True
rows = set()
with open('hmda_lar.csv', 'r') as dataFile:
    with open('valid_rows_sample_small.csv', 'w') as outFile:
        dataReader = csv.reader(dataFile, delimiter=',')
        outWriter = csv.writer(outFile, delimiter = ',')
        for row in dataReader:
            # Skip the header line
            totalRows += 1
            # Ignore rows with incorrect number of columns
            if len(row) != numCols:
                invalidCols += 1
                continue 
            else:
                # Ignore rows where more than 1/2 of the entries are missing
                # Count the number of nan's in a row
                missingFields = reduce(lambda x, y: x + int(y == ""), row, 0) # do not change "" to ''
                if missingFields >= int(0.5 * numCols):
                    missingCols += 1
                    continue
                else:
                    keptRows += 1
                    outWriter.writerow(row)
print("Dropped: %d Missing: %d   Kept: %d   Total: %d" % (invalidCols, missingCols,
                                                                             keptRows, totalRows))

# If we only drop duplicates that match on all fields these are the results.   
# Dropped:     Duplicates:     De-duplicated:     Total:    

Dropped: 0 Missing: 0   Kept: 439655   Total: 439655


In [3]:
df_dup = pd.read_csv("valid_rows_sample_small.csv", sep=',', engine='python', error_bad_lines=False, dtype='unicode')

In [4]:
df_dedup = df_dup.drop_duplicates(keep='first');
duplicateRows = df_dup.shape[0]- df_dedup.shape[0]
print("Duplicates: %d" % duplicateRows)

Duplicates: 0


In [5]:
# Guarantees all rows are accounted for after filtering data
invalidCols + duplicateRows + missingCols + keptRows == totalRows

True

The following object contains suggested data types for the corresponding columns. The column headers not in this object are best represented as strings

In [6]:
colToType = {
    "tract_to_msamd_income" : float, 
    "rate_spread" : float,
    "population" : int,
    "minority_population" : bool,
    "number_of_owner_occupied_units" : int, 
    "number_of_1_to_4_family_units" : int, 
    "loan_amount_000s" : float, 
    "hud_median_family_income" : float,
    "applicant_income_000s" : float,
    "sequence_number" : int, 
    "census_tract_number" : float, 
    "as_of_year" : int,
    "application_date_indicator" : int,     
}

In [7]:
df_test = df_dedup
# Use Pandas drop_duplicates() as evidence that dataset is deduplicated
print("Deduplicated Valid Rows: %d\tFully Deduplicated: %r" 
      % (len(df_test), len(df_test) == len(df_test.drop_duplicates())))
print("Columns: %d" % len(df_test.columns.values))

# Convert types of columns
for colName, colType in colToType.items():
    if colType == int:
        df_test[colName] = df_test[colName].apply(lambda x: x if x != 'nan' else 0).astype(int)
    if colType == float:
        df_test[colName] = df_test[colName].apply(lambda x: x if x != 'nan' else float('nan')).astype(float)

Deduplicated Valid Rows: 439654	Fully Deduplicated: True
Columns: 47


Some fields may have values that are incompatible types. This may occur when no data is stored for a variable, a user did not complete the course or course registration, or a column may contain multiple data types. A string representation of an age cannot be compared to a number. If a user inputted N/A, or left that field blank, it is interpreted differently as NA, na, NaN. In this data set, missing information is encoded as "Information not provided by applicant in mail, Internet, or telephone application" as well.

In [21]:
df_test.replace("nan", np.nan, inplace=True)
df_test.replace("None", np.nan, inplace=True)
df_test.replace("Information not provided by applicant in mail, Internet, or telephone application", np.nan, inplace=True)

In [9]:
# this data set is specific to New York State in 2015 so there is no need to keep the state name, year, and abbrevation NY
df_test.drop(["state_name","state_abbr", "as_of_year"],axis=1);

In [25]:
df_test.groupby('lien_status_name').respondent_id.count()

lien_status_name
Not applicable                    61490
Not secured by a lien             23620
Secured by a first lien          340272
Secured by a subordinate lien     14272
Name: respondent_id, dtype: int64

get rid of: application withdrawn, file closed for incompleteness

In [26]:
# list(df_test.action_taken_name.values).unique()
df_test.groupby('action_taken_name').respondent_id.count()

action_taken_name
Application approved but not accepted                   14180
Application denied by financial institution             79697
Application withdrawn by applicant                      39496
File closed for incompleteness                          16733
Loan originated                                        228054
Loan purchased by the institution                       61490
Preapproval request denied by financial institution         4
Name: respondent_id, dtype: int64

A bank loan that gets approved is considered "originated" and is indicated under the "action_taken_name" column. A loan may not originated due to 1 of 6 options: the loan application was approved but not accepted, application denied by financial institution, application withdrawn by applicant, file closed for incompleteness, loan purchased by the institution, preapproval request denied by finanical institution. We are only interested in analyzing if a loan application was submitted and if that application was approved or not approved. Therefore, we can remove columns that provide additional information about action taken following a loan that was not approved, or if an application was not completed/withdrawn.

In [31]:
df_test = df_test.drop(df_test[df_test.action_taken_name == "Application withdrawn by applicant"].index)
df_test = df_test.drop(df_test[df_test.action_taken_name == "File closed for incompleteness"].index)

In [32]:
df_test.groupby('action_taken_name').respondent_id.count()

action_taken_name
Application approved but not accepted                   14180
Application denied by financial institution             79697
Loan originated                                        228054
Loan purchased by the institution                       61490
Preapproval request denied by financial institution         4
Name: respondent_id, dtype: int64

add enums

In [12]:
# See the different reasons up to 3 per application for why a loan was not originated
df_test.groupby('denial_reason_name_1').apply(lambda x: x.nunique())

Unnamed: 0_level_0,tract_to_msamd_income,rate_spread,population,minority_population,number_of_owner_occupied_units,number_of_1_to_4_family_units,loan_amount_000s,hud_median_family_income,applicant_income_000s,state_name,...,applicant_sex_name,applicant_race_name_5,applicant_race_name_4,applicant_race_name_3,applicant_race_name_2,applicant_race_name_1,applicant_ethnicity_name,agency_name,agency_abbr,action_taken_name
denial_reason_name_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Collateral,3163,0,2778,2708,1743,1977,1016,15,696,1,...,4,0,0,3,5,7,4,6,6,1
Credit application incomplete,2712,0,2433,2381,1627,1829,998,15,661,1,...,4,0,0,0,5,7,4,6,6,1
Credit history,3399,0,2945,2869,1775,2031,805,15,569,1,...,4,0,1,3,5,7,4,6,6,1
Debt-to-income ratio,3477,0,2999,2975,1791,2064,1174,15,569,1,...,4,1,1,2,5,7,4,6,6,2
Employment history,612,0,610,570,541,569,355,15,191,1,...,3,0,0,0,2,6,4,6,6,1
"Insufficient cash (downpayment, closing costs)",1074,0,1044,1008,869,953,591,15,352,1,...,4,0,0,0,2,7,4,6,6,1
Mortgage insurance denied,126,0,127,127,122,124,93,12,83,1,...,3,0,0,0,1,6,3,6,6,1
Other,2438,0,2224,2170,1550,1708,816,15,557,1,...,4,0,0,1,5,7,4,6,6,2
Unverifiable information,1379,0,1322,1288,1069,1102,652,15,380,1,...,4,0,0,1,3,7,4,6,6,1


In [13]:
# Add back in if we get rid of denial reasons and only focus on loan originated/not loan originated
# Recording reasons for denial is optional, except for institutions supervised by the Office of Thrift Supervision (OTS)* or the Office of the Comptroller of the Currency (OCC).
#df_test.drop(["denial_reason_name_1","denial_reason_name_2", "denial_reason_name_3"],axis=1);

In [20]:
df_test.groupby('agency_abbr').respondent_id.count()#apply(lambda x: x.nunique())
# 6 different agencies can approve these loans

agency_abbr
CFPB    177762
FDIC     15555
FRS      10211
HUD     150441
NCUA     50944
OCC      34741
Name: respondent_id, dtype: int64

<font color='red'>ADD: We should drop more that goes beyond approved/not approved. Thoughts: lien status is for collateral- is this too specific? </font> 

In order to process this data and model trends in loan biases, we will only work with numeric entries. Therefore, we must encode categorical columns with numbers.

In [15]:
# potential columns be encoded: 
# action_taken_name: 0 originated, 1 not originated
# agency_name: 
# action_taken: 
# applicant_race_name_1:
# applicant_race_name_2:
# applicant_race_name_3:
# applicant_race_name_4: 
# applicant_race_name_5: 
# applicant_sex_name: 0 male, 1 female
# co_applicant_ethnicity_name:
# co_applicant_race_name_1:
# co_applicant_race_name_1:
# co_applicant_race_name_1:
# co_applicant_sex_name:
# county_name:
# hoepa_status_name:
# lien_status_name:
# loan_purpose_name:
# loan_type_name:
# purchaser_type_name:

In [16]:
# encode categorical to numerical for processing
def encode_action(action_type, category):
    if action_type == category:
        return 0
    else: 
        return 1

In [17]:
df_encode = df_test.copy()
df_encode.action_taken_name = df_encode.action_taken_name.apply(lambda x: encode_action(x, 'Loan originated'))

We will bucket income into the standard US tax brackets found at https://web.blockadvisors.com/2017-tax-brackets/ in order to control for income and consider the impact of race on loan status.

In [53]:
df_test['applicant_income_000s'].describe()

count    328663.000000
mean        135.017757
std         249.788201
min           1.000000
25%          57.000000
50%          88.000000
75%         139.000000
max        9999.000000
Name: applicant_income_000s, dtype: float64

In [64]:
# the max applicant income reported is 9999 thousand and there are 37 applicants with this income. 
print("Number of applicants with reported income above $9.9 million:", 
      df_test[df_test['applicant_income_000s'] == 9999.000000].shape[0])

Number of applicants with reported income above $9.9 million: 37


In [66]:
# if you do not provide the maximum bucket value, all incomes that do not fall within these specific categories will be reported as NaN
df_encode['income_bracket'] = pd.cut(df_test['applicant_income_000s'], [0, 18, 75, 153, 233, 416, 470, 9999])

In [None]:
# download as csv for processing notebook 
df_encode.to_csv("encoded_loan_data.csv", index=False)