We seek to answer the question:
### Is algorithmic lending racially biased?


Before we look for loan biases in this housing data we must import and clean the data set so that we can perform analyses. <font color='red'>We need a data dictionary, I have found some that aren't great and don't totally match our data set but are released by HMDA affiliates.</font> 
### Import Data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import csv
import hashlib
from functools import reduce

# Get the total number of rows in the data set prior to filtering out bad, missing, or corrupt lines
# use the number to compare the size of the data set after filtering 
columnNames = []
with open('headers.txt', 'r') as headerFile:
    headerReader = csv.reader(headerFile, delimiter=',')
    for row in headerReader:
        columnNames.append(row[1])
        
numCols = len(columnNames)

https://www.ffiec.gov/hmda/glossary.htm contains explanations of many columns and acronyms

In [2]:
invalidCols = 0; duplicateRows = 0; keptRows = 0; missingCols = 0; totalRows = 0
onHeader = True
rows = set()
with open('hmda_lar.csv', 'r') as dataFile:
    with open('valid_rows_sample_small.csv', 'w') as outFile:
        dataReader = csv.reader(dataFile, delimiter=',')
        outWriter = csv.writer(outFile, delimiter = ',')
        for row in dataReader:
            # Skip the header line
            totalRows += 1
            # Ignore rows with incorrect number of columns
            if len(row) != numCols:
                invalidCols += 1
                continue 
            else:
                # Ignore rows where more than 1/2 of the entries are missing
                # Count the number of nan's in a row
                missingFields = reduce(lambda x, y: x + int(y == ""), row, 0) # do not change "" to ''
                if missingFields >= int(0.5 * numCols):
                    missingCols += 1
                    continue
                else:
                    keptRows += 1
                    outWriter.writerow(row)
print("Dropped: %d Missing: %d   Kept: %d   Total: %d" % (invalidCols, missingCols,
                                                                             keptRows, totalRows))

# If we only drop duplicates that match on all fields these are the results.   
# Dropped:     Duplicates:     De-duplicated:     Total:    

Dropped: 0 Missing: 0   Kept: 439655   Total: 439655


In [3]:
df_dup = pd.read_csv("valid_rows_sample_small.csv", sep=',', engine='python', error_bad_lines=False, dtype='unicode')

In [4]:
df_dedup = df_dup.drop_duplicates(keep='first');
duplicateRows = df_dup.shape[0]- df_dedup.shape[0]
print("Duplicates: %d" % duplicateRows)

Duplicates: 0


In [5]:
# Guarantees all rows are accounted for after filtering data
invalidCols + duplicateRows + missingCols + keptRows == totalRows

True

The following object contains suggested data types for the corresponding columns. The column headers not in this object are best represented as strings

In [6]:
colToType = {
    "tract_to_msamd_income" : float, 
    "rate_spread" : float,
    "population" : int,
    "minority_population" : bool,
    "number_of_owner_occupied_units" : int, 
    "number_of_1_to_4_family_units" : int, 
    "loan_amount_000s" : float, 
    "hud_median_family_income" : float,
    "applicant_income_000s" : float,
    "sequence_number" : int, 
    "census_tract_number" : float, 
    "as_of_year" : int,
    "application_date_indicator" : int,     
}

In [7]:
df_test = df_dedup
# Use Pandas drop_duplicates() as evidence that dataset is deduplicated
print("Deduplicated Valid Rows: %d\tFully Deduplicated: %r" 
      % (len(df_test), len(df_test) == len(df_test.drop_duplicates())))
print("Columns: %d" % len(df_test.columns.values))

# Convert types of columns
for colName, colType in colToType.items():
    if colType == int:
        df_test[colName] = df_test[colName].apply(lambda x: x if x != 'nan' else 0).astype(int)
    if colType == float:
        df_test[colName] = df_test[colName].apply(lambda x: x if x != 'nan' else float('nan')).astype(float)

Deduplicated Valid Rows: 439654	Fully Deduplicated: True
Columns: 47


Some fields may have values that are incompatible types. This may occur when no data is stored for a variable, a user did not complete the course or course registration, or a column may contain multiple data types. A string representation of an age cannot be compared to a number. If a user inputted N/A, or left that field blank, it is interpreted differently as NA, na, NaN.

In [8]:
df_test.replace("nan", np.nan, inplace=True)
df_test.replace("None", np.nan, inplace=True)

In [9]:
# this data set is specific to New York State in 2015 so there is no need to keep the state name, year, and abbrevation NY
df_test.drop(["state_name","state_abbr", "as_of_year"],axis=1);

In [10]:
df_test.groupby('lien_status_name').apply(lambda x: x.nunique())

Unnamed: 0_level_0,tract_to_msamd_income,rate_spread,population,minority_population,number_of_owner_occupied_units,number_of_1_to_4_family_units,loan_amount_000s,hud_median_family_income,applicant_income_000s,state_name,...,applicant_sex_name,applicant_race_name_5,applicant_race_name_4,applicant_race_name_3,applicant_race_name_2,applicant_race_name_1,applicant_ethnicity_name,agency_name,agency_abbr,action_taken_name
lien_status_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Not applicable,3809,0,3227,3238,1837,2137,1345,15,693,1,...,4,0,0,1,3,7,4,6,6,1
Not secured by a lien,3354,0,2924,2801,1762,2017,118,15,463,1,...,4,0,1,3,5,7,4,5,5,5
Secured by a first lien,3956,563,3349,3339,1851,2162,3192,15,2552,1,...,4,3,3,5,5,7,4,6,6,6
Secured by a subordinate lien,3011,277,2675,2565,1703,1913,523,15,610,1,...,4,1,1,2,5,7,4,6,6,5


In [11]:
# list(df_test.action_taken_name.values).unique()
df_test.groupby('action_taken_name').apply(lambda x: x.nunique())

Unnamed: 0_level_0,tract_to_msamd_income,rate_spread,population,minority_population,number_of_owner_occupied_units,number_of_1_to_4_family_units,loan_amount_000s,hud_median_family_income,applicant_income_000s,state_name,...,applicant_sex_name,applicant_race_name_5,applicant_race_name_4,applicant_race_name_3,applicant_race_name_2,applicant_race_name_1,applicant_ethnicity_name,agency_name,agency_abbr,action_taken_name
action_taken_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Application approved but not accepted,3304,0,2906,2825,1768,2017,1149,15,814,1,...,4,1,1,3,5,7,4,6,6,1
Application denied by financial institution,3942,0,3324,3306,1848,2164,1726,15,1202,1,...,4,2,2,5,5,7,4,6,6,1
Application withdrawn by applicant,3753,0,3203,3195,1829,2131,1672,15,1260,1,...,4,1,1,2,5,7,4,6,6,1
File closed for incompleteness,3418,0,2970,2936,1786,2051,1290,15,892,1,...,4,1,1,2,5,7,4,6,6,1
Loan originated,3945,638,3339,3331,1848,2160,2870,15,2194,1,...,4,2,3,5,5,7,4,6,6,1
Loan purchased by the institution,3809,0,3227,3238,1837,2137,1345,15,693,1,...,4,0,0,1,3,7,4,6,6,1
Preapproval request denied by financial institution,4,0,4,4,4,4,4,2,4,1,...,2,0,0,0,0,2,1,2,2,1


A bank loan that gets approved is considered "originated" and is indicated under the "action_taken_name" column. A loan may not originated due to 1 of 6 options: the loan application was approved but not accepted, application denied by financial institution, application withdrawn by applicant, file closed for incompleteness, loan purchased by the institution, preapproval request denied by finanical institution. (to be discussed) We are only interested in analyzing if a loan was approved or not approved. Therefore, we can remove columns that provide additional information about action taken following a loan that was not approved. 

In [12]:
# See the different reasons up to 3 per application for why a loan was not originated
df_test.groupby('denial_reason_name_3').apply(lambda x: x.nunique())

Unnamed: 0_level_0,tract_to_msamd_income,rate_spread,population,minority_population,number_of_owner_occupied_units,number_of_1_to_4_family_units,loan_amount_000s,hud_median_family_income,applicant_income_000s,state_name,...,applicant_sex_name,applicant_race_name_5,applicant_race_name_4,applicant_race_name_3,applicant_race_name_2,applicant_race_name_1,applicant_ethnicity_name,agency_name,agency_abbr,action_taken_name
denial_reason_name_3,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Collateral,252,0,252,250,243,246,204,14,154,1,...,3,0,0,0,2,5,3,6,6,1
Credit application incomplete,80,0,80,80,77,79,70,12,60,1,...,4,0,0,0,0,6,4,5,5,1
Credit history,383,0,384,373,353,359,251,15,164,1,...,4,0,0,0,2,7,4,6,6,1
Debt-to-income ratio,314,0,310,309,292,290,193,13,155,1,...,3,0,0,0,1,6,3,6,6,1
Employment history,43,0,45,44,44,45,40,11,39,1,...,3,0,0,0,0,4,3,6,6,1
"Insufficient cash (downpayment, closing costs)",285,0,291,283,272,278,225,13,158,1,...,4,0,0,0,1,7,4,5,5,1
Mortgage insurance denied,25,0,25,25,25,25,26,8,24,1,...,3,0,0,0,0,3,3,6,6,1
Other,599,0,594,597,557,573,382,14,223,1,...,4,0,0,0,4,7,4,6,6,1
Unverifiable information,143,0,143,144,139,141,130,10,92,1,...,3,0,0,0,0,5,3,6,6,1


In [13]:
# Add back in if we get rid of denial reasons and only focus on loan originated/not loan originated
# Recording reasons for denial is optional, except for institutions supervised by the Office of Thrift Supervision (OTS)* or the Office of the Comptroller of the Currency (OCC).
#df_test.drop(["denial_reason_name_1","denial_reason_name_2", "denial_reason_name_3"],axis=1);

In [14]:
df_test.groupby('agency_abbr').apply(lambda x: x.nunique())
# 6 different agencies can approve these loans

Unnamed: 0_level_0,tract_to_msamd_income,rate_spread,population,minority_population,number_of_owner_occupied_units,number_of_1_to_4_family_units,loan_amount_000s,hud_median_family_income,applicant_income_000s,state_name,...,applicant_sex_name,applicant_race_name_5,applicant_race_name_4,applicant_race_name_3,applicant_race_name_2,applicant_race_name_1,applicant_ethnicity_name,agency_name,agency_abbr,action_taken_name
agency_abbr,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
CFPB,3972,204,3347,3332,1852,2167,3061,15,2403,1,...,4,3,3,5,5,7,4,1,1,7
FDIC,3117,202,2724,2686,1711,1949,1150,15,735,1,...,4,0,0,1,4,7,4,1,1,6
FRS,2205,98,2060,1931,1477,1596,884,15,472,1,...,4,1,1,2,4,7,4,1,1,6
HUD,3877,475,3292,3284,1841,2150,1417,15,1125,1,...,4,1,2,4,5,7,4,1,1,6
NCUA,3323,396,2885,2832,1755,2005,939,15,726,1,...,4,1,2,2,4,7,4,1,1,7
OCC,3556,381,3070,3041,1806,2088,1348,15,909,1,...,4,0,1,3,5,7,4,1,1,6


<font color='red'>ADD: We should drop more that goes beyond approved/not approved. Thoughts: lien status is for collateral- is this too specific? </font> 

In order to process this data and model trends in loan biases, we will only work with numeric entries. Therefore, we must encode categorical columns with numbers.

In [15]:
# potential columns be encoded: 
# action_taken_name: 0 originated, 1 not originated
# agency_name: 
# action_taken: 
# applicant_race_name_1:
# applicant_race_name_2:
# applicant_race_name_3:
# applicant_race_name_4: 
# applicant_race_name_5: 
# applicant_sex_name: 0 male, 1 female
# co_applicant_ethnicity_name:
# co_applicant_race_name_1:
# co_applicant_race_name_1:
# co_applicant_race_name_1:
# co_applicant_sex_name:
# county_name:
# hoepa_status_name:
# lien_status_name:
# loan_purpose_name:
# loan_type_name:
# purchaser_type_name:

In [16]:
# encode categorical to numerical for processing
def encode_action(action_type, category):
    if action_type == category:
        return 0
    else: 
        return 1

In [20]:
df_encode = df_test.copy()
df_encode.action_taken_name = df_encode.action_taken_name.apply(lambda x: encode_action(x, 'Loan originated'))

In [27]:
# download as csv for processing notebook 
df_encode.to_csv("encoded_loan_data.csv", index=False)