We seek to answer the question:
### Is lending racially discriminatory in the US?

Before we look for loan biases in this housing data we must import and clean the data set so that we can perform analyses. We will utilize our data dictionary in order to understand the value of each column.
### Import Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import csv
import hashlib
from functools import reduce
import enum

# Get the total number of rows in the data set prior to filtering out bad, missing, or corrupt lines
# use the number to compare the size of the data set after filtering 
columnNames = []
with open('headers.txt', 'r') as headerFile:
    headerReader = csv.reader(headerFile, delimiter=',')
    for row in headerReader:
        columnNames.append(row[1])
        
numCols = len(columnNames)

In addition to the data dictionary in this repository, https://www.ffiec.gov/hmda/glossary.htm contains explanations of columns and acronyms.

In [None]:
# Write out clean lines to new csv
invalidCols = 0; duplicateRows = 0; keptRows = 0; missingCols = 0; totalRows = 0
onHeader = True
rows = set()
with open('hmda_lar.csv', 'r') as dataFile:
    with open('valid_rows_sample_small.csv', 'w') as outFile:
        dataReader = csv.reader(dataFile, delimiter=',')
        outWriter = csv.writer(outFile, delimiter = ',')
        for row in dataReader:
            # Skip the header line
            totalRows += 1
            # Ignore rows with incorrect number of columns
            if len(row) != numCols:
                invalidCols += 1
                continue 
            else:
                # Ignore rows where more than 1/2 of the entries are missing
                # Count the number of nan's in a row
                missingFields = reduce(lambda x, y: x + int(y == ""), row, 0) # do not change "" to ''
                if missingFields >= int(0.5 * numCols):
                    missingCols += 1
                    continue
                else:
                    keptRows += 1
                    outWriter.writerow(row)
print("Dropped: %d Missing: %d   Kept: %d   Total: %d" % (invalidCols, missingCols,
                                                                             keptRows, totalRows))

# If we only drop duplicates that match on all fields these are the results.   
# Dropped:     Duplicates:     De-duplicated:     Total:    

In [None]:
# Read in new csv with clean lines
df_dup = pd.read_csv("valid_rows_sample_small.csv", sep=',', engine='python', error_bad_lines=False, dtype='unicode')

In [None]:
# Calculate the number of rows dropped in new csv
df_dedup = df_dup.drop_duplicates(keep='first');
duplicateRows = df_dup.shape[0]- df_dedup.shape[0]
print("Duplicates: %d" % duplicateRows)

In [None]:
# Guarantees all rows are accounted for after filtering data
invalidCols + duplicateRows + missingCols + keptRows == totalRows

The following object contains suggested data types for the corresponding columns. The column headers not in this object are best represented as strings.

In [None]:
colToType = {
    "tract_to_msamd_income" : float, 
    "rate_spread" : float,
    "population" : int,
    "minority_population" : bool,
    "number_of_owner_occupied_units" : int, 
    "number_of_1_to_4_family_units" : int, 
    "loan_amount_000s" : float, 
    "hud_median_family_income" : float,
    "applicant_income_000s" : float,
    "sequence_number" : int, 
    "census_tract_number" : float, 
    "as_of_year" : int,
    "application_date_indicator" : int,     
}

In [None]:
# Convert column types
df_test = df_dedup
# Use Pandas drop_duplicates() as evidence that dataset is deduplicated
print("Deduplicated Valid Rows: %d\tFully Deduplicated: %r" 
      % (len(df_test), len(df_test) == len(df_test.drop_duplicates())))
print("Columns: %d" % len(df_test.columns.values))

# Convert types of columns
for colName, colType in colToType.items():
    if colType == int:
        df_test[colName] = df_test[colName].apply(lambda x: x if x != 'nan' else 0).astype(int)
    if colType == float:
        df_test[colName] = df_test[colName].apply(lambda x: x if x != 'nan' else float('nan')).astype(float)

Some fields may have values that are incompatible types. This may occur when no data is stored for a variable, a user did not complete the application, or a column may contain multiple data types. A string representation of an age cannot be compared to a number. If a user inputted N/A, or left that field blank, it is interpreted differently as NA, na, NaN. In this data set, missing information is encoded as "Information not provided by applicant in mail, Internet, or telephone application" as well.

In [None]:
# Replace inconsistent empty row entries 
df_test.replace("nan", np.nan, inplace=True)
df_test.replace("None", np.nan, inplace=True)
df_test.replace("Not applicable", np.nan, inplace = True)
df_test.replace("Information not provided by applicant in mail, Internet, or telephone application", np.nan, inplace=True)

In [None]:
# This data set is specific to New York State in 2015 so there is no need to keep the state name, year, and abbrevation NY
df_test.drop(["state_name","state_abbr", "as_of_year"],axis=1);

Next we want to look at the counts for different columns. This will help us eliminate columns that are not well populated or rows that are too empty. Each grouping is on the sequence number, a unique indentifier for each application. 

In [None]:
df_test.groupby('applicant_sex_name').sequence_number.count()

There are twice as many male applicants (or at least applicants with commonly male names) in the dataset. 

In [None]:
df_test.groupby('lien_status_name').sequence_number.count()

Lien status is an indication of expected default and collateral. It guarantees an underlying obligation, where if not satisfied, a creditor can seize the asset that is the subject of the lien. 

Get rid of: application withdrawn, file closed for incompleteness

In [None]:
df_test.groupby('action_taken_name').sequence_number.count()

A bank loan that gets approved is considered "originated" and is indicated under the "action_taken_name" column. A loan may not originated due to 1 of 6 options: the loan application was approved but not accepted, application denied by financial institution, application withdrawn by applicant, file closed for incompleteness, loan purchased by the institution, preapproval request denied by finanical institution. We are only interested in analyzing if a loan application was submitted and if that application was approved or not approved. Therefore, we can remove columns that provide additional information about action taken following a loan that was not approved, or if an application was not completed/withdrawn.

In [None]:
df_test = df_test.drop(df_test[df_test.action_taken_name == "Application withdrawn by applicant"].index)
df_test = df_test.drop(df_test[df_test.action_taken_name == "File closed for incompleteness"].index)

In [None]:
df_test.groupby('action_taken_name').sequence_number.count()

We are only looking at loans that were originated or purchased by the institution. All other types of action names are conisdered rejections.

In [None]:
# See the different reasons, up to 3 per application, for why a loan was not originated
df_test.groupby('denial_reason_name_1').apply(lambda x: x.nunique())

In [None]:
df_test.groupby('agency_abbr').sequence_number.count()

We see 6 different agencies can approve loan applications

In [None]:
df_test.groupby('property_type_name').describe()

We see there is no income data on multifamily dwellings, so we will drop these row entries. 98.2% of the data is with respect to one-to-four-family dwellings, so we will also drop rows with maufactured housing. Now all remaining row entries are for one-to-four-family dwellings. We can drop this column overall.

In [None]:
df_test.groupby("loan_type_name").sequence_number.count()

In [None]:
df_test = df_test.drop(df_test[df_test.property_type_name == "Multifamily dwelling"].index)
df_test = df_test.drop(df_test[df_test.property_type_name == "Manufactured housing"].index)
df_test.drop(["property_type_name"],axis=1);

In [None]:
df_test.groupby('preapproval_name').sequence_number.count()

In [None]:
df_test.groupby('loan_purpose_name').sequence_number.count()

A majority of applications were loan requests needed to purchase a home. However, racial discrimination may be prevalent in any type of loan so we will not distinguish between loan types in the analysis.


In [None]:
df_test.groupby('co_applicant_race_name_5').sequence_number.count()

In [None]:
df_test.groupby('co_applicant_race_name_4').sequence_number.count()

In [None]:
df_test.groupby('co_applicant_race_name_3').sequence_number.count()

In [None]:
df_test.groupby('co_applicant_race_name_2').sequence_number.count()

In [None]:
df_test.groupby('co_applicant_race_name_1').sequence_number.count()

In [None]:
df_test.drop(["co_applicant_race_name_2", "co_applicant_race_name_3", "co_applicant_race_name_4", "co_applicant_race_name_5"],axis=1);

co-applicant_race_name[2-5] are nearly blank on most row entries. We can delete these columns.

"0" means the application was made on or after 1/1/2004 and "2" means the application date is not available. Because we are only looking at one year, 2015, we will drop this column.

In order to process this data and model trends in loan biases, we will only work with numeric entries. Therefore, we must encode categorical columns with numbers.

In [None]:
df_test.groupby('application_date_indicator').sequence_number.count()

In [None]:
df_test.drop(['application_date_indicator'],axis=1);

In [None]:
# Encodes categorical to numerical types needed for processing
def encode_action(action_type, category):
    if action_type == category:
        return 1
    else: 
        return 0

In [None]:
df_encode = df_test.copy()
df_encode.action_taken_name = df_encode.action_taken_name.apply(lambda x: 
     int(x in ['Loan originated', 'Loan purchased by the institution']))

We will bucket income into the standard US tax brackets found at https://web.blockadvisors.com/2017-tax-brackets/ in order to control for income and consider the impact of race on loan status. 

In [None]:
df_test['applicant_income_000s'].describe()

In [None]:
# the max applicant income reported is 9999 thousand 
print("Number of applicants with reported income above $9.9 million:", 
      df_test[df_test['applicant_income_000s'] == 9999.000000].shape[0])

In [None]:
# Look at distribution of incomes
df_test['loan_amount_000s'].describe()

In [None]:
# the max loan amount requested is 99999 thousand which are probably outliers or reporting errors
print("Number of applicants with loan requested above $99.9 million:", 
      df_test[df_test['loan_amount_000s'] == 99999.000000].shape[0])

In [None]:
# If you do not provide the maximum bucket value, all incomes that do not fall within these specific 
# categories will be reported as NaN, so 9999 is the highest possible value.
df_encode['income_bracket'] = pd.cut(df_test['applicant_income_000s'], [0, 18, 75, 153, 233, 416, 470, 9999])

In [None]:
df_encode.groupby("rate_spread").sequence_number.describe();

The rate spread which is the difference between the loan's annual percentage rate (APR) and the average prime offer rate (APOR), is a good indicator of bias/discriminatory lending. A higher-priced mortgage loan is a consumer credit transaction secured by the consumer’s principal dwelling with an annual percentage rate (APR) that exceeds the average prime offer rate (APOR) by a given amount. In general, for a first-lien mortgage, a loan is “higher-priced” if its APR exceeds the APOR by 1.5 percent or more. For a subordinate mortgage, a loan is “higher-priced” if its APR exceeds the APOR by 3.5 percent. https://www.scotsmanguide.com/Residential/Articles/2014/11/High-Cost-vs--Higher-Priced-Mortgages/

In [None]:
# download as csv for processing notebook
df_encode.to_csv("encoded_loan_data.csv", index=False)