In [1]:
import pandas as pd
%matplotlib inline
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import ttest_ind

In [2]:
data = pd.read_csv('hmda_lar.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
data.shape

(4408826, 47)

In [4]:
# Look at the structure of the data
data.head()

Unnamed: 0,tract_to_msamd_income,rate_spread,population,minority_population,number_of_owner_occupied_units,number_of_1_to_4_family_units,loan_amount_000s,hud_median_family_income,applicant_income_000s,state_name,...,applicant_sex_name,applicant_race_name_5,applicant_race_name_4,applicant_race_name_3,applicant_race_name_2,applicant_race_name_1,applicant_ethnicity_name,agency_name,agency_abbr,action_taken_name
0,,,,,,,290,,44.0,California,...,Female,,,,,White,Hispanic or Latino,Department of Housing and Urban Development,HUD,Application denied by financial institution
1,,,,,,,600,,42.0,California,...,Male,,,,,White,Not Hispanic or Latino,National Credit Union Administration,NCUA,Application denied by financial institution
2,,,,,,,118,,43.0,California,...,Male,,,,,White,Not Hispanic or Latino,Department of Housing and Urban Development,HUD,Application denied by financial institution
3,138.410004,,4323.0,83.0,996.0,1232.0,292,63000.0,141.0,California,...,Female,,,,,White,Hispanic or Latino,Office of Thrift Supervision,OTS,Loan originated
4,,,,,,,700,,178.0,California,...,Male,,,,,Asian,Not Hispanic or Latino,National Credit Union Administration,NCUA,Application denied by financial institution


# Data preparation

Only records whose action is among Loan originated, Loan purchased by the institution, Application approved but not accepted, and Loan Application denied by financial institution are selected, where the first three actions are considered approval and the last action is considered denial.

In [3]:
# Recode actions into approved or denied
denied_actions = {'Application denied by financial institution'}
approved_actions = {'Loan originated','Loan purchased by the institution','Application approved but not accepted'}

def recode_actions(origin):
    if origin in denied_actions:
        return 'Denied'
    if origin in approved_actions:
        return 'Approved'

data['actions_new'] = data['action_taken_name'].apply(recode_actions)

In the data, the column of "applicant_race_name_1" indicates the applicant's race and the column of "applicant_ethnicity_name" indicates the applicant's ethnicity. From these two columns, it can be decided whether an applicant is a minority or not. Here applicants whose race is White and ethnicity is "Not Hispanic or Latino" are defined as non-minorities and applicants with other combinations of race and ethnicity as minorities. Note there are missing values in both columns. For applicants whose race is not White, no matter what is their ethnicity, they are minorities. However, for applicants whose race is White and ethnicity information is missing, they are labeled as unknown. Also, for applicants whose ethnicity is "Hispanic or Latino", they as minorities no matter what their races are. After labeling, a new variable indicating whether the applicant is minority or not is created.

In [4]:
# Recode race and ethnicity into minority or non-minority

def recode_minority(ethnicity_name, race_name):
    if ethnicity_name == 'Hispanic or Latino':
        return 'Yes'
    if race_name != 'White':
        return 'Yes'
    if race_name == 'White' and ethnicity_name == 'Not Hispanic or Latino':
        return 'No'

data['Minority'] = data.apply(lambda x: recode_minority(x['applicant_ethnicity_name'], x['applicant_race_name_1']),axis=1)

In [8]:
# For convenience, export results of recoding to csv
data[['Minority','actions_new']].to_csv('addCol.csv')

In [4]:
# Read from exported csv and append to dataset
addCol = pd.read_csv('addCol.csv')

data['actions_new'] = addCol['actions_new']
data['Minority'] = addCol['Minority']

# General Overview

First, I got an overview of minority's application success rate by directly looking at the ratios of denied cases to all cases of minorities and non-minorities.

In [10]:
# Overall
data[['Minority','actions_new']].groupby(['Minority','actions_new']).size()

# Approval rate of non-minority: 84.66%
# Approval rate of minority: 79.67%

Minority  actions_new
No        Approved       2349563
          Denied          425602
Yes       Approved       1301539
          Denied          332122
dtype: int64

In [11]:
# Texas
data[['Minority','actions_new']].loc[data['state_name']=='Texas'].groupby(['Minority','actions_new']).size()

# Approval rate of non-minority: 84.52%
# Approval rate of minority: 75.02%

Minority  actions_new
No        Approved       917813
          Denied         168044
Yes       Approved       404961
          Denied         134828
dtype: int64

In [12]:
# California
data[['Minority','actions_new']].loc[data['state_name']=='California'].groupby(['Minority','actions_new']).size()

# Approval rate of non-minority: 84.54%
# Approval rate of minority: 81.96%

Minority  actions_new
No        Approved       1431750
          Denied          257558
Yes       Approved        896578
          Denied          197294
dtype: int64

It can be observed that cases with minority applicants are more likely to be denied than those with non-minority applicants.

By comparing the data from Texas with that from California, it is clear that the success rate for minorities to obtain loans differentiate between two states while for non-minorities, there is no such significant geographic effect. In both states, minorities are less likely to get approved. As people may have expected, in Texas, minorities are less likely to get approved than in California. 

However, it is still uncertain that minorities are discriminated against lending and such discrimination is more severe in Texas as it is a Red State. There can be some factors common for minorities but not so common for non-minorities that keep them from getting approved. To determine if such factors exists, I need to find out the most important variables that affects a case's success and compare minorities and non-minorities from perspective of these variables. 

# Dealing with missing values

As mentioned above, when recoding data, for some records, whether the applicant is a minority and whether the case was approved is unknown. Records of which such information are not provided should be excluded.
After excluding the records above, the total number of records is 4408826. In the dataset, there are some columns that contains over or nearly 4 million null values, such as rate, denial reason and so on. For such columns that contains too many null values, they should be excluded from analysis because they don't provide much useful information. 

Also, there are over 16000 null values in each of location- related variables: 'tract_to_msamd_income',  'population',' minorituy_population', 'number_of_owner_occupied_units',  'number_of_1_to_4_family_units', and 'number_of_owner_occupied_units'. When looking at the distribution of these null values, I found that the number of records containing at least one null values in one of these columns, which is fewer than 0.5% of number of all records. So, I decided to exclude these records from analysis. 


As mentioned above, to prove that minorities are not discriminated against lending, I need to find the factors that play important roles in getting approved and differentiate between minorities and non-minorities. Based on this assumption, if such factors exist, they must be dependent on or highly correlated with race and ethnicity variables. Provided that values dependent to each other can undermine the performance of models, race and ethnicity are excluded from modeling.

In [15]:
# Look at missing values
data.isnull().sum()

tract_to_msamd_income               16339
rate_spread                       4292612
population                          15557
minority_population                 15686
number_of_owner_occupied_units      17508
number_of_1_to_4_family_units       15936
loan_amount_000s                        0
hud_median_family_income            14960
applicant_income_000s              257826
state_name                              0
state_abbr                              0
sequence_number                         0
respondent_id                           0
purchaser_type_name                     0
property_type_name                      0
preapproval_name                        0
owner_occupancy_name                    0
msamd_name                         183065
loan_type_name                          0
loan_purpose_name                       0
lien_status_name                        0
hoepa_status_name                       0
edit_status_name                  3777831
denial_reason_name_3              

# Feature engineering

Since I am looking at minority-related issues, the ratio of minority population makes more sense than the absolute number of minorities in an area. So I created a new variable called 'minority_ratio', which is the ratio of minority population to the total population of the tract. Regarding the economic status of the tract where the property is located, instead of 'tract_to_msamd_income', the percentage of the median family income for the tract compared to the median family income for the MSA/MD, the absolute value of median income of the tract makes more sense. So I created another new value called 'tract_median_income', which is calculated by 'tract_to_msamd_income' * 'hud_median_family_income'. Then 'tract_to_msamd_income' and 'hud_median_family_income' are excluded to avoid dependency among variables.

In [16]:
# Feature engineering
data['minority_ratio'] = data['minority_population']/data['population']
data['tract_median_income'] = data['tract_to_msamd_income'] * data['hud_median_family_income']
data['income_to_loan'] = data['applicant_income_000s'] / data['loan_amount_000s']

# Modeling

Random forest is used to find the most important features for predicting whether an application will be approved and correlation matrix is used to find out a feature has a positive or negative impact on the result of the application.

In [17]:
# Define training data for modeling
train = data[['tract_median_income','number_of_owner_occupied_units', 'number_of_1_to_4_family_units',
       'applicant_income_000s', 'loan_amount_000s',
       'state_abbr', 'property_type_name',
       'preapproval_name', 'owner_occupancy_name', 'loan_type_name',
       'loan_purpose_name', 'lien_status_name', 'hoepa_status_name',
       'co_applicant_sex_name','applicant_sex_name','agency_abbr','minority_ratio','actions_new']]

train = pd.get_dummies(train)

In [18]:
train = train.dropna()

In [19]:
train.head()

Unnamed: 0,tract_median_income,number_of_owner_occupied_units,number_of_1_to_4_family_units,applicant_income_000s,loan_amount_000s,minority_ratio,state_abbr_CA,state_abbr_TX,property_type_name_Manufactured housing,property_type_name_Multifamily dwelling,...,applicant_sex_name_Male,agency_abbr_CFPB,agency_abbr_FDIC,agency_abbr_FRS,agency_abbr_HUD,agency_abbr_NCUA,agency_abbr_OCC,agency_abbr_OTS,actions_new_Approved,actions_new_Denied
3,8719830.0,996.0,1232.0,141.0,292,0.0192,1,0,0,0,...,0,0,0,0,0,0,0,1,1,0
13,9863469.0,1065.0,1381.0,40.0,294,0.006228,1,0,0,0,...,1,0,0,0,0,0,1,0,1,0
18,8673315.0,858.0,1488.0,119.0,303,0.006639,1,0,0,0,...,1,0,0,0,0,0,1,0,1,0
23,8227233.0,1519.0,1964.0,135.0,228,0.00586,1,0,0,0,...,1,0,0,0,0,0,1,0,1,0
28,15036760.0,1927.0,2241.0,188.0,457,0.003975,1,0,0,0,...,0,0,0,0,0,0,1,0,1,0


In [21]:
y = train.iloc[:,45]
y.head()

3     1
13    1
18    1
23    1
28    1
Name: actions_new_Approved, dtype: uint8

In [22]:
X = train.iloc[:,0:45]

X.head()

Unnamed: 0,tract_median_income,number_of_owner_occupied_units,number_of_1_to_4_family_units,applicant_income_000s,loan_amount_000s,minority_ratio,state_abbr_CA,state_abbr_TX,property_type_name_Manufactured housing,property_type_name_Multifamily dwelling,...,applicant_sex_name_Female,"applicant_sex_name_Information not provided by applicant in mail, Internet, or telephone application",applicant_sex_name_Male,agency_abbr_CFPB,agency_abbr_FDIC,agency_abbr_FRS,agency_abbr_HUD,agency_abbr_NCUA,agency_abbr_OCC,agency_abbr_OTS
3,8719830.0,996.0,1232.0,141.0,292,0.0192,1,0,0,0,...,1,0,0,0,0,0,0,0,0,1
13,9863469.0,1065.0,1381.0,40.0,294,0.006228,1,0,0,0,...,0,0,1,0,0,0,0,0,1,0
18,8673315.0,858.0,1488.0,119.0,303,0.006639,1,0,0,0,...,0,0,1,0,0,0,0,0,1,0
23,8227233.0,1519.0,1964.0,135.0,228,0.00586,1,0,0,0,...,0,0,1,0,0,0,0,0,1,0
28,15036760.0,1927.0,2241.0,188.0,457,0.003975,1,0,0,0,...,1,0,0,0,0,0,0,0,1,0


In [23]:
# Scale numerical data
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(X.iloc[:,0:6])
X.iloc[:,0:6] = scaler.transform(X.iloc[:,0:6])

In [37]:
X.head()

Unnamed: 0,tract_median_income,number_of_owner_occupied_units,number_of_1_to_4_family_units,applicant_income_000s,loan_amount_000s,minority_ratio,state_abbr_CA,state_abbr_TX,property_type_name_Manufactured housing,property_type_name_One-to-four family dwelling (other than manufactured housing),...,applicant_sex_name_Female,"applicant_sex_name_Information not provided by applicant in mail, Internet, or telephone application",applicant_sex_name_Male,agency_abbr_CFPB,agency_abbr_FDIC,agency_abbr_FRS,agency_abbr_HUD,agency_abbr_NCUA,agency_abbr_OCC,agency_abbr_OTS
3,0.297621,0.145025,0.124343,0.014003,0.003829,0.004593,1,0,0,1,...,1,0,0,0,0,0,0,0,0,1
13,0.338853,0.155092,0.139406,0.003901,0.003855,0.001414,1,0,0,1,...,0,0,1,0,0,0,0,0,1,0
18,0.295944,0.124891,0.150222,0.011802,0.003974,0.001515,1,0,0,1,...,0,0,1,0,0,0,0,0,1,0
23,0.279861,0.221331,0.198342,0.013403,0.002987,0.001324,1,0,0,1,...,0,0,1,0,0,0,0,0,1,0
28,0.525367,0.280858,0.226345,0.018704,0.006,0.000862,1,0,0,1,...,1,0,0,0,0,0,0,0,1,0


In [25]:
# Build a Model using random forest
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()
model.fit(X, y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [26]:
# Output list of variables sorted by importance
pd.DataFrame({'colname': X.columns, 'importance': model.feature_importances_}).sort_values(by='importance',ascending=False).head(10)

Unnamed: 0,colname,importance
4,loan_amount_000s,0.195335
3,applicant_income_000s,0.183844
0,tract_median_income,0.127346
5,minority_ratio,0.121535
2,number_of_1_to_4_family_units,0.118322
1,number_of_owner_occupied_units,0.118083
24,lien_status_name_Not applicable,0.017169
26,lien_status_name_Secured by a first lien,0.015346
21,loan_purpose_name_Home improvement,0.008074
22,loan_purpose_name_Home purchase,0.007359


# Model interpretation and next steps

According to the result of random forest model, it can be observed that the most important features affecting application's result are applicant's income and the amount of loan, followed by some tract-related variables, including 'tract_median_income', 'minority_ratio', 'number_of_owner_occupied_units', and 'number_of_1_to_4_family_units'. It can be concluded that the economic status and demographic status of where the property is located are also affecting the success rate of a loan application. Following amount of loan are income and tract-related variables are lien status, lien of the mortgage and loan purposes. These variables, along variables indicating the race and ethnicity of the applicant, are included for further analysis.

In [27]:
# Extract most important variables and look at their correlations with the target
further = data[['income_to_loan',
    'tract_median_income','number_of_owner_occupied_units', 'number_of_1_to_4_family_units',
       'loan_amount_000s', 'applicant_income_000s','loan_purpose_name','lien_status_name','Minority','minority_ratio','actions_new']]
further = pd.get_dummies(further)

further.corr()[['actions_new_Approved','Minority_Yes']]

Unnamed: 0,actions_new_Approved,Minority_Yes
income_to_loan,-0.049038,-0.007234
tract_median_income,0.074256,-0.149122
number_of_owner_occupied_units,0.029392,-0.073036
number_of_1_to_4_family_units,0.012496,-0.072768
loan_amount_000s,0.036904,-0.067238
applicant_income_000s,0.036846,-0.095418
minority_ratio,-0.004024,0.014819
loan_purpose_name_Home improvement,-0.120597,0.012521
loan_purpose_name_Home purchase,0.109009,0.068408
loan_purpose_name_Refinancing,-0.060731,-0.072199


# Further analysis

By looking at correlation matrix generated from included variables, I can see which variables' impact on result and their relations with Minority variable. I found that the higher the median income of the tract where property is located is, the more likely the application will be approved, while properties of minorities' applications tend to have lower median income. This goes same with number of owner-occupied units of the tract, number of 1-to-4 family units of the tract, loan amount and the applicant's income. If the application's property is located in tract with lower ratio of minority population, it tends to be approved, while properties of minorities' applications tend to locate in tracts with higher ratio of minority population.

In [25]:
# Split data by minority or non-minority
mino = data.loc[data['Minority'] == 'Yes']
majo = data.loc[data['Minority'] == 'No']

In [102]:
# Look at each group's important variables
mino['applicant_income_000s'].describe()

count    1.597741e+06
mean     1.008917e+02
std      1.215228e+02
min      1.000000e+00
25%      4.900000e+01
50%      7.700000e+01
75%      1.220000e+02
max      9.999000e+03
Name: applicant_income_000s, dtype: float64

In [103]:
majo['applicant_income_000s'].describe()

count    2.606991e+06
mean     1.352630e+02
std      1.915382e+02
min      1.000000e+00
25%      6.400000e+01
50%      9.900000e+01
75%      1.520000e+02
max      9.999000e+03
Name: applicant_income_000s, dtype: float64

In [104]:
mino['loan_amount_000s'].describe()

count    1.709227e+06
mean     2.414074e+02
std      1.869450e+02
min      1.000000e+00
25%      1.200000e+02
50%      2.000000e+02
75%      3.200000e+02
max      1.400000e+04
Name: loan_amount_000s, dtype: float64

In [105]:
majo['loan_amount_000s'].describe()

count    2.759894e+06
mean     2.763878e+02
std      2.567693e+02
min      1.000000e+00
25%      1.340000e+02
50%      2.200000e+02
75%      3.550000e+02
max      7.600000e+04
Name: loan_amount_000s, dtype: float64

# Controlled hypothesis tests

To find out whether minorities are discriminated against lending, I decided to control the tract, income and loan amount to be approximately the same between minorities and non-minorities and calculate the respective approval rates of minority and non-minority applicants.

In [35]:
# Drop records with nulls in location variables and create uique tract identifier
data = data.dropna(subset = ['census_tract_number'])
data = data.dropna(subset = ['county_name'])

data['tract_uni'] = data['census_tract_number'].astype(str) + '-' + data['county_name'] + '-' + data['state_abbr']

data = data.dropna(subset = ['Minority'])

data = data.dropna(subset = ['actions_new'])

In [35]:
control = data[['as_of_year','state_abbr','loan_amount_000s','applicant_income_000s','tract_uni','Minority','actions_new']]

In [36]:
control = control.dropna()

In [38]:
new = pd.get_dummies(control['actions_new'])
control['actions_new_Approved'] = new['Approved']

In [39]:
control.head()

Unnamed: 0,as_of_year,state_abbr,loan_amount_000s,applicant_income_000s,tract_uni,Minority,actions_new,actions_new_Approved
3,2010,CA,292,141.0,5545.11-Los Angeles County-CA,Yes,Approved,1
13,2010,CA,294,40.0,4503.0-Alameda County-CA,No,Approved,1
18,2010,CA,303,119.0,3570.0-Contra Costa County-CA,No,Approved,1
23,2010,CA,228,135.0,3300.0-Contra Costa County-CA,No,Approved,1
28,2010,CA,457,188.0,3451.1-Contra Costa County-CA,Yes,Approved,1


To control the tract-related variables, I simply selected records from the same tract. To control the loan amount and income to be the same, I selected a subset of all records where based on T-test, the means of income and loan amount of minority group and non-minority group are approximately equal, respectively. Given that for most tracts, number of minority applicants are much smaller than the number of non-minority applicants, the goal basically to select a subset of non-minority applicants whose average income and loan amount are approximately equal to those of minority applicants in the tract. I did this by sorting the applicants by income and loan amount and perform T-test on both mean income and mean loan amount (since the distribution of income and loan amount is skewed, I actually performed T-test on logarithm of them). If the null hypothesis of equal means cannot be rejected, it is considered that they are approximately equal. If based on T-test, the null hypothesis of equal means is rejected, check if the mean income of selected non-minority group is larger or smaller than that of minority group. If it is larger, exclude the record with largest income from selected non-minority group, and if it is smaller, exclude the record with smallest income from selected non-minority group. Repeat these steps until the two groups are considered approximately equal on both income and loan amount. Then calculate and compare the approval rate of minority applicants and non-minority applicants to see if they are significantly different. To ensure accuracy, the remaining records of each group should be more than 30. Tracts with no more than 30 records on each group after multiple times of excluding will not be considered in analysis. With most important variables controlled to be almost the same, if there is still significant difference between approval rate of minority applicants and non-minority applicants, it can be considered minorities are discriminated against lending in this tract. 

After determining the status of each tract, the overall situation can be calculated by the ratio of number of tracts with discrimination. Since each tract has different number records, the ratio can also be caculated with number of each records in each tract as weights.

In [40]:
def hypo_test(select_year, year, select_state, state, num_tracts,): # Define the year, state and number of tracts
    if select_year: 
        print('Year of data: %d' %(year))
        test = control.loc[control['as_of_year'] == year]
    else:
        print('Year of data: 2010 and 2013')
        test = control
    if select_state:
        print('State of data: %s' %(state))
        test = test.loc[test['state_abbr'] == state]
    else:
        print('State of data: TX and CA')
    
    # Get a list of top n tracts with most number of records and calculate the total number of records in these tracts
    top_tracts = test['tract_uni'].value_counts()[0:num_tracts].index.tolist()
    total_records = test['tract_uni'].value_counts()[0:num_tracts].sum()
    print('%d tracts with %d records' %(num_tracts,total_records))
     
    
    print('Calculating...')
    
    result = []
    result_weighted = []
    total = 0
    
    # Loop through each tract, get minority and non-minority data in this tract, sort by income and loan amount
    for i in top_tracts: 
        subset = test.loc[test['tract_uni']==i]
        num = len(subset)
        minority = subset.loc[subset['Minority']=='Yes']
        majority = subset.loc[subset['Minority']=='No']
        minority = minority.sort_values(by=['applicant_income_000s','loan_amount_000s'])
        majority = majority.sort_values(by=['applicant_income_000s','loan_amount_000s'])

        flag = 0
        # Loop until the loan amount and income are apporximately the same between two groups
        while flag == 0 and len(majority) > 30 and len(minority) > 30: 
            
            # log-transform the data
            income_minority = np.log(minority['applicant_income_000s'] * 1000)
            income_majority = np.log(majority['applicant_income_000s'] * 1000)
            amount_minority = np.log(minority['loan_amount_000s'] * 1000)
            amount_majority = np.log(majority['loan_amount_000s'] * 1000)
            
            # T-test between two groups and adjust data
            t1, p_income = ttest_ind(income_minority, income_majority, equal_var=False)
            t2, p_amount = ttest_ind(amount_minority, amount_majority, equal_var=False)
            if p_income<0.2 or p_amount<0.1:
                mean_minority = income_minority.mean()
                mean_majority = income_majority.mean()
                if mean_majority > mean_minority:
                    majority = majority[:-1]
                else: 
                    majority = majority[1:]
            else:
                flag = 1
        
        # Calculate and compare the approval rates of two groups
        if flag == 1:
            total += num
            if minority['actions_new_Approved'].mean() < majority['actions_new_Approved'].mean():
                result.append(1)
                result_weighted.append(len(subset))
            else:
                result.append(0)
                result_weighted.append(0)
    
    result = np.array(result)
    result_weighted = np.array(result_weighted)
    print('Total number of valid tracts: %d' %len(result))
    print('Ratio of tracts with discrimination: %f' %np.mean(result))
    print('Weighted Ratio of tracts with discrimination: %f' %(np.sum(result_weighted)/total))
    

In [61]:
# Two states, 4000 tracts, two years
hypo_test(False, 0, False,0, 4000)

Year of data: 2010 and 2013
State of data: TX and CA
4000 tracts with 2429812 records
Calculating...
Total number of valid tracts: 3615
Ratio of tracts with discrimination: 0.767635
Weighted Ratio of tracts with discrimination: 0.777587


In [43]:
# Texas, 2000 tracts, two years
hypo_test(False, 0, True, 'TX', 2000)

Year of data: 2010 and 2013
State of data: TX
2000 tracts with 1083068 records
Calculating...
Total number of valid tracts: 1663
Ratio of tracts with discrimination: 0.853879
Weighted Ratio of tracts with discrimination: 0.870193


In [44]:
# California, 2000 tracts, two years
hypo_test(False, 0, True, 'CA', 2000)

Year of data: 2010 and 2013
State of data: CA
2000 tracts with 1291808 records
Calculating...
Total number of valid tracts: 1877
Ratio of tracts with discrimination: 0.727224
Weighted Ratio of tracts with discrimination: 0.725169


In [45]:
# Two states, 2000 tracts, 2010
hypo_test(True, 2010, False, 0, 2000)

Year of data: 2010
State of data: TX and CA
2000 tracts with 989654 records
Calculating...
Total number of valid tracts: 1655
Ratio of tracts with discrimination: 0.706344
Weighted Ratio of tracts with discrimination: 0.750462


In [46]:
# Two states, 2000 tracts, 2013
hypo_test(True, 2013, False, 0, 2000)

Year of data: 2013
State of data: TX and CA
2000 tracts with 858311 records
Calculating...
Total number of valid tracts: 1746
Ratio of tracts with discrimination: 0.757732
Weighted Ratio of tracts with discrimination: 0.766264


# Summary

"	Are minorities discriminated against in lending?

    Judging from the result of analysis, yes.
    
"	What is your hypothesis before running experiments?

    Minorities are not discriminated against lending.
    
"	Which minorities did you chose to analyze? 

    I considered White Non-Hispanic-or-Latino applicants as majorities and other races and ethnicities as minorities and analyzed the minorities as a whole.
    
"	What was the analytical process that you chose to use? 

    For preprocessing, I cleaned data and did some transformations on some variables. Then I selected variables has a large impact on the result of application using Random Forest. Then I control the important variables to be approximately the same between minority group and majority group and compared approval rates of both group.
    
"	What metrics/variables did you use in this data to prove or disprove? Why?

    Tract-related variables, applicant income, loan amount. Because they are considered most important variables by Random Forest model.
    
"	What metrics/variables did you Not use in this data to prove or disprove? Why?

    Variables with too many null values and variables which do not make much sense in modeling.
    
"	Are these variables explanatory or correlated or both?

    Some are explanatory and some are correlated.

"	What factors did you control for? Did you use other data sets? Why? Why not?

    Tracts, loan amount and applicant income.
    No. Because it is hard to find relevant datasets to be aligned with so many records and given dataset are sufficient in predicting application result.
"	Is there a Geographic effect?

    Yes. The discrimination in Texas is more severe that in California
    
"	Are the effects you are describing getting worst or better over time? Was your hypothesis supported by your analysis? Why?

    Discrimination against minority is more severe in 2013 than in 2010.
    My hypothesis was not supported by my analysis because controlling important variables to be approximately the same, there is still a significant difference between application approval rates of minority group and non-minority group.
