# Lab 4 - QMSS GR5015
### Cindy Chen, cjc2279

__I will reuse my data set on 2020 US mortgage approvals for this analysis.__

In [1]:
import pandas as pd
import os
import numpy as np
import matplotlib.pyplot as mpl
import statsmodels.api as sm
import statsmodels.formula.api as smf

In [2]:
os.chdir("C:/Users/Cindy C/Downloads/2020_lar")
mortgage = pd.read_csv("2020_lar.txt", sep = "|", header = 0, nrows = 8000)

### 1. Run a simple regression, and interpret your results.  (Did the results fit your expectations?  Why?  Why not?)  

I ran a simple regression on loan_amount using two independent variables: applicant_income and applicant_age (whether they were below 25 or between 25-34).  I was very curious about the mortgage sizes that people my age were securing.  As preparation for my independent project (where I will use this same data set), I decided to explore single-family home mortgages only so my data was subset by these two additional conditions.

In running my simple regression, I noticed that my income coefficient was statistically significant but not my applicant_age coefficient.  My R-squared value of 0.310 seems quite high (more than 20%).  In terms of my coefficients, it tells me that a 1% increase in income will yield a \\$13,160 increase in the loan amount requested.  While my regression's intercept is negative and feels unintuitive (since a mortgage can't be negative), the negative intercept makes sense since you need a certain level of income to even qualify for a mortgage (which is not zero).

In [3]:
#First, I must convert property value into a numeric variable
print(mortgage['loan_amount'].dtype)
mortgage['loan_amount'].describe()

int64


count    8.000000e+03
mean     2.253338e+05
std      1.518443e+05
min      5.000000e+03
25%      1.250000e+05
50%      2.050000e+05
75%      2.950000e+05
max      3.605000e+06
Name: loan_amount, dtype: float64

In [4]:
mortgage['income'].describe()

count    7738.000000
mean      107.550788
std       141.720717
min        -4.000000
25%        55.000000
50%        84.000000
75%       130.000000
max      8800.000000
Name: income, dtype: float64

In [5]:
#I want to create log-transformed variables for income since our descriptive stats show that there is quite
#a large range
mortgage['income_ln'] = np.where(mortgage['income'] != 0, np.log(mortgage['income']), 0)

  result = getattr(ufunc, method)(*inputs, **kwargs)
  result = getattr(ufunc, method)(*inputs, **kwargs)


In [6]:
#I'm going to remove any rows where applicant_age is listed as '8888' because this is not a real age
mortgage = mortgage[mortgage['applicant_age'] != '8888']
mortgage = mortgage[mortgage['co_applicant_age'] != '8888']
mortgage = mortgage[mortgage['co_applicant_age'] != '9999']

In [7]:
#since my data frame is so big, I wanted to just create a smaller data frame so it's easier for me to preview
mortgage2 = mortgage.loc[:, ['income_ln',
                             'applicant_age',
                             'state_code',
                             'derived_dwelling_category',
                             'co_applicant_age',
                            'loan_amount']]
mortgage2.head(10)

Unnamed: 0,income_ln,applicant_age,state_code,derived_dwelling_category,co_applicant_age,loan_amount
0,3.496508,25-34,LA,Single Family (1-4 Units):Site-Built,25-34,165000
1,4.682131,25-34,TX,Single Family (1-4 Units):Site-Built,<25,225000
2,4.219508,65-74,TX,Single Family (1-4 Units):Site-Built,65-74,235000
7,3.951244,25-34,OK,Single Family (1-4 Units):Site-Built,45-54,85000
8,3.806662,25-34,LA,Single Family (1-4 Units):Site-Built,55-64,165000
10,4.564348,35-44,AR,Single Family (1-4 Units):Site-Built,35-44,275000
11,4.60517,35-44,TX,Single Family (1-4 Units):Site-Built,25-34,245000
12,4.553877,25-34,OK,Single Family (1-4 Units):Site-Built,25-34,175000
13,4.430817,45-54,TX,Single Family (1-4 Units):Site-Built,35-44,225000
14,5.030438,35-44,TX,Single Family (1-4 Units):Site-Built,35-44,195000


In [8]:
#since I want to subset my data, I want to make sure that the smaller data frame is still a
#sufficient size to run my analysis.  If I only have 25 remaining rows, I might consider changing my filters!
mortgage2[(mortgage2['derived_dwelling_category'] == "Single Family (1-4 Units):Site-Built")].shape

(2966, 6)

In [24]:
#for this analysis, I want to filter my dataset to single-family properties
lm1 = smf.ols('loan_amount ~ income_ln + C(applicant_age)',
              data = mortgage2,
             subset = (mortgage2['derived_dwelling_category'] == "Single Family (1-4 Units):Site-Built")
                       & ((mortgage2['applicant_age'] == '25-34') | (mortgage2['applicant_age'] == '<25'))).fit(cor_type = "HC3")

lm1.summary()

0,1,2,3
Dep. Variable:,loan_amount,R-squared:,0.31
Model:,OLS,Adj. R-squared:,0.308
Method:,Least Squares,F-statistic:,151.9
Date:,"Thu, 11 Nov 2021",Prob (F-statistic):,3.3399999999999996e-55
Time:,13:14:13,Log-Likelihood:,-8857.9
No. Observations:,679,AIC:,17720.0
Df Residuals:,676,BIC:,17740.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-3.336e+05,3.7e+04,-9.022,0.000,-4.06e+05,-2.61e+05
C(applicant_age)[T.<25],-1.985e+04,1.34e+04,-1.479,0.140,-4.62e+04,6501.978
income_ln,1.316e+05,7869.273,16.729,0.000,1.16e+05,1.47e+05

0,1,2,3
Omnibus:,19.728,Durbin-Watson:,1.319
Prob(Omnibus):,0.0,Jarque-Bera (JB):,35.74
Skew:,0.177,Prob(JB):,1.73e-08
Kurtosis:,4.067,Cond. No.,41.9


### 2. Add an interaction term to that model that you think might moderate the original relationship between X1 and X2.  Explain why you think an interaction might be present and in what direction it would work.  Explain your results.  Did it work out?  Yes?  No?  

My interaction variable will be co-applicant age.  I wonder if a co-applicant who might be much older (likely a parent) rather than a partner (around the same age) would result in a larger amount of mortgage requested, and that this relationship might be statistically significant.  My reasoning for using this interaction variable is to see if parents are instrumental in helping their young adult children secure a larger mortgage.

To do this, I wanted to change my co_applicant_age variable into a dummy variable.  Rather than use the age ranges listed under the co_applicant_age variable, I wanted to flag whether or not they might be the parent.  In other words, I wanted to mark whether the coapplicant was above 45 years old or not.  I chose this 45-year-old threshold because this would be a likely age for parents of applicants below 25 years old, and gives me a 10 year gap between the oldest of my applicant subset (34 years old) as to minimize relationships with large age gaps.

__Findings:__

When I ran this new regression with the interaction term, the results initially surprised me.  The coefficients for applicant_age and parent by themselves were not statistically significant, but the interaction term (where applicant_age was multiplied by the parent dummy) yielded a p-value of 0.014, so the interaction term was statistically significant.  However, the direction of the interaction was surprising: it is negative.  This means that when someone is under 25 and the co-applicant is someone above 45 (likely their parent), then they request $98,790 less for their mortgage. 

This might be explained by the fact that if someone is under 25 and single (since they aren't buying a place where their similarly-aged partner is the co-applicant), they are trying to buy a much smaller and therefore less expensive single-family home than a young couple who might seek a single-family property to start a family.  Likewise, if someone under 25 with negligible wealth does not qualify for a mortgage of any size by themselves, the inclusion of their parent as their co-applicant may be in essence a "Hail Mary" to help them get the smallest mortgage possible.

In [10]:
#create new dummy variable

#set parent to zero
mortgage2['parent'] = 0


#set conditions where it's not zero (so it's likely a parent)
mortgage2['parent'] = np.where(mortgage2['co_applicant_age'] == '45-54', 1,
                              np.where(mortgage2['co_applicant_age'] == '>74', 1,
                                      np.where(mortgage2['co_applicant_age'] == '55-64', 1,
                                              np.where(mortgage2['co_applicant_age'] == '65-74', 1,0))))

mortgage2['parent'] = mortgage2['parent'].astype('str')

In [11]:
mortgage2.head(10)

Unnamed: 0,income_ln,applicant_age,state_code,derived_dwelling_category,co_applicant_age,loan_amount,parent
0,3.496508,25-34,LA,Single Family (1-4 Units):Site-Built,25-34,165000,0
1,4.682131,25-34,TX,Single Family (1-4 Units):Site-Built,<25,225000,0
2,4.219508,65-74,TX,Single Family (1-4 Units):Site-Built,65-74,235000,1
7,3.951244,25-34,OK,Single Family (1-4 Units):Site-Built,45-54,85000,1
8,3.806662,25-34,LA,Single Family (1-4 Units):Site-Built,55-64,165000,1
10,4.564348,35-44,AR,Single Family (1-4 Units):Site-Built,35-44,275000,0
11,4.60517,35-44,TX,Single Family (1-4 Units):Site-Built,25-34,245000,0
12,4.553877,25-34,OK,Single Family (1-4 Units):Site-Built,25-34,175000,0
13,4.430817,45-54,TX,Single Family (1-4 Units):Site-Built,35-44,225000,0
14,5.030438,35-44,TX,Single Family (1-4 Units):Site-Built,35-44,195000,0


In [12]:
lm2 = smf.ols('loan_amount ~ income_ln + C(applicant_age)*C(parent)',
              data = mortgage2,
             subset = (mortgage2['derived_dwelling_category'] == "Single Family (1-4 Units):Site-Built")
                       & ((mortgage2['applicant_age'] == '25-34') | (mortgage2['applicant_age'] == '<25'))).fit(cor_type = "HC3")

lm2.summary()

0,1,2,3
Dep. Variable:,loan_amount,R-squared:,0.323
Model:,OLS,Adj. R-squared:,0.319
Method:,Least Squares,F-statistic:,80.5
Date:,"Thu, 11 Nov 2021",Prob (F-statistic):,7.7e-56
Time:,13:10:10,Log-Likelihood:,-8851.3
No. Observations:,679,AIC:,17710.0
Df Residuals:,674,BIC:,17740.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-3.425e+05,3.68e+04,-9.315,0.000,-4.15e+05,-2.7e+05
C(applicant_age)[T.<25],-2922.1249,1.43e+04,-0.204,0.839,-3.11e+04,2.52e+04
C(parent)[T.1],-2.185e+04,1.97e+04,-1.111,0.267,-6.05e+04,1.68e+04
C(applicant_age)[T.<25]:C(parent)[T.1],-9.879e+04,4e+04,-2.470,0.014,-1.77e+05,-2.03e+04
income_ln,1.338e+05,7827.704,17.095,0.000,1.18e+05,1.49e+05

0,1,2,3
Omnibus:,25.293,Durbin-Watson:,1.309
Prob(Omnibus):,0.0,Jarque-Bera (JB):,50.073
Skew:,0.215,Prob(JB):,1.34e-11
Kurtosis:,4.259,Cond. No.,46.9


### 3. Give me an update on your independent project.  What do you plan to investigate?  What are your hypotheses?  What data are you using?  How can we help?  Write your answer here in the lab report, but also send your answer as an email to me at gme2101@columbia.edu with subject "Independent Project Update - [insert your name]"

* I plan to investigate racial discrimination in US mortgage approvals between 2019 and 2020.  I am curious about how soaring demand for single-family properties during the beginning of the pandemic may have influenced who was approved for mortgages to move into "more desired" neighborhoods.  To narrow the scope, I will focus on Ohio census tracts and single-family homes.  I chose this scope because Ohio had one of the highest property value growth rates in the country and it has about equal proportions of White and Black residents.


* My hypothesis is that census tracts in Ohio with higher-than-average single-family property value increases between 2019 and 2020 had greater increases in non-White mortgage denial rates compared to 2018 to 2019.  This occurs because mortgage lenders' algorithms (while they claim to be unbiased) can gatekeep and select for certain demographics in desirable communities.  After all, redlining has proven to be a discriminatory practice that endures to this day, so I want to explore whether the pandemic exacerbated this behavior.


* My data will be the 2018, 2019, and 2020 Home Mortgage Disclosure Act data sets.


* In terms of help, I was wondering how to work with large data sets that I might not be able to load into my laptop's RAM.  For instance, are there resources on campus or virtual computers to read the data?