<center><h1> Case Study 1</h1></center>
<center><h3> Week 1 (out of 5)</h3></center>

**Author(s):**
1. Belicia Rodriguez (belicia.rodriguez@emory.edu)
 
**Data Source**: W.C. Hunter and M.B. Walker (1996), [“*The Cultural Affinity Hypothesis and Mortgage Lending Decisions*,”](https://link.springer.com/article/10.1007/BF00174551) Journal of Real Estate Finance and Economics 13, 57-70.
 
**Book**: [Introductory Econometrics: A Modern Approach](https://economics.ut.ac.ir/documents/3030266/14100645/Jeffrey_M._Wooldridge_Introductory_Econometrics_A_Modern_Approach__2012.pdf) by Jeffrey Wooldridge

**Data Description**: ```http://fmwww.bc.edu/ec-p/data/wooldridge/loanapp.dta```

```
  Obs:  1989

  1. occ                       occupancy
  2. loanamt                   loan amt in thousands
  3. action                    type of action taken
  4. msa                       msa number of property
  5. suffolk                   =1 if property in Suffolk County
  6. race                      race of applicant
  7. gender                    gender of applicant
  8. appinc                    applicant income, $1000s
  9. typur                     type of purchaser of loan
 10. unit                      number of units in property
 11. married                   =1 if applicant married
 12. dep                       number of dependents
 13. emp                       years employed in line of work
 14. yjob                      years at this job
 15. self                      self-employment dummy
 16. atotinc                   total monthly income
 17. cototinc                  coapp total monthly income
 18. hexp                      propose housing expense
 19. price                     purchase price
 20. other                     other financing, $1000s
 21. liq                       liquid assets
 22. rep                       no. of credit reports
 23. gdlin                     credit history meets guidelines
 24. lines                     no. of credit lines on reports
 25. mortg                     credit history on mortgage paym
 26. cons                      credit history on consumer stuf
 27. pubrec                    =1 if filed bankruptcy
 28. hrat                      housing exp, % total inccome
 29. obrat                     other oblgs,  % total income
 30. fixadj                    fixed or adjustable rate?
 31. term                      term of loan in months
 32. apr                       appraised value
 33. prop                      type of property
 34. inss                      PMI sought
 35. inson                     PMI approved
 36. gift                      gift as down payment
 37. cosign                    is there a cosigner
 38. unver                     unverifiable info
 39. review                    number of times reviewed
 40. netw                      net worth
 41. unem                      unemployment rate by industry
 42. min30                     =1 if minority pop. > 30%
 43. bd                        =1 if boarded-up val > MSA med
 44. mi                        =1 if tract inc > MSA median
 45. old                       =1 if applic age > MSA median
 46. vr                        =1 if tract vac rte > MSA med
 47. sch                       =1 if > 12 years schooling
 48. black                     =1 if applicant black
 49. hispan                    =1 if applicant Hispanic
 50. male                      =1 if applicant male
 51. reject                    =1 if action == 3
 52. approve                   =1 if action == 1 or 2
 53. mortno                    no mortgage history
 54. mortperf                  no late mort. payments
 55. mortlat1                  one or two late payments
 56. mortlat2                  > 2 late payments
 57. chist                     =0 if accnts deliq. >= 60 days
 58. multi                     =1 if two or more units
 59. loanprc                   amt/price
 60. thick                     =1 if rep > 2
 61. white                     =1 if applicant white
 62. obwhte                    obrat*awhite
 ```

In your new job as a data analyst, you have been given this data set and asked to _build_ a machine (in this case a **linear probability model**) to aid your client to automatize their loan approval decisions. Using your knowledge of the Ridge, Lasso, and Elastic Net estimators find an economic sound model with the smallest *mean squared errors* when 20% of the observations in this data set are kept to validate the proposed model using a seed equal to 42.

<h4>Things to consider:</h4>

    1. Read the original paper to understand how the variables were constructed and the feature in their model.
    2. Read section 7.4 titled "Interactions Involving Dummy Variables" in 'Introductory Econometrics: A Modern Approach' by Jeffrey Wooldridge.
    3. Read section 7.5 titled "A Binary Dependent Variable: The Linear Probability Model" in 'Introductory Econometrics: A Modern Approach' by Jeffrey Wooldridge.
    4. You should try standardizing as well as normalizing your chosen features when trying different estimators.
    5. Always estimate an intercept.
    6. Your answers should include the final chosen specification and the reported mean squared error of your validation data set.

In [14]:
import pandas as pd
import numpy as np
import patsy
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge, RidgeCV, Lasso, LassoCV, ElasticNet, ElasticNetCV
from sklearn.metrics import mean_squared_error

# import dataset
loanapp = pd.read_stata('http://fmwww.bc.edu/ec-p/data/wooldridge/loanapp.dta')

## Preprocessing

The dataset currently has all variables set as floats, and none of the variables are set to dummies. During preprocessing, I will construct dummy variables of the intersection of race and gender, and I will change the categorical variables in the dataset to dummy variables and drop the first level in the categorical variable in order to make constructing a model later easier (and prevent perfect collinearity). The last section then constructs demeaned and standardized versions of all the variables.

In [15]:
# change numeric values in race and gender to their string representation
for i in range(len(loanapp)):
    # change race variables
    if loanapp['race'][i] == 5:
        loanapp['race'][i] = 'white'
    elif loanapp['race'][i] == 4:
        loanapp['race'][i] = 'black'
    elif loanapp['race'][i] == 3:
        loanapp['race'][i] = 'hispan'
    
    # change gender variables
    if loanapp['gender'][i] == 1:
        loanapp['gender'][i] = 'male'
    elif loanapp['gender'][i] == 2:
        loanapp['gender'][i] = 'female'
    elif loanapp['gender'][i] == 3:
        loanapp['gender'][i] = 'other'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  del sys.path[0]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats

## Comment:

I do not see what is the advantage in changing the data type of the categorical variables to strings. If you are worried about the categories being ```float32```, you can change the type to integer (```int```). To change the type you can use, for instance ```loanapp['race'] = loanapp['race'].astype('int32')```. Also, pay attention to the warning ```SettingWithCopyWarning``` (here is a good explanation of it https://www.dataquest.io/blog/settingwithcopywarning/). Finally, if for some reason you really wanted to change the floats to strings, there is a faster way to do it. Instead of using a loop, you can use the ```loc``` function and try something like: ```loanapp.loc[loanapp.race==5,'race'] = 'white'``` and repeat the process for all the values. You could use loop to iterate along the possible values that race could take.

In [16]:
# create a list of all categorical variables in dataset
cat = ['suffolk', 'married', 'self', 'pubrec', 'cosign', 'min30', 'bd', 'mi', 'old', 'vr', 'sch', 'black', 'hispan', 'male', 'reject', 'approve', 'mortno', 'mortperf', 'mortlat1', 'mortlat2', 'chist', 'multi', 'thick', 'white', 'inss', 'inson', 'fixadj', 'unver', 'gift', 'prop', 'occ', 'action']

# convert all categorical variables to dummy variables
loanapp = pd.get_dummies(loanapp, columns = cat, drop_first = True)
loanapp = pd.get_dummies(loanapp, columns = ['race', 'gender'])

# create race/gender dummy variables
for r in ['race_black', 'race_hispan', 'race_white']:
    for g in ['gender_female', 'gender_male', 'gender_other']:
        loanapp[r[5:] + '_' + g[7:]] = loanapp[r] * loanapp[g]

# drop unneeded dummy variables
loanapp.drop(columns=['race_black', 'race_hispan', 'race_white','gender_female', 'gender_male', 'gender_other'], inplace = True)

In [4]:
# remove .0 portion of string in categorical column names that causes patsy design matrix error
for i in list(range(1,len(loanapp.columns))):
    if loanapp.columns[i][-2:] == '.0':
        loanapp.rename(columns = {loanapp.columns[i] : loanapp.columns[i][:-2]}, inplace = True)

In [5]:
# demean and standardize variables for later interactions
for i in loanapp.columns:
    loanapp[i + '_dmean'] = loanapp[i] - loanapp[i].mean(skipna = True)
    if i[-6:] != '_dmean':
        loanapp[i + '_standard'] = (loanapp[i] - loanapp[i].mean(skipna = True)) / loanapp[i].var(skipna = True)

## Regression Function

To reduce the amount of copying and pasting of the Ridge, LASSO, and Elastic regression code, I decided to construct a function that runs the three regressions where the input is the dataset and the specification to check, and the output is the coefficients in each regression and the mean squared error of the testing set prediction.

In [6]:
# function that collects shrinkage method regression coefficients and collects test MSE

def runshrinkages(specification, df):
    # create design matrix
    y, X = patsy.dmatrices(specification, data = df, return_type = 'dataframe')

    # create the indices for the train (80%) and validation (20%) data sets.
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

    # create a dataframe that collects the coeficients from different regressions
    shrink_coefs = pd.DataFrame(columns = ['Ridge', 'LASSO', 'Elastic'], index = X.columns)

    # create a dataframe that collects mean square error (MSE) from each test
    mse_df = pd.DataFrame(columns = ['MSE'], index = ['Ridge', 'LASSO', 'Elastic'])

    # Ridge Regression
    # create array of alphas for cross-validation alpha selection
    alphas = np.linspace(0,0.05,20) + 0.001
    
    # use cross-validation ridge regression function to choose alpha for ridge regression
    ridgecv = RidgeCV(alphas = alphas, scoring = 'neg_mean_squared_error', normalize = True)
    ridgecv.set_params(fit_intercept=True)
    ridgecv.fit(X_train, y_train)

    # run ridge regression to fit training set, predict testing set, and extract MSE from prediction
    ridge = Ridge(alpha = ridgecv.alpha_, normalize = True)
    ridge.set_params(fit_intercept=True)
    ridge.fit(X_train, y_train)
    mse_df['MSE'][0] = mean_squared_error(y_test, ridge.predict(X_test))

    # find ridge shrinkage coefficients of specification
    ridge.fit(X,y)
    shrink_coefs[['Ridge']] = ridge.coef_.transpose(1,0).tolist()

    # Lasso Regression
    # use cross-validation LASSO regression function to choose alpha for LASSO regression
    alphas = np.linspace(0.0001,0.0005,100)
    lassocv = LassoCV(alphas = list(alphas), cv = 10, max_iter = 100000, normalize = True)
    lassocv.set_params(fit_intercept=True)
    lassocv.fit(X_train, y_train.values.ravel()) # avoid warning using ravel()

    # run LASSO regression to fit training set, predict testing set, and extract MSE from prediction
    lasso = Lasso(max_iter = 10000, normalize = True)
    lasso.set_params(alpha=lassocv.alpha_, fit_intercept=True)
    lasso.fit(X_train, y_train.values.ravel())
    mse_df['MSE'][1] = mean_squared_error(y_test, lasso.predict(X_test))

    # find LASSO shrinkage coefficients of specification
    lasso.fit(X,y)
    shrink_coefs['LASSO'] = lasso.coef_.tolist()

    # Elastic Net
    # use cross-validation Elastic Net regression function to choose alpha and L1 ratio for Elastic Net regression
    enetcv = ElasticNetCV(cv=10, random_state=42, fit_intercept=True, normalize = True)
    enetcv.fit(X_train, y_train.values.ravel()) # need .ravel() to avoid warning
    
    # run elastic regression to fit training set, predict testing set, and extract MSE from prediction
    enet = ElasticNet(alpha=enetcv.alpha_, l1_ratio=enetcv.l1_ratio_, fit_intercept=True)
    enet.fit(X_train, y_train)
    mse_df['MSE'][2] = mean_squared_error(y_test, enet.predict(X_test))

    # find elastic net shrinkage coefficients of specification
    enet.fit(X, y)
    shrink_coefs['Elastic'] = enet.coef_.tolist()

    return {'mse': mse_df, 'coefs' : shrink_coefs}


## Comment:

You standardized the model before but in the options of  Ridge, LASSO, and Elastic regression you chose ```normalized = True```. In the web-page of the library the authors say:  If you wish to standardize, please use ```sklearn.preprocessing.StandardScaler``` before calling fit on an estimator with ```normalize=False```. You standardize the data yourself but it still applies.

## Constructing Base Model

In Hunter and Walker's $\textit{The Cultural Gap Hypothesis and Mortgage Lending Decision}$, the authors constructed a model with variables they found to be the most optimal. Therefore, I decided to construct their model first and observe what their mean squared error results are for their proposed model. In the next section, I will aim to construct a model that has a lower MSE than the paper's.

In [7]:
# list of variables in paper specification
paper = ['hrat', 'obrat', 'mortlat2_1', 'pubrec_1', 'self_1', 'chist_1', 'unem', 'multi_1', 'cosign_1', 'married_1', 'loanprc', 'dep', 'sch_1', 'thick_1', 'white_1', 'male_1', 'vr_1']

# build model based on paper specification
f = 'approve_1 ~ -1 + ' + ' + '.join([ x for x in paper]) + ' + sch_1:white_1 + sch_1:male_1 + sch_1:thick_1 + sch_1:vr_1 + chist_1:white_1 + chist_1:male_1 + chist_1:thick_1 + chist_1:vr_1 + obrat:male_1'

# print resulting model specification
print('Paper Model Specification')
f

Paper Model Specification


'approve_1 ~ -1 + hrat + obrat + mortlat2_1 + pubrec_1 + self_1 + chist_1 + unem + multi_1 + cosign_1 + married_1 + loanprc + dep + sch_1 + thick_1 + white_1 + male_1 + vr_1 + sch_1:white_1 + sch_1:male_1 + sch_1:thick_1 + sch_1:vr_1 + chist_1:white_1 + chist_1:male_1 + chist_1:thick_1 + chist_1:vr_1 + obrat:male_1'

In [8]:
# run the shrinkage function
paper_model = runshrinkages(f, loanapp)

# print the MSE of each shrinkage regression
print('Paper MSE')
paper_model['mse']

Paper MSE


Unnamed: 0,MSE
Ridge,0.0953132
LASSO,0.0942892
Elastic,0.0961222


In [9]:
# print the shrinkage coefficients of the model
print('Paper Coefficients')
paper_model['coefs']

Paper Coefficients


Unnamed: 0,Ridge,LASSO,Elastic
hrat,0.00125201,0.0,0.001615
obrat,-0.00514317,-0.003981,-0.007279
mortlat2_1,-0.103042,-0.060581,-0.090278
pubrec_1,-0.231903,-0.231685,-0.234832
self_1,-0.0408941,-0.027214,-0.043405
chist_1,0.163122,0.124461,0.253122
unem,-0.00598244,-0.004293,-0.006317
multi_1,-0.0739748,-0.063757,-0.071716
cosign_1,0.0231465,0.0,0.027631
married_1,0.0400116,0.023811,0.042206


The regression with the lowest mean squared error (MSE) is LASSO. Therefore, based on LASSO, the most optimal model is the following:

$approve$ = $\beta_0$ + $\beta_1$ $obrat$ + $\beta_2$ $mortlat2\_1$ + $\beta_3$ $pubrec\_1$ + $\beta_4$ $self\_1$ + $\beta_5$ $chist\_1$ + $\beta_6$ $unem$ + $\beta_7$ $multi\_1$ + $\beta_8$ $married\_1$ + $\beta_9$ $loanprc$ + $\beta_{10}$ $white\_1$ + $\beta_{11}$ $vr\_1$ + $\beta_{12}$ $sch\_1:vr\_1$

## Updated Model Specifications

With the shrinkage model results from the previous section, I will now build off of the base paper model by adding race and gender intersection dummy variables and adding interactions between race/gender dummy variables and other variables I've deemed possibly relevant. After testing the model with the race/gender and their interactions, I will take those same variables and demean and standardize them in two different models to see if the mean square error decreases. In total, I will construct three more models.

In [10]:
# list containing the race and gender variables and their demenaed and standardize counterparts
race_gender = ['black_female', 'black_male', 'black_other', 'hispan_female', 'hispan_male', 'hispan_other', 'white_female', 'white_other']

race_gender_dmean = ['black_female_dmean', 'black_male_dmean', 'black_other_dmean', 'hispan_female_dmean', 'hispan_male_dmean', 'hispan_other_dmean', 'white_female_dmean', 'white_other_dmean']

race_gender_standard = ['black_female_dmean', 'black_male_dmean', 'black_other_dmean', 'hispan_female_dmean', 'hispan_male_dmean', 'hispan_other_dmean', 'white_female_dmean', 'white_other_dmean']

# loop through the married, unverified info, and late mortgage payments variables and create list of the interactions with race/gender
race_gender_interact = []
for i in ['married_1', 'unver_1', 'mortlat2_1']:
    for rg in race_gender:
        race_gender_interact.append(i + ':' + rg)

race_gender_interact_dmean = []
for i in ['married_1_dmean', 'unver_1_dmean', 'mortlat2_1_dmean']:
    for rg in race_gender_dmean:
        race_gender_interact_dmean.append(i + ':' + rg)

race_gender_interact_standard = []
for i in ['married_1_standard', 'unver_1_standard', 'mortlat2_1_standard']:
    for rg in race_gender_standard:
        race_gender_interact_standard.append(i + ':' + rg)

In [11]:
# create string specification for regular model
g = f + ' + ' + ' + '.join([x for x in race_gender]) + ' + ' + ' + '.join([x for x in race_gender_interact])

# create string specification for demeaned model 
g_dmean = f + ' + ' + ' + '.join([x for x in race_gender_dmean]) + ' + ' + ' + '.join([x for x in race_gender_interact_dmean])

# create string specification for standard model
g_standard = f + ' + ' + ' + '.join([x for x in race_gender_standard]) + ' + ' + ' + '.join([x for x in race_gender_interact_standard])

**Comment:** god job! I like the idea of the function it allows to test may different models faster

In [12]:
# run shrinkage methods for regular specification and print MSE
regular = runshrinkages(g, loanapp)
print('Regular Shrinkage')
regular['mse']

Regular Shrinkage


Unnamed: 0,MSE
Ridge,0.0860434
LASSO,0.0863096
Elastic,0.0876498


In [13]:
# run shrinkage methods for demeaned specification and print MSE
demeaned = runshrinkages(g_dmean, loanapp)
print('Demeaned Shrinkage')
demeaned['mse']

Demeaned Shrinkage


Unnamed: 0,MSE
Ridge,0.0837909
LASSO,0.0933818
Elastic,0.0946233


In [14]:
# run shrinkage methods for standard specification and print MSE
standard = runshrinkages(g_standard, loanapp)
print('Standard Shrinkage')
standard['mse']

Standard Shrinkage


Unnamed: 0,MSE
Ridge,0.0837909
LASSO,0.0933818
Elastic,0.0904258


Of the three specifications, the regular shrinkage has all three MSEs in the same range (0.086-0.086), but the demeaned and standard ridge shrinkages have the lowest MSE of 0.0837909. However, their LASSO and Elastic has a higher MSE than the standard shrinkage MSEs. Therefore, I would find that the standard Ridge and LASSO specification would be the best reference points to build the best model for the clients.

In [15]:
# print out the regular specification that was used for shrinkages for reference
print('Regular Model Specification')
g

Regular Model Specification


'approve_1 ~ -1 + hrat + obrat + mortlat2_1 + pubrec_1 + self_1 + chist_1 + unem + multi_1 + cosign_1 + married_1 + loanprc + dep + sch_1 + thick_1 + white_1 + male_1 + vr_1 + sch_1:white_1 + sch_1:male_1 + sch_1:thick_1 + sch_1:vr_1 + chist_1:white_1 + chist_1:male_1 + chist_1:thick_1 + chist_1:vr_1 + obrat:male_1 + black_female + black_male + black_other + hispan_female + hispan_male + hispan_other + white_female + white_other + married_1:black_female + married_1:black_male + married_1:black_other + married_1:hispan_female + married_1:hispan_male + married_1:hispan_other + married_1:white_female + married_1:white_other + unver_1:black_female + unver_1:black_male + unver_1:black_other + unver_1:hispan_female + unver_1:hispan_male + unver_1:hispan_other + unver_1:white_female + unver_1:white_other + mortlat2_1:black_female + mortlat2_1:black_male + mortlat2_1:black_other + mortlat2_1:hispan_female + mortlat2_1:hispan_male + mortlat2_1:hispan_other + mortlat2_1:white_female + mortlat2_1

In [16]:
# display the coefficients for the regular shrinkage model results
print('Regular Coefficients')
regular['coefs']

Regular Coefficients


Unnamed: 0,Ridge,LASSO,Elastic
hrat,0.00128465,0.0,0.00163
obrat,-0.00525866,-0.003931,-0.007039
mortlat2_1,-0.0357315,-0.006939,-0.051918
pubrec_1,-0.219244,-0.215262,-0.223996
self_1,-0.043566,-0.022992,-0.044695
chist_1,0.124696,0.10944,0.183596
unem,-0.00580628,-0.003201,-0.00597
multi_1,-0.0723511,-0.057426,-0.072468
cosign_1,0.0346143,0.0,0.037609
married_1,0.0473033,0.017477,0.049739


Here is the final model I would propose to the client:

$approve$ = $\beta_1$ $obrat$ + $\beta_2$ $mortlat2\_1$ + $\beta_3$ $pubrec\_1$ + $\beta_4$ $self\_1$ + $\beta_5$ $chist\_1$ + $\beta_6$ $unem$ + $\beta_7$ $multi\_1$ + $\beta_8$ $married\_1$ + $\beta_9$ $loanprc$ + $\beta_{10}$ $white\_1$ + $\beta_{11}$ $vr\_1$ + $\beta_{12}$ $sch\_1:vr\_1$ + $\beta_{13}$ $hispan\_female$ + $\beta_{14}$ $white\_female$ + $\beta_{15}$ $married\_1:hispan\_female$ + $\beta_{16}$ $unver\_1:black\_male$ + $\beta_{17}$ $unver\_1:hispan\_female$ + $\beta_{17}$ $unver\_1:hispan\_male$ + $\beta_{18}$ $unver\_1:white\_female$ + $\beta_{19}$ $mortlat2\_1:black\_male$ + $\beta_{20}$ $mortlat2\_1:hispan\_male$