<center><h1> Case Study 2</h1></center>
<center><h3> Week 2 (out of 5)</h3></center>

**Author(s):**
1. Belicia Rodriguez (belicia.rodriguez@emory.edu)
 
**Data Source**: W.C. Hunter and M.B. Walker (1996), [“*The Cultural Affinity Hypothesis and Mortgage Lending Decisions*,”](https://link.springer.com/article/10.1007/BF00174551) Journal of Real Estate Finance and Economics 13, 57-70.
 
**Book**: [Introductory Econometrics: A Modern Approach](https://economics.ut.ac.ir/documents/3030266/14100645/Jeffrey_M._Wooldridge_Introductory_Econometrics_A_Modern_Approach__2012.pdf) by Jeffrey Wooldridge

**Data Description**: ```http://fmwww.bc.edu/ec-p/data/wooldridge/loanapp.dta```

```
  Obs:  1989

  1. occ                       occupancy
  2. loanamt                   loan amt in thousands
  3. action                    type of action taken
  4. msa                       msa number of property
  5. suffolk                   =1 if property in Suffolk County
  6. race                      race of applicant
  7. gender                    gender of applicant
  8. appinc                    applicant income, $1000s
  9. typur                     type of purchaser of loan
 10. unit                      number of units in property
 11. married                   =1 if applicant married
 12. dep                       number of dependents
 13. emp                       years employed in line of work
 14. yjob                      years at this job
 15. self                      self-employment dummy
 16. atotinc                   total monthly income
 17. cototinc                  coapp total monthly income
 18. hexp                      propose housing expense
 19. price                     purchase price
 20. other                     other financing, $1000s
 21. liq                       liquid assets
 22. rep                       no. of credit reports
 23. gdlin                     credit history meets guidelines
 24. lines                     no. of credit lines on reports
 25. mortg                     credit history on mortgage paym
 26. cons                      credit history on consumer stuf
 27. pubrec                    =1 if filed bankruptcy
 28. hrat                      housing exp, % total inccome
 29. obrat                     other oblgs,  % total income
 30. fixadj                    fixed or adjustable rate?
 31. term                      term of loan in months
 32. apr                       appraised value
 33. prop                      type of property
 34. inss                      PMI sought
 35. inson                     PMI approved
 36. gift                      gift as down payment
 37. cosign                    is there a cosigner
 38. unver                     unverifiable info
 39. review                    number of times reviewed
 40. netw                      net worth
 41. unem                      unemployment rate by industry
 42. min30                     =1 if minority pop. > 30%
 43. bd                        =1 if boarded-up val > MSA med
 44. mi                        =1 if tract inc > MSA median
 45. old                       =1 if applic age > MSA median
 46. vr                        =1 if tract vac rte > MSA med
 47. sch                       =1 if > 12 years schooling
 48. black                     =1 if applicant black
 49. hispan                    =1 if applicant Hispanic
 50. male                      =1 if applicant male
 51. reject                    =1 if action == 3
 52. approve                   =1 if action == 1 or 2
 53. mortno                    no mortgage history
 54. mortperf                  no late mort. payments
 55. mortlat1                  one or two late payments
 56. mortlat2                  > 2 late payments
 57. chist                     =0 if accnts deliq. >= 60 days
 58. multi                     =1 if two or more units
 59. loanprc                   amt/price
 60. thick                     =1 if rep > 2
 61. white                     =1 if applicant white
 62. obwhte                    obrat*awhite
 ```

1. [5 points] After loading the data set into a Pandas DataFrame, use the approaches described [here](https://stackoverflow.com/questions/50326157/create-dummy-variables-for-interdependent-categories-in-pandas]) to create dummies for the interdependent race categories `black`, `hispan`, `white` and gender category `male`. **Hint:** You should have generated a total of 6 new dummy variables.

In [1]:
import pandas as pd
import numpy as np
import patsy
from sklearn.model_selection import train_test_split, KFold
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import StandardScaler

# import dataset as pandas dataframe
df = pd.read_stata('http://fmwww.bc.edu/ec-p/data/wooldridge/loanapp.dta')

# change numerical values in race and gender to their representation (makes creation of dummy variables easier)
df.loc[df.race==5,'race'] = 'white'
df.loc[df.race==4,'race'] = 'black'
df.loc[df.race==3,'race'] = 'hispan'

df.loc[df.gender==1, 'gender'] = 'male'
df.loc[df.gender==2, 'gender'] = 'female'
df.loc[df.gender==3, 'gender'] = 'other'

# create dummies for interdependent race and gender categories
df = pd.get_dummies(df, columns = ['race', 'gender'])

# create race/gender dummy variables
for r in ['race_black', 'race_hispan', 'race_white']:
    for g in ['gender_female', 'gender_male', 'gender_other']:
        df[r[5:] + '_' + g[7:]] = df[r] * df[g]

# drop unneeded dummy variables
df.drop(columns=['race_black', 'race_hispan', 'race_white','gender_female', 'gender_male', 'gender_other'], inplace = True)

2. [5 points] Rename the previously 6 newly created dummy variables accordingly, i.e., `black_male` equals 1 for a black male applicant and 0 otherwise, `white_female` equals 1 for a white female applicant and 0 otherwise, etc.

I already renamed the six newly created dummy variables when I changed the numerical values into their string representative and then created their interaction in the for loop.

3. You are interested in building a model that accurately predict loan rejection (`rejection`) based on various applicants' features such as `loanamt`, `appinc`, `unit`, `married`, `dep`, `emp`, `yjob`, `self`, `atotinc`, `cototinc`, `hexp`, `price`, `other`, `liq`, `rep`, `gdlin`, `lines`, `mortg`, `cons`, `pubrec`, `hrat`, `obrat`, `fixadj`, `term`, `apr`, `gift`, `cosign`, `netw`, `uem`, `min30`, `bd`, `mi`, `old`, `vr`, `sch`, `mortno`, `chist`, `multi`, `loanprc`, `thick`, and 5 of the 6 newly created race-gender dummies, i.e., make the `white_male` the base category.
    1. [5 points] Replace *all* the proposed features measured in USD with their natural logarithm.
    2. [5 points] *Add* demeaned squared terms of *all* features measured in units of time to your data frame, e.g., `lemp2` where $\texttt{lemp2}=(\ln(\texttt{emp})-\overline{ln(\texttt{emp}})^2$ where $\overline{ln(\texttt{emp}}$ represents the sample average of the `lemp2` variable. **Note**: You do not need to do this for the remaining features that are not measured in units of time.
    3. [10 points] *Add* to your data frame _all_ products between the 5 race-gender dummy variables and the other **non-dummy** features in the data frame so far (including those you created in point B. above).

In [2]:
# list all features of model
feat = ['loanamt', 'appinc', 'unit', 'married', 'dep', 'emp', 'yjob', 'self', 'atotinc', 'cototinc', 'hexp', 'price', 'other', 'liq', 'rep', 'gdlin', 'lines', 'mortg', 'cons', 'pubrec', 'hrat', 'obrat', 'fixadj', 'term', 'apr', 'gift', 'cosign', 'netw', 'unem', 'min30', 'bd', 'mi', 'old', 'vr', 'sch', 'mortno', 'chist', 'multi', 'black_male', 'black_female', 'black_other', 'white_female', 'white_other', 'hispan_male', 'hispan_female', 'hispan_other']

feat_2 = []

# list proposed features measured in US dollars
dollars_feat = ['loanamt', 'appinc', 'atotinc', 'cototinc', 'hexp', 'price', 'other', 'liq', 'apr', 'loanprc']

# take natural log of dollar features
for i in dollars_feat:
    for n in range(len(df)):
        if df[i][n] > 0:
            df[i][n] = np.log(df[i][n])

# list features measured in units of time
time_feat = ['emp', 'yjob', 'term']

# add demeaned squared terms of the time features
time_demeaned = []
for i in time_feat:
    new_var = i + '_dmean_sq'
    df[new_var] = (df[i] - df[i].mean(skipna = True))**2

    # add new variable to time demeaned list
    time_demeaned.append(new_var)

# split feat into two lists for race-gender dummy features and other non-dummy features
race_gender_feat = ['black_male', 'black_female', 'black_other', 'white_female', 'white_other', 'hispan_male', 'hispan_female', 'hispan_other']

non_dummy = ['loanamt', 'appinc', 'unit', 'married', 'dep', 'emp', 'yjob', 'self', 'atotinc', 'cototinc', 'hexp', 'price', 'other', 'liq', 'rep', 'gdlin', 'lines', 'mortg', 'cons', 'pubrec', 'hrat', 'obrat', 'fixadj', 'term', 'apr', 'gift', 'cosign', 'netw', 'unem', 'min30', 'bd', 'mi', 'old', 'vr', 'sch', 'mortno', 'chist', 'multi']

# create products of race-gender dummy variables and other non-dummy features in dataframe
for i in race_gender_feat:
    for j in non_dummy + time_demeaned:
        join_str = '*'.join([i,j])
        df[join_str] = df[i] * df[j]

        # add new variable to features 2 list
        feat_2.append(join_str)

# add time demeaned to feat_2 list
feat_2 = feat_2 + time_demeaned

**Comment**

(2) You could avoid the loop over the idex (which can take long for an increasing number of observations) by finding the variables for which there are zero values using:

```zeros = df[usd_terms].loc[:,(df[usd_terms]<=0).any()].columns.tolist()```

Then iterating over the variables in the ```zeros``` list to change zeros for ones. With those changes you could iterate over the list ```usd_terms``` and use ```np.log()``` (this does produces the same result as $log(1)=0$.

4. [20 points] Perform a __Ridge__ Logistic Regression of your model and choose the necessary hyperparameter via a 5-fold cross-validation. Report your logistic score and the confusion matrix based on a validation test made up of 20% of the original sample (use 42 as the seed). **Note**: You can do this via the `LogisticRegressionCV` function from the `sklearn.linear_model` subpackage by choosing a suitable set for the `Cs` and `l1_ratios` accordingly.

In [5]:
# create string model specification
f = 'reject ~ -1 + ' + ' + '.join([x for x in feat + feat_2])

# create design matrices for specifications
y, X = patsy.dmatrices(f, data=df, return_type='dataframe')

# partition dataset into train and test datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

# standardizes each col of training design matrix 
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)

# create kfold object
kfold = KFold(n_splits=5, random_state=42, shuffle=True)

# create dataframe that will collect scores and optimal alpha and lambdas for each test
results = pd.DataFrame(columns=['Scores', 'Lambda', 'Alpha'], index=['Ridge', 'LASSO', 'Elastic Net'])

In [6]:
# find the optimal ridge lambda using cross validation setup
searchCV = LogisticRegressionCV(
    Cs = list(np.linspace(0.008,0.009,20,endpoint=True))
    ,penalty = 'l2'
    ,scoring = 'accuracy' #proportion of main diag of confusion matrix
    ,cv = kfold
    ,random_state = 42
    ,max_iter = 10000
    ,fit_intercept = True
    ,solver = 'lbfgs'
    ,tol = 10
)

# fit the cross validation logit regression
logitcv = searchCV.fit(X_train, y_train.values.ravel())

# calculate proportion of true predictions
results.loc['Ridge', 'Scores'] = logitcv.score(X_test, y_test)

# calculate the optimal lambda used
results.loc['Ridge', 'Lambda'] = (logitcv.C_).round(6)

# show results
results.loc[['Ridge']]

Unnamed: 0,Scores,Lambda,Alpha
Ridge,0.902507,[0.008158],


In [7]:
# calculate confusion matrix of test set
ridge_confusion_mat = confusion_matrix(y_test, logitcv.predict(X_test))
print('Ridge Confusion Matrix: \n', ridge_confusion_mat)

Ridge Confusion Matrix: 
 [[312   7]
 [ 28  12]]


5. [20 points] Perform a __Lasso__ Logistic Regression of your model and choose the necessary hyperparameter via a 5-fold cross-validation. Report your logistic score and the confusion matrix based on a validation test made up of 20% of the original sample (use 42 as the seed). **Note:** You can do this via the `LogisticRegressionCV` function from the `sklearn.linear_model` subpackage by choosing a suitable set for the `Cs` and `l1_ratios` accordingly.

In [8]:
# find the optimal ridge lambda using cross validation setup
searchCV = LogisticRegressionCV(
    Cs = list(np.linspace(0.008,0.009,20,endpoint=True))
    ,penalty = 'l1'
    ,scoring = 'accuracy' #proportion of main diag of confusion matrix
    ,cv = kfold
    ,random_state = 42
    ,max_iter = 10000
    ,fit_intercept = True
    ,solver = 'saga'
    ,tol = 10
)

# fit the cross validation logit regression
logitcv = searchCV.fit(X_train, y_train.values.ravel())

# calculate proportion of true predictions
results.loc['LASSO', 'Scores'] = logitcv.score(X_test, y_test)

# calculate the optimal lambda used
results.loc['LASSO', 'Lambda'] = (logitcv.C_).round(6)

# show results
results.loc[['LASSO']]

Unnamed: 0,Scores,Lambda,Alpha
LASSO,0.891365,[0.008053],


In [9]:
# calculate confusion matrix of test set
lasso_confusion_mat = confusion_matrix(y_test, logitcv.predict(X_test))
print('LASSO Confusion Matrix: \n', lasso_confusion_mat)

LASSO Confusion Matrix: 
 [[318   1]
 [ 38   2]]


4. [20 points] Perform an __Elastic Net__ Logistic Regression of your model and choose the necessary hyperparameter via a 5-fold cross-validation. Report your logistic score and the confusion matrix based on a validation test made up of 20% of the original sample (use 42 as the seed).

In [10]:
# find the optimal elasticnet alpha and lambda using cross validation setup
searchCV = LogisticRegressionCV(
    Cs = list(np.linspace(0.007,0.009,20,endpoint=True)) #this corresponds to 1/lambda
    ,penalty = 'elasticnet'
    ,l1_ratios = np.linspace(0.4,0.05200,endpoint=True) #this corresponds to alpha above
    ,scoring = 'accuracy' #proportion of main diag of confusion matrix
    ,cv = kfold
    ,random_state = 42
    ,max_iter = 10000
    ,fit_intercept = True
    ,solver = 'saga' #only optimizer available for elasticnet
    ,tol = 10
)

# fit the cross validation logit regression
logitcv = searchCV.fit(X_train, y_train.values.ravel())

# calculate proportion of true predictions
results.loc['Elastic Net', 'Scores'] = logitcv.score(X_test, y_test)

# calculate the optimal lambda and alpha used
results.loc['Elastic Net', 'Lambda'] = (logitcv.C_).round(6)
results.loc['Elastic Net', 'Alpha'] = (logitcv.l1_ratio_).round(6)

# show results
results.loc[['Elastic Net']]

Unnamed: 0,Scores,Lambda,Alpha
Elastic Net,0.89415,[0.008474],[0.378694]


In [11]:
# calculate confusion matrix of test set
elastic_confusion_mat = confusion_matrix(y_test, logitcv.predict(X_test))
print('Elastic Net Confusion Matrix: \n', elastic_confusion_mat)

Elastic Net Confusion Matrix: 
 [[307  12]
 [ 26  14]]


5. [10 points] What machine among the three you built would you use and why?

In [12]:
# print results of three logistic regressions for comparison
results

Unnamed: 0,Scores,Lambda,Alpha
Ridge,0.902507,[0.008158],
LASSO,0.891365,[0.008053],
Elastic Net,0.89415,[0.008474],[0.378694]


In [13]:
# print confusion matrices
print('Ridge Confusion Matrix: \n', ridge_confusion_mat)
print('LASSO Confusion Matrix: \n', lasso_confusion_mat)
print('Elastic Net Confusion Matrix: \n', elastic_confusion_mat)

Ridge Confusion Matrix: 
 [[312   7]
 [ 28  12]]
LASSO Confusion Matrix: 
 [[318   1]
 [ 38   2]]
Elastic Net Confusion Matrix: 
 [[307  12]
 [ 26  14]]


I would use the Ridge Logistical Regression because the regression has the highest proportion of true predictions for the validation set. Ridge has the highest score, meaning Ridge accurately predicted 90% of the loan rejections in the testing set. Also, when comparing the total number of true predictions (adding the diagonals of the confusion matrix) between Ridge, LASSO and Elastic, Ridge has 324 true test set prediction whereas LASSO has 320 and Elastic Net has 321.  