# HMDA Data -- Regression Modeling

## Using ML with *scikit-learn* for modeling -- (01) Linear Regression



This notebook explores the Home Mortgage Disclosure Act (HMDA) data for one year -- 2017. We use concepts from as well as tools from our own research and further readings to create a machine learning logistical regression model along with Naive Bayes classifers for predictive properties of loan approval rates.

*Note that as of July 12, 2019, HMDA data is publically available for 2007 - 2017.  
https://www.consumerfinance.gov/data-research/hmda/explore

--

**Documentation:**  
(1) See below in '01'

*There are many learning sources and prior work around similar topics: We draw inspiration from past Cohorts as well as learning materials from peer sources such as Kaggle and Towards Data Science*

---

## Importing Libraries and Loading the Data

First, we need to import all the libraries we are going to utilize throughout this notebook. We import everything at the very top of this notebook for order and best practice.

In [2]:
# Importing Libraries.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math

import os
import psycopg2
import pandas.io.sql as psql
import sqlalchemy
from sqlalchemy import create_engine

from sklearn import preprocessing
from sklearn import model_selection
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

from scipy import stats
from pylab import*
from matplotlib.ticker import LogLocator

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

*-----*
 
Second, we establish the connection to the AWS PostgreSQL Relational Database System.

In [3]:
# Postgres (username, password, and database name) -- we define variables and put it into a function to easily call using an engine.
postgres_host = 'aws-pgsql-loan-canoe.cr3nrpkvgwaj.us-east-2.rds.amazonaws.com'  
postgres_port = '5432' 
postgres_username = 'reporting_user' 
postgres_password = 'team_loan_canoe2019'
postgres_dbname = "paddle_loan_canoe"
postgres_str = ('postgresql://{username}:{password}@{host}:{port}/{dbname}'
                .format(username = postgres_username,
                        password = postgres_password,
                        host = postgres_host,
                        port = postgres_port,
                        dbname = postgres_dbname)
               )

# Creating the connection.
cnx = create_engine(postgres_str)

*-----*
 
Last, we use pandas to read the sql panda frame we saved from the wrangling script (see script: HMDA DATA Summ. Stats. for more )

In [4]:
#  Reading the panda dataframe we wrote to_sql from the wrangling script -- note: the dataframe is already wrangled so simply we SELECT *.

    # --> extracting everything from our training data into a new dataframe here, and using an index_col.
train_2017 = psql.read_sql_query ('SELECT * FROM aa__testing.loans_2017__training', cnx, index_col='outcome').reset_index()

    #--> # using pandas to view the first 5 rows for a quick spot-check.
train_2017.head(5)


Unnamed: 0,outcome,year,dn_reason1,agency,state,county,ln_type,ln_purp,ln_amt_000s,hud_med_fm_inc,pop,rt_spread,outcome_bucket,prc_blw_hs__2013_17_5yravg,prc_hs__2013_17_5yravg,prc_ba_plus__2013_17_5yravg,r_birth_2017,r_intl_mig_2017,r_natural_inc_2017
0,Application denied by financial institution,2017,Credit application incomplete,Department of Housing and Urban Development,Michigan,Genesee County,Conventional,Refinancing,140,53700,8791,0.0,"Denied, Not Accepted, or Withdrawn",9,36,21,10,0,-1
1,Application denied by financial institution,2017,Credit application incomplete,Department of Housing and Urban Development,Michigan,Genesee County,Conventional,Refinancing,140,53700,8791,0.0,"Denied, Not Accepted, or Withdrawn",10,32,20,10,0,-1
2,Loan purchased by the institution,2017,,Department of Housing and Urban Development,Illinois,Cook County,Conventional,Home purchase,124,77500,4316,0.0,Approved or Loan Purchased by the Institution,4,22,40,9,3,1
3,Loan purchased by the institution,2017,,Department of Housing and Urban Development,Illinois,Cook County,Conventional,Home purchase,124,77500,4316,0.0,Approved or Loan Purchased by the Institution,14,24,37,9,3,1
4,Loan purchased by the institution,2017,,Department of Housing and Urban Development,Illinois,Cook County,Conventional,Home purchase,124,77500,4316,0.0,Approved or Loan Purchased by the Institution,21,37,15,9,3,1


---

## 00. A Simplistic Model Purely on One Feature *(note: not a regression)*

Just to test this out, let's look at a simple linear (non-OLS) model that predicts purely on median household income.

In [5]:
# Categorizing predicted Y var (loan outcome)  & other integer vars (integer encoding) using postgreSQL and pandas.
train_2017_en = psql.read_sql_query ('''SELECT CAST( CASE 
                                                         WHEN tr17.outcome_bucket = 'Denied, Not Accepted, or Withdrawn' 
                                                             THEN 0 ELSE 1 
                                                     END As INT
                                                  ) 
                                                As outcome_flg,
                                                tr17.hud_med_fm_inc
                                        FROM aa__testing.loans_2017__training tr17'''
                                     , cnx)
train_2017_en.head(5)

Unnamed: 0,outcome_flg,hud_med_fm_inc
0,0,53700
1,0,53700
2,1,77500
3,1,77500
4,1,77500


In [7]:
# Setting some "modeling" variables.
loans_total = train_2017_en.shape[0] 
loans_approved = len(train_2017_en[train_2017_en.outcome_flg == 1])

# Generating what proportion of the loan applications were approved.
proportion_approved = float(loans_approved) /loans_total
print('The proportion of loan applications approved is %s.' % proportion_approved)

The proportion of loan applications approved is 0.5450705949563373.


In [8]:
# Next step: What is a simple way to determine what proportion of the loans were approved at certain thresholds?

    # --> starting by separating median household incomes into categories
greater_than_75K = train_2017_en[train_2017_en.hud_med_fm_inc > 75000]
less_than_75K = train_2017_en[train_2017_en.hud_med_fm_inc < 75000]

    # --> determining what proportion of loans were approved for hud med inc > $75K
proportion_greater_than_75K = float(len(greater_than_75K[greater_than_75K.outcome_flg == 1])) / len(greater_than_75K)
print('The proportion of loan applications approved -- for households with median incomes > $75K -- is %s.' % proportion_greater_than_75K)

    # --> determining what proportion of loans were approved for hud med inc < $75K    
proportion_less_than_75K = float(len(less_than_75K[less_than_75K.outcome_flg == 1])) / len(less_than_75K)
print('The proportion of loan applications approved -- for households with median incomes < $75K -- is %s.' % proportion_less_than_75K)

The proportion of loan applications approved -- for households with median incomes > $75K -- is 0.6384521231522455.
The proportion of loan applications approved -- for households with median incomes < $75K -- is 0.5069824086603518.


*Result*: This simplistic "model" let's us know that households with median incomes ```greater than $75K``` are much more likely to get their loan applications approved. We could say that our model:

- if median HH income > 75K => loan approved (64.5 percent);     
- if median HH income < 75K => loan not likely approved (50.7 percent).  

But this means that our model is not really a "model", because it will be wrong more than 33 percent of the time. Therefore, with this preliminary result in mind, we move on to exploring machie learning models using Scikit-learn below.

**Documentation:**

(1) From CCPE Machine Learnng Labs: 
- Scikit-Learn is a powerful machine learning library implemented in Python with numeric and scientific computing powerhouses Numpy, Scipy, and matplotlib for extremely fast analysis of small to medium-sized data sets. It is open source, commercially usable and contains many modern machine learning algorithms for classification, regression, clustering, feature extraction, and optimization. For this reason Scikit-Learn is often the first tool in a data scientist's toolkit for machine learning of incoming data sets. 
- Scikit-learn will expect numeric values and no blanks  

(2) See Further Resources & Works Cited Links at the bottom.

---

## 01. Using scikit-learn -- *Linear Regression*

**Documentation:**  
(1) https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html  
(2) https://statisticalhorizons.com/linear-vs-logistic  
(3) model_selection: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html  


**First,** we need to do a bit more wrangling because Scikit-learn will expect numeric values and no blanks. For this model, I will leverage ```PostgreSQL``` to do wrangling in lieu of pure python (for demonstration and integration purposes).

In [9]:
# Categorizing predicted Y var (loan outcome)  & other integer vars (integer encoding) using postgreSQL and pandas
df_2017_en = psql.read_sql_query ('''SELECT 
                                            CAST (CASE 
                                                      WHEN tr17.outcome_bucket = 'Denied, Not Accepted, or Withdrawn' THEN 0 ELSE 1 
                                                  END As INT
                                                 ) 
                                            As outcome_flg,
                                            tr17.year As loan_app_yr,
                                            -- BELOW: indicator variables (i.e. dummies for agency)
                                            CAST(CASE WHEN agency = 'Office of the Comptroller of the Currency'   THEN 1  ELSE 0  END AS INT) As OCC_flg,
                                            CAST(CASE WHEN agency = 'Federal Deposit Insurance Corporation'       THEN 1  ELSE 0  END As INT) As FDIC_flg,
                                            CAST(CASE WHEN agency = 'National Credit Union Administration'        THEN 1  ELSE 0  END AS INT) As NCUA_flg,
                                            CAST(CASE WHEN agency = 'Department of Housing and Urban Development' THEN 1  ELSE 0  END As INT) AS DHUD_flg,
                                            CAST(CASE WHEN agency = 'Consumer Financial Protection Bureau'        THEN 1  ELSE 0  END AS INT) AS CFPB_flg,
                                            --*
                                            -- BELOW: indicator variables (i.e. dummies for states) to categorize by U.S. geo region
                                                      --* note: see Issue in Loan-Canoe repo for notes and development on transforming states/msa/counties
                                            CAST(CASE WHEN lower(tr17.state) In ('new york', 'new jersey', 'connecticut', 'massachusetts', 'pennsylvania') 
                                                          THEN 1 ELSE 0 END AS INT) 
                                            As northeast_flg,
                                            CAST(CASE WHEN lower(tr17.state) In ('colorado', 'new mexico', 'arizona', 'utah', 'california') 
                                                          THEN 1 ELSE 0 END AS INT) 
                                            As west_flg,
                                            CAST(CASE WHEN lower(tr17.state) In ('iowa', 'ohio', 'michigan', 'wisconsin', 'north dakota', 'nebraska',
                                                                                 'kansas') 
                                                          THEN 1 ELSE 0 END AS INT) 
                                            As midwest_flg,
                                            CAST(CASE WHEN lower(tr17.state) In ('mississippi', 'west virginia', 'south carolina', 'arkansas', 'missouri',
                                                                                 'kentucky', 'florida', 'virginia', 'georgia', 'texas') 
                                                          THEN 1 ELSE 0 END AS INT) 
                                            As south_flg,
                                            CAST(CASE WHEN lower(tr17.state) = 'alaska'  THEN 1 ELSE 0 END AS INT) As alaska_flg,
                                            CAST(CASE WHEN lower(tr17.state) = 'hawaii'  THEN 1 ELSE 0 END AS INT) As hawaii_flg,
                                                      --> note: non-continental states are put as stand alone dummies, but could've put one dummy (i.e. 'other')
                                            --*
                                            -- BELOW: indiator variables (i.e. dummies for loan type)
                                            CAST(CASE WHEN lower(tr17.ln_type) = 'fsa/rhs-guaranteed' THEN 1 ELSE 0 END AS INT) As fsa_rhs_guarnt_flg,
                                            CAST(CASE WHEN lower(tr17.ln_type) = 'fha-insured'        THEN 1 ELSE 0 END AS INT) As fha_insured_flg,
                                            CAST(CASE WHEN lower(tr17.ln_type) = 'conventional'       THEN 1 ELSE 0 END AS INT) As conventional_flg,
                                            CAST(CASE WHEN lower(tr17.ln_type) = 'va-guaranteed'      THEN 1 ELSE 0 END AS INT) As va_guarnt_flg,
                                            --*
                                            -- BELOW: indicator variables (i.e dummies for loan purpose)
                                            CAST(CASE WHEN lower(tr17.ln_purp) = 'home purchase'    THEN 1 ELSE 0 END AS INT) As home_purch_flg,
                                            CAST(CASE WHEN lower(tr17.ln_purp) = 'home improvement' THEN 1 ELSE 0 END AS INT) As home_improv_flg,
                                            CAST(CASE WHEN lower(tr17.ln_purp) = 'refinancing'      THEN 1 ELSE 0 END AS INT) AS refinance_flg,
                                            --*
                                            -- BELOW: categorical int assignment for continous targets
                                            CASE 
                                                WHEN tr17.ln_amt_000s <= 200  THEN 1 --> category label: low
                                                WHEN tr17.ln_amt_000s <= 500  THEN 2 --> category label: medium
                                                WHEN tr17.ln_amt_000s  > 500  THEN 3 --> category label: high
                                                    --> note: this categorical int assign. is important in OLS bc OLS can't handle mix of continuous and discrete.
                                            END ln_amt_000s_cat,
                                            CASE 
                                                WHEN tr17.hud_med_fm_inc/1000 <= 45   THEN 1 --> category label: low
                                                WHEN tr17.hud_med_fm_inc/1000 <= 75   THEN 2 --> category label: medium
                                                WHEN tr17.hud_med_fm_inc/1000 <= 100  THEN 3 --> category label: high
                                                    --> note: this categorical int assign. is important in OLS bc OLS can't handle mix of continuous and discrete.
                                            END hud_med_fm_inc_000s_cat,
                                            --* 
                                            tr17.ln_amt_000s, tr17.hud_med_fm_inc
                                            FROM aa__testing.loans_2017__training tr17'''
                                     , cnx)
df_2017_en.head(5)

Unnamed: 0,outcome_flg,loan_app_yr,occ_flg,fdic_flg,ncua_flg,dhud_flg,cfpb_flg,northeast_flg,west_flg,midwest_flg,...,fha_insured_flg,conventional_flg,va_guarnt_flg,home_purch_flg,home_improv_flg,refinance_flg,ln_amt_000s_cat,hud_med_fm_inc_000s_cat,ln_amt_000s,hud_med_fm_inc
0,0,2017,0,0,0,1,0,0,0,1,...,0,1,0,0,0,1,1,2,140,53700
1,0,2017,0,0,0,1,0,0,0,1,...,0,1,0,0,0,1,1,2,140,53700
2,1,2017,0,0,0,1,0,0,0,0,...,0,1,0,1,0,0,1,3,124,77500
3,1,2017,0,0,0,1,0,0,0,0,...,0,1,0,1,0,0,1,3,124,77500
4,1,2017,0,0,0,1,0,0,0,0,...,0,1,0,1,0,0,1,3,124,77500


### **OLS Linear Regression -- Key Points**
```*Note:* These are in my own words, so that they may be pulled into our final report.```  

**OLS Linear Regression:** An OLS (Ordinary Least Square) Regression is an statistical modeling estimation technique that performs to minimize the regressor's coefficient -- notated as β -- so to reduce the sum of the squared residuals.

Put another way, we use this technique to estimate the coeffients of our regressors in our model to generate a function that allows us to determine how likely a mortage loan application is to be approved given the set of variables from our data. 

**OLS has 7 classical assumptions:**
- The regression model in linear, correctly specified, and has an additive error term;
- The error term has a zero population mean;
- All explanatory variables are uncorrelated with the error term;
- Observations of the error term are uncorrelated with each other;
- The error term has a constant variance;
- No explanatory variable is a perfect linear function of any other explanatory variable;
- The error term is normally distributed.

--

**Next,** I prepare the dataset - I will use 80% of the data to train our model and 20% of our data to evaluate our model:
- Training dataset - used to train our model;
- Testing dataset - used to test if our model is making accurate predictions.

In [10]:
# Using pandas to generate an aggregate function that outputs a more detailed view of what data is missing.

total = df_2017_en.isnull().sum().sort_values(ascending=False)
percent_1 = df_2017_en.isnull().sum()/df_2017_en.isnull().count()*100
percent_2 = (round(percent_1, 1)).sort_values(ascending=False)
missing_data = pd.concat([total, percent_2], axis=1, keys=['Total', '%'])

missing_data.head()

Unnamed: 0,Total,%
hud_med_fm_inc,0,0.0
ln_amt_000s,0,0.0
loan_app_yr,0,0.0
occ_flg,0,0.0
fdic_flg,0,0.0


 ###### Note: Only after confirming there is no missing data, can we move on to regressions (alternatively, we can use skipna or fillna first).
 --

#### ==> *Array Splicing to Split Data -- Training and Testing Sets:*

In [43]:
# Generating arrays and performing array splicing to split the data -- note: numpy column counter begins at 0, not 1:

array = df_2017_en.values
X = array[:,2:22] #--> keep all rows, but take only the second column all the way to the last column before the 22nd;
Y = array[:,0]    #--> keep all rows, but take only first column (counter begins at 0 => which is the outcome_flg);

# Splitting our data for model selection with random state generator of 20:
x_train, x_test, y_train, y_test = model_selection.train_test_split(X, Y, test_size=0.2, random_state=20)
print( '--')

--


#### ==> *Running Linear Regressions the Sets of Data -- Results for R^2 and Accuracy Scores*

In [37]:
# Intializing our linear regression algorithm:

reg_model = LinearRegression().fit(X, Y)

print(f'The R-squared score is {reg_model.score(X, Y).round(3)}')
print('')
print(f'The Beta coefficients are = {reg_model.coef_}')


The R-squared score is 0.177

The Beta coefficients are = [-3.40114290e-01 -2.33078045e-01 -2.57810343e-01 -2.54730842e-01
 -2.28321693e-01  4.04647145e-02  4.95488649e-01  3.39859763e-01
  2.33051664e-01  1.56906645e+08 -5.69517582e+08  1.36412326e+11
  1.36412326e+11  1.36412326e+11  1.36412326e+11 -9.32750612e+10
 -9.32750612e+10 -9.32750612e+10  1.65446433e-01  1.64819716e-01]


In [112]:
# Training Linear Regression Model using out training set of the data:

reg_model_train = LinearRegression().fit(x_train, y_train)
xa=reg_model_train.coef_
xa
#print(f'The R-squared score for our training set of the data is {reg_model_train.score(x_train, y_train).round(3)}')
#print('')

array([-3.26489482e-01, -2.48082053e-01, -2.59372162e-01, -2.55841991e-01,
       -2.26344087e-01,  3.96031660e-02,  4.84900014e-01,  3.40648463e-01,
        2.32903098e-01, -2.91312067e+09, -6.98210176e+09,  7.63277548e+11,
        7.63277548e+11,  7.63277548e+11,  7.63277548e+11,  2.50148551e+12,
        2.50148551e+12,  2.50148551e+12,  1.66499531e-01,  1.65527177e-01])

In [135]:
# Now that we trained our model, we view the Beta terms using an array of tuples of the coefficients
#x_train_df = pd.DataFrame(StandardScaler)().fit_trainsform(df_2017_en.), columns=df_2017_en.loc
#x_train_col = df_2017_en.columns[(df_2017_en.values == np.asarray(x_train)[:,None])]

df_col = (df_2017_en.iloc[0:1, 2:22])
col_x = list(df_col.columns)
cf_x = reg_model_train.coef_

for colu, coeff in zip(col_x, cf_x):
    print ('The Beta coefficiet is of {} is {}'.format(colu, coeff))
    
print(' ')

The Beta coefficiet is of occ_flg is -0.32648948197896266
The Beta coefficiet is of fdic_flg is -0.24808205335155292
The Beta coefficiet is of ncua_flg is -0.25937216167892707
The Beta coefficiet is of dhud_flg is -0.2558419906832628
The Beta coefficiet is of cfpb_flg is -0.22634408744663834
The Beta coefficiet is of northeast_flg is 0.03960316595349022
The Beta coefficiet is of west_flg is 0.4849000140224471
The Beta coefficiet is of midwest_flg is 0.3406484632601666
The Beta coefficiet is of south_flg is 0.2329030976711811
The Beta coefficiet is of alaska_flg is -2913120666.2475586
The Beta coefficiet is of hawaii_flg is -6982101758.6814575
The Beta coefficiet is of fsa_rhs_guarnt_flg is 763277548038.2003
The Beta coefficiet is of fha_insured_flg is 763277548037.8229
The Beta coefficiet is of conventional_flg is 763277548037.742
The Beta coefficiet is of va_guarnt_flg is 763277548037.778
The Beta coefficiet is of home_purch_flg is 2501485514286.323
The Beta coefficiet is of home_impr

*Note:* when we see it displayed like this, we immediately can tell that some of these coefficients make absolutely no sense! This suggests OLS may not be a good model (at least not for our data when set up with these features.

#### => ***Accuracy Score***

In [31]:
model = LinearRegression()
model.fit(x_train,y_train)
predictions = model.predict(x_test)
print(f'The accuracy score is {accuracy_score(y_test, predictions.round())}')

The accuracy score is 0.6575538100581455


---

**Result:** So our un-adjusted R-squared scores are week, *and* we get an accuracy score of 0.66 for our test set of data preditions. Additionally, we got a classification warning before altering in the accuracy scorecode cell above, hinting that OLS is a poor classifier, and, in this case especially, will very likley not separate the classes correctly.  

**Further Resources:**  
(1) Stackoverflow - https://stackoverflow.com/questions/38015181/accuracy-score-valueerror-cant-handle-mix-of-binary-and-continuous-target