# HMDA Data (Con't) -- ML Logit Regression Model & Naive Bayes Classifers

## Using ML with ```scikit-learn``` for  Logistical Regression Model
This notebook explores the Home Mortgage Disclosure Act (HMDA) data for one year -- 2017. We use concepts from as well as tools from our own research and further readings to create a machine learning logistical regression model along with Naive Bayes classifers for predictive properties of loan approval rates. 

*Note that as of July 12, 2019, HMDA data is publically available for 2007 - 2017.  
https://www.consumerfinance.gov/data-research/hmda/explore

Documentation:
--> *add lins here*

*There are many learning sources and prior work around similar topics: We draw inspiration from past Cohorts as well as learning materials from peer sources such as Kaggle and Towards Data Science*

---

## Importing Libraries and Loading the Data

First, we need to import all the libraries we are going to utilize throughout this notebook. We import everything at the very top of this notebook for order and best practice.

In [9]:
# Importing Libraries.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math

import os
import psycopg2
import pandas.io.sql as psql
import sqlalchemy
from sqlalchemy import create_engine

from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
from scipy import stats
from pylab import*
from matplotlib.ticker import LogLocator

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

*-----*
 
Second, we establish the connection to the AWS PostgreSQL Relational Database System.

In [10]:
# Postgres (username, password, and database name) -- we define variables and put it into a function to easily call using an engine.
postgres_host = 'aws-pgsql-loan-canoe.cr3nrpkvgwaj.us-east-2.rds.amazonaws.com'  
postgres_port = '5432' 
postgres_username = 'reporting_user' 
postgres_password = 'team_loan_canoe2019'
postgres_dbname = "paddle_loan_canoe"
postgres_str = ('postgresql://{username}:{password}@{host}:{port}/{dbname}'
                .format(username = postgres_username,
                        password = postgres_password,
                        host = postgres_host,
                        port = postgres_port,
                        dbname = postgres_dbname)
               )

# Creating the connection.
cnx = create_engine(postgres_str)

*-----*
 
Last, we use pandas to read the sql panda frame we saved from the wrangling script (see script: HMDA DATA Summ. Stats. for more )

In [11]:
#  Reading the panda dataframe we wrote to_sql from the wrangling script -- note: the dataframe is already wrangled so simply we SELECT *.

    # --> extracting everything from our training data into a new dataframe here, and using an index_col.
train_2017 = psql.read_sql_query ('SELECT * FROM aa__testing.loans_2017__training', cnx, index_col='outcome').reset_index()

    #--> # using pandas to view the first 5 rows for a quick spot-check.
train_2017.head(5)


Unnamed: 0,outcome,year,dn_reason1,agency,state,county,ln_type,ln_purp,ln_amt_000s,hud_med_fm_inc,pop,rt_spread,outcome_bucket,prc_blw_hs__2013_17_5yravg,prc_hs__2013_17_5yravg,prc_ba_plus__2013_17_5yravg,r_birth_2017,r_intl_mig_2017,r_natural_inc_2017
0,Application denied by financial institution,2017,Credit application incomplete,Department of Housing and Urban Development,Michigan,Genesee County,Conventional,Refinancing,140,53700,8791,0.0,"Denied, Not Accepted, or Withdrawn",9,36,21,10,0,-1
1,Application denied by financial institution,2017,Credit application incomplete,Department of Housing and Urban Development,Michigan,Genesee County,Conventional,Refinancing,140,53700,8791,0.0,"Denied, Not Accepted, or Withdrawn",10,32,20,10,0,-1
2,Loan purchased by the institution,2017,,Department of Housing and Urban Development,Illinois,Cook County,Conventional,Home purchase,124,77500,4316,0.0,Approved or Loan Purchased by the Institution,4,22,40,9,3,1
3,Loan purchased by the institution,2017,,Department of Housing and Urban Development,Illinois,Cook County,Conventional,Home purchase,124,77500,4316,0.0,Approved or Loan Purchased by the Institution,14,24,37,9,3,1
4,Loan purchased by the institution,2017,,Department of Housing and Urban Development,Illinois,Cook County,Conventional,Home purchase,124,77500,4316,0.0,Approved or Loan Purchased by the Institution,21,37,15,9,3,1


---

### 00 - A Simplistic Model Purely on One Feature (note: This is not a regression). 

Just to test this out, let's look at a simple linear (non-OLS) model that predicts purely on median household income.

In [12]:
# Categorizing predicted Y var (loan outcome)  & other integer vars (integer encoding) using postgreSQL and pandas.
train_2017_en = psql.read_sql_query ('''SELECT CAST( CASE 
                                                         WHEN tr17.outcome_bucket = 'Denied, Not Accepted, or Withdrawn' 
                                                             THEN 0 ELSE 1 
                                                     END As INT
                                                  ) As outcome_flg,
                                                tr17.hud_med_fm_inc
                                        FROM aa__testing.loans_2017__training tr17'''
                                     , cnx)
train_2017_en.head(5)

Unnamed: 0,outcome_flg,hud_med_fm_inc
0,0,53700
1,0,53700
2,1,77500
3,1,77500
4,1,77500


In [13]:
# Setting some "modeling" variables.
loans_total = train_2017_en.shape[0] 
loans_approved = len(train_2017_en[train_2017_en.outcome_flg == 1])

# Generating what proportion of the loan applications were approved.
proportion_approved = float(loans_approved) /loans_total
print('The proportion of loan applications approved is %s.' % proportion_approved)

The proportion of loan applications approved is 0.5450705949563373.


In [20]:
# Next step: What is a simple way to determine what proportion of the loans were approved at certain thresholds?

    # --> starting by separating median household incomes into categories
greater_than_75K = train_2017_en[train_2017_en.hud_med_fm_inc > 75000]
less_than_75K = train_2017_en[train_2017_en.hud_med_fm_inc < 75000]

    # --> determining what proportion of loans were approved for hud med inc > $75K
proportion_greater_than_75K = float(len(greater_than_75K[greater_than_75K.outcome_flg == 1])) / len(greater_than_75K)
print('The proportion of loan applications approved -- for households with median incomes > $75K -- is %s.' % proportion_greater_than_75K)

    # --> determining what proportion of loans were approved for hud med inc < $75K    
proportion_less_than_75K = float(len(less_than_75K[less_than_75K.outcome_flg == 1])) / len(less_than_75K)
print('The proportion of loan applications approved -- for households with median incomes < $75K -- is %s.' % proportion_less_than_75K)

The proportion of loan applications approved -- for households with median incomes > $75K -- is 0.6384521231522455.
The proportion of loan applications approved -- for households with median incomes < $75K -- is 0.5069824086603518.


*Result*: This simplistic "model" let's us know that households with median incomes ```greater than $75K``` are much more likely to get their loan applications approved. We could say that our model:

- if median HH income > 75K => loan approved (64.5 percent)     
- if median HH income < 75K => loan not likely approved (50.7 percent)  

But this means that our model is not really a "model", because it will be wrong more than 33 percent of the time. Therefore, with this preliminary result in mind, we move on to exploring machie learning models using Scikit-learn below:

**Documentation:**

(1) From CCPE Machine Learnng Labs: 
- Scikit-Learn is a powerful machine learning library implemented in Python with numeric and scientific computing powerhouses Numpy, Scipy, and matplotlib for extremely fast analysis of small to medium-sized data sets. It is open source, commercially usable and contains many modern machine learning algorithms for classification, regression, clustering, feature extraction, and optimization. For this reason Scikit-Learn is often the first tool in a data scientist's toolkit for machine learning of incoming data sets. 
- Scikit-learn will expect numeric values and no blanks  

(2) See Further Resources & Works Cited Links at the bottom.

---

## 01 - Using scikit-laern: Linear Regression 

**Documentation:**  
(1) https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html  
(2) https://statisticalhorizons.com/linear-vs-logistic  

In [None]:
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

In [None]:
# Categorizing predicted Y var (loan outcome)  & other integer vars (integer encoding) using postgreSQL and pandas
train_2017_en = psql.read_sql_query ('''SELECT Distinct tr17.agency
                                              /*CAST (CASE 
                                                          WHEN tr17.outcome_bucket = 'Denied, Not Accepted, or Withdrawn' THEN 0 ELSE 1 
                                                      END As INT
                                                     ) As outcome_flg,
                                                CAST (CASE
                                                          WHEN agency = 'Office of the Comptroller of the Currency' THEN 1
                                                          WHEN agency = 'Federal Deposit Insurance Corporation' THEN 2
                                                          WHEN agency = 'National Credit Union Administration' THEN 3
                                                          WHEN agency = 'Department of Housing and Urban Development' THEN 4
                                                          WHEN agency = 'Consumer Financial Protection Bureau' THEN 5
                                                     END As INT
                                                     ) As agency_flg
                                                     */
                                        FROM aa__testing.loans_2017__training tr17'''
                                     
                                     , cnx)
train_2017_en.head()