# Loan Approval Prediction
Edmund Walsh - May 10th, 2020

## Introduction
The project examines the data provided by the Home Mortgage Disclosure Act (HMDA) which requires mortgage lenders in the United States to disclose information about the mortgage lending decisions they have made. Specifically, we will be examining prediction of whether or not an application will be accepted or denied.

This notebook is part of a short 5-day project whose purpose goes beyond the acceptance prediction of mortgages. The focus of this project is more about the process and end-to-end engineering from raw data to results and presentation. This notebook will focus on the data science approach and process and highlight a roadmap best illustrated in the image below:

<img src='https://miro.medium.com/max/1400/1*LoBdYL_YyIcYJ842peLDpQ.jpeg'
     alt='Data Science & Data Engineering'
     style='height: 350px; width: 750px;' />

## A Ground Up Approach
The pyramid above is such a good illustration because the process truly is building in a step-by-step fashion 
towards the ultimate goal of finding useful results that are actionable and impactful in the real world. 
The end results get all of the attention, but a project is unlikely to be successful without these strong foundations.

## Context 
While this project request didn't specifically state 'why' we are looking into this data, I will work within the context of three important rationales.
1. Mortgage due diligence is expensive and time intensive. A process that can more reliably expidite the process will save lenders significant time and resources.
2. From a regulatory perspective and also importantly as Machine Intelligence becomse a larger and more common part of this process it is important for us to be aware of and highlight any bias.
3. Some financial instituions may be more or less likely to issue mortgages and this may reflect either an over or under utilization of the balance sheet or their risk appetite.

### Preparation
Before digging in, let's install our python requirements and follow the instructions for setting up our docker environment and tools in the [README.md](file:///../README.md)

In [10]:
pip install --user -r ./misc/requirements.txt

Collecting pandas==0.24.1
  Using cached pandas-0.24.1-cp36-cp36m-manylinux1_x86_64.whl (10.1 MB)
Installing collected packages: pandas
  Attempting uninstall: pandas
    Found existing installation: pandas 0.25.1
    Uninstalling pandas-0.25.1:
      Successfully uninstalled pandas-0.25.1
Successfully installed pandas-0.24.1
Note: you may need to restart the kernel to use updated packages.


### Package Import
Now we can begin importing the required packages. Many of these are common python packages, the exception is seed which is a set of functions we will use to import our initial data and begin our collect step on our roadmap.

In [24]:
import seed
import pandas as pd
import numpy as np
import os
import config

### A Discussion of the data
Our data sources will come from three major sources and we will use their APIs to download the data.
1. Our main dataset comes from the HMDA database and includes not only a large set of information about the individual loan approval decision but also information about the originating institutions.
    a. A full list of available data on the mortgage approvals can be found [here](http://cfpb.github.io/api/hmda/fields.html)
    b. We also will look at the originating instituions and information about that dataset can be found [here](https://api.consumerfinance.gov/data/hmda/slice/institutions/metadata)
2. Our next and complementary set of information comes from the census bureau. We will use their county business patterns series which aggregate economic information at a county level. I hope that this information will provide some valueable economic insight into regional economics that may affect an mortgage approval decision.
    a. Details about this dataset can be found [here](https://www.census.gov/programs-surveys/cbp.html)
    b. This API requires a key which you can sign up for [here](https://api.census.gov/data/key_signup.html)
    c. After you have signed up, please include this key in the [config.py](file:///./config.py) file.
3. Finally, we will also fill in some data from Federal Reserve Bank of St. Louis. This data will come into use towards the end of the project as we begin to look across time periods as it should give us an indication of both the financial conditions and sentiment of the originating institutions.
    a. Details about this API can be found [here](https://fred.stlouisfed.org/docs/api/fred/)

In [3]:
# after configuring the census API re-import config and check api key
import config
print(config.api_key)

8dadaedad2b940dd8ffff397507286b479540d00


### Data Collection
Luckily for us, the designers of the APIs have made this pretty easy. A big thank you to them!

To start, I have selected a single year and a single state. Feel free to change to your preferences, data is available from 2007 - 2017. Some important caveats. As this is an illustrative project only, there are some important details about availability of data (i.e. when it was published) and data type issues that would require more attention in a production environment.

In [4]:
# choose a first state by two letter code and year
init_state = "OH"
init_yr = 2016

In [5]:
# pull data from the HMDA database on mortgage approvals -- this may take awhile
data_lar = seed.lar_pull(init_state, init_yr)

In [6]:
# a quick data snapshot
data_lar.head() 

Unnamed: 0,action_taken,action_taken_name,agency_code,agency_abbr,agency_name,applicant_ethnicity,applicant_ethnicity_name,applicant_income_000s,applicant_race_1,applicant_race_2,...,state_name,hud_median_family_income,loan_amount_000s,number_of_1_to_4_family_units,number_of_owner_occupied_units,minority_population,population,rate_spread,tract_to_msamd_income,uuid
0,1,Loan originated,7,HUD,Department of Housing and Urban Development,2,Not Hispanic or Latino,75,5,,...,Ohio,66600,219,3165,2746,9.869999885559082,10439,,148.33999633789062,227c8944-b4ba-4a9d-8c9a-c53871823ac7
1,3,Application denied by financial institution,7,HUD,Department of Housing and Urban Development,2,Not Hispanic or Latino,60,5,,...,Ohio,69100,293,3131,2921,16.459999084472656,8742,,172.69000244140625,6a2a0637-fb67-4be1-a78c-f63838759d7a
2,1,Loan originated,7,HUD,Department of Housing and Urban Development,2,Not Hispanic or Latino,87,5,,...,Ohio,66600,104,2320,1485,4.429999828338623,5508,,89.05000305175781,39eb12fd-5000-40bc-a306-2073a93c18b9
3,1,Loan originated,7,HUD,Department of Housing and Urban Development,2,Not Hispanic or Latino,86,5,,...,Ohio,55400,153,948,809,2.490000009536743,2973,,133.8300018310547,4c29ce4c-ca72-4a0c-ad8a-72e98dc49e1a
4,1,Loan originated,7,HUD,Department of Housing and Urban Development,2,Not Hispanic or Latino,46,5,,...,Ohio,66600,206,2375,2210,5.340000152587891,6044,,126.06999969482422,0b1a339e-6bab-48db-a9d6-6badd47eba94


In [19]:
'{:,.0f}'.format(data_lar.shape[0]) + ' total rows  for a total of ' + \
'{:,.0f}'.format(data_lar.shape[0]*data_lar.shape[1]) + ' data points'

'493,271 total rows  for a total of 38,968,409 data points'

In [7]:
# Now pull county business patters data from the census bureau
census_df = seed.census_pull(init_state, init_yr, data_lar, config.api_key)

In [8]:
# A quick snapshot
census_df.head()

Unnamed: 0,EMP,ESTAB,PAYANN,POP,county_code,county_name,state_abbr,state_code,year
0,3899081,195687,210202807,1250871,35,"Cuyahoga County, Ohio",OH,39,2016
1,3775265,169990,190400631,1138190,49,"Franklin County, Ohio",OH,39,2016
2,466581,35316,20664563,227255,85,"Lake County, Ohio",OH,39,2016
3,201710,14425,7926456,111289,169,"Wayne County, Ohio",OH,39,2016
4,2809075,125607,159909829,782863,61,"Hamilton County, Ohio",OH,39,2016


In [20]:
'{:,.0f}'.format(census_df.shape[0]) + ' total rows  for a total of ' + \
'{:,.0f}'.format(census_df.shape[0]*census_df.shape[1]) + ' data points'

'88 total rows  for a total of 792 data points'

In [25]:
# Finally, let's pull data about the originating institutions
data_inst = seed.inst_pull(init_state, init_yr)

In [26]:
# A quick snapshot
data_inst.head()

Unnamed: 0,activity_year,respondent_id,agency_code,agency_abbr,agency_name,federal_tax_id,respondent_name,respondent_address,respondent_city,respondent_state,...,parent_state,parent_zip_code,respondent_name_panel,respondent_city_panel,respondent_state_panel,other_lender_code,region_code,validity_error,assets,lar_count
0,2016,46,1,OCC,Office of the Comptroller of the Currency,31-4247738,FIRST NATIONAL BANK OF MCCONNE,"86 N. KENNEBEC AVENUE, PO BOX 208",MCCONNELSVILLE,OH,...,OH,43756.0,FIRST NB,MCCONNELSVILLE,OH,0,3,N,138285,129
1,2016,47,1,OCC,Office of the Comptroller of the Currency,35-0704860,FIRST FINANCIAL BANK NA,1401 S 3RD ST,TERRE HAUTE,IN,...,IN,47807.0,FIRST FNCL BK NA,TERRE HAUTE,IN,0,3,N,2882577,2695
2,2016,86,1,OCC,Office of the Comptroller of the Currency,31-0294798,FIRST NATIONAL BANK OF GERMANT,17 NORTH MAIN STREET,GERMANTOWN,OH,...,,,FIRST NB,GERMANTOWN,OH,0,3,N,52364,38
3,2016,324,1,OCC,Office of the Comptroller of the Currency,23-0916895,FIRST NATIONAL BANK AND TRUST,40 SOUTH STATE ST,NEWTOWN,PA,...,,,FIRST NB&TC NEWTOWN,NEWTOWN,PA,0,1,N,860869,184
4,2016,325,1,OCC,Office of the Comptroller of the Currency,24-0558097,"FNB BANK, NA",354 MILL STREET,DANVILLE,PA,...,PA,17604.0,FNB BK NA,DANVILLE,PA,0,1,N,363285,270


In [27]:
'{:,.0f}'.format(data_inst.shape[0]) + ' total rows  for a total of ' + \
'{:,.0f}'.format(data_inst.shape[0]*data_inst.shape[1]) + ' data points'

'100 total rows  for a total of 2,400 data points'

### Data Storage
We are very early on, but we have already pulled together a rather large dataset.