### Saving the work


*Purpose: This notebook breaks down step-by-step a simple wrangling approach to creating a training data of 50,000 records (data is already pre-recorded randomly in original raw csv) for your single year (i.e. 2013, 2015, and 2017)*.

Documentation:  
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html  
https://stackoverflow.com/questions/23103962/how-to-write-dataframe-to-postgres-table                    
https://github.com/metabase/metabase/issues/7214  
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html

In [1]:
# Importing Libraries.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math

import os
import psycopg2
import pandas.io.sql as psql
import sqlalchemy
from sqlalchemy import create_engine

from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
from scipy import stats
from pylab import*
from matplotlib.ticker import LogLocator

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [None]:
# Creating a new engine to specify the "master" permissions
postgres_host = 'YOUR AWS DATA INSTANCE'  
postgres_port = 'YOUR AWS PORT' 
postgres_username = 'YOUR AWS MASTER USERNAMME'
postgres_password = 'YOUR AWS PWS'
postgres_dbname = "paddle_loan_canoe"
postgres_str = ('postgresql://{username}:{password}@{host}:{port}/{dbname}'
                .format(username = postgres_username,
                        password = postgres_password,
                        host = postgres_host,
                        port = postgres_port,
                        dbname = postgres_dbname)
               )


# Creating the connection.
cnx_m = create_engine(postgres_str)

In [None]:
#  Reading YOUR HMDA YEAR (i.e. 2013, 2015, 2017 -- a single year) dataset; join population and education datasets appropriately for YOUR YEAR 
#  for the first 50,000 rows -- as a dataframe using pandas: df.

df = pd.read_sql_query ('''SELECT 
                              --> a. main: casting a few key MORTGAGE data fields:
                                   CAST(us17.action_taken_name As varchar(56)) As outcome, us17.as_of_year As year,
                                   CAST(denial_reason_name_1 As varchar(56)) dn_reason1 , CAST(us17.agency_name As varchar(56)) As agency,
                                   CAST(us17.state_name As varchar(28)) As state,         CAST(us17.county_name As varchar(56)) As county,
                                   CAST(us17.loan_type_name As varchar(56)) As ln_type,   CAST(us17.loan_purpose_name As varchar(56)) As ln_purp, 
                                   us17.loan_amount_000s As ln_amt_000s, us17.hud_median_family_income As hud_med_fm_inc, population as pop,

                                       --two embedded fuctions and one CASE below: assigns hierarchy in CASE, and converts to num in two steps
                                   CAST ( CAST ( CASE
                                                     WHEN us17.rate_spread = '' Then '0'
                                                     ELSE us17.rate_spread
                                                 END As varchar(5)
                                               ) As numeric
                                        )
                                   As rt_spread,
                                       --categorize loan application outcome into two buckets: "Approved", "Denied, Not Accepted, or Withdrawn"
                                   CASE
                                       WHEN us17.action_taken_name In ('Loan originated', 'Loan purchased by the institution')
                                           THEN 'Approved or Loan Purchased by the Institution'
                                       ELSE 'Denied, Not Accepted, or Withdrawn'
                                   END outcome_bucket,
                              --*
                              --> b. macro-econ: casting and joining a few key EDUCATION data fields:
                                   CAST(educ17."Perc_adults w_less than a HS diploma_2013-17" As int)  As prc_blw_HS__2013_17_5yrAvg,
                                   CAST(educ17."Perc_adults w_ HS diploma only_2013-17" As int)        As prc_HS__2013_17_5yrAvg,
                                   CAST(educ17."Perc_adults w_BA deg or higher_2013-17" As int)        As prc_BA_plus__2013_17_5yrAvg,
                              --*
                              --> c. macro-econ: casting and joining a few key POPULATION data fields:
                                   CAST(pop17.r_birth_2017 AS INT)                                     As r_birth_2017,
                                   CAST(pop17.r_international_mig_2017 AS INT)                         As r_intl_mig_2017,
                                   CAST(pop17.r_natural_inc_2017 AS INT)                               As r_natural_inc_2017
                              --*
                           FROM YOUR SCHEMA.YOUR_YEAR us17 
                           LEFT OUTER JOIN YOUR_SCHEMA.education__acs_1970_to_2017_5yravgs educ17 
                                   ON us17.county_name = educ17."Area name"
                           LEFT OUTER JOIN YOUR_SCHEMA.populationestimates__usda_ers_2010_to_2018 pop17
                                   ON us17.county_name = pop17.area_name
                           LIMIT 50000''', cnx)

# Using pandas to view the first 5 rows (NB: why does it start at 0?).
df.head(5)

In [None]:
# Using PostgreSQL to count and group by the merged "r_" variables to see null values with "" appearing as INT
df_test = pd.read_sql_query ('''WITH count_r_vars AS 
                                ( SELECT 
                              
                                   CAST(us17.action_taken_name As varchar(56)) As outcome, us17.as_of_year As year,
                                   CAST(denial_reason_name_1 As varchar(56)) dn_reason1 , CAST(us17.agency_name As varchar(56)) As agency,
                                   CAST(us17.state_name As varchar(28)) As state,         CAST(us17.county_name As varchar(56)) As county,
                                   CAST(educ17."Perc_adults w_less than a HS diploma_2013-17" As int)  As prc_blw_HS__2013_17_5yrAvg,
                                   CAST(educ17."Perc_adults w_ HS diploma only_2013-17" As int)        As prc_HS__2013_17_5yrAvg,
                                   CAST(educ17."Perc_adults w_BA deg or higher_2013-17" As int)        As prc_BA_plus__2013_17_5yrAvg,
                                   CAST(pop17.r_birth_2017 AS INT)                                     As r_birth_2017,
                                   CAST(pop17.r_international_mig_2017 AS INT)                         As r_intl_mig_2017,
                                   CAST(pop17.r_natural_inc_2017 AS INT)                               As r_natural_inc_2017


                                   FROM YOUR_SCHEMAt.YOUR_YEAR7 us17 
                                   LEFT OUTER JOIN YOUR_SCHEMA.YOUR_YEAR educ17 
                                           ON us17.county_name = educ17."Area name"
                                   LEFT OUTER JOIN YOUR_SCHEMA.populationestimates__usda_ers_2010_to_2018 pop17
                                           ON us17.county_name = pop17.area_name
                                   LIMIT 50000
                                ) 
                                SELECT 'r_birth_2017' As r_var__nm, COUNT(*) As null_counts FROM count_r_vars WHERE r_birth_2017 IS NULL
                                    UNION ALL
                                SELECT 'r_intl_mig_2017' As r_var__nm, COUNT(*) As null_counts  FROM count_r_vars WHERE r_birth_2017 IS NULL
                                    UNION ALL
                                SELECT 'r_nat_inc_2017' As r_var_nm, COUNT(*) As null_counts FROM count_r_vars WHERE r_natural_inc_2017 IS NULL
                                '''           
                             , cnx)
df_test.head()

---

In [None]:
# Limit your prelimary analysis to just loan applications for $700K or less

df2 = df[df.ln_amt_000s < 700]

Note (For the years from 2013-2017): Since the ``` > $100K``` median household incomes appear to be outliers, we'll replace them with ```= $91K```, since it is the top of the upper wishker and therefore falls within the last quartile. Note, this is for preliminary modeling only and could not "standardized" so simply in our final model.

In [None]:
df3 = df2.dropna(subset=['r_natural_inc_2017', 'r_birth_2017', 'r_intl_mig_2017'])

In [None]:
df3_dtype = {'outcome': sqlalchemy.types.VARCHAR(length=56),        'year':  sqlalchemy.types.INTEGER(),
             'dn_reason1': sqlalchemy.types.VARCHAR(length=56),     'agency': sqlalchemy.types.VARCHAR(length=56), 
             'state': sqlalchemy.types.VARCHAR(length=28),          'county': sqlalchemy.types.VARCHAR(length=56), 
             'ln_type': sqlalchemy.types.VARCHAR(length=56),        'ln_purp': sqlalchemy.types.VARCHAR(length=56),
             'ln_amt_000s': sqlalchemy.types.INTEGER(),             'hud_med_fm_inc': sqlalchemy.types.INTEGER(),
             'pop': sqlalchemy.types.INTEGER(),                     'rt_spread': sqlalchemy.types.NUMERIC(),
             'outcome_bucket': sqlalchemy.types.VARCHAR(length=56), 'prc_blw_HS__2013_17_5yrAvg': sqlalchemy.types.INTEGER(),
             'prc_HS__2013_17_5yrAvg': sqlalchemy.types.INTEGER(),  'prc_BA_plus__2013_17_5yrAvg': sqlalchemy.types.INTEGER(),
             'r_birth_2017': sqlalchemy.types.INTEGER(),            'r_intl_mig_2017': sqlalchemy.types.INTEGER(),
             'r_natural_inc_2017': sqlalchemy.types.INTEGER()
            }

In [None]:
# Using pandas to write Dataframe to PostgreSQL and replacing table if it already exists
df3.to_sql(name='loans_2017__training', schema='aa__testing', chunksize=250,
           dtype= df3_dtype, method=None, con=cnx_m, if_exists='replace', index=False)
