**This notebook is intended to provide a Python-based solution to importing our project CSV files into a PostgreSQL Database.  The two coding challenges we face are:**

a) Getting the AWS RDS DB Connection to work.  
The key thing I had to do with Amazon RDS was set Public Accessibility = Yes, and then edit inbound traffic under the Security Rules to whitelist the relevant IP address. This article hit on the basic points and then I figured out how to do this in AWS RDS/EC2.
https://aws.amazon.com/getting-started/tutorials/create-connect-postgresql-db/

b) Importing CSV data with missing values into PostgreSQL.
The CSV Data Source for the project has missing values noted as empty strings which cannot be imported directly from CSV to PostgreSQL (due to a PostgreSQL Bug).  Also of note, the read CSV step has to occur row by row because our files are too large to hold a whole year of data in local memory.

OPTION A (FAILED) "COPY FROM" does not work on null="" because PostgreSQL doesn't handle it. Pulled relevant cell from the notebook for now but we may add it back in, to implement Option C below.

OPTION B (FAILED) Pandas is not working, either, because it relies on "IMPORT INTO", which converts the null value back to an empty string to pass the SQL statement in Python bindings.  

OPTION C (PENDING) Likely solution is to cast all fields as VARCHAR to start, then change type following initial import.  Will code this out at a later date.
https://www.postgresql.org/message-id/4C882E8E.6080301%40postnewspapers.com.au
https://github.com/cockroachdb/cockroach/issues/19743
https://forum.cockroachlabs.com/t/import-from-csv-fails-on-null-data-for-int-types/1067/11
https://stackoverflow.com/questions/13125236/sqlalchemy-psycopg2-and-postgresql-copy
https://stackoverflow.com/questions/21527057/python-parse-csv-ignoring-comma-with-double-quotes

In [None]:
import csv
import os
import pandas as pd
import psycopg2
from sqlalchemy import create_engine

In [None]:
# Python objects for script

hmda_schema = 'public'
tables = ['17', '16', '15', '14', '13', '12']
year = tables[2]
test_table = f'hmda_lar_20{year}_allrecords'

path = ''
csv_file_path = f'{path}{test_table}.csv'

print(csv_file_path)
print(test_table)

In [None]:
# Postgres username, password, and database name.
postgres_host = ''  
postgres_port = '5432' 
postgres_username = '' 
postgres_password = ''
postgres_dbname = ''
postgres_str = ('postgresql://{username}:{password}@{host}:{port}/{dbname}'
                .format(username = postgres_username,
                        password = postgres_password,
                        host = postgres_host,
                        port = postgres_port,
                        dbname = postgres_dbname)
               )


# Creating the connection.
engine = create_engine(postgres_str)

In [None]:
engine.execute(f'DROP TABLE IF EXISTS {test_table};')

**Quickest way to code table header is pulling from test file, then using df.to_sql**

In [None]:
path_for_header_df = ''
df = pd.read_csv(path_for_header_df, low_memory=False)
df[:0].to_sql(test_table, engine, if_exists='fail', index=False)

**Create a basic CSV iterator, since we can't read a whole year of data into local memory.  This cell works.**

In [None]:
def read_csv(path):
    with open(csv_file_path, 'rt') as f:
        reader = csv.reader(f)
        for row in reader:
            yield row

g = read_csv(csv_file_path)
next(g)

**This next cell does not work**

In [None]:
def read_csv(path):
    # First open the file
    with open(path, 'rt') as f:
        reader = csv.reader(f)
        next(reader, None)
        #have to preprocess our data so that PostgreSQL can handle it
        for row in reader:   
            df = pd.DataFrame(columns = ['tract_to_msamd_income', 'rate_spread', 'population', 'minority_population', 'number_of_owner_occupied_units', 'number_of_1_to_4_family_units', 'loan_amount_000s', 'hud_median_family_income', 'applicant_income_000s', 'state_name', 'state_abbr', 'sequence_number', 'respondent_id', 'purchaser_type_name', 'property_type_name', 'preapproval_name', 'owner_occupancy_name', 'msamd_name', 'loan_type_name', 'loan_purpose_name', 'lien_status_name', 'hoepa_status_name', 'edit_status_name', 'denial_reason_name_3', 'denial_reason_name_2', 'denial_reason_name_1', 'county_name', 'co_applicant_sex_name', 'co_applicant_race_name_5', 'co_applicant_race_name_4', 'co_applicant_race_name_3', 'co_applicant_race_name_2', 'co_applicant_race_name_1', 'co_applicant_ethnicity_name', 'census_tract_number', 'as_of_year', 'application_date_indicator', 'applicant_sex_name', 'applicant_race_name_5', 'applicant_race_name_4', 'applicant_race_name_3', 'applicant_race_name_2', 'applicant_race_name_1', 'applicant_ethnicity_name', 'agency_name', 'agency_abbr', 'action_taken_name'])
            df.loc[0] = row
            print(df)
            df.to_sql(test_table, engine, schema=None, if_exists='append', 
                             index=False, index_label=None, chunksize=None, dtype=None, method=None)
            
    conn.commit()
    
read_csv(csv_file_path)

**Once we find a coding approach that works, the below cells verify whether the table now exists.**

In [None]:
show_tables= '''SELECT
   *
FROM
   pg_catalog.pg_tables
WHERE
   schemaname != 'pg_catalog'
AND schemaname != 'information_schema';'''

In [None]:
print('Current tables:')
current_tables = pd.read_sql_query(show_tables,engine)
print(current_tables)