This is a draft script to ingest 2017 Home Mortgage Disclosure Act (HMDA) data from a CSV file into a pandas dataframe, then subsequently explore the data and do some basic wrangling.  Future drafts will generate descriptive statistics and create more refined visualizations (also, examine why the CSV appears to have missing data in mandatory fields).

The key issue I had to address was the HMDA data set's unique "sentinel values" for missing values (including " ") as part of the CSV download (details below).  The most efficient solution I am finding is to read the data in as a pandas dataframe.  As a test value, df['applicant_income_000s'][44] then results in NaN and not " ".

An ongoing issue we continue to research is how to optimize performance for running scripts against such a large data set.  This will be further researched in July.

***Information on Data Source***

Download from this link: https://www.consumerfinance.gov/data-research/hmda/explore
Select year(s) of data: 2017
Select Suggested Filters: "All records"
Time Stamp of Download: July 7 2019, 5:45 PM EDT
Website states "There are 14,285,496 HMDA records from 2017."

NOTE: Due to memory limits on my local PC, I am working with 2017 data for one state (TX), to start.  This cut of data has 1,148,206 records.

***Reference Links***

"A Guide to Home Mortgage Disclosure Act Data" http://nowdata.cinow.info/media/uploads/2011/10/7/92.pdf

https://jakevdp.github.io/PythonDataScienceHandbook/03.04-missing-values.html

https://stackoverflow.com/questions/24251219/pandas-read-csv-low-memory-and-dtype-options

In [None]:
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

Setting option to suppress scientific notation in output (such as describe function)
See also: https://twitter.com/vboykis/status/474241498754461696?lang=en

In [None]:
pd.set_option('display.float_format', lambda x: '%.3f' % x)

In [None]:
#path = 'add path here'

In [None]:
sourcedata = '%shmda_lar_test2.csv' % (path)
print(sourcedata)

In [None]:
df = pd.read_csv(sourcedata, low_memory=False)

In [None]:
df.describe(include = 'all')

In [None]:
df.info()

In [None]:
df.head(n=10)

We anticipate the project's key dependent variable will be action_taken_name:

In [None]:
df.groupby('action_taken_name').count()

One consideration with Pandas and NumPy relates to handling of Null values. I had to remove null values before trying to generate visualizations.

Reference Links: 
https://stackoverflow.com/questions/34955158/what-might-be-the-cause-of-invalid-value-encountered-in-less-equal-in-numpy
https://helpful.knobs-dials.com/index.php/Python_usage_notes_-_Numpy,_scipy

In [None]:
subset_df = df[['loan_amount_000s','applicant_income_000s']].dropna()
subset_df = subset_df[ (subset_df['loan_amount_000s']<20000) & (subset_df['applicant_income_000s']<40000) ]

In [None]:
g = sns.lmplot('loan_amount_000s', 'applicant_income_000s', data=subset_df, fit_reg=False)
plt.show()

In [None]:
sns.boxplot([subset_df['loan_amount_000s']])

In [None]:
sns.boxplot([subset_df['applicant_income_000s']])