# Plotting the CT PPP data

Import necessary libraries. To avoid truncating the number of rows displayed in output, the max_rows below can be set to 'None'. Because output can be over 60K rows, this limit is useful to start with.

In [None]:
import numpy as np
import pandas as pd
import json
from matplotlib import pyplot as plt
%matplotlib inline

import plotnine as p9
import warnings

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', None)
pd.options.display.float_format = "{:,.0f}".format

## Loading and preparing the data

Read in the three JSON files that contain the CT PPP data. Unless the JSON files are in the same working directory as this notebook when working locally on your computer, it will necessary to use full path names for the files. <br><br>If using Google Colab, first upload these JSON files to Colab by clicking the folder icon at the notebook's sidebar and then clicking the upload button. Then preface the JSON filenames below with '/content/', so that the first filename below is '/content/ctppp_small_063020.json' instead. <br><br>Here we assign <\\$150K loans to the small_df dataframe, the >\\$150K loans to the large_df dataframe, and the composite dataset to the total_df dataframe.

In [None]:
small_df = pd.read_json('ctppp_small_063020.json', dtype={'Zip': 'str'})

In [None]:
large_df = pd.read_json('ctppp_large_063020.json', dtype={'Zip': 'str'})

In [None]:
total_df = pd.read_json('ctppp_total_063020.json', dtype={'Zip': 'str'})

One simple approach for creating point estimates of the >\\$150K loans is to assign the same dollar value for all loans in the same dollar range. E.g., \\$7.5M for all \\$5M-10M loans. This is not an actual estimate per se, but a standardization of the larger loan values. For the sake of simplicity, this approach is used below to illustrate aggregrate loan dollar amounts. This standardization uses the following steps:
1. Use the reported SBA total loan dollar amount for CT and subtract out deleted loans from this SBA total. In subtracting out the deleted loans, for <\\$150K loans, use the actual loan amount reported; for >\\$150K loans, use the midpoint of the dollar range for that loan. E.g., \\$250K for \\$150K-\\$350K loans.
2. Compute a percentage of the dollar range that will be added to the loan mininum to give the standardized loan value. Use the same percentage for all loan ranges. E.g., 50% would yield the midpoint for all the loan ranges. In the initial tranche of data provided by the SBA, 35.425% was deemed appropriate after cleaning the data and subtracting out deleted items.
3. Follow this approach to create another standardized column of loan amounts in the large and total loan dataframes. Actual reported loan amounts for the <\\$150K loans are still always used in this new column, however.

In [None]:
standardization_percentage = 0.35425
large_df['StandardLoanAmount'] = (large_df['LoanMin']+(large_df['LoanMax']-large_df['LoanMin'])*standardization_percentage)
total_df['StandardLoanAmount'] = (total_df['LoanMin']+(total_df['LoanMax']-total_df['LoanMin'])*standardization_percentage)

With these standardized loan amounts, it is easier to illustrate aggregate relationships in the data. The CT PPP data is now fully available in the notebook for analysis.

## Plotting the data

The first example generates a dotplot of retained jobs for each of the loan ranges >\\$150K. First confirm the number of borrowers that did not give information on retained jobs and the number that answered 'zero' retained jobs. Then create the dotplot using the plotnine (Grammar of Graphics) library and save it to the current working directory.

In [None]:
sum(large_df['JobsRetained'].isnull())

In [None]:
sum(large_df['JobsRetained']==0)

In [None]:
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')

p = (
    p9.ggplot(large_df)
    + p9.geom_point(p9.aes(x='LoanRange',y='JobsRetained'), position='jitter', alpha=0.5)
    + p9.labs(y='Self-Reported Jobs Retained', x='', title='Larger loans reported higher job retention')
    + p9.theme(axis_text_x=None, figure_size=(9,6))
    + p9.theme(plot_title=p9.element_text(size=18, face='bold'))
    + p9.coord_flip()
)
p.save(filename='large_loan_jobs.png', width=9, height=6, units='in', dpi=600)
p

The next example is a simple truncated histogram of the reported jobs retained among the <\\$150 loans. In this case, the distribution is cut off at 25 jobs.

In [None]:
sum(small_df['JobsRetained'].isnull())

In [None]:
sum(small_df['JobsRetained']==0)

In [None]:
small_df['JobsRetained'].describe()

In [None]:
p = (
    p9.ggplot(small_df[(small_df['JobsRetained']<=25)])
    + p9.geom_bar(p9.aes(x='JobsRetained'), size=1.0, fill='cornflowerblue')
    + p9.labs(x='Reported Jobs Retained', y='Loan Count', title='Smaller loans averaged 5 jobs retained')
    + p9.theme(figure_size=(9,6))
    + p9.theme(plot_title=p9.element_text(size=18, face='bold'))
)
p.save(filename='small_loan_jobs.png', width=9, height=6, units='in', dpi=600)
p

The last example shows the total loan dollar volume approved each day, for the three largest and three smallest loan ranges. In order to more easily work with dates, transform the current date column, 'DateApproved', in the dataframe into a new datetime-aware column, 'DateTime', and save it in the total_df dataframe. Use this column in the plot. Find the largest daily total loan dollar amount to label the axis manually.

In [None]:
total_df['DateTime'] = pd.to_datetime(total_df['DateApproved'])

In [None]:
total_df.groupby(['DateTime','LoanRange'])['StandardLoanAmount'].sum().max()

The date_format below is used to customize the date labels: it's not mandatory and this next line can be eliminated, along with the full line of code associated with the date_format in the plot code block.

In [None]:
from mizani.formatters import date_format

In [None]:
p = (
    p9.ggplot(total_df[(total_df.LoanRange!='d $350,000-1 million') &
                              (total_df.LoanRange!='e $150,000-350,000')])
    + p9.geom_col(p9.aes(x='DateTime',y='StandardLoanAmount'), color='cornflowerblue')
    + p9.labs(x='Loan Approval Date', y='', title='Larger loans were approved earlier')
    + p9.theme(axis_text_x=p9.element_text(angle=30))
    + p9.scale_x_datetime(labels=date_format('%-m-%-d'))
    + p9.theme(axis_text_x=None, figure_size=(9,6))
    + p9.theme(plot_title=p9.element_text(size=18, face='bold'))
    + p9.scale_y_continuous(labels=['0','50mm', '100mm','150mm'])
    + p9.facet_wrap('~ LoanRange', ncol=3)
)
p.save(filename='approvals_loan_range.png', width=9, height=6, units='in', dpi=600)
p