# Loading  and preparing the CT PPP data

Import necessary libraries. To avoid truncating the number of rows displayed in output, the max_rows below can be set to 'None'. Because output can be over 60K rows, this limit is useful to start with.

In [None]:
import numpy as np
import pandas as pd
import json

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', None)
pd.options.display.float_format = "{:,.0f}".format

Read in the three JSON files that contain the CT PPP data. Unless the JSON files are in the same working directory as this notebook when working locally on your computer, it will necessary to use full path names for the files. <br><br>If using Google Colab, first upload these JSON files to Colab by clicking the folder icon at the notebook's sidebar and then clicking the upload button. Then preface the JSON filenames below with '/content/', so that the first filename below is '/content/ctppp_small_063020.json' instead. <br><br>Here we assign <\\$150K loans to the small_df dataframe, the >\\$150K loans to the large_df dataframe, and the composite dataset to the total_df dataframe.

In [None]:
small_df = pd.read_json('ctppp_small_063020.json', dtype={'Zip': 'str'})

In [None]:
large_df = pd.read_json('ctppp_large_063020.json', dtype={'Zip': 'str'})

In [None]:
total_df = pd.read_json('ctppp_total_063020.json', dtype={'Zip': 'str'})

Alternatively one can build the composite dataset using the small_df and large_df dataframes. To to build the total dataframe using these two dataframes, use the following approach:
```Python
concat_small_df = small_df.copy()
concat_small_df['LoanMax'] = concat_small_df['LoanAmount'].copy()
concat_small_df.rename(columns={'LoanAmount':'LoanMin'}, inplace=True)
total_df = pd.concat([large_df, concat_small_df], join='inner', ignore_index=True)
```
With the three dataframes loaded, you can starting analyzing CT PPP data, as seen in these three one-line examples. For more detailed work, it will be helpful to estimate loan dollar amounts for the >$150K loans. This is done below.

In [None]:
small_df.groupby('CD')['LoanAmount'].agg(['sum','mean','count']).sort_values(by='sum', ascending=False)

In [None]:
large_df[large_df['BusinessName'].str.count('CAPITAL|VENTURE|ASSET')>0].head()

In [None]:
total_df.groupby('Sector')['JobsRetained'].agg(['sum','mean','count']).sort_values(by='sum', ascending=False)

One simple approach for creating point estimates of the >\\$150K loans is to assign the same dollar value for all loans in the same dollar range. E.g., \\$7.5M for all \\$5M-10M loans. This is not an actual estimate per se, but a standardization of the larger loan values. For the sake of simplicity, this approach is used below to illustrate aggregrate loan dollar amounts. This standardization uses the following steps:
1. Use the reported SBA total loan dollar amount for CT and subtract out deleted loans from this SBA total. In subtracting out the deleted loans, for <\\$150K loans, use the actual loan amount reported; for >\\$150K loans, use the midpoint of the dollar range for that loan. E.g., \\$250K for \\$150K-\\$350K loans.
2. Compute a percentage of the dollar range that will be added to the loan mininum to give the standardized loan value. Use the same percentage for all loan ranges. E.g., 50% would yield the midpoint for all the loan ranges. In the initial tranche of data provided by the SBA, 35.425% was deemed appropriate after cleaning the data and subtracting out deleted items.
3. Follow this approach to create another standardized column of loan amounts in the large and total loan dataframes. Actual reported loan amounts for the <\\$150K loans are still always used in this new column, however.

In [None]:
standardization_percentage = 0.35425
large_df['StandardLoanAmount'] = (large_df['LoanMin']+(large_df['LoanMax']-large_df['LoanMin'])*standardization_percentage)
total_df['StandardLoanAmount'] = (total_df['LoanMin']+(total_df['LoanMax']-total_df['LoanMin'])*standardization_percentage)

With these standardized loan amounts, it is easier to illustrate aggregate relationships in the data.

In [None]:
total_df['StandardLoanAmount'].sum()

In [None]:
large_df['StandardLoanAmount'].value_counts().sort_values(ascending=True)

In [None]:
total_df.groupby('LoanRange')['StandardLoanAmount'].agg(['sum','count']).sort_index()

The CT PPP data is now fully available in the notebook for analysis.