## Loading Data

In [3]:
%load_ext autoreload
%autoreload 2
import os
import sys
import pandas as pd

module_path = os.path.abspath(os.path.join(os.pardir, 'src'))
if module_path not in sys.path:
    sys.path.append(module_path)
    
from modules import dataloading as dl

from datetime import datetime

targetdir = "../data/extracted/"

The code block below downloads the Survey of Consumer Finances 2019 from the Federal Reserve. For the purposes of this project, the target directory where the data is saved (`targetdir`) is the directory that is not uploaded to the Git repo because of size. The `year` is 2019 and the `series` variable indicates which variables from the dataset will be downloaded. `series` is set to the global variable `sel_vars` from `modules.dataloading` which contains all the variables that will be used in some way for the modeling.

Because the variable names are unwieldly series of numbers and letters, they are renamed according to what they explain.

In [8]:
# # Run this code if data is not downloaded locally
# dl.SCF_load_data(targetdir=targetdir, 
#                  year=2019, 
#                  series=dl.sel_vars)

# Use pandas to read the 2019 data and preview
df = pd.read_stata(targetdir + 'scf2019s/p19i6.dta', columns=dl.sel_vars)
df.head()


Unnamed: 0,yy1,y1,x42001,x7001,x7020,x102,x8000,x14,x19,x8021,...,x3748,x3754,x3760,x3765,x3732,x3738,x3744,x3750,x3756,x3762
0,1,11,30598.896539,1,1,0,5,75,0,2,...,0,0,0,0,0,0,0,0,0,0
1,1,12,23561.874562,1,1,0,5,75,0,2,...,0,0,0,0,0,0,0,0,0,0
2,1,13,25726.122276,1,1,0,5,75,0,2,...,0,0,0,0,0,0,0,0,0,0
3,1,14,26488.31706,1,1,0,5,75,0,2,...,0,0,0,0,0,0,0,0,0,0
4,1,15,23809.061856,1,1,0,5,75,0,2,...,0,0,0,0,0,0,0,0,0,0


Only those variables that will be used in the final model are renamed whereas others that are only used to construct other final variables are not renamed. For example, the total amount of lines of credit being used are not aggregated into a single variable and must be added together. 

In [11]:
# putting all the columns in lower case to be absolutely sure there are no issues when renaming
df.columns = [x.lower() for x in df.columns]

# uses a global dict in modules.dataloading to rename columns
df.rename(columns=dl.rename_dict, inplace=True)
df.head()

Unnamed: 0,household_id,imputed_hh_id,weighting,persons_in_peu,spouse_part_of_peu,ref_next_relative_type,switch_of_resp_ref,ref_age,spouse_age,ref_sex,...,x3748,x3754,x3760,x3765,x3732,x3738,x3744,x3750,x3756,x3762
0,1,11,30598.896539,1,1,0,5,75,0,2,...,0,0,0,0,0,0,0,0,0,0
1,1,12,23561.874562,1,1,0,5,75,0,2,...,0,0,0,0,0,0,0,0,0,0
2,1,13,25726.122276,1,1,0,5,75,0,2,...,0,0,0,0,0,0,0,0,0,0
3,1,14,26488.31706,1,1,0,5,75,0,2,...,0,0,0,0,0,0,0,0,0,0
4,1,15,23809.061856,1,1,0,5,75,0,2,...,0,0,0,0,0,0,0,0,0,0


The function below performs additional operations on the data in order to (1) aggregate relevant variables for each household, (2) average all the values for each household across each of their implicates, and (3) calculate the relevant target variables `lqd_assets` (which measures liquid assets minus current debts) and `1k_target` which indicates whether a household has more than \$1,000 in liquid net worth.

Additional operations to clean the data are done in EDA to explore outliers and data values that are formatted in certain ways for the sake of accounting of all answers to the survey.

In [5]:
df = dl.clean_SCF_df(df)

In [6]:
df.head()

Unnamed: 0_level_0,imputed_hh_id,weighting,persons_in_PEU,spouse_part_of_PEU,ref_next_relative_type,switch_of_resp_ref,ref_age,spouse_age,ref_sex,spouse_sex,...,checking_accts_value,savings_accts_value,lqd_assets,educ_bins,doctorate_deg,master_deg,bachelor_deg,assoc_deg,hs_deg,1k_target
household_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,13.0,26036.854458,1.0,1.0,0.0,5.0,75.0,0.0,2.0,0.0,...,6000.0,0.0,550000.0,3,0,0,1,0,0,1
2,23.0,18969.956098,5.0,2.0,1.0,1.0,50.0,39.0,1.0,2.0,...,759.0,8.0,767.0,1,1,0,0,0,0,0
3,33.0,20483.071126,2.0,2.0,1.0,1.0,53.0,49.0,1.0,2.0,...,3750.0,0.0,6750.0,1,1,0,0,0,0,1
4,43.0,31785.437408,2.0,2.0,2.0,5.0,29.0,28.0,1.0,2.0,...,3500.0,10006.0,21506.0,4,0,0,0,1,0,1
5,53.0,21046.09621,2.0,2.0,2.0,5.0,47.0,39.0,1.0,2.0,...,-1.0,0.0,-1.0,1,1,0,0,0,0,0


## Saving a Sample

The code below saves a sample of the data if needed. 

In [None]:
# csv_head = df.head()
# csv_head.to_csv('example_data.csv', index=False)
# csv_head.shape