# **Vanguard's \$1,000 Question: Analyzing U.S. Households' Liquid Assets**
## Dan Valenzuela

***

## **Overview** <a id="Overview"></a>

Vanguard Group is a financial services company that has been a pioneer in creating mutual funds, a type of investment vehicle that helps investors reach their financial goals. However, many financial services companies are making it cheaper to invest in their products, making Vanguard's market-leading inflows more precarious going forward. This project aims help Vanguard answer the following question  by implementing a logistic regression model trained on data provided by the Federal Reserve: which households have enough money to invest with Vanguard?

Using a logistic regression model trained on data transformed to take into account outlier through various methods, a model with an F1 score of approximately .915 can be achieved. Further, the model shows that those households that bank with a number of financial institutions and are credit card power users are likely to be targets of Vanguard's mutual funds.

In the immediate term, Vanguard can data tools like credit reports and Plaid (which provides information about other financial services) to improve its targeting efforts.

[**1. Problem**](#Problem)

[**2. Data Understanding**](#Data-Understanding)

[**3. Data Preparation**](#Data-Preparation)

[**4. Data Analysis**](#Data-Analysis)

[**5. Evaluation**](#Evaluation)

[**6. Conclusion & Next Steps**](#Conclusion)

[**7. Endnotes**](#Endnotes)

[**8. Appendix A: Variables**](#AppA)

***

## **Problem** <a id="Problem"></a>
[*↑ Back to overview*](#Overview)

Vanguard Group is a financial services company that has been a [pioneer](https://www.foxbusiness.com/personal-finance/vanguards-index-funds-a-history-of-evolution-for-investors) in creating mutual funds, a type of investment vehicle that helps investors reach their financial goals. However, many financial services companies are making it [cheaper to invest](https://www.marketwatch.com/story/fidelity-cuts-fees-to-0-as-it-jumps-on-zero-commission-bandwagon-2019-10-10) in their products, making Vanguard's [market-leading](https://www.inquirer.com/business/vanguard-inflows-bloomberg-jeff-demaso-index-active-mutual-funds-20190524.html) inflows more precarious going forward. 

Despite competitors' new moves, 4 in 5 households that invest with Vanguard utilize mutual funds to meet their financial goals.<a id='fn-1-src'></a>[<sup>1</sup>](#fn-1) Mutual funds are essentially like any other company but their sole purpose is to pool money from investors and invest it into stocks, bonds, or other financial products. When a person invests in Vanguard mutual funds, what they are essentially doing is buying shares in a company that Vanguard has set up for the purpose of investing money in a certain way. In return, that person gets a cut of the returns (or losses) on investment.

Why do so many households use mutual funds? It outsources all the effort of researching companies, executing trades, and paying trading fees to the mutual fund. For that effort, mutual funds charge a fee based on the amount households invest.<a id='fn-2-src'></a>[<sup>2</sup>](#fn-2)

However, it's not so easy as telling companies that set up mutual funds like Vanguard to shut up and take your money. Vanguard requires investors to put in at least \$1,000 for some of their mutual funds. Other mutual fund investment minimums only go higher. <a id='fn-3-src'></a>[<sup>3</sup>](#fn-3)



Two questions come to mind when considering what Vanguard can do to keep its competitive edge: 
1. Are there certain features about U.S. households that would predict whether they have enough available money to invest with Vanguard?
2. Are there certain features that would predict whether households would use that extra money to invest with Vanguard?

Although it would be great to answer question 2, the lack of data that identifies Vanguard customers is difficult to overcome without becoming a Vanguard employeer. The good news is that question 1 is a much easier question to answer with public data.

This project aims to answer question 1 by implementing a logistic regression model trained on data provided by the Federal Reserve that reveals the finances of U.S. households in detail.



[*↑ Back to overview*](#Overview)
***

## **Data Understanding** <a id="Data-Understanding"></a>
[*↑ Back to overview*](#Overview)

In [21]:
%load_ext autoreload
%autoreload 2

import os
import sys
import pandas as pd

module_path = os.path.abspath(os.path.join( 'src'))
if module_path not in sys.path:
    sys.path.append(module_path)
    
from modules import dataloading as dl
from modules import modeling

targetdir = 'data/extracted/'

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### Dataset

An overview of the data can be best summarized by the Federal Reserve itself <a id='fn-4-src'></a>[<sup>4</sup>](#fn-4): 

> The Federal Reserve Board’s triennial Survey of Consumer Finances (SCF) collects infor- mation about family income, net worth, balance sheet components, credit use, and other financial outcomes. \[. . . \] The SCF is a triennial interview survey of U.S. families sponsored by the
Board of Governors of the Federal Reserve System with the cooperation of the U.S.
Department of the Treasury. Since 1992, data for the SCF have been collected by NORC, a
research organization at the University of Chicago. Although the majority of the data are
collected between May and December of each survey year, a small fraction of the data
collection occurs in the first four months of the next calendar year. In the 2019 SCF, this
portion of the data collection overlapped with early months of the COVID-19 pandemic,
with about 9 percent of interviews conducted between February and April 2020.

The 2019 survey is the most recent survey and is used for this project. Although there are only about 6,000 households in any given survey, there are tricky structural issues that the data has that any model needs to address. The first is sampling.
 

#### *Sampling*

The survey constructs a full data set using two different methods. First, it obtains a standard random sample of the population to obtain a broad range of geographies and homeownership. Second, another sample is taken from primarily wealthly households as the non-response from those households is higher than those less well off. 

Although this sampling technique helps fix the issue of non-response, it also constructs a data set that is overreprsentative of wealthy households compared to the general population. The Federal Reserve provides sample weights for each household which intends to represent the number of households that each surveyed household would represent in the general population.<a id='fn-5-src'></a>[<sup>5</sup>](#fn-5)

The weighting is primarily a question as to implementation with respect to the model, however there is another big quirk with the data: how it deals with missing values.

#### *Imputing Missing Values*

The number of observations in the full data set is 5 times the number of households in the dataset.

In [2]:
# # Run this code if data is not downloaded locally
# dl.SCF_load_data(targetdir=targetdir, 
#                  year=2019, 
#                  series=dl.sel_vars)

og_df = pd.read_stata(targetdir + 'scf2019s/p19i6.dta', columns=None)

In [3]:
# number of obs. divided by num. of households
og_df.shape[0] / og_df.yy1.value_counts().shape[0]

5.0

The Federal Reserve [provides](https://www.federalreserve.gov/econres/files/codebk2019.txt) 5 full observations for each household, each varying one from another, to (1) help model the wide range of possible values that can be put in for missing values and (2) obfuscate the identity of households. 

This begs the question of what one can do to create a model when one has 5 copies of a data set, which is addressed in part in the SCF's [codebook](https://www.federalreserve.gov/econres/files/codebk2019.txt). How this is dealt with specifically is addressed in [Data Preparation](#Data-Preparation).

#### *Variables*

In [4]:
# Number of variables
og_df.shape[1]

5333

There are over 5,000 variables available for analysis, each of which corresponds to some item on the survey given to each household. 

Since this analysis aims to help Vanguard more easily identify potential customers, variables were chosen from the dataset that (1) help build a picture of a household's liquid assets, (2) reflect debts that may appear on a credit report and (3) contains demographic information that may be routinely collected.

Of the original 5,000 variables, 68 were used. Further feature engineering was performed on these variables which is described in [Data Preparation](#Data-Preparation), but a comprehensive view of the variables can be seein in [Appendix A](#AppA).

In [10]:
df = pd.read_stata(targetdir + 'scf2019s/p19i6.dta', columns=dl.sel_vars)

In [6]:
df.shape[1]

68

[*↑ Back to overview*](#Overview)
***

## **Data Preparation** <a id="Data-Preparation"></a>
[*↑ Back to overview*](#Overview)

### Dealing with Implicates

In [11]:
df.columns = [x.lower() for x in df.columns]
df.rename(columns=dl.rename_dict, inplace=True)
df.head()

Unnamed: 0,household_id,weighting,persons_in_PEU,spouse_part_of_PEU,ref_age,spouse_age,ref_sex,spouse_sex,ref_race,ref_educ,...,x3748,x3754,x3760,x3765,x3732,x3738,x3744,x3750,x3756,x3762
0,1,30598.896539,1,1,75,0,2,0,1,12,...,0,0,0,0,0,0,0,0,0,0
1,1,23561.874562,1,1,75,0,2,0,1,12,...,0,0,0,0,0,0,0,0,0,0
2,1,25726.122276,1,1,75,0,2,0,1,12,...,0,0,0,0,0,0,0,0,0,0
3,1,26488.31706,1,1,75,0,2,0,1,12,...,0,0,0,0,0,0,0,0,0,0
4,1,23809.061856,1,1,75,0,2,0,1,12,...,0,0,0,0,0,0,0,0,0,0


As seen above, the first five observations in the 2019 SCF are the 5 implicates for the first household. It's possible to perform regressions on all 5 implicates, however, for the sake of simplicity, this project averages all of the values for each household. This is a method recommended by the Federal Reserve in the [codebook for the SCF](https://www.federalreserve.gov/econres/files/codebk2019.txt). 

Below is the dataframe with averaged values along with additional data cleaning and feature engineering performed. 

In [12]:
avg_df = dl.clean_SCF_df(df, neg_vals=True, modeling=False)
avg_df.head()

Unnamed: 0_level_0,weighting,persons_in_PEU,spouse_part_of_PEU,ref_age,spouse_age,ref_sex,spouse_sex,ref_race,ref_educ,spouse_educ,...,bachelor_deg,assoc_deg,hs_deg,educ_bins_s,doctorate_deg_s,master_deg_s,bachelor_deg_s,assoc_deg_s,hs_deg_s,1k_target
household_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,26037.0,1.0,1.0,75.0,0.0,2.0,0.0,1.0,12.0,0.0,...,1,0,0,0,0,0,0,0,0,1
2,18970.0,5.0,2.0,50.0,39.0,1.0,2.0,1.0,8.0,8.0,...,0,0,0,1,1,0,0,0,0,0
3,20483.0,2.0,2.0,53.0,49.0,1.0,2.0,1.0,8.0,8.0,...,0,0,0,1,1,0,0,0,0,1
4,31785.0,2.0,2.0,29.0,28.0,1.0,2.0,1.0,13.0,14.0,...,0,1,0,5,0,0,0,0,1,1
5,21046.0,2.0,2.0,47.0,39.0,1.0,2.0,1.0,8.0,8.0,...,0,0,0,1,1,0,0,0,0,0


### Feature Engineering



#### *Aggregating variables*

A number of variables needed to be aggreagated across survey responses to get a complete picture of certain assets that a household had. For example, there were 7 separate variables that could contain relevant information for how much money a household has in their checking accounts as seen in [Appendix A](#AppA). 

The aggregated variable that requires a little more judgment is the `lqd_assets` variable which is the basis for the target variable. The purpose of the `lqd_assets` variable is to capture how much financial assets a households have at their immediate disposal to spend on whatever they want. As such, it's fair to include not only cash in checking accounts in this variable, but also the cash value of investments or annuities. 

#### *Target variable - `1k_target`*
The target variable for the logistic regression model is a binary classifier as to whether a household has at least \$1,000 in liquid assets as defined above. 

#### *Categorical variables*
A number of variables in the dataset were taken as categorical variables and needed to be binned or dummied accordingly. For example, variables related to education needed to be binned as there are large differences between someone who only finished 11th grade and someone who graduated from high school. Others needed to be dummied such as race variables. 

### Cleaning

#### *Non-sensical values*
Most non-sensical values came from households that had different values for categorical variables in their implicates. As a result, their average values became floats instead of integers as originally prescribed in the codebook. 

Some cleaning and judgment calls needed to be made, such as making any negative float for race variable be equal to the reference category. In that case the judgment made sense since there were so few number of variables.

Most other categorical variables had negative values to refer to inapplicable categories and as such were set to `0`, the reference category.

#### *Preparing for modeling*
The final model only needs `1k_target` and the relevant engineered features. The underlying features can be dropped. As such, once all the operations above were performed, the dataframe was finally prepped for modeling. The dataframe can be seen below.

In [13]:
df = dl.clean_SCF_df(df, neg_vals=False, modeling=True)
df.head()

Unnamed: 0_level_0,persons_in_PEU,ref_age,spouse_age,total_income,total_cc_limit,freq_cc_payment,rev_charge_accts,num_fin_inst,LOC_owed_now,ed_loans_owed_now,...,spouse_occ_code_2.0,spouse_occ_code_3.0,spouse_occ_code_4.0,spouse_occ_code_5.0,spouse_occ_code_6.0,primary_home_type_3.0,primary_home_type_4.0,primary_home_type_5.0,income_comparison_2.0,income_comparison_3.0
household_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,75.0,0.0,52800.0,15000.0,1.0,0.0,4.0,0.0,0.0,...,0,0,0,0,0,1,0,0,0,1
2,5.0,50.0,39.0,37800.0,0.0,0.0,0.0,2.0,0.0,0.0,...,0,0,0,0,0,1,0,0,0,1
3,2.0,53.0,49.0,103000.0,1000.0,3.0,0.0,2.0,0.0,0.0,...,1,0,0,0,0,1,0,0,0,0
4,2.0,29.0,28.0,122000.0,55000.0,1.0,0.0,10.0,0.0,177000.0,...,0,0,0,0,0,1,0,0,0,1
5,2.0,47.0,39.0,29200.0,0.0,0.0,0.0,3.0,0.0,0.0,...,0,0,0,0,0,1,0,0,1,0


[*↑ Back to overview*](#Overview)
***

## **Data Analysis** <a id="Data-Analysis"></a>
[*↑ Back to overview*](#Overview)

### Visualization of Features of the Data

#### Effect of weighting

As previously discussed, one of the most impactful features of the data is the fact that it oversamples on wealthy households and contains weighting to account for that oversampling. Below is a figure that describes how the weighting changes the cumulative distribution of liquid assets.

<center><img src='images/Cumulative Distribution of Liquid Assets
Among U.S. Households (Weighted and Unweighted).png'></center>

When weights are used to account for the fact that the dataset oversamples more well-off households, less well-off households increase their representation in the dataset. This figure shows that at \$1,000 liquid assets, the  number of people that have at most that mount of money increases from about 20% to about 24%. And on the higher end of the wealth spectrum, you can see that there is an even greater increase of people that have at most \\$100,000 in liquid net worth.

#### Distribution of `lqd_assets`

Another issue with the data that needs to be dealt with in the modeling is the "class-imbalance" in the sense that the target class has orders of magnitude greater amounts of liquid assets compared to the non-target class, \$15 trillion compared to \\$9 billion. This indicates a skewed underlying distribution of `lqd_assets`.

In [20]:
df = pd.read_stata(targetdir + 'scf2019s/p19i6.dta', columns=dl.sel_vars)
df.columns = [x.lower() for x in df.columns]
df.rename(columns=dl.rename_dict, inplace=True)
temp_df = dl.clean_SCF_df(df, neg_vals=False, modeling=False)

temp_df['lqd_assets_w'] = temp_df.lqd_assets * temp_df.weighting

# Total amount of liquid assets outside of target
tot_nontarget = temp_df.groupby('1k_target').lqd_assets_w.sum()[0] #9.42 bn weighted
tot_nontarget_str = '{:,.2f}'.format(tot_nontarget)
print('Total liquid assets in non-target class: $', tot_nontarget_str)
tot_target = temp_df.groupby('1k_target').lqd_assets_w.sum()[1] #15 tn weighted
tot_target_str = '{:,.2f}'.format(tot_target)
print('Total liquid assets in target class: $', tot_target_str)

Total liquid assets in non-target class: $ 9,488,564,438.00
Total liquid assets in target class: $ 15,225,776,464,271.00


As one can see below, the distribution of liquid assets is *extremely* skewed. Although the graph appears to show that there are few outliers, the way it is constructed means that there are a large number between the last visible bar and the maximum value of the x-axis.

<center><img src='images/Distribution of Liquid Assets
Among U.S. Households.png'></center>

### Baseline Model

In [22]:
sets = modeling.baseline(targetdir)
sets

{'F1': 0.9079854073773814,
 'Precision': 0.8756841282251759,
 'Recall': 0.9427609427609428,
 'AUC': 0.8891017830706547}

As a reminder, the models produced here are trained to target `1k_target` using the aggregated and non-constructed, non-functional features in [Appendix A](#AppA). The above model's metrics shows that a logistic regression on the dataset without much cross-validating, grid searching, or including sample weights still yields a model that balances precision and recall rather well with a preference for recall.

The better recall score compared to precision is because this unweighted data gives overrepresents well-off households. This means that this model likely overtrained on wealthy househodls and will tend to predict that households have at least \$1,000 in liquid assets when they in fact do not.

This should imply that using sample weights should improve the model's F1 score, specifically with respect to precision.

### Weighted Model

In [23]:
sets = modeling.weighted(targetdir)
sets

Fitting 5 folds for each of 8 candidates, totalling 40 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  40 out of  40 | elapsed:    4.3s finished


{'F1': 0.915,
 'Precision': 0.905940594059406,
 'Recall': 0.9242424242424242,
 'AUC': 0.8939623210051224}

As predicted, a model trained on data using sample weight improved scores. This model also uses `GridSearchCV` to find the best regularization penalty to help improve scores.

However, it may be fair to interpret this model's improvement mainly being due to the fact that weighting the samples helps the model train less on wealthy households. As such, it will be less prone to reach for false positive and therefore should increase the precision score, as seen here. It would also be reasonable to see a decrease in recall as the model's training on more less well-off household will cause it to reach for fewer true positives. 

It should also be noted that this model's AUC score also improved, meaning that across the decision thresholds for the logistic regression model the model does better compared to baseline.

Although the weights may assist in helping represent more of the less well-off in the training, outliers in the data are also likely a source of error. The next step would be to control for those outliers by scaling the data.

### Weighted and RobustScaled Model

In [26]:
sets = modeling.weightandscale(targetdir)
sets

Fitting 5 folds for each of 8 candidates, totalling 40 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  40 out of  40 | elapsed:    1.2s finished


{'F1': 0.912106135986733,
 'Precision': 0.8986928104575164,
 'Recall': 0.9259259259259259,
 'AUC': 0.884097132151607}

RobustScaler was used to help with outliers as according to Scikit-Learn, it is one of the better scaler's to use to deal with outliers. However RobustScaler's default parameters yielded a model that was less robust with an F1 score lower than the simply weighted model.

An investigation into the scaler's default parameters shows that the data is scaled according to an interquartile range of 25 to the 75. This means that it is using less information on the tail ends of the distributions of the variables. This lack of information is what is likely making the model suffer and will need to be tuned to include more information.

### Weighted and Tuned RobustScaled  Model

In [35]:
sets = modeling.weightandscale(targetdir, iqr=(.1,99.9))    
sets

Fitting 5 folds for each of 8 candidates, totalling 40 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  40 out of  40 | elapsed:   11.0s finished


{'F1': 0.9158723580605057,
 'Precision': 0.9020408163265307,
 'Recall': 0.9301346801346801,
 'AUC': 0.8999102569141479}

As predicted, increasing the interquartile range for the RobustScaler improved the mode and achieved the best F1 score thus far by improving precision at some expense to recall. Not only that but AUC also improved as well.

This final model yields logistric regression coefficients like the ones below, which are the top 5 predictors for a household having more than \$1,000 in liquid assets. This is a surprising result as this indicates that `total_income` isn't as great of an indicator as one would think. This may due to be some unintentional adjustments done in the model and will need to be investigated.

However, the other features seem to make sense. A head of household with a white-collar occupation, a college degree and a high credit card limit is likely to have more liquid assets on hand. The last point making even more sense given that credit limits are given based on how much a person earns.

<center><img src='images/Scale of Logistic Regression
 Coefficients.png'></center>

[*↑ Back to overview*](#Overview)
***

## **Evaluation**<a id="Evaluation"></a>
[*↑ Back to overview*](#Overview)

### Feature Engineering
The features used in this model are a small subset of all the features in the dataset. As such, it may be prudent to re-evaluate the features used to see if there are other predictors that Vanguard may find valuable. Further, the surprise of `total_income` not appearing as one of the most influential features also seems to show that there may be some confounding variables either exlcuded or included.

### Sensitivity to Outliers
The model is likely still sensitive to outliers and needs other tools to deal with them. It may make sense to remove these outliers for the purposes of this model as those with extreme amounts of wealth are not likely to use Vanguard's mutual funds through their retail channels. 

[*↑ Back to overview*](#Overview)
***

## **Conclusion & Next Steps**<a id="Conclusion"></a>
[*↑ Back to overview*](#Overview)

Given the high F1 score of all the models presented here, one can conclude that the top features presented in each model that predict membership in the target class are likely to be features that Vanguard ought to search for.

In the case of number of financial institutions, Vanguard may consider adopting services like Plaid that other financial services companies use to understand whether households use other services. Further, the final model presented here also reinforces any use of credit reports that Vanguard may use to market to households. Lastly, given that having a bachelor's degree is predictive of having enough money to invest with Vanguard, it may also make sense for Vanguard to improve its marketing to those who are just graduating college.

[*↑ Back to overview*](#Overview)
***

## **Endnotes** <a id="Endnotes"></a>
[*↑ Back to overview*](#Overview)


<a id='fn-1'></a> [1.](#fn-1-src) *How America Invests 2020*. 2020. Vanguard. [https://personal.vanguard.com/pdf/how-america-invests-2020.pdf](https://personal.vanguard.com/pdf/how-america-invests-2020.pdf), 28.


<a id='fn-2'></a> [2.](#fn-2-src) U.S. Securities and Exchange Commission. "Mutual Funds". Investor.gov. Date accessed: Dec. 10, 2020. [https://www.investor.gov/introduction-investing/investing-basics/investment-products/mutual-funds-and-exchange-traded-1](https://www.investor.gov/introduction-investing/investing-basics/investment-products/mutual-funds-and-exchange-traded-1).

<a id='fn-3'></a> [3.](#fn-3-src) Vanguard. "Vanguard mutal fund fees & minimums". Mutual Funds. Date accessed: Dec. 10, 2020. [https://investor.vanguard.com/mutual-funds/fees](https://investor.vanguard.com/mutual-funds/fees).

<a id='fn-4'></a> [4.](#fn-4-src) *Changes in U.S. Family Finances from
2016 to 2019*. 2020. Board of Governors of the Federal Reserve System. [https://www.federalreserve.gov/publications/files/scf20.pdf](https://www.federalreserve.gov/publications/files/scf20.pdf), 4.

<a id='fn-5'></a> [5.](#fn-5-src) Id. at 40-42.

[*↑ Back to overview*](#Overview)
***

## **Appendix A: Variables** <a id="AppA"></a>
[*↑ Back to overview*](#Overview)



**Functional Variables**

|SCF Variable| Description |
|------|------|
|`yy1`  |Household ID  |
|`x42001`  |Weighting  |

**Demographic Variables**

|SCF Variable| Description |
|------|------|
|`x7001`  |Number of persons in economic household |
|`x7020`  | Whether spouse is part of household  |
|`x14` | Head of household age|
|`x19` | Spouse age |
|`x8021` | Sex of head of household |
|`x103`| Sex of spouse |
|`x6809`| Race of head of household |
|`x5931`| Education of head of household |
|`x6111` | Education of spouse |
|`x6780` | Whether head of household was looking for work in last year|
|`x6784` | Whether spouse was looking for work in last year |
|`x7402` | Industry code for employment of head of household|
|`x7412` | Industry code for employment of spouse |
|`x7401` | Occupation code for employment of head of household |
|`x7411` | Occupation code for employment of spouse |
|`x501` | Type of home for primary residence |



**Financial Variables - Non-Constructed**

|SCF Variable| Description |
|------|------|
|`x5729`  | Total income in last year|
|`x7650` | Income comparison compared to last year|
|`x5802` | Number of inheritances received |
|`x414` | Total credit card limit |
|`x432` | Frequency of credit card payments | 
|`x7575` | Amount of revolving charge accounts |
|`x8300` | Number of financial institutions used by household |

**Financial Variables (Aggregated) - `LOC_owed_now`**

|SCF Variable| Description |
|------|------|
|`x1108` | Owed amount on line of credit |
|`x1119` | Owed amount on line of credit |
|`x1130` |Owed amount on line of credit |
|`x1136` |Owed amount on line of credit |


**Financial Variables (Aggregated) - `ed_loans_owed_now`**

|SCF Variable| Description |
|------|------|
|`x7824` | Owed amount on education loan |
|`x7847` | Owed amount on education loan |
|`x7870` | Owed amount on education loan |
|`x7924` | Owed amount on education loan |
|`x7947` | Owed amount on education loan |
|`x7970` | Owed amount on education loan |
|`x7179` | Owed amount on education loan |


**Financial Variables (Aggregated) - `cc_newcharges_value`**

|SCF Variable| Description |
|------|------|
|`x412` | New charges to credit card |
|`x420` | New charges to credit card |
|`x426` | New charges to credit card |


**Financial Variables (Aggregated) - `cc_currbal_value`**

|SCF Variable| Description |
|------|------|
|`x413` | Value of credit card current balance |
|`x421` | Value of credit card current balance |
|`x427` | Value of credit card current balance |


**Financial Variables (Aggregated) - `checking_accts_value`**

|SCF Variable| Description |
|------|------|
|`x3506` | Checking account value |
|`x3510` | Checking account value |
|`x3514` | Checking account value |
|`x3518` | Checking account value |
|`x3522` | Checking account value |
|`x3526` | Checking account value|
|`x3765` | Cash in all other checking accounts |

**Financial Variables (Aggregated) - `savings_accts_value`**

|SCF Variable| Description |
|------|------|
|`x3730` | Savings account with type `x3732`|
|`x3736` | Savings account with type `x3738`|
|`x3742` | Savings account with type `x3744`|
|`x3748` | Savings account with type `x3750`|
|`x3754` | Savings account with type `x3756`|
|`x3760` | Savings account with type `x3762`|
|`x3765` | Cash in all other savings accounts |

**Financial Variables (Aggregated) - `lqd_assets`**

|Variable| Description |
|------|------|
|`x6704` | Market value of mutual fund holdings |
|`x6706` | Market value of bond holdings |
|`x3915` | Market value of stock holdings |
|`x6576` | Cash value of annuities |
|`x6587` | Cash value of trust interests |
|`x4006` | Cash value of life insurance |
|`savings_accts_value` | Total cash in savings accounts |
|`checking_accts_value` | Total cash in checking accounts |

**Financial Variables - `1k_target`**

|Value| Condition |
|------|------|
|`1` | If `lqd_assets` > `1000` |
|`0` | If `lqd_assets` <= `1000`|


[*↑ Back to overview*](#Overview)



***