## I. National Financial Capability Study 
### Financial Industry Regulatory Authority (FINRA) Investor Education Foundation, 2018

#### Purpose

In this notebook, we will elaborate on the FINRA dataset. Our main goals were to:
* capture anything that relates to our question. In other words, excluding any columns without: Demographic info, Financial Literacy Education & Behaviors, Home Ownership and Debt-to-Income indicators.
* quantify and record the response rate percentages for each column and determine a cut-off (if applicable)

By doing this, we will have the ability to compare the amount of responses between different variables that we are interested in. For example, ensuring that we have comparable amounts of data for "Home Ownership" and "Debt-to-Income" indicators.

#### Summary/Overview

* We reduced original data from (27091, 128) to (27091, 65) by only keeping columns that pertain to the variables we are interested in analyzing
* Technically speaking, our data does not have null or NaN values
* For our purposes, we want to classify "decline to state" (which is coded by a ```99```) and blanks (appearing  as an empty string, ```' '```) as **NaN**
* There is another response ("I don't know") that is coded by a ```98```. For now, we are leaving that in its own category.

In [13]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
df = pd.read_csv("NFCS 2018 State Data 190603.csv")
df

Unnamed: 0,NFCSID,STATEQ,CENSUSDIV,CENSUSREG,A3,A3Ar_w,A3B,A4A_new_w,A5_2015,A6,...,M42,M6,M7,M8,M31,M9,M10,wgt_n2,wgt_d2,wgt_s3
0,2018010001,48,9,4,2,5,11,1,5,4,...,,1,3,98,98,98,1,0.683683,0.519642,1.095189
1,2018010002,10,5,3,2,2,8,1,6,1,...,,1,3,98,3,1,98,0.808358,2.516841,0.922693
2,2018010003,44,7,3,2,2,8,1,6,1,...,,1,1,98,98,1,98,1.021551,1.896192,0.671093
3,2018010004,10,5,3,2,1,7,1,6,2,...,7,98,98,4,4,2,98,0.808358,2.516841,0.922693
4,2018010005,13,8,4,1,2,2,1,6,1,...,,1,3,98,2,1,98,0.448075,0.614733,1.232221
5,2018010006,11,5,3,2,2,8,1,5,1,...,,98,98,98,98,98,98,1.017910,1.329637,0.936126
6,2018010007,11,5,3,1,2,2,1,3,1,...,,1,2,1,1,1,1,1.433500,1.479585,1.020436
7,2018010008,26,4,2,2,1,7,1,2,2,...,,2,2,2,2,1,2,0.445837,1.745104,0.931951
8,2018010009,21,5,3,1,2,2,1,7,1,...,,2,2,3,3,1,2,1.138393,0.799081,0.761420
9,2018010010,10,5,3,2,6,12,1,6,2,...,,1,3,1,4,1,2,0.946931,2.789539,1.028050


In [12]:
df.shape

(27091, 128)

### Technically speaking, there are no null values in our data:

In [4]:
df.isnull().values.any()

False

# Capturing relevant columns only

In [5]:
my_list = df.columns.values.tolist()
#my_list

We might want to consider the following points at a later date
* J4
* J33_40
* J42_1
* B2
* C40
* F2_6
* G20
* all of G30
* G22
* G25_2
* H1
* H30_3
* M4

In [6]:
skinny_df = df[['NFCSID',
 'STATEQ',
 'CENSUSDIV',
 'CENSUSREG',
 'A3',
 'A3Ar_w',
 'A3B',
 'A4A_new_w',
 'A5_2015',
 'A6',
 'A7',
 'A7A',
 'A11',
 'A8',
 'AM21',
 'AM30',
 'AM31',
 'AM22',
 'X3',
 'X4',
 'A9',
 'A40',
 'A10',
 'A10A',
 'A21_2015',
 'A22_2015',
 'A14',
 'A41',
 'J1',
 'J2',
 'J3',
 'J5',
 'J20',
 'J32',
 'J41_1',
 'J41_2',
 'C1_2012',
 'C4_2012',
 'C5_2012',
 'D40',
 'EA_1',
 'E7',
 'E8',
 'E20',
 'E15_2015',
 'F2_1',
 'F2_2',
 'F2_3',
 'F2_4',
 'F2_5',
 'G38',
 'M40',
 'M20',
 'M21_1',
 'M21_2_2015',
 'M21_3',
 'M21_4',
 'M41',
 'M42',
 'M6',
 'M9',
 'M10',
 'wgt_n2',
 'wgt_d2',
 'wgt_s3']]
skinny_df

Unnamed: 0,NFCSID,STATEQ,CENSUSDIV,CENSUSREG,A3,A3Ar_w,A3B,A4A_new_w,A5_2015,A6,...,M21_3,M21_4,M41,M42,M6,M9,M10,wgt_n2,wgt_d2,wgt_s3
0,2018010001,48,9,4,2,5,11,1,5,4,...,,,,,1,98,1,0.683683,0.519642,1.095189
1,2018010002,10,5,3,2,2,8,1,6,1,...,,,,,1,1,98,0.808358,2.516841,0.922693
2,2018010003,44,7,3,2,2,8,1,6,1,...,,,,,1,1,98,1.021551,1.896192,0.671093
3,2018010004,10,5,3,2,1,7,1,6,2,...,1,,3,7,98,2,98,0.808358,2.516841,0.922693
4,2018010005,13,8,4,1,2,2,1,6,1,...,,,,,1,1,98,0.448075,0.614733,1.232221
5,2018010006,11,5,3,2,2,8,1,5,1,...,,,,,98,98,98,1.017910,1.329637,0.936126
6,2018010007,11,5,3,1,2,2,1,3,1,...,,,,,1,1,1,1.433500,1.479585,1.020436
7,2018010008,26,4,2,2,1,7,1,2,2,...,,,,,2,1,2,0.445837,1.745104,0.931951
8,2018010009,21,5,3,1,2,2,1,7,1,...,,,,,2,1,2,1.138393,0.799081,0.761420
9,2018010010,10,5,3,2,6,12,1,6,2,...,,,,,1,1,2,0.946931,2.789539,1.028050


### While there are technically no "NaN" values in our data. There are certain answers such as ```99```, or blanks that can be considered non-acceptable responses 

These fall under the non-acceptable responses becuase they fail to provide us with a clear answer to the question asked. They indicate that the person declined to answer 

#### Quantifying how many ```99```s there are:

In [6]:
search99= skinny_df.apply(pd.value_counts)
#search99

In [7]:
search99 = search99.loc['99']

In [8]:
search99.sum()

1312.0

#### That's a lot of unanswered questions. How many individuals didn't answer a question?

In [9]:
grab99 = skinny_df[(skinny_df == '99').any(axis=1)]

In [10]:
grab99

Unnamed: 0,NFCSID,STATEQ,CENSUSDIV,CENSUSREG,A3,A3Ar_w,A3B,A4A_new_w,A5_2015,A6,...,M21_3,M21_4,M41,M42,M6,M9,M10,wgt_n2,wgt_d2,wgt_s3
34,2018010035,48,9,4,1,6,6,1,4,1,...,,,,,1,98,98,0.707918,0.514290,1.096894
36,2018010037,33,2,1,2,6,12,1,2,1,...,,,,,1,1,98,2.219286,1.532573,1.096985
40,2018010041,41,5,3,2,3,9,1,2,1,...,,,,,98,98,98,1.026420,0.654446,0.872951
72,2018010073,51,8,4,1,6,6,1,2,1,...,,,,,99,1,98,0.485971,0.225086,1.427754
149,2018010150,48,9,4,1,3,3,1,2,2,...,,,,,3,1,2,0.726409,0.503368,0.949739
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26857,2018036858,12,9,4,1,2,2,2,6,1,...,,,,,99,1,2,1.997203,0.182560,1.590288
26954,2018036955,12,9,4,2,4,10,2,4,2,...,,,,,1,1,2,1.800757,0.255835,1.033814
27006,2018037007,8,5,3,1,3,3,1,6,1,...,1,,3,7,99,1,1,1.006108,0.124663,0.723003
27024,2018037025,8,5,3,1,6,6,1,6,1,...,99,99,2,5,3,1,2,0.980497,0.130570,0.707569


#### 796 Individuals didn't answer at least one question
That's less than 5 percent of the entire sample. 

### There were also individuals who answered with "Don't know" (indicated by ```98```s)

In [11]:
search98 = skinny_df.apply(pd.value_counts)
search98 = search98.loc['98']
search98.sum()

4392.0

In [12]:
grab98 = skinny_df[(skinny_df == '98').any(axis=1)]
grab98

Unnamed: 0,NFCSID,STATEQ,CENSUSDIV,CENSUSREG,A3,A3Ar_w,A3B,A4A_new_w,A5_2015,A6,...,M21_3,M21_4,M41,M42,M6,M9,M10,wgt_n2,wgt_d2,wgt_s3
3,2018010004,10,5,3,2,1,7,1,6,2,...,1,,3,7,98,2,98,0.808358,2.516841,0.922693
5,2018010006,11,5,3,2,2,8,1,5,1,...,,,,,98,98,98,1.017910,1.329637,0.936126
6,2018010007,11,5,3,1,2,2,1,3,1,...,,,,,1,1,1,1.433500,1.479585,1.020436
7,2018010008,26,4,2,2,1,7,1,2,2,...,,,,,2,1,2,0.445837,1.745104,0.931951
17,2018010018,24,4,2,2,5,11,1,2,1,...,,,,,98,1,98,0.522265,1.833441,1.007209
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27058,2018037059,51,8,4,2,3,9,1,6,1,...,,,,,98,99,99,0.320832,0.139948,0.378946
27066,2018037067,51,8,4,2,6,12,1,4,1,...,,,,,98,1,1,0.469334,0.201007,1.144149
27067,2018037068,51,8,4,2,3,9,2,6,1,...,,,,,1,2,2,0.620629,0.170350,0.504043
27076,2018037077,51,8,4,2,3,9,1,4,2,...,,,,,1,2,98,0.404002,0.183442,0.808332


### 4392 answers of "I don't know" and 2921 individuals with at least one "IDK"
This quite a bit larger than the unanswered questions, with roughly 10.7 percent of the participants answering with "I dont' know" to at least one question

------------------------

### To better capture the response rate for each column, we are defining ```NaN``` to be a code (i.e, ```99```) or empty string that consitutes a non-acceptable response. 

#### Note: we may consider adding "I don't know/IDK" (i.e, ```98```) to this bucket at a later time but for now, it is its own category.

#### Game Plan:
* replace empty strings and code ```99``` with ```NaN```
* obtain a ratio 1 - (NaN/column count) to obtain response rate for each column

In [7]:
'''please scroll down to A10: 
Which of the following best describes your [spouse's/partner's] current employment or work status?'''
for (columnName, columnData) in skinny_df.iteritems(): 
    print(columnName)
    print(skinny_df[columnName].value_counts())
    print()

NFCSID
2018011123    1
2018014596    1
2018012487    1
2018035026    1
2018037075    1
2018022744    1
2018024793    1
2018018650    1
2018020699    1
2018030940    1
2018032989    1
2018026846    1
2018028895    1
2018014564    1
2018016613    1
2018010470    1
2018012519    1
2018035058    1
2018022776    1
2018024825    1
2018018682    1
2018020731    1
2018030972    1
2018033021    1
2018026878    1
2018010438    1
2018016581    1
2018014532    1
2018016549    1
2018037011    1
             ..
2018036492    1
2018034445    1
2018011928    1
2018026209    1
2018013915    1
2018026145    1
2018013883    1
2018032290    1
2018030243    1
2018020004    1
2018017957    1
2018024102    1
2018022055    1
2018036396    1
2018034349    1
2018011832    1
2018015930    1
2018028224    1
2018015962    1
2018026177    1
2018032322    1
2018030275    1
2018020036    1
2018017989    1
2018024134    1
2018022087    1
2018036428    1
2018034381    1
2018011864    1
2018011149    1
Name: NFCSID, Len

#### Isolating first case of "non-acceptable" blank response as defined above:

In [14]:
(skinny_df['A10'] == ' ').sum()

10421

In [9]:
response_rate = 1 - (float((skinny_df['A10'] == ' ').sum()/skinny_df['A10'].count()))
response_rate_percent = round(response_rate*100, 2)
print('response rate: ' + str(response_rate_percent)+'%')

response rate: 61.53%


#### Another interesting case appears with C5_2012 (counting both 99 and blanks)
Do you [or your spouse/partner] regularly contribute to a retirement account like a [Thrift Savings Plan (TSP),] 401(k) or IRA? [2012 base]

In [10]:
response_rate = 1 - (float(((skinny_df['C5_2012'] == '99').sum()+(skinny_df['C5_2012'] == ' ').sum())/skinny_df['C5_2012'].count()))
response_rate_percent = round(response_rate*100, 2)
print('response rate: ' + str(response_rate_percent)+'%')

response rate: 50.86%


#### Attempt at converting blanks to NA

In [17]:
skinny_df_copy = skinny_df.copy()

In [18]:
skinny_df_copy = skinny_df_copy.replace(r'^\s*$', np.NaN, regex=True)

In [19]:
skinny_df_copy.head()

Unnamed: 0,NFCSID,STATEQ,CENSUSDIV,CENSUSREG,A3,A3Ar_w,A3B,A4A_new_w,A5_2015,A6,...,M21_3,M21_4,M41,M42,M6,M9,M10,wgt_n2,wgt_d2,wgt_s3
0,2018010001,48,9,4,2,5,11,1,5,4,...,,,,,1,98,1,0.683683,0.519642,1.095189
1,2018010002,10,5,3,2,2,8,1,6,1,...,,,,,1,1,98,0.808358,2.516841,0.922693
2,2018010003,44,7,3,2,2,8,1,6,1,...,,,,,1,1,98,1.021551,1.896192,0.671093
3,2018010004,10,5,3,2,1,7,1,6,2,...,1.0,,3.0,7.0,98,2,98,0.808358,2.516841,0.922693
4,2018010005,13,8,4,1,2,2,1,6,1,...,,,,,1,1,98,0.448075,0.614733,1.232221


Seems to have worked

## Trying the search method for empty reponses

In [21]:
searchE = skinny_df.apply(pd.value_counts)
searchE = searchE.loc[' ']
searchE.sum()

374304.0

In [22]:
grabE = skinny_df[(skinny_df == ' ').any(axis=1)]
grabE

Unnamed: 0,NFCSID,STATEQ,CENSUSDIV,CENSUSREG,A3,A3Ar_w,A3B,A4A_new_w,A5_2015,A6,...,M21_3,M21_4,M41,M42,M6,M9,M10,wgt_n2,wgt_d2,wgt_s3
0,2018010001,48,9,4,2,5,11,1,5,4,...,,,,,1,98,1,0.683683,0.519642,1.095189
1,2018010002,10,5,3,2,2,8,1,6,1,...,,,,,1,1,98,0.808358,2.516841,0.922693
2,2018010003,44,7,3,2,2,8,1,6,1,...,,,,,1,1,98,1.021551,1.896192,0.671093
3,2018010004,10,5,3,2,1,7,1,6,2,...,1,,3,7,98,2,98,0.808358,2.516841,0.922693
4,2018010005,13,8,4,1,2,2,1,6,1,...,,,,,1,1,98,0.448075,0.614733,1.232221
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27086,2018037087,51,8,4,2,3,9,1,2,1,...,,,,,1,1,2,0.404002,0.183442,0.808332
27087,2018037088,20,1,1,2,2,8,2,2,1,...,,,,,1,98,98,0.475130,0.737766,1.116158
27088,2018037089,20,1,1,1,2,2,2,6,2,...,,,,,1,98,98,0.531368,0.713412,2.392632
27089,2018037090,20,1,1,2,1,7,2,6,2,...,,,,,1,1,2,0.377318,0.546359,0.965486


There are quite a few individuals who left a question empty. However, that is to be expected since military-related questions were most likely dropped by a majority of respondents.

# Setting the "Do not want to respond" ( ```99```) responses to NaNs

In [23]:
skinny_df_copy = skinny_df_copy.replace(99, np.NaN)

Column J3 had 178 '99' responses and no original NaN values. Lets see if the code above worked.

In [24]:
skinny_df_copy["J3"]

0        3.0
1        3.0
2        2.0
3        2.0
4        3.0
        ... 
27086    2.0
27087    NaN
27088    2.0
27089    1.0
27090    1.0
Name: J3, Length: 27091, dtype: float64

In [25]:
skinny_df_copy["J3"].isna().sum()

178

Code worked; counts match

# Response Rate
Now lets try to develop a response rate for every column in the dataframe!

In [26]:
skinny_df_copy.columns

Index(['NFCSID', 'STATEQ', 'CENSUSDIV', 'CENSUSREG', 'A3', 'A3Ar_w', 'A3B',
       'A4A_new_w', 'A5_2015', 'A6', 'A7', 'A7A', 'A11', 'A8', 'AM21', 'AM30',
       'AM31', 'AM22', 'X3', 'X4', 'A9', 'A40', 'A10', 'A10A', 'A21_2015',
       'A22_2015', 'A14', 'A41', 'J1', 'J2', 'J3', 'J5', 'J20', 'J32', 'J41_1',
       'J41_2', 'C1_2012', 'C4_2012', 'C5_2012', 'D40', 'EA_1', 'E7', 'E8',
       'E20', 'E15_2015', 'F2_1', 'F2_2', 'F2_3', 'F2_4', 'F2_5', 'G38', 'M40',
       'M20', 'M21_1', 'M21_2_2015', 'M21_3', 'M21_4', 'M41', 'M42', 'M6',
       'M9', 'M10', 'wgt_n2', 'wgt_d2', 'wgt_s3'],
      dtype='object')

In [32]:
Percentage_Diff = {}

In [33]:
for column in skinny_df_copy.columns:
    print(column)
    #print(skinny_df_copy[column].isna().sum())
    response_rate = float((skinny_df_copy[column].isna().sum())/27091)
    response_rate_percent = round(100-(response_rate*100), 2)
    #print(response_rate_percent)
    
    print('response rate: ' + str(response_rate_percent)+'%')
    print("")
    
    Percentage_Diff[column] = response_rate_percent

NFCSID
response rate: 100.0%

STATEQ
response rate: 100.0%

CENSUSDIV
response rate: 100.0%

CENSUSREG
response rate: 100.0%

A3
response rate: 100.0%

A3Ar_w
response rate: 100.0%

A3B
response rate: 100.0%

A4A_new_w
response rate: 100.0%

A5_2015
response rate: 100.0%

A6
response rate: 100.0%

A7
response rate: 100.0%

A7A
response rate: 100.0%

A11
response rate: 100.0%

A8
response rate: 100.0%

AM21
response rate: 98.53%

AM30
response rate: 11.28%

AM31
response rate: 11.28%

AM22
response rate: 53.36%

X3
response rate: 100.0%

X4
response rate: 3.46%

A9
response rate: 100.0%

A40
response rate: 99.36%

A10
response rate: 61.53%

A10A
response rate: 100.0%

A21_2015
response rate: 59.98%

A22_2015
response rate: 7.45%

A14
response rate: 61.53%

A41
response rate: 99.7%

J1
response rate: 99.25%

J2
response rate: 99.32%

J3
response rate: 99.34%

J5
response rate: 98.58%

J20
response rate: 99.23%

J32
response rate: 99.27%

J41_1
response rate: 99.15%

J41_2
response rate: 

## Lets find response rate with "IDK" Values (```98```) included

In [34]:
less_than_50 = {}

for column in skinny_df_copy.columns:
    print(column)
    #print(skinny_df_copy[column].isna().sum())
    response_rate = float((skinny_df_copy[column].isna().sum() + (skinny_df_copy[column] == '98').sum())/27091)
    response_rate_percent = round(100-(response_rate*100), 2)
    #print(response_rate_percent)
    
    print('response rate: ' + str(response_rate_percent)+'%')
    print("")
    
    Percentage_Diff[column] = Percentage_Diff[column] - response_rate_percent
    if response_rate_percent <= 50:
        less_than_50[column] = response_rate_percent

NFCSID
response rate: 100.0%

STATEQ
response rate: 100.0%

CENSUSDIV
response rate: 100.0%

CENSUSREG
response rate: 100.0%

A3
response rate: 100.0%

A3Ar_w
response rate: 100.0%

A3B
response rate: 100.0%

A4A_new_w
response rate: 100.0%

A5_2015
response rate: 100.0%

A6
response rate: 100.0%

A7
response rate: 100.0%

A7A
response rate: 100.0%

A11
response rate: 100.0%

A8
response rate: 100.0%

AM21
response rate: 98.53%

AM30
response rate: 11.28%

AM31
response rate: 11.18%

AM22
response rate: 53.36%

X3
response rate: 100.0%

X4
response rate: 3.46%

A9
response rate: 100.0%

A40
response rate: 99.36%

A10
response rate: 61.53%

A10A
response rate: 100.0%

A21_2015
response rate: 59.75%

A22_2015
response rate: 7.4%

A14
response rate: 60.15%

A41
response rate: 99.7%

J1
response rate: 99.25%

J2
response rate: 99.32%

J3
response rate: 99.34%

J5
response rate: 98.58%

J20
response rate: 99.23%

J32
response rate: 99.27%

J41_1
response rate: 99.15%

J41_2
response rate: 9

## Lets find the difference between the two percentages

I went back and retroactively added a dictionary to the code so we can see the differences betweent the response rate with regards to NaN values and the response rate with regards to NaN and IDK values

In [35]:
Percentage_Diff

{'NFCSID': 0.0,
 'STATEQ': 0.0,
 'CENSUSDIV': 0.0,
 'CENSUSREG': 0.0,
 'A3': 0.0,
 'A3Ar_w': 0.0,
 'A3B': 0.0,
 'A4A_new_w': 0.0,
 'A5_2015': 0.0,
 'A6': 0.0,
 'A7': 0.0,
 'A7A': 0.0,
 'A11': 0.0,
 'A8': 0.0,
 'AM21': 0.0,
 'AM30': 0.0,
 'AM31': 0.09999999999999964,
 'AM22': 0.0,
 'X3': 0.0,
 'X4': 0.0,
 'A9': 0.0,
 'A40': 0.0,
 'A10': 0.0,
 'A10A': 0.0,
 'A21_2015': 0.22999999999999687,
 'A22_2015': 0.04999999999999982,
 'A14': 1.3800000000000026,
 'A41': 0.0,
 'J1': 0.0,
 'J2': 0.0,
 'J3': 0.0,
 'J5': 0.0,
 'J20': 0.0,
 'J32': 0.0,
 'J41_1': 0.0,
 'J41_2': 0.0,
 'C1_2012': 0.0,
 'C4_2012': 0.0,
 'C5_2012': 0.8699999999999974,
 'D40': 0.0,
 'EA_1': 0.0,
 'E7': 0.3200000000000003,
 'E8': 1.2199999999999989,
 'E20': 2.039999999999999,
 'E15_2015': 0.4100000000000037,
 'F2_1': 1.0,
 'F2_2': 1.230000000000004,
 'F2_3': 1.2600000000000051,
 'F2_4': 1.3799999999999955,
 'F2_5': 1.2800000000000011,
 'G38': 0.0,
 'M40': 0.0,
 'M20': 0.0,
 'M21_1': 0.4299999999999997,
 'M21_2_2015': 0.22000000

As we can see, the IDK responses doesn't really any of the response rates of any of the columns. Roughly 13 columns had a difference greater than 1 %, and even then it didn't exceed 3%.

# Column Analysis

In [36]:
less_than_50

{'AM30': 11.28,
 'AM31': 11.18,
 'X4': 3.46,
 'A22_2015': 7.4,
 'E20': 35.19,
 'E15_2015': 34.11,
 'M21_1': 20.22,
 'M21_2_2015': 17.08,
 'M21_3': 20.29,
 'M21_4': 3.86,
 'M41': 18.64,
 'M42': 20.32}

12 columns had a response rate that was less than 50%. 

 * AM30     -military
 * AM31     -military
 * X4       -military
 * A22_2015 -Attending school or not
 * E20      -Do you currently owe more on your home than you think you could sell it for today?
 * E15_2015 -How many times have you been late with your mortgage payments in the past 12 months?
 * M21_1    -All of M21 ask when they recieved financial education
 * M21_2_2015
 * M21_3
 * M21_4
 * M41      -Hours of financial education
 * M42      -Quality of finacial education
 
Three involved the military, one involved schooling (asking if they're currently attending), two regarding homeownership, and 6 regarding financial education
The questions regarding financial education are a bit concerning, but looking at the data it seems like roughly 4000 people took some sort of financial education, and M21 just clarifies when they took it.

We might consider getting rid of A22, since another question already asks about education (A_21, A41)

We can discuss whether or not to use the military data. It seems fairly small so it might be best to just eliminate those columns

The morgtage one, althought not substantial, is also not that small. A 35% response rate still comes out to about 8000 responses, so we can still use it I think. It depends on how we decided to use the data.