# Parse 2018 Data

Per the data dictionary, household, family, and person data are all in the same file. So will need to split them out to import into the main notebook.

[Useful link on how to use Pandas to parse fix width files.](https://towardsdatascience.com/parsing-fixed-width-text-files-with-pandas-f1db8f737276)

Records are organized by:

- Household 92,139 1,076 Characters - **identified by first value = 1**
- Family    79,236 1,076 Characters - **identified by first value = 2**
- Person   180,084 1,076 Characters - **identified by first value = 3**


In [86]:
# grab the imports needed for the project
import pandas as pd
import glob

In [11]:
data_path = '~/Documents/CNM/DataScience/'
file_name = 'asec2018_pubuse.dat'
full_file_name = data_path + file_name

In [7]:
fseq_col = ['FH_SEQ']  # Joins to household data through H_SEQ
year_col = ['DATA_YEAR']
person_cols = ['OCCUP','A_MJOCC','A_DTOCC','AGE1','A_SEX','PRDTRACE','PXRACE1','PRCITSHP',
               'A_HGA','PRERELG', 'A_GRSWK', 'HRCHECK','HRSWK','PEARNVAL','A_CLSWKR','WEIND',
               'A_MARITL','A_HSCOL','A_WKSTAT','HEA','PEINUSYR']

household_cols = ['GTMETSTA','GEDIV','GESTFIPS','HHINC','H_TENURE','H_LIVQRT']

family_cols = ['FKINDEX', 'FINC_FR','FINC_SE','FINC_WS','FINC_ANN','FINC_CSP','FINC_DIS','FINC_DIV','FINC_RNT','FINC_DST','FINC_ED','FINC_SS','FINC_SSI',
               'FINC_FIN','FINC_SUR','FINC_INT','FINC_UC','FINC_OI','FINC_VET','FINC_PAW','FINC_WC','FINC_PEN']

# 2018

## Household Record - 2018

In [107]:
# Parse columns based on specification
# tuples for start and end positions of columns
hh_specs = [(0,1),(343,358),(319,324),(1,6),(52,53),(328,329),(41,43),(271,273),(34,35),(30,32)]

# Household Columns
all_hh_cols = ['REC_TYPE','H_IDNUM1','H_IDNUM2','H_SEQ'] + household_cols

In [108]:
# Run command to pull data into a dataframe
hh_data = pd.read_fwf(full_file_name, skiprows=0, 
                      skipfooter=0, colspecs=hh_specs, names=all_hh_cols)

In [109]:
# Post processing
hh_data_only = hh_data[hh_data['REC_TYPE']==1].copy()
hh_data_only['H_IDNUM'] = hh_data_only['H_IDNUM1'].map(str) + hh_data_only['H_IDNUM2'].map(str)
hh_data_only.drop(['H_IDNUM1', 'H_IDNUM2'], axis=1, inplace=True)
hh_data_only.to_csv(data_path + 'hhpub18.csv')
# hh_data_only.shape

## Family Record - 2018

In [None]:
# Parse columns based on specification
# tuples for start and end positions of columns
ff_specs = [(0,1),(343,358),(319,324),(1,6),(52,53),(328,329),(41,43),(271,273),(34,35),(30,32)]

# Household Columns
all_ff_cols = ['REC_TYPE','FH_SEQ'] + family_cols

## Person Record - 2018