# Parse 2018 Data

Per the data dictionary, household, family, and person data are all in the same file. So will need to split them out to import into the main notebook.

[Useful link on how to use Pandas to parse fix width files.](https://towardsdatascience.com/parsing-fixed-width-text-files-with-pandas-f1db8f737276)

Records are organized by:

- Household 92,139 1,076 Characters - **identified by first value = 1**
- Family    79,236 1,076 Characters - **identified by first value = 2**
- Person   180,084 1,076 Characters - **identified by first value = 3**


In [29]:
# grab the imports needed for the project
import pandas as pd
import glob

In [69]:
data_path = '~/Documents/CNM/DataScience/'
# file_name = 'asec2014_pubuse_tax_fix_5x8_2017.dat'
# file_name = 'asec2014_pubuse_3x8_rerun_v2.dat'
file_name = 'asec2012_pubuse.dat'
full_file_name = data_path + file_name
hh_rec_type = 1
ff_rec_type = 2
pp_rec_type = 3

## Household Record
### 2015-2018

Column| Spec| Code
:---|:---|:---|
H_IDNUM1| 344:15| 343-358
H_IDNUM2| 320:5| 319-324
H_SEQ| 2:5| 1-6
GTMETSTA| 53:1| 52-53
GEDIV| 329:1| 328-329
GESTFIPS| 42:2| 41-43 
HHINC| 272:2| 271-273
H_TENURE| 35:1| 34-35
H_LIVQRT| 31:2| 30-32

### 2014 and eariler

*Some column names changed but will still map to the same name as later years.*

Column| Spec| Code
:---|:---|:---|
H-IDNUM1| 344:15| 343-358
H-IDNUM2| 320:5| 319-324
H-SEQ| 2:5| 1-6
GTMETSTA| 53:1| 52-53
GEDIV| 329:1| 328-329
GESTFIPS| 42:2| 41-43 
HHINC| 272:2| 271-273
H-TENURE| 35:1| 34-35
H-LIVQRT| 31:2| 30-32


In [70]:
household_cols = ['GTMETSTA','GEDIV','GESTFIPS','HHINC','H_TENURE','H_LIVQRT']

# tuples for start and end positions of columns
hh_specs = [(0,1),(343,358),(319,324),(1,6),(52,53),(328,329),(41,43),(271,273),(34,35),(30,32)]

# Household Columns
all_hh_cols = ['REC_TYPE','H_IDNUM1','H_IDNUM2','H_SEQ'] + household_cols

In [71]:
# Run command to pull data into a dataframe
hh_data = pd.read_fwf(full_file_name, skiprows=0, 
                      skipfooter=0, colspecs=hh_specs, names=all_hh_cols)

In [72]:
# Post processing
hh_data_only = hh_data[hh_data['REC_TYPE']==hh_rec_type].copy()
hh_data_only['H_IDNUM'] = hh_data_only['H_IDNUM1'].map(str) + hh_data_only['H_IDNUM2'].map(str)
hh_data_only.drop(['H_IDNUM1', 'H_IDNUM2'], axis=1, inplace=True)
hh_data_only['DATA_YEAR'] = '2012'
hh_data_only.to_csv(data_path + 'hhpub12.csv')
# hh_data_only.shape

## Family Record

*Note: 2014 and prior, underscores are dashes. i.e. FH_SEQ is FS-SEQ*

Column| Spec| Code
:---|:---|:---|
FH_SEQ| 2:5| 1-5
FINC_FR| 63:1| 62-63
FINC_SE| 55:1|
FINC_WS| 47:1|
FINC_CSP| 173:1|
FINC_DIS| 125:1|
FINC_DIV| 149:1|
FINC_RNT| 157:1|
FINC_ED| 165:1|
FINC_SS| 87:1|
FINC_SSI| 95:1|
FINC_FIN| 189:1|
FINC_SUR| 117:1|
FINC_INT| 141:1|
FINC_UC| 71:1|
FINC_OI| 197:1|
FINC_VET| 109:1|
FINC_PAW| 102:1|
FINC_WC| 79:1|


In [73]:
# FKINDEX, 'FINC_ANN', 'FINC_DST', 'FINC_PEN' not in 2018 and earlier

family_cols = ['FINC_FR','FINC_SE','FINC_WS','FINC_CSP','FINC_DIS','FINC_DIV','FINC_RNT',
               'FINC_ED','FINC_SS','FINC_SSI','FINC_FIN','FINC_SUR','FINC_INT','FINC_UC',
               'FINC_OI','FINC_VET','FINC_PAW','FINC_WC']

# tuples for start and end positions of columns
ff_specs = [(0,1),(1,5),(62,63),(54,55),(46,47),(172,173),(124,125),(148,149),(156,157),
            (164,165),(86,87),(94,95),(188,189),(116,117),(140,141),(70,71),
            (196,197),(108,109),(101,102),(78,79)]

# Household Columns
all_ff_cols = ['REC_TYPE','FH_SEQ'] + family_cols

In [74]:
# Run command to pull data into a dataframe
ff_data = pd.read_fwf(full_file_name, skiprows=0, 
                      skipfooter=0, colspecs=ff_specs, names=all_ff_cols)

In [75]:
# Post processing
ff_data_only = ff_data[ff_data['REC_TYPE']==ff_rec_type].copy()
ff_data_only['DATA_YEAR'] = '2012'
ff_data_only.to_csv(data_path + 'ffpub12.csv')
# ff_data_only.shape

## Person Record

*Note: 2014 and prior, underscores are dashes. i.e. A_MJOCC = A-MJOCC*

Column| Spec| Code
:---|:---|:---|
PERIDNUM| 96:22|
OCCUP| 296:4|
A_MJOCC| 211:2|
A_DTOCC| 213:2|
AGE1| 44:2|
A_SEX| 24:1|
PRDTRACE| 27:2|
PXRACE1| 859:2|
PRCITSHP| 95:1|
A_HGA| 25:2|
PRERELG| 183:1|
A_GRSWK| 191:4|
HRCHECK| 270:1|
HRSWK| 268:2|
PEARNVAL| 588:8|
A_CLSWKR| 176:1|
WEIND| 287:2|
A_MARITL| 21:1|
A_HSCOL| 198:1|
A_WKSTAT| 202:1|
HEA| 691:1|
PEINUSYR| 93:2|


In [76]:
person_cols = ['OCCUP','A_MJOCC','A_DTOCC','AGE1','A_SEX','PRDTRACE','PXRACE1','PRCITSHP',
               'A_HGA','PRERELG', 'A_GRSWK', 'HRCHECK','HRSWK','PEARNVAL','A_CLSWKR','WEIND',
               'A_MARITL','A_HSCOL','A_WKSTAT','HEA','PEINUSYR']

# tuples for start and end positions of columns
pp_specs = [(0,1),(95,117),(295,298),(210,212),(212,214),(43,45),(23,24),(26,28),(858,860),(94,95),
            (24,26),(182,183),(190,193),(269,270),(267,269),(587,594),(175,176),(286,288),
            (20,21),(197,198),(201,202),(690,691),(92,94)]

# Household Columns
all_pp_cols = ['REC_TYPE','PERIDNUM'] + person_cols

In [77]:
# Run command to pull data into a dataframe
pp_data = pd.read_fwf(full_file_name, skiprows=0, 
                      skipfooter=0, colspecs=pp_specs, names=all_pp_cols)

In [78]:
# Post processing
pp_data_only = pp_data[pp_data['REC_TYPE']==pp_rec_type].copy()
pp_data_only['DATA_YEAR'] = '2012'
pp_data_only.to_csv(data_path + 'pppub12.csv')
# pp_data_only.shape