Working with the Current Population Survey (CPS) in Python
=====

### Annual Social and Economic Supplement (ASEC)

-----

*Update: April 19, 2018*<br>
*Brian Dew*<br>
*@bd_econ*

The CPS ASEC, also called the March CPS, includes additional questions that cover income, poverty, health insurance coverage, and more. The [raw public use file](http://thedataweb.rm.census.gov/ftp/cps_ftp.html#cpsmarch) is fixed-width format and includes variables described in the associated data dictionary. Unlike the basic monthly CPS, the March CPS records are hierarchical. Person rows are nested in family rows which are nested in household rows. The third character of person rows is 3. 

**To do list**
* Small working example -- get age, gender and weight from person records, 2012 and 2017.


Useful blog posts from Tom Augspurger: 

* [Part 1: Using Python to tackle the CPS](http://tomaugspurger.github.io/tackling%20the%20cps.html)
* [Part 2: Using Python to tackle the CPS](http://tomaugspurger.github.io/tackling%20the%20cps%20%28part%202%29.html)
* [Part 3: Using Python to tackle the CPS](http://tomaugspurger.github.io/tackling%20the%20cps%20%28part%203%29.html)
* [Part 4: Using Python to tackle the CPS](http://tomaugspurger.github.io/tackling%20the%20cps%20%28part%204%29.html)


### Import preliminaries

In [1]:
# Import packages
import pandas as pd
print(f'pandas {pd.__version__}')
import re, wquantiles

pandas 0.22.0


### Data file and data dictionary from Census FTP site

In [142]:
# Data dictionary 
datafile = 'data/asec2013_pubuse.dat'
dd_txt = 'data/08ASEC2013_Data_Dict_Full.txt'
dd = open(dd_txt, 'r', encoding='iso-8859-1').read()

### Obtain column and variable information from data dictionary

In [143]:
# Retrieve column info from dictionary
p = re.compile('D (\w+\-?\w+?)\s+(\d{1,2})\s+(\d+)\s+')
var_key = pd.DataFrame(p.findall(dd), columns=['Var', 'Len', 'Loc'])
var_key = var_key.apply(pd.to_numeric, errors='ignore')

# Filter out columns of interest
s = ['A-AGE', 'A_AGE', 'A-SEX', 'A-FNLWGT', 'MARSUPWT', 'PRECORD', 'WKSWORK', 
     'HRSWK', 'WEWKRS', 'PTOTVAL', 'A-ERNLWT', 'A_ERNLWT']

# Filter out columns of interest
#s = ['H_SEQ', 'H_HHTYPE', 'H_TYPE', 'HRHTYPE','HTOTVAL', 'HSUP_WGT',
#     'PH_SEQ', 'P_STAT', 'A_AGE', 'A_SEX', 'MARSUPWT', 'PTOTVAL',
#     'GEREG']
s_key = var_key[var_key['Var'].isin(s)]

### Read file into memory

In [144]:
# Read raw fwf file
data = pd.read_fwf(datafile, header=None, names=list(s_key.Var),# nrows=1000,
                 colspecs=list(zip(s_key.Loc-1, s_key.Loc + s_key.Len-1)))

In [145]:
df = data[(data['PRECORD'] == 3) & 
          (data['A-AGE'].between(16, 64)) & 
          (data['A-SEX'] == 2)]
df['ANN_HRS'] = df['WKSWORK'] * df['HRSWK']
df['WGT'] = pd.to_numeric(df['MARSUPWT'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


In [137]:
wquantiles.quantile(df['ANN_HRS'], df['WGT'], 0.5)

1290.0

In [156]:
df[df['WEWKRS'].isin([5])].WGT.sum() / tot

0.320049007570324

In [146]:
tot = df.WGT.sum()

In [153]:
df

Unnamed: 0,PRECORD,A-AGE,A-SEX,A-FNLWGT,A-ERNLWT,MARSUPWT,WKSWORK,HRSWK,WEWKRS,PTOTVAL,ANN_HRS,WGT
8,3,45,2,63641,255594,00059114,28,38,3,25200,1064,59114
15,3,59,2,59095,0,00061314,0,0,5,0,0,61314
19,3,62,2,67484,0,00065654,52,32,2,45184,1664,65654
24,3,26,2,67176,0,00034743,32,40,3,3472,1280,34743
39,3,27,2,69143,0,00034999,51,11,2,8505,561,34999
46,3,40,2,60764,0,00030006,52,40,1,15000,2080,30006
47,3,19,2,60940,0,00033427,7,15,4,2800,105,33427
52,3,35,2,61435,0,00031842,52,40,1,15800,2080,31842
54,3,17,2,60940,0,00033427,0,0,5,0,0,33427
59,3,62,2,65354,0,00061963,0,0,5,22162,0,61963


### Match median household income

Estimate from Census: $59,039

In [5]:
# Median Household Income (Close)
df = data[data['H_HHTYPE'] == 1]
df = df.drop_duplicates(subset='H_SEQ', keep='first')
df = df[df['H_TYPE'] <= 8]

print(f"Number of Households: {df.HSUP_WGT.sum()/100:,.0f}")
med_inc = wquantiles.median(df['HTOTVAL'], df['HSUP_WGT'])
print(f"2016 Median HH Income: ${med_inc:,.2f}")

Number of Households: 126,223,685
2016 Median HH Income: $59,000.00


In [6]:
# By GEREG - weird results (not working)
for gereg in [1, 2, 3, 4]:
    dft = data[data['GEREG'] == gereg]
    print(f"Number of Households: {dft.HSUP_WGT.sum()/100:,.0f}")
    print(f"2016 Median HH Income: ${wquantiles.median(dft['HTOTVAL'], dft['HSUP_WGT']):,.2f}")

Number of Households: 4,738,556,079
2016 Median HH Income: $22,220.00
Number of Households: 48,661,838
2016 Median HH Income: $36,447.79
Number of Households: 70,584,378
2016 Median HH Income: $39,600.00
Number of Households: 131,358,172
2016 Median HH Income: $0.00


### Match median personal income

Estimate from Census: $31,099

In [121]:
# Results not close enough!
df = data[(data['A_AGE'] >=15) & (data['PRECORD'] == 3)]
df['MARSUPWT'] = pd.to_numeric(df['MARSUPWT'])

print(f"Number of People: {df.MARSUPWT.sum()/100:,.0f}")
df = df[df['PTOTVAL'] > 0]
print(f"With income: {df.MARSUPWT.sum()/100:,.0f}")
med_inc = wquantiles.median(df['PTOTVAL'], df['MARSUPWT'])
print(f"2016 Median Personal Income: ${med_inc:,.2f}")

Number of People: 259,403,062
With income: 228,239,369
2016 Median Personal Income: $30,100.00


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
