Working with the Current Population Survey (CPS) in Python
=====

### Annual Social and Economic Supplement (ASEC)

-----

*Update: February 13, 2018*<br>
*Brian Dew*<br>
*@bd_econ*

The CPS ASEC, also called the March CPS, includes additional questions that cover income, poverty, health insurance coverage, and more. The [raw public use file](http://thedataweb.rm.census.gov/ftp/cps_ftp.html#cpsmarch) is fixed-width format and includes variables described in the associated data dictionary. 

**To do list**
* Use variable information to convert CPS .dat file to human-readable form
* From readable CPS, calculate the median household income for 2016


Rely on the blog posts from Tom Augspurger: 

* [Part 1: Using Python to tackle the CPS](http://tomaugspurger.github.io/tackling%20the%20cps.html)
* [Part 2: Using Python to tackle the CPS](http://tomaugspurger.github.io/tackling%20the%20cps%20%28part%202%29.html)
* [Part 3: Using Python to tackle the CPS](http://tomaugspurger.github.io/tackling%20the%20cps%20%28part%203%29.html)
* [Part 4: Using Python to tackle the CPS](http://tomaugspurger.github.io/tackling%20the%20cps%20%28part%204%29.html)


### Import preliminaries

In [8]:
# Import packages
import pandas as pd
print(f'pandas {pd.__version__}')
import re, wquantiles

pandas 0.22.0


### Data file and data dictionary from Census FTP site

In [9]:
# Data dictionary 
filename = 'data/asec2017_pubuse.dat'
dd_txt = 'data/08ASEC2017_Data_Dict_Full.txt'
dd = open(dd_txt, 'r', encoding='iso-8859-1').read()

### Obtain column and variable information from data dictionary

In [3]:
# Retrieve column info from dictionary
p = re.compile('D (\w+)\s+(\d{1,2})\s+(\d+)\s+')
var_key = pd.DataFrame(p.findall(dd), columns=['Var', 'Len', 'Loc'])
var_key = var_key.apply(pd.to_numeric, errors='ignore')

# Filter out columns of interest
s = ['H_SEQ', 'H_HHTYPE', 'H_TYPE', 'HRHTYPE','HTOTVAL', 'HSUP_WGT',
     'PH_SEQ', 'P_STAT', 'A_AGE', 'A_SEX', 'MARSUPWT', 'PTOTVAL',
     'GEREG']
s_key = var_key[var_key['Var'].isin(s)]

### Read file into memory

In [4]:
# Read raw fwf file
data = pd.read_fwf(filename, header=None, names=list(s_key.Var),# nrows=1000,
                 colspecs=list(zip(s_key.Loc-1, s_key.Loc + s_key.Len-1)))

### Match median household income

Estimate from Census: $59,039

In [5]:
# Median Household Income (Close)
df = data[data['H_HHTYPE'] == 1]
df = df.drop_duplicates(subset='H_SEQ', keep='first')
df = df[df['H_TYPE'] <= 8]

print(f"Number of Households: {df.HSUP_WGT.sum()/100:,.0f}")
med_inc = wquantiles.median(df['HTOTVAL'], df['HSUP_WGT'])
print(f"2016 Median HH Income: ${med_inc:,.2f}")

Number of Households: 126,223,685
2016 Median HH Income: $59,000.00


In [6]:
# By GEREG - weird results (not working)
for gereg in [1, 2, 3, 4]:
    dft = data[data['GEREG'] == gereg]
    print(f"Number of Households: {dft.HSUP_WGT.sum()/100:,.0f}")
    print(f"2016 Median HH Income: ${wquantiles.median(dft['HTOTVAL'], dft['HSUP_WGT']):,.2f}")

Number of Households: 4,738,556,079
2016 Median HH Income: $22,220.00
Number of Households: 48,661,838
2016 Median HH Income: $36,447.79
Number of Households: 70,584,378
2016 Median HH Income: $39,600.00
Number of Households: 131,358,172
2016 Median HH Income: $0.00


### Match median personal income

Estimate from Census: $31,099

In [7]:
# Results not close enough!
df = data[(data['A_AGE'] >=15)]
df['MARSUPWT'] = df.MARSUPWT.astype('float')

print(f"Number of People: {df.MARSUPWT.sum()/100:,.0f}")
df = df[df['PTOTVAL'] > 0]
print(f"With income: {df.MARSUPWT.sum()/100:,.0f}")
med_inc = wquantiles.median(df['PTOTVAL'], df['MARSUPWT'])
print(f"2016 Median Personal Income: ${med_inc:,.2f}")

Number of People: 259,409,062
With income: 228,239,369
2016 Median Personal Income: $30,100.00


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
