Working with the Current Population Survey (CPS) in Python
=====

### Annual Social and Economic Supplement (ASEC)

-----

*Update: February 13, 2018*<br>
*Brian Dew*<br>
*@bd_econ*

The CPS ASEC, also called the March CPS, includes additional questions that cover income, poverty, health insurance coverage, and more. The [raw public use file](http://thedataweb.rm.census.gov/ftp/cps_ftp.html#cpsmarch) is fixed-width format and includes variables described in the associated data dictionary. 

**To do list**
* Use variable information to convert CPS .dat file to human-readable form
* From readable CPS, calculate the median household income for 2016


Rely on the blog posts from Tom Augspurger: 

* [Part 1: Using Python to tackle the CPS](http://tomaugspurger.github.io/tackling%20the%20cps.html)
* [Part 2: Using Python to tackle the CPS](http://tomaugspurger.github.io/tackling%20the%20cps%20%28part%202%29.html)
* [Part 3: Using Python to tackle the CPS](http://tomaugspurger.github.io/tackling%20the%20cps%20%28part%203%29.html)
* [Part 4: Using Python to tackle the CPS](http://tomaugspurger.github.io/tackling%20the%20cps%20%28part%204%29.html)


### Import preliminaries

In [1]:
import pandas as pd
print(f'pandas {pd.__version__}')
import re, wquantiles

pandas 0.22.0


### Data file and data dictionary from Census FTP site

In [2]:
filename = 'data/asec2017_pubuse.dat'
dd_txt = 'data/08ASEC2017_Data_Dict_Full.txt'
dd = open(dd_txt, 'r', encoding='iso-8859-1').read()

### Obtain column and variable information from data dictionary

In [3]:
p = re.compile('D (\w+)\s+(\d{1,2})\s+(\d+)\s+')
var_key = pd.DataFrame(p.findall(dd), columns=['Var', 'Len', 'Loc'])
var_key = var_key.apply(pd.to_numeric, errors='ignore')

# Filter out columns of interest
#s = ['H_SEQ', 'H_HHTYPE', 'H_TYPE', 'HRHTYPE','HTOTVAL', 'HSUP_WGT']
s = ['PH_SEQ', 'P_STAT', 'A_AGE', 'A_SEX', 'MARSUPWT', 'PTOTVAL']
s_key = var_key[var_key['Var'].isin(s)]

### Read file into memory

In [28]:
df = pd.read_fwf(filename, header=None, names=list(s_key.Var),# nrows=1000,
                 colspecs=list(zip(s_key.Loc-1, s_key.Loc + s_key.Len-1)))

### Match median household income

Estimate from Census: $59,039

In [None]:
df = df[df['H_HHTYPE'] == 1]
df = df.drop_duplicates(subset='H_SEQ', keep='first')
df = df[df['H_TYPE'] <= 8]

print(f"Number of Households: {df.HSUP_WGT.sum()/100:,.0f}")
med_inc = wquantiles.median(df['HTOTVAL'], df['HSUP_WGT'])
print(f"2016 Median HH Income: ${med_inc:,.2f}")

In [None]:
for gereg in [1, 2, 3, 4]:
    dft = df[df['GEREG'] == gereg]
    print(f"Number of Households: {dft.HSUP_WGT.sum()/100:,.0f}")
    print(f"2016 Median HH Income: ${wquantiles.median(dft['HTOTVAL'], dft['HSUP_WGT']):,.2f}")

### Match median personal income

Estimate from Census: $31,099

In [33]:
df = df[(df['A_AGE'] >=15)]
df['MARSUPWT'] = df.MARSUPWT.astype('float')

print(f"Number of People: {df.MARSUPWT.sum()/100:,.0f}")
df = df[df['PTOTVAL'] > 0]
print(f"With income: {df.MARSUPWT.sum()/100:,.0f}")
med_inc = wquantiles.median(df['PTOTVAL'], df['MARSUPWT'])
print(f"2016 Median Personal Income: ${med_inc:,.2f}")

Number of People: 259,409,062
With income: 228,239,369
2016 Median Personal Income: $30,100.00


In [31]:
df[df['PTOTVAL'] > 0].MARSUPWT.sum()/100

228239368.63999999

In [22]:
df

Unnamed: 0,PH_SEQ,A_AGE,A_SEX,P_STAT,MARSUPWT,PTOTVAL
5,4,51,2,1,69755.0,18899
9,5,85,1,1,71575.0,38059
10,5,71,2,1,71575.0,10859
16,9,55,1,1,66354.0,25000
17,9,42,2,1,66354.0,20000
18,9,21,2,1,80785.0,8400
19,9,18,2,1,73899.0,0
22,10,59,1,1,139910.0,56002
23,10,60,2,1,139910.0,50008
26,11,28,1,1,129364.0,38020


In [21]:
df.MARSUPWT.sum()/100

259409062.11000001

In [None]:
df

In [None]:
#%matplotlib inline
#dft['HTOTVAL'].hist(bins=500, figsize=(15, 2))

In [None]:
wquantiles.median(df['HTOTVAL'], df['HSUP_WGT']/100)

In [None]:
len(df)

In [None]:
df.HSUP_WGT.sum()/100

In [None]:
df = df[df['GESTFIPS'].between(1,56, inclusive=True)]#.MARSUPWT.sum()/100

In [None]:
df.groupby('H_TYPE').count()

In [None]:
df['H_TYPE']

In [34]:
print(dd)

2017 ANNUAL SOCIAL AND ECONOMIC (ASEC) 
SUPPLEMENT DATA DICTIONARY


HOUSEHOLD RECORD



DATA       SIZE   BEGIN RANGE                  

D HRECORD     1      1  (1:1)
U All households
V          1 .Household record

D H_SEQ       5      2  (00001:99999)
     Household sequence number
V All households
V     00001- .Household sequence number
V     99999  .

D HHPOS       2      7  (00:00)
	Trailer portion of unique household ID. 00 for HH record. Same function in family record is field FFPOS (01-39). Same function in person record is PPPOS (41-79).

D HUNITS      1      9  (1:5)
	Item 78 - How many units in the structure
U H_HHTYPE = 1
V          1 .1 Unit
V          2 .2 Units
V          3 .3 - 4 Units
V          4 .5 - 9 Units
V          5 .10+ Units

D HEFAMINC    2     10  (-1:16)
	Family income
	NOTE:  If a nonfamily household, income includes only that of householder.
U All households
V         -1 .Not in universe
V         01 .Less than $5,000
V         02 .$5,000 to $7,499
V    

In [None]:
p = re.compile('D (\w+)\s+(\d{1,2})\s+(\d+)\s+')
var_key = pd.DataFrame(p.findall(dd), columns=['Var', 'Len', 'Loc'])
#var_key.columns = ['Var', 'Len', 'Loc']
var_key = var_key.apply(pd.to_numeric, errors='ignore')
#var_key['Start'] = var_key['Loc'] -1 
#var_key['End'] = var_key['Loc'] + var_key['Len']

# Filter out columns of interest
s = ['HTOTVAL', 'H_HHTYPE', 'HSUP_WGT', 'H_SEQ', 'HEFAMINC']
s_key = var_key[var_key['Var'].isin(s)]  

In [None]:
# Read file
df = pd.read_fwf(filename, header=None, names=list(s_key.Var), nrows=1000,
                 colspecs=list(zip(s_key.Loc-1, s_key.Loc + s_key.Len-1)))

In [None]:
print(dd)

In [None]:
list(zip(s_key.Loc-1, s_key.Loc + s_key.Len))

In [None]:
s_key

In [None]:
df

In [None]:
[(i[1]['Start']-1, i[1]['End']) for i in s_key.iterrows()]

In [None]:
list(zip(s_key.Start-1, s_key.End))

In [None]:
list(s_key.Var)

In [None]:
%%time
# Read file
df = pd.read_fwf(filename, header=None, names=list(s_key.Var), nrows=1000,
                 colspecs=list(zip(s_key.Loc-1, s_key.Loc + s_key.Len)))

In [None]:
#len(df)

### Read file into memory

In [None]:
%%time
df = pd.read_fwf(filename, widths=list(var_key.Length), 
                 header=None, nrows=1000) # If testing, use 
df.columns = var_key.Variable.values
df = df.drop('FILLER', axis=1)

### Match median household income

Estimate from Census: $59,039

In [None]:
import wquantiles

cols = ['HTOTVAL', 'H_HHTYPE', 'HSUP_WGT', 'H_SEQ', 'HEFAMINC']
df = pd.read_csv('data/2017_CPS_ASEC.csv', usecols=cols)
lent(df)
#df = df[df['H_HHTYPE'] == 1]
#df = df.drop_duplicates(subset='H_SEQ', keep='first')

#print(f"2016 Median HH Income: ${wquantiles.median(df['HTOTVAL'], df['HSUP_WGT']):,.2f}")

In [None]:
len(df)

### Store CPS ASEC as csv for future use

In [None]:
df.to_csv('data/2017_CPS_ASEC.csv')

In [None]:
print(dd)

In [None]:
var_key['Start'] = var_key['Location'] -1 
var_key['End'] = var_key['Location'] + var_key['Length']