Working with the Current Population Survey (CPS) in Python
=====

### Annual Social and Economic Supplement (ASEC)

-----

*Update: May 20, 2018*<br>
*Brian Dew*<br>
*@bd_econ*

The CPS ASEC, also called the March CPS, includes additional questions that cover income, poverty, health insurance coverage, and more. The [raw public use file](http://thedataweb.rm.census.gov/ftp/cps_ftp.html#cpsmarch) is fixed-width format and includes variables described in the associated data dictionary. Unlike the basic monthly CPS, the March CPS records are hierarchical. Person records are nested in family records which are nested in household records. Rows in the raw data that correspond to person records begin with 3. 

This example will look at teachers hours and income in 2006 and 2016 in both the US as a whole and separately among states with large teacher protests in 2018. 

Useful blog posts from Tom Augspurger: 

* [Part 1: Using Python to tackle the CPS](http://tomaugspurger.github.io/tackling%20the%20cps.html)
* [Part 2: Using Python to tackle the CPS](http://tomaugspurger.github.io/tackling%20the%20cps%20%28part%202%29.html)
* [Part 3: Using Python to tackle the CPS](http://tomaugspurger.github.io/tackling%20the%20cps%20%28part%203%29.html)
* [Part 4: Using Python to tackle the CPS](http://tomaugspurger.github.io/tackling%20the%20cps%20%28part%204%29.html)


### Import preliminaries

In [1]:
# Import packages
import pandas as pd
print(f'pandas {pd.__version__}')
import numpy as np
import re, wquantiles

pandas 0.23.0


### Parameters

In [2]:
files = [('data/asec2007_pubuse_tax2.dat', 'data/cpsmar07.ddf', 2006),
         ('data/asec2017_pubuse.dat', 'data/08ASEC2017_Data_Dict_Full.txt', 2016)]

cpi =  1.19075

### Loop over files and create summary statistics

In [3]:
# Blank dataframe to return with summary statistics
stats = pd.DataFrame()

for file in files:
    # Raw data from Census FTP site
    datafile = file[0]

    # Data dictionary
    dd_txt = file[1]
    dd = open(dd_txt, 'r', encoding='iso-8859-1').read()

    # Retrieve column info from dictionary
    p = re.compile('D (\w+\-?\w+?)\s+(\d{1,2})\s+(\d+)\s+')
    var_key = pd.DataFrame(p.findall(dd), columns=['Var', 'Len', 'Loc'])
    var_key = var_key.apply(pd.to_numeric, errors='ignore')

    # List of variables of interest to be extracted from full file
    s = ['A_AGE', 'A_SEX', 'MARSUPWT', 'PRECORD', 'WEWKRS', 'WSAL_VAL', 
         'HRSWK', 'GESTFIPS', 'PEIOOCC', 'WKSWORK', 'A_CLSWKR',
         'PEMOMTYP', 'PEDADTYP', 'ERN_VAL', 'A-AGE', 'A-SEX', 'WSAL-VAL',
         'A-CLSWKR', 'ERN-VAL']
    s_key = var_key[var_key['Var'].isin(s)]

    # Read raw fwf file
    data = pd.read_fwf(datafile, header=None, names=list(s_key.Var),
                     colspecs=list(zip(s_key.Loc-1, s_key.Loc + s_key.Len-1)))

    # Calculate annual hours (weeks * (hours/week))
    data['ANN_HRS'] = data['WKSWORK'] * data['HRSWK']
    
    # States filled in (blank in early year person records)
    if file[2] == 2006:
        data['GESTFIPS'] = data['GESTFIPS'].replace(to_replace=0, method='ffill')
        # Rename columns to later year variable format
        data.columns = data.columns.str.replace('-', '_')
        
    if file[2] == 2016:
        data['State'] = [d[1]['GESTFIPS'] if d[1]['PRECORD'] == 1 else 0 for d in data.iterrows()]
        data['GESTFIPS'] = data['State'].replace(to_replace=0, method='ffill')

    # Keep person records with wage data & 80+ hours worked 
    df = data.loc[(data['PRECORD'] == 3)]

    # Convert weight variable to float
    df['MARSUPWT'] = df['MARSUPWT'].astype(float)

    # Keep only observations with a weight > 0
    df = df[df['MARSUPWT'] > 0]
    
    # Hourly wage
    df['HRLY_WAGE'] = df['ERN_VAL'] / df['ANN_HRS']

    # Identify public school teachers working 3/4 year or more
    pt = df[(df['PEIOOCC'].between(2200,2340)) & 
            (df['A_CLSWKR'].between(3,5)) & 
            #(df['ANN_HRS'] >= 1365) & 
            (df['A_AGE'].between(25, 54))]

    # Identify subset in WV, OK, AZ, KY, NC, MS, and CO
    pt2 = pt[pt['GESTFIPS'].isin([54, 40, 4, 21, 37, 28, 8])]

    for g in [(pt, 'All States'), (pt2, 'Active States')]:
        for p in [0.1, 0.25, 0.5, 0.75, 0.9]:
            if file[2] == 2006:
                stats.at[f'p{int(p * 100)}, wage', f'{g[1]}, {file[2]}'] = round(wquantiles.quantile(
                    g[0]['HRLY_WAGE'], g[0]['MARSUPWT'], p) * cpi, 2)
            else:
                stats.at[f'p{int(p * 100)}, wage', f'{g[1]}, {file[2]}'] = round(wquantiles.quantile(
                    g[0]['HRLY_WAGE'], g[0]['MARSUPWT'], p), 2)
        tmp = g[0][g[0]['HRLY_WAGE'] > 0]
        if file[2] == 2006:
            stats.at[f'Mean', f'{g[1]}, {file[2]}'] = round(np.average(
                     tmp['HRLY_WAGE'], weights=tmp['MARSUPWT']) * 1.19075, 2)
        else:
            stats.at[f'Mean', f'{g[1]}, {file[2]}'] = round(np.average(
                     tmp['HRLY_WAGE'], weights=tmp['MARSUPWT']), 2)
            
        stats.at['n', f'{g[1]}, {file[2]}'] = len(g[0])
        
        stats.at['Weighed n', f'{g[1]}, {file[2]}'] = round(g[0]['MARSUPWT'].sum() / 100.0, -2)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [4]:
stats

Unnamed: 0,"All States, 2006","Active States, 2006","All States, 2016","Active States, 2016"
"p10, wage",10.52,9.2,12.02,11.99
"p25, wage",17.01,15.06,17.5,15.84
"p50, wage",22.9,20.04,23.6,20.53
"p75, wage",31.49,26.43,31.25,25.63
"p90, wage",40.93,37.31,43.27,34.73
Mean,24.47,21.34,27.11,23.91
n,2594.0,256.0,2191.0,267.0
Weighed n,3570300.0,381000.0,3577900.0,388900.0
