## bd econ CPS extract

bd_CPS_benchmark.ipynb

February 12, 2018

Contact: Brian Dew, @bd_econ

Requires: `cpsYYYY.ft` files for each year. The bd CPS files are generated bd_CPS_reader.ipynb

-----

See [readme](https://github.com/bdecon/econ_data/tree/master/bd_CPS) for documentation.

In [1]:
import pandas as pd
print('pandas:', pd.__version__)
import numpy as np
print('numpy:', np.__version__)
import wquantiles
import os

os.chdir('/home/brian/Documents/CPS/data/clean/')

pandas: 0.24.1
numpy: 1.15.4


### 1994-onward extracts

#### Benchmark 1

In October 1999, how many people were unemployed because of losing a job?

BLS: LNU03023621: 2,162,000

In [2]:
(pd.read_feather('cps1999.ft')
   .query('MONTH==10 and UNEMPTYPE == "Job Loser"')
   ['BASICWGT']).sum()

2161502.5

#### Benchmark 2

In February 2007, what share of age 25-54 women were employed?

BLS: LNU02300062: 72.6

In [3]:
df = (pd.read_feather('cps2007.ft')
        .query('MONTH==2 and 25 <= AGE <= 54 and FEMALE==1')
        .groupby('LFS')
        .BASICWGT.sum())

df['Employed'] / df.sum()

0.72602457

#### Benchmark 3

In May 2014, how many people have more than one job?

BLS: LNU02026619: 7,305,000

In [4]:
(pd.read_feather('cps2014.ft')
   .query('MONTH==5 and MJH==1')
   .BASICWGT).sum()

7304317.5

#### Benchmark 4

In 2017 Q1, what were the nominal median usual weekly earnings?

BLS: LEU0252881500: $865

In [5]:
df = (pd.read_feather('cps2017.ft')
        .query('MONTH < 4 and WKWAGE > 0 and WORKFT == 1'))

wquantiles.median(df['WKWAGE'], df['PWORWGT'])

865.3800048828125

In [6]:
# Sophisticated version
def binned_wage(group):
    """Return BLS-styled binned median wage"""
    weight = 'PWORWGT'
    wage_var = 'WKWAGE'
    percentile = 0.5
    bin_size = 50
    bins = list(np.arange(25, 3000, bin_size))
    # Cut wage series according to bins of bin_size
    bin_cut = lambda x: pd.cut(x[wage_var], bins, include_lowest=True)
    
    # Calculate cumulative sum for weight variable
    cum_sum = lambda x: x[weight].cumsum()
    
    # Sort wages then apply bin_cut and cum_sum
    df = (group.sort_values(wage_var)
               .assign(WAGE_BIN = bin_cut, CS = cum_sum))
    
    # Find the weight at the percentile of interest
    pct_wgt = df[weight].sum() * percentile

    # Find wage bin for person nearest to weighted percentile
    pct_bin = df.iloc[df['CS'].searchsorted(pct_wgt)].WAGE_BIN
    
    # Weight at bottom and top of bin
    wgt_btm, wgt_top = (df.loc[df['WAGE_BIN'] == pct_bin, 'CS']
                          .iloc[[0, -1]].values)
    
    # Find where in the bin the percentile is and return that value
    pct_value = ((((pct_wgt - wgt_btm) / 
                   (wgt_top - wgt_btm)) * bin_size) + pct_bin.left)
    return pct_value

binned_wage(df)

867.56760484129

#### Benchmark 5

In April 2007, what was the unemployment rate for native born Hispanic or latino people?

BLS: LNU04073425: 5.6 

In [7]:
df = (pd.read_feather('cps2007.ft')
        .query('MONTH == 4 and FORBORN == 0 and WBHAO == "Hispanic"')
        .groupby('LFS')
        .BASICWGT.sum())

df['Unemployed'].sum() / (df['Unemployed'].sum() + df['Employed'].sum())

0.055565815

#### Benchmark 6

In 2017, what was the union membership rate for black men?

BLS: LUU0204905200: 13.7

In [8]:
df = (pd.read_feather('cps2017.ft')
        .query('PEERNLAB > 0 and WBHAOM == "Black" and FEMALE == 0')
        .groupby('PEERNLAB')
        .PWORWGT.sum())

df[1] / df.sum()

0.13706622

#### Benchmark 7

In November 2015, on average, how many hours did usually employed full-time married (spouse present) men work?

BLS: LNU02533629: 44.1

EDIT: Tested and works (44.10250044951661) but removed because adds two variables.

In [9]:
#df = (pd.read_feather('cps2015.ft')
#        .query('MONTH == 11 and PRFTLF == 1 and PRMARSTA in [1, 2]'
#               'and FEMALE == 0 and PRAGNA == 2 and HRSACTT > 0'))
#
#np.average(df['HRSACTT'], weights=df['BASICWGT'])

#### Benchmark 8

In 2017, what was the median hourly wage for 45 to 54 year old female wage and salary workers paid hourly rates?

BLS: LEU0207640900: $15.16

In [10]:
df = (pd.read_feather('cps2017.ft')
        .query('45 <= AGE <54 and FEMALE == 1 and PRERNHLY > 0'
               'and COW1 not in ["Self-employed Incorporated", "Without Pay"]')
        .assign(HRWAGE_HRLY = lambda x: x['PRERNHLY'] / 100.0))

wquantiles.median(df['HRWAGE_HRLY'], weights=df['PWORWGT'])

15.0

In [11]:
# Sophisticated version
def binned_wage(group):
    """Return BLS-styled binned median wage"""
    weight = 'PWORWGT'
    wage_var = 'HRWAGE_HRLY'
    decile = 0.5
    bin_size = 0.5
    bins = list(np.arange(.25, 300, bin_size))
    bin_cut = lambda x: pd.cut(x[wage_var], bins, include_lowest=True)
    cum_sum = lambda x: x[weight].cumsum()
    dft = (group.sort_values(wage_var)
                .assign(WAGE_RANGE = bin_cut, CS = cum_sum))
    dec_point = dft[weight].sum() * decile
    dec_bin = (dft.iloc[(dft['CS'] - dec_point).abs().argsort()[:1]]
                  .WAGE_RANGE.values[0])
    wage_bins = list(dft['WAGE_RANGE'].unique())
    dec_loc = wage_bins.index(dec_bin)
    bin_below = dft[dft['WAGE_RANGE'] == wage_bins[dec_loc-1]].iloc[-1].CS
    bin_above = dft[dft['WAGE_RANGE'] == wage_bins[dec_loc]].iloc[-1].CS
    dec_value = ((((dec_point - bin_below) / 
                   (bin_above - bin_below)) * bin_size) + dec_bin.left)
    return dec_value

binned_wage(df)

15.144216310106446

#### Benchmark 9

In 2018, how many employed people had a professional certification or license?

BLS [Table 48](https://www.bls.gov/cps/cpsaat49.htm): 37,556,000

In [12]:
(pd.read_feather('cps2018.ft')
   .query('LFS == "Employed" and CERT == 1')
   .BASICWGT.sum() / 12.0)

37523274.666666664

#### Benchmark 10

In 2018, how many people were employed in Logging?

BLS [Table 18](https://www.bls.gov/cps/cpsaat18.htm): 112,000

In [13]:
(pd.read_feather('cps2018.ft')
   .query('PEIO1ICD == 270')
   .BASICWGT.sum() / 12.0)

118410.8125

#### Benchmark 11

In February 2012, what was the unemployment rate for veterans age 18+?

BLS: LNU04049526: 7.0%

In [14]:
df = (pd.read_feather('cps2012.ft')
        .query('MONTH == 2 and AGE >= 18 and VETERAN == 1')
        .groupby('LFS')
        .BASICWGT.sum())

df['Unemployed'] /  (df['Unemployed'].sum() + df['Employed'].sum())

0.069386184

### 1989-93 Extracts

#### Benchmark 1

How many women age 20-24 were employed in June 1992?

BLS: LNU02000038: 6,190,000

In [15]:
(pd.read_feather('cps1992.ft')
   .query('MONTH == 6 and FEMALE == 1 and 20 <= AGE <= 24'
          'and LFSR in [1, 2]')).BASICWGT.sum()

6144925.0

#### Benchmark 2

What was the unemployment rate in Febuary 1989?

BLS: LNU04000000: 5.6%

In [16]:
df = (pd.read_feather('cps1989.ft')
        .query('MONTH == 2 and AGE > 15')
        .groupby('LFSR')
        .BASICWGT.sum())

df[2:4].sum() / df[0:4].sum()

0.05666313

#### Benchmark 3

In December 1990, what was the unemployment rate (U-2) if you only count people who lost jobs or completed temporary jobs?

BLS: LNU04023621: 3.2%

In [17]:
df = (pd.read_feather('cps1990.ft')
        .query('MONTH == 12 and AGE > 15'))

unjl = df.query('UNEMPTYPE=="Job Loser"').BASICWGT.sum()

lf = df.query('LFSR in [1, 2, 3, 4]').BASICWGT.sum()

unjl / lf

0.03193486

#### Benchmark 4

In 1991, what share of wage and salary workers were represented by a union?

BLS: LUU0204899700: 18.1%

In [18]:
df = (pd.read_feather('cps1991.ft')
        .query('AGE > 15 and CLSWKR not in [-1, 5, 7, 8] and UNMEM > 0'))

uncov = df.query('UNMEM == 1 or UNCOV == 1').PWORWGT.sum()

total = df.PWORWGT.sum()

uncov/total

0.18115489

#### Benchmark 5

In July 1989, how many people were unemployed for 27 weeks or more?

BLS: LNU03008636: 616,000

In [19]:
(pd.read_feather('cps1989.ft')
   .query('MONTH == 7 and UNEMPDUR >= 27')
   .BASICWGT.sum())

607400.75

#### Benchmark 6

In Q2 1992, what was the median usual weekly earnings?

BLS: LEU0252881500: $436

In [20]:
months = [4, 5, 6]

df = (pd.read_feather('cps1992.ft')
        .query('MONTH in @months and WKWAGE > 0 and WORKFT == 1'))

wquantiles.median(df['WKWAGE'], df['PWORWGT'])

435.0

In [21]:
# Sophisticated version
def binned_wage(group):
    """Return BLS-styled binned median wage"""
    weight = 'PWORWGT'
    wage_var = 'WKWAGE'
    percentile = 0.5
    bin_size = 25
    bins = list(np.arange(12.5, 3000, bin_size))
    # Cut wage series according to bins of bin_size
    bin_cut = lambda x: pd.cut(x[wage_var], bins, include_lowest=True)
    
    # Calculate cumulative sum for weight variable
    cum_sum = lambda x: x[weight].cumsum()
    
    # Sort wages then apply bin_cut and cum_sum
    df = (group.sort_values(wage_var)
               .assign(WAGE_BIN = bin_cut, CS = cum_sum))
    
    # Find the weight at the percentile of interest
    pct_wgt = df[weight].sum() * percentile

    # Find wage bin for person nearest to weighted percentile
    pct_bin = df.iloc[df['CS'].searchsorted(pct_wgt)].WAGE_BIN
    
    # Weight at bottom and top of bin
    wgt_btm, wgt_top = (df.loc[df['WAGE_BIN'] == pct_bin, 'CS']
                          .iloc[[0, -1]].values)
    
    # Find where in the bin the percentile is and return that value
    pct_value = ((((pct_wgt - wgt_btm) / 
                   (wgt_top - wgt_btm)) * bin_size) + pct_bin.left)
    return pct_value

binned_wage(df)

436.6196248779963