## bd econ CPS extract

bd_CPS_benchmark.ipynb

March 8, 2019

Contact: Brian Dew, @bd_econ

Requires: `cpsYYYY.ft` files for each year. The bd CPS files are generated bd_CPS_reader.ipynb

-----

See [readme](https://github.com/bdecon/econ_data/tree/master/bd_CPS) for documentation.

In [1]:
import pandas as pd
print('pandas:', pd.__version__)
import numpy as np
print('numpy:', np.__version__)
import wquantiles
import os

os.chdir('/home/brian/Documents/CPS/data/clean/')

pandas: 0.24.1
numpy: 1.16.2


### 1994-onward extracts

#### Benchmark 1

In October 1999, how many people were unemployed because of losing a job?

BLS: LNU03023621: 2,162,000

In [2]:
(pd.read_feather('cps1999.ft')
   .query('MONTH==10 and UNEMPTYPE == "Job Loser"')
   ['BASICWGT']).sum()

2161502.5

#### Benchmark 2

In February 2007, what share of age 25-54 women were employed?

BLS: LNU02300062: 72.6

In [3]:
df = (pd.read_feather('cps2007.ft')
        .query('MONTH==2 and 25 <= AGE <= 54 and FEMALE==1')
        .groupby('LFS')
        .BASICWGT.sum())

df['Employed'] / df.sum()

0.72602457

#### Benchmark 3

In May 2014, how many people have more than one job?

BLS: LNU02026619: 7,305,000

In [4]:
(pd.read_feather('cps2014.ft')
   .query('MONTH==5 and MJH==1')
   .BASICWGT).sum()

7304317.5

#### Benchmark 4

In 2017 Q1, what were the nominal median usual weekly earnings?

BLS: LEU0252881500: $865

In [5]:
df = (pd.read_feather('cps2017.ft')
        .query('MONTH < 4 and WKWAGE > 0 and WORKFT == 1'))

wquantiles.median(df['WKWAGE'], df['PWORWGT'])

865.3800048828125

In [6]:
# Sophisticated version
def binned_wage(group):
    """Return BLS-styled binned median wage"""
    weight = 'PWORWGT'
    wage_var = 'WKWAGE'
    percentile = 0.5
    bin_size = 50
    bins = list(np.arange(25, 3000, bin_size))
    # Cut wage series according to bins of bin_size
    bin_cut = lambda x: pd.cut(x[wage_var], bins, include_lowest=True)
    
    # Calculate cumulative sum for weight variable
    cum_sum = lambda x: x[weight].cumsum()
    
    # Sort wages then apply bin_cut and cum_sum
    df = (group.sort_values(wage_var)
               .assign(WAGE_BIN = bin_cut, CS = cum_sum))
    
    # Find the weight at the percentile of interest
    pct_wgt = df[weight].sum() * percentile

    # Find wage bin for person nearest to weighted percentile
    pct_bin = df.iloc[df['CS'].searchsorted(pct_wgt)].WAGE_BIN
    
    # Weight at bottom and top of bin
    wgt_btm, wgt_top = (df.loc[df['WAGE_BIN'] == pct_bin, 'CS']
                          .iloc[[0, -1]].values)
    
    # Find where in the bin the percentile is and return that value
    pct_value = ((((pct_wgt - wgt_btm) / 
                   (wgt_top - wgt_btm)) * bin_size) + pct_bin.left)
    return pct_value

binned_wage(df)

867.56760484129

#### Benchmark 5

In April 2007, what was the unemployment rate for native born Hispanic or latino people?

BLS: LNU04073425: 5.6 

In [7]:
df = (pd.read_feather('cps2007.ft')
        .query('MONTH == 4 and FORBORN == 0 and WBHAO == "Hispanic"')
        .groupby('LFS')
        .BASICWGT.sum())

df['Unemployed'].sum() / (df['Unemployed'].sum() + df['Employed'].sum())

0.055565815

#### Benchmark 6

In 2017, what was the union membership rate for black men?

BLS: LUU0204905200: 13.7

In [8]:
df = (pd.read_feather('cps2017.ft')
        .query('UNIONMEM >= 0 and WBHAOM == "Black" and FEMALE == 0')
        .groupby('UNIONMEM')
        .PWORWGT.sum())

df[1] / df.sum()

0.13706622

#### Benchmark 7

In November 2015, on average, how many hours did usually employed full-time married (spouse present) men work?

BLS: LNU02533629: 44.1

EDIT: Tested and works (44.10250044951661) but removed because adds two variables.

In [9]:
#df = (pd.read_feather('cps2015.ft')
#        .query('MONTH == 11 and PRFTLF == 1 and PRMARSTA in [1, 2]'
#               'and FEMALE == 0 and PRAGNA == 2 and HRSACTT > 0'))
#
#np.average(df['HRSACTT'], weights=df['BASICWGT'])

#### Benchmark 8

In 2017, what was the median hourly wage for 45 to 54 year old female wage and salary workers paid hourly rates?

BLS: LEU0207640900: $15.16

In [10]:
df = (pd.read_feather('cps2017.ft')
        .query('45 <= AGE <54 and FEMALE == 1 and PAIDHRLY == 1'
               'and COW1 not in ["Self-employed Incorporated", "Without Pay"]'))

wquantiles.median(df['HRWAGE'], weights=df['PWORWGT'])

15.0

In [11]:
# Sophisticated version
def binned_wage(group):
    """Return BLS-styled binned median wage"""
    weight = 'PWORWGT'
    wage_var = 'HRWAGE'
    decile = 0.5
    bin_size = 0.5
    bins = list(np.arange(.25, 300, bin_size))
    bin_cut = lambda x: pd.cut(x[wage_var], bins, include_lowest=True)
    cum_sum = lambda x: x[weight].cumsum()
    dft = (group.sort_values(wage_var)
                .assign(WAGE_RANGE = bin_cut, CS = cum_sum))
    dec_point = dft[weight].sum() * decile
    dec_bin = (dft.iloc[(dft['CS'] - dec_point).abs().argsort()[:1]]
                  .WAGE_RANGE.values[0])
    wage_bins = list(dft['WAGE_RANGE'].unique())
    dec_loc = wage_bins.index(dec_bin)
    bin_below = dft[dft['WAGE_RANGE'] == wage_bins[dec_loc-1]].iloc[-1].CS
    bin_above = dft[dft['WAGE_RANGE'] == wage_bins[dec_loc]].iloc[-1].CS
    dec_value = ((((dec_point - bin_below) / 
                   (bin_above - bin_below)) * bin_size) + dec_bin.left)
    return dec_value

binned_wage(df)

15.146724733885172

#### Benchmark 9

In 2018, how many employed people had a professional certification or license?

BLS [Table 48](https://www.bls.gov/cps/cpsaat49.htm): 37,556,000

In [12]:
(pd.read_feather('cps2018.ft')
   .query('LFS == "Employed" and CERT == 1')
   .BASICWGT.sum() / 12.0)

37523274.666666664

#### Benchmark 10

In 2018, how many people were employed in Logging?

BLS [Table 18](https://www.bls.gov/cps/cpsaat18.htm): 112,000

In [13]:
(pd.read_feather('cps2018.ft')
   .query('IND == 270')
   .BASICWGT.sum() / 12.0)

118410.8125

#### Benchmark 11

In February 2012, what was the unemployment rate for veterans age 18+?

BLS: LNU04049526: 7.0%

In [14]:
df = (pd.read_feather('cps2012.ft')
        .query('MONTH == 2 and AGE >= 18 and VETERAN == 1')
        .groupby('LFS')
        .BASICWGT.sum())

df['Unemployed'] /  (df['Unemployed'].sum() + df['Employed'].sum())

0.069386184

#### Benchmark 12 (doesn't work)

In November 2015, how many women moved from NILF to employed?

BLS: LNU07200002: 2,264,000

(from BLS: estimates use a weight calculated by BLS that is not publicly available.)

In [15]:
df = (pd.read_feather('cps2015.ft').query('FEMALE == 1'))

month1, month2 = 10, 11

# Collect total population number to reweight later
tot = df.query('MONTH == @month2').PWLGWGT.sum()
d1 = df.loc[df['MONTH'] == month1]
d2 = df.loc[df['MONTH'] == month2]
m = pd.merge(d1, d2, on=['CPSID','PULINENO'], how='inner')
m = m[(m['AGE_y'] >= m['AGE_x']) &
      (m['AGE_x'] <= m['AGE_y'] + 1)]
m_tot = m.PWLGWGT_y.sum()

# Filter annual data to keep only revelant month's data
d1 = df.loc[(df['MONTH'] == month1) & (df['LFS'] == 'NILF')]
d2 = df.loc[(df['MONTH'] == month2) & (df['LFS'] == 'Employed')]

# Combine the two months and check that the person matches
m = pd.merge(d1, d2, on=['CPSID','PULINENO'], how='inner')
m = m[(m['AGE_y'] >= m['AGE_x']) &
      (m['AGE_x'] <= m['AGE_y'] + 1)]

m['PWLGWGT_y'].sum() * (tot / m_tot)

2046516.1

#### Benchmark 13

In 2018, how many people age 16-64 were (on average) unemployed?

[BLS](https://www.bls.gov/news.release/disabl.htm): 445,000

In [16]:
(pd.read_feather('cps2018.ft')
   .query('16 <= AGE <= 64 and DISABILITY == 1 and LFS == "Unemployed"')
   .BASICWGT.sum() / 12)

443844.1666666667

#### Benchmark 14

How many people are not in the labor force in May 2012?

[FRB PHL](https://www.phil.frb.org/-/media/research-and-data/publications/research-rap/2014/constructing-the-reason-for-nonparticipation-variable-using-the-monthly-cps-pdf.pdf?la=en): 87,698,000

In [17]:
(pd.read_feather('cps2012.ft')
   .query('LFS == "NILF" and AGE >= 16 and MONTH == 5')
   .BASICWGT.sum())

87967540.0

#### Benchmark 15

In 2018, how many women age 25 and over are at or below the prevailing Federal minimum wage?

[BLS](https://www.bls.gov/cps/cpsaat44.htm): 575,000

In [18]:
(pd.read_feather('cps2018.ft')
   .query('AGE >= 25 and FEMALE == 1 and MINWAGE == 1 and PAIDHRLY == 1')
   .PWORWGT.sum() / 12)

556591.4166666666

#### Benchmark 16

In December 2018, how many 16-19 year olds were employed?

BLS: LNU02000012: 5,023,000

In [19]:
(pd.read_feather('cps2018.ft')
   .query('LFS == "Employed" and 16 <= AGE <= 19 and MONTH == 12')
   .BASICWGT.sum())

5022600.0

#### Benchmark 17

In January 2001, how many people were part-time for economic reasons?

BLS : LNU02032194: 3,732,000

In [20]:
(pd.read_feather('cps2001.ft')
   .query('PTECON == 1 and MONTH == 1')
   .BASICWGT.sum())

3731719.2

#### Benchmark 18

In November 2017, how many people of Hispanic or Latino ethnicity were not in the labor force?

BLS: LNU05000009: 14,272,000

In [21]:
(pd.read_feather('cps2017.ft')
   .query('WBHAO == "Hispanic" and MONTH == 11 and LFS=="NILF"')
   .BASICWGT.sum())

14271504.0

#### Benchmark 19

In 2018 Q4, how many women were employed full-time in production occupations and what was their median usual weekly wage?

BLS: LEU0254726500: 2,049,000

BLS: LEU0254779900: $594.00

In [22]:
Q4 = [10, 11, 12]
query = 'FEMALE == 1 and MONTH in @Q4 and HRSUSL1 >= 35 and OCCD == 21'
df = pd.read_feather('cps2018.ft').query(query)
   
df.BASICWGT.sum() / 3

2047336.1666666667

In [23]:
# Sophisticated version
def binned_wage(group):
    """Return BLS-styled binned median wage"""
    weight = 'PWORWGT'
    wage_var = 'WKWAGE'
    percentile = 0.5
    bin_size = 50
    bins = list(np.arange(25, 3000, bin_size))
    # Cut wage series according to bins of bin_size
    bin_cut = lambda x: pd.cut(x[wage_var], bins, include_lowest=True)
    
    # Calculate cumulative sum for weight variable
    cum_sum = lambda x: x[weight].cumsum()
    
    # Sort wages then apply bin_cut and cum_sum
    df = (group.sort_values(wage_var)
               .assign(WAGE_BIN = bin_cut, CS = cum_sum))
    
    # Find the weight at the percentile of interest
    pct_wgt = df[weight].sum() * percentile

    # Find wage bin for person nearest to weighted percentile
    pct_bin = df.iloc[df['CS'].searchsorted(pct_wgt)].WAGE_BIN
    
    # Weight at bottom and top of bin
    wgt_btm, wgt_top = (df.loc[df['WAGE_BIN'] == pct_bin, 'CS']
                          .iloc[[0, -1]].values)
    
    # Find where in the bin the percentile is and return that value
    pct_value = ((((pct_wgt - wgt_btm) / 
                   (wgt_top - wgt_btm)) * bin_size) + pct_bin.left)
    return pct_value

binned_wage(df)

597.1703455307022

#### Benchmark 20

In the year ending February 2018, how many people age 25-54 were not in the labor force because of disability?

Tedeschi: 6,700,000

In [9]:
(pd.read_feather('cps2017.ft')
   .query('25 <= AGE <= 54 and NILFREASON == "Disabled/Ill" and MONTH > 2')
   .append(pd.read_feather('cps2018.ft')
   .query('25 <= AGE <= 54 and NILFREASON == "Disabled/Ill" and MONTH <= 2'))
   .BASICWGT.sum() / 12)

6707834.666666667

#### Benchmark 21

In February 2018, how many discouraged workers were there?

BLS: 373,000

In [12]:
(pd.read_feather('cps2018.ft')
   .query('NILFREASON == "Discouraged" and MONTH == 2')
   .BASICWGT.sum() / 12)

397784.2916666667

### 1989-93 Extracts

#### Benchmark 1

How many women age 20-24 were employed in June 1992?

BLS: LNU02000038: 6,190,000

In [24]:
(pd.read_feather('cps1992.ft')
   .query('MONTH == 6 and FEMALE == 1 and 20 <= AGE <= 24'
          'and LFS == "Employed"')).BASICWGT.sum()

6144925.0

#### Benchmark 2

What was the unemployment rate in Febuary 1989?

BLS: LNU04000000: 5.6%

In [25]:
df = (pd.read_feather('cps1989.ft')
        .query('MONTH == 2 and AGE > 15')
        .groupby('LFS')
        .BASICWGT.sum())

df['Unemployed'].sum() / (df['Unemployed'].sum() + df['Employed'].sum())

0.05666313

#### Benchmark 3

In December 1990, what was the unemployment rate (U-2) if you only count people who lost jobs or completed temporary jobs?

BLS: LNU04023621: 3.2%

In [26]:
df = (pd.read_feather('cps1990.ft')
        .query('MONTH == 12 and AGE > 15'))

unjl = df.query('UNEMPTYPE=="Job Loser"').BASICWGT.sum()

lf = df.query('LFS in ["Unemployed", "Employed"]').BASICWGT.sum()

unjl / lf

0.03193486

#### Benchmark 4

In 1991, what share of wage and salary workers were represented by a union?

BLS: LUU0204899700: 18.1%

In [27]:
df = (pd.read_feather('cps1991.ft')
        .query('AGE > 15 and UNION >= 0'))

uncov = df.query('UNION == 1').PWORWGT.sum()

total = df.PWORWGT.sum()

uncov/total

0.18115489

#### Benchmark 5

In July 1989, how many people were unemployed for 27 weeks or more?

BLS: LNU03008636: 616,000

In [28]:
(pd.read_feather('cps1989.ft')
   .query('MONTH == 7 and UNEMPDUR >= 27')
   .BASICWGT.sum())

607400.75

#### Benchmark 6

In Q2 1992, what was the median usual weekly earnings?

BLS: LEU0252881500: $436

In [29]:
months = [4, 5, 6]

df = (pd.read_feather('cps1992.ft')
        .query('MONTH in @months and WKWAGE > 0 and WORKFT == 1'))

wquantiles.median(df['WKWAGE'], df['PWORWGT'])

435.0

In [30]:
# Sophisticated version
def binned_wage(group):
    """Return BLS-styled binned median wage"""
    weight = 'PWORWGT'
    wage_var = 'WKWAGE'
    percentile = 0.5
    bin_size = 25
    bins = list(np.arange(12.5, 3000, bin_size))
    # Cut wage series according to bins of bin_size
    bin_cut = lambda x: pd.cut(x[wage_var], bins, include_lowest=True)
    
    # Calculate cumulative sum for weight variable
    cum_sum = lambda x: x[weight].cumsum()
    
    # Sort wages then apply bin_cut and cum_sum
    df = (group.sort_values(wage_var)
               .assign(WAGE_BIN = bin_cut, CS = cum_sum))
    
    # Find the weight at the percentile of interest
    pct_wgt = df[weight].sum() * percentile

    # Find wage bin for person nearest to weighted percentile
    pct_bin = df.iloc[df['CS'].searchsorted(pct_wgt)].WAGE_BIN
    
    # Weight at bottom and top of bin
    wgt_btm, wgt_top = (df.loc[df['WAGE_BIN'] == pct_bin, 'CS']
                          .iloc[[0, -1]].values)
    
    # Find where in the bin the percentile is and return that value
    pct_value = ((((pct_wgt - wgt_btm) / 
                   (wgt_top - wgt_btm)) * bin_size) + pct_bin.left)
    return pct_value

binned_wage(df)

436.6196248779963

#### Benchmark 7

In September 1989, how many people were part-time for economic reasons?

BLS: LNU02032194: 4,487,000

In [31]:
(pd.read_feather('cps1989.ft')
   .query('MONTH == 9 and PTECON == 1')
   .BASICWGT.sum())

4436614.5

#### Benchmark 8

In December 1991, how many black or African-American people were not in the labor force?

BLS: LNU05000006: 8,153,000

**NOTE**: Issue here is because Hispanic blacks excluded from WBHAO == 'Black'

When using RACE == 2, the result is 8119624.0

In [32]:
(pd.read_feather('cps1991.ft')
   .query('MONTH == 12 and WBHAO == "Black" and LFS == "NILF" and AGE > 15')
   .BASICWGT.sum())

7996622.0

#### Benchmark 9

In February 1990, how many people were employed in service occupations?

BLS: LNU02032204: 17,545,000

In [33]:
(pd.read_feather('cps1990.ft')
   .query('MONTH == 2 and LFS == "Employed" and OCC80M in [6, 7, 8]')
   .BASICWGT.sum()) * (1 / 0.913) # BLS historical comparatability

17293907.99561884

#### Benchmark 10 - CEPRdata

In May 1993, how many Hispanic people were unemployed (using ORG sample)

In [34]:
os.chdir('/home/brian/Documents/CPS/data/')

keep_cols = ['month', 'wbhao', 'unem', 'orgwgt']
(pd.read_stata('cepr_org_1993.dta', columns=keep_cols)
   .query('month == 5 and wbhao == "Hispanic" and unem == 1')
   .orgwgt.sum())

1354661.4

In [35]:
(pd.read_feather('clean/cps1993.ft')
   .query('MONTH == 5 and LFS == "Unemployed" and WBHAO == "Hispanic" and AGE > 15')
   .PWORWGT.sum())

1354661.2