Basic Monthly CPS Files: Reading, Adjusting, Benchmarking
=====

## Generate 2017 annual CPS early estimate from available basic monthly files

-----

*September 15, 2017*<br>
*Brian Dew, dew@cepr.net*


Census CPS Monthly files can be downloaded [here](http://thedataweb.rm.census.gov/ftp/cps_ftp.html). As of September 15, 2017, the latest available file is August 2017.

The Data dictionary is found [here](http://thedataweb.rm.census.gov/pub/cps/basic/201701-/January_2017_Record_Layout.txt) and describes the variables and their range of possible values.

To match the raw data to education categories use [this](http://ceprdata.org/wp-content/cps/programs/basic/cepr_basic_educ.do) file, which is the CEPR program file for basic CPS education variables.

In [1]:
import pandas as pd
import numpy as np
import os

#os.chdir('/home/domestic-ra/Working/CPS_ORG/EPOPs/')
os.chdir('C:/Working/econ_data/micro/')

The CPS data files are fixed width format, so specific variables are located in a specific range in each row. To make the basic monthly CPS file variable names correspond with names used in the CEPR uniform extract, I map the CEPR data values to the data contained in the combined CPS monthly files.

Selected variables of interest from the data dictionary:

Variable | Len | Title/Name | Location
 :---|:---:|:---|:---
HRMONTH	| 2 | MONTH OF INTERVIEW | 16-17 
HRYEAR4 | 4 | YEAR OF INTERVIEW | 18-21
PRTAGE | 2 | PERSONS AGE | 122-123
PESEX | 2 | SEX (1 MALE, 2 FEMALE) | 129-130
PEEDUCA | 2 | HIGHEST LEVEL OF SCHOOL COMPLETED OR DEGREE RECEIVED | 137-138
PREMPNOT | 2 | MLR - EMPLOYED, UNEMPLOYED, OR NILF | 393-394
PRFTLF | 2 | FULL TIME LABOR FORCE | 397-398
PRERNWA | 8 | WEEKLY EARNINGS RECODE | 527-534
PWORWGT | 10 | OUTGOING ROTATION WEIGHT | 603-612
PWCMPWGT | 10 | COMPOSITED FINAL WEIGHT | 846-855

In [2]:
# Python numbering (subtract one from first number in range)
colspecs = [(15,17), (17,21), (121,123), (128,130), (136,138), (392,394), 
            (396,398), (526, 534), (602,612), (845,855), (867, 871), (487, 489)]
colnames = ['month', 'year', 'age', 'PESEX', 'PEEDUCA', 'PREMPNOT', 'PRFTLF', 
            'PRERNWA', 'orgwgt', 'fnlwgt', 'sec_job', 'maj_occ_sec']

educ_dict = {31: 'LTHS',
             32: 'LTHS',
             33: 'LTHS',
             34: 'LTHS',
             35: 'LTHS',
             36: 'LTHS',
             37: 'LTHS',
             38: 'HS',
             39: 'HS',
             40: 'Some college',
             41: 'Some college',
             42: 'Some college',
             43: 'College',
             44: 'Advanced',
             45: 'Advanced',
             46: 'Advanced',
            }

gender_dict = {1: 0, 2: 1}

empl_dict = {1: 1, 2: 0, 3: 0, 4: 0}

Convert from fixed-width format to pandas dataframe. The source files are monthly .dat files with names such as: `jan17pub.dat`

In [8]:

data = pd.DataFrame()   # This will be the combined annual df

for file in os.listdir('Data/'):
    if file.endswith('.dat'):
        df = pd.read_fwf('Data/{}'.format(file), colspecs=colspecs, header=None)
        # Set the values to match with CEPR extracts
        df.columns = colnames
        # Add the currently open monthly df to the combined annual df
        data = data.append(df)

In [9]:
data[data['maj_occ_sec'] > 0].groupby('month').count()

Unnamed: 0_level_0,year,age,PESEX,PEEDUCA,PREMPNOT,PRFTLF,PRERNWA,orgwgt,fnlwgt,sec_job,maj_occ_sec
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,758,758,758,758,758,758,758,758,758,758,758
2,819,819,819,819,819,819,819,819,819,819,819
3,797,797,797,797,797,797,797,797,797,797,797
4,759,759,759,759,759,759,759,759,759,759,759
5,802,802,802,802,802,802,802,802,802,802,802
6,716,716,716,716,716,716,716,716,716,716,716
7,721,721,721,721,721,721,721,721,721,721,721
8,679,679,679,679,679,679,679,679,679,679,679
9,747,747,747,747,747,747,747,747,747,747,747
10,755,755,755,755,755,755,755,755,755,755,755


Map basic monthly CPS values to CEPR extract values

In [10]:
data['educ'] = data['PEEDUCA'].map(educ_dict)
data['female'] = data['PESEX'].map(gender_dict)
data['empl'] = data['PREMPNOT'].map(empl_dict)
data['weekpay'] = data['PRERNWA'].astype(float) / 100
data['uhourse'] = data['PRFTLF'].replace(1, 40)

In [6]:
data.dropna().to_stata('Data/cepr_org_2017.dta')

### Benchmark 1:
2017 EPOP for 25-54 year old women vs BLS estimate by month: [LNU02300062](https://data.bls.gov/timeseries/LNU02300062)

In [11]:
data = data[data['year'] == 2017].dropna()
for month in sorted(data['month'].unique()):
    df = data[(data['female'] == 1) & 
              (data['age'].isin(range(25,55))) & # python equiv to 25-54
              (data['month'] == month)].dropna()
    # EPOP as numpy weighted average of the employed variable
    epop = np.average(df['empl'].astype(float), weights=df['fnlwgt']) * 100
    date = pd.to_datetime('{}-{}-01'.format(df['year'].values[0], month))
    print('{:%B %Y}: Women, age 25-54: {:0.1f}'.format(date, epop))

January 2017: Women, age 25-54: 71.3
February 2017: Women, age 25-54: 71.8
March 2017: Women, age 25-54: 72.2
April 2017: Women, age 25-54: 72.3
May 2017: Women, age 25-54: 72.0
June 2017: Women, age 25-54: 71.5
July 2017: Women, age 25-54: 71.4
August 2017: Women, age 25-54: 71.4
September 2017: Women, age 25-54: 72.7
October 2017: Women, age 25-54: 72.5
November 2017: Women, age 25-54: 73.0
December 2017: Women, age 25-54: 72.6


### Benchmark2:
Full-time, 16+, Median Usual Weekly Earnings: [LEU0252881500](https://data.bls.gov/timeseries/LEU0252881500)

In [7]:
import wquantiles
df = data[(data['PRERNWA'] > -1) & 
          (data['age'] >= 16) & 
          (data['PRFTLF'] == 1) &
          (data['month'].isin([1, 2, 3]))].dropna()
print('2017 Q1 Usual Weekly Earnings: ${0:,.2f}'.format(
    # Weighted median using wquantiles package
    wquantiles.median(df['PRERNWA'], df['orgwgt']) / 100.0))

2017 Q1 Usual Weekly Earnings: $865.00
