## Current Population Survey Microdata using Python

March 10, 2018

*Note: [IPUMS](https://cps.ipums.org/cps/) is likely the quickest interface for retrieving CPS data. The process below can be avoided by using IPUMS.*

*See: Tom Augspurger's [blog](https://tomaugspurger.github.io/tackling%20the%20cps.html) and [github](https://github.com/TomAugspurger/pycps) as the definitive resource for working with CPS microdata in python*

If your research requires reading raw CPS microdata, which are stored in fixed-width format text files covering one month each, you can use Python to do so. 

The [Census FTP page](https://thedataweb.rm.census.gov/ftp/cps_ftp.html) contains the microdata and dictionaries identifying each variable name, location, value range, and whether it applies to a restricted sample. 

In [90]:
# Import packages 
import pandas as pd  # pandas 0.22
import numpy as np
import re            # regular expressions

#### Use the January 2017 data dictionary to find variable locations

The example will calculate the employment to population ratio for women between the age of 25 and 54 in April 2017. To do this, we need to find the appropriate data dictionary on the Census FTP site, in this case January_2017_Record_Layout.txt, open it with python, and read the text inside. 

We find that the BLS composite weight is called ```PWCMPWGT```, the age variable is called ```PRTAGE```, the sex variable is called ```PESEX``` and women are identified by '2', and the employment status is stored as ```PREMPNOT```.

You may also notice that the dictionary follows a pattern, where variable names and locations are stored on the same line and in the same order. Regular expressions can be used to extract the parts of this pattern that we care about, specifically: the variable name, length, description, and location.

The python list ```dd_sel_var``` stores the variable names and locations for the four variables of interest. 

In [71]:
# Data dictionary 
dd_file = 'January_2017_Record_Layout.txt'
dd_full = open(dd_file, 'r', encoding='iso-8859-1').read()

# Series of interest 
series = ['PWCMPWGT', 'PRTAGE', 'PREMPNOT', 'PESEX']

# Regular expression finds rows with variable location details
p = re.compile('\n(\w+)\s+(\d+)\s+(.*?)\t+.*?(\d\d*).*?(\d\d+)')

# Keep adjusted results for series of interest
dd_sel_var = [(i[0], int(i[3])-1, int(i[4])) 
              for i in p.findall(dd_full) if i[0] in series]

In [89]:
print(dd_sel_var)

[('PRTAGE', 121, 123), ('PESEX', 128, 130), ('PREMPNOT', 392, 394), ('PWCMPWGT', 845, 855)]


#### Read the CPS microdata for April 2017

There are many ways to accomplish this task. One that is simple for small scale projects and still executes quickly involves using python list comprehension to read each line of the microdata and pull out the parts we want, using the locations from the data dictionary. 

Pandas is used to make the data structure a bit more human readable and to make filtering the data a bit more intuitive. The column names come from the data dictionary varible ids.

In [93]:
# Convert raw data into a list of tuples
data = [tuple(int(line[i[1]:i[2]]) for i in dd_sel_var) 
        for line in open('apr17pub.dat', 'rb')]

# Convert to pandas dataframe, add variable ids as heading
df = pd.DataFrame(data, columns=[v[0] for v in dd_sel_var])

#### Benchmarking against BLS published data

The last step to show that the example has worked is to compare a sample calculation, the prime age employment rate of women, to the [BLS published version of that calculation](https://data.bls.gov/timeseries/LNU02300062). If the benchmark calculation from the microdata is very close to the BLS result, we can feel a bit better about other calculations that we need to do.

In [94]:
# Temporary dataframe with only women age 25 to 54
dft = df[(df['PESEX'] == 2) & (df['PRTAGE'].between(25, 54))]

# Identify employed portion of group as 1.0 & the rest as 0.0
empl = np.where(dft['PREMPNOT'] == 1, 1.0, 0.0)

# Take weighted average of employed portion of group
epop = np.average(empl, weights=dft['PWCMPWGT']) * 100

# Print out the result to check against LNU02300062
print(f'April 2017: {round(epop, 1)}')

April 2017: 72.3


#### About the U.S. Current Population Survey (CPS)

The CPS was initially deployed in 1940 to give a more accurate unemployment rate estimate, and it is still the source of the official unemployment rate. The CPS is a monthly survey of around 65,000 households. Each selected household is surveyed up to 8 times. Interviewers ask basic demographic and employment information for the first three interview months, then ask additional detailed wage questions on the 4th interview. The household is not surveyed again for eight months, and then repeats four months of interviews with detailed wage questions again on the fourth. 

The CPS is not a random sample, but a multi-stage stratified sample. In the first stage, each state and DC are divided into "primary sampling units". In the second stage, a sample of housing units are drawn from the selected PSUs.

There are also months were each household receives supplemental questions on a topic of interest. The largest such "CPS supplement", conducted each March, is the Annual Social and Economic Supplement. The sample size for this supplement is expanded, and the respondents are asked questions about various sources of income, and about the quality of their jobs (for example, health insurance benefits). Other supplements cover topics like job tenure, or computer and internet use.

The CPS is a joint product of the U.S. Census Bureau and the Bureau of Labor Statistics.

*Special thanks to John Schmitt for guidance on the CPS.*

In [86]:
data = np.array([tuple(int(line[i[1]:i[2]]) for i in dd_sel_var) 
        for line in open('apr17pub.dat', 'rb')])

In [87]:
data

array([[      42,        1,        1, 15730712],
       [      26,        2,        1, 14582612],
       [      25,        2,        1, 20672047],
       ...,
       [      43,        1,        1,  2676135],
       [      27,        2,        1,  2790799],
       [       7,        2,       -1,        0]])

In [77]:
np.array(data)

array([[      42,        1,        1, 15730712],
       [      26,        2,        1, 14582612],
       [      25,        2,        1, 20672047],
       ...,
       [      43,        1,        1,  2676135],
       [      27,        2,        1,  2790799],
       [       7,        2,       -1,        0]])

In [None]:
df[[s for s in s1 if s in df]] = df[[s for s in s1 if s in df]].astype(np.int8)    
df[[s for s in s2 if s in df]] = df[[s for s in s2 if s in df]].astype(np.int16)
df[[s for s in s3 if s in df]] = df[[s for s in s3 if s in df]].astype(np.int32)

In [56]:
for i in dd_sel_var:
    if i[2] - i[1] < 3:
        print(tuple([i[0], i[1], i[2], np.int8]))
    else:
        print(tuple([i[0], i[1], i[2], np.int32]))
    #print(i[2] - i[1])

('PRTAGE', 121, 123, <class 'numpy.int8'>)
('PESEX', 128, 130, <class 'numpy.int8'>)
('PREMPNOT', 392, 394, <class 'numpy.int8'>)
('PWCMPWGT', 845, 855, <class 'numpy.int32'>)


In [62]:
df

Unnamed: 0,PRTAGE,PESEX,PREMPNOT,PWCMPWGT
0,42,1,1,15730712
1,26,2,1,14582612
2,25,2,1,20672047
3,42,2,4,15492377
4,47,1,1,18155638
5,49,2,1,16330038
6,30,1,1,20020684
7,26,2,1,16112316
8,29,1,1,23116850
9,68,2,4,15760419


In [51]:
dfm = [pd.to_numeric(row) for row in data]

In [14]:
import numpy as np

In [16]:
dft

Unnamed: 0,HRHHID,HRMONTH,HRYEAR4,HRMIS,HRHHID2,GESTFIPS,GTMETSTA,PRTAGE,PESEX,PEEDUCA,...,PRMJOCC1,PRSJMJ,PRERNHLY,PRERNWA,PENLFRET,PENLFACT,PWORWGT,PWSSWGT,PRCHLD,PWCMPWGT


In [17]:
df

Unnamed: 0,HRHHID,HRMONTH,HRYEAR4,HRMIS,HRHHID2,GESTFIPS,GTMETSTA,PRTAGE,PESEX,PEEDUCA,...,PRMJOCC1,PRSJMJ,PRERNHLY,PRERNWA,PENLFRET,PENLFACT,PWORWGT,PWSSWGT,PRCHLD,PWCMPWGT
0,b'000110116792163',b' 4',b'2017',b' 7',b'05011',b'01',b'1',b'42',b' 1',b'43',...,b' 3',b' 1',b' -1',b' -1',b'-1',b'-1',b' 0',b' 16044411',b' 0',b' 15730712'
1,b'000110116792163',b' 4',b'2017',b' 7',b'05011',b'01',b'1',b'26',b' 2',b'40',...,b' 3',b' 1',b' -1',b' -1',b'-1',b'-1',b' 0',b' 15049796',b' 0',b' 14582612'
2,b'000110116792163',b' 4',b'2017',b' 7',b'05111',b'01',b'1',b'25',b' 2',b'39',...,b' 3',b' 1',b' -1',b' -1',b'-1',b'-1',b' 0',b' 21403677',b' 1',b' 20672047'
3,b'000110116792163',b' 4',b'2017',b' 7',b'05111',b'01',b'1',b'42',b' 2',b'39',...,b'-1',b'-1',b' -1',b' -1',b'-1',b'-1',b' 0',b' 14333979',b' 0',b' 15492377'
4,b'000110206593381',b' 4',b'2017',b' 8',b'05011',b'01',b'1',b'47',b' 1',b'40',...,b' 3',b' 1',b' -1',b' 92307',b'-1',b'-1',b' 71919038',b' 18852532',b' 3',b' 18155638'
5,b'000110206593381',b' 4',b'2017',b' 8',b'05011',b'01',b'1',b'49',b' 2',b'43',...,b' 1',b' 1',b' -1',b' 115384',b'-1',b'-1',b' 61993609',b' 16785172',b' 3',b' 16330038'
6,b'000110206593381',b' 4',b'2017',b' 8',b'05011',b'01',b'1',b'30',b' 1',b'40',...,b' 5',b' 1',b'1100',b' 67300',b'-1',b'-1',b' 76747769',b' 20065813',b' 0',b' 20020684'
7,b'000110267480552',b' 4',b'2017',b' 8',b'05112',b'01',b'1',b'26',b' 2',b'43',...,b' 4',b' 1',b' 900',b' 30000',b'-1',b'-1',b' 65704751',b' 16621183',b' 0',b' 16112316'
8,b'000110267480552',b' 4',b'2017',b' 8',b'05112',b'01',b'1',b'29',b' 1',b'40',...,b' 4',b' 1',b'1050',b' 35000',b'-1',b'-1',b' 92431106',b' 23650892',b' 0',b' 23116850'
9,b'000110270885905',b' 4',b'2017',b' 7',b'05011',b'01',b'1',b'68',b' 2',b'39',...,b'-1',b'-1',b' -1',b' -1',b'-1',b'-1',b' 0',b' 15407404',b' 0',b' 15760419'


In [19]:
# Data dictionary
dd_file = 'January_2017_Record_Layout.txt'
dd_full = open(dd_file, 'r', encoding='iso-8859-1').read()
# Series of interest 
s = ['PWORWGT', 'PWCMPWGT', 'HRHHID', 'PULINENO', 'HRHHID2', 
     'PEHRUSL1', 'PRERNWA', 'PRERNHLY', 'PTERNWA', 'PTERNHLY', 
     'PRTAGE', 'PEAGE', 'PRUNEDUR', 'PWSSWGT', 'HRYEAR', 'HRYEAR4']

# These series can be stored as categorical later on
s2 = ['HRMONTH', 'PESEX', 'PEMLR', 'PENLFRET', 'PENLFACT', 
      'PRDISC', 'GESTFIPS', 'HRMIS', 'PRCOW1', 'PRFTLF', 
      'PREMPNOT', 'PRCIVLF', 'PEJHRSN','PRSJMJ', 'PEEDUCA', 
      'PRWKSTAT', 'PRMJOCC1', 'GTMETSTA', 'GEMETSTA', 'PEDWWNTO',
      'PRUNTYPE', 'PRMJIND1', 'PERACE', 'PTDTRACE', 'PRDTRACE', 
      'PRORIGIN', 'PRDTHSP', 'PRCHLD']   
s = s + s2
p = re.compile('\n(\w+)\s+(\d+)\s+(.*?)\t+.*?(\d\d*).*?(\d\d+)')
dd_sel_var = [(i[0], int(i[3])-1, int(i[4])) 
              for i in p.findall(dd_full) if i[0] in s]

In [None]:
rows = []
with open('apr17pub.dat', 'r', encoding='utf-8') as f:
    for line in f:
        if int(line[845:855]) > 0:  # Composite weight
            rows.append(tuple(int(line[i[1]:i[2]].strip()) for i in dd2))
            
df = pd.DataFrame(rows, columns=[v[0] for v in dd2])

In [4]:
df.memory_usage()

Index           80
HRHHID      818288
HRMONTH     818288
HRYEAR4     818288
HRMIS       818288
HRHHID2     818288
GESTFIPS    818288
GTMETSTA    818288
PRTAGE      818288
PESEX       818288
PEEDUCA     818288
PTDTRACE    818288
PRDTHSP     818288
PULINENO    818288
PEMLR       818288
PEHRUSL1    818288
PEDWWNTO    818288
PEJHRSN     818288
PRCIVLF     818288
PRDISC      818288
PREMPNOT    818288
PRFTLF      818288
PRUNEDUR    818288
PRUNTYPE    818288
PRWKSTAT    818288
PRCOW1      818288
PRMJIND1    818288
PRMJOCC1    818288
PRSJMJ      818288
PRERNHLY    818288
PRERNWA     818288
PENLFRET    818288
PENLFACT    818288
PWORWGT     818288
PWSSWGT     818288
PRCHLD      818288
PWCMPWGT    818288
dtype: int64

In [None]:
df = pd.DataFrame(rows, columns=[v[0] for v in dd2])

In [None]:
rows

In [None]:
df = pd.DataFrame(data, columns=[v[0] for v in dd2])

In [None]:
df.apply(pd.to_numeric)

In [None]:
# Series of interest 
s = ['PWORWGT', 'PWCMPWGT', 'HRHHID', 'PULINENO', 'HRHHID2', 'PEHRUSL1', 
     'PRERNWA', 'PRERNHLY', 'PTERNWA', 'PTERNHLY', 'PRTAGE', 'PEAGE',
     'PRUNEDUR', 'PWSSWGT', 'HRYEAR', 'HRYEAR4']

# These series can be stored as categorical later on
s2 = ['HRMONTH', 'PESEX', 'PEMLR', 'PENLFRET', 'PENLFACT', 'PRDISC', 'GESTFIPS',
      'HRMIS', 'PRCOW1', 'PRFTLF', 'PREMPNOT', 'PRCIVLF', 'PEJHRSN','PRSJMJ', 
      'PEEDUCA', 'PRWKSTAT', 'PRMJOCC1', 'GTMETSTA', 'GEMETSTA', 'PEDWWNTO',
      'PRUNTYPE', 'PRMJIND1', 'PERACE', 'PTDTRACE', 'PRDTRACE', 'PRORIGIN',
      'PRDTHSP', 'PRCHLD']   
s = s + s2

In [None]:
p = re.compile('\n(\w+)\s+(\d+)\s+(.*?)\t+.*?(\d\d*).*?(\d\d+)')
dd2 = [(i[0], int(i[3]), int(i[4])) 
       for i in p.findall(dd_full) if i[0] in s]

In [None]:
data = [tuple(line[i[1]:i[2]] for i in dd2) 
        for line in open('apr17pub.dat', 'rb') 
        if int(line[845:855]) > 0]
df = pd.DataFrame(data, columns=[v[0] for v in dd2])

#### Example of reading first line of raw data file

In [None]:
# Structure of the data - fixed width format
with open('apr17pub.dat', 'r') as f:
    print(f.readline()) # Print first line

#### Reading a sample data dictionary

In [None]:
# Data dictionary
dd_file = 'January_2017_Record_Layout.txt'
dd_full = open(dd_file, 'r', encoding='iso-8859-1').read()

#### Series of interest

In [None]:
# Series of interest 
s = ['PWORWGT', 'PWCMPWGT', 'HRHHID', 'PULINENO', 'HRHHID2', 'PEHRUSL1', 
     'PRERNWA', 'PRERNHLY', 'PTERNWA', 'PTERNHLY', 'PRTAGE', 'PEAGE',
     'PRUNEDUR', 'PWSSWGT', 'HRYEAR', 'HRYEAR4']

# These series can be stored as categorical later on
s2 = ['HRMONTH', 'PESEX', 'PEMLR', 'PENLFRET', 'PENLFACT', 'PRDISC', 'GESTFIPS',
      'HRMIS', 'PRCOW1', 'PRFTLF', 'PREMPNOT', 'PRCIVLF', 'PEJHRSN','PRSJMJ', 
      'PEEDUCA', 'PRWKSTAT', 'PRMJOCC1', 'GTMETSTA', 'GEMETSTA', 'PEDWWNTO',
      'PRUNTYPE', 'PRMJIND1', 'PERACE', 'PTDTRACE', 'PRDTRACE', 'PRORIGIN',
      'PRDTHSP', 'PRCHLD']   
s = s + s2

#### Finding the pattern in the data dictionary and variable names, lengths, and column locations

In [None]:
p = re.compile('\n(\w+)\s+(\d+)\s+(.*?)\t+.*?(\d\d*).*?(\d\d+)')
dd2 = [(i[0], int(i[3]), int(i[4])) 
       for i in p.findall(dd_full) if i[0] in s]

#### Using the data dictionary pattern to parse the raw fixed-width format file

In [None]:
data = [tuple(line[i[1]:i[2]] for i in dd2) 
        for line in open('apr17pub.dat', 'r', encoding='utf-8') 
        if int(line[845:855]) > 0]
df = pd.DataFrame(data, columns=[v[0] for v in dd2])

In [None]:
# Read all columns and the first 50 rows
pd.read_fwf('apr17pub.dat', 
            colspecs = [(i[2], i[3]) for i in dd], 
            names = [i[0] for i in dd])

In [None]:
# Set of functions for parsing raw data

# Use struct to read files faster 
def struct_constr(fieldspecs):
    """Specify which characters to retrieve and which to ignore"""
    unpack_len = 0
    unpack_fmt = ""
    for fieldspec in fieldspecs:
        start = fieldspec[1] - 1
        end = start + fieldspec[2]
        if start > unpack_len:
            unpack_fmt += str(start - unpack_len) + "x"
        unpack_fmt += str(end - start) + "s"
        unpack_len = end
    #print(unpack_fmt)
    return struct.Struct(unpack_fmt).unpack_from

# Convert valid lines to list
def fwf_to_list(file, unpacker):
    rows = []
    with open(f'{file}', 'r', encoding='utf-8') as f:
        for line in f:
            if int(line[845:855]) > 0:  # Composite weight
                rows.append(tuple(map(int, unpacker(line.encode()))))
    return rows

In [None]:
#Read monthly file and add it to annual dataframe
row_list = fwf_to_list('apr17pub.dat', struct_constr(dd))
df = pd.DataFrame(row_list, columns=[v[0] for v in dd])

In [None]:
df

In [None]:
unpacker = struct_constr(dd)

In [None]:
with open('apr17pub.dat', 'r') as f:
    line = tuple(map(int, unpacker(f.readline().encode())))

In [None]:
line

In [None]:
dd

In [None]:
[(i[0], i[2], i[3]) for i in fields if i[0] != 'FILLER']
#[i[0] for i in fields if i[0] != 'FILLER']

In [None]:
dd2

In [None]:
dd = [(i[0], int(i[3]), int(i[1])) 
       for i in p.findall(dd_full) if i[0] in s]

In [None]:
dd

In [None]:
str = '1234567890'
w = [0,2,5,7,10]
> [ str[ w[i-1] : w[i] ] for i in range(1,len(w)) ]
['12', '345', '67', '890']

In [None]:
df.memory_usage()

In [None]:
pd.DataFrame(data)

In [None]:
def slices(s, *args):
    position = 0
    for length in args:
        yield s[position:position + length]
        position += length

In [None]:
rows = []
with open('apr17pub.dat', 'r', encoding='utf-8') as f:
    for line in f:
        if int(line[845:855]) > 0:  # Composite weight
            rows.append(tuple(map(int, unpacker(line.encode()))))

In [None]:
rows