# THE PANDEMIC'S WORKING PARENTS

#### <u>DATA PREP</u>

#### The Project

The purpose of this project is to shed light on the challenges that working parents are facing during the 2020-2021 COVID-19 pandemic. My analysis of U.S. Census Household Pulse Survey data reveals that since the beginning of the pandemic California’s
job market lost the most parents with schoolchildren in the country, trailing only Nevada and
Michigan. California households with PreK-12 children were significantly more likely to lose employment income than parents without children. 


#### The Data
This project uses data for weeks 1 through 27 of the pandemic from the U.S. Census Bureau Household Pulse Survey Public Use files (https://www.census.gov/programs-surveys/household-pulse-survey/datasets.html). Each week's data is published in a separate csv file. The Census also publishes weekly data dictionaries in excel format. 

## Importing tools

In [1]:
import pandas as pd
import numpy as np
import os

pd.options.mode.chained_assignment = None # None|'warn'|'raise'
pd.set_option('display.float_format', '{:.2f}'.format)

class color:
   BOLD = '\033[1m'
   END = '\033[0m'

<hr>

## Importing the data

I import my stored dataframe. 

In [2]:
pulse = pd.read_csv('pulse.csv')

In the process of prepping this database, I remove rows with missing answers to questions important to my analysis. I record the length of the original database now so I can compare it as I go. 

In [3]:
original_length = len(pulse)
print('\n', color.BOLD + 'Original row count:' + color.END, original_length)


 [1mOriginal row count:[0m 2323337


<hr>

## Creating data categories
### KIDS

I am interested in identifying households with individuals under 18 and households with school-age children. In the weekly Pulse survey, the Census asks individuals how many people under 18 live in their households. That number is recorded in the <b>THHLD_NUMKID</b> column. This column does not have missing values.

In [4]:
pulse.THHLD_NUMKID.value_counts()

0.00    1505186
1.00     357116
2.00     293350
3.00     111495
4.00      37159
5.00      19031
Name: THHLD_NUMKID, dtype: int64

I create a new <b>kids</b> column where I mark 1 for households with one or more people under 18 and 0 for those without. 

In [5]:
pulse['kids'] = np.where(pulse.THHLD_NUMKID == 0, 0, 1)

I then label households with children in school or homeschooled. The database records answers to the question,

    'At any time during the 2020-2021 school year, were, or will, any children in this household enrolled in a public school, enrolled in a private school, or educated in a homeschool setting in Kindergarten through 12th grade or grade equivalent? Select all that apply.'

in three columns:

- <b>ENROLL1</b>: 'Yes, enrolled in a public or private school'
- <b>ENROLL2</b>: 'Yes, homeschooled'
- <b>ENROLL3</b>: 'No'

These columns include missing responses marked as '-88' in the database. If the interviewee addressed the question but did not select the category in a particular column, their response is marked as '-99'.

In [6]:
print(pulse.ENROLL1.value_counts(), '\n')
print(pulse.ENROLL2.value_counts(), '\n')
print(pulse.ENROLL3.value_counts())

-88.00    1595434
1.00       522921
-99.00     204982
Name: ENROLL1, dtype: int64 

-88.00    1595434
-99.00     687134
1.00        40769
Name: ENROLL2, dtype: int64 

-88.00    1595434
-99.00     560305
1.00       167598
Name: ENROLL3, dtype: int64


I remove rows where households reported having members 18 or younger, but -88 or -99 is marked in all three ENROLL columns.

In [7]:
# removing invalid records
pulse = pulse[~(
    (pulse.kids == 1) & 
    (pulse.ENROLL1 < 0) & 
    (pulse.ENROLL2 < 0) & 
    (pulse.ENROLL3 < 0)
)]

print('\n', color.BOLD + 'Original row count:' + color.END, original_length, 
      '\n', color.BOLD + 'Count of valid records:' + color.END, len(pulse), 
      '\n', color.BOLD + 'Total removed:' + color.END, original_length - len(pulse), '\n')


 [1mOriginal row count:[0m 2323337 
 [1mCount of valid records:[0m 2189398 
 [1mTotal removed:[0m 133939 



In a new column, <b>school_kids</b>, I mark 1 for households which reported having chilren enrolled in school or homeschooled, and 0 for households which answered that they did not. 

In [8]:
# Labeling households with school kids, marking na for others
pulse['school_kids'] = np.where(
    (pulse.kids == 1) &
    (
        (pulse.ENROLL1 == 1) | (pulse.ENROLL2 == 1)
    ), 1, np.nan)

# Labeling households without school kids
pulse['school_kids'] = np.where(
    (pulse.kids == 1) &
    (
        (pulse.ENROLL3 == 1)
    ), 0, pulse.school_kids)


print('\n', color.BOLD + 'Kids and school kids columns' + color.END, '\n')
display(pulse[['kids', 'ENROLL1', 'ENROLL2', 'ENROLL3', 'school_kids']].head(3))


 [1mKids and school kids columns[0m 



Unnamed: 0,kids,ENROLL1,ENROLL2,ENROLL3,school_kids
1,0,-88.0,-88.0,-88.0,
2,1,1.0,-99.0,-99.0,1.0
3,0,-88.0,-88.0,-88.0,


### HOUSEHOLDS WHICH EXPERIENCED A RECENT JOB LOSS

Interviewees in the survey were asked if they had experienced a loss of employment income since March 13, 2020. Their answers were recorded in the <b>WRKLOSS</b> column. The column includes missing values. 

In [9]:
pulse.WRKLOSS.value_counts()

2.00      1338296
1.00       839225
-99.00      11636
-88.00        241
Name: WRKLOSS, dtype: int64

Because this question required interviewees to select only one answer, I remove all answers marked as -88 (missing) and -99 (question seen but category not selected). 

In [10]:
pulse = pulse[~((pulse.WRKLOSS == -88) | (pulse.WRKLOSS == -99))]

print('\n', color.BOLD + 'Original row count:' + color.END, original_length, 
      '\n', color.BOLD + 'Count of valid records:' + color.END, len(pulse), 
      '\n', color.BOLD + 'Total removed:' + color.END, original_length - len(pulse))


 [1mOriginal row count:[0m 2323337 
 [1mCount of valid records:[0m 2177521 
 [1mTotal removed:[0m 145816


In [11]:
pulse.WRKLOSS.value_counts()

2.00    1338296
1.00     839225
Name: WRKLOSS, dtype: int64

I create a new column, <b>recent_job_loss</b>, where I mark 1 for households who reported a loss (currently marked 1) and 0 for households who didn't (currently marked 2). 

In [12]:
# Labeling households which experienced recent job losses
pulse['recent_job_loss'] = np.where(pulse.WRKLOSS == 1, 1, np.nan)
pulse['recent_job_loss'] = np.where(pulse.WRKLOSS == 2, 0, pulse['recent_job_loss'])

print('\n', color.BOLD + 'Recent job loss columns' + color.END, '\n')
display(pulse[['WRKLOSS', 'recent_job_loss']].head(3))


 [1mRecent job loss columns[0m 



Unnamed: 0,WRKLOSS,recent_job_loss
1,1.0,1.0
2,2.0,0.0
3,1.0,1.0


### MARITAL STATUS

I follow the same process to recode the marital status of the interviewee. The <b>MS</b> column includes the following choices:

1) Now married 
2) Widowed
3) Divorced
4) Separated
5) Never married

As above, -99 indicates that the question was seen but the category was not selected and -88 indicates that the answer is missing. Because the question required interviewees to select only one answer, I mark as null all answers not ranging from 1 to 5. I do not remove invalid records because this characteristic is not central to my analysis. 

In [13]:
# Labeling interviewees who reported not being married
pulse['married'] = np.where(
    (pulse.MS == 2) | 
    (pulse.MS == 3) | 
    (pulse.MS == 4) | 
    (pulse.MS == 5), 0, np.nan)

# Labeling interviewees who being married
pulse['married'] = np.where(pulse.MS == 1, 1, pulse.married)

print('\n', color.BOLD + 'Marital status columns' + color.END, '\n')
display(pulse[['MS', 'married']].head(3))


 [1mMarital status columns[0m 



Unnamed: 0,MS,married
1,3.0,0.0
2,1.0,1.0
3,2.0,0.0


### INCOME

I retrieved low-income limits data published in 2020 by the Department of Housing and Urban Development from https://www.huduser.gov/portal/datasets/il/il20/State-Incomelimits-Report-FY20r.pdf. I used SmallPDF to convert the documents into Excel files. I then filtered the data to only include low-income limits and added the <b>EST-ST</b> column with state codes as reported in the Census Pulse Survey data dictionary.

State income limits rely on household member counts. However, limits are only defined for households of up to 8 members. Larger households meet the 8-member limits. To make sure I could match all of the data, I create a new <b>member_count</b> column which reports the exact number of household members as reported in the <b>THHLD_NUMPER</b> survey column for households of 8 members or fewer. I record 8 for larger households. 

In [14]:
# Creating member counts col
pulse['member_count'] = np.where(pulse.THHLD_NUMPER <= 8, pulse.THHLD_NUMPER, 8)

I import the HUD state low-income limits data.

In [15]:
# Importing df
lil = pd.read_csv('State-Incomelimits-Report-FY20r-lil.csv')

print('\n', color.BOLD + 'State Low-Income Limits 2020' + color.END, '\n')
display(lil.head(3))


 [1mState Low-Income Limits 2020[0m 



Unnamed: 0,state,EST_ST,1,2,3,4,5,6,7,8
0,ALABAMA,1,36550.0,41800.0,47000.0,52250.0,56400.0,60600.0,64800.0,68950.0
1,ALASKA,2,51650.0,59000.0,66400.0,73750.0,79650.0,85550.0,91450.0,97350.0
2,ARIZONA,4,40400.0,46150.0,51900.0,57700.0,62300.0,66900.0,71500.0,76150.0


I transpose the member count columns of the income data to create a more compact dataframe.

In [16]:
# Transposing df
lil_t = lil.melt(id_vars=['state','EST_ST']).rename(columns={
    'variable': 'member_count',
    'value': 'low_income_limit'
})
# Converting member_count col to integer
lil_t.member_count = lil_t.member_count.astype('int64')

print('\n', color.BOLD + 'State Low-Income Limits 2020 - Transposed' + color.END, '\n')
display(lil_t.head(3))


 [1mState Low-Income Limits 2020 - Transposed[0m 



Unnamed: 0,state,EST_ST,member_count,low_income_limit
0,ALABAMA,1,1,36550.0
1,ALASKA,2,1,51650.0
2,ARIZONA,4,1,40400.0


I then left-merge the pulse dataframe with the income dataframe on the state code column, <b>EST_ST</b>, and household member count column, <b>member_count</b>. As a result, each row is assigned to the appropriate low-income limit for the state and household member count of the household it represents.

In [17]:
pulse = pd.merge(pulse, lil_t, on=['EST_ST', 'member_count'], how='left')      

print('\n', color.BOLD + 'Pulse & State Low-Income Limits data merge' + color.END, '\n')
display(pulse.head(3))


 [1mPulse & State Low-Income Limits data merge[0m 



Unnamed: 0.1,Unnamed: 0,SCRAM,WEEK,EST_ST,EST_MSA,REGION,HWEIGHT_x,PWEIGHT,TBIRTH_YEAR,ABIRTH_YEAR,...,TBEDROOMS,HWEIGHT_y,HWEIGHT,kids,school_kids,recent_job_loss,married,member_count,state,low_income_limit
0,1,V220000001S10011554410113,22.0,1.0,,2.0,899.97,1683.63,1947.0,2.0,...,,,899.97,0,,1.0,0.0,2.0,ALABAMA,41800.0
1,2,V220000001S15010024400123,22.0,1.0,,2.0,2077.84,3887.14,1989.0,2.0,...,,,2077.84,1,1.0,0.0,1.0,6.0,ALABAMA,60600.0
2,3,V220000001S15010351400113,22.0,53.0,42660.0,4.0,3555.42,6731.73,1971.0,2.0,...,,,3555.42,0,,1.0,0.0,2.0,WASHINGTON,57450.0


The Census Pulse data does not provide exact household incomes. Instead, it provides income brackets labeled with integers from 1 to 8 as follows:

1) Less than \\$25,000  
2) \\$25,000 - \\$34,999  
3) \\$35,000 - \\$49,999   
4) \\$50,000 - \\$74,999   
5) \\$75,000 - \\$99,999   
6) \\$100,000 - \\$149,999   
7) \\$150,000 - \\$199,999
8) \\$200,000 and above

I use the same brackets to categorize low-income limits in a new column, <b>low_income_limit_cat</b>.

In [18]:
# Creating new income cat col
pulse['low_income_limit_cat'] = np.where(pulse.low_income_limit < 25000, 1, np.nan)

pulse['low_income_limit_cat'] = np.where(
    (pulse.low_income_limit >= 25000) & 
    (pulse.low_income_limit <= 34999), 2, pulse.low_income_limit_cat)

pulse['low_income_limit_cat'] = np.where(
    (pulse.low_income_limit >= 35000) & 
    (pulse.low_income_limit <= 49999), 3, pulse.low_income_limit_cat)

pulse['low_income_limit_cat'] = np.where(
    (pulse.low_income_limit >= 50000) & 
    (pulse.low_income_limit <= 74999), 4, pulse.low_income_limit_cat)

pulse['low_income_limit_cat'] = np.where(
    (pulse.low_income_limit >= 75000) & 
    (pulse.low_income_limit <= 99999), 5, pulse.low_income_limit_cat)

pulse['low_income_limit_cat'] = np.where(
    (pulse.low_income_limit >= 100000) & 
    (pulse.low_income_limit <= 149999), 6, pulse.low_income_limit_cat)

I specify conditions only for the first categories because low income limits do not top $149,999 across all states.

In [19]:
print('\n', color.BOLD + 'Low-income limit categories:' + color.END, pulse.low_income_limit_cat.sort_values().unique(), '\n')


 [1mLow-income limit categories:[0m [2. 3. 4. 5. 6.] 



Finally, I compare the income categories reported by each household in the survey to the low-income limit category which corresponds to that household. 

In [20]:
pulse['low_income'] = np.where(pulse.INCOME <= pulse.low_income_limit_cat, 1, 0)

print('\n', color.BOLD + 'Low-income limit comparison columns' + color.END, '\n')
display(pulse[['INCOME', 'low_income_limit', 'low_income_limit_cat', 'low_income']].tail(5))


 [1mLow-income limit comparison columns[0m 



Unnamed: 0,INCOME,low_income_limit,low_income_limit_cat,low_income
2177516,3.0,57400.0,4.0,1
2177517,4.0,44650.0,3.0,0
2177518,4.0,63750.0,4.0,1
2177519,4.0,44650.0,3.0,0
2177520,5.0,51000.0,4.0,0


In [21]:
# Storing df
pulse.to_csv('pulse.csv')