# THE PANDEMIC'S WORKING PARENTS

#### <u>DATA VETTING</u>

#### The Project

The purpose of this project is to shed light on the challenges that working parents are facing during the 2020-2021 COVID-19 pandemic. My analysis of U.S. Census Household Pulse Survey data reveals that in 2020 California’s
job market lost the most parents with schoolchildren in the country, trailing only Nevada and
Michigan. In 2020, California households with PreK-12 children were significantly more likely to lose employment income than parents without children. 


#### The Data
This project uses data for weeks 1 through 25 of the pandemic from the U.S. Census Bureau Household Pulse Survey Public Use files (https://www.census.gov/programs-surveys/household-pulse-survey/datasets.html). Each week's data is published in a separate csv file. The Census also publishes weekly data dictionaries in excel format. 

## Importing tools

In [1]:
import pandas as pd
import numpy as np
import os

pd.options.mode.chained_assignment = None # None|'warn'|'raise'
pd.set_option('display.float_format', '{:.2f}'.format)

class color:
   BOLD = '\033[1m'
   END = '\033[0m'

<hr>

## Importing the data

I create an empty list to store the Census Pulse Survey files I downloaded. I then create a loop which imports each file and adds it to my list. 

In [2]:
# creating empty list to store imported dfs
pulse_file_lst = []

# looping through, importing and adding dfs in folder to file_lst
for file_name in os.listdir('pulse_files')[:-1]:
    path = '/Users/carolineghisolfi/Desktop/winter_2021/dataj_pulse/pulse_files/' + file_name
    file = pd.read_csv(path, 
                       dtype={
                           'WEEK': int
                       })
    pulse_file_lst.append(file)

In [3]:
print('\n', color.BOLD + 'First week' + color.END, '\n')
display(pulse_file_lst[0].head(3))
display(pulse_file_lst[0].info())


 [1mFirst week[0m 



Unnamed: 0,SCRAM,WEEK,EST_ST,EST_MSA,REGION,HWEIGHT,PWEIGHT,TBIRTH_YEAR,ABIRTH_YEAR,EGENDER,...,PSWHYCHG1,PSWHYCHG2,PSWHYCHG3,PSWHYCHG4,PSWHYCHG5,PSWHYCHG6,PSWHYCHG7,PSWHYCHG8,PSWHYCHG9,INCOME
0,V220000001S10011352410113,22,1,,2,1170.79,3285.4,1978,2,2,...,-88,-88,-88,-88,-88,-88,-88,-88,-88,-88
1,V220000001S10011554410113,22,1,,2,899.97,1683.63,1947,2,2,...,-99,-99,-99,1,-99,1,-99,-99,-99,4
2,V220000001S15010024400123,22,1,,2,2077.84,3887.14,1989,2,1,...,-88,-88,-88,-88,-88,-88,-88,-88,-88,5


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68348 entries, 0 to 68347
Columns: 204 entries, SCRAM to INCOME
dtypes: float64(4), int64(199), object(1)
memory usage: 106.4+ MB


None

I create a master <b>pulse</b> dataframe by concatenating all the files in my list. 

In [4]:
# concatenating dfs in master
pulse = pd.concat(pulse_file_lst)

print('\n', color.BOLD + 'Weeks 1 through 25' + color.END, '\n')
display(pulse.head(3))
display(pulse.info())

print('\n', color.BOLD + 'Weeks in dataframe:' + color.END, np.sort(pulse.WEEK.unique()))


 [1mWeeks 1 through 25[0m 



Unnamed: 0,SCRAM,WEEK,EST_ST,EST_MSA,REGION,HWEIGHT,PWEIGHT,TBIRTH_YEAR,ABIRTH_YEAR,EGENDER,...,SNAPMNTH4,SNAPMNTH5,SNAPMNTH6,SNAPMNTH7,SNAPMNTH8,SNAPMNTH9,SNAPMNTH10,SNAPMNTH11,SNAPMNTH12,TBEDROOMS
0,V220000001S10011352410113,22,1,,2.0,1170.79,3285.4,1978,2,2,...,,,,,,,,,,
1,V220000001S10011554410113,22,1,,2.0,899.97,1683.63,1947,2,2,...,,,,,,,,,,
2,V220000001S15010024400123,22,1,,2.0,2077.84,3887.14,1989,2,1,...,,,,,,,,,,


<class 'pandas.core.frame.DataFrame'>
Int64Index: 2167927 entries, 0 to 108061
Columns: 226 entries, SCRAM to TBEDROOMS
dtypes: float64(153), int64(72), object(1)
memory usage: 3.7+ GB


None


 [1mWeeks in dataframe:[0m [ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
 25]


<hr>

# Adjusting Weights

The Pulse Survey underwent small but significant changes between phase 1, and phases 2 and 3 (an overview of these changes can be found <a href='https://www.huduser.gov/portal/pdredge/pdr-edge-frm-asst-sec-092820.html'>here</a>). For example, while the survey was conducted on a weekly basis in the first phase, it became biweekly in the second and third. 

Most importantly to this analysis, the Census introduced a new household weight variable in phases 2 and 3. While in phase 1 (weeks 1-12) household weights were provided in  separate weekly excel sheets, in phases 2 and 3 the Census included household weights in the main database's <b>HWEIGHT</b> column. 

In [5]:
pulse[['HWEIGHT', 'WEEK']].groupby(by='WEEK')['HWEIGHT'].min()

WEEK
1      NaN
2      NaN
3      NaN
4      NaN
5      NaN
6      NaN
7      NaN
8      NaN
9      NaN
10     NaN
11     NaN
12     NaN
13   19.99
14   16.30
15   22.25
16    9.02
17   20.99
18   25.55
19   31.13
20   17.11
21   32.51
22   31.62
23   22.93
24   20.27
25   32.90
Name: HWEIGHT, dtype: float64

Once again, I create an empty list and create a loop which stores the weight files in the list. Then, I concatenate the weight files in a master dataframe named <b>weights</b>.

In [6]:
# creating empty list to store imported dfs
weight_file_lst = []

# looping through, importing and adding dfs in folder to file_lst
for file_name in os.listdir('weight_files')[:-1]:
    path = '/Users/carolineghisolfi/Desktop/winter_2021/dataj_pulse/weight_files/' + file_name
    file = pd.read_csv(path)
    weight_file_lst.append(file)
    
# concatenating dfs in master
weights = pd.concat(weight_file_lst)

In [7]:
print('\n', color.BOLD + 'Phase 1 weights' + color.END, '\n')
display(weights.head(3))


 [1mPhase 1 weights[0m 



Unnamed: 0,WEEK,SCRAM,HWEIGHT,HWEIGHT1,HWEIGHT2,HWEIGHT3,HWEIGHT4,HWEIGHT5,HWEIGHT6,HWEIGHT7,...,HWEIGHT71,HWEIGHT72,HWEIGHT73,HWEIGHT74,HWEIGHT75,HWEIGHT76,HWEIGHT77,HWEIGHT78,HWEIGHT79,HWEIGHT80
0,1,V010000001S10011099370111,1074.75,1074.75,1133.46,832.28,2107.0,1110.27,290.29,1861.68,...,914.35,1167.75,328.65,1024.18,1707.99,992.2,1227.19,1658.08,934.61,1263.89
1,1,V010000001S10011900470112,2147.82,2147.82,3672.59,565.94,2235.81,3380.11,2198.79,2147.58,...,3117.82,2363.21,574.87,3596.52,597.45,2264.79,2067.21,2334.59,1993.23,2214.06
2,1,V010000001S18010744940111,842.56,842.56,1578.21,223.67,927.74,1864.36,612.33,855.89,...,1433.85,746.41,279.66,1126.62,295.25,915.62,701.33,1091.08,716.94,841.38


I am only interested in the main weight, <b>HWEIGHT</b>, so I exclude other calculated weights. 

In [8]:
weights = weights[['WEEK', 'SCRAM', 'HWEIGHT']]

I merge the pulse and weights dataframes. The merge creates two <b>HWEIGHT</b> columns differentiated by the x and y variables. I combine the two in a new <b>HWEIGHT</b> column. 

In [9]:
# Merging dfs
pulse = pd.merge(pulse, weights, on=['WEEK', 'SCRAM'], how='left')

# Combining weight cols
pulse['HWEIGHT'] = np.where(pulse.HWEIGHT_x.isnull() == False, pulse.HWEIGHT_x, pulse.HWEIGHT_y)

In [10]:
print('\n', color.BOLD + 'Weight Columns' + color.END, '\n')
display(pulse[['HWEIGHT_x', 'HWEIGHT_y', 'HWEIGHT']].head(3), pulse[['HWEIGHT_x', 'HWEIGHT_y', 'HWEIGHT']].tail(3))


 [1mWeight Columns[0m 



Unnamed: 0,HWEIGHT_x,HWEIGHT_y,HWEIGHT
0,1170.79,,1170.79
1,899.97,,899.97
2,2077.84,,2077.84


Unnamed: 0,HWEIGHT_x,HWEIGHT_y,HWEIGHT
2167924,,230.53,230.53
2167925,,9.67,9.67
2167926,,80.42,80.42


<b>Note</b>: The census did not provide household weights for <b>Week 9</b> of the survey. 

In [11]:
pulse[pulse.WEEK == 9][['HWEIGHT_x', 'HWEIGHT_y', 'HWEIGHT']]

Unnamed: 0,HWEIGHT_x,HWEIGHT_y,HWEIGHT
1818774,,,
1818775,,,
1818776,,,
1818777,,,
1818778,,,
...,...,...,...
1917432,,,
1917433,,,
1917434,,,
1917435,,,


In [12]:
# Storing df
%store pulse

Stored 'pulse' (DataFrame)
