<a href="https://colab.research.google.com/github/wenjunsun/Covid-19-analysis-with-uw-ubicomp/blob/master/week11/prepare_data_for_PSM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook we will prepare data for propensity score matching using the before and after shelter in place date, instead of using the first case date. The reason for doing this is that people's behaviors don't change right afte the first case. For example King County's first case is on January 22nd, and people's behavior didn't change until March. So picking the SIP date as the breaking point might yield more reasonable results for how much policy has impact on people's behavior. However, this means that we need to have a baseline date for counties without shelter in place. We will use the mean date of all shelter in place dates as the baseline date.

In [4]:
ls

[0m[01;34mdrive[0m/  [01;34msample_data[0m/


In [5]:
cd drive/My\ Drive/week11

/content/drive/My Drive/week11


In [6]:
import pandas as pd

# First need to calculate the mean shelter in place date, of all the counties

In [7]:
ls

 2016_US_County_Level_Presidential_Results.csv
'2020 County Health Rankings Data - Additional Measure Data.csv'
'2020 County Health Rankings Data - Ranked Measure Data.csv'
 aggregated_data.csv
 days_since.csv
 plot_time_series.ipynb
 prepare_data_for_PSM.ipynb
 us_states_governors.csv


In [8]:
# load data that has the date of each county's shelter in place
# read the SIP date column as datetime object.
days_since = pd.read_csv("days_since.csv", parse_dates=['Date - shelter in place'], infer_datetime_format=True)

In [9]:
days_since.head()

Unnamed: 0.1,Unnamed: 0,Date - first case,Date - first death,Date - reopening,Date - shelter in place,Date - shelter in place ends,cnamelong,county,county_code,state,state_code,state_name
0,0,2020-03-24,2020-04-07,2020-04-30,2020-04-04,2020-04-30,Autauga County,1.0,1001.0,1.0,AL,Alabama
1,1,2020-03-15,2020-03-29,2020-04-30,2020-04-04,2020-04-30,Baldwin County,3.0,1003.0,1.0,AL,Alabama
2,2,2020-04-03,2020-04-29,2020-04-30,2020-04-04,2020-04-30,Barbour County,5.0,1005.0,1.0,AL,Alabama
3,3,2020-03-30,2020-05-08,2020-04-30,2020-04-04,2020-04-30,Bibb County,7.0,1007.0,1.0,AL,Alabama
4,4,2020-03-25,2020-05-17,2020-04-30,2020-04-04,2020-04-30,Blount County,9.0,1009.0,1.0,AL,Alabama


In [10]:
days_since.dtypes

Unnamed: 0                               int64
Date - first case                       object
Date - first death                      object
Date - reopening                        object
Date - shelter in place         datetime64[ns]
Date - shelter in place ends            object
cnamelong                               object
county                                 float64
county_code                            float64
state                                  float64
state_code                              object
state_name                              object
dtype: object

In [11]:
days_since['Date - shelter in place'].mean()

Timestamp('2020-03-28 09:36:50.898386944')

As we can see the 'mean' shelter in place date of all counties is 2020-03-28. We will use this date for counties without shelter in place order. For counties with shelter in place, we will use that county's SIP date as comparison metric

# Now we need to get all the data within the time window.

- time window for counties without shelter in place order is 2019-03-28 to 2019-06-01 and 2020-03-28 to 2020-06-01
- time window for counties with shelter in place order is that county's shelter in place date - 2020-06-01 and the same window in 2019.

## combine social distance data with shelter in place date data

In [12]:
data = pd.read_csv("aggregated_data.csv", parse_dates=['date_range_start'],\
                   infer_datetime_format = True) 

In [13]:
data.dtypes

date_range_start                   datetime64[ns]
state                                     float64
state_code                                 object
cnamelong                                  object
county_code                               float64
device_count                              float64
part_time_work_behavior_devices           float64
full_time_work_behavior_devices           float64
home_dwell_time                           float64
non_home_dwell_time                       float64
dtype: object

In [14]:
data.head()

Unnamed: 0,date_range_start,state,state_code,cnamelong,county_code,device_count,part_time_work_behavior_devices,full_time_work_behavior_devices,home_dwell_time,non_home_dwell_time
0,2019-01-01,1.0,AL,Autauga County,1001.0,4708.0,296.0,66.0,4019193.0,162117.0
1,2019-01-01,1.0,AL,Baldwin County,1003.0,19655.0,1125.0,340.0,14443397.0,684136.0
2,2019-01-01,1.0,AL,Barbour County,1005.0,1570.0,84.0,27.0,1009240.0,66210.0
3,2019-01-01,1.0,AL,Bibb County,1007.0,1702.0,102.0,21.0,1450137.0,87795.0
4,2019-01-01,1.0,AL,Blount County,1009.0,5224.0,315.0,84.0,4620005.0,261769.0


In [15]:
days_since.head()

Unnamed: 0.1,Unnamed: 0,Date - first case,Date - first death,Date - reopening,Date - shelter in place,Date - shelter in place ends,cnamelong,county,county_code,state,state_code,state_name
0,0,2020-03-24,2020-04-07,2020-04-30,2020-04-04,2020-04-30,Autauga County,1.0,1001.0,1.0,AL,Alabama
1,1,2020-03-15,2020-03-29,2020-04-30,2020-04-04,2020-04-30,Baldwin County,3.0,1003.0,1.0,AL,Alabama
2,2,2020-04-03,2020-04-29,2020-04-30,2020-04-04,2020-04-30,Barbour County,5.0,1005.0,1.0,AL,Alabama
3,3,2020-03-30,2020-05-08,2020-04-30,2020-04-04,2020-04-30,Bibb County,7.0,1007.0,1.0,AL,Alabama
4,4,2020-03-25,2020-05-17,2020-04-30,2020-04-04,2020-04-30,Blount County,9.0,1009.0,1.0,AL,Alabama


In [16]:
# we only need to combine the date of shelter in place to social distancing data.
# so let's just get the columns we need
days_since = days_since[['county_code', 'Date - shelter in place']]

In [17]:
days_since.head()

Unnamed: 0,county_code,Date - shelter in place
0,1001.0,2020-04-04
1,1003.0,2020-04-04
2,1005.0,2020-04-04
3,1007.0,2020-04-04
4,1009.0,2020-04-04


In [18]:
data = pd.merge(data, days_since, how ="left", on="county_code")

In [20]:
data.head()

Unnamed: 0,date_range_start,state,state_code,cnamelong,county_code,device_count,part_time_work_behavior_devices,full_time_work_behavior_devices,home_dwell_time,non_home_dwell_time,Date - shelter in place
0,2019-01-01,1.0,AL,Autauga County,1001.0,4708.0,296.0,66.0,4019193.0,162117.0,2020-04-04
1,2019-01-01,1.0,AL,Baldwin County,1003.0,19655.0,1125.0,340.0,14443397.0,684136.0,2020-04-04
2,2019-01-01,1.0,AL,Barbour County,1005.0,1570.0,84.0,27.0,1009240.0,66210.0,2020-04-04
3,2019-01-01,1.0,AL,Bibb County,1007.0,1702.0,102.0,21.0,1450137.0,87795.0,2020-04-04
4,2019-01-01,1.0,AL,Blount County,1009.0,5224.0,315.0,84.0,4620005.0,261769.0,2020-04-04


In [21]:
# these are the counties without shelter in place.
data[data['Date - shelter in place'].isnull()].head()

Unnamed: 0,date_range_start,state,state_code,cnamelong,county_code,device_count,part_time_work_behavior_devices,full_time_work_behavior_devices,home_dwell_time,non_home_dwell_time,Date - shelter in place
110,2019-01-01,5.0,AR,Arkansas County,5001.0,1347.0,77.0,27.0,873555.0,74612.0,NaT
111,2019-01-01,5.0,AR,Ashley County,5003.0,1444.0,71.0,30.0,916515.0,86816.0,NaT
112,2019-01-01,5.0,AR,Baxter County,5005.0,2681.0,105.0,51.0,1677715.0,22716.0,NaT
113,2019-01-01,5.0,AR,Benton County,5007.0,21550.0,938.0,290.0,17354346.0,668214.0,NaT
114,2019-01-01,5.0,AR,Boone County,5009.0,2638.0,113.0,36.0,2112064.0,59270.0,NaT


## filter out data not within the time window range.

In [22]:
from datetime import datetime

In [23]:
# a custom function that given a row return true
# if this row is within the time window and false
# otherwise
def isThisRowInTimeWindow(row):
  thisDataDate = row['date_range_start']
  # if this row's SIP date is null, then begin window from 2020-03-28,
  # else use that row's SIP date as the starting window
  thisYearBeginDate = row['Date - shelter in place'] if not pd.isnull(row['Date - shelter in place']) else thisDataDate.replace(2020,3,28)

  thisYearEndDate = thisYearBeginDate.replace(month=6, day = 1)

  # return true if this row's date is within the window
  return thisDataDate.dayofyear >= thisYearBeginDate.dayofyear \
    and thisDataDate.dayofyear <= thisYearEndDate.dayofyear

In [24]:
data.iloc[280000]

date_range_start                   2019-03-28 00:00:00
state                                               48
state_code                                          TX
cnamelong                              Colorado County
county_code                                      48089
device_count                                      1810
part_time_work_behavior_devices                    239
full_time_work_behavior_devices                    106
home_dwell_time                                 798574
non_home_dwell_time                             337115
Date - shelter in place            2020-04-02 00:00:00
Name: 280000, dtype: object

In [25]:
isThisRowInTimeWindow(data.iloc[280000])

False

In [26]:
KingData = data[data['county_code'] == 53033]

In [27]:
KingData

Unnamed: 0,date_range_start,state,state_code,cnamelong,county_code,device_count,part_time_work_behavior_devices,full_time_work_behavior_devices,home_dwell_time,non_home_dwell_time,Date - shelter in place
2966,2019-01-01,53.0,WA,King County,53033.0,114583.0,3951.0,2077.0,73161723.0,4030288.0,2020-03-23
6192,2019-01-02,53.0,WA,King County,53033.0,117387.0,9283.0,5155.0,64932063.0,8105451.0,2020-03-23
9418,2019-01-03,53.0,WA,King County,53033.0,117073.0,9380.0,5711.0,65496984.0,8517352.0,2020-03-23
12644,2019-01-04,53.0,WA,King County,53033.0,117290.0,9041.0,4900.0,65660241.0,9448583.0,2020-03-23
15870,2019-01-05,53.0,WA,King County,53033.0,116990.0,4704.0,2172.0,76302229.0,4960347.0,2020-03-23
...,...,...,...,...,...,...,...,...,...,...,...
1876623,2020-08-04,53.0,WA,King County,53033.0,103704.0,5550.0,3471.0,69214475.0,4665696.0,2020-03-23
1879849,2020-08-05,53.0,WA,King County,53033.0,105099.0,6087.0,3747.0,68208583.0,4781856.0,2020-03-23
1883074,2020-08-06,53.0,WA,King County,53033.0,106226.0,6013.0,3907.0,69423495.0,4337102.0,2020-03-23
1886300,2020-08-07,53.0,WA,King County,53033.0,106192.0,5346.0,3006.0,64435351.0,5716823.0,2020-03-23


In [28]:
KingData['withinTimeWindow'] = KingData.apply(lambda row: isThisRowInTimeWindow(row) , axis = 1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [29]:
KingData = KingData[KingData['withinTimeWindow'] == True]

In [30]:
KingData

Unnamed: 0,date_range_start,state,state_code,cnamelong,county_code,device_count,part_time_work_behavior_devices,full_time_work_behavior_devices,home_dwell_time,non_home_dwell_time,Date - shelter in place,withinTimeWindow
267498,2019-03-24,53.0,WA,King County,53033.0,118808.0,6091.0,2464.0,81089794.0,6007667.0,2020-03-23,True
270724,2019-03-25,53.0,WA,King County,53033.0,122723.0,12202.0,9332.0,67392234.0,13449798.0,2020-03-23,True
273950,2019-03-26,53.0,WA,King County,53033.0,120187.0,12776.0,9735.0,63797125.0,15143574.0,2020-03-23,True
277176,2019-03-27,53.0,WA,King County,53033.0,120321.0,13304.0,8305.0,65123183.0,15443649.0,2020-03-23,True
280402,2019-03-28,53.0,WA,King County,53033.0,119195.0,13121.0,9383.0,62116175.0,16438987.0,2020-03-23,True
...,...,...,...,...,...,...,...,...,...,...,...,...
1657390,2020-05-28,53.0,WA,King County,53033.0,93936.0,4800.0,2887.0,77310261.0,2776536.0,2020-03-23,True
1660613,2020-05-29,53.0,WA,King County,53033.0,94147.0,5889.0,3464.0,71710474.0,3730639.0,2020-03-23,True
1663836,2020-05-30,53.0,WA,King County,53033.0,96908.0,4099.0,1937.0,81087270.0,1643943.0,2020-03-23,True
1667059,2020-05-31,53.0,WA,King County,53033.0,99344.0,4259.0,2390.0,83672189.0,1280814.0,2020-03-23,True


In [31]:
KingData['2019?'] = KingData['date_range_start'].apply(lambda x: x.year == 2019)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [32]:
KingData

Unnamed: 0,date_range_start,state,state_code,cnamelong,county_code,device_count,part_time_work_behavior_devices,full_time_work_behavior_devices,home_dwell_time,non_home_dwell_time,Date - shelter in place,withinTimeWindow,2019?
267498,2019-03-24,53.0,WA,King County,53033.0,118808.0,6091.0,2464.0,81089794.0,6007667.0,2020-03-23,True,True
270724,2019-03-25,53.0,WA,King County,53033.0,122723.0,12202.0,9332.0,67392234.0,13449798.0,2020-03-23,True,True
273950,2019-03-26,53.0,WA,King County,53033.0,120187.0,12776.0,9735.0,63797125.0,15143574.0,2020-03-23,True,True
277176,2019-03-27,53.0,WA,King County,53033.0,120321.0,13304.0,8305.0,65123183.0,15443649.0,2020-03-23,True,True
280402,2019-03-28,53.0,WA,King County,53033.0,119195.0,13121.0,9383.0,62116175.0,16438987.0,2020-03-23,True,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1657390,2020-05-28,53.0,WA,King County,53033.0,93936.0,4800.0,2887.0,77310261.0,2776536.0,2020-03-23,True,False
1660613,2020-05-29,53.0,WA,King County,53033.0,94147.0,5889.0,3464.0,71710474.0,3730639.0,2020-03-23,True,False
1663836,2020-05-30,53.0,WA,King County,53033.0,96908.0,4099.0,1937.0,81087270.0,1643943.0,2020-03-23,True,False
1667059,2020-05-31,53.0,WA,King County,53033.0,99344.0,4259.0,2390.0,83672189.0,1280814.0,2020-03-23,True,False


In [33]:
data_2019 = KingData[KingData['2019?'] == True]

In [34]:
data_2020 = KingData[KingData['2019?'] == False]

In [36]:
home_dwell_2019 = data_2019['home_dwell_time'].sum()

In [37]:
non_home_dwell_2019 = data_2019['non_home_dwell_time'].sum() 

In [38]:
home_dwell_2020 = data_2020['home_dwell_time'].sum()

In [39]:
non_home_dwell_2020 = data_2020['non_home_dwell_time'].sum()

In [41]:
print(f'2019 percentage of home dwell time: {home_dwell_2019 / (home_dwell_2019 + non_home_dwell_2019)}')
print(f'2020 percentage of home dwell time: {home_dwell_2020 / (home_dwell_2020 + non_home_dwell_2020)}')

2019 percentage of home dwell time: 0.8424550045873533
2020 percentage of home dwell time: 0.9793589266690815


We can see people spend more time at home in 2020 afte shelter in place compared to last year in King County, which is what we expect.

In [42]:
data[data['Date - shelter in place'].isnull()].head()

Unnamed: 0,date_range_start,state,state_code,cnamelong,county_code,device_count,part_time_work_behavior_devices,full_time_work_behavior_devices,home_dwell_time,non_home_dwell_time,Date - shelter in place
110,2019-01-01,5.0,AR,Arkansas County,5001.0,1347.0,77.0,27.0,873555.0,74612.0,NaT
111,2019-01-01,5.0,AR,Ashley County,5003.0,1444.0,71.0,30.0,916515.0,86816.0,NaT
112,2019-01-01,5.0,AR,Baxter County,5005.0,2681.0,105.0,51.0,1677715.0,22716.0,NaT
113,2019-01-01,5.0,AR,Benton County,5007.0,21550.0,938.0,290.0,17354346.0,668214.0,NaT
114,2019-01-01,5.0,AR,Boone County,5009.0,2638.0,113.0,36.0,2112064.0,59270.0,NaT


In [43]:
data.iloc[97]

date_range_start                   2019-01-01 00:00:00
state                                                4
state_code                                          AZ
cnamelong                              Coconino County
county_code                                       4005
device_count                                      7314
part_time_work_behavior_devices                    297
full_time_work_behavior_devices                    133
home_dwell_time                            3.37339e+06
non_home_dwell_time                             631045
Date - shelter in place            2020-03-31 00:00:00
Name: 97, dtype: object

In [44]:
isThisRowInTimeWindow(data.iloc[97])

False

Seems like our function is working. Let's apply this function to every row in data.

In [48]:
data['withinTimeWindow'] = data.apply(lambda row: isThisRowInTimeWindow(row) , axis = 1)

Let's check if our function did the right thing.

In [49]:
data[data['withinTimeWindow'] == True]

Unnamed: 0,date_range_start,state,state_code,cnamelong,county_code,device_count,part_time_work_behavior_devices,full_time_work_behavior_devices,home_dwell_time,non_home_dwell_time,Date - shelter in place,withinTimeWindow
241869,2019-03-16,72.0,PR,Adjuntas Municipio,72001.0,369.0,24.0,17.0,178636.0,10671.0,2020-03-15,True
241870,2019-03-16,72.0,PR,Aguada Municipio,72003.0,1061.0,49.0,34.0,579401.0,12360.0,2020-03-15,True
241871,2019-03-16,72.0,PR,Aguadilla Municipio,72005.0,1640.0,73.0,49.0,889210.0,11242.0,2020-03-15,True
241872,2019-03-16,72.0,PR,Aguas Buenas Municipio,72007.0,597.0,34.0,19.0,271526.0,6155.0,2020-03-15,True
241873,2019-03-16,72.0,PR,Aibonito Municipio,72009.0,609.0,27.0,17.0,273104.0,6102.0,2020-03-15,True
...,...,...,...,...,...,...,...,...,...,...,...,...
1670534,2020-06-01,72.0,PR,Yabucoa Municipio,72151.0,912.0,38.0,37.0,663637.0,13601.0,2020-03-15,True
1670535,2020-06-01,72.0,PR,Yauco Municipio,72153.0,3277.0,227.0,154.0,2715685.0,51824.0,2020-03-15,True
1670536,2020-06-01,78.0,VI,St. Croix Island,78010.0,1125.0,63.0,67.0,532147.0,50517.0,NaT,True
1670537,2020-06-01,78.0,VI,St. John Island,78020.0,114.0,7.0,8.0,49063.0,4527.0,NaT,True


In [50]:
data[(data['withinTimeWindow'] == True) & (data['Date - shelter in place'].isnull())]

Unnamed: 0,date_range_start,state,state_code,cnamelong,county_code,device_count,part_time_work_behavior_devices,full_time_work_behavior_devices,home_dwell_time,non_home_dwell_time,Date - shelter in place,withinTimeWindow
280772,2019-03-29,5.0,AR,Arkansas County,5001.0,1534.0,239.0,90.0,711530.0,358996.0,NaT,True
280773,2019-03-29,5.0,AR,Ashley County,5003.0,1545.0,205.0,72.0,808707.0,325809.0,NaT,True
280774,2019-03-29,5.0,AR,Baxter County,5005.0,2991.0,312.0,148.0,1379416.0,345637.0,NaT,True
280775,2019-03-29,5.0,AR,Benton County,5007.0,23365.0,3253.0,1526.0,13147940.0,4892840.0,NaT,True
280776,2019-03-29,5.0,AR,Boone County,5009.0,2999.0,390.0,187.0,1570295.0,571892.0,NaT,True
...,...,...,...,...,...,...,...,...,...,...,...,...
1670456,2020-06-01,69.0,MP,Saipan Municipality,69110.0,109.0,14.0,13.0,63866.0,7046.0,NaT,True
1670457,2020-06-01,69.0,MP,Tinian Municipality,69120.0,13.0,3.0,3.0,7449.0,572.0,NaT,True
1670536,2020-06-01,78.0,VI,St. Croix Island,78010.0,1125.0,63.0,67.0,532147.0,50517.0,NaT,True
1670537,2020-06-01,78.0,VI,St. John Island,78020.0,114.0,7.0,8.0,49063.0,4527.0,NaT,True


In [51]:
data[(data['withinTimeWindow'] == True) & (data['county_code'] == 66010)]

Unnamed: 0,date_range_start,state,state_code,cnamelong,county_code,device_count,part_time_work_behavior_devices,full_time_work_behavior_devices,home_dwell_time,non_home_dwell_time,Date - shelter in place,withinTimeWindow
283803,2019-03-29,66.0,GU,Guam,66010.0,1603.0,179.0,161.0,826841.0,49443.0,NaT,True
287029,2019-03-30,66.0,GU,Guam,66010.0,1718.0,192.0,176.0,930995.0,39748.0,NaT,True
290255,2019-03-31,66.0,GU,Guam,66010.0,1658.0,179.0,166.0,928532.0,34860.0,NaT,True
293481,2019-04-01,66.0,GU,Guam,66010.0,1825.0,178.0,157.0,1016954.0,50503.0,NaT,True
296707,2019-04-02,66.0,GU,Guam,66010.0,1782.0,150.0,140.0,974901.0,55123.0,NaT,True
...,...,...,...,...,...,...,...,...,...,...,...,...
1657563,2020-05-28,66.0,GU,Guam,66010.0,4644.0,283.0,183.0,4063549.0,316162.0,NaT,True
1660786,2020-05-29,66.0,GU,Guam,66010.0,4414.0,486.0,239.0,2587026.0,289296.0,NaT,True
1664009,2020-05-30,66.0,GU,Guam,66010.0,175.0,21.0,20.0,56941.0,30067.0,NaT,True
1667232,2020-05-31,66.0,GU,Guam,66010.0,289.0,31.0,28.0,119332.0,29462.0,NaT,True


In [53]:
# King county's data
data[(data['withinTimeWindow'] == True) & (data['county_code'] == 53033)]

Unnamed: 0,date_range_start,state,state_code,cnamelong,county_code,device_count,part_time_work_behavior_devices,full_time_work_behavior_devices,home_dwell_time,non_home_dwell_time,Date - shelter in place,withinTimeWindow
267498,2019-03-24,53.0,WA,King County,53033.0,118808.0,6091.0,2464.0,81089794.0,6007667.0,2020-03-23,True
270724,2019-03-25,53.0,WA,King County,53033.0,122723.0,12202.0,9332.0,67392234.0,13449798.0,2020-03-23,True
273950,2019-03-26,53.0,WA,King County,53033.0,120187.0,12776.0,9735.0,63797125.0,15143574.0,2020-03-23,True
277176,2019-03-27,53.0,WA,King County,53033.0,120321.0,13304.0,8305.0,65123183.0,15443649.0,2020-03-23,True
280402,2019-03-28,53.0,WA,King County,53033.0,119195.0,13121.0,9383.0,62116175.0,16438987.0,2020-03-23,True
...,...,...,...,...,...,...,...,...,...,...,...,...
1657390,2020-05-28,53.0,WA,King County,53033.0,93936.0,4800.0,2887.0,77310261.0,2776536.0,2020-03-23,True
1660613,2020-05-29,53.0,WA,King County,53033.0,94147.0,5889.0,3464.0,71710474.0,3730639.0,2020-03-23,True
1663836,2020-05-30,53.0,WA,King County,53033.0,96908.0,4099.0,1937.0,81087270.0,1643943.0,2020-03-23,True
1667059,2020-05-31,53.0,WA,King County,53033.0,99344.0,4259.0,2390.0,83672189.0,1280814.0,2020-03-23,True


They look right, the counties without SIP ranges from 3-28 to 6-1, and counties with SIP ranges from that county's shelter in place date to 6-1

In [54]:
# only get rows within the time range.
data = data[data['withinTimeWindow'] == True]

In [55]:
data.shape

(423439, 12)

# separate last year's data from this year's data

In [56]:
data.head()

Unnamed: 0,date_range_start,state,state_code,cnamelong,county_code,device_count,part_time_work_behavior_devices,full_time_work_behavior_devices,home_dwell_time,non_home_dwell_time,Date - shelter in place,withinTimeWindow
241869,2019-03-16,72.0,PR,Adjuntas Municipio,72001.0,369.0,24.0,17.0,178636.0,10671.0,2020-03-15,True
241870,2019-03-16,72.0,PR,Aguada Municipio,72003.0,1061.0,49.0,34.0,579401.0,12360.0,2020-03-15,True
241871,2019-03-16,72.0,PR,Aguadilla Municipio,72005.0,1640.0,73.0,49.0,889210.0,11242.0,2020-03-15,True
241872,2019-03-16,72.0,PR,Aguas Buenas Municipio,72007.0,597.0,34.0,19.0,271526.0,6155.0,2020-03-15,True
241873,2019-03-16,72.0,PR,Aibonito Municipio,72009.0,609.0,27.0,17.0,273104.0,6102.0,2020-03-15,True


In [57]:
# add a column that indicates whether this row's date is in 2019
data['2019?'] = data['date_range_start'].apply(lambda date: date.year == 2019)

In [58]:
data.head()

Unnamed: 0,date_range_start,state,state_code,cnamelong,county_code,device_count,part_time_work_behavior_devices,full_time_work_behavior_devices,home_dwell_time,non_home_dwell_time,Date - shelter in place,withinTimeWindow,2019?
241869,2019-03-16,72.0,PR,Adjuntas Municipio,72001.0,369.0,24.0,17.0,178636.0,10671.0,2020-03-15,True,True
241870,2019-03-16,72.0,PR,Aguada Municipio,72003.0,1061.0,49.0,34.0,579401.0,12360.0,2020-03-15,True,True
241871,2019-03-16,72.0,PR,Aguadilla Municipio,72005.0,1640.0,73.0,49.0,889210.0,11242.0,2020-03-15,True,True
241872,2019-03-16,72.0,PR,Aguas Buenas Municipio,72007.0,597.0,34.0,19.0,271526.0,6155.0,2020-03-15,True,True
241873,2019-03-16,72.0,PR,Aibonito Municipio,72009.0,609.0,27.0,17.0,273104.0,6102.0,2020-03-15,True,True


In [59]:
data[data['2019?'] == False].head()

Unnamed: 0,date_range_start,state,state_code,cnamelong,county_code,device_count,part_time_work_behavior_devices,full_time_work_behavior_devices,home_dwell_time,non_home_dwell_time,Date - shelter in place,withinTimeWindow,2019?
1419016,2020-03-15,72.0,PR,Adjuntas Municipio,72001.0,911.0,30.0,16.0,807082.0,3328.0,2020-03-15,True,False
1419017,2020-03-15,72.0,PR,Aguada Municipio,72003.0,1927.0,75.0,34.0,1798533.0,25442.0,2020-03-15,True,False
1419018,2020-03-15,72.0,PR,Aguadilla Municipio,72005.0,2766.0,126.0,52.0,2513833.0,38051.0,2020-03-15,True,False
1419019,2020-03-15,72.0,PR,Aguas Buenas Municipio,72007.0,930.0,48.0,24.0,793095.0,23722.0,2020-03-15,True,False
1419020,2020-03-15,72.0,PR,Aibonito Municipio,72009.0,879.0,28.0,18.0,725721.0,10204.0,2020-03-15,True,False


We can see the date 2020-3-14 is getting in our data even though shelter in place is 2020-3-15. This might by caused by the one off error. (2019 has 366 days and 2020 has 365 days). Don't think this will affect our analysis that much.

In [60]:
data_2019 = data[data['2019?'] == True]

In [61]:
data_2020 = data[data['2019?'] == False]

In [62]:
# drop unnecessary columns
data_2019.drop(columns=['withinTimeWindow', '2019?'], axis = 1, inplace = True)
data_2020.drop(columns=['withinTimeWindow', '2019?'], axis = 1, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [63]:
data_2019.head()

Unnamed: 0,date_range_start,state,state_code,cnamelong,county_code,device_count,part_time_work_behavior_devices,full_time_work_behavior_devices,home_dwell_time,non_home_dwell_time,Date - shelter in place
241869,2019-03-16,72.0,PR,Adjuntas Municipio,72001.0,369.0,24.0,17.0,178636.0,10671.0,2020-03-15
241870,2019-03-16,72.0,PR,Aguada Municipio,72003.0,1061.0,49.0,34.0,579401.0,12360.0,2020-03-15
241871,2019-03-16,72.0,PR,Aguadilla Municipio,72005.0,1640.0,73.0,49.0,889210.0,11242.0,2020-03-15
241872,2019-03-16,72.0,PR,Aguas Buenas Municipio,72007.0,597.0,34.0,19.0,271526.0,6155.0,2020-03-15
241873,2019-03-16,72.0,PR,Aibonito Municipio,72009.0,609.0,27.0,17.0,273104.0,6102.0,2020-03-15


In [64]:
data_2020.head()

Unnamed: 0,date_range_start,state,state_code,cnamelong,county_code,device_count,part_time_work_behavior_devices,full_time_work_behavior_devices,home_dwell_time,non_home_dwell_time,Date - shelter in place
1419016,2020-03-15,72.0,PR,Adjuntas Municipio,72001.0,911.0,30.0,16.0,807082.0,3328.0,2020-03-15
1419017,2020-03-15,72.0,PR,Aguada Municipio,72003.0,1927.0,75.0,34.0,1798533.0,25442.0,2020-03-15
1419018,2020-03-15,72.0,PR,Aguadilla Municipio,72005.0,2766.0,126.0,52.0,2513833.0,38051.0,2020-03-15
1419019,2020-03-15,72.0,PR,Aguas Buenas Municipio,72007.0,930.0,48.0,24.0,793095.0,23722.0,2020-03-15
1419020,2020-03-15,72.0,PR,Aibonito Municipio,72009.0,879.0,28.0,18.0,725721.0,10204.0,2020-03-15


# add a SIP column to data to indicate whether this county has SIP implemented.

In [65]:
data_2019.head()

Unnamed: 0,date_range_start,state,state_code,cnamelong,county_code,device_count,part_time_work_behavior_devices,full_time_work_behavior_devices,home_dwell_time,non_home_dwell_time,Date - shelter in place
241869,2019-03-16,72.0,PR,Adjuntas Municipio,72001.0,369.0,24.0,17.0,178636.0,10671.0,2020-03-15
241870,2019-03-16,72.0,PR,Aguada Municipio,72003.0,1061.0,49.0,34.0,579401.0,12360.0,2020-03-15
241871,2019-03-16,72.0,PR,Aguadilla Municipio,72005.0,1640.0,73.0,49.0,889210.0,11242.0,2020-03-15
241872,2019-03-16,72.0,PR,Aguas Buenas Municipio,72007.0,597.0,34.0,19.0,271526.0,6155.0,2020-03-15
241873,2019-03-16,72.0,PR,Aibonito Municipio,72009.0,609.0,27.0,17.0,273104.0,6102.0,2020-03-15


In [66]:
data_2019['SIP?'] = data_2019['Date - shelter in place'].apply(lambda x: 0 if pd.isnull(x) else 1)
data_2020['SIP?'] = data_2020['Date - shelter in place'].apply(lambda x: 0 if pd.isnull(x) else 1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [67]:
data_2019.head()

Unnamed: 0,date_range_start,state,state_code,cnamelong,county_code,device_count,part_time_work_behavior_devices,full_time_work_behavior_devices,home_dwell_time,non_home_dwell_time,Date - shelter in place,SIP?
241869,2019-03-16,72.0,PR,Adjuntas Municipio,72001.0,369.0,24.0,17.0,178636.0,10671.0,2020-03-15,1
241870,2019-03-16,72.0,PR,Aguada Municipio,72003.0,1061.0,49.0,34.0,579401.0,12360.0,2020-03-15,1
241871,2019-03-16,72.0,PR,Aguadilla Municipio,72005.0,1640.0,73.0,49.0,889210.0,11242.0,2020-03-15,1
241872,2019-03-16,72.0,PR,Aguas Buenas Municipio,72007.0,597.0,34.0,19.0,271526.0,6155.0,2020-03-15,1
241873,2019-03-16,72.0,PR,Aibonito Municipio,72009.0,609.0,27.0,17.0,273104.0,6102.0,2020-03-15,1


In [68]:
data_2019[data_2019['SIP?'] == 0].head()

Unnamed: 0,date_range_start,state,state_code,cnamelong,county_code,device_count,part_time_work_behavior_devices,full_time_work_behavior_devices,home_dwell_time,non_home_dwell_time,Date - shelter in place,SIP?
280772,2019-03-29,5.0,AR,Arkansas County,5001.0,1534.0,239.0,90.0,711530.0,358996.0,NaT,0
280773,2019-03-29,5.0,AR,Ashley County,5003.0,1545.0,205.0,72.0,808707.0,325809.0,NaT,0
280774,2019-03-29,5.0,AR,Baxter County,5005.0,2991.0,312.0,148.0,1379416.0,345637.0,NaT,0
280775,2019-03-29,5.0,AR,Benton County,5007.0,23365.0,3253.0,1526.0,13147940.0,4892840.0,NaT,0
280776,2019-03-29,5.0,AR,Boone County,5009.0,2999.0,390.0,187.0,1570295.0,571892.0,NaT,0


# group by county and aggregate on device_count + completely_home_device_count

In [70]:
data_2019.head()

Unnamed: 0,date_range_start,state,state_code,cnamelong,county_code,device_count,part_time_work_behavior_devices,full_time_work_behavior_devices,home_dwell_time,non_home_dwell_time,Date - shelter in place,SIP?
241869,2019-03-16,72.0,PR,Adjuntas Municipio,72001.0,369.0,24.0,17.0,178636.0,10671.0,2020-03-15,1
241870,2019-03-16,72.0,PR,Aguada Municipio,72003.0,1061.0,49.0,34.0,579401.0,12360.0,2020-03-15,1
241871,2019-03-16,72.0,PR,Aguadilla Municipio,72005.0,1640.0,73.0,49.0,889210.0,11242.0,2020-03-15,1
241872,2019-03-16,72.0,PR,Aguas Buenas Municipio,72007.0,597.0,34.0,19.0,271526.0,6155.0,2020-03-15,1
241873,2019-03-16,72.0,PR,Aibonito Municipio,72009.0,609.0,27.0,17.0,273104.0,6102.0,2020-03-15,1


In [71]:
data_2019_agg = data_2019.groupby(['state','state_code','cnamelong','county_code',\
                                   'SIP?']).agg(device_count = ('device_count', 'sum'),\
                                                part_time_work_behavior_devices = ('part_time_work_behavior_devices', 'sum'),\
                                                full_time_work_behavior_devices = ('full_time_work_behavior_devices', 'sum'),\
                                                home_dwell_time = ('home_dwell_time','sum'),\
                                                non_home_dwell_time = ('non_home_dwell_time', 'sum')).reset_index()

In [72]:
data_2020_agg = data_2020.groupby(['state','state_code','cnamelong','county_code',\
                                   'SIP?']).agg(device_count = ('device_count', 'sum'),\
                                                part_time_work_behavior_devices = ('part_time_work_behavior_devices', 'sum'),\
                                                full_time_work_behavior_devices = ('full_time_work_behavior_devices', 'sum'),\
                                                home_dwell_time = ('home_dwell_time','sum'),\
                                                non_home_dwell_time = ('non_home_dwell_time', 'sum')).reset_index()

In [73]:
data_2019_agg.head()

Unnamed: 0,state,state_code,cnamelong,county_code,SIP?,device_count,part_time_work_behavior_devices,full_time_work_behavior_devices,home_dwell_time,non_home_dwell_time
0,1.0,AL,Autauga County,1001.0,1,303064.0,39481.0,18033.0,207555919.0,59952572.0
1,1.0,AL,Baldwin County,1003.0,1,1350876.0,152967.0,63784.0,782640390.0,222484590.0
2,1.0,AL,Barbour County,1005.0,1,105108.0,10961.0,4847.0,55241847.0,16631395.0
3,1.0,AL,Bibb County,1007.0,1,126167.0,14755.0,7089.0,77506480.0,23095104.0
4,1.0,AL,Blount County,1009.0,1,351527.0,43750.0,22995.0,245274817.0,73748380.0


In [74]:
data_2020_agg.head()

Unnamed: 0,state,state_code,cnamelong,county_code,SIP?,device_count,part_time_work_behavior_devices,full_time_work_behavior_devices,home_dwell_time,non_home_dwell_time
0,1.0,AL,Autauga County,1001.0,1,311971.0,26829.0,12973.0,277471829.0,34661931.0
1,1.0,AL,Baldwin County,1003.0,1,1284706.0,104511.0,44191.0,987101463.0,134823959.0
2,1.0,AL,Barbour County,1005.0,1,98849.0,8194.0,3387.0,65920332.0,12817822.0
3,1.0,AL,Bibb County,1007.0,1,121574.0,10898.0,4144.0,100388865.0,16800976.0
4,1.0,AL,Blount County,1009.0,1,370591.0,32636.0,14847.0,311330513.0,51959142.0


In [75]:
data_2019_agg.shape

(3227, 10)

In [76]:
data_2020_agg.shape

(3227, 10)

# produce the difference in stay at home behavior bewteen 2020 and 2019.

In [77]:
data_2019_agg.head()

Unnamed: 0,state,state_code,cnamelong,county_code,SIP?,device_count,part_time_work_behavior_devices,full_time_work_behavior_devices,home_dwell_time,non_home_dwell_time
0,1.0,AL,Autauga County,1001.0,1,303064.0,39481.0,18033.0,207555919.0,59952572.0
1,1.0,AL,Baldwin County,1003.0,1,1350876.0,152967.0,63784.0,782640390.0,222484590.0
2,1.0,AL,Barbour County,1005.0,1,105108.0,10961.0,4847.0,55241847.0,16631395.0
3,1.0,AL,Bibb County,1007.0,1,126167.0,14755.0,7089.0,77506480.0,23095104.0
4,1.0,AL,Blount County,1009.0,1,351527.0,43750.0,22995.0,245274817.0,73748380.0


In [80]:
data_2019_agg['last_year_perc_part_time'] = data_2019_agg['part_time_work_behavior_devices'] / data_2019_agg['device_count']

In [81]:
data_2019_agg['last_year_perc_full_time'] = data_2019_agg['full_time_work_behavior_devices'] / data_2019_agg['device_count']

In [82]:
data_2019_agg['last_year_perc_time_home'] = data_2019_agg['home_dwell_time'] / (data_2019_agg['home_dwell_time'] + data_2019_agg['non_home_dwell_time'])

In [84]:
data_2019_agg.head()

Unnamed: 0,state,state_code,cnamelong,county_code,SIP?,device_count,part_time_work_behavior_devices,full_time_work_behavior_devices,home_dwell_time,non_home_dwell_time,last_year_perc_part_time,last_year_perc_full_time,last_year_perc_time_home
0,1.0,AL,Autauga County,1001.0,1,303064.0,39481.0,18033.0,207555919.0,59952572.0,0.130273,0.059502,0.775885
1,1.0,AL,Baldwin County,1003.0,1,1350876.0,152967.0,63784.0,782640390.0,222484590.0,0.113235,0.047217,0.77865
2,1.0,AL,Barbour County,1005.0,1,105108.0,10961.0,4847.0,55241847.0,16631395.0,0.104283,0.046114,0.768601
3,1.0,AL,Bibb County,1007.0,1,126167.0,14755.0,7089.0,77506480.0,23095104.0,0.116948,0.056187,0.77043
4,1.0,AL,Blount County,1009.0,1,351527.0,43750.0,22995.0,245274817.0,73748380.0,0.124457,0.065415,0.768831


In [86]:
data_2019_agg = data_2019_agg[['state','state_code','cnamelong','county_code','SIP?',\
                               'last_year_perc_part_time', 'last_year_perc_full_time', \
                               'last_year_perc_time_home']]

In [88]:
data_2019_agg.head()

Unnamed: 0,state,state_code,cnamelong,county_code,SIP?,last_year_perc_part_time,last_year_perc_full_time,last_year_perc_time_home
0,1.0,AL,Autauga County,1001.0,1,0.130273,0.059502,0.775885
1,1.0,AL,Baldwin County,1003.0,1,0.113235,0.047217,0.77865
2,1.0,AL,Barbour County,1005.0,1,0.104283,0.046114,0.768601
3,1.0,AL,Bibb County,1007.0,1,0.116948,0.056187,0.77043
4,1.0,AL,Blount County,1009.0,1,0.124457,0.065415,0.768831


In [89]:
merged_data = data_2020_agg.merge(data_2019_agg, on=['state','state_code','cnamelong',\
                                                     'county_code','SIP?'])

In [90]:
merged_data.head()

Unnamed: 0,state,state_code,cnamelong,county_code,SIP?,device_count,part_time_work_behavior_devices,full_time_work_behavior_devices,home_dwell_time,non_home_dwell_time,last_year_perc_part_time,last_year_perc_full_time,last_year_perc_time_home
0,1.0,AL,Autauga County,1001.0,1,311971.0,26829.0,12973.0,277471829.0,34661931.0,0.130273,0.059502,0.775885
1,1.0,AL,Baldwin County,1003.0,1,1284706.0,104511.0,44191.0,987101463.0,134823959.0,0.113235,0.047217,0.77865
2,1.0,AL,Barbour County,1005.0,1,98849.0,8194.0,3387.0,65920332.0,12817822.0,0.104283,0.046114,0.768601
3,1.0,AL,Bibb County,1007.0,1,121574.0,10898.0,4144.0,100388865.0,16800976.0,0.116948,0.056187,0.77043
4,1.0,AL,Blount County,1009.0,1,370591.0,32636.0,14847.0,311330513.0,51959142.0,0.124457,0.065415,0.768831


In [91]:
merged_data['this_year_perc_part_time'] = merged_data['part_time_work_behavior_devices'] / merged_data['device_count']

In [92]:
merged_data['this_year_perc_full_time'] = merged_data['full_time_work_behavior_devices'] / merged_data['device_count']

In [93]:
merged_data['this_year_perc_time_home'] = merged_data['home_dwell_time'] / (merged_data['home_dwell_time'] + merged_data['non_home_dwell_time'])

In [94]:
merged_data['diff_in_perc_time_home'] = merged_data['this_year_perc_time_home'] - merged_data['last_year_perc_time_home']

In [95]:
merged_data['diff_in_perc_full_time'] = merged_data['this_year_perc_full_time'] - merged_data['last_year_perc_full_time']

In [96]:
merged_data['diff_in_perc_part_time'] = merged_data['this_year_perc_part_time'] - merged_data['last_year_perc_part_time']

In [97]:
merged_data.head()

Unnamed: 0,state,state_code,cnamelong,county_code,SIP?,device_count,part_time_work_behavior_devices,full_time_work_behavior_devices,home_dwell_time,non_home_dwell_time,last_year_perc_part_time,last_year_perc_full_time,last_year_perc_time_home,this_year_perc_part_time,this_year_perc_full_time,this_year_perc_time_home,diff_in_perc_time_home,diff_in_perc_full_time,diff_in_perc_part_time
0,1.0,AL,Autauga County,1001.0,1,311971.0,26829.0,12973.0,277471829.0,34661931.0,0.130273,0.059502,0.775885,0.085998,0.041584,0.888952,0.113066,-0.017918,-0.044274
1,1.0,AL,Baldwin County,1003.0,1,1284706.0,104511.0,44191.0,987101463.0,134823959.0,0.113235,0.047217,0.77865,0.08135,0.034398,0.879828,0.101178,-0.012819,-0.031885
2,1.0,AL,Barbour County,1005.0,1,98849.0,8194.0,3387.0,65920332.0,12817822.0,0.104283,0.046114,0.768601,0.082894,0.034264,0.83721,0.068609,-0.01185,-0.021389
3,1.0,AL,Bibb County,1007.0,1,121574.0,10898.0,4144.0,100388865.0,16800976.0,0.116948,0.056187,0.77043,0.089641,0.034086,0.856635,0.086205,-0.022101,-0.027307
4,1.0,AL,Blount County,1009.0,1,370591.0,32636.0,14847.0,311330513.0,51959142.0,0.124457,0.065415,0.768831,0.088065,0.040063,0.856976,0.088145,-0.025352,-0.036392


In [98]:
merged_data['diff_in_perc_time_home'].mean()

0.07330556698199603

In [99]:
merged_data[merged_data['SIP?'] == 0]['diff_in_perc_time_home'].mean()

0.06229188344218074

In [100]:
merged_data[merged_data['SIP?'] == 1]['diff_in_perc_time_home'].mean()

0.07538737230842396

In [101]:
merged_data[merged_data['cnamelong'] == 'King County']

Unnamed: 0,state,state_code,cnamelong,county_code,SIP?,device_count,part_time_work_behavior_devices,full_time_work_behavior_devices,home_dwell_time,non_home_dwell_time,last_year_perc_part_time,last_year_perc_full_time,last_year_perc_time_home,this_year_perc_part_time,this_year_perc_full_time,this_year_perc_time_home,diff_in_perc_time_home,diff_in_perc_full_time,diff_in_perc_part_time
2655,48.0,TX,King County,48269.0,1,1329.0,112.0,78.0,669812.0,281426.0,0.084531,0.042553,0.777821,0.084274,0.058691,0.704148,-0.073673,0.016138,-0.000257
2967,53.0,WA,King County,53033.0,1,6412848.0,287519.0,187209.0,5968868000.0,125800506.0,0.092081,0.058538,0.842455,0.044835,0.029193,0.979359,0.136904,-0.029345,-0.047246


As we can see, the percentage of time stay at home increased after shelter in place date, which is what we expect. Percentage of full time work and percentage of part time work typically drop.

In [102]:
merged_data[merged_data['cnamelong'] == 'Los Angeles County']

Unnamed: 0,state,state_code,cnamelong,county_code,SIP?,device_count,part_time_work_behavior_devices,full_time_work_behavior_devices,home_dwell_time,non_home_dwell_time,last_year_perc_part_time,last_year_perc_full_time,last_year_perc_time_home,this_year_perc_part_time,this_year_perc_full_time,this_year_perc_time_home,diff_in_perc_time_home,diff_in_perc_full_time,diff_in_perc_part_time
203,6.0,CA,Los Angeles County,6037.0,1,26396435.0,1424682.0,978473.0,23820930000.0,581282765.0,0.08957,0.050473,0.861709,0.053973,0.037068,0.976179,0.11447,-0.013404,-0.035598


In [104]:
# drop columns we don't need - we only need difference.
merged_data.drop(columns = ['device_count', 'part_time_work_behavior_devices', \
                            'full_time_work_behavior_devices', 'home_dwell_time',\
                            'non_home_dwell_time'], axis=1, inplace=True)

In [105]:
merged_data.head()

Unnamed: 0,state,state_code,cnamelong,county_code,SIP?,last_year_perc_part_time,last_year_perc_full_time,last_year_perc_time_home,this_year_perc_part_time,this_year_perc_full_time,this_year_perc_time_home,diff_in_perc_time_home,diff_in_perc_full_time,diff_in_perc_part_time
0,1.0,AL,Autauga County,1001.0,1,0.130273,0.059502,0.775885,0.085998,0.041584,0.888952,0.113066,-0.017918,-0.044274
1,1.0,AL,Baldwin County,1003.0,1,0.113235,0.047217,0.77865,0.08135,0.034398,0.879828,0.101178,-0.012819,-0.031885
2,1.0,AL,Barbour County,1005.0,1,0.104283,0.046114,0.768601,0.082894,0.034264,0.83721,0.068609,-0.01185,-0.021389
3,1.0,AL,Bibb County,1007.0,1,0.116948,0.056187,0.77043,0.089641,0.034086,0.856635,0.086205,-0.022101,-0.027307
4,1.0,AL,Blount County,1009.0,1,0.124457,0.065415,0.768831,0.088065,0.040063,0.856976,0.088145,-0.025352,-0.036392


In [106]:
# drop columns we don't need - we only need difference.
merged_data.drop(columns = ['last_year_perc_part_time', 'last_year_perc_full_time', \
                            'last_year_perc_time_home', 'this_year_perc_part_time',\
                            'this_year_perc_full_time', 'this_year_perc_time_home'], axis=1, inplace=True)

In [107]:
merged_data.head()

Unnamed: 0,state,state_code,cnamelong,county_code,SIP?,diff_in_perc_time_home,diff_in_perc_full_time,diff_in_perc_part_time
0,1.0,AL,Autauga County,1001.0,1,0.113066,-0.017918,-0.044274
1,1.0,AL,Baldwin County,1003.0,1,0.101178,-0.012819,-0.031885
2,1.0,AL,Barbour County,1005.0,1,0.068609,-0.01185,-0.021389
3,1.0,AL,Bibb County,1007.0,1,0.086205,-0.022101,-0.027307
4,1.0,AL,Blount County,1009.0,1,0.088145,-0.025352,-0.036392


# combine data with covariates.

In [109]:
merged_data.head()

Unnamed: 0,state,state_code,cnamelong,county_code,SIP?,diff_in_perc_time_home,diff_in_perc_full_time,diff_in_perc_part_time
0,1.0,AL,Autauga County,1001.0,1,0.113066,-0.017918,-0.044274
1,1.0,AL,Baldwin County,1003.0,1,0.101178,-0.012819,-0.031885
2,1.0,AL,Barbour County,1005.0,1,0.068609,-0.01185,-0.021389
3,1.0,AL,Bibb County,1007.0,1,0.086205,-0.022101,-0.027307
4,1.0,AL,Blount County,1009.0,1,0.088145,-0.025352,-0.036392


In [110]:
data = merged_data

In [111]:
# covariates data
df_chr_1 = pd.read_csv('2020 County Health Rankings Data - Ranked Measure Data.csv')
df_chr_2 = pd.read_csv('2020 County Health Rankings Data - Additional Measure Data.csv')
df_governors = pd.read_csv('us_states_governors.csv', encoding='latin-1')
df_election = pd.read_csv('2016_US_County_Level_Presidential_Results.csv')

In [112]:
df_election['political_party'] = df_election.apply(lambda x: 'Republican' if x['per_gop'] > x['per_dem'] else 'Democratic', axis=1)
df_election['political_diff'] = df_election.apply(lambda x: x['per_dem'] - x['per_gop'], axis=1)

In [113]:
df_chr = df_chr_1.merge(df_chr_2, on=['FIPS', 'State', 'County'])
columns = ['FIPS', 'State', 'County', 'Population_y', 'Years of Potential Life Lost Rate', '% Fair or Poor Health', 
           'Average Number of Physically Unhealthy Days', 'Average Number of Mentally Unhealthy Days',
           '% Low Birthweight', '% Smokers', '% Adults with Obesity', 'Food Environment Index', 
           '% Physically Inactive', '% With Access to Exercise Opportunities', '% Excessive Drinking',
           '% Driving Deaths with Alcohol Involvement', 'Chlamydia Rate', 'Teen Birth Rate', '% Uninsured_x',
           'Primary Care Physicians Rate', 'Primary Care Physicians Ratio', 
           'Dentist Rate', 'Dentist Ratio', 'Mental Health Provider Rate', 'Mental Health Provider Ratio',
           'Preventable Hospitalization Rate', '% With Annual Mammogram', '% Vaccinated',
           'High School Graduation Rate', '% Some College', '% Unemployed', '% Children in Poverty',
           'Income Ratio', '% Single-Parent Households', 'Social Association Rate', 'Violent Crime Rate',
           'Injury Death Rate', 'Average Daily PM2.5', 'Presence of Water Violation', '% Severe Housing Problems',
           '% Drive Alone to Work', '% Long Commute - Drives Alone',
           'Life Expectancy', 'Age-Adjusted Death Rate', 'Child Mortality Rate',
           'Infant Mortality Rate', '% Frequent Physical Distress', '% Frequent Mental Distress',
           '% Adults with Diabetes', 'HIV Prevalence Rate', 
           '% Food Insecure', '% Limited Access to Healthy Foods',
           'Drug Overdose Mortality Rate', 'Motor Vehicle Mortality Rate',
           '% Insufficient Sleep', '% Uninsured_y', '% Uninsured.1',
           'Other Primary Care Provider Rate', 'Other Primary Care Provider Ratio','% Disconnected Youth',
           'Average Grade Performance', 'Average Grade Performance.1', 'Median Household Income', 
           '% Enrolled in Free or Reduced Lunch', 'Segregation index', 'Segregation Index', 'Homicide Rate',
           'Suicide Rate (Age-Adjusted)', 'Firearm Fatalities Rate',
           'Juvenile Arrest Rate', 'Average Traffic Volume per Meter of Major Roadways',
           '% Homeowners', '% Severe Housing Cost Burden', '% less than 18 years of age', '% 65 and over',
           '% Black', '% American Indian & Alaska Native', '% Asian', '% Native Hawaiian/Other Pacific Islander',
           '% Hispanic', '% Non-Hispanic White', '% Not Proficient in English', '% Female', '% Rural'
          ]
df_chr = df_chr[columns]

In [114]:
df_merged = data.merge(df_governors, how='left', left_on='state_code', right_on='State Code')
df_merged = df_merged.merge(df_chr, left_on='county_code', right_on='FIPS')
df_merged = df_merged.merge(df_election, left_on='county_code', right_on='combined_fips')
df_merged.head()

Unnamed: 0.1,state,state_code,cnamelong,county_code,SIP?,diff_in_perc_time_home,diff_in_perc_full_time,diff_in_perc_part_time,State Code,State Name,Governor,Party,Date,Shelter at home begins,Shelter in place ends,Reopen,FIPS,State,County,Population_y,Years of Potential Life Lost Rate,% Fair or Poor Health,Average Number of Physically Unhealthy Days,Average Number of Mentally Unhealthy Days,% Low Birthweight,% Smokers,% Adults with Obesity,Food Environment Index,% Physically Inactive,% With Access to Exercise Opportunities,% Excessive Drinking,% Driving Deaths with Alcohol Involvement,Chlamydia Rate,Teen Birth Rate,% Uninsured_x,Primary Care Physicians Rate,Primary Care Physicians Ratio,Dentist Rate,Dentist Ratio,Mental Health Provider Rate,...,Other Primary Care Provider Rate,Other Primary Care Provider Ratio,% Disconnected Youth,Average Grade Performance,Average Grade Performance.1,Median Household Income,% Enrolled in Free or Reduced Lunch,Segregation index,Segregation Index,Homicide Rate,Suicide Rate (Age-Adjusted),Firearm Fatalities Rate,Juvenile Arrest Rate,Average Traffic Volume per Meter of Major Roadways,% Homeowners,% Severe Housing Cost Burden,% less than 18 years of age,% 65 and over,% Black,% American Indian & Alaska Native,% Asian,% Native Hawaiian/Other Pacific Islander,% Hispanic,% Non-Hispanic White,% Not Proficient in English,% Female,% Rural,Unnamed: 0,votes_dem,votes_gop,total_votes,per_dem,per_gop,diff,per_point_diff,state_abbr,county_name,combined_fips,political_party,political_diff
0,1.0,AL,Autauga County,1001.0,1,0.113066,-0.017918,-0.044274,AL,Alabama,Kay Ivey,Republican,10-Apr-17,4/4/2020,4/30/2020,4/30/2020,1001,Alabama,Autauga,55601,8129.0,21,4.7,4.7,9.0,18,33,7.2,35,69.0,15,27.0,407.2,25.0,9.0,45.0,2220:1,32.0,3089:1,23.0,...,40.0,2527:1,,3.0,2.8,59338.0,43.0,25.0,24.0,5.0,18.0,16.0,11.0,88,75,13.0,23.7,15.6,19.3,0.5,1.2,0.1,3.0,74.3,1,51.4,42.0,29,5908.0,18110.0,24661.0,0.239569,0.734358,12202,49.48%,AL,Autauga County,1001,Republican,-0.494789
1,1.0,AL,Baldwin County,1003.0,1,0.101178,-0.012819,-0.031885,AL,Alabama,Kay Ivey,Republican,10-Apr-17,4/4/2020,4/30/2020,4/30/2020,1003,Alabama,Baldwin,218022,7354.0,18,4.2,4.3,8.0,17,31,8.0,27,74.0,18,31.0,325.0,28.0,11.0,73.0,1372:1,50.0,2019:1,96.0,...,56.0,1787:1,8.0,3.0,2.9,57588.0,48.0,41.0,32.0,3.0,19.0,14.0,26.0,87,74,12.0,21.6,20.4,8.8,0.8,1.2,0.1,4.6,83.1,1,51.5,42.3,30,18409.0,72780.0,94090.0,0.195653,0.773515,54371,57.79%,AL,Baldwin County,1003,Republican,-0.577862
2,1.0,AL,Barbour County,1005.0,1,0.068609,-0.01185,-0.021389,AL,Alabama,Kay Ivey,Republican,10-Apr-17,4/4/2020,4/30/2020,4/30/2020,1005,Alabama,Barbour,24881,10254.0,30,5.4,5.2,11.0,22,42,5.6,24,53.0,13,40.0,716.3,41.0,12.0,32.0,3159:1,36.0,2765:1,8.0,...,52.0,1914:1,13.0,2.7,2.4,34382.0,63.0,25.0,23.0,8.0,13.0,18.0,15.0,102,61,14.0,20.9,19.4,48.0,0.7,0.5,0.2,4.3,45.6,2,47.2,67.8,31,4848.0,5431.0,10390.0,0.466603,0.522714,583,5.61%,AL,Barbour County,1005,Republican,-0.056112
3,1.0,AL,Bibb County,1007.0,1,0.086205,-0.022101,-0.027307,AL,Alabama,Kay Ivey,Republican,10-Apr-17,4/4/2020,4/30/2020,4/30/2020,1007,Alabama,Bibb,22400,11978.0,19,4.6,4.6,10.0,19,38,7.8,34,16.0,16,28.0,339.7,42.0,10.0,49.0,2061:1,22.0,4480:1,22.0,...,112.0,896:1,,2.6,2.4,46064.0,62.0,53.0,53.0,8.0,21.0,24.0,,29,75,10.0,20.5,16.5,21.1,0.4,0.2,0.1,2.6,74.6,0,46.8,68.4,32,1874.0,6733.0,8748.0,0.21422,0.769662,4859,55.54%,AL,Bibb County,1007,Republican,-0.555441
4,1.0,AL,Blount County,1009.0,1,0.088145,-0.025352,-0.036392,AL,Alabama,Kay Ivey,Republican,10-Apr-17,4/4/2020,4/30/2020,4/30/2020,1009,Alabama,Blount,57840,11335.0,22,4.9,4.9,8.0,19,34,8.4,30,16.0,14,19.0,234.4,34.0,13.0,22.0,4463:1,19.0,5258:1,16.0,...,22.0,4449:1,19.0,3.0,2.8,50412.0,53.0,48.0,18.0,6.0,17.0,20.0,7.0,33,79,8.0,23.2,18.2,1.5,0.7,0.3,0.1,9.6,86.9,2,50.7,90.0,33,2150.0,22808.0,25384.0,0.084699,0.898519,20658,81.38%,AL,Blount County,1009,Republican,-0.81382


In [115]:
df_merged.shape

(3140, 113)

In [116]:
data.shape

(3227, 8)

In [117]:
# only select some covariates for propensity score matching
df_reduced_covariate = \
df_merged[['state', 'state_code', 'State Name','cnamelong', 'county_code',\
           'diff_in_perc_time_home', 'diff_in_perc_full_time', 'diff_in_perc_part_time',\
           'SIP?', 'Median Household Income', '% Rural', 'Population_y', 'political_diff',\
           '% less than 18 years of age', '% 65 and over', '% Asian', '% Black', '% Hispanic',\
           '% Non-Hispanic White']]

In [118]:
df_reduced_covariate.head()

Unnamed: 0,state,state_code,State Name,cnamelong,county_code,diff_in_perc_time_home,diff_in_perc_full_time,diff_in_perc_part_time,SIP?,Median Household Income,% Rural,Population_y,political_diff,% less than 18 years of age,% 65 and over,% Asian,% Black,% Hispanic,% Non-Hispanic White
0,1.0,AL,Alabama,Autauga County,1001.0,0.113066,-0.017918,-0.044274,1,59338.0,42.0,55601,-0.494789,23.7,15.6,1.2,19.3,3.0,74.3
1,1.0,AL,Alabama,Baldwin County,1003.0,0.101178,-0.012819,-0.031885,1,57588.0,42.3,218022,-0.577862,21.6,20.4,1.2,8.8,4.6,83.1
2,1.0,AL,Alabama,Barbour County,1005.0,0.068609,-0.01185,-0.021389,1,34382.0,67.8,24881,-0.056112,20.9,19.4,0.5,48.0,4.3,45.6
3,1.0,AL,Alabama,Bibb County,1007.0,0.086205,-0.022101,-0.027307,1,46064.0,68.4,22400,-0.555441,20.5,16.5,0.2,21.1,2.6,74.6
4,1.0,AL,Alabama,Blount County,1009.0,0.088145,-0.025352,-0.036392,1,50412.0,90.0,57840,-0.81382,23.2,18.2,0.3,1.5,9.6,86.9


In [119]:
# save final data
df_reduced_covariate.to_csv("county_data_with_reduced_covariates.csv", index = False)