<a href="https://colab.research.google.com/github/wenjunsun/Covid-19-analysis-with-uw-ubicomp/blob/master/week10/prepare_data_with_SIP_date.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook we will prepare data for propensity score matching using the before and after shelter in place date, instead of using the first case date. The reason for doing this is that people's behaviors don't change right afte the first case. For example King County's first case is on January 22nd, and people's behavior didn't change until March. So picking the SIP date as the breaking point might yield more reasonable results for how much policy has impact on people's behavior.

In [1]:
ls

[0m[01;34mdrive[0m/  [01;34msample_data[0m/


In [2]:
cd drive/My\ Drive/week8

/content/drive/My Drive/week8


In [3]:
import pandas as pd

# First need to calculate the mean shelter in place date, of all the counties

In [5]:
ls

 2016_US_County_Level_Presidential_Results.csv
 2019_data.csv
'2020 County Health Rankings Data - Additional Measure Data.csv'
'2020 County Health Rankings Data - Ranked Measure Data.csv'
 2020_data.csv
 check_if_data_is_alright.ipynb
'Copy of agg_social_dist_2.csv'
 county_data_with_covariates.csv
 county_data_with_reduced_covariates.csv
 county_data_with_reduced_covariates_up_to_2020-8-8.csv
 data_2019_agg.csv
 data_2020_agg.csv
 data_for_propensity.csv
 days_since.csv
 merged_data.csv
 prepare_data_with_SIP_date.ipynb
 prepare_new_data.ipynb
 propensity_score_matching_new_data.ipynb
 social_dist_aggregated_on_county.csv
 social_dist_all_trimmed.csv
 social_dist_all_trimmed_new.csv
 social_dist_low_device_count_filtered.csv
 social_dist_reduced.csv
 us_states_governors.csv


In [8]:
# load data that has the date of each county's shelter in place
# read the SIP date column as datetime object.
days_since = pd.read_csv("days_since.csv", parse_dates=['Date - shelter in place'], infer_datetime_format=True)

In [9]:
days_since.head()

Unnamed: 0.1,Unnamed: 0,Date - first case,Date - first death,Date - reopening,Date - shelter in place,Date - shelter in place ends,cnamelong,county,county_code,state,state_code,state_name
0,0,2020-03-24,2020-04-07,2020-04-30,2020-04-04,2020-04-30,Autauga County,1.0,1001.0,1.0,AL,Alabama
1,1,2020-03-15,2020-03-29,2020-04-30,2020-04-04,2020-04-30,Baldwin County,3.0,1003.0,1.0,AL,Alabama
2,2,2020-04-03,2020-04-29,2020-04-30,2020-04-04,2020-04-30,Barbour County,5.0,1005.0,1.0,AL,Alabama
3,3,2020-03-30,2020-05-08,2020-04-30,2020-04-04,2020-04-30,Bibb County,7.0,1007.0,1.0,AL,Alabama
4,4,2020-03-25,2020-05-17,2020-04-30,2020-04-04,2020-04-30,Blount County,9.0,1009.0,1.0,AL,Alabama


In [10]:
days_since.dtypes

Unnamed: 0                               int64
Date - first case                       object
Date - first death                      object
Date - reopening                        object
Date - shelter in place         datetime64[ns]
Date - shelter in place ends            object
cnamelong                               object
county                                 float64
county_code                            float64
state                                  float64
state_code                              object
state_name                              object
dtype: object

In [11]:
days_since['Date - shelter in place'].mean()

Timestamp('2020-03-28 09:36:50.898386944')

As we can see the 'mean' shelter in place date of all counties is 2020-03-28. We will use this date for counties without shelter in place order. For counties with shelter in place, we will use that county's SIP date as comparison metric

# Now we need to get all the data within the time window.

- time window for counties without shelter in place order is 2019-03-28 to 2019-06-01 and 2020-03-28 to 2020-06-01
- time window for counties with shelter in place order is that county's shelter in place date - 2020-06-01 and the same window in 2019.

## combine social distance data with shelter in place date data

In [12]:
data = pd.read_csv("social_dist_low_device_count_filtered.csv", parse_dates=['date_range_start'],\
                   infer_datetime_format = True) 

In [13]:
data.dtypes

date_range_start                datetime64[ns]
state                                  float64
state_code                              object
cnamelong                               object
county_code                            float64
device_count                           float64
completely_home_device_count           float64
dtype: object

In [14]:
data.head()

Unnamed: 0,date_range_start,state,state_code,cnamelong,county_code,device_count,completely_home_device_count
0,2019-01-01,1.0,AL,Autauga County,1001.0,4708.0,1829.0
1,2019-01-01,1.0,AL,Baldwin County,1003.0,19655.0,7717.0
2,2019-01-01,1.0,AL,Barbour County,1005.0,1570.0,594.0
3,2019-01-01,1.0,AL,Bibb County,1007.0,1702.0,623.0
4,2019-01-01,1.0,AL,Blount County,1009.0,5224.0,1901.0


In [16]:
days_since.head()

Unnamed: 0.1,Unnamed: 0,Date - first case,Date - first death,Date - reopening,Date - shelter in place,Date - shelter in place ends,cnamelong,county,county_code,state,state_code,state_name
0,0,2020-03-24,2020-04-07,2020-04-30,2020-04-04,2020-04-30,Autauga County,1.0,1001.0,1.0,AL,Alabama
1,1,2020-03-15,2020-03-29,2020-04-30,2020-04-04,2020-04-30,Baldwin County,3.0,1003.0,1.0,AL,Alabama
2,2,2020-04-03,2020-04-29,2020-04-30,2020-04-04,2020-04-30,Barbour County,5.0,1005.0,1.0,AL,Alabama
3,3,2020-03-30,2020-05-08,2020-04-30,2020-04-04,2020-04-30,Bibb County,7.0,1007.0,1.0,AL,Alabama
4,4,2020-03-25,2020-05-17,2020-04-30,2020-04-04,2020-04-30,Blount County,9.0,1009.0,1.0,AL,Alabama


In [17]:
# we only need to combine the date of shelter in place to social distancing data.
# so let's just get the columns we need
days_since = days_since[['county_code', 'Date - shelter in place']]

In [18]:
days_since.head()

Unnamed: 0,county_code,Date - shelter in place
0,1001.0,2020-04-04
1,1003.0,2020-04-04
2,1005.0,2020-04-04
3,1007.0,2020-04-04
4,1009.0,2020-04-04


In [21]:
data = pd.merge(data, days_since, how ="left", on="county_code")

In [22]:
data.head()

Unnamed: 0,date_range_start,state,state_code,cnamelong,county_code,device_count,completely_home_device_count,Date - shelter in place
0,2019-01-01,1.0,AL,Autauga County,1001.0,4708.0,1829.0,2020-04-04
1,2019-01-01,1.0,AL,Baldwin County,1003.0,19655.0,7717.0,2020-04-04
2,2019-01-01,1.0,AL,Barbour County,1005.0,1570.0,594.0,2020-04-04
3,2019-01-01,1.0,AL,Bibb County,1007.0,1702.0,623.0,2020-04-04
4,2019-01-01,1.0,AL,Blount County,1009.0,5224.0,1901.0,2020-04-04


In [24]:
# these are the counties without shelter in place.
data[data['Date - shelter in place'].isnull()].head()

Unnamed: 0,date_range_start,state,state_code,cnamelong,county_code,device_count,completely_home_device_count,Date - shelter in place
97,2019-01-01,5.0,AR,Arkansas County,5001.0,1347.0,462.0,NaT
98,2019-01-01,5.0,AR,Ashley County,5003.0,1444.0,484.0,NaT
99,2019-01-01,5.0,AR,Baxter County,5005.0,2681.0,1291.0,NaT
100,2019-01-01,5.0,AR,Benton County,5007.0,21550.0,8760.0,NaT
101,2019-01-01,5.0,AR,Boone County,5009.0,2638.0,1132.0,NaT


## filter out data not within the time window range.

In [25]:
from datetime import datetime

In [26]:
# a custom function that given a row return true
# if this row is within the time window and false
# otherwise
def isThisRowInTimeWindow(row):
  thisDataDate = row['date_range_start']
  # if this row's SIP date is null, then begin window from 2020-03-28,
  # else use that row's SIP date as the starting window
  thisYearBeginDate = row['Date - shelter in place'] if not pd.isnull(row['Date - shelter in place']) else thisDataDate.replace(2020,3,28)

  thisYearEndDate = thisYearBeginDate.replace(month=6, day = 1)

  # return true if this row's date is within the window
  if thisDataDate.dayofyear <= 60:
    return thisDataDate.dayofyear >= thisYearBeginDate.dayofyear \
      and thisDataDate.dayofyear <= thisYearEndDate.dayofyear
  else:
    return thisDataDate.dayofyear + 1 >= thisYearBeginDate.dayofyear \
      and thisDataDate.dayofyear <= thisYearEndDate.dayofyear

In [34]:
data.iloc[280000]

date_range_start                2019-04-03 00:00:00
state                                            18
state_code                                       IN
cnamelong                             Morgan County
county_code                                   18109
device_count                                   5691
completely_home_device_count                   1425
Date - shelter in place         2020-03-25 00:00:00
Name: 280000, dtype: object

In [35]:
isThisRowInTimeWindow(data.iloc[280000])

True

In [40]:
data[data['Date - shelter in place'].isnull()].head()

Unnamed: 0,date_range_start,state,state_code,cnamelong,county_code,device_count,completely_home_device_count,Date - shelter in place
97,2019-01-01,5.0,AR,Arkansas County,5001.0,1347.0,462.0,NaT
98,2019-01-01,5.0,AR,Ashley County,5003.0,1444.0,484.0,NaT
99,2019-01-01,5.0,AR,Baxter County,5005.0,2681.0,1291.0,NaT
100,2019-01-01,5.0,AR,Benton County,5007.0,21550.0,8760.0,NaT
101,2019-01-01,5.0,AR,Boone County,5009.0,2638.0,1132.0,NaT


In [37]:
data.iloc[97]

date_range_start                2019-01-01 00:00:00
state                                             5
state_code                                       AR
cnamelong                           Arkansas County
county_code                                    5001
device_count                                   1347
completely_home_device_count                    462
Date - shelter in place                         NaT
Name: 97, dtype: object

In [38]:
isThisRowInTimeWindow(data.iloc[97])

False

In [43]:
data[data['Date - shelter in place'].isnull()][40000:40005]

Unnamed: 0,date_range_start,state,state_code,cnamelong,county_code,device_count,completely_home_device_count,Date - shelter in place
280100,2019-04-03,19.0,IA,Marion County,19125.0,2606.0,621.0,NaT
280101,2019-04-03,19.0,IA,Marshall County,19127.0,2459.0,771.0,NaT
280102,2019-04-03,19.0,IA,Mills County,19129.0,1203.0,300.0,NaT
280103,2019-04-03,19.0,IA,Mitchell County,19131.0,718.0,186.0,NaT
280104,2019-04-03,19.0,IA,Monona County,19133.0,614.0,176.0,NaT


In [45]:
data.iloc[280100]

date_range_start                2019-04-03 00:00:00
state                                            19
state_code                                       IA
cnamelong                             Marion County
county_code                                   19125
device_count                                   2606
completely_home_device_count                    621
Date - shelter in place                         NaT
Name: 280100, dtype: object

This row should be within the time window.

In [46]:
isThisRowInTimeWindow(data.iloc[280100])

True

Seems like our function is working. Let's apply this function to every row in data.

In [47]:
data['withinTimeWindow'] = data.apply(lambda row: isThisRowInTimeWindow(row) , axis = 1)

Let's check if our function did the right thing.

In [48]:
data[data['withinTimeWindow'] == True]

Unnamed: 0,date_range_start,state,state_code,cnamelong,county_code,device_count,completely_home_device_count,Date - shelter in place,withinTimeWindow
224582,2019-03-15,72.0,PR,Adjuntas Municipio,72001.0,402.0,157.0,2020-03-15,True
224583,2019-03-15,72.0,PR,Aguada Municipio,72003.0,1106.0,498.0,2020-03-15,True
224584,2019-03-15,72.0,PR,Aguadilla Municipio,72005.0,1688.0,801.0,2020-03-15,True
224585,2019-03-15,72.0,PR,Aguas Buenas Municipio,72007.0,650.0,251.0,2020-03-15,True
224586,2019-03-15,72.0,PR,Aibonito Municipio,72009.0,623.0,250.0,2020-03-15,True
...,...,...,...,...,...,...,...,...,...
1564063,2020-06-01,72.0,PR,Villalba Municipio,72149.0,1001.0,460.0,2020-03-15,True
1564064,2020-06-01,72.0,PR,Yabucoa Municipio,72151.0,912.0,403.0,2020-03-15,True
1564065,2020-06-01,72.0,PR,Yauco Municipio,72153.0,3277.0,1547.0,2020-03-15,True
1564066,2020-06-01,78.0,VI,St. Croix Island,78010.0,1125.0,433.0,NaT,True


In [49]:
data[(data['withinTimeWindow'] == True) & (data['Date - shelter in place'].isnull())]

Unnamed: 0,date_range_start,state,state_code,cnamelong,county_code,device_count,completely_home_device_count,Date - shelter in place,withinTimeWindow
261208,2019-03-28,5.0,AR,Arkansas County,5001.0,1469.0,249.0,NaT,True
261209,2019-03-28,5.0,AR,Ashley County,5003.0,1493.0,292.0,NaT,True
261210,2019-03-28,5.0,AR,Baxter County,5005.0,2856.0,809.0,NaT,True
261211,2019-03-28,5.0,AR,Benton County,5007.0,22612.0,5265.0,NaT,True
261212,2019-03-28,5.0,AR,Boone County,5009.0,2904.0,688.0,NaT,True
...,...,...,...,...,...,...,...,...,...
1563987,2020-06-01,56.0,WY,Washakie County,56043.0,419.0,133.0,NaT,True
1563988,2020-06-01,56.0,WY,Weston County,56045.0,386.0,124.0,NaT,True
1563989,2020-06-01,66.0,GU,Guam,66010.0,4528.0,1176.0,NaT,True
1564066,2020-06-01,78.0,VI,St. Croix Island,78010.0,1125.0,433.0,NaT,True


In [50]:
data[(data['withinTimeWindow'] == True) & (data['county_code'] == 66010)]

Unnamed: 0,date_range_start,state,state_code,cnamelong,county_code,device_count,completely_home_device_count,Date - shelter in place,withinTimeWindow
264068,2019-03-28,66.0,GU,Guam,66010.0,1672.0,663.0,NaT,True
267099,2019-03-29,66.0,GU,Guam,66010.0,1603.0,654.0,NaT,True
270123,2019-03-30,66.0,GU,Guam,66010.0,1718.0,747.0,NaT,True
273142,2019-03-31,66.0,GU,Guam,66010.0,1658.0,784.0,NaT,True
276175,2019-04-01,66.0,GU,Guam,66010.0,1825.0,757.0,NaT,True
...,...,...,...,...,...,...,...,...,...
1548793,2020-05-27,66.0,GU,Guam,66010.0,4581.0,1513.0,NaT,True
1551822,2020-05-28,66.0,GU,Guam,66010.0,4644.0,1388.0,NaT,True
1554860,2020-05-29,66.0,GU,Guam,66010.0,4414.0,1441.0,NaT,True
1560944,2020-05-31,66.0,GU,Guam,66010.0,289.0,86.0,NaT,True


In [51]:
# King county's data
data[(data['withinTimeWindow'] == True) & (data['cnamelong'] == 'King County')]

Unnamed: 0,date_range_start,state,state_code,cnamelong,county_code,device_count,completely_home_device_count,Date - shelter in place,withinTimeWindow
248722,2019-03-23,53.0,WA,King County,53033.0,118729.0,40176.0,2020-03-23,True
251759,2019-03-24,53.0,WA,King County,53033.0,118808.0,44303.0,2020-03-23,True
254794,2019-03-25,53.0,WA,King County,53033.0,122723.0,37746.0,2020-03-23,True
257828,2019-03-26,53.0,WA,King County,53033.0,120187.0,36193.0,2020-03-23,True
260863,2019-03-27,53.0,WA,King County,53033.0,120321.0,34815.0,2020-03-23,True
...,...,...,...,...,...,...,...,...,...
1551651,2020-05-28,53.0,WA,King County,53033.0,93936.0,38052.0,2020-03-23,True
1554689,2020-05-29,53.0,WA,King County,53033.0,94147.0,36943.0,2020-03-23,True
1557731,2020-05-30,53.0,WA,King County,53033.0,96908.0,42590.0,2020-03-23,True
1560773,2020-05-31,53.0,WA,King County,53033.0,99344.0,45484.0,2020-03-23,True


They look right, the counties without SIP ranges from 3-28 to 6-1, and counties with SIP ranges from that county's shelter in place date to 6-1

In [52]:
# only get rows within the time range.
data = data[data['withinTimeWindow'] == True]

In [53]:
data.shape

(404898, 9)

# separate last year's data from this year's data

In [55]:
data.head()

Unnamed: 0,date_range_start,state,state_code,cnamelong,county_code,device_count,completely_home_device_count,Date - shelter in place,withinTimeWindow
224582,2019-03-15,72.0,PR,Adjuntas Municipio,72001.0,402.0,157.0,2020-03-15,True
224583,2019-03-15,72.0,PR,Aguada Municipio,72003.0,1106.0,498.0,2020-03-15,True
224584,2019-03-15,72.0,PR,Aguadilla Municipio,72005.0,1688.0,801.0,2020-03-15,True
224585,2019-03-15,72.0,PR,Aguas Buenas Municipio,72007.0,650.0,251.0,2020-03-15,True
224586,2019-03-15,72.0,PR,Aibonito Municipio,72009.0,623.0,250.0,2020-03-15,True


In [56]:
# add a column that indicates whether this row's date is in 2019
data['2019?'] = data['date_range_start'].apply(lambda date: date.year == 2019)

In [57]:
data.head()

Unnamed: 0,date_range_start,state,state_code,cnamelong,county_code,device_count,completely_home_device_count,Date - shelter in place,withinTimeWindow,2019?
224582,2019-03-15,72.0,PR,Adjuntas Municipio,72001.0,402.0,157.0,2020-03-15,True,True
224583,2019-03-15,72.0,PR,Aguada Municipio,72003.0,1106.0,498.0,2020-03-15,True,True
224584,2019-03-15,72.0,PR,Aguadilla Municipio,72005.0,1688.0,801.0,2020-03-15,True,True
224585,2019-03-15,72.0,PR,Aguas Buenas Municipio,72007.0,650.0,251.0,2020-03-15,True,True
224586,2019-03-15,72.0,PR,Aibonito Municipio,72009.0,623.0,250.0,2020-03-15,True,True


In [59]:
data[data['2019?'] == False].head()

Unnamed: 0,date_range_start,state,state_code,cnamelong,county_code,device_count,completely_home_device_count,Date - shelter in place,withinTimeWindow,2019?
1325179,2020-03-14,72.0,PR,Adjuntas Municipio,72001.0,871.0,359.0,2020-03-15,True,False
1325180,2020-03-14,72.0,PR,Aguada Municipio,72003.0,1867.0,752.0,2020-03-15,True,False
1325181,2020-03-14,72.0,PR,Aguadilla Municipio,72005.0,2801.0,1161.0,2020-03-15,True,False
1325182,2020-03-14,72.0,PR,Aguas Buenas Municipio,72007.0,953.0,340.0,2020-03-15,True,False
1325183,2020-03-14,72.0,PR,Aibonito Municipio,72009.0,801.0,316.0,2020-03-15,True,False


We can see the date 2020-3-14 is getting in our data even though shelter in place is 2020-3-15. This might by caused by the one off error. (2019 has 366 days and 2020 has 365 days). Don't think this will affect our analysis that much.

In [60]:
data_2019 = data[data['2019?'] == True]

In [61]:
data_2020 = data[data['2019?'] == False]

In [64]:
# drop unnecessary columns
data_2019.drop(columns=['withinTimeWindow', '2019?'], axis = 1, inplace = True)
data_2020.drop(columns=['withinTimeWindow', '2019?'], axis = 1, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [65]:
data_2019.head()

Unnamed: 0,date_range_start,state,state_code,cnamelong,county_code,device_count,completely_home_device_count,Date - shelter in place
224582,2019-03-15,72.0,PR,Adjuntas Municipio,72001.0,402.0,157.0,2020-03-15
224583,2019-03-15,72.0,PR,Aguada Municipio,72003.0,1106.0,498.0,2020-03-15
224584,2019-03-15,72.0,PR,Aguadilla Municipio,72005.0,1688.0,801.0,2020-03-15
224585,2019-03-15,72.0,PR,Aguas Buenas Municipio,72007.0,650.0,251.0,2020-03-15
224586,2019-03-15,72.0,PR,Aibonito Municipio,72009.0,623.0,250.0,2020-03-15


In [66]:
data_2020.head()

Unnamed: 0,date_range_start,state,state_code,cnamelong,county_code,device_count,completely_home_device_count,Date - shelter in place
1325179,2020-03-14,72.0,PR,Adjuntas Municipio,72001.0,871.0,359.0,2020-03-15
1325180,2020-03-14,72.0,PR,Aguada Municipio,72003.0,1867.0,752.0,2020-03-15
1325181,2020-03-14,72.0,PR,Aguadilla Municipio,72005.0,2801.0,1161.0,2020-03-15
1325182,2020-03-14,72.0,PR,Aguas Buenas Municipio,72007.0,953.0,340.0,2020-03-15
1325183,2020-03-14,72.0,PR,Aibonito Municipio,72009.0,801.0,316.0,2020-03-15


# add a SIP column to data to indicate whether this county has SIP implemented.

In [67]:
data_2019.head()

Unnamed: 0,date_range_start,state,state_code,cnamelong,county_code,device_count,completely_home_device_count,Date - shelter in place
224582,2019-03-15,72.0,PR,Adjuntas Municipio,72001.0,402.0,157.0,2020-03-15
224583,2019-03-15,72.0,PR,Aguada Municipio,72003.0,1106.0,498.0,2020-03-15
224584,2019-03-15,72.0,PR,Aguadilla Municipio,72005.0,1688.0,801.0,2020-03-15
224585,2019-03-15,72.0,PR,Aguas Buenas Municipio,72007.0,650.0,251.0,2020-03-15
224586,2019-03-15,72.0,PR,Aibonito Municipio,72009.0,623.0,250.0,2020-03-15


In [68]:
data_2019['SIP?'] = data_2019['Date - shelter in place'].apply(lambda x: 0 if pd.isnull(x) else 1)
data_2020['SIP?'] = data_2020['Date - shelter in place'].apply(lambda x: 0 if pd.isnull(x) else 1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [69]:
data_2019.head()

Unnamed: 0,date_range_start,state,state_code,cnamelong,county_code,device_count,completely_home_device_count,Date - shelter in place,SIP?
224582,2019-03-15,72.0,PR,Adjuntas Municipio,72001.0,402.0,157.0,2020-03-15,1
224583,2019-03-15,72.0,PR,Aguada Municipio,72003.0,1106.0,498.0,2020-03-15,1
224584,2019-03-15,72.0,PR,Aguadilla Municipio,72005.0,1688.0,801.0,2020-03-15,1
224585,2019-03-15,72.0,PR,Aguas Buenas Municipio,72007.0,650.0,251.0,2020-03-15,1
224586,2019-03-15,72.0,PR,Aibonito Municipio,72009.0,623.0,250.0,2020-03-15,1


In [71]:
data_2019[data_2019['SIP?'] == 0].head()

Unnamed: 0,date_range_start,state,state_code,cnamelong,county_code,device_count,completely_home_device_count,Date - shelter in place,SIP?
261208,2019-03-28,5.0,AR,Arkansas County,5001.0,1469.0,249.0,NaT,0
261209,2019-03-28,5.0,AR,Ashley County,5003.0,1493.0,292.0,NaT,0
261210,2019-03-28,5.0,AR,Baxter County,5005.0,2856.0,809.0,NaT,0
261211,2019-03-28,5.0,AR,Benton County,5007.0,22612.0,5265.0,NaT,0
261212,2019-03-28,5.0,AR,Boone County,5009.0,2904.0,688.0,NaT,0


# group by county and aggregate on device_count + completely_home_device_count

In [72]:
data_2019_agg = data_2019.groupby(['state','state_code','cnamelong','county_code',\
                                   'SIP?']).agg(device_count = ('device_count', 'sum'),\
                                                completely_home_device_count = ('completely_home_device_count', 'sum')).reset_index()

In [73]:
data_2020_agg = data_2020.groupby(['state','state_code','cnamelong','county_code',\
                                   'SIP?']).agg(device_count = ('device_count', 'sum'),\
                                                completely_home_device_count = ('completely_home_device_count', 'sum')).reset_index()

In [74]:
data_2019_agg.head()

Unnamed: 0,state,state_code,cnamelong,county_code,SIP?,device_count,completely_home_device_count
0,1.0,AL,Autauga County,1001.0,1,308239.0,70403.0
1,1.0,AL,Baldwin County,1003.0,1,1374799.0,326561.0
2,1.0,AL,Barbour County,1005.0,1,106870.0,26794.0
3,1.0,AL,Bibb County,1007.0,1,128209.0,30203.0
4,1.0,AL,Blount County,1009.0,1,357421.0,76718.0


In [75]:
data_2020_agg.head()

Unnamed: 0,state,state_code,cnamelong,county_code,SIP?,device_count,completely_home_device_count
0,1.0,AL,Autauga County,1001.0,1,317591.0,88634.0
1,1.0,AL,Baldwin County,1003.0,1,1307972.0,376494.0
2,1.0,AL,Barbour County,1005.0,1,100674.0,25949.0
3,1.0,AL,Bibb County,1007.0,1,123815.0,30630.0
4,1.0,AL,Blount County,1009.0,1,377311.0,95661.0


In [76]:
data_2019_agg.shape

(3098, 7)

In [77]:
data_2020_agg.shape

(3082, 7)

# produce the difference in stay at home behavior bewteen 2020 and 2019.

In [78]:
data_2019_agg['last_year_perc'] = data_2019_agg['completely_home_device_count'] / data_2019_agg['device_count']

In [79]:
data_2019_agg.head()

Unnamed: 0,state,state_code,cnamelong,county_code,SIP?,device_count,completely_home_device_count,last_year_perc
0,1.0,AL,Autauga County,1001.0,1,308239.0,70403.0,0.228404
1,1.0,AL,Baldwin County,1003.0,1,1374799.0,326561.0,0.237534
2,1.0,AL,Barbour County,1005.0,1,106870.0,26794.0,0.250716
3,1.0,AL,Bibb County,1007.0,1,128209.0,30203.0,0.235576
4,1.0,AL,Blount County,1009.0,1,357421.0,76718.0,0.214643


In [80]:
data_2019_agg = data_2019_agg[['state','state_code','cnamelong','county_code','SIP?',\
                               'last_year_perc']]

In [81]:
merged_data = data_2020_agg.merge(data_2019_agg, on=['state','state_code','cnamelong',\
                                                     'county_code','SIP?'])

In [82]:
merged_data.head()

Unnamed: 0,state,state_code,cnamelong,county_code,SIP?,device_count,completely_home_device_count,last_year_perc
0,1.0,AL,Autauga County,1001.0,1,317591.0,88634.0,0.228404
1,1.0,AL,Baldwin County,1003.0,1,1307972.0,376494.0,0.237534
2,1.0,AL,Barbour County,1005.0,1,100674.0,25949.0,0.250716
3,1.0,AL,Bibb County,1007.0,1,123815.0,30630.0,0.235576
4,1.0,AL,Blount County,1009.0,1,377311.0,95661.0,0.214643


In [83]:
merged_data['this_year_perc'] = merged_data['completely_home_device_count'] / merged_data['device_count']

In [84]:
merged_data['diff_in_perc_at_home'] = merged_data['this_year_perc'] - merged_data['last_year_perc']

In [85]:
merged_data.head()

Unnamed: 0,state,state_code,cnamelong,county_code,SIP?,device_count,completely_home_device_count,last_year_perc,this_year_perc,diff_in_perc_at_home
0,1.0,AL,Autauga County,1001.0,1,317591.0,88634.0,0.228404,0.279082,0.050678
1,1.0,AL,Baldwin County,1003.0,1,1307972.0,376494.0,0.237534,0.287846,0.050312
2,1.0,AL,Barbour County,1005.0,1,100674.0,25949.0,0.250716,0.257753,0.007037
3,1.0,AL,Bibb County,1007.0,1,123815.0,30630.0,0.235576,0.247385,0.011809
4,1.0,AL,Blount County,1009.0,1,377311.0,95661.0,0.214643,0.253534,0.03889


In [86]:
merged_data['diff_in_perc_at_home'].mean()

0.03406070169604386

In [87]:
merged_data[merged_data['SIP?'] == 0]['diff_in_perc_at_home'].mean()

0.018074559979937753

In [88]:
merged_data[merged_data['SIP?'] == 1]['diff_in_perc_at_home'].mean()

0.036809367313262245

In [89]:
merged_data[merged_data['cnamelong'] == 'King County']

Unnamed: 0,state,state_code,cnamelong,county_code,SIP?,device_count,completely_home_device_count,last_year_perc,this_year_perc,diff_in_perc_at_home
2824,53.0,WA,King County,53033.0,1,6500915.0,2971047.0,0.302285,0.45702,0.154735


In [90]:
merged_data[merged_data['cnamelong'] == 'Los Angeles County']

Unnamed: 0,state,state_code,cnamelong,county_code,SIP?,device_count,completely_home_device_count,last_year_perc,this_year_perc,diff_in_perc_at_home
189,6.0,CA,Los Angeles County,6037.0,1,26752354.0,11955990.0,0.328817,0.446914,0.118096


We see a very big difference in behavior in counties like King County of LA, but the mean change of behavior is 3%

In [None]:
# drop columns we don't need - we only need difference.
merged_data.drop(columns = ['completely_home_device_count', 'device_count'], axis=1, inplace=True)

In [93]:
merged_data.head()

Unnamed: 0,state,state_code,cnamelong,county_code,SIP?,last_year_perc,this_year_perc,diff_in_perc_at_home
0,1.0,AL,Autauga County,1001.0,1,0.228404,0.279082,0.050678
1,1.0,AL,Baldwin County,1003.0,1,0.237534,0.287846,0.050312
2,1.0,AL,Barbour County,1005.0,1,0.250716,0.257753,0.007037
3,1.0,AL,Bibb County,1007.0,1,0.235576,0.247385,0.011809
4,1.0,AL,Blount County,1009.0,1,0.214643,0.253534,0.03889


# combine data with covariates.

In [94]:
merged_data.head()

Unnamed: 0,state,state_code,cnamelong,county_code,SIP?,last_year_perc,this_year_perc,diff_in_perc_at_home
0,1.0,AL,Autauga County,1001.0,1,0.228404,0.279082,0.050678
1,1.0,AL,Baldwin County,1003.0,1,0.237534,0.287846,0.050312
2,1.0,AL,Barbour County,1005.0,1,0.250716,0.257753,0.007037
3,1.0,AL,Bibb County,1007.0,1,0.235576,0.247385,0.011809
4,1.0,AL,Blount County,1009.0,1,0.214643,0.253534,0.03889


In [95]:
data = merged_data

In [96]:
# covariates data
df_chr_1 = pd.read_csv('2020 County Health Rankings Data - Ranked Measure Data.csv')
df_chr_2 = pd.read_csv('2020 County Health Rankings Data - Additional Measure Data.csv')
df_governors = pd.read_csv('us_states_governors.csv', encoding='latin-1')
df_election = pd.read_csv('2016_US_County_Level_Presidential_Results.csv')

In [97]:
df_election['political_party'] = df_election.apply(lambda x: 'Republican' if x['per_gop'] > x['per_dem'] else 'Democratic', axis=1)
df_election['political_diff'] = df_election.apply(lambda x: x['per_dem'] - x['per_gop'], axis=1)

In [98]:
df_chr = df_chr_1.merge(df_chr_2, on=['FIPS', 'State', 'County'])
columns = ['FIPS', 'State', 'County', 'Population_y', 'Years of Potential Life Lost Rate', '% Fair or Poor Health', 
           'Average Number of Physically Unhealthy Days', 'Average Number of Mentally Unhealthy Days',
           '% Low Birthweight', '% Smokers', '% Adults with Obesity', 'Food Environment Index', 
           '% Physically Inactive', '% With Access to Exercise Opportunities', '% Excessive Drinking',
           '% Driving Deaths with Alcohol Involvement', 'Chlamydia Rate', 'Teen Birth Rate', '% Uninsured_x',
           'Primary Care Physicians Rate', 'Primary Care Physicians Ratio', 
           'Dentist Rate', 'Dentist Ratio', 'Mental Health Provider Rate', 'Mental Health Provider Ratio',
           'Preventable Hospitalization Rate', '% With Annual Mammogram', '% Vaccinated',
           'High School Graduation Rate', '% Some College', '% Unemployed', '% Children in Poverty',
           'Income Ratio', '% Single-Parent Households', 'Social Association Rate', 'Violent Crime Rate',
           'Injury Death Rate', 'Average Daily PM2.5', 'Presence of Water Violation', '% Severe Housing Problems',
           '% Drive Alone to Work', '% Long Commute - Drives Alone',
           'Life Expectancy', 'Age-Adjusted Death Rate', 'Child Mortality Rate',
           'Infant Mortality Rate', '% Frequent Physical Distress', '% Frequent Mental Distress',
           '% Adults with Diabetes', 'HIV Prevalence Rate', 
           '% Food Insecure', '% Limited Access to Healthy Foods',
           'Drug Overdose Mortality Rate', 'Motor Vehicle Mortality Rate',
           '% Insufficient Sleep', '% Uninsured_y', '% Uninsured.1',
           'Other Primary Care Provider Rate', 'Other Primary Care Provider Ratio','% Disconnected Youth',
           'Average Grade Performance', 'Average Grade Performance.1', 'Median Household Income', 
           '% Enrolled in Free or Reduced Lunch', 'Segregation index', 'Segregation Index', 'Homicide Rate',
           'Suicide Rate (Age-Adjusted)', 'Firearm Fatalities Rate',
           'Juvenile Arrest Rate', 'Average Traffic Volume per Meter of Major Roadways',
           '% Homeowners', '% Severe Housing Cost Burden', '% less than 18 years of age', '% 65 and over',
           '% Black', '% American Indian & Alaska Native', '% Asian', '% Native Hawaiian/Other Pacific Islander',
           '% Hispanic', '% Non-Hispanic White', '% Not Proficient in English', '% Female', '% Rural'
          ]
df_chr = df_chr[columns]

In [99]:
df_merged = data.merge(df_governors, how='left', left_on='state_code', right_on='State Code')
df_merged = df_merged.merge(df_chr, left_on='county_code', right_on='FIPS')
df_merged = df_merged.merge(df_election, left_on='county_code', right_on='combined_fips')
df_merged.head()

Unnamed: 0.1,state,state_code,cnamelong,county_code,SIP?,last_year_perc,this_year_perc,diff_in_perc_at_home,State Code,State Name,Governor,Party,Date,Shelter at home begins,Shelter in place ends,Reopen,FIPS,State,County,Population_y,Years of Potential Life Lost Rate,% Fair or Poor Health,Average Number of Physically Unhealthy Days,Average Number of Mentally Unhealthy Days,% Low Birthweight,% Smokers,% Adults with Obesity,Food Environment Index,% Physically Inactive,% With Access to Exercise Opportunities,% Excessive Drinking,% Driving Deaths with Alcohol Involvement,Chlamydia Rate,Teen Birth Rate,% Uninsured_x,Primary Care Physicians Rate,Primary Care Physicians Ratio,Dentist Rate,Dentist Ratio,Mental Health Provider Rate,...,Other Primary Care Provider Rate,Other Primary Care Provider Ratio,% Disconnected Youth,Average Grade Performance,Average Grade Performance.1,Median Household Income,% Enrolled in Free or Reduced Lunch,Segregation index,Segregation Index,Homicide Rate,Suicide Rate (Age-Adjusted),Firearm Fatalities Rate,Juvenile Arrest Rate,Average Traffic Volume per Meter of Major Roadways,% Homeowners,% Severe Housing Cost Burden,% less than 18 years of age,% 65 and over,% Black,% American Indian & Alaska Native,% Asian,% Native Hawaiian/Other Pacific Islander,% Hispanic,% Non-Hispanic White,% Not Proficient in English,% Female,% Rural,Unnamed: 0,votes_dem,votes_gop,total_votes,per_dem,per_gop,diff,per_point_diff,state_abbr,county_name,combined_fips,political_party,political_diff
0,1.0,AL,Autauga County,1001.0,1,0.228404,0.279082,0.050678,AL,Alabama,Kay Ivey,Republican,10-Apr-17,4/4/2020,4/30/2020,4/30/2020,1001,Alabama,Autauga,55601,8129.0,21,4.7,4.7,9.0,18,33,7.2,35,69.0,15,27.0,407.2,25.0,9.0,45.0,2220:1,32.0,3089:1,23.0,...,40.0,2527:1,,3.0,2.8,59338.0,43.0,25.0,24.0,5.0,18.0,16.0,11.0,88,75,13.0,23.7,15.6,19.3,0.5,1.2,0.1,3.0,74.3,1,51.4,42.0,29,5908.0,18110.0,24661.0,0.239569,0.734358,12202,49.48%,AL,Autauga County,1001,Republican,-0.494789
1,1.0,AL,Baldwin County,1003.0,1,0.237534,0.287846,0.050312,AL,Alabama,Kay Ivey,Republican,10-Apr-17,4/4/2020,4/30/2020,4/30/2020,1003,Alabama,Baldwin,218022,7354.0,18,4.2,4.3,8.0,17,31,8.0,27,74.0,18,31.0,325.0,28.0,11.0,73.0,1372:1,50.0,2019:1,96.0,...,56.0,1787:1,8.0,3.0,2.9,57588.0,48.0,41.0,32.0,3.0,19.0,14.0,26.0,87,74,12.0,21.6,20.4,8.8,0.8,1.2,0.1,4.6,83.1,1,51.5,42.3,30,18409.0,72780.0,94090.0,0.195653,0.773515,54371,57.79%,AL,Baldwin County,1003,Republican,-0.577862
2,1.0,AL,Barbour County,1005.0,1,0.250716,0.257753,0.007037,AL,Alabama,Kay Ivey,Republican,10-Apr-17,4/4/2020,4/30/2020,4/30/2020,1005,Alabama,Barbour,24881,10254.0,30,5.4,5.2,11.0,22,42,5.6,24,53.0,13,40.0,716.3,41.0,12.0,32.0,3159:1,36.0,2765:1,8.0,...,52.0,1914:1,13.0,2.7,2.4,34382.0,63.0,25.0,23.0,8.0,13.0,18.0,15.0,102,61,14.0,20.9,19.4,48.0,0.7,0.5,0.2,4.3,45.6,2,47.2,67.8,31,4848.0,5431.0,10390.0,0.466603,0.522714,583,5.61%,AL,Barbour County,1005,Republican,-0.056112
3,1.0,AL,Bibb County,1007.0,1,0.235576,0.247385,0.011809,AL,Alabama,Kay Ivey,Republican,10-Apr-17,4/4/2020,4/30/2020,4/30/2020,1007,Alabama,Bibb,22400,11978.0,19,4.6,4.6,10.0,19,38,7.8,34,16.0,16,28.0,339.7,42.0,10.0,49.0,2061:1,22.0,4480:1,22.0,...,112.0,896:1,,2.6,2.4,46064.0,62.0,53.0,53.0,8.0,21.0,24.0,,29,75,10.0,20.5,16.5,21.1,0.4,0.2,0.1,2.6,74.6,0,46.8,68.4,32,1874.0,6733.0,8748.0,0.21422,0.769662,4859,55.54%,AL,Bibb County,1007,Republican,-0.555441
4,1.0,AL,Blount County,1009.0,1,0.214643,0.253534,0.03889,AL,Alabama,Kay Ivey,Republican,10-Apr-17,4/4/2020,4/30/2020,4/30/2020,1009,Alabama,Blount,57840,11335.0,22,4.9,4.9,8.0,19,34,8.4,30,16.0,14,19.0,234.4,34.0,13.0,22.0,4463:1,19.0,5258:1,16.0,...,22.0,4449:1,19.0,3.0,2.8,50412.0,53.0,48.0,18.0,6.0,17.0,20.0,7.0,33,79,8.0,23.2,18.2,1.5,0.7,0.3,0.1,9.6,86.9,2,50.7,90.0,33,2150.0,22808.0,25384.0,0.084699,0.898519,20658,81.38%,AL,Blount County,1009,Republican,-0.81382


In [100]:
df_merged.shape

(2996, 113)

In [101]:
data.shape

(3074, 8)

In [102]:
# only select some covariates for propensity score matching
df_reduced_covariate = \
df_merged[['state', 'state_code', 'State Name','cnamelong', 'county_code', 'diff_in_perc_at_home',\
           'SIP?', 'Median Household Income', '% Rural', 'Population_y', 'political_diff',\
           '% less than 18 years of age', '% 65 and over', '% Asian', '% Black', '% Hispanic',\
           '% Non-Hispanic White']]

In [103]:
df_reduced_covariate.head()

Unnamed: 0,state,state_code,State Name,cnamelong,county_code,diff_in_perc_at_home,SIP?,Median Household Income,% Rural,Population_y,political_diff,% less than 18 years of age,% 65 and over,% Asian,% Black,% Hispanic,% Non-Hispanic White
0,1.0,AL,Alabama,Autauga County,1001.0,0.050678,1,59338.0,42.0,55601,-0.494789,23.7,15.6,1.2,19.3,3.0,74.3
1,1.0,AL,Alabama,Baldwin County,1003.0,0.050312,1,57588.0,42.3,218022,-0.577862,21.6,20.4,1.2,8.8,4.6,83.1
2,1.0,AL,Alabama,Barbour County,1005.0,0.007037,1,34382.0,67.8,24881,-0.056112,20.9,19.4,0.5,48.0,4.3,45.6
3,1.0,AL,Alabama,Bibb County,1007.0,0.011809,1,46064.0,68.4,22400,-0.555441,20.5,16.5,0.2,21.1,2.6,74.6
4,1.0,AL,Alabama,Blount County,1009.0,0.03889,1,50412.0,90.0,57840,-0.81382,23.2,18.2,0.3,1.5,9.6,86.9


In [104]:
# save final data
df_reduced_covariate.to_csv("county_data_with_reduced_covariates_with_SIP.csv", index = False)