<a href="https://colab.research.google.com/github/wenjunsun/Covid-19-analysis-with-uw-ubicomp/blob/master/data_and_pre-processing/aggreagate_data_on_county.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Since we have in our original data each row about a subgroup in a county, we want to aggregate so we have data about each county in each row. But we have column like median home dwell time, for which we cannot just do a mean when we aggregate. So we made custom weighted average function to aggregate all data about a county on the same day together. For more specific details of how we aggregate the data please read this notebook.

# try our custom weighted mean function on small data set to ensure correctness

In [None]:
import pandas as pd
import numpy as np

In [None]:
ls

[0m[01;34mdrive[0m/  [01;34msample_data[0m/


In [None]:
cd drive/My\ Drive/COVID\ 19\ data\ analysis

/content/drive/My Drive/COVID 19 data analysis


In [None]:
ls

 aggreagate_data_on_county.ipynb
 agg_social_dist
 analysis_and_graphs.ipynb
 compare_30_days_before_and_after.ipynb
'Copy of COVID-19 US state policy database.csv'
 days_since.csv
 merge_mobility_with_first_case_and_shelter_in_place.ipynb
 small_data.csv
 social_dist_all_trimmed.csv
'Social Distancing Index Exploration.ipynb'
 social_dist_with_dates
 social_dist_with_days_since


In [None]:
# only read partial data, to see if my groupby and agg function 
# are doing things right
data = pd.read_csv("social_dist_all_trimmed.csv", nrows=100000)

In [None]:
data.head()

Unnamed: 0.1,Unnamed: 0,date_range_start,date_range_end,state,state_code,cnamelong,county_code,origin_census_block_group,candidate_device_count,device_count,completely_home_device_count,part_time_work_behavior_devices,full_time_work_behavior_devices,delivery_behavior_devices,median_home_dwell_time,median_non_home_dwell_time,median_percentage_time_home,distance_traveled_from_home
0,0,2020-01-01T00:00:00-06:00,2020-01-02T00:00:00-06:00,1.0,AL,Calhoun County,1015,10150007002,,80,25,5,6,,752,,,5431.0
1,1,2020-01-01T00:00:00-05:00,2020-01-02T00:00:00-05:00,1.0,AL,Cleburne County,1029,10299598001,,156,39,10,17,,797,,,15016.0
2,2,2020-01-01T00:00:00-06:00,2020-01-02T00:00:00-06:00,1.0,AL,Cleburne County,1029,10299598001,,156,39,10,17,,797,,,15016.0
3,3,2020-01-01T00:00:00-06:00,2020-01-02T00:00:00-06:00,1.0,AL,Jefferson County,1073,10730109006,,38,14,2,6,,713,,,7419.0
4,4,2020-01-01T00:00:00-06:00,2020-01-02T00:00:00-06:00,1.0,AL,Tuscaloosa County,1125,11250103023,,131,30,13,31,,750,,,11979.0


In [None]:
# group by start date, end date, state, state_code, cnamelong, county_code, 
small_data = data[(data['state'] ==  1.0) & (data['cnamelong'] == 'Calhoun County') & (data['date_range_start'] == '2020-01-01T00:00:00-06:00')]

In [None]:
# just pick the first 5 for our purpose
small_data = small_data.iloc[:5]

In [None]:
small_data

Unnamed: 0.1,Unnamed: 0,date_range_start,date_range_end,state,state_code,cnamelong,county_code,origin_census_block_group,candidate_device_count,device_count,completely_home_device_count,part_time_work_behavior_devices,full_time_work_behavior_devices,delivery_behavior_devices,median_home_dwell_time,median_non_home_dwell_time,median_percentage_time_home,distance_traveled_from_home
0,0,2020-01-01T00:00:00-06:00,2020-01-02T00:00:00-06:00,1.0,AL,Calhoun County,1015,10150007002,,80,25,5,6,,752,,,5431.0
4937,4937,2020-01-01T00:00:00-06:00,2020-01-02T00:00:00-06:00,1.0,AL,Calhoun County,1015,10150008001,,138,56,8,12,,159,,,100893.0
8572,8572,2020-01-01T00:00:00-06:00,2020-01-02T00:00:00-06:00,1.0,AL,Calhoun County,1015,10150025023,,187,41,15,29,,815,,,10692.0
12399,12399,2020-01-01T00:00:00-06:00,2020-01-02T00:00:00-06:00,1.0,AL,Calhoun County,1015,10150026002,,158,41,21,36,,771,,,13812.0
13117,13117,2020-01-01T00:00:00-06:00,2020-01-02T00:00:00-06:00,1.0,AL,Calhoun County,1015,10150010002,,69,20,2,14,,1008,,,4944.0


In [None]:
# wm stands for weighted mean. This is the function that calculates
# the weighted average of an array based on device count.
wm1 = lambda x: np.average(x, weights= small_data.loc[x.index, "device_count"])

In [None]:
# group by all the other features like state, county, and start_date.
# main purpose is to aggregate the statistics of a county based on
# multiple "origin_census_block_group"s
agg_data1 = small_data.groupby(["date_range_start",'date_range_end','state','state_code','cnamelong','county_code']).agg(
    candidate_device_count = ('candidate_device_count', 'sum'),
    device_count = ('device_count', 'sum'),
    completely_home_device_count = ('completely_home_device_count', 'sum'),
    part_time_work_behavior_devices = ('part_time_work_behavior_devices', 'sum'),
    full_time_work_behavior_devices = ('full_time_work_behavior_devices', 'sum'),
    delivery_behavior_devices = ('delivery_behavior_devices', 'sum'),
    median_home_dwell_time = ('median_home_dwell_time',wm),
    median_non_home_dwell_time = ('median_non_home_dwell_time', wm),
    median_percentage_time_home = ('median_percentage_time_home', wm), 
    distance_traveled_from_home = ('distance_traveled_from_home',"sum")).reset_index()

In [None]:
agg_data1.iloc[0]

date_range_start                   2020-01-01T00:00:00-06:00
date_range_end                     2020-01-02T00:00:00-06:00
state                                                      1
state_code                                                AL
cnamelong                                     Calhoun County
county_code                                             1015
candidate_device_count                                     0
device_count                                             632
completely_home_device_count                             183
part_time_work_behavior_devices                           51
full_time_work_behavior_devices                           97
delivery_behavior_devices                                  0
median_home_dwell_time                               673.856
median_non_home_dwell_time                               NaN
median_percentage_time_home                              NaN
distance_traveled_from_home                           135772
Name: 0, dtype: object

In [None]:
(752 *80 + 159*138 + 815*187 + 771*158 + 1008*69) / (80 + 138 + 187 + 158 + 69)

673.8560126582279

Seems like our median_home_dwell_time is calculuated correctly by the weighted mean function

# let's now try it on the whole data set then!

In [None]:
data = pd.read_csv("social_dist_with_days_since")

In [None]:
data.head()

Unnamed: 0.1,Unnamed: 0,date_range_start,date_range_end,state,state_code,cnamelong,county_code,origin_census_block_group,candidate_device_count,device_count,completely_home_device_count,part_time_work_behavior_devices,full_time_work_behavior_devices,delivery_behavior_devices,median_home_dwell_time,median_non_home_dwell_time,median_percentage_time_home,distance_traveled_from_home,Date - first case,Date - shelter in place,days_since_first_case,days_since_shelter
0,0,2020-01-01T00:00:00-06:00,2020-01-02T00:00:00-06:00,1.0,AL,Calhoun County,1015,10150007002,,80,25,5,6,,752,,,5431.0,2020-03-18,2020-04-04 00:00:00,-77.0,-94.0
1,1,2020-01-01T00:00:00-05:00,2020-01-02T00:00:00-05:00,1.0,AL,Cleburne County,1029,10299598001,,156,39,10,17,,797,,,15016.0,2020-03-25,2020-04-04 00:00:00,-84.0,-94.0
2,2,2020-01-01T00:00:00-06:00,2020-01-02T00:00:00-06:00,1.0,AL,Cleburne County,1029,10299598001,,156,39,10,17,,797,,,15016.0,2020-03-25,2020-04-04 00:00:00,-84.0,-94.0
3,3,2020-01-01T00:00:00-06:00,2020-01-02T00:00:00-06:00,1.0,AL,Jefferson County,1073,10730109006,,38,14,2,6,,713,,,7419.0,2020-03-13,2020-04-04 00:00:00,-72.0,-94.0
4,4,2020-01-01T00:00:00-06:00,2020-01-02T00:00:00-06:00,1.0,AL,Tuscaloosa County,1125,11250103023,,131,30,13,31,,750,,,11979.0,2020-03-14,2020-04-04 00:00:00,-73.0,-94.0


In [None]:
wm = lambda x: np.average(x, weights= data.loc[x.index, "device_count"])

In [None]:
agg_data = data.groupby(["date_range_start",'date_range_end','state','state_code','cnamelong','county_code','days_since_first_case','days_since_shelter']).agg(
    candidate_device_count = ('candidate_device_count', 'sum'),
    device_count = ('device_count', 'sum'),
    completely_home_device_count = ('completely_home_device_count', 'sum'),
    part_time_work_behavior_devices = ('part_time_work_behavior_devices', 'sum'),
    full_time_work_behavior_devices = ('full_time_work_behavior_devices', 'sum'),
    delivery_behavior_devices = ('delivery_behavior_devices', 'sum'),
    median_home_dwell_time = ('median_home_dwell_time',wm),
    median_non_home_dwell_time = ('median_non_home_dwell_time', wm),
    median_percentage_time_home = ('median_percentage_time_home', wm), 
    distance_traveled_from_home = ('distance_traveled_from_home', wm)).reset_index()

In [None]:
agg_data.head(10)

Unnamed: 0,date_range_start,date_range_end,state,state_code,cnamelong,county_code,days_since_first_case,days_since_shelter,candidate_device_count,device_count,completely_home_device_count,part_time_work_behavior_devices,full_time_work_behavior_devices,delivery_behavior_devices,median_home_dwell_time,median_non_home_dwell_time,median_percentage_time_home,distance_traveled_from_home
0,2020-01-01T00:00:00-04:00,2020-01-02T00:00:00-04:00,23.0,ME,Aroostook County,23003,-93.0,-91.0,0.0,56,20,5,7,0.0,477.0,,,22780.0
1,2020-01-01T00:00:00-04:00,2020-01-02T00:00:00-04:00,23.0,ME,Washington County,23029,-93.0,-91.0,0.0,156,68,5,24,0.0,280.461538,,,89197.25
2,2020-01-01T00:00:00-04:00,2020-01-02T00:00:00-04:00,72.0,PR,Adjuntas Municipio,72001,-75.0,-74.0,0.0,241,115,14,19,0.0,890.377593,,,140597.925311
3,2020-01-01T00:00:00-04:00,2020-01-02T00:00:00-04:00,72.0,PR,Aguada Municipio,72003,-75.0,-74.0,0.0,697,327,40,59,0.0,775.674319,,,60138.74175
4,2020-01-01T00:00:00-04:00,2020-01-02T00:00:00-04:00,72.0,PR,Aguadilla Municipio,72005,-75.0,-74.0,0.0,1103,518,55,98,0.0,740.937443,,,8579.466002
5,2020-01-01T00:00:00-04:00,2020-01-02T00:00:00-04:00,72.0,PR,Aguas Buenas Municipio,72007,-75.0,-74.0,0.0,333,160,24,39,0.0,526.009009,,,6873.354354
6,2020-01-01T00:00:00-04:00,2020-01-02T00:00:00-04:00,72.0,PR,Aibonito Municipio,72009,-75.0,-74.0,0.0,222,88,14,16,0.0,708.725225,,,5225.873874
7,2020-01-01T00:00:00-04:00,2020-01-02T00:00:00-04:00,72.0,PR,Arecibo Municipio,72013,-75.0,-74.0,0.0,1637,876,88,135,0.0,815.136225,,,20897.327428
8,2020-01-01T00:00:00-04:00,2020-01-02T00:00:00-04:00,72.0,PR,Arroyo Municipio,72015,-75.0,-74.0,0.0,250,103,14,18,0.0,791.272,,,4622.016
9,2020-01-01T00:00:00-04:00,2020-01-02T00:00:00-04:00,72.0,PR,AÒasco Municipio,72011,-75.0,-74.0,0.0,466,234,22,31,0.0,718.133047,,,5627.523605


In [None]:
agg_data.iloc[0]

date_range_start                   2020-01-01T00:00:00-04:00
date_range_end                     2020-01-02T00:00:00-04:00
state                                                     23
state_code                                                ME
cnamelong                                   Aroostook County
county_code                                            23003
days_since_first_case                                    -93
days_since_shelter                                       -91
candidate_device_count                                     0
device_count                                              56
completely_home_device_count                              20
part_time_work_behavior_devices                            5
full_time_work_behavior_devices                            7
delivery_behavior_devices                                  0
median_home_dwell_time                                   477
median_non_home_dwell_time                               NaN
median_percentage_time_h

In [None]:
agg_data.shape

(332517, 18)

In [None]:
agg_data[(agg_data['date_range_start'] == '2020-01-01T00:00:00-05:00') & (agg_data['date_range_end'] == '2020-01-02T00:00:00-05:00') & (agg_data['county_code'] == 23003)]

Unnamed: 0,date_range_start,date_range_end,state,state_code,cnamelong,county_code,days_since_first_case,days_since_shelter,candidate_device_count,device_count,completely_home_device_count,part_time_work_behavior_devices,full_time_work_behavior_devices,delivery_behavior_devices,median_home_dwell_time,median_non_home_dwell_time,median_percentage_time_home,distance_traveled_from_home
477,2020-01-01T00:00:00-05:00,2020-01-02T00:00:00-05:00,23.0,ME,Aroostook County,23003,-93.0,-91.0,0.0,2737,1044,146,245,0.0,582.172086,,,21696.611984


In [None]:
data[(data['date_range_start'] == '2020-01-01T00:00:00-05:00') & (data['date_range_end'] == '2020-01-02T00:00:00-05:00') & (data['county_code'] == 23003)].sum()

Unnamed: 0                                                                   8627543
date_range_start                   2020-01-01T00:00:00-05:002020-01-01T00:00:00-0...
date_range_end                     2020-01-02T00:00:00-05:002020-01-02T00:00:00-0...
state                                                                           1794
state_code                         MEMEMEMEMEMEMEMEMEMEMEMEMEMEMEMEMEMEMEMEMEMEME...
cnamelong                          Aroostook CountyAroostook CountyAroostook Coun...
county_code                                                                  1794234
origin_census_block_group                                             17943082149176
candidate_device_count                                                             0
device_count                                                                    2737
completely_home_device_count                                                    1044
part_time_work_behavior_devices                                  

we can't check the median_home_dwell_time that directly, but at least the sums are calculated correctly in agg_data compared to our original data. So I guess we have done things right

By this aggregation we can study the data in the county level rather than from the smallest subgroup level. This also decrease our data size dramatically from 20 million to 332,517 data points.

Now we just need to export this data into a new data set.

In [None]:
agg_data.to_csv("agg_social_dist")