In this notebook we will produce a clean dataset from 9GB SafeGraph data from 2020-01-01 to 2020-12-18, for our final analysis of causal inference on whether the government policy changes people's behavior. Now the dataset's rows are by each census block, we will produce a dataset that each row is a county.

In [1]:
import pandas as pd
import numpy as np

We are dealing with mixed types of data in some columns, because some rows are "rogue", as we can see below.

In [2]:
data = pd.read_csv("social_dist_all_trimmed_new_dec2020.csv", nrows = 219243)

In [3]:
data.dtypes

Unnamed: 0                           int64
date_range_start                    object
date_range_end                      object
state                              float64
state_code                          object
cnamelong                           object
county_code                          int64
origin_census_block_group            int64
candidate_device_count               int64
device_count                         int64
completely_home_device_count         int64
part_time_work_behavior_devices      int64
full_time_work_behavior_devices      int64
delivery_behavior_devices            int64
median_home_dwell_time               int64
median_non_home_dwell_time           int64
median_percentage_time_home          int64
distance_traveled_from_home        float64
dtype: object

In [4]:
data = pd.read_csv("social_dist_all_trimmed_new_dec2020.csv", nrows = 219244)

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [5]:
data.dtypes

Unnamed: 0                         float64
date_range_start                    object
date_range_end                      object
state                               object
state_code                          object
cnamelong                           object
county_code                         object
origin_census_block_group           object
candidate_device_count              object
device_count                        object
completely_home_device_count        object
part_time_work_behavior_devices     object
full_time_work_behavior_devices     object
delivery_behavior_devices           object
median_home_dwell_time              object
median_non_home_dwell_time          object
median_percentage_time_home         object
distance_traveled_from_home         object
dtype: object

In [6]:
data.tail()

Unnamed: 0.1,Unnamed: 0,date_range_start,date_range_end,state,state_code,cnamelong,county_code,origin_census_block_group,candidate_device_count,device_count,completely_home_device_count,part_time_work_behavior_devices,full_time_work_behavior_devices,delivery_behavior_devices,median_home_dwell_time,median_non_home_dwell_time,median_percentage_time_home,distance_traveled_from_home
219239,219239.0,2020-01-01T00:00:00-06:00,2020-01-02T00:00:00-06:00,55,WI,Kenosha County,55059,550590020001,118,67,31,1,1,1,799,7,99,5400
219240,219240.0,2020-01-01T00:00:00-06:00,2020-01-02T00:00:00-06:00,55,WI,Langlade County,55067,550679605002,51,28,4,2,1,1,625,441,60,33941
219241,219241.0,2020-01-01T00:00:00-06:00,2020-01-02T00:00:00-06:00,55,WI,Marathon County,55073,550730019007,88,49,12,3,3,2,489,104,76,8443
219242,219242.0,2020-01-01T00:00:00-04:00,2020-01-02T00:00:00-04:00,72,PR,CanÛvanas Municipio,72029,720291001033,11,6,2,1,1,1,198,17,87,11333
219243,,date_range_start,date_range_end,state,state_code,cnamelong,county_code,origin_census_block_group,candidate_device_count,device_count,completely_home_device_count,part_time_work_behavior_devices,full_time_work_behavior_devices,delivery_behavior_devices,median_home_dwell_time,median_non_home_dwell_time,median_percentage_time_home,distance_traveled_from_home


As we can see above, the row that makes us have mixed typed data is the row that is probably the head of another dataframe. (we produced this 9GB data by joining a lot of dataframes of data from each date together) 

To combat this, we need to use `pd.to_numeric()` function, which converts a column to numeric data type, can it will put `NaN` on the places where it cannot be converted. So we will do a `dropna()` after to drop those nonsensical rows.

But before we do that, we need to make sure that there are no null data. Let's check the nulls in our data.

In [7]:
data.isnull().sum()

Unnamed: 0                          1
date_range_start                    0
date_range_end                      0
state                              13
state_code                         13
cnamelong                          13
county_code                         0
origin_census_block_group           0
candidate_device_count              0
device_count                        0
completely_home_device_count        0
part_time_work_behavior_devices     0
full_time_work_behavior_devices     0
delivery_behavior_devices           0
median_home_dwell_time              0
median_non_home_dwell_time          0
median_percentage_time_home         0
distance_traveled_from_home         8
dtype: int64

In [8]:
data[data.isnull().any(axis=1)]

Unnamed: 0.1,Unnamed: 0,date_range_start,date_range_end,state,state_code,cnamelong,county_code,origin_census_block_group,candidate_device_count,device_count,completely_home_device_count,part_time_work_behavior_devices,full_time_work_behavior_devices,delivery_behavior_devices,median_home_dwell_time,median_non_home_dwell_time,median_percentage_time_home,distance_traveled_from_home
2660,2660.0,2020-01-01T00:00:00-04:00,2020-01-02T00:00:00-04:00,72,PR,Morovis Municipio,72101,721019554012,12,5,3,1,1,1,198,0,100,
50695,50695.0,2020-01-01T00:00:00-04:00,2020-01-02T00:00:00-04:00,72,PR,Jayuya Municipio,72073,720739561001,13,6,4,1,1,1,803,0,100,
68818,68818.0,2020-01-01T00:00:00-07:00,2020-01-02T00:00:00-07:00,,,,46102,461029405003,137,51,21,3,2,1,331,38,89,1092
88542,88542.0,2020-01-01T00:00:00-04:00,2020-01-02T00:00:00-04:00,72,PR,San Juan Municipio,72127,721270085005,10,10,5,1,1,1,227,0,100,
101568,101568.0,2020-01-01T00:00:00-04:00,2020-01-02T00:00:00-04:00,72,PR,Sabana Grande Municipio,72121,721219603001,17,9,7,1,1,1,238,0,100,
104347,104347.0,2020-01-01T00:00:00-09:00,2020-01-02T00:00:00-09:00,,,,2158,21580001003,51,24,7,2,1,1,87,13,74,67833
112106,112106.0,2020-01-01T00:00:00-04:00,2020-01-02T00:00:00-04:00,72,PR,San Juan Municipio,72127,721270053002,28,19,17,1,1,1,233,0,100,
118935,118935.0,2020-01-01T00:00:00-07:00,2020-01-02T00:00:00-07:00,,,,46102,461029409002,63,35,13,2,1,5,397,45,88,115299
121911,121911.0,2020-01-01T00:00:00-07:00,2020-01-02T00:00:00-07:00,,,,46102,461029405001,100,47,13,8,2,2,210,206,53,11739
143807,143807.0,2020-01-01T00:00:00-07:00,2020-01-02T00:00:00-07:00,,,,46102,461029409003,71,21,7,1,1,1,101,8,74,32684


We see that some rows have `NaN` values though. We can either keep them or neglect them. If we want to keep them, then we can't just drop any rows with `NaN` values, after we did `pd.to_numeric()`. 

we can drop the rows with `NaN` in `Unnamed:0` though. Pretty sure that is the only place when the "rogue" row has `NaN`. As shown above.

In [9]:
data = pd.read_csv("social_dist_all_trimmed_new_dec2020.csv", nrows = 1000000)

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [10]:
data

Unnamed: 0.1,Unnamed: 0,date_range_start,date_range_end,state,state_code,cnamelong,county_code,origin_census_block_group,candidate_device_count,device_count,completely_home_device_count,part_time_work_behavior_devices,full_time_work_behavior_devices,delivery_behavior_devices,median_home_dwell_time,median_non_home_dwell_time,median_percentage_time_home,distance_traveled_from_home
0,0.0,2020-01-01T00:00:00-06:00,2020-01-02T00:00:00-06:00,1,AL,Colbert County,1033,10330210004,138,93,20,8,2,3,711,95,79,13128
1,1.0,2020-01-01T00:00:00-06:00,2020-01-02T00:00:00-06:00,1,AL,Jefferson County,1073,10730049022,351,157,40,8,5,1,6,49,6,12147
2,2.0,2020-01-01T00:00:00-06:00,2020-01-02T00:00:00-06:00,1,AL,Talladega County,1121,11210118001,199,96,38,6,2,1,505,32,90,11180
3,3.0,2020-01-01T00:00:00-06:00,2020-01-02T00:00:00-06:00,1,AL,Tuscaloosa County,1125,11250106021,760,486,148,50,22,7,637,87,81,13867
4,4.0,2020-01-01T00:00:00-09:00,2020-01-02T00:00:00-09:00,2,AK,Northwest Arctic Borough,2188,21880002003,20,10,4,1,1,1,470,0,100,2.70813e+06
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
999995,122592.0,2020-01-05T00:00:00-05:00,2020-01-06T00:00:00-05:00,37,NC,Wilson County,37195,371950011002,97,75,19,2,1,2,944,41,95,18107
999996,122593.0,2020-01-05T00:00:00-05:00,2020-01-06T00:00:00-05:00,39,OH,Butler County,39017,390170002003,83,54,21,1,1,1,1027,22,98,3809
999997,122594.0,2020-01-05T00:00:00-05:00,2020-01-06T00:00:00-05:00,39,OH,Cuyahoga County,39035,390351361032,165,122,36,5,1,1,1016,98,90,9615
999998,122595.0,2020-01-05T00:00:00-05:00,2020-01-06T00:00:00-05:00,39,OH,Cuyahoga County,39035,390351781021,225,129,45,4,1,3,997,45,96,5491


In [11]:
data.dtypes

Unnamed: 0                         float64
date_range_start                    object
date_range_end                      object
state                               object
state_code                          object
cnamelong                           object
county_code                         object
origin_census_block_group           object
candidate_device_count              object
device_count                        object
completely_home_device_count        object
part_time_work_behavior_devices     object
full_time_work_behavior_devices     object
delivery_behavior_devices           object
median_home_dwell_time              object
median_non_home_dwell_time          object
median_percentage_time_home         object
distance_traveled_from_home         object
dtype: object

In [12]:
data[data['Unnamed: 0'].isnull()]

Unnamed: 0.1,Unnamed: 0,date_range_start,date_range_end,state,state_code,cnamelong,county_code,origin_census_block_group,candidate_device_count,device_count,completely_home_device_count,part_time_work_behavior_devices,full_time_work_behavior_devices,delivery_behavior_devices,median_home_dwell_time,median_non_home_dwell_time,median_percentage_time_home,distance_traveled_from_home
219243,,date_range_start,date_range_end,state,state_code,cnamelong,county_code,origin_census_block_group,candidate_device_count,device_count,completely_home_device_count,part_time_work_behavior_devices,full_time_work_behavior_devices,delivery_behavior_devices,median_home_dwell_time,median_non_home_dwell_time,median_percentage_time_home,distance_traveled_from_home
438634,,date_range_start,date_range_end,state,state_code,cnamelong,county_code,origin_census_block_group,candidate_device_count,device_count,completely_home_device_count,part_time_work_behavior_devices,full_time_work_behavior_devices,delivery_behavior_devices,median_home_dwell_time,median_non_home_dwell_time,median_percentage_time_home,distance_traveled_from_home
658036,,date_range_start,date_range_end,state,state_code,cnamelong,county_code,origin_census_block_group,candidate_device_count,device_count,completely_home_device_count,part_time_work_behavior_devices,full_time_work_behavior_devices,delivery_behavior_devices,median_home_dwell_time,median_non_home_dwell_time,median_percentage_time_home,distance_traveled_from_home
877402,,date_range_start,date_range_end,state,state_code,cnamelong,county_code,origin_census_block_group,candidate_device_count,device_count,completely_home_device_count,part_time_work_behavior_devices,full_time_work_behavior_devices,delivery_behavior_devices,median_home_dwell_time,median_non_home_dwell_time,median_percentage_time_home,distance_traveled_from_home


This verifies our hypothesis. So we can drop rows with `NaN` in `Unnamed: 0` column to rid of bad rows.

Let's try our aggregation process on some small data

In [14]:
data = data.dropna(subset = ['Unnamed: 0'])

In [15]:
data = data.drop(['Unnamed: 0'], axis = 1)

In [16]:
# distance variable contains extreme outliers, and other variables
# we are deleting we won't be using for our analysis.
# we are using date_range_start to represent the date of a row,
# do we are deleting date_range_end because it is redundant.
data = data.drop(['distance_traveled_from_home', 'candidate_device_count',
                 'origin_census_block_group', 'median_percentage_time_home',
                 'date_range_end'], axis = 1)

In [17]:
data.head()

Unnamed: 0,date_range_start,state,state_code,cnamelong,county_code,device_count,completely_home_device_count,part_time_work_behavior_devices,full_time_work_behavior_devices,delivery_behavior_devices,median_home_dwell_time,median_non_home_dwell_time
0,2020-01-01T00:00:00-06:00,1,AL,Colbert County,1033,93,20,8,2,3,711,95
1,2020-01-01T00:00:00-06:00,1,AL,Jefferson County,1073,157,40,8,5,1,6,49
2,2020-01-01T00:00:00-06:00,1,AL,Talladega County,1121,96,38,6,2,1,505,32
3,2020-01-01T00:00:00-06:00,1,AL,Tuscaloosa County,1125,486,148,50,22,7,637,87
4,2020-01-01T00:00:00-09:00,2,AK,Northwest Arctic Borough,2188,10,4,1,1,1,470,0


In [18]:
data.dtypes

date_range_start                   object
state                              object
state_code                         object
cnamelong                          object
county_code                        object
device_count                       object
completely_home_device_count       object
part_time_work_behavior_devices    object
full_time_work_behavior_devices    object
delivery_behavior_devices          object
median_home_dwell_time             object
median_non_home_dwell_time         object
dtype: object

As we can see, although we deleted the "rogue" rows, data types are still all objects. We need to convert numerical rows to numeric type using pandas' `to_numeric()` function

In [19]:
# convert numerical columns to numerical datatypes - now they 
data['state'] = pd.to_numeric(data['state'], errors='coerce')
data['county_code'] = pd.to_numeric(data['county_code'], errors='coerce')
data['device_count'] = pd.to_numeric(data['device_count'], errors='coerce')
data['completely_home_device_count'] = pd.to_numeric(data['completely_home_device_count'], errors='coerce')
data['part_time_work_behavior_devices'] = pd.to_numeric(data['full_time_work_behavior_devices'], errors='coerce')
data['full_time_work_behavior_devices'] = pd.to_numeric(data['completely_home_device_count'], errors='coerce')
data['delivery_behavior_devices'] = pd.to_numeric(data['delivery_behavior_devices'], errors='coerce')
data['median_home_dwell_time'] = pd.to_numeric(data['median_home_dwell_time'], errors='coerce')
data['median_non_home_dwell_time'] = pd.to_numeric(data['median_non_home_dwell_time'], errors='coerce')

In [20]:
data.dtypes

date_range_start                    object
state                              float64
state_code                          object
cnamelong                           object
county_code                          int64
device_count                         int64
completely_home_device_count         int64
part_time_work_behavior_devices      int64
full_time_work_behavior_devices      int64
delivery_behavior_devices            int64
median_home_dwell_time               int64
median_non_home_dwell_time           int64
dtype: object

In [25]:
data.isnull().sum()

date_range_start                    0
state                              53
state_code                         53
cnamelong                          53
county_code                         0
device_count                        0
completely_home_device_count        0
part_time_work_behavior_devices     0
full_time_work_behavior_devices     0
delivery_behavior_devices           0
median_home_dwell_time              0
median_non_home_dwell_time          0
dtype: int64

In [26]:
data.head()

Unnamed: 0,date_range_start,state,state_code,cnamelong,county_code,device_count,completely_home_device_count,part_time_work_behavior_devices,full_time_work_behavior_devices,delivery_behavior_devices,median_home_dwell_time,median_non_home_dwell_time
0,2020-01-01T00:00:00-06:00,1.0,AL,Colbert County,1033,93,20,2,20,3,711,95
1,2020-01-01T00:00:00-06:00,1.0,AL,Jefferson County,1073,157,40,5,40,1,6,49
2,2020-01-01T00:00:00-06:00,1.0,AL,Talladega County,1121,96,38,2,38,1,505,32
3,2020-01-01T00:00:00-06:00,1.0,AL,Tuscaloosa County,1125,486,148,22,148,7,637,87
4,2020-01-01T00:00:00-09:00,2.0,AK,Northwest Arctic Borough,2188,10,4,1,4,1,470,0


In [29]:
# parse date_string into date object in python.
from datetime import datetime
date_parser = lambda date_string: datetime.strptime(date_string[:10], "%Y-%m-%d")

In [30]:
data['date_range_start'] = data['date_range_start'].apply(date_parser)

In [31]:
data.head()

Unnamed: 0,date_range_start,state,state_code,cnamelong,county_code,device_count,completely_home_device_count,part_time_work_behavior_devices,full_time_work_behavior_devices,delivery_behavior_devices,median_home_dwell_time,median_non_home_dwell_time
0,2020-01-01,1.0,AL,Colbert County,1033,93,20,2,20,3,711,95
1,2020-01-01,1.0,AL,Jefferson County,1073,157,40,5,40,1,6,49
2,2020-01-01,1.0,AL,Talladega County,1121,96,38,2,38,1,505,32
3,2020-01-01,1.0,AL,Tuscaloosa County,1125,486,148,22,148,7,637,87
4,2020-01-01,2.0,AK,Northwest Arctic Borough,2188,10,4,1,4,1,470,0


Now it is time to groupby and aggregate on this small dataset.

The way to aggregate median_home_dwell_time is to first calculate the total time, and sum up the total time of each county, and in the end divide by number of devices in that county. We don't need to worry about integer overflow here because python has infinite range for integers.

In [32]:
data['total_home_dwell_time'] = data['median_home_dwell_time'] * data['device_count']
data['total_non_home_dwell_time'] = data['median_non_home_dwell_time'] * data['device_count']

In [33]:
data.head()

Unnamed: 0,date_range_start,state,state_code,cnamelong,county_code,device_count,completely_home_device_count,part_time_work_behavior_devices,full_time_work_behavior_devices,delivery_behavior_devices,median_home_dwell_time,median_non_home_dwell_time,total_home_dwell_time,total_non_home_dwell_time
0,2020-01-01,1.0,AL,Colbert County,1033,93,20,2,20,3,711,95,66123,8835
1,2020-01-01,1.0,AL,Jefferson County,1073,157,40,5,40,1,6,49,942,7693
2,2020-01-01,1.0,AL,Talladega County,1121,96,38,2,38,1,505,32,48480,3072
3,2020-01-01,1.0,AL,Tuscaloosa County,1125,486,148,22,148,7,637,87,309582,42282
4,2020-01-01,2.0,AK,Northwest Arctic Borough,2188,10,4,1,4,1,470,0,4700,0


In [37]:
data = data.groupby(['date_range_start', 'state', 'state_code',
                     'cnamelong', 'county_code']).agg(
    device_count = ('device_count', 'sum'),
    completely_home_device_count = ('completely_home_device_count', 'sum'),
    part_time_work_behavior_devices = ('part_time_work_behavior_devices', 'sum'),
    full_time_work_behavior_devices = ('full_time_work_behavior_devices', 'sum'),
    delivery_behavior_devices = ('delivery_behavior_devices', 'sum'),
    total_home_dwell_time = ('total_home_dwell_time', 'sum'),
    total_non_home_dwell_time = ('total_non_home_dwell_time', 'sum')
).reset_index()

In [39]:
data

Unnamed: 0,date_range_start,state,state_code,cnamelong,county_code,device_count,completely_home_device_count,part_time_work_behavior_devices,full_time_work_behavior_devices,delivery_behavior_devices,total_home_dwell_time,total_non_home_dwell_time
0,2020-01-01,1.0,AL,Autauga County,1001,5501,1578,155,1578,105,5115697,491120
1,2020-01-01,1.0,AL,Baldwin County,1003,20761,5997,569,5997,291,17810114,1968796
2,2020-01-01,1.0,AL,Barbour County,1005,1660,452,51,452,37,1308738,180423
3,2020-01-01,1.0,AL,Bibb County,1007,2040,560,47,560,40,1864093,206490
4,2020-01-01,1.0,AL,Blount County,1009,6206,1779,149,1779,84,5933303,606736
...,...,...,...,...,...,...,...,...,...,...,...,...
16076,2020-01-05,72.0,PR,Yabucoa Municipio,72151,310,85,19,85,21,193082,28100
16077,2020-01-05,72.0,PR,Yauco Municipio,72153,629,202,22,202,22,506879,45891
16078,2020-01-05,78.0,VI,St. Croix Island,78010,637,170,29,170,30,395597,72950
16079,2020-01-05,78.0,VI,St. John Island,78020,36,9,2,9,2,12236,6484


Seems like it is working to me! Now we need to compare these results with the results we get from stream processing (remember that we have 9 GB of data, which can't fit into my RAM, so we have to process data a chunk at a time.) Let's do stream processing on the same first million data to see if that generates the same result as above.

In [69]:
aggregatedData = pd.DataFrame(columns = data.columns)

In [70]:
# given dataframe, groupby by the counties and date and aggregate
# other variable.
def aggregateData(dataframe):
    return dataframe.groupby(['date_range_start', 'state', 'state_code',
                              'cnamelong', 'county_code']).agg(
        device_count = ('device_count', 'sum'),
        completely_home_device_count = ('completely_home_device_count', 'sum'),
        part_time_work_behavior_devices = ('part_time_work_behavior_devices', 'sum'),
        full_time_work_behavior_devices = ('full_time_work_behavior_devices', 'sum'),
        delivery_behavior_devices = ('delivery_behavior_devices', 'sum'),
        total_home_dwell_time = ('total_home_dwell_time', 'sum'),
        total_non_home_dwell_time = ('total_non_home_dwell_time', 'sum')
    ).reset_index()

In [71]:
# given a part of dataframe, return a dataframe that is processed
# using the procedure like the one above.
def processChunk(chunk):
    
    def convertColumnsToNumerical(dataframe):
        dataframe['state'] = pd.to_numeric(dataframe['state'], errors='coerce')
        dataframe['county_code'] = pd.to_numeric(dataframe['county_code'], errors='coerce')
        dataframe['device_count'] = pd.to_numeric(dataframe['device_count'], errors='coerce')
        dataframe['completely_home_device_count'] = pd.to_numeric(dataframe['completely_home_device_count'], errors='coerce')
        dataframe['part_time_work_behavior_devices'] = pd.to_numeric(dataframe['full_time_work_behavior_devices'], errors='coerce')
        dataframe['full_time_work_behavior_devices'] = pd.to_numeric(dataframe['completely_home_device_count'], errors='coerce')
        dataframe['delivery_behavior_devices'] = pd.to_numeric(dataframe['delivery_behavior_devices'], errors='coerce')
        dataframe['median_home_dwell_time'] = pd.to_numeric(dataframe['median_home_dwell_time'], errors='coerce')
        dataframe['median_non_home_dwell_time'] = pd.to_numeric(dataframe['median_non_home_dwell_time'], errors='coerce')
        return dataframe
    
    chunk = chunk.dropna(subset = ['Unnamed: 0']) # deal with rogue rows.
    chunk = chunk.drop(['Unnamed: 0'], axis = 1)
    chunk = chunk.drop(['distance_traveled_from_home', 'candidate_device_count',
                        'origin_census_block_group', 'median_percentage_time_home',
                        'date_range_end'], axis = 1)
    
    chunk = convertColumnsToNumerical(chunk)
    
    chunk['date_range_start'] = chunk['date_range_start'].apply(date_parser)
    
    chunk['total_home_dwell_time'] = chunk['median_home_dwell_time'] * chunk['device_count']
    chunk['total_non_home_dwell_time'] = chunk['median_non_home_dwell_time'] * chunk['device_count']
    
    chunk = aggregateData(chunk)
    
    return chunk

In [72]:
counter = 0
for chunk in pd.read_csv("social_dist_all_trimmed_new_dec2020.csv", chunksize = 10000, nrows = 1000000):
    chunk = processChunk(chunk)
    
    # groupby and aggregate data of the same county of the same date.
    aggregatedData = aggregatedData.append(chunk, ignore_index = True)
    aggregatedData = aggregateData(aggregatedData)
    
    print(counter) # track progress
    print(aggregatedData.shape[0]) # how large is our data right now
    counter += 1

0
1970
1
2489
2
2722
3
2870
4
2966
5
3019
6
3066
7
3099
8
3120
9
3133
10
3157
11
3172
12
3185
13
3191
14
3197
15
3207
16
3213
17
3216
18
3223
19
3224
20
3226
21
3696
22
5237
23
5726
24
5965
25
6105
26
6198
27
6250
28
6297
29
6325
30
6347
31
6367
32
6390
33
6406
34
6416
35
6419
36
6422
37
6427
38
6435
39
6440
40
6443
41
6447
42
6451
43
7139
44
8512
45
8960
46
9195
47
9335
48
9423
49
9489
50
9528
51
9554
52
9572
53
9593
54
9612
55
9627
56
9638
57
9648
58
9650
59
9658
60
9666
61
9667
62
9673
63
9676
64
9678
65
10536
66
11761
67
12209
68
12425
69
12562
70
12652
71
12710
72
12743
73
12781
74
12807
75
12824
76
12839
77
12857
78
12861
79
12874
80
12881
81
12887
82
12890
83
12896
84
12898
85
12901
86
12903
87
13923
88
15031
89
15432
90
15664
91
15800
92
15889
93
15935
94
15973
95
16011
96
16028
97
16046
98
16063
99
16081


In [73]:
aggregatedData

Unnamed: 0,date_range_start,state,state_code,cnamelong,county_code,device_count,completely_home_device_count,part_time_work_behavior_devices,full_time_work_behavior_devices,delivery_behavior_devices,total_home_dwell_time,total_non_home_dwell_time
0,2020-01-01,1.0,AL,Autauga County,1001,5501,1578,155,1578,105,5115697,491120
1,2020-01-01,1.0,AL,Baldwin County,1003,20761,5997,569,5997,291,17810114,1968796
2,2020-01-01,1.0,AL,Barbour County,1005,1660,452,51,452,37,1308738,180423
3,2020-01-01,1.0,AL,Bibb County,1007,2040,560,47,560,40,1864093,206490
4,2020-01-01,1.0,AL,Blount County,1009,6206,1779,149,1779,84,5933303,606736
...,...,...,...,...,...,...,...,...,...,...,...,...
16076,2020-01-05,72.0,PR,Yabucoa Municipio,72151,310,85,19,85,21,193082,28100
16077,2020-01-05,72.0,PR,Yauco Municipio,72153,629,202,22,202,22,506879,45891
16078,2020-01-05,78.0,VI,St. Croix Island,78010,637,170,29,170,30,395597,72950
16079,2020-01-05,78.0,VI,St. John Island,78020,36,9,2,9,2,12236,6484


In [74]:
data

Unnamed: 0,date_range_start,state,state_code,cnamelong,county_code,device_count,completely_home_device_count,part_time_work_behavior_devices,full_time_work_behavior_devices,delivery_behavior_devices,total_home_dwell_time,total_non_home_dwell_time
0,2020-01-01,1.0,AL,Autauga County,1001,5501,1578,155,1578,105,5115697,491120
1,2020-01-01,1.0,AL,Baldwin County,1003,20761,5997,569,5997,291,17810114,1968796
2,2020-01-01,1.0,AL,Barbour County,1005,1660,452,51,452,37,1308738,180423
3,2020-01-01,1.0,AL,Bibb County,1007,2040,560,47,560,40,1864093,206490
4,2020-01-01,1.0,AL,Blount County,1009,6206,1779,149,1779,84,5933303,606736
...,...,...,...,...,...,...,...,...,...,...,...,...
16076,2020-01-05,72.0,PR,Yabucoa Municipio,72151,310,85,19,85,21,193082,28100
16077,2020-01-05,72.0,PR,Yauco Municipio,72153,629,202,22,202,22,506879,45891
16078,2020-01-05,78.0,VI,St. Croix Island,78010,637,170,29,170,30,395597,72950
16079,2020-01-05,78.0,VI,St. John Island,78020,36,9,2,9,2,12236,6484


As we can see, our stream processing function gives the same result as just doing a global group by by comparing `data (global process)` and `aggregatedData (stream process)`. Now we can be sure our algorithm is correct and use it to process the 9GB data.

In [75]:
data.isnull().sum()

date_range_start                   0
state                              0
state_code                         0
cnamelong                          0
county_code                        0
device_count                       0
completely_home_device_count       0
part_time_work_behavior_devices    0
full_time_work_behavior_devices    0
delivery_behavior_devices          0
total_home_dwell_time              0
total_non_home_dwell_time          0
dtype: int64

As we can note, there are no null data now, this is because `groupby` function in pandas ignores `NaN` values. This gets rid of a couple of census_blocks' data, but hope it is not a big deal.

Stream process 9GB of data!!!!

In [79]:
aggregatedData = pd.DataFrame(columns = data.columns)

In [80]:
counter = 0
for chunk in pd.read_csv("social_dist_all_trimmed_new_dec2020.csv", chunksize = 1000000):
    chunk = processChunk(chunk)
    
    # groupby and aggregate data of the same county of the same date.
    aggregatedData = aggregatedData.append(chunk, ignore_index = True)
    aggregatedData = aggregateData(aggregatedData)
    
    print(counter) # track progress
    print(aggregatedData.shape[0]) # how large is our data right now
    counter += 1

0
16081
1
31670
2
45135
3
61041
4
74186
5
90208
6
103225
7
119288
8
134078
9
148317
10
164030
11
177332
12
193291
13
206369
14
222408
15
235409
16
251482
17
266618
18
280538
19
296325
20
309558
21
325519
22
338575
23
354604
24
367582
25
383642
26
399253
27
412693
28
428579
29
441710
30
457730
31
470742
32
486786
33
501691
34
515835
35
531584
36
544863
37
560818
38
573875
39
589901
40
602897
41
618965
42
634466
43
648002
44
663895
45
677052
46
693075
47
706099
48
722164
49
737107
50
751234
51
767021
52
780287
53
796263
54
809324
55
825356
56
838344
57
854403
58
869884
59
883427
60
899302
61
912449
62
928454
63
941465
64
957513
65
971974
66
986536
67
1002208
68
1015556
69
1031470
70
1044563
71
1060580
72
1073569
73
1089618
74
1104796
75
1118642
76
1131564


In [83]:
aggregatedData

Unnamed: 0,date_range_start,state,state_code,cnamelong,county_code,device_count,completely_home_device_count,part_time_work_behavior_devices,full_time_work_behavior_devices,delivery_behavior_devices,total_home_dwell_time,total_non_home_dwell_time
0,2020-01-01,1.0,AL,Autauga County,1001,5501,1578,155,1578,105,5115697,491120
1,2020-01-01,1.0,AL,Baldwin County,1003,20761,5997,569,5997,291,17810114,1968796
2,2020-01-01,1.0,AL,Barbour County,1005,1660,452,51,452,37,1308738,180423
3,2020-01-01,1.0,AL,Bibb County,1007,2040,560,47,560,40,1864093,206490
4,2020-01-01,1.0,AL,Blount County,1009,6206,1779,149,1779,84,5933303,606736
...,...,...,...,...,...,...,...,...,...,...,...,...
1131559,2020-12-16,72.0,PR,Yabucoa Municipio,72151,426,176,25,176,25,183193,6955
1131560,2020-12-16,72.0,PR,Yauco Municipio,72153,469,217,35,217,33,311804,5998
1131561,2020-12-16,78.0,VI,St. Croix Island,78010,825,285,49,285,47,286467,28768
1131562,2020-12-16,78.0,VI,St. John Island,78020,105,27,12,27,12,41523,6915


In [84]:
aggregatedData.to_csv('safegraph_data_by_county.csv', index = False)