<a href="https://colab.research.google.com/github/wenjunsun/Covid-19-analysis-with-uw-ubicomp/blob/master/2020-11/prepare_google_mobility_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook we will prepare the dataset for propensity score matching with Google's Mobility dataset. The dataset is gotten from [this website](https://www.google.com/covid19/mobility/), and we verified that its data is very similar to our dataset, but they have different metrics measuring people's length of stay in retail stores/groceries. It is worth doing the same analysis we did on Safegraph on this dataset to validate our results.

# 0. go to directory containing data + load packages

In [None]:
cd drive/My\ Drive/covid/PSM/data

/content/drive/My Drive/covid/PSM/data


In [None]:
ls

agg_social_dist_2.csv
agg_social_dist.csv
county_data_with_reduced_covariates_with_SIP.csv
google_mobility.csv
social_dist_aggregated_on_county.csv


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# 1. load + clean data

In [None]:
data = pd.read_csv('google_mobility.csv', parse_dates=['date'],
                   infer_datetime_format = True)

In [None]:
data.head()

Unnamed: 0,country_region_code,country_region,sub_region_1,sub_region_2,metro_area,iso_3166_2_code,census_fips_code,date,retail_and_recreation_percent_change_from_baseline,grocery_and_pharmacy_percent_change_from_baseline,parks_percent_change_from_baseline,transit_stations_percent_change_from_baseline,workplaces_percent_change_from_baseline,residential_percent_change_from_baseline
0,US,United States,,,,,,2020-02-15,6.0,2.0,15.0,3.0,2.0,-1.0
1,US,United States,,,,,,2020-02-16,7.0,1.0,16.0,2.0,0.0,-1.0
2,US,United States,,,,,,2020-02-17,6.0,0.0,28.0,-9.0,-24.0,5.0
3,US,United States,,,,,,2020-02-18,0.0,-1.0,6.0,1.0,0.0,1.0
4,US,United States,,,,,,2020-02-19,2.0,0.0,8.0,1.0,1.0,0.0


In [None]:
data.dtypes

country_region_code                                           object
country_region                                                object
sub_region_1                                                  object
sub_region_2                                                  object
metro_area                                                   float64
iso_3166_2_code                                               object
census_fips_code                                             float64
date                                                  datetime64[ns]
retail_and_recreation_percent_change_from_baseline           float64
grocery_and_pharmacy_percent_change_from_baseline            float64
parks_percent_change_from_baseline                           float64
transit_stations_percent_change_from_baseline                float64
workplaces_percent_change_from_baseline                      float64
residential_percent_change_from_baseline                     float64
dtype: object

So this dataset is already in a form that has every county's behavioral data in each date from mid February to current date. We want to get the average of behavioral data of each county from Febuary to `6-1-2020`, as we did with SafeGraph analysis.

We can see some data have null `census_fips_code`?

In [None]:
num_counties = len(data['census_fips_code'].unique())
print(f'there are {num_counties} unique counties within Google mobility dataset')

there are 2834 unique counties within Google mobility dataset


In [None]:
data.dropna(subset=['census_fips_code'], inplace = True)

In [None]:
data['census_fips_code'].isnull().sum()

0

In [None]:
# drop unnecessary columns
data.drop(['country_region_code', 'country_region', 'sub_region_1',
           'sub_region_2', 'metro_area', 'iso_3166_2_code'], axis = 1, inplace = True)

In [None]:
data.head()

Unnamed: 0,census_fips_code,date,retail_and_recreation_percent_change_from_baseline,grocery_and_pharmacy_percent_change_from_baseline,parks_percent_change_from_baseline,transit_stations_percent_change_from_baseline,workplaces_percent_change_from_baseline,residential_percent_change_from_baseline
532,1001.0,2020-02-15,5.0,7.0,,,-4.0,
533,1001.0,2020-02-16,0.0,1.0,-23.0,,-4.0,
534,1001.0,2020-02-17,8.0,0.0,,,-27.0,5.0
535,1001.0,2020-02-18,-2.0,0.0,,,2.0,0.0
536,1001.0,2020-02-19,-2.0,0.0,,,2.0,0.0


# 2. filter data with non-desirable dates

Now we will delete all the rows which have dates after `6-1-2020`

In [None]:
# default date to filter is 6-1-2020
# given a dataframe and a column that denotes the date column,
# and a date to filter by, return a new dataframe such that the new one
# doesn't contain any rows after the date parameter.
def filterDatesAfter(date = np.datetime64('2020-06-01'), dataset = data, dateColumn = 'date'):

  # helper function that given a row, return true if this row's date is after date. 
  def isThisRowAfterDate(row, date = date):
    return row['date'] > date
  
  newData = dataset[dataset.apply(lambda row: not isThisRowAfterDate(row), axis = 1)]

  return newData

In [None]:
data = filterDatesAfter()

In [None]:
data['date'].unique()

array(['2020-02-15T00:00:00.000000000', '2020-02-16T00:00:00.000000000',
       '2020-02-17T00:00:00.000000000', '2020-02-18T00:00:00.000000000',
       '2020-02-19T00:00:00.000000000', '2020-02-20T00:00:00.000000000',
       '2020-02-21T00:00:00.000000000', '2020-02-22T00:00:00.000000000',
       '2020-02-23T00:00:00.000000000', '2020-02-24T00:00:00.000000000',
       '2020-02-25T00:00:00.000000000', '2020-02-26T00:00:00.000000000',
       '2020-02-27T00:00:00.000000000', '2020-02-28T00:00:00.000000000',
       '2020-02-29T00:00:00.000000000', '2020-03-01T00:00:00.000000000',
       '2020-03-02T00:00:00.000000000', '2020-03-03T00:00:00.000000000',
       '2020-03-04T00:00:00.000000000', '2020-03-05T00:00:00.000000000',
       '2020-03-06T00:00:00.000000000', '2020-03-07T00:00:00.000000000',
       '2020-03-08T00:00:00.000000000', '2020-03-09T00:00:00.000000000',
       '2020-03-10T00:00:00.000000000', '2020-03-11T00:00:00.000000000',
       '2020-03-12T00:00:00.000000000', '2020-03-13

# 3. Average behavioral data for each county from Feb to June

In [None]:
data.head()

Unnamed: 0,census_fips_code,date,retail_and_recreation_percent_change_from_baseline,grocery_and_pharmacy_percent_change_from_baseline,parks_percent_change_from_baseline,transit_stations_percent_change_from_baseline,workplaces_percent_change_from_baseline,residential_percent_change_from_baseline
532,1001.0,2020-02-15,5.0,7.0,,,-4.0,
533,1001.0,2020-02-16,0.0,1.0,-23.0,,-4.0,
534,1001.0,2020-02-17,8.0,0.0,,,-27.0,5.0
535,1001.0,2020-02-18,-2.0,0.0,,,2.0,0.0
536,1001.0,2020-02-19,-2.0,0.0,,,2.0,0.0


In [None]:
data.drop(['date'], axis = 1, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [None]:
data.head()

Unnamed: 0,census_fips_code,retail_and_recreation_percent_change_from_baseline,grocery_and_pharmacy_percent_change_from_baseline,parks_percent_change_from_baseline,transit_stations_percent_change_from_baseline,workplaces_percent_change_from_baseline,residential_percent_change_from_baseline
532,1001.0,5.0,7.0,,,-4.0,
533,1001.0,0.0,1.0,-23.0,,-4.0,
534,1001.0,8.0,0.0,,,-27.0,5.0
535,1001.0,-2.0,0.0,,,2.0,0.0
536,1001.0,-2.0,0.0,,,2.0,0.0


In [None]:
data = data.groupby(['census_fips_code']).agg(
    avg_retail_and_recreation_percent_change_6_1 = ('retail_and_recreation_percent_change_from_baseline', 'mean'),
    avg_grocery_and_pharmacy_percent_change_6_1 = ('grocery_and_pharmacy_percent_change_from_baseline', 'mean'),
    avg_parks_percent_change_6_1 = ('parks_percent_change_from_baseline', 'mean'),
    avg_transit_stations_percent_change_6_1 = ('transit_stations_percent_change_from_baseline', 'mean'),
    avg_workplaces_percent_change_6_1 = ('workplaces_percent_change_from_baseline', 'mean'),
    avg_residential_percent_change_6_1 = ('residential_percent_change_from_baseline', 'mean')
).reset_index()

In [None]:
data

Unnamed: 0,census_fips_code,avg_retail_and_recreation_percent_change_6_1,avg_grocery_and_pharmacy_percent_change_6_1,avg_parks_percent_change_6_1,avg_transit_stations_percent_change_6_1,avg_workplaces_percent_change_6_1,avg_residential_percent_change_6_1
0,1001.0,-9.240741,9.111111,-34.000000,,-22.481481,9.480000
1,1003.0,-12.944444,6.120370,23.500000,-6.361111,-19.777778,6.564815
2,1005.0,2.608696,-6.203704,,,-15.333333,
3,1007.0,0.823529,5.571429,,,-18.175926,
4,1009.0,-6.372093,6.118812,,,-21.111111,8.720000
...,...,...,...,...,...,...,...
2827,56037.0,-11.537037,7.305882,,13.888889,-20.231481,7.120000
2828,56039.0,-39.369048,-17.661017,-24.018868,-38.691358,-37.574074,
2829,56041.0,-1.125000,14.352941,,12.537037,-16.435185,
2830,56043.0,-2.767442,-12.000000,,,-20.960526,


In [None]:
# now we are done! save the data!
data.to_csv('avg_google_mobility_up_to_6_1.csv', index = False)

# 4. Combine Google Behavioral data with our covariates.

In [54]:
google_data = data

In [56]:
safe_graph_data = pd.read_csv('county_data_with_reduced_covariates_with_SIP.csv')

In [57]:
google_data

Unnamed: 0,census_fips_code,avg_retail_and_recreation_percent_change_6_1,avg_grocery_and_pharmacy_percent_change_6_1,avg_parks_percent_change_6_1,avg_transit_stations_percent_change_6_1,avg_workplaces_percent_change_6_1,avg_residential_percent_change_6_1
0,1001.0,-9.240741,9.111111,-34.000000,,-22.481481,9.480000
1,1003.0,-12.944444,6.120370,23.500000,-6.361111,-19.777778,6.564815
2,1005.0,2.608696,-6.203704,,,-15.333333,
3,1007.0,0.823529,5.571429,,,-18.175926,
4,1009.0,-6.372093,6.118812,,,-21.111111,8.720000
...,...,...,...,...,...,...,...
2827,56037.0,-11.537037,7.305882,,13.888889,-20.231481,7.120000
2828,56039.0,-39.369048,-17.661017,-24.018868,-38.691358,-37.574074,
2829,56041.0,-1.125000,14.352941,,12.537037,-16.435185,
2830,56043.0,-2.767442,-12.000000,,,-20.960526,


In [58]:
safe_graph_data

Unnamed: 0,state,state_code,State Name,cnamelong,county_code,diff_in_perc_at_home,SIP?,Median Household Income,% Rural,Population_y,political_diff,% less than 18 years of age,% 65 and over,% Asian,% Black,% Hispanic,% Non-Hispanic White
0,1.0,AL,Alabama,Autauga County,1001.0,0.050678,1,59338.0,42.0,55601,-0.494789,23.7,15.6,1.2,19.3,3.0,74.3
1,1.0,AL,Alabama,Baldwin County,1003.0,0.050312,1,57588.0,42.3,218022,-0.577862,21.6,20.4,1.2,8.8,4.6,83.1
2,1.0,AL,Alabama,Barbour County,1005.0,0.007037,1,34382.0,67.8,24881,-0.056112,20.9,19.4,0.5,48.0,4.3,45.6
3,1.0,AL,Alabama,Bibb County,1007.0,0.011809,1,46064.0,68.4,22400,-0.555441,20.5,16.5,0.2,21.1,2.6,74.6
4,1.0,AL,Alabama,Blount County,1009.0,0.038890,1,50412.0,90.0,57840,-0.813820,23.2,18.2,0.3,1.5,9.6,86.9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2991,56.0,WY,Wyoming,Sweetwater County,56037.0,-0.018485,0,73315.0,10.9,43051,-0.535382,26.2,12.1,1.0,1.1,16.1,79.3
2992,56.0,WY,Wyoming,Teton County,56039.0,0.075183,1,99087.0,46.4,23081,0.278663,18.4,15.4,1.4,0.6,14.9,81.5
2993,56.0,WY,Wyoming,Uinta County,56041.0,0.010157,0,63401.0,43.1,20299,-0.614926,28.8,14.1,0.5,0.7,9.2,87.4
2994,56.0,WY,Wyoming,Washakie County,56043.0,-0.007825,0,55190.0,36.0,7885,-0.640377,22.7,21.7,0.8,0.5,14.1,82.4


In [59]:
merged_data = safe_graph_data.merge(google_data, left_on = 'county_code', right_on = 'census_fips_code')

In [60]:
merged_data

Unnamed: 0,state,state_code,State Name,cnamelong,county_code,diff_in_perc_at_home,SIP?,Median Household Income,% Rural,Population_y,political_diff,% less than 18 years of age,% 65 and over,% Asian,% Black,% Hispanic,% Non-Hispanic White,census_fips_code,avg_retail_and_recreation_percent_change_6_1,avg_grocery_and_pharmacy_percent_change_6_1,avg_parks_percent_change_6_1,avg_transit_stations_percent_change_6_1,avg_workplaces_percent_change_6_1,avg_residential_percent_change_6_1
0,1.0,AL,Alabama,Autauga County,1001.0,0.050678,1,59338.0,42.0,55601,-0.494789,23.7,15.6,1.2,19.3,3.0,74.3,1001.0,-9.240741,9.111111,-34.000000,,-22.481481,9.480000
1,1.0,AL,Alabama,Baldwin County,1003.0,0.050312,1,57588.0,42.3,218022,-0.577862,21.6,20.4,1.2,8.8,4.6,83.1,1003.0,-12.944444,6.120370,23.500000,-6.361111,-19.777778,6.564815
2,1.0,AL,Alabama,Barbour County,1005.0,0.007037,1,34382.0,67.8,24881,-0.056112,20.9,19.4,0.5,48.0,4.3,45.6,1005.0,2.608696,-6.203704,,,-15.333333,
3,1.0,AL,Alabama,Bibb County,1007.0,0.011809,1,46064.0,68.4,22400,-0.555441,20.5,16.5,0.2,21.1,2.6,74.6,1007.0,0.823529,5.571429,,,-18.175926,
4,1.0,AL,Alabama,Blount County,1009.0,0.038890,1,50412.0,90.0,57840,-0.813820,23.2,18.2,0.3,1.5,9.6,86.9,1009.0,-6.372093,6.118812,,,-21.111111,8.720000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2823,56.0,WY,Wyoming,Sweetwater County,56037.0,-0.018485,0,73315.0,10.9,43051,-0.535382,26.2,12.1,1.0,1.1,16.1,79.3,56037.0,-11.537037,7.305882,,13.888889,-20.231481,7.120000
2824,56.0,WY,Wyoming,Teton County,56039.0,0.075183,1,99087.0,46.4,23081,0.278663,18.4,15.4,1.4,0.6,14.9,81.5,56039.0,-39.369048,-17.661017,-24.018868,-38.691358,-37.574074,
2825,56.0,WY,Wyoming,Uinta County,56041.0,0.010157,0,63401.0,43.1,20299,-0.614926,28.8,14.1,0.5,0.7,9.2,87.4,56041.0,-1.125000,14.352941,,12.537037,-16.435185,
2826,56.0,WY,Wyoming,Washakie County,56043.0,-0.007825,0,55190.0,36.0,7885,-0.640377,22.7,21.7,0.8,0.5,14.1,82.4,56043.0,-2.767442,-12.000000,,,-20.960526,


We did lose about 150 counties because Google Mobility dataset only contains 2800+ counties, but I think that is okay

In [63]:
num_of_SIP_counties = merged_data['SIP?'].sum()
print(f'num of SIP counties = {num_of_SIP_counties}')

num of SIP counties = 2436


In [64]:
num_of_no_SIP_counties = merged_data[merged_data['SIP?'] == 0].shape[0]
print(f'num of no SIP counties = {num_of_no_SIP_counties}')

num of no SIP counties = 392


In [65]:
merged_data.to_csv('final_data_google_and_safe_graph_up_to_6_1.csv', index = False)