# google-covid-19-mobility-data-process-usa

In this notebook I will refine the process from `google-covid-19-mobility-data-process-v1` in order to get the highest resolution data possible for a map of the USA.

In [92]:
import pandas as pd
import requests
import simplejson as json
import numpy as np

---

## Load a reduced CSV containing just the United States entries

In [7]:
usaDf = pd.read_csv("./output-data/usa.csv", low_memory=False)

In [8]:
usaDf.head()

Unnamed: 0,country_region_code,country_region,sub_region_1,sub_region_2,metro_area,iso_3166_2_code,census_fips_code,place_id,date,retail_and_recreation_percent_change_from_baseline,grocery_and_pharmacy_percent_change_from_baseline,parks_percent_change_from_baseline,transit_stations_percent_change_from_baseline,workplaces_percent_change_from_baseline,residential_percent_change_from_baseline
0,US,United States,,,,,,ChIJCzYy5IS16lQRQrfeQ5K5Oxw,2020-02-15,6.0,2.0,15.0,3.0,2.0,-1.0
1,US,United States,,,,,,ChIJCzYy5IS16lQRQrfeQ5K5Oxw,2020-02-16,7.0,1.0,16.0,2.0,0.0,-1.0
2,US,United States,,,,,,ChIJCzYy5IS16lQRQrfeQ5K5Oxw,2020-02-17,6.0,0.0,28.0,-9.0,-24.0,5.0
3,US,United States,,,,,,ChIJCzYy5IS16lQRQrfeQ5K5Oxw,2020-02-18,0.0,-1.0,6.0,1.0,0.0,1.0
4,US,United States,,,,,,ChIJCzYy5IS16lQRQrfeQ5K5Oxw,2020-02-19,2.0,0.0,8.0,1.0,1.0,0.0


---

**`sub_reigon_1` is states**

In [9]:
usaDf["sub_region_1"].unique()

array([nan, 'Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California',
       'Colorado', 'Connecticut', 'Delaware', 'District of Columbia',
       'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana',
       'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland',
       'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi',
       'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire',
       'New Jersey', 'New Mexico', 'New York', 'North Carolina',
       'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania',
       'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee',
       'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington',
       'West Virginia', 'Wisconsin', 'Wyoming'], dtype=object)

---

**`sub_region_2` is counties - of which there are 1,719**

In [10]:
usaDf["sub_region_2"].unique()

array([nan, 'Autauga County', 'Baldwin County', ..., 'Uinta County',
       'Washakie County', 'Weston County'], dtype=object)

In [11]:
len(usaDf["sub_region_2"].unique())

1719

---

**`metro_area` is not used**

In [13]:
usaDf["metro_area"].unique()

array([nan])

---

**`iso_3166_2_code` is used and relates to States.**

In [14]:
usaDf["iso_3166_2_code"].unique()

array([nan, 'US-AL', 'US-AK', 'US-AZ', 'US-AR', 'US-CA', 'US-CO', 'US-CT',
       'US-DE', 'US-DC', 'US-FL', 'US-GA', 'US-HI', 'US-ID', 'US-IL',
       'US-IN', 'US-IA', 'US-KS', 'US-KY', 'US-LA', 'US-ME', 'US-MD',
       'US-MA', 'US-MI', 'US-MN', 'US-MS', 'US-MO', 'US-MT', 'US-NE',
       'US-NV', 'US-NH', 'US-NJ', 'US-NM', 'US-NY', 'US-NC', 'US-ND',
       'US-OH', 'US-OK', 'US-OR', 'US-PA', 'US-RI', 'US-SC', 'US-SD',
       'US-TN', 'US-TX', 'US-UT', 'US-VT', 'US-VA', 'US-WA', 'US-WV',
       'US-WI', 'US-WY'], dtype=object)

---

**`census_fips_code` is present and could be used as an alternative to `place_id`**

In [15]:
usaDf["census_fips_code"].unique()

array([   nan,  1001.,  1003., ..., 56041., 56043., 56045.])

---

## Create a new data frame containing just the rows with `sub_region_2`

In [27]:
len(usaDf["place_id"].unique())

2889

In [25]:
usaSubRegion2Df = usaDf[usaDf["sub_region_2"].notnull()]

In [28]:
len(usaSubRegion2Df["place_id"].unique())

2837

We now have 2,837 rows instead of 2,889. Not a great reduction...

---

## Get the lat long coordinates for each unique `place_id`

In [29]:
uniquePlaceIdsDf = usaSubRegion2Df[["place_id"]].drop_duplicates()

Access the google maps api to get coordinates for each `place_id`

In [30]:
with open('./secrets/googleapikey.txt', 'r') as f:
    key = f.read()

In [31]:
def get_lat_long(place_id):
    try:
        API_KEY = key.rstrip("\n")
        url = "https://maps.googleapis.com/maps/api/place/details/json?place_id=" + str(place_id) + "&key=" + str(API_KEY) + "&fields=geometry"
        
        response = (requests.get(url).text)
        response_json = json.loads(response)
        
        if "result" in response_json:
            result = response_json["result"]
            if "geometry" in result:
                geometry = result["geometry"]
                if "location" in geometry:
                    location = geometry["location"]
                    return location["lat"], location["lng"]
                else:
                    return None, None
            else:
                return None, None
        else:
            return None, None
    
    except Exception as e:
        raise e

In [32]:
uniquePlaceIdsDf.loc[:, "lat"], uniquePlaceIdsDf.loc[:, "lng"] = zip(*uniquePlaceIdsDf['place_id'].map(get_lat_long))

In [33]:
uniquePlaceIdsDf.head()

Unnamed: 0,place_id,lat,lng
924,ChIJg9z7ewWPjogRA_8QrB0va7o,32.579182,-86.499655
1380,ChIJMRERvxBnmogR0K_9O08dxDw,30.601074,-87.776333
1842,ChIJtUJfgR1tjYgRAjFO7yq2nuI,31.81729,-85.354965
2279,ChIJEdvIA8szj4gRU_XpIL_1cSw,32.95628,-87.142289
2716,ChIJdcUJhYeXiYgRERpotPlZFfs,34.014515,-86.499655


I'll save these coordinates as a CSV file for later use.

In [34]:
uniquePlaceIdsDf.to_csv("./output-data/usa-sub-region-2-lat-lng.csv", index=False)

In [35]:
len(uniquePlaceIdsDf)

2837

---

## Merge the coordinates with the original `sub_region_2` data frame

In [36]:
usaSubRegion2MergeDf = pd.merge(usaSubRegion2Df, uniquePlaceIdsDf, on='place_id', how='outer')

In [37]:
usaSubRegion2MergeDf.head()

Unnamed: 0,country_region_code,country_region,sub_region_1,sub_region_2,metro_area,iso_3166_2_code,census_fips_code,place_id,date,retail_and_recreation_percent_change_from_baseline,grocery_and_pharmacy_percent_change_from_baseline,parks_percent_change_from_baseline,transit_stations_percent_change_from_baseline,workplaces_percent_change_from_baseline,residential_percent_change_from_baseline,lat,lng
0,US,United States,Alabama,Autauga County,,,1001.0,ChIJg9z7ewWPjogRA_8QrB0va7o,2020-02-15,5.0,7.0,,,-4.0,,32.579182,-86.499655
1,US,United States,Alabama,Autauga County,,,1001.0,ChIJg9z7ewWPjogRA_8QrB0va7o,2020-02-16,0.0,1.0,-23.0,,-4.0,,32.579182,-86.499655
2,US,United States,Alabama,Autauga County,,,1001.0,ChIJg9z7ewWPjogRA_8QrB0va7o,2020-02-17,8.0,0.0,,,-27.0,5.0,32.579182,-86.499655
3,US,United States,Alabama,Autauga County,,,1001.0,ChIJg9z7ewWPjogRA_8QrB0va7o,2020-02-18,-2.0,0.0,,,2.0,0.0,32.579182,-86.499655
4,US,United States,Alabama,Autauga County,,,1001.0,ChIJg9z7ewWPjogRA_8QrB0va7o,2020-02-19,-2.0,0.0,,,2.0,0.0,32.579182,-86.499655


---

## Calculate 7 day rolling averages for each location

In [48]:
def add_rolling_average(df):
    df.loc[:, "retail-average"] = df["retail_and_recreation_percent_change_from_baseline"].rolling(window=7, center=True).mean()
    df.loc[:, "grocery-average"] = df["grocery_and_pharmacy_percent_change_from_baseline"].rolling(window=7, center=True).mean()
    df.loc[:, "parks-average"] = df["parks_percent_change_from_baseline"].rolling(window=7, center=True).mean()
    df.loc[:, "transit-average"] = df["transit_stations_percent_change_from_baseline"].rolling(window=7, center=True).mean()
    df.loc[:, "workplace-average"] = df["workplaces_percent_change_from_baseline"].rolling(window=7, center=True).mean()
    df.loc[:, "residential-average"] = df["residential_percent_change_from_baseline"].rolling(window=7, center=True).mean()
    
    return df

In [49]:
usaSubRegion2AverageDf = usaSubRegion2MergeDf.groupby("place_id").apply(add_rolling_average)

In [54]:
usaSubRegion2AverageDf[1000000:1000005]

Unnamed: 0,country_region_code,country_region,sub_region_1,sub_region_2,metro_area,iso_3166_2_code,census_fips_code,place_id,date,retail_and_recreation_percent_change_from_baseline,...,workplaces_percent_change_from_baseline,residential_percent_change_from_baseline,lat,lng,retail-average,grocery-average,parks-average,transit-average,workplace-average,residential-average
1000000,US,United States,Texas,Taylor County,,,48441.0,ChIJNfnP60m-VoYRN7ziFE_S9ic,2021-02-09,-12.0,...,-19.0,6.0,32.245509,-99.812494,-19.714286,-6.428571,,4.428571,-22.0,9.142857
1000001,US,United States,Texas,Taylor County,,,48441.0,ChIJNfnP60m-VoYRN7ziFE_S9ic,2021-02-10,-23.0,...,-22.0,10.0,32.245509,-99.812494,-23.857143,-7.857143,,1.571429,-23.857143,10.285714
1000002,US,United States,Texas,Taylor County,,,48441.0,ChIJNfnP60m-VoYRN7ziFE_S9ic,2021-02-11,-38.0,...,-46.0,21.0,32.245509,-99.812494,-32.142857,-14.428571,,-5.142857,-29.571429,12.428571
1000003,US,United States,Texas,Taylor County,,,48441.0,ChIJNfnP60m-VoYRN7ziFE_S9ic,2021-02-12,-24.0,...,-27.0,14.0,32.245509,-99.812494,-39.142857,-17.142857,,-12.857143,-38.571429,15.142857
1000004,US,United States,Texas,Taylor County,,,48441.0,ChIJNfnP60m-VoYRN7ziFE_S9ic,2021-02-13,-39.0,...,-24.0,12.0,32.245509,-99.812494,-47.0,-17.428571,,-24.142857,-47.142857,16.857143


---

## Round the average figures to 1 decimal place for a smaller final file size

In [59]:
usaSubRegion2RoundedDf = usaSubRegion2AverageDf.round({
    'retail-average': 1,
    'grocery-average': 1,
    'parks-average': 1,
    'transit-average': 1,
    'workplace-average': 1,
    'residential-average': 1
})

---

## Remove any NaN `place_id`s

In [61]:
usaSubRegion2NotNaDf = usaSubRegion2RoundedDf[usaSubRegion2RoundedDf["place_id"].notna(
)]

---

## Convert the data into a python dictionary so it can be exported as json

In [109]:
def create_list_for_json(df):
    outputList = []
    listOfPlaceIds = df["place_id"].drop_duplicates().to_list()
    groupByPlaceId = df.groupby("place_id")
    
    for place_id in listOfPlaceIds:
        thisDf = groupByPlaceId.get_group(place_id)
        parksList = thisDf["parks-average"].to_list()
        
        # Some of the parks columns contain all NaNs, we'll skip these
        if np.isnan(parksList).all():
            continue
        
        myDict = {}
        myDict['lng'] = thisDf.iloc[0]["lng"]
        myDict["lat"] =  thisDf.iloc[0]["lat"]

        # parks_percent_change_from_baseline
        myDict["parks"] = parksList

        outputList.append(myDict)
        
    return outputList

In [110]:
usaSubRegion2List = create_list_for_json(usaSubRegion2NotNaDf)

In [111]:
len(usaSubRegion2List)

666

Also get a list of the dates for reference. Taking the dates from the first `place_id`.

In [112]:
dateList = usaSubRegion2NotNaDf[usaSubRegion2NotNaDf["place_id"] == "ChIJNfnP60m-VoYRN7ziFE_S9ic"]["date"].to_list()

Add the data and the dates to a new dataframe for export

In [113]:
exportDf = {}

In [114]:
exportDf["data"] = usaSubRegion2List

In [115]:
exportDf["dates"] = dateList

In [117]:
with open("./public/data/usa-parks.json", "w") as outfile: 
    json.dump(exportDf, outfile, ignore_nan=True)