# google-covid-19-mobility-data-process-europe

In this notebook I will refine the process from `google-covid-19-mobility-data-process-v1` in order to get the highest resolution data possible for a map of Europe.

In [1]:
import pandas as pd
import requests
import simplejson as json
import numpy as np

---

## Load a reduced CSV containing just the United States entries

In [2]:
europeDf = pd.read_csv("./output-data/europe.csv")

In [3]:
europeDf.head()

Unnamed: 0,country_region_code,country_region,sub_region_1,sub_region_2,metro_area,iso_3166_2_code,census_fips_code,place_id,date,retail_and_recreation_percent_change_from_baseline,grocery_and_pharmacy_percent_change_from_baseline,parks_percent_change_from_baseline,transit_stations_percent_change_from_baseline,workplaces_percent_change_from_baseline,residential_percent_change_from_baseline
0,AT,Austria,,,,,,ChIJfyqdJZsHbUcRr8Hk3XvUEhA,2020-02-15,9.0,1.0,42.0,13.0,0.0,-2.0
1,AT,Austria,,,,,,ChIJfyqdJZsHbUcRr8Hk3XvUEhA,2020-02-16,15.0,21.0,42.0,12.0,1.0,-2.0
2,AT,Austria,,,,,,ChIJfyqdJZsHbUcRr8Hk3XvUEhA,2020-02-17,9.0,5.0,35.0,3.0,-4.0,0.0
3,AT,Austria,,,,,,ChIJfyqdJZsHbUcRr8Hk3XvUEhA,2020-02-18,8.0,5.0,40.0,2.0,-4.0,0.0
4,AT,Austria,,,,,,ChIJfyqdJZsHbUcRr8Hk3XvUEhA,2020-02-19,4.0,2.0,10.0,-1.0,-5.0,1.0


---

**`sub_reigon_1` is regions**

In [6]:
europeDf["sub_region_1"].unique()

array([nan, 'Burgenland', 'Carinthia', 'Lower Austria', 'Salzburg',
       'Styria', 'Tyrol', 'Upper Austria', 'Vienna', 'Vorarlberg',
       'Brussels', 'Flanders', 'Wallonia', 'Blagoevgrad Province',
       'Burgas', 'Dobrich Province', 'Gabrovo', 'Haskovo Province',
       'Jambol', 'Kardzhali Province', 'Kyustendil Province', 'Lovec',
       'Montana Province', 'Pazardzhik', 'Pernik', 'Pleven Province',
       'Plovdiv Province', 'Razgrad', 'Ruse', 'Shumen Province',
       'Silistra', 'Sliven Province', 'Smoljan', 'Sofia City Province',
       'Sofia Province', 'Stara Zagora', 'Targovishte Province', 'Varna',
       'Veliko Tarnovo Province', 'Vidin', 'Vraca',
       'Central Bohemian Region', 'Hradec Králové Region',
       'Karlovy Vary Region', 'Liberec Region',
       'Moravian-Silesian Region', 'Olomouc Region', 'Pardubice Region',
       'Plzeň Region', 'Prague', 'South Bohemian Region',
       'South Moravian Region', 'Ústí nad Labem Region',
       'Vysočina Region', 'Zlín

---

**`sub_region_2` is counties - of which there are 2,673**

In [7]:
europeDf["sub_region_2"].unique()

array([nan, 'Eisenstadt', 'Eisenstadt-Umgebung District', ...,
       'Turčianske Teplice District', 'Tvrdošín District',
       'Žilina District'], dtype=object)

In [9]:
len(europeDf["sub_region_2"].unique())

2673

---

**`metro_area` is not used**

In [11]:
europeDf["metro_area"].unique()

array([nan])

---

**`iso_3166_2_code` is used and relates to regions.**

In [12]:
europeDf["iso_3166_2_code"].unique()

array([nan, 'AT-1', 'AT-2', ..., 'SK-TC', 'SK-TA', 'SK-ZI'], dtype=object)

---

**`census_fips_code` is not used.**

In [13]:
europeDf["census_fips_code"].unique()

array([nan])

---

## Create a new data frame containing just the rows with `sub_region_2`

In [4]:
len(europeDf["place_id"].unique())

3304

In [5]:
europeSubRegion2Df = europeDf[europeDf["sub_region_2"].notnull()]

In [6]:
len(europeSubRegion2Df["place_id"].unique())

2647

We now have 2,674 rows instead of 3,304. Not a great reduction...

---

## Get the lat long coordinates for each unique `place_id`

In [18]:
uniquePlaceIdsDf = europeSubRegion2Df[["place_id"]].drop_duplicates()

Access the google maps api to get coordinates for each `place_id`

In [19]:
with open('./secrets/googleapikey.txt', 'r') as f:
    key = f.read()

In [20]:
def get_lat_long(place_id):
    try:
        API_KEY = key.rstrip("\n")
        url = "https://maps.googleapis.com/maps/api/place/details/json?place_id=" + str(place_id) + "&key=" + str(API_KEY) + "&fields=geometry"
        
        response = (requests.get(url).text)
        response_json = json.loads(response)
        
        if "result" in response_json:
            result = response_json["result"]
            if "geometry" in result:
                geometry = result["geometry"]
                if "location" in geometry:
                    location = geometry["location"]
                    return location["lat"], location["lng"]
                else:
                    return None, None
            else:
                return None, None
        else:
            return None, None
    
    except Exception as e:
        raise e

In [22]:
uniquePlaceIdsDf.loc[:, "lat"], uniquePlaceIdsDf.loc[:, "lng"] = zip(*uniquePlaceIdsDf['place_id'].map(get_lat_long))

In [23]:
uniquePlaceIdsDf.head()

Unnamed: 0,place_id,lat,lng
924,ChIJrziXHO43bEcR306wHEnyAlE,47.8464,16.528
1298,ChIJPaH4E0lPbEcR0jlSWowdN2I,47.8808,16.6721
1737,ChIJCUxcuQnpbkcRG0c2Jl5j0a0,47.0593,16.3245
2153,ChIJuRM1csjjbkcRdLPeHyNbg6o,46.9371,16.1296
2490,ChIJ3_3OS7wxbEcRGTfCAPgqt5A,47.7362,16.3966


I'll save these coordinates as a CSV file for later use.

In [24]:
uniquePlaceIdsDf.to_csv("./output-data/europe-sub-region-2-lat-lng.csv", index=False)

In [25]:
len(uniquePlaceIdsDf)

2647

---

## Merge the coordinates with the original `sub_region_2` data frame

In [7]:
uniquePlaceIdsDf = pd.read_csv("./output-data/europe-sub-region-2-lat-lng.csv")

In [8]:
europeSubRegion2MergeDf = pd.merge(europeSubRegion2Df, uniquePlaceIdsDf, on='place_id', how='outer')

In [9]:
europeSubRegion2MergeDf.head()

Unnamed: 0,country_region_code,country_region,sub_region_1,sub_region_2,metro_area,iso_3166_2_code,census_fips_code,place_id,date,retail_and_recreation_percent_change_from_baseline,grocery_and_pharmacy_percent_change_from_baseline,parks_percent_change_from_baseline,transit_stations_percent_change_from_baseline,workplaces_percent_change_from_baseline,residential_percent_change_from_baseline,lat,lng
0,AT,Austria,Burgenland,Eisenstadt,,,,ChIJrziXHO43bEcR306wHEnyAlE,2020-02-15,-4.0,-9.0,,,,,47.84637,16.52796
1,AT,Austria,Burgenland,Eisenstadt,,,,ChIJrziXHO43bEcR306wHEnyAlE,2020-02-17,3.0,4.0,,5.0,7.0,,47.84637,16.52796
2,AT,Austria,Burgenland,Eisenstadt,,,,ChIJrziXHO43bEcR306wHEnyAlE,2020-02-18,0.0,2.0,,-4.0,7.0,,47.84637,16.52796
3,AT,Austria,Burgenland,Eisenstadt,,,,ChIJrziXHO43bEcR306wHEnyAlE,2020-02-19,-7.0,7.0,,-3.0,5.0,,47.84637,16.52796
4,AT,Austria,Burgenland,Eisenstadt,,,,ChIJrziXHO43bEcR306wHEnyAlE,2020-02-20,0.0,3.0,,4.0,5.0,,47.84637,16.52796


---

## Calculate 7 day rolling averages for each location

In [10]:
def add_rolling_average(df):
    df.loc[:, "retail-average"] = df["retail_and_recreation_percent_change_from_baseline"].rolling(window=7, center=True).mean()
    df.loc[:, "grocery-average"] = df["grocery_and_pharmacy_percent_change_from_baseline"].rolling(window=7, center=True).mean()
    df.loc[:, "parks-average"] = df["parks_percent_change_from_baseline"].rolling(window=7, center=True).mean()
    df.loc[:, "transit-average"] = df["transit_stations_percent_change_from_baseline"].rolling(window=7, center=True).mean()
    df.loc[:, "workplace-average"] = df["workplaces_percent_change_from_baseline"].rolling(window=7, center=True).mean()
    df.loc[:, "residential-average"] = df["residential_percent_change_from_baseline"].rolling(window=7, center=True).mean()
    
    return df

In [11]:
europeSubRegion2AverageDf = europeSubRegion2MergeDf.groupby("place_id").apply(add_rolling_average)

In [12]:
europeSubRegion2AverageDf[1000000:1000005]

Unnamed: 0,country_region_code,country_region,sub_region_1,sub_region_2,metro_area,iso_3166_2_code,census_fips_code,place_id,date,retail_and_recreation_percent_change_from_baseline,...,workplaces_percent_change_from_baseline,residential_percent_change_from_baseline,lat,lng,retail-average,grocery-average,parks-average,transit-average,workplace-average,residential-average
1000000,SE,Sweden,Stockholm County,Salem Municipality,,,,ChIJ8eJRdycNX0YRfvG06GYpON0,2020-04-27,,...,-35.0,,59.195579,17.757501,,,,-36.714286,,
1000001,SE,Sweden,Stockholm County,Salem Municipality,,,,ChIJ8eJRdycNX0YRfvG06GYpON0,2020-04-28,,...,-32.0,,59.195579,17.757501,,,,-41.714286,,
1000002,SE,Sweden,Stockholm County,Salem Municipality,,,,ChIJ8eJRdycNX0YRfvG06GYpON0,2020-04-29,,...,-29.0,,59.195579,17.757501,,,,-40.428571,,
1000003,SE,Sweden,Stockholm County,Salem Municipality,,,,ChIJ8eJRdycNX0YRfvG06GYpON0,2020-04-30,,...,-38.0,,59.195579,17.757501,,,,-42.0,,
1000004,SE,Sweden,Stockholm County,Salem Municipality,,,,ChIJ8eJRdycNX0YRfvG06GYpON0,2020-05-01,,...,-84.0,,59.195579,17.757501,,,,-41.285714,,


---

## Round the average figures to 1 decimal place for a smaller final file size

In [13]:
europeSubRegion2RoundedDf = europeSubRegion2AverageDf.round({
    'retail-average': 1,
    'grocery-average': 1,
    'parks-average': 1,
    'transit-average': 1,
    'workplace-average': 1,
    'residential-average': 1
})

---

## Remove any NaN `place_id`s

In [14]:
europeSubRegion2NotNaDf = europeSubRegion2RoundedDf[europeSubRegion2RoundedDf["place_id"].notna(
)]

---

## Convert the data into a python dictionary so it can be exported as json

In [15]:
def create_list_for_json(df):
    outputList = []
    listOfPlaceIds = df["place_id"].drop_duplicates().to_list()
    groupByPlaceId = df.groupby("place_id")
    
    for place_id in listOfPlaceIds:
        thisDf = groupByPlaceId.get_group(place_id)
        parksList = thisDf["parks-average"].to_list()
        
        # Some of the parks columns contain all NaNs, we'll skip these
        if np.isnan(parksList).all():
            continue
        
        myDict = {}
        myDict['lng'] = thisDf.iloc[0]["lng"]
        myDict["lat"] =  thisDf.iloc[0]["lat"]

        # parks_percent_change_from_baseline
        myDict["parks"] = thisDf.set_index("date")["parks-average"].to_dict()

        outputList.append(myDict)
        
    return outputList

In [16]:
europeSubRegion2List = create_list_for_json(europeSubRegion2NotNaDf)

In [17]:
len(europeSubRegion2List)

991

In [19]:
with open("./public/data/europe-parks.json", "w") as outfile: 
    json.dump(europeSubRegion2List, outfile, ignore_nan=True)

---

## Missing `sub_region_2` for many countries in Europe

Many countries, such as Ireland and Germany are missing `sub_region_2` rows.

In [11]:
europeDf[europeDf["country_region"] == "Ireland"]["sub_region_1"].unique()

array([nan, 'County Carlow', 'County Cavan', 'County Clare',
       'County Cork', 'County Donegal', 'County Dublin', 'County Galway',
       'County Kerry', 'County Kildare', 'County Kilkenny',
       'County Laois', 'County Leitrim', 'County Limerick',
       'County Longford', 'County Louth', 'County Mayo', 'County Meath',
       'County Monaghan', 'County Offaly', 'County Roscommon',
       'County Sligo', 'County Tipperary', 'County Waterford',
       'County Westmeath', 'County Wexford', 'County Wicklow'],
      dtype=object)

In [12]:
europeDf[europeDf["country_region"] == "Ireland"]["sub_region_2"].unique()

array([nan], dtype=object)