## This Notebook - Goals - FOR EDINA 

**What?:** <br>
- Introduction/tutorial to <code>folium</code>, an interactive library for plotting data on leaflet maps
- Visualization of current Covid-19 data

**Who?:** <br>
- Academics in geosciences
- Geophysical Data Science course
- Users interested in geospatial data analysis
 
**Why?:** <br>
- Tutorial/guide for academics and students on how to use folium and how to process Covid-19 data

**Noteable features to exploit:** <br>
- Use of pre-installed libraries 

**How?:** <br>
- Step-by-step data processing using real-time global Covid-19 data
- Clear visualisations - concise explanations
- Effective use of core libraries
<hr>

# Visualising Covid-19 data using Folium 
<code>folium</code> makes it easy to visualize data that’s been manipulated in Python on an interactive leaflet map. It enables both the binding of data to a map for choropleth visualizations as well as passing rich vector/raster/HTML visualizations as markers on the map. The library has a number of built-in tilesets from <code>OpenStreetMap</code>, but <code>Stamen</code> and <code>Mapbox</code> and supports custom tilesets with Mapbox or Cloudmade API keys. It supports both Image, Video, GeoJSON and TopoJSON overlays. It also has several plugins that enable interactive tools for the maps.

This notebook is a tutorial on creating a global choropleth map of the confirmed cases of Covid-19 for each country. There is a step-by-step guide to access, process and clean up the data before plotting it. The tutorial runs the user through the optimal way of visualizing the data, starting with a linear segmented colormap, then a logarithmic colormap and a logarithmic colormap with hover tools showing the absolute values for each country.

**Notebook contents:**
- Importing the necessary libraries
- <a href='#pre'>Pre-processing of open source Covid-19 data</a>
- Choropleth maps using Covid-19 data
    - <a href='#lin'>Linear colormap</a>
    - <a href='#log'>Logarithmic colormap</a>
    - <a href='#hov'>Logarithmic colormap with hover tools showing absolute values</a>

In [None]:
# Import general libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import datetime
import requests
import json
from IPython.display import display

# Import folium package and modules
import folium
from folium import plugins

# Import widgets library
import ipywidgets as widgets
from ipywidgets import interactive


# Hide warning messages
import warnings
warnings.filterwarnings('ignore')

<a id='pre'></a>
## Pre-processing Covid-19 data

### Getting the data
Open source global data on the number of confirmed Covid-19 cases per country can be accessed through the following github repo: https://github.com/CSSEGISandData/COVID-19.git <br>
The entire repository can be cloned into your Noteable home directory, although it might take a few tries and a few minutes for the repo to be succesfully cloned as it is very large.

In [None]:
# Clone github repository of open source Covid-19 data - sometimes takes a few tries and takes a few minutes as it is quite a large repo
!git clone https://github.com/CSSEGISandData/COVID-19.git

### Cleaning up the data
After successfully cloning the repo into your home directory, the data used can be found in <code>COVID-19/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv</code>. If you have a look inside the csv file, you can see that most of the data is available per country. However, there are some countries which have data split into states or provinces. These had to be summed up to give the overall number of cases within that country.

Another issue is that the data doesn't have any vector data associated with the countries, i.e. no polygons are defined for the choropleth map to plot. So the vecor geometry for the countries had to be obtained from raw github data available: https://raw.githubusercontent.com/python-visualization/folium/master/examples/data/world-countries.json. As some of the country names in the Covid-19 csv data and the vector geometry json file differ, they need to be changed to match. Greenland, for example, was a special case because it is an autonomous territory within the Kingdom of Denmark so it was listed under Greenland, Denmark in the csv file, but it had its own polygon geometry within the json file. Therefore the country name had to be changed to Greenland to match the json file.

After having fixed all the duplicate countries, several columns in the csv file become redundant (Latitude, Longitude and Province/State columns) so these can be dropped to make the data look cleaner. All '0' values are changed to 'NaN' as it would later become an issue when the data is plotted on a logarithmic scale (log0 is undefined).

To be able to plot the data with a hover tooltip, the Covid-19 data needs to be added to the the json file. So first, the csv file needs to be converted to json as well and then the right keys and values need to be selected from both json files.

What needs to be done:
- need vector geometry from additional json file
- match country names in csv and json files
- sum number of cases of countries split into states or provinces into one country wide statistic
- fix Greenland's country name
- drop duplicate countries - keep country-wide data
- change all '0' values to 'NaN'
- drop Latitude, Longitude and State/Province columns
- convert csv file into json and add Covid-19 data to vector geometry json file

In [None]:
# Load timeseries data of daily cumulative confirmed infections per country
total_cases_g = pd.read_csv('COVID-19/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv', na_values=0, delimiter=',')

# Match country names in dataframe(eg. 'US') to country names in json file (eg. 'United States of America')
total_cases_g.replace('US', 'United States of America', inplace = True)
total_cases_g.replace('Tanzania', "United Republic of Tanzania", inplace = True)
total_cases_g.replace('Congo (Brazzaville)', "Republic of the Congo", inplace = True)
total_cases_g.replace('Congo (Kinshasa)', "Democratic Republic of the Congo", inplace = True)
total_cases_g.replace('Czechia', "Czech Republic", inplace = True)
total_cases_g.replace('Korea, South', "South Korea", inplace = True)
total_cases_g.replace('Taiwan*', "Taiwan", inplace = True)
total_cases_g.replace('Serbia', "Republic of Serbia", inplace = True)
total_cases_g.replace('North Macedonia', "Macedonia", inplace = True)
total_cases_g.replace('Guinea-Bissau', "Guinea Bissau", inplace = True)
total_cases_g.replace("Cote d'Ivoire", "Ivory Coast", inplace = True)

# Drop columns of coordinates - using geojson file to plot the polygons of the countries so this is not necessary
total_cases_g = total_cases_g.drop(columns=['Lat', 'Long'])

In [None]:
# Show duplicated countries
total_cases_g[total_cases_g.duplicated(subset='Country/Region') == True]

In [None]:
# Combine states/provinces of the duplicated countries to one country
countries = ['Australia', 'Canada', 'United Kingdom', 'France', 'Netherlands', 'China'] # Duplicated countries
total_cases = total_cases_g
for i in countries:
    series = total_cases_g[total_cases_g['Country/Region'] == i].agg('sum') # Sum up all content of all columns
    series['Country/Region'] = i # Rename Country name
    series['Province/State'] = 'NaN' # Rename Province/State name to NaN
    df = series.to_frame().transpose() # Turn series into a dataframe and switch rows and columns 
    total_cases = pd.concat([total_cases, df], ignore_index=True, sort = True) # Combine original datframe and new dataframe of fixed duplicates

# Fix naming issue with Greenland - can't do what we did with the others as it exists in geojson file
greenland = total_cases[total_cases['Province/State'] == 'Greenland'] # Create dataframe with only greenland in it
greenland['Country/Region'] = 'Greenland' # Rename country
greenland['Province/State'] = 'NaN' # Rename Province/State
total_cases = pd.concat([total_cases, greenland], ignore_index=True) # Combine original dataframe and greenland dataframe

# Replace 0 with NaN to make sure np.log used on values later on is defined
total_cases.replace(0, np.nan, inplace=True)

# Drop duplicate countries and keep the last (combined data) entry
total_cases.drop_duplicates(subset='Country/Region', keep='last', inplace=True)

# Delete 'Province/State' columns as it is not needed anymore
total_cases.drop(columns=['Province/State'], inplace=True)

# Sort dataframe alphabetically by country
total_cases.sort_values('Country/Region', inplace = True)

# Show final edited dataframe
total_cases

In [None]:
#Setting up the world countries data URL
url = 'https://raw.githubusercontent.com/python-visualization/folium/master/examples/data'
country_shapes = f'{url}/world-countries.json'

# Loading it into the notebook as a json file
res = requests.get(country_shapes)
data = res.json()

In [None]:
# Set index of covid-dataframe to the country
total_cases_ind = total_cases.set_index('Country/Region')

# Turn dataframe into a json file with a structure of  {column -> {index -> value}}
result = total_cases_ind.to_json(orient='columns')
parsed = json.loads(result)

# Select any date to show all the 'country : data' pairs
parsed["7/22/20"]

<a id='lin'></a>
## Linear colormap
Once all the data is processed, actually plotting it is quite straightforward using <code>folium.Choropleth()</code>. The arguments needed within the function are <code>geo_data</code> - which is the json file - and <code>data</code> - which is the cleaned csv file - with the data columns specified in <code>columns</code>. If no bins are specified for the colormap scale, it automatically plots it on a linear scale with equal bin widths.

This however might not be the most efficient way to plot the data. Because of a few very large values, the range of the colormap scale is so wide that the shading on the map doesn't appear to be very informative. So there might be other ways to visualize the data in a better way.

In [None]:
# Define function to enable using Date Picker widget
def mapping(date):
    # Change format of date picked by widget to match the column names of the dataframe
    date1 = date.strftime("%-m/%-d/%y")
    # Create basemap layer
    m = folium.Map()
    # Create choropleth layer
    folium.Choropleth(geo_data=country_shapes, # Geometry for countries
                      name='choropleth COVID-19', # Name of layer
                      data=total_cases, columns=['Country/Region', date1], # Covid-19 data per country
                      key_on='feature.properties.name', # Matching Covid-19 data to the geometry by the country name
                      legend_name='Total confirmed cases of Covid-19 until '+ str(date1), # Colorbar label
                      highlight=True, # Highlighting country polygons when hovering over it
                      fill_color='YlOrRd', # Colormap
                      nan_fill_color='white', # Color for NaN values
                     ).add_to(m) # Add choropleth layer to basemap
    # Add control panel to choose which layer to show on the map
    folium.LayerControl().add_to(m) 
    # Show map
    display(m) 
    return date1

# Create Date Picker widget
date = widgets.DatePicker(description='Pick a Date', value = datetime.date(2020, 7, 22), disabled=False)

# Make widget interactive by connecting it to the function 
interactive(mapping, date=date)

<a id='log'></a>
## Logarithmic colormap
If you have a look at the data then you can see that a few countries - usually ones with large populations -  have huge numbers, but most are relatively small in comparison. One way around this is to plot the data on a logarithmic scale. This makes the choropleth map look more representative as you can see the differences in the number of cases between smaller countries with less population. This, however, means the ticks on the colorbar go from 1-15, which isn't very informative about the absolute values for the countries. So the next step would be to add additional hover tools which shows the absolute number of cases for each country when you hover over the country.

*Note: An interesting exercise would be to create a colormap where outlying values are coloured differently so that the colormap range is more representative of the number of cases most countries have. Another interesting way to visualize the data would be to obtain data on the population and visualize the percentage of the number of confirmed Covid-19 cases with respect to the country's population. A third option would be to process the data further and obtain the number of new daily confirmed Covid-19 cases instead of plotting the cumulative value. All of these options should lessen the effect of the outliers and influence of the country's population.*

In [None]:
# Define function to enable using Date Picker widget
def mapping(date):
    # Change format of date picked by widget to match the column names of the dataframe
    date1 = date.strftime("%-m/%-d/%y")
    # Create basemap layer
    m = folium.Map()
    # Change absolute values into a logarithmic scale
    total_cases['day'] = np.log(total_cases[date1])
    # Create choropleth layer
    folium.Choropleth(geo_data=country_shapes, # Geometry for countries
                      name='choropleth COVID-19', # Name of layer
                      data=total_cases, columns=['Country/Region', 'day'], # Covid-19 data per country
                      key_on='feature.properties.name', # Matching Covid-19 data to the geometry by the country name
                      legend_name='Log of total confirmed cases of Covid-19 until '+ str(date1), # Colorbar label
                      highlight=True, # Highlighting country polygons when hovering over it
                      fill_color='YlOrRd', # Colormap
                      nan_fill_color='white', # Color for NaN values
                     ).add_to(m)# Add choropleth layer to basemap
    # Add control panel to choose which layer to show on the map
    folium.LayerControl().add_to(m)
    # Show map
    display(m)
    return date1

# Create Date Picker widget
date = widgets.DatePicker(description='Pick a Date', value = datetime.date(2020, 7, 22), disabled=False)

# Make widget interactive by connecting it to the function 
interactive(mapping, date=date)

<a id='hov'></a>
## Logarithmic colorbar with hover tools
As you have seen above, the color scheme looks reasonable but obtaining the absolute values require some calculation as the colorbar is given in a logarithmic scale. To avoid unnecessary calculations and make the plot easily interpreted, a hover tool can be added to display <code>key : value</code> pairs from the geometry json file. Therefore, the data containing the number of confirmed Covid-19 cases has to be added to the geometry json file - this also requires the conversion of the csv file to a json file.

This way, the colors are representative and the absolute values are also easily accessed.

In [None]:
# Define function to enable using Date Picker widget
def mapping(date):
    # Change format of date picked by widget to match the column names of the dataframe
    date1 = date.strftime("%-m/%-d/%y")
    
    # Add Covid-19 data of chosen date to the right dictionary in the json file
    for key in parsed[date1].keys(): # For each country for that chosen date within the Covid data (converted to json)
        for i in range(0, 177): # For each country in the geometry json file
            if data['features'][i]['properties']['name'] == key: # If country name in the geometry json file equals country name in covid json file
                data['features'][i]['properties']['number of confirmed cases'] = parsed[date1].get(key, 'NO') # Add new key: value pair to dictionary with covid-19 data
                
    # Create basemap layer
    m = folium.Map()
    # Change absolute values into a logarithmic scale
    total_cases['day'] = np.log(total_cases[date1])
    # Create choropleth layer
    choropleth = folium.Choropleth(geo_data=data, # Geometry for countries
                                   name='choropleth COVID-19', # Name of layer
                                   data=total_cases, columns=['Country/Region', 'day'], # Covid-19 data per country
                                   key_on='feature.properties.name', # Matching Covid-19 data to the geometry by the country name
                                   legend_name='Log of total confirmed cases of Covid-19 until '+ str(date1), # Colorbar label
                                   highlight=True, # Highlighting country polygons when hovering over it
                                   fill_color='YlOrRd', # Colormap
                                   nan_fill_color='white', # Color for NaN values
                                  ).add_to(m) # Add choropleth layer to basemap
    # Add hover tool showing name of the country and number of confirmed Covid-19 cases until the chosen date
    choropleth.geojson.add_child(folium.features.GeoJsonTooltip(['name', 'number of confirmed cases']))
    # Add control panel to choose which layer to show on the map
    folium.LayerControl().add_to(m)
    # Show map
    display(m)
    return date1

# Create Date Picker widget
date = widgets.DatePicker(description='Pick a Date', value = datetime.date(2020, 7, 22), disabled=False)

# Make widget interactive by connecting it to the function 
interactive(mapping, date=date)