![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)
 
<a href="https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fcallysto%2Fcurriculum-notebooks&branch=master&urlpath=notebooks/curriculum-notebooks/Science/CovidSunburst/covid-19-cases-sunburst.ipynb&depth=1" target="_parent"><img src="https://raw.githubusercontent.com/callysto/curriculum-notebooks/master/open-in-callysto-button.svg?sanitize=true" width="123" height="24" alt="Open in Callysto"></a>

## Sunburst Visualization of COVID-19 Cases Data

In this notebook we will visualize the number of COVID-19 cases around the world using a [sunburst chart](https://plotly.com/python/sunburst-charts/) by continent and country, and also talk about [data cleaning](https://en.wikipedia.org/wiki/Data_cleansing).

Let's get [data](https://github.com/CSSEGISandData/COVID-19) provided by [Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE)](https://systems.jhu.edu/).

Click on the code cell below, then click the `▶Run` button to download and preview the data. You can also change the date, in `mm-dd-yyyy` format, to another date that has data [available](https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data#daily-reports-csse_covid_19_daily_reports).

In [None]:
date = '08-01-2022'

import pandas as pd
url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/'+date+'.csv'
df = pd.read_csv(url)
df

## Data Cleaning

In the dataframe we see a lot of missing values ([NaN or "Not a Number")](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html)), we'll need to deal with some of those before we can create a visualization.

The first step will be to see which rows don't have a `Lat` ([latitude](https://en.wikipedia.org/wiki/Latitude)) value.

In [None]:
df[df['Lat'].isna()]['Country_Region'].unique()

A couple of those are cruise ships, [Diamond Princess](https://en.wikipedia.org/wiki/Diamond_Princess_(ship)) and [MS Zaandam](https://en.wikipedia.org/wiki/MS_Zaandam), so we'll remove those rows.

In [None]:
df.drop(df[(df['Country_Region'] == 'Diamond Princess') | (df['Country_Region'] == 'MS Zaandam')].index, inplace=True)
df[df['Lat'].isna()]['Country_Region'].unique()

The remaining rows that are missing latitude values are all countries, let's take a look at `Canada`.

In [None]:
df[df['Country_Region'] == 'Canada']

At least one of the "provinces" in Canada is also a cruise ship, so let's drop those and create a [pie chart](https://plotly.com/python/pie-charts/) of just the Canadian data.

In [None]:
canada = df[df['Country_Region'] == 'Canada'].copy()
canada.drop(canada[canada['Province_State'].str.contains('Princess')].index, inplace=True)
canada.drop(canada[canada['Province_State'].str.contains('Travellers')].index, inplace=True)

import plotly.express as px
px.pie(canada, values='Confirmed', names='Province_State', title='Confirmed Cases in Canada')

We don't have state or provincial data for every country, but let's create a sunburst chart for the countries in our data set that do include a `Province_State` value.

In [None]:
px.sunburst(df[df['Province_State'].notna()], path=['Country_Region','Province_State'], values='Confirmed')

The sunburst chart we generated is interactive, try clicking on different countries to expand and contract the chart.

For countries without a `Province_State` value, let's just copy the `Country_Region` value so we can add them to our chart.

In [None]:
df['Province_State'].fillna(df['Country_Region'], inplace=True)
px.sunburst(df, path=['Country_Region','Province_State'], values='Confirmed')

To add one more level to our sunburst chart, let's group the data by continent using the [pycountry-convert](https://pypi.org/project/pycountry-convert/) code library.

The `country_to_continent` function will `try` to find the continent for a given country, but if there are any issues it will just output `Error`.

In [None]:
import pycountry_convert as pc
def country_to_continent(country):
    try:
        country_code = pc.country_name_to_country_alpha2(country, cn_name_format='default')
        continent_code = pc.country_alpha2_to_continent_code(country_code)
        continent_name = pc.convert_continent_code_to_continent_name(continent_code)
    except:
        continent_name = 'Error'
    return(continent_name)

df['Continent'] = [country_to_continent(country) for country in df['Country_Region']]
px.sunburst(df, path=['Continent','Country_Region','Province_State'], values='Confirmed')

Of course we don't actually want a continent named `Error`, so let's fix those values.

The code library didn't recognize `US` (it would have worked with `United States`), but we know that `US` is in North America.

In [None]:
df.loc[df['Country_Region']=='US','Continent'] = 'North America'
df['Country_Region'] = df['Country_Region'].replace('US','United States')  # Replace US with United States
px.sunburst(df, path=['Continent','Country_Region','Province_State'], values='Confirmed')

Let's check out the rest of the continent errors.

In [None]:
df[df['Continent']=='Error']['Country_Region']

We can drop the `Summer Olympics 2020` and  `Winter Olympics 2022`.

In [None]:
df.drop(df[df['Country_Region'].str.contains('Olympics')].index, inplace=True)
df[df['Continent']=='Error']['Country_Region']

Now let's set the continents for the rest of these.

In [None]:
continents = {
    'Antarctica':'Antarctica',
    'Burma':'Asia',
    'Congo':'Africa',
    "Cote d'Ivoire":'Africa',
    'Holy See':'Europe',
    'Korea, North':'Asia',
    'Korea, South':'Asia',
    'Kosovo':'Europe',
    'Taiwan*':'Asia',
    'Timor-Leste':'Asia',
    'West Bank and Gaza':'Asia',
    }
for cr, continent in continents.items():
    df.loc[df['Country_Region']==cr, 'Continent'] = continent
px.sunburst(df, path=['Continent','Country_Region','Province_State'], values='Confirmed')

Some of those slices are very small, we increase the size of the chart to see more detail. We should also add a title to our chart.

In [None]:
size = 1000
title = 'Confimed COVID-19 Cases as of '+date
px.sunburst(df, path=['Continent','Country_Region','Province_State'], values='Confirmed', title=title, width=size, height=size)

# Conclusion

In this notebook we cleaned and visualized [COVID-19 data](https://github.com/CSSEGISandData/COVID-19) provided by [Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE)](https://systems.jhu.edu/).

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)