![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

<a href="https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fcallysto%2Finteresting-problems&branch=main&subPath=notebooks/covid-cases-per-capita.ipynb&depth=1" target="_parent"><img src="https://raw.githubusercontent.com/callysto/curriculum-notebooks/master/open-in-callysto-button.svg?sanitize=true" width="123" height="24" alt="Open in Callysto"/></a>

# COVID-19 Cases Per Capita

While there are a number of well-designed dashboards and visualization tools for COVID-19 data, such as [Bing](https://bing.com/covid) and [The World Bank](http://datatopics.worldbank.org/universal-health-coverage/coronavirus/), we are going to try building something ourselves in a Jupyter notebook.

This Jupyter notebook uses [COVID-19 statistics from Johns Hopkins University CSSE](https://github.com/CSSEGISandData/COVID-19), you can also see [their dashboard](https://www.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6). Population statistics are from [Gapminder](http://gapm.io/dpop).

### Licence and Disclaimer

COVID-19 data sets are copyright 2020 [Johns Hopkins University](https://systems.jhu.edu) (available for educational and academic research purposes). The population data is free to use from [Gapminder](https://www.gapminder.org) under a [Creative Commons Attribution license](https://creativecommons.org/licenses/by/4.0/). This notebook also carries a [Creative Commons Attribution license](https://creativecommons.org/licenses/by/4.0/).

This notebook should not be considered medical or policy-making advice. Always follow the directives and orders of your public health authority.

## Getting Started

First, `▶Run` the next cell to import a data set. Once the data set has been downloaded and imported into a [DataFrame](https://www.tutorialspoint.com/python_pandas/python_pandas_dataframe.htm), it will be displayed.

You can change the date, but make sure you use the format `'MM-DD-YYYY'` as they do in the CSSE data set. Files are updated once a day around midnight [UTC](https://en.wikipedia.org/wiki/Coordinated_Universal_Time).

In [None]:
date = '01-01-2021'

import pandas as pd
csv_url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/'+date+'.csv'
covid_stats = pd.read_csv(csv_url)
pop_csv_url = 'https://docs.google.com/spreadsheets/d/18Ep3s1S0cvlT1ovQG9KdipLEoQ1Ktz5LtTTQpDcWbX0/export?gid=1668956939&format=csv'
pop_df = pd.read_csv(pop_csv_url)
current_population = pop_df[pop_df['time']==2020]
print('Data successfully imported')

### Cleaning the Data

Since these two data sets don't use the same names for all of the countries, we'll replace some names in the covid_stats data set before we combine them. Some of these names have political implications, but we will use the names found in the Gapminder data set.

In [None]:
countries_to_rename = {'US':'United States','Korea, South':'South Korea','Burma':'Myanmar','Laos':'Lao','Cabo Verde':'Cape Verde',
                       'Congo (Kinshasa)':'Congo, Dem. Rep.','Congo (Brazzaville)':'Congo, Rep.','Eswatini':'Swaziland','West Bank and Gaza':'Palestine',
                       'Czechia':'Czech Republic','Kyrgyzstan':'Kyrgyz Republic','North Macedonia':'Macedonia, FYR','Slovakia':'Slovak Republic',
                       'Saint Kitts and Nevis':'St. Kitts and Nevis','Saint Lucia':'St. Lucia','Saint Vincent and the Grenadines':'St. Vincent and the Grenadines'}
for key in countries_to_rename:
    covid_stats.replace(key,countries_to_rename[key],inplace=True)
print('Data sucessfully cleaned')

### Combining the Data

Run the next cell to create a [DataFrame](https://www.tutorialspoint.com/python_pandas/python_pandas_dataframe.htm) from the two sets of downloaded data.

In [None]:
# If you prefer specific countries, put a # in front of the next line and remove the six ' marks around the next list
country_list = current_population['name'].unique()
'''
country_list = ['Italy', 'Spain', 'Germany', 'France', 
                'Israel', 'United States', 'United Kingdom',
                'South Korea', 'Singapore', 'Australia',
                'Canada', 'China', 'Argentina', 'Russia', 'India']
'''

df = pd.DataFrame(columns=['Country', 'Population', 'Confirmed', 'Percent'])

for country in country_list:
    confirmed = covid_stats[covid_stats['Country_Region']==country]['Confirmed'].sum()
    population = current_population[current_population['name']==country]['population'].values[0]
    percent = confirmed*100.0/population
    data_row = {'Country':country,'Population':population,'Confirmed':confirmed,'Percent':percent}
    df = df.append(data_row, ignore_index=True)

df.sort_values('Confirmed',ascending=False).head()

### Adding World Data

In [None]:
world_population = current_population['population'].sum()
world_confirmed_cases = covid_stats['Confirmed'].sum()
world_percent = world_confirmed_cases*100.0/world_population
world_values = {'Country':'World','Population':world_population,'Confirmed':world_confirmed_cases,'Percent':world_percent}
df = df.append(world_values, ignore_index=True)
df.sort_values('Confirmed',ascending=False).head()

### Adding More Data

You can also edit then run the next cell to add other data to the DataFrame. Remove the `'''` marks to enable the code to run.

Each time you run it, this code will add a row.

In [None]:
'''
place = 'Edmonton'
population = 1461182
confirmed = 111
percent = confirmed*100.0/population
new_row = {'Country':place,'Population':population,'Confirmed':world_confirmed_cases,'Percent':world_percent}
df = df.append(new_row, ignore_index=True)
df.tail()
'''

### Sorting Data

You can also sort by percent.

In [None]:
df.sort_values('Percent',ascending=False).head(20)

### Specific Countries

To see a DataFrame of specific countries, edit and run the next cell

In [None]:
#df[df['Country']=='Canada']
list_of_countries = ['Canada', 'China', 'Italy']
df[df['Country'].isin(list_of_countries)]

## Names of Countries

As mentioned, these two data sets don't always use the same country/region names. To see the country/region names in each original data set, run the following cell.

In [None]:
csv_url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/'+date+'.csv'
covid_stats = pd.read_csv(csv_url)

pop_csv_url = 'https://docs.google.com/spreadsheets/d/18Ep3s1S0cvlT1ovQG9KdipLEoQ1Ktz5LtTTQpDcWbX0/export?gid=1668956939&format=csv'
pop_df = pd.read_csv(pop_csv_url)
current_population = pop_df[pop_df['time']==2019]

from_covid_stats = covid_stats['Country_Region'].unique()
from_gapminder = current_population['name'].unique()
print('From covid_stats:')
print(from_covid_stats)
print('')
print('From Gapminder:')
print(from_gapminder)
print('')
print('Not found in Gapminder:')
for c in from_covid_stats:
    if c not in from_gapminder:
        print(c)
print('')
print('Not found in covid_stats:')
for x in from_gapminder:
    if x not in from_covid_stats:
        print(x)

## Graphing Data

### Confirmed Cases Bar Graph

In [None]:
import plotly.express as px
plot_data = df[df['Country']!='World'].sort_values('Confirmed',ascending=False).head(20)
px.bar(plot_data, x='Country', y='Confirmed',title='COVID Cases')

In [None]:
plot_data

### Time Series

We'll download some time series data, then create charts by setting the index, dropping some columns, and `T`ransposing rows and columns.

In [None]:
countries = ['Canada', 'Mexico']

time_series_confirmed_url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv'
tsc = pd.read_csv(time_series_confirmed_url)

time_series_recovered_url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv'
tsr = pd.read_csv(time_series_recovered_url)

time_series_deaths_url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv'
tsd = pd.read_csv(time_series_deaths_url)

tsc2 = tsr.set_index('Country/Region').drop(columns=['Province/State','Lat','Long']).T
px.line(tsc2, y=countries, title='Confirmed Cases').show()
tsr = tsr.set_index('Country/Region').drop(columns=['Province/State','Lat','Long']).T
px.line(tsr, y=countries, title='Recovered Cases').show()
# not sure why this is not working
#tsd2 = tsd.set_index('Country/Region').drop(columns=['Province/State','Lat','Long']).T
#px.line(tsd2, y=countries, title='Deaths').show()

**Hopefully you have found this an interesting introduction to data science using online COVID-19 data.**

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)