![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

<a href="https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fcallysto%2Fpresentations&branch=master&subPath=data-science-with-covid-instructor.ipynb&depth=1" target="_parent"><img src="https://raw.githubusercontent.com/callysto/curriculum-notebooks/master/open-in-callysto-button.svg?sanitize=true" width="123" height="24" alt="Open in Callysto"/></a>

# Introduction to Data Science with COVID-19 Data

While there are a number of well-designed dashboards and visualization tools for COVID-19 data, such as [Bing](https://bing.com/covid) and [The World Bank](http://datatopics.worldbank.org/universal-health-coverage/coronavirus/), we are going to try building something ourselves in a Jupyter notebook.

This Jupyter notebook uses [COVID-19 statistics from Johns Hopkins University CSSE](https://github.com/CSSEGISandData/COVID-19), you can also see [their dashboard](https://www.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6).

### Licence and Disclaimer

COVID-19 data sets are copyright 2020 [Johns Hopkins University](https://systems.jhu.edu) (available for educational and academic research purposes). The population data is free to use from [Gapminder](https://www.gapminder.org) under a [Creative Commons Attribution license](https://creativecommons.org/licenses/by/4.0/). This notebook also carries a [Creative Commons Attribution license](https://creativecommons.org/licenses/by/4.0/).

This notebook should not be considered medical or policy-making advice. Always follow the directives and orders of your public health authority.

## Getting Started

First, `▶Run` the next cell to import a data set. Once the data set has been downloaded and imported into a [DataFrame](https://www.tutorialspoint.com/python_pandas/python_pandas_dataframe.htm), it will be displayed.

You can change the date, but make sure you use the format `'MM-DD-YYYY'` as they do in the CSSE data set. Files are updated once a day around midnight [UTC](https://en.wikipedia.org/wiki/Coordinated_Universal_Time).

In [None]:
date = '04-07-2020'

import pandas as pd

csv_url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/'+date+'.csv'
covid_stats = pd.read_csv(csv_url)
covid_stats

## Data Cleaning

`Run` the next cell to clean up the data. We'll add up values for each country and create a new dataframe.

In [None]:
# If you prefer specific countries, put a # in front of the next line and remove the six ' marks around the next list
country_list = covid_stats['Country_Region'].unique()
'''
country_list = ['Italy', 'Spain', 'Germany', 'France', 
                'Israel', 'US', 'United Kingdom',
                'Singapore', 'Australia', 'Canada',
                'China', 'Argentina', 'Russia', 'India']
'''

df = pd.DataFrame(columns=['Country', 'Confirmed', 'Recovered', 'Deaths'])

for country in country_list:
    confirmed = covid_stats[covid_stats['Country_Region']==country]['Confirmed'].sum()
    recovered = covid_stats[covid_stats['Country_Region']==country]['Recovered'].sum()
    deaths = covid_stats[covid_stats['Country_Region']==country]['Deaths'].sum()
    data_row = {'Country':country,'Confirmed':confirmed,'Recovered':recovered,'Deaths':deaths}
    df = df.append(data_row, ignore_index=True)

df.sort_values('Confirmed',ascending=False)

## Graphing Data

We will use the `Plotly Express` library to create a graph of our data set.

```python
import plotly.express as px
px.bar(df.sort_values('Confirmed').head(20), x='Country', y='Confirmed', title='COVID Cases')
```

Another option:

```python
import plotly.express as px
px.bar(df.sort_values('Confirmed',ascending=False).head(20), x='Country', y='Confirmed', title='COVID Cases')
```

In [None]:
import plotly.express as px
px.bar(df.sort_values('Confirmed',ascending=False).head(20), x='Country', y='Confirmed', title='COVID Cases')

### Renaming Countries

The country/state naming for this data set mostly follows the [WHO list of member states](https://www.who.int/choice/demography/by_country/en/), but we can `Run` the next cell to rename some of them.

In [None]:
df.replace('US','United States',regex=True,inplace=True)

countries_to_rename = {'Korea, South':'South Korea','Burma':'Myanmar','Laos':'Lao','Cabo Verde':'Cape Verde',
                       'Congo (Kinshasa)':'Congo, Dem. Rep.','Congo (Brazzaville)':'Congo, Rep.','Eswatini':'Swaziland','West Bank and Gaza':'Palestine',
                       'Czechia':'Czech Republic','Kyrgyzstan':'Kyrgyz Republic','North Macedonia':'Macedonia, FYR','Slovakia':'Slovak Republic',
                       'Saint Kitts and Nevis':'St. Kitts and Nevis','Saint Lucia':'St. Lucia','Saint Vincent and the Grenadines':'St. Vincent and the Grenadines'}
for key in countries_to_rename:
    covid_stats.replace(key,countries_to_rename[key],inplace=True)
print('Countries renamed in covid_stats')

### Listing Countries

To see a list of the countries in your data set, use

```python
df['Country'].unique()
```

or

```python
covid_stats['Country_Region'].unique()
```

In [None]:
df['Country'].unique()

## Adding World Data

We can also add up all of the values in the data set to get worldwide totals.

In [None]:
confirmed = covid_stats['Confirmed'].sum()
recovered = covid_stats['Recovered'].sum()
deaths = covid_stats['Deaths'].sum()
world_values = {'Country':'World','Confirmed':confirmed,'Recovered':recovered,'Deaths':deaths}
df = pd.concat([df, pd.DataFrame(world_values, index=[0])], ignore_index=True)
df.tail()

## Sorting Data

`Run` the next cell to sort the data by a particular column. The `ascending=False` is optional (the default is `True`), and `.head(16)` shows just the first 16 rows.

In [None]:
df.sort_values('Confirmed', ascending=False).head(16)

## Selecting Specific Countries

To see a DataFrame of specific countries, edit and run the next cell.

In [None]:
#df[df['Country']=='Canada']
list_of_countries = ['Canada', 'China', 'Italy']
df[df['Country'].isin(list_of_countries)]

## Adding Population Data

We'll use population data from [Gapminder](https://gapminder.org).

In [None]:
pop_sheet_id = '18Ep3s1S0cvlT1ovQG9KdipLEoQ1Ktz5LtTTQpDcWbX0'
pop_gid = '1668956939'
population_csv_url = 'https://docs.google.com/spreadsheets/d/'+pop_sheet_id+'/export?gid='+pop_gid+'&format=csv'
population_data = pd.read_csv(population_csv_url)
population = population_data[population_data['time']==2019]
population

In [None]:
# Set the index as country name for both dataframes so we can join them together
cp = population.set_index('name')
cs = df.set_index('Country')
new_df = cs.join(cp)
new_df

In [None]:
# Drop some columns we don't need, rename for consistency
new_df.drop(columns=['geo','time'],inplace=True)
new_df.rename(columns={'population':'Population'},inplace=True)
new_df

In [None]:
# flaten the multi-index
new_df.reset_index(inplace=True)
new_df

In [None]:
# Drop any "not available" data
new_df = new_df.dropna()
new_df

In [None]:
# Calculate values for a new column
new_df['Confirmed Percent'] = new_df['Confirmed']/new_df['Population']*100
new_df

In [None]:
# Make a graph
y_values = 'Confirmed Percent'
px.bar(new_df.sort_values(y_values,ascending=False).head(20), x='Country', y=y_values, title=y_values+' of Population')

## Next Steps

Hopefully that's an interesting introduction to data science using online COVID-19 data.

If you would like to see time series or geographical data, here are some examples.

In [None]:
time_series_confirmed_url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv'
tsc = pd.read_csv(time_series_confirmed_url)

time_series_deaths_url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv'
tsd = pd.read_csv(time_series_deaths_url)

time_series_recovered_url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv'
tsr = pd.read_csv(time_series_recovered_url)

In [None]:
# Create a time series graph by setting the index, dropping some columns, and Transposing rows and columns
px.scatter(tsr.set_index('Country/Region').drop(columns=['Province/State','Lat','Long']).T, y='Canada').show()
px.scatter(tsr.set_index('Country/Region').drop(columns=['Province/State','Lat','Long']).T, y=['Canada','Mexico']).show()

In [None]:
# Plot geospacial data
import plotly.express as px
px.scatter_geo(covid_stats, lat='Lat', lon='Long_', size='Confirmed', hover_name='Combined_Key')

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)