![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

<a href="https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fcallysto%2Fpresentations&branch=master&subPath=data-science-with-covid-student.ipynb&depth=1" target="_parent"><img src="https://raw.githubusercontent.com/callysto/curriculum-notebooks/master/open-in-callysto-button.svg?sanitize=true" width="123" height="24" alt="Open in Callysto"/></a>

# Introduction to Data Science with COVID-19 Data

While there are a number of well-designed dashboards and visualization tools for COVID-19 data, such as [Bing](https://bing.com/covid) and [The World Bank](http://datatopics.worldbank.org/universal-health-coverage/coronavirus/), we are going to try building something ourselves in a Jupyter notebook.

This Jupyter notebook uses [COVID-19 statistics from Johns Hopkins University CSSE](https://github.com/CSSEGISandData/COVID-19), you can also see [their dashboard](https://www.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6).

## Licence and Disclaimer

COVID-19 data sets are copyright 2020 [Johns Hopkins University](https://systems.jhu.edu) (available for educational and academic research purposes). The population data is free to use from [Gapminder](https://www.gapminder.org) under a [Creative Commons Attribution license](https://creativecommons.org/licenses/by/4.0/). This notebook also carries a [Creative Commons Attribution license](https://creativecommons.org/licenses/by/4.0/).

This notebook should not be considered medical or policy-making advice. Always follow the directives and orders of your public health authority.

## Getting Started

First, `▶Run` the next cell to import a data set. Once the data set has been downloaded and imported into a [DataFrame](https://www.tutorialspoint.com/python_pandas/python_pandas_dataframe.htm), it will be displayed.

You can change the date, but make sure you use the format `'MM-DD-YYYY'` as they do in the CSSE data set. Files are updated once a day around midnight [UTC](https://en.wikipedia.org/wiki/Coordinated_Universal_Time).

In [1]:
date = '04-08-2020'

import pandas as pd

csv_url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/'+date+'.csv'
covid_stats = pd.read_csv(csv_url)
covid_stats

Unnamed: 0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key
0,45001.0,Abbeville,South Carolina,US,2020-04-08 22:51:58,34.223334,-82.461707,5,0,0,0,"Abbeville, South Carolina, US"
1,22001.0,Acadia,Louisiana,US,2020-04-08 22:51:58,30.295065,-92.414197,86,2,0,0,"Acadia, Louisiana, US"
2,51001.0,Accomack,Virginia,US,2020-04-08 22:51:58,37.767072,-75.632346,11,0,0,0,"Accomack, Virginia, US"
3,16001.0,Ada,Idaho,US,2020-04-08 22:51:58,43.452658,-116.241552,438,3,0,0,"Ada, Idaho, US"
4,19001.0,Adair,Iowa,US,2020-04-08 22:51:58,41.330756,-94.471059,1,0,0,0,"Adair, Iowa, US"
...,...,...,...,...,...,...,...,...,...,...,...,...
2878,,,,Vietnam,2020-04-08 22:51:39,14.058324,108.277199,251,0,126,125,Vietnam
2879,,,,West Bank and Gaza,2020-04-08 22:51:39,31.952200,35.233200,263,1,44,218,West Bank and Gaza
2880,,,,Western Sahara,2020-04-08 22:51:39,24.215500,-12.885800,4,0,0,4,",,Western Sahara"
2881,,,,Zambia,2020-04-08 22:51:39,-13.133897,27.849332,39,1,7,31,Zambia


## Data Cleaning

`Run` the next cell to clean up the data. We'll add up values for each country and create a new dataframe.

In [2]:
# If you prefer specific countries, put a # in front of the next line and remove the six ' marks around the next list
country_list = covid_stats['Country_Region'].unique()
'''
country_list = ['Italy', 'Spain', 'Germany', 'France', 
                'Israel', 'US', 'United Kingdom',
                'Singapore', 'Australia', 'Canada',
                'China', 'Argentina', 'Russia', 'India']
'''

df = pd.DataFrame(columns=['Country', 'Confirmed', 'Recovered', 'Deaths'])

for country in country_list:
    confirmed = covid_stats[covid_stats['Country_Region']==country]['Confirmed'].sum()
    recovered = covid_stats[covid_stats['Country_Region']==country]['Recovered'].sum()
    deaths = covid_stats[covid_stats['Country_Region']==country]['Deaths'].sum()
    data_row = {'Country':country,'Confirmed':confirmed,'Recovered':recovered,'Deaths':deaths}
    df = df.append(data_row, ignore_index=True)

df.sort_values('Confirmed',ascending=False)

Unnamed: 0,Country,Confirmed,Recovered,Deaths
0,US,429052,23559,14695
158,Spain,148220,48021,14792
87,Italy,139422,26491,17669
7,France,113959,21452,10887
68,Germany,113296,46300,2349
...,...,...,...,...
181,Western Sahara,4,0,0
35,Burundi,3,0,0
132,Papua New Guinea,2,0,0
157,South Sudan,2,0,0


In [5]:
df.sort_values('Confirmed',ascending=False).tail(20)

Unnamed: 0,Country,Confirmed,Recovered,Deaths
105,MS Zaandam,9,0,2
122,Nepal,9,1,0
39,Central African Republic,8,0,0
144,Saint Vincent and the Grenadines,8,1,0
107,Malawi,8,0,1
24,Belize,8,0,1
77,Holy See,8,2,0
151,Sierra Leone,7,0,0
36,Cabo Verde,7,1,1
29,Botswana,6,0,1


## Graphing Data

We will use the `cufflinks` library to create a graph of our data set.

```python
import cufflinks as cf
cf.go_offline()
df.sort_values('Confirmed').iplot(kind='bar',x='Country',y='Confirmed')
```

Another option:

```python
import cufflinks as cf
cf.go_offline()
df.sort_values('Confirmed',ascending=False).head(20).iplot(kind='bar',x='Country',y='Confirmed',title='COVID Cases')
```

In [7]:
import cufflinks as cf
cf.go_offline()
df.sort_values('Confirmed',ascending=False).head(20).iplot(kind='bar',x='Country',y='Confirmed',title='COVID Cases')

### Renaming Countries

The country/state naming for this data set mostly follows the [WHO list of member states](https://www.who.int/choice/demography/by_country/en/), but we can `Run` the next cell to rename some of them.

In [8]:
df.replace('US','United States',regex=True,inplace=True)
df.replace('Korea, South','South Korea',regex=True,inplace=True)

In [9]:
df

Unnamed: 0,Country,Confirmed,Recovered,Deaths
0,United States,429052,23559,14695
1,Canada,19141,4154,407
2,United Kingdom,61474,345,7111
3,China,82809,77567,3337
4,Netherlands,20682,272,2255
...,...,...,...,...
179,Vietnam,251,126,0
180,West Bank and Gaza,263,44,1
181,Western Sahara,4,0,0
182,Zambia,39,7,1


### Listing Countries

To see a list of the countries in your data set, use

```python
df['Country'].unique()
```

or

```python
covid_stats['Country_Region'].unique()
```

In [11]:
df['Country'].unique().tolist()

['United States',
 'Canada',
 'United Kingdom',
 'China',
 'Netherlands',
 'Australia',
 'Denmark',
 'France',
 'Afghanistan',
 'Albania',
 'Algeria',
 'Andorra',
 'Angola',
 'Antigua and Barbuda',
 'Argentina',
 'Armenia',
 'Austria',
 'Azerbaijan',
 'Bahamas',
 'Bahrain',
 'Bangladesh',
 'Barbados',
 'Belarus',
 'Belgium',
 'Belize',
 'Benin',
 'Bhutan',
 'Bolivia',
 'Bosnia and Herzegovina',
 'Botswana',
 'Brazil',
 'Brunei',
 'Bulgaria',
 'Burkina Faso',
 'Burma',
 'Burundi',
 'Cabo Verde',
 'Cambodia',
 'Cameroon',
 'Central African Republic',
 'Chad',
 'Chile',
 'Colombia',
 'Congo (Brazzaville)',
 'Congo (Kinshasa)',
 'Costa Rica',
 "Cote d'Ivoire",
 'Croatia',
 'Cuba',
 'Cyprus',
 'Czechia',
 'Diamond Princess',
 'Djibouti',
 'Dominica',
 'Dominican Republic',
 'Ecuador',
 'Egypt',
 'El Salvador',
 'Equatorial Guinea',
 'Eritrea',
 'Estonia',
 'Eswatini',
 'Ethiopia',
 'Fiji',
 'Finland',
 'Gabon',
 'Gambia',
 'Georgia',
 'Germany',
 'Ghana',
 'Greece',
 'Grenada',
 'Guatemala'

## Adding World Data

We can also add up all of the values in the data set to get worldwide totals.

In [12]:
confirmed = covid_stats['Confirmed'].sum()
recovered = covid_stats['Recovered'].sum()
deaths = covid_stats['Deaths'].sum()
world_values = {'Country':'World','Confirmed':confirmed,'Recovered':recovered,'Deaths':deaths}
df = df.append(world_values, ignore_index=True)
df.tail()

Unnamed: 0,Country,Confirmed,Recovered,Deaths
180,West Bank and Gaza,263,44,1
181,Western Sahara,4,0,0
182,Zambia,39,7,1
183,Zimbabwe,11,0,3
184,World,1511104,328661,88338


## Sorting Data

`Run` the next cell to sort the data by a particular column using the code:

```python
df.sort_values('Confirmed', ascending=False).head(16)
```

The `ascending=False` is optional (the default is `True`), and `.head(16)` shows just the first 16 rows.

## Selecting Specific Countries

To see a DataFrame of specific countries, use either of the following two options:

```python
df[df['Country']=='Canada']
```

or

```python
list_of_countries = ['Canada', 'China', 'Italy']
df[df['Country'].isin(list_of_countries)]
```

In [16]:
list_of_countries = ['Canada', 'China', 'Italy']
df[df['Country'].isin(list_of_countries)].iplot(kind='bar',x='Country',y='Recovered',title='COVID Cases')

## Adding Population Data

We'll use population data from [Gapminder](https://gapminder.org).

In [17]:
pop_sheet_id = '18Ep3s1S0cvlT1ovQG9KdipLEoQ1Ktz5LtTTQpDcWbX0'
pop_gid = '1668956939'
population_csv_url = 'https://docs.google.com/spreadsheets/d/'+pop_sheet_id+'/export?gid='+pop_gid+'&format=csv'
population_data = pd.read_csv(population_csv_url)
population = population_data[population_data['time']==2019]
population

Unnamed: 0,geo,name,time,population
219,afg,Afghanistan,2019,37209007
520,alb,Albania,2019,2938428
821,dza,Algeria,2019,42679018
1122,and,Andorra,2019,77072
1423,ago,Angola,2019,31787566
...,...,...,...,...
58011,vnm,Vietnam,2019,97429061
58312,yem,Yemen,2019,29579986
58613,zmb,Zambia,2019,18137369
58914,zwe,Zimbabwe,2019,17297495


In [20]:
cp = population.set_index('name')
cs = df.set_index('Country')

In [22]:
covid_combined = cs.join(cp)

In [26]:
covid_combined.drop(columns='time',inplace=True)

In [29]:
covid_combined.dropna(inplace=True)

In [31]:
covid_combined['Confirmed Percentage'] = covid_combined['Confirmed']/covid_combined['population']*100

In [32]:
covid_combined

Unnamed: 0_level_0,Confirmed,Recovered,Deaths,geo,population,Confirmed Percentage
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
United States,429052,23559,14695,usa,3.290931e+08,0.130374
Canada,19141,4154,407,can,3.727981e+07,0.051344
United Kingdom,61474,345,7111,gbr,6.695902e+07,0.091808
China,82809,77567,3337,chn,1.420062e+09,0.005831
Netherlands,20682,272,2255,nld,1.713291e+07,0.120715
...,...,...,...,...,...,...
Uzbekistan,545,30,3,uzb,3.280737e+07,0.001661
Venezuela,167,65,9,ven,3.277987e+07,0.000509
Vietnam,251,126,0,vnm,9.742906e+07,0.000258
Zambia,39,7,1,zmb,1.813737e+07,0.000215


In [39]:
covid_combined.sort_values('Confirmed Percentage',ascending=False).iplot(kind='bar',y='Confirmed Percentage')

## Next Steps

Hopefully that's an interesting introduction to data science using online COVID-19 data.

If you would like to see time series or geographical data, here are some examples.

In [40]:
time_series_confirmed_url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv'
tsc = pd.read_csv(time_series_confirmed_url)

time_series_deaths_url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv'
tsd = pd.read_csv(time_series_deaths_url)

time_series_recovered_url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv'
tsr = pd.read_csv(time_series_recovered_url)

In [42]:
tsr

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,3/30/20,3/31/20,4/1/20,4/2/20,4/3/20,4/4/20,4/5/20,4/6/20,4/7/20,4/8/20
0,,Afghanistan,33.00000,65.000000,0,0,0,0,0,0,...,2,5,5,10,10,10,15,18,18,29
1,,Albania,41.15330,20.168300,0,0,0,0,0,0,...,44,52,67,76,89,99,104,116,131,154
2,,Algeria,28.03390,1.659600,0,0,0,0,0,0,...,37,46,61,61,62,90,90,90,113,237
3,,Andorra,42.50630,1.521800,0,0,0,0,0,0,...,10,10,10,10,16,21,26,31,39,52
4,,Angola,-11.20270,17.873900,0,0,0,0,0,0,...,0,1,1,1,1,2,2,2,2,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
244,Falkland Islands (Malvinas),United Kingdom,-51.79630,-59.523600,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
245,Saint Pierre and Miquelon,France,46.88520,-56.315900,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
246,,South Sudan,6.87700,31.307000,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
247,,Western Sahara,24.21550,-12.885800,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [43]:
# Create a time series graph by setting the index, dropping some columns, and Transposing rows and columns
tsr.set_index('Country/Region').drop(columns=['Province/State','Lat','Long']).T.iplot()
#tsr.set_index('Country/Region').drop(columns=['Province/State','Lat','Long']).T.iplot(y=['Canada', 'Mexico'])
#tsr.set_index('Country/Region').drop(columns=['Province/State','Lat','Long']).T.iplot(y='Canada')

In [44]:
# Plot geospacial data
import plotly.express as px
fig = px.scatter_geo(covid_stats, lat='Lat', lon='Long_', size='Confirmed', hover_name='Combined_Key')
fig.show()

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)