![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

[![Open in Callysto](https://github.com/callysto/curriculum-notebooks/blob/open-in-callysto-button/open-in-callysto-button.png?raw=true)](https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fcallysto%2Fpresentations&branch=master&subPath=data-science-with-covid-instructor.ipynb&depth=1)

<a href="https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fcallysto%2Fpresentations&branch=master&subPath=data-science-with-covid-instructor.ipynb&depth=1" target="_blank"><img src="https://github.com/callysto/curriculum-notebooks/blob/open-in-callysto-button/open-in-callysto-button.png?raw=true" alt="Open in Callysto"/></a>

# Introduction to Data Science with COVID-19 Data

This Jupyter notebook uses [COVID-19 statistics from Johns Hopkins CSSE](https://github.com/CSSEGISandData/COVID-19).

First, `▶Run` the next cell to import the data. Once the data set has been downloaded and imported into a [DataFrame](https://www.tutorialspoint.com/python_pandas/python_pandas_dataframe.htm), it will be displayed.

You can change the date, but make sure you use the format `'MM-DD-YYYY'` as they do in the CSSE data set.

In [1]:
date = '04-06-2020'

import pandas as pd

csv_url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/'+date+'.csv'
covid_stats = pd.read_csv(csv_url)
covid_stats

Unnamed: 0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key
0,45001.0,Abbeville,South Carolina,US,2020-04-06 23:22:15,34.223334,-82.461707,6,0,0,0,"Abbeville, South Carolina, US"
1,22001.0,Acadia,Louisiana,US,2020-04-06 23:22:15,30.295065,-92.414197,79,2,0,0,"Acadia, Louisiana, US"
2,51001.0,Accomack,Virginia,US,2020-04-06 23:22:15,37.767072,-75.632346,11,0,0,0,"Accomack, Virginia, US"
3,16001.0,Ada,Idaho,US,2020-04-06 23:22:15,43.452658,-116.241552,402,3,0,0,"Ada, Idaho, US"
4,19001.0,Adair,Iowa,US,2020-04-06 23:22:15,41.330756,-94.471059,1,0,0,0,"Adair, Iowa, US"
...,...,...,...,...,...,...,...,...,...,...,...,...
2804,,,,West Bank and Gaza,2020-04-06 23:21:55,31.952200,35.233200,254,1,24,229,West Bank and Gaza
2805,,,,Western Sahara,2020-04-06 23:21:55,24.215500,-12.885800,4,0,0,4,",,Western Sahara"
2806,,,,Zambia,2020-04-06 23:21:55,-13.133897,27.849332,39,1,5,33,Zambia
2807,,,,Zimbabwe,2020-04-06 23:21:55,-19.015438,29.154857,10,1,0,9,Zimbabwe


## Data Cleaning

`Run` the next cell to clean up the data. We'll add up values for each country and create a new dataframe.

In [2]:
# If you prefer specific countries, put a # in front of the next line and remove the six ' marks around the next list
country_list = covid_stats['Country_Region'].unique()
'''
country_list = ['Italy', 'Spain', 'Germany', 'France', 
                'Israel', 'US', 'United Kingdom',
                'Singapore', 'Australia', 'Canada',
                'China', 'Argentina', 'Russia', 'India']
'''

df = pd.DataFrame(columns=['Country', 'Confirmed', 'Recovered', 'Deaths'])

for country in country_list:
    confirmed = covid_stats[covid_stats['Country_Region']==country]['Confirmed'].sum()
    recovered = covid_stats[covid_stats['Country_Region']==country]['Recovered'].sum()
    deaths = covid_stats[covid_stats['Country_Region']==country]['Deaths'].sum()
    data_row = {'Country':country,'Confirmed':confirmed,'Recovered':recovered,'Deaths':deaths}
    df = df.append(data_row, ignore_index=True)

df.sort_values('Confirmed',ascending=False)

Unnamed: 0,Country,Confirmed,Recovered,Deaths
0,US,366614,19581,10783
158,Spain,136675,40437,13341
87,Italy,132547,22837,16523
68,Germany,103374,28700,1810
7,France,98963,17428,8926
...,...,...,...,...
146,Sao Tome and Principe,4,0,0
35,Burundi,3,0,0
132,Papua New Guinea,2,0,0
157,South Sudan,1,0,0


## Add World Data

We can also add up all of the values in the data set to get worldwide totals.

In [3]:
confirmed = covid_stats['Confirmed'].sum()
recovered = covid_stats['Recovered'].sum()
deaths = covid_stats['Deaths'].sum()
world_values = {'Country':'World','Confirmed':confirmed,'Recovered':recovered,'Deaths':deaths}
df = df.append(world_values, ignore_index=True)
df.tail()

Unnamed: 0,Country,Confirmed,Recovered,Deaths
180,West Bank and Gaza,254,24,1
181,Western Sahara,4,0,0
182,Zambia,39,5,1
183,Zimbabwe,10,0,1
184,World,1345048,276515,74565


## Sorting Data

`Run` the next cell to sort the data by a particular column. The `ascending=False` is optional (the default is `True`), and `.head(16)` shows just the first 16 rows.

In [4]:
df.sort_values('Confirmed', ascending=False).head(16)

Unnamed: 0,Country,Confirmed,Recovered,Deaths
184,World,1345048,276515,74565
0,US,366614,19581,10783
158,Spain,136675,40437,13341
87,Italy,132547,22837,16523
68,Germany,103374,28700,1810
7,France,98963,17428,8926
3,China,82665,77310,3335
83,Iran,60500,24236,3739
2,United Kingdom,52279,287,5385
172,Turkey,30217,1326,649


## Selecting Specific Countries

To see a DataFrame of specific countries, edit and run the next cell.

In [5]:
#df[df['Country']=='Canada']
list_of_countries = ['Canada', 'China', 'Italy']
df[df['Country'].isin(list_of_countries)]

Unnamed: 0,Country,Confirmed,Recovered,Deaths
1,Canada,16563,3256,339
3,China,82665,77310,3335
87,Italy,132547,22837,16523


## Graphing Data

We will use the `cufflinks` library to create a graph of our data set.

```python
import cufflinks as cf
cf.go_offline()
df.sort_values('Confirmed').iplot(kind='bar',x='Country',y='Confirmed')
```

Another option:

```python
import cufflinks as cf
cf.go_offline()
df.sort_values('Confirmed',ascending=False).head(20).iplot(kind='bar',x='Country',y='Confirmed',title='COVID Cases')
```

To exclude the `World` row, you can `.drop(184)`.

In [6]:
import cufflinks as cf
cf.go_offline()
df.sort_values('Confirmed').iplot(kind='bar',x='Country',y='Confirmed')

**Hopefully that's an interesting introduction to data science using online COVID-19 data.**

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)