# Tracking the progress of the COVID-19 Pandemic
> "In this article we show how to track the spread of the CIV-19 virus"
- toc: false
- branch: master
- badges: true
- comments: false
- categories: [charts,visualization,data]

As the COVID-19 viral pandemic gathers steam, it is important that we keep up to date with the latest information from reputable and trusted sources. This helps us to make the best decisions in order to remain safe, mitigate the effects, and eventually return the society to normal. One way way that data scientists and data analysts can help is in performing data analysis using the best available data sources. This article performs data analysis using some important sources of COVID 19 data.

![](../images/coronavirus.jfif)

## Data from Johns Hopkins Center for System Science and Engineering
One of the most popular sources for information on COVID 19 is the **Center for Systems Science and Engineering** at **Johns Hopkins University**. They maintain an interactive dashboard with very up to date statistics on the global spread. Data is normally about less than a day old and pretty comprehensive across the countries, with data resolution down to the province/state/region level for some of the major countries. The CSSE also maintains a github repository where the data is located in CSV files. These CSV files can be downloaded and used in data projects, either directly of by cloning the github repository.

## Kaggle COVID Data Analysis
The [Novel Corona Virus Dataset](https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset) is a dataset hosted on Kaggle. It is made available to data scientists to spur understanding of the pandemic and in particular to help in predicting the path that the epidemic will take. It is derived from the CSSE data, but is a day or two behind.

## Other sources for COVID-19 

### Canada Public Health
The official Canadian government website for COVID-19 updates is located [here](https://www.canada.ca/en/public-health/services/diseases/coronavirus-disease-covid-19.html)

### Ontario
https://www.ontario.ca/page/2019-novel-coronavirus

## Install Libraries
We will use the following libraries and will need to install them using pip.

In [1]:
#! pip install altair pendulum folium

In [1]:
import pandas as pd
import altair as alt
import pendulum
from ipywidgets import HTML
pd.options.display.max_rows = 80

### Download the data
We will download the data directly from the CSSE github repository

In [2]:
CSSE_URL = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/'
CONFIRMED_URL = f'{CSSE_URL}/time_series_19-covid-Confirmed.csv'
DEATHS_URL = f'{CSSE_URL}/time_series_19-covid-Deaths.csv'
RECOVERED_URL = f'{CSSE_URL}/time_series_19-covid-Recovered.csv'

confirmed_wide = pd.read_csv(CONFIRMED_URL)
deaths_wide = pd.read_csv(DEATHS_URL)
recovered_wide = pd.read_csv(RECOVERED_URL)

 ## When was the data last updated
 The raw data is in wide format and data for the latest days is appended to new columns on the right. We can use this to show when the data was last updated.

In [3]:
latest_date = pendulum.instance( pd.to_datetime(confirmed_wide.columns[-1]))
pd.DataFrame({'Date Updated': [latest_date.to_day_datetime_string()], 
              'When': [latest_date.diff_for_humans()]}, 
             index=[''])

Unnamed: 0,Date Updated,When
,"Tue, Mar 17, 2020 12:00 AM",1 day ago


## Data Processing

In [4]:
dates = confirmed_wide.columns[4:]
ID_VARS = ['Province/State', 'Country/Region', 'Lat', 'Long']
CASE_COLS = ['Confirmed', 'Active', 'Deaths', 'Recovered']

# Convert from the wide format to long format
confirmed_long = confirmed_wide.melt(id_vars=ID_VARS, value_vars=dates, var_name='Date', value_name='Confirmed')
deaths_long = deaths_wide.melt(id_vars=ID_VARS, value_vars=dates, var_name='Date', value_name='Deaths')
recovered_long = recovered_wide.melt(id_vars=ID_VARS, value_vars=dates, var_name='Date', value_name='Recovered')
covid_full = pd.concat([confirmed_long, deaths_long['Deaths'], recovered_long['Recovered']], axis=1, sort=False)

covid_full.Date = pd.to_datetime(covid_full.Date).dt.normalize()
covid_full = covid_full[['Date', 'Province/State','Country/Region','Lat','Long','Confirmed','Deaths','Recovered']]
covid_full = covid_full.rename(columns={'Province/State': 'Province', 'Country/Region': 'Country'})
covid_full.Country = covid_full.Country.replace('Mainland/China', 'China').replace('Korea, South', 'South Korea')
covid_full = covid_full.set_index('Date')

# Add an Active columns
covid_full['Active'] = covid_full['Confirmed'] - covid_full['Deaths'] - covid_full['Recovered']

# remove values with , in the Country to avoid double counting
covid_full = covid_full[covid_full.Province.str.contains(',') != True]

covid_full.tail()

Unnamed: 0_level_0,Province,Country,Lat,Long,Confirmed,Deaths,Recovered,Active
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2020-03-17,Cayman Islands,United Kingdom,19.3133,-81.2546,1,1,0,0
2020-03-17,Reunion,France,-21.1351,55.2471,9,0,0,9
2020-03-17,,Barbados,13.1939,-59.5432,2,0,0,2
2020-03-17,,Montenegro,42.5,19.3,2,0,0,2
2020-03-17,,The Gambia,13.4667,-16.6,1,0,0,1


## Latest Data

In [5]:
latest_date = covid_full.index.max()
latest_data = covid_full.loc[latest_date]

In [6]:
covid_full.loc[latest_date][CASE_COLS].sum().to_frame().T

Unnamed: 0,Confirmed,Active,Deaths,Recovered
0,197145,108400,7905,80840


### Latest By Country

In [7]:
covid_full.loc[latest_date].groupby('Country')[CASE_COLS]\
        .sum().sort_values(['Confirmed', 'Active'], ascending=[False, False])

Unnamed: 0_level_0,Confirmed,Active,Deaths,Recovered
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
China,81058,9030,3230,68798
Italy,31506,26062,2503,2941
Iran,16169,9792,988,5389
Spain,11748,10187,533,1028
Germany,9257,9166,24,67
South Korea,8320,6832,81,1407
France,7699,7539,148,12
US,6421,6296,108,17
Switzerland,2700,2669,27,4
United Kingdom,1960,1851,56,53


### Canada Wide

In [8]:
def country_data(country):
    return  covid_full[covid_full.Country == country].loc[latest_date].groupby('Province')[CASE_COLS] \
            .sum().sort_values(['Confirmed', 'Active'], ascending=[False, False])

In [9]:
canada_data = country_data('Canada')
canada_data

Unnamed: 0_level_0,Confirmed,Active,Deaths,Recovered
Province,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Ontario,185,179,1,5
British Columbia,103,95,4,4
Alberta,74,74,0,0
Quebec,74,74,0,0
Grand Princess,8,8,0,0
Manitoba,8,8,0,0
New Brunswick,8,8,0,0
Nova Scotia,7,7,0,0
Saskatchewan,7,7,0,0
Newfoundland and Labrador,3,3,0,0


## USA

In [10]:
country_data('US');

## World Map

In [11]:
import folium
m = folium.Map(location=[0, 0], tiles='cartodbpositron', min_zoom=1, max_zoom=4, zoom_start=1)

for i in range(0, len(latest_data)):
    folium.Circle(
        location=[latest_data.iloc[i]['Lat'], latest_data.iloc[i]['Long']],
        color='crimson', 
        tooltip =   '<li><bold>Country : '+str(latest_data.iloc[i]['Country'])+
                    '<li><bold>Province : '+str(latest_data.iloc[i]['Province'])+
                    '<li><bold>Confirmed : '+str(latest_data.iloc[i]['Confirmed'])+
                    '<li><bold>Deaths : '+str(latest_data.iloc[i]['Deaths'])+
                    '<li><bold>Recovered : '+str(latest_data.iloc[i]['Recovered']),
        radius=int(latest_data.iloc[i]['Confirmed'])**1.2).add_to(m)


ModuleNotFoundError: No module named 'folium'

![](../images/worldmap.png)

## Case Data over Time

We will use Altair to produce charts of the case data over time. Since we want to show by country, we will create a function in which we can pass the country as a parameter.

In [12]:
def cases_over_time(country=None, province=None):
    title='Covid cases over time'
    base_data = covid_full
    
    if province:
        base_data = base_data[base_data.Province==province]
        title = f'{title} {province}'
        
    if country:
        base_data = base_data[base_data.Country==country]
        title = f'{title} {country}'
        

        
    case_data = base_data.groupby('Date')['Recovered', 'Deaths', 'Active'] \
            .sum().reset_index().melt(id_vars="Date", value_vars=['Recovered', 'Deaths', 'Active'],
                     var_name='Case', value_name='Count')
    
    ## The chart
    chart = alt.Chart(case_data).mark_area().encode(
            x='Date:T',
            y='Count:Q',
            color='Case:N'
        ).properties(
            title=title,
            width=600
            ).configure_axis(
                grid=False
            )
    return chart

cases_over_time()

### Cases in Canada

In [13]:
cases_over_time('Canada')

### Cases in Ontario Canada

In [14]:
cases_over_time('Canada', 'Ontario')

### Cases in the USA

In [15]:
cases_over_time('US')

### Cases in Italy

In [16]:
cases_over_time('Italy')

### Cases in the United Kingdom

In [17]:
cases_over_time('United Kingdom')