# Visualizations of Covid-19 data

The purpose of this notebook is to create some displays of basic Covid-19 data. A roadmap will be lain down later.

### Loading up-to-date data

We load current data of the number of confirmed cases and fatalities per country from the Johns Hopkins Center for Systems Science and Engineering (JHU/CSSE)

In [1]:
import pandas as pd
import plotly.express as px

In [2]:
def load_data(file_name, column_name):
    """
    @arg file_name: url to the csv with the current covid-19 data, either confirmed cases or fatalities
    @arg column_namme: name of the data in the date columns, cumulative confirmed cases or cumulate fatalities for example
    
    @arg return: a Pandas DataFrame with per Country/Region and Date the number of cumulative confirmed cases or fatalities. 
                 Dates are in the datetime64[ns] format
                 
    Based on https://towardsdatascience.com/visualise-covid-19-case-data-using-python-dash-and-plotly-e58feb34f70f
    """
    data = pd.read_csv(file_name) \
             .drop(['Lat', 'Long'], axis=1) \
             .melt(id_vars=['Province/State', 'Country/Region'], var_name='date', value_name = column_name) \
             .astype({'date':'datetime64[ns]', column_name : 'Int64'})
    data['Province/State'].fillna('<all>', inplace=True)
    return data

In [3]:
# We load in the data hosted by the Johns Hopkins Center for Systems Science and Engineering (JHU/CSSE)

base_url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series'
confirmed_cases_url = base_url + "/time_series_covid19_confirmed_global.csv"
number_deaths_url = base_url + "/time_series_covid19_deaths_global.csv"

df_c_orig = pd.read_csv(confirmed_cases_url)

In [4]:
# Display the original, unpreprocessed data
df_c_orig.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,3/24/20,3/25/20,3/26/20,3/27/20,3/28/20,3/29/20,3/30/20,3/31/20,4/1/20,4/2/20
0,,Afghanistan,33.0,65.0,0,0,0,0,0,0,...,74,84,94,110,110,120,170,174,237,273
1,,Albania,41.1533,20.1683,0,0,0,0,0,0,...,123,146,174,186,197,212,223,243,259,277
2,,Algeria,28.0339,1.6596,0,0,0,0,0,0,...,264,302,367,409,454,511,584,716,847,986
3,,Andorra,42.5063,1.5218,0,0,0,0,0,0,...,164,188,224,267,308,334,370,376,390,428
4,,Angola,-11.2027,17.8739,0,0,0,0,0,0,...,3,3,4,4,5,7,7,7,8,8


In [5]:
# Display the processed data
df_c = load_data(confirmed_cases_url, 'cum_confirmed')
df_c.head()

Unnamed: 0,Province/State,Country/Region,date,cum_confirmed
0,<all>,Afghanistan,2020-01-22,0
1,<all>,Albania,2020-01-22,0
2,<all>,Algeria,2020-01-22,0
3,<all>,Andorra,2020-01-22,0
4,<all>,Angola,2020-01-22,0


In [6]:
# Countries can be subdivided in Provinces/States. Here, we look at the numbers per country, aggregating over Provinces/States:
df_c_country = df_c.groupby(['Country/Region','date'], as_index = False)['cum_confirmed'].sum()
df_c_country.head(10)

Unnamed: 0,Country/Region,date,cum_confirmed
0,Afghanistan,2020-01-22,0
1,Afghanistan,2020-01-23,0
2,Afghanistan,2020-01-24,0
3,Afghanistan,2020-01-25,0
4,Afghanistan,2020-01-26,0
5,Afghanistan,2020-01-27,0
6,Afghanistan,2020-01-28,0
7,Afghanistan,2020-01-29,0
8,Afghanistan,2020-01-30,0
9,Afghanistan,2020-01-31,0


In [47]:
px.line(df_c_country, x='date', y='cum_confirmed', color='Country/Region', title = 'Cumulative Number Of Confirmed Cases')

In [48]:
# Let's start with the following countries:
selected_countries = ['China', 'US', 'Italy', 'Spain', 
                      'Germany', 'France', 'Iran', 'United Kingdom', 
                      'Switzerland', 'Belgium', 'Netherlands', 
                      'Turkey', 'Taiwan*', 'Korea, South', 
                      'Thailand', 'Indonesia']
# Hong Kong would be nice too. Rapid community response because of SARS ealier.

df_c_sel = df_c_country[df_c_country['Country/Region'].isin(selected_countries)]

In [49]:
px.line(df_c_sel, x='date', y='cum_confirmed', color='Country/Region', title = 'Cumulative Number Of Confirmed Cases')

In [51]:
df_c_sel['daily_confirmed'] = df_c_sel.groupby('Country/Region')['cum_confirmed'].diff()
px.line(df_c_sel.dropna(), x='date', y='daily_confirmed', color='Country/Region', title = "Daily Number Of Confirmed Cases")#, barmode = 'group')



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



TODO: Plot these in population densities?

### Align time series on similar number of first cases


In [44]:
def align_time_series(df, start_num):
    """
    @arg df        : a Dataframe that contains the columns 'Country/Region', 'date', 'cum_confirmed' with corresponding datatypes
    @arg start_num : The minimum number of cases to align the time series on. Includes this number.
    
    @return 
    """
    
    df_filtered = df[df['cum_confirmed'] >= start_num]
    df_filtered['days_since_num'] = df_filtered.groupby('Country/Region')['date'].apply(lambda x: (x - x.iloc[0]).astype('timedelta64[D]'))
    return df_filtered

In [46]:
# Aligns the time series on the first day that the number of confirmed cases exceeds 'start_num_to_align'

start_num_to_align = 20 

df_filtered = align_time_series(df_c_sel, start_num_to_align)
px.line(df_filtered, x='days_since_num', y='cum_confirmed', color='Country/Region', title = f"Cumulative Cases since the first day with {start_num_to_align} Cases")#, barmode = 'group')



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

