# 02806 Final project 
> An analysis and visualization for novel Covid-19 virus

- toc: true 
- badges: true
- author: Georgios Zefkilis & Yucheng Ren
- comments: false
- categories: [data_analysis, visualization]

> Tip: This page is generated from a Jupyter notebook, some of the code are hid under the hood, some of them can be shown by clicking the button `Show Code`. If you want to visit the complete notebook, please click the `view on github` button above. Some of our visualizations are generated by [Flourish](https://app.flourish.studio/), we provide the code above each plot which is used to prepare the data for that specific vis.

# Introduction

Some introduction here

# Data Analysis

In this section, we would like to show the data we are using in the project and also present some of the statistics about the data.

## Preprocessing

There are 4 datasets we are using in this project which contain information about viruses of nCovid-19, EBOLA, H1N1, SARS. The first step here is to clean the data, extract columns we need and rename them to have the same names.

Below is part of the data in each dataset

In [116]:
# hide
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors

In [58]:
# hide
path = 'final_data/'
cov19 = pd.read_csv('https://raw.githubusercontent.com/datasets/covid-19/master/data/countries-aggregated.csv', parse_dates=['Date'])
h1n1 = pd.read_csv(path + "H1N12009.csv", encoding='latin1') # 7288, (Country, Cases, Deaths, Update time)
ebola = pd.read_csv(path + "ebola_2014_2016_clean.csv") # 24850 (Country, Date, Cnfirmed cases, Death cases)
sars = pd.read_csv(path + "sars_2003_clean.csv")

In [59]:
#collapse-hide

# cov19 preprocess
cov19 = cov19[['Date', 'Country', 'Confirmed', 'Deaths', 'Recovered']]
cov19['Country'] = cov19['Country'].replace('Mainland China', 'China')
cov19.columns = ['Date', 'Country', 'Cases', 'Deaths', 'Recovered']
cov19 = cov19.groupby(['Date', 'Country'])[['Cases', 'Deaths', 'Recovered']]
cov19 = cov19.sum().reset_index()

# ebola preprocess
ebola = ebola[['Date', 'Country', 'No. of confirmed, probable and suspected cases',
                     'No. of confirmed, probable and suspected deaths']]
ebola.columns = ['Date', 'Country', 'Cases', 'Deaths']
ebola = ebola.groupby(['Date', 'Country'])[['Cases', 'Deaths']]
ebola = ebola.sum().reset_index()
ebola['Cases'] = ebola['Cases'].fillna(0)
ebola['Deaths'] = ebola['Deaths'].fillna(0)
ebola['Cases'] = ebola['Cases'].astype('int')
ebola['Deaths'] = ebola['Deaths'].astype('int')

# h1n1 preprocess
h1n1 = h1n1[['Update Time', 'Country', 'Cases', 'Deaths']]
h1n1.columns = ['Date', 'Country', 'Cases', 'Deaths']
h1n1 = h1n1.groupby(['Date', 'Country'])[['Cases', 'Deaths']]
h1n1 = h1n1.sum().reset_index()

# sars preprocess

sars = sars[['Date', 'Country', 'Cumulative number of case(s)', 
                   'Number of deaths', 'Number recovered']]
sars.columns = ['Date', 'Country', 'Cases', 'Deaths', 'Recovered']
sars = sars.groupby(['Date', 'Country'])[['Cases', 'Deaths', 'Recovered']]
sars = sars.sum().reset_index()

In [62]:
# hide
dataset = {'nCovid-19': cov19, 'H1N1': h1n1, 'EBOLA': ebola, 'SARS': sars}

**Novel Covid-19**

In [36]:
# hide_input
cov19.head()

Unnamed: 0,Date,Country,Cases,Deaths,Recovered
0,2020-01-22,Afghanistan,0,0,0
1,2020-01-22,Albania,0,0,0
2,2020-01-22,Algeria,0,0,0
3,2020-01-22,Andorra,0,0,0
4,2020-01-22,Angola,0,0,0


**EBOLA**

In [37]:
# hide_input
ebola.head()

Unnamed: 0,Date,Country,Cases,Deaths
0,2014-08-29,Guinea,648,430
1,2014-08-29,Liberia,1378,694
2,2014-08-29,Nigeria,19,7
3,2014-08-29,Sierra Leone,1026,422
4,2014-09-05,Guinea,812,517


**H1N1**

In [66]:
# hide_input
h1n1.head()

Unnamed: 0,Date,Country,Cases,Deaths
0,5/23/2009 8:00,Argentina,1,0.0
1,5/23/2009 8:00,Australia,12,0.0
2,5/23/2009 8:00,Austria,1,0.0
3,5/23/2009 8:00,Belgium,7,0.0
4,5/23/2009 8:00,Brazil,8,0.0


**SARS**

In [39]:
# hide_input
sars.head()

Unnamed: 0,Date,Country,Cases,Deaths,Recovered
0,2003-03-17,Canada,8,2,0
1,2003-03-17,Germany,1,0,0
2,2003-03-17,"Hong Kong SAR, China",95,1,0
3,2003-03-17,Singapore,20,0,0
4,2003-03-17,Switzerland,2,0,0


## Baisc Data Analysis

Size of each dataset

In [63]:
# hide_input
for name, data in dataset.items():
    print(name + ' with size of ', data.size)

nCovid-19 with size of  76775
H1N1 with size of  7288
EBOLA with size of  9516
SARS with size of  12685


In [64]:
# collapse-hide

def averagePerDay(disease, attr='Cases'):
    allCases = disease[attr].max()
    days = len(disease.Date.unique())
    return allCases * 1.0 / days

def getTimeSpan(disease):
    return (disease.Date.min(), disease.Date.max(), len(disease.Date.unique()))

print("Time span for our data:")
for key, value in dataset.items():
    start, end, span = getTimeSpan(value)
    print(key, 'starts from ', start, ' ends at ', end, ' duration ', span, 'days')

print()
print("Average confirmed cases per day for each disease")
for key, value in dataset.items():
    cases = averagePerDay(value)
    print(key, 'average confirmed', cases)
print()
print("Average death cases per day for each disease")
for key, value in dataset.items():
    cases = averagePerDay(value, 'Deaths')
    print(key, 'average death', cases)

Time span for our data:
nCovid-19 starts from  2020-01-22 00:00:00  ends at  2020-04-13 00:00:00  duration  83 days
H1N1 starts from  5/23/2009 8:00  ends at  7/6/2009 9:00  duration  22 days
EBOLA starts from  2014-08-29  ends at  2016-03-23  duration  259 days
SARS starts from  2003-03-17  ends at  2003-07-11  duration  96 days

Average confirmed cases per day for each disease
nCovid-19 average confirmed 6995.409638554217
H1N1 average confirmed 4296.0
EBOLA average confirmed 54.52509652509652
SARS average confirmed 55.510416666666664

Average death cases per day for each disease
nCovid-19 average death 283.48192771084337
H1N1 average death 19.5
EBOLA average death 18.583011583011583
SARS average death 3.625


# Comparison with Other Infectious Virus

The other three viruses we choose to compare with Covid-19 are H1N1, SARS and EBOLA.

## Lethal Ability

In [47]:
# hide
c_dbd = cov19.groupby('Date')[['Cases', 'Deaths', 'Recovered']].sum().reset_index()
h_dbd = h1n1.groupby('Date')[['Cases', 'Deaths']].sum().reset_index()
e_dbd = ebola.groupby('Date')[['Cases', 'Deaths']].sum().reset_index()
s_dbd = sars.groupby('Date')[['Cases', 'Deaths', 'Recovered']].sum().reset_index()

In [49]:
# collapse-hide
# data for line chart
conDeaths = pd.concat([c_dbd['Deaths'], h_dbd['Deaths'], e_dbd['Deaths'], s_dbd['Deaths']], axis=1,keys=['Covid-19', 'H1N1','EBOLA', 'SARS'])
conDeaths.to_csv('compDeath.csv', index=True)

<div class="flourish-embed" data-src="story/261608" data-url="https://flo.uri.sh/story/261608/embed"><script src="https://public.flourish.studio/resources/embed.js"></script></div>

## Lethal Rate

TODO

## Overall Comparison

In [51]:
# collapse-hide
# data for buble chart
header = 'Virus, No. of countries, Time Duration, Confirmed, Deaths, Recovered'
# total confirmed, total deaths, total recovered, time duration, No. of countries
with open('buble.csv', 'w') as f:
    f.write(header + '\n')
    for name, data in dataset.items():
        line = [name]
        line.append(str(len(data.Country.unique())))
        line.append(str(len(data.Date.unique())))
        if hasattr(data, 'Cases'):
            line.append(str(data.Cases.sum()))
        else:
            line.append(str(data.Confirmed.sum()))
        line.append(str(data.Deaths.sum()))
        if hasattr(data, 'Recovered'):
            line.append(str(data.Recovered.sum()))
        else:
            line.append('')
        f.write(','.join(line) + '\n')

<div class="flourish-embed flourish-scatter" data-src="visualisation/1913459" data-url="https://flo.uri.sh/visualisation/1913459/embed"><script src="https://public.flourish.studio/resources/embed.js"></script></div>

# Covid-19 Analysis

In [57]:
# collapse-hide
# data for bar race chart
countries = cov19.Country.unique()
length = len(cov19.loc[cov19.Country == 'China'].Cases)
header = ['Day' + str(i) for i in range(1, length+1)]
header.insert(0, 'Country')
with open('COV 19.csv', 'w') as f:
    f.write(','.join(header) + '\n')
    for country in countries:
        line = [str(i) for i in cov19.loc[cov19.Country == country].Cases]
        line.insert(0, country)
        f.write(','.join(line) + '\n')

<div class="flourish-embed flourish-bar-chart-race" data-src="visualisation/1908357" data-url="https://flo.uri.sh/visualisation/1908357/embed"><script src="https://public.flourish.studio/resources/embed.js"></script></div>

## Distribution

TODO

In [92]:
import altair as alt
from vega_datasets import data

source = data.stocks()

highlight = alt.selection(type='single', on='mouseover',
                          fields=['symbol'], nearest=True)

base = alt.Chart(source).encode(
    x='date:T',
    y='price:Q',
    color='symbol:N'
)

points = base.mark_circle().encode(
    opacity=alt.value(0)
).add_selection(
    highlight
).properties(
    width=600
)

lines = base.mark_line().encode(
    size=alt.condition(~highlight, alt.value(1), alt.value(3))
)

points + lines

In [95]:
source

Unnamed: 0,symbol,date,price
0,MSFT,2000-01-01,39.81
1,MSFT,2000-02-01,36.35
2,MSFT,2000-03-01,43.22
3,MSFT,2000-04-01,28.37
4,MSFT,2000-05-01,25.45
...,...,...,...
555,AAPL,2009-11-01,199.91
556,AAPL,2009-12-01,210.73
557,AAPL,2010-01-01,192.06
558,AAPL,2010-02-01,204.62
