# Introduction to Pandas, Numpy, Plotly using COVID-19 data

This notebook uses data from the [2019 Novel Coronavirus COVID-19 (2019-nCoV) Data Repository by Johns Hopkins CSSE](https://github.com/CSSEGISandData/COVID-19)

In [1]:
import datetime
print(f"Last executed {datetime.datetime.now()}")

Last executed 2020-04-04 16:31:19.389751


In [2]:
import pandas as pd
import numpy as np
import plotly.graph_objects as go

## Importing data

In [4]:
def geturl(var):
    return f"https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-{var}.csv"

df = pd.read_csv(geturl("Recovered")) # also try "Deaths" and "Recovered"

HTTPError: HTTP Error 404: Not Found

In [None]:
df

Choose below which data to consider.  The cell will select the appropriate row for the table, and only consider columns from the fifth onwards (i.e. ignoring the first four) 

In [None]:
place = "Beijing"
cases = df[df.loc[:,"Province/State"] == place].iloc[0,4:]

In [None]:
place = "Italy"
cases = df[df.loc[:,"Country/Region"] == place].iloc[0,4:]

In [None]:
dates = cases.index 
cases = cases.values.astype(int)

In [None]:
dates # Note: this is just a sequence of strings

In [None]:
cases

In [None]:
for date,case in zip(dates,cases):
    print(f"{date:8s} {case}")

In [None]:
cases.shape

## [Data wrangling](https://en.wikipedia.org/wiki/Data_wrangling)

For each day, compute the number of new cases with respect to the previous day

In [None]:
newcases = np.zeros(cases.shape, dtype=int)
for i in range(1,len(cases)):
    newcases[i] = cases[i] - cases[i-1]
newcases

The same thing can be done in a more pythonic way as follows (see the [documentation of `np.diff`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.diff.html))

In [None]:
newcases = np.diff(cases, prepend=cases[0])
newcases

## Visualization

In [None]:
traces = [
    go.Scatter(x=dates, y=cases, name="total number of cases"),
    go.Bar(x=dates, y=newcases, name="new cases for the day")
]
layout = go.Layout(title=f"COVID-19 cases in {place}",
                   xaxis=dict(title="date"),
                   yaxis={'title':'number of people'})
fig = go.Figure(data=traces, layout=layout)
fig

# Exercises

## 1
Repeat the analysis above by considering all of Mainland China

## 2
Add the number of recovered people and the number of deaths to the plots above.

## 3
For all places with at least 100 confirmed cases, compute the mortality rate, i.e. #deaths / #confirmed.  Rank the different places from the place where the virus proves most deadly to least, and visualize this information as a horizontal bar graph.  Also see [an interesting note on mortality rate](https://www.worldometers.info/coronavirus/coronavirus-death-rate/).

# Some solutions
Note improvements vs the code written above

In [None]:
def getall(var):
    """ Returns a series for the variable in `var` (valid options: "Confirmed", "Deaths", "Recovered").
    The series reports the sum of cases over all places """
    df = pd.read_csv(geturl(var))
    s = df.iloc[:,4:].sum(axis=0)
    s.index = pd.to_datetime(s.index)
    return s

df = pd.concat(dict(confirmed=getall("Confirmed"),
                    dead=getall("Deaths"),
                    recovered=getall("Recovered")),
               axis=1)

df["sick"] = df["confirmed"] - df["dead"] - df["recovered"]
df["newconfirmed"] = df["confirmed"].diff()
df

In [None]:
traces = [
    go.Scatter(x=df.index, y=df["confirmed"], name="confirmed", line=dict(color='black', width=2)),
    go.Scatter(x=df.index, y=df["dead"], name="dead", line=dict(color='gray', width=2)),
    go.Scatter(x=df.index, y=df["sick"], name="sick", line=dict(color='green', width=4)),
    go.Bar(x=df.index, y=df["newconfirmed"], name="newconfirmed", marker_color="red")
]
layout = go.Layout(title=f"COVID-19 cases",
                   xaxis=dict(title="date"),
                   yaxis=dict(title='number of people'))
fig = go.Figure(data=traces, layout=layout)
fig