# $R_0$ Analysis

One of the key quantities of interest for any infectious pathogen is the _reproductive number_, denoted $R_0$. 

This is the expected number of new cases caused by each case. 

Most estimates place this number for the 2019-nCov (novel coronavirus) as between 2 and 3. For example, Zhang et al ([link](https://www.ncbi.nlm.nih.gov/pubmed/32097725)), who analyzed the outbreak on the Diamond Princess cruise ship, say: 

> The Maximum-Likelihood (ML) value of R0 was 2.28 for COVID-19 outbreak at the early stage on the ship. The median with 95% confidence interval (CI) of R0 values was 2.28 (2.06-2.52) estimated by the bootstrap resampling method.

In this notebook, we try to reproduce these findings within Canada. That is, we run a hypothesis test: 

$$ H_0: R_0 \in [2.06, 2.52] $$ 

$$ H_A: R_0 < 2.06 \text{ or } R_0 > 2.52 $$ 

**Note:** This is known as a [composite hypothesis](https://en.wikipedia.org/wiki/P-value#For_composite_hypothesis), because we're testing against a set of values (in this case an interval), rather than a single number. 

It looks like the procedure is simply to pick the most unfavorable point in the interval, and then do a standard point-to-point hypothesis test.

### Setup

In [1]:
# Load dependencies
using Pkg
pkg"activate .."
using JSON, Statistics, Plots, Dates, LsqFit, TimeSeries
using StatsModels, GLM, HypothesisTests

[32m[1m Activating[22m[39m environment at `~/classes/326/ECON-326-Project/Project.toml`


In [2]:
# Load data
data = JSON.parsefile("../data/covid_data.json");

### Data Filtering

There is a potential heteroskedasticity problem, in that if we run our regression against the entire timespan (from Jan. 22 to April 5), we're ignoring all the changes in the public health environment which have taken place. 

For example, we would expect that the observed $\hat{R}_0$ in Canada is lower in recent weeks, as Canada has aggressively enforced social distancing.

So, we'll **remove all data** dating to after the WHO pandemic declaration on **Mar. 11, 2020**.

In [6]:
# Unpack data
pandemic_day = Date("2020-03-11")
canada = data["Canada"]

confirmed = [point["confirmed"] for point in canada if Date(point["date"]) <= pandemic_day]
recovered = [point["recovered"] for point in canada if  Date(point["date"]) <= pandemic_day]
deaths = [point["deaths"] for point in canada if Date(point["date"]) <= pandemic_day]
dates = Date.([point["date"] for point in canada if Date(point["date"]) <= pandemic_day]);

### Regression Model

There is another difficulty, which is that there is not a uniform incubation period for the virus. 

However, the median incubation period is roughly 5 days (see [here](https://annals.org/aim/fullarticle/2762808/incubation-period-coronavirus-disease-2019-covid-19-from-publicly-reported), _Annals of Internal Medicine_), so we will use that as an approximation. 

Our guess for the true spread process is then:

$$ x_{t + 5} = R_0 x_t + \epsilon $$

If Canada was a closed system (i.e., there was no immigration), and the incubation period was exactly and uniformly 5 days, with each case causing exactly $R_0$ others, then $\epsilon = 0$. 

Note that this implies an exponential spread: 

$$\begin{align}
x_{t + 5} - x_t = (R_0 - 1)x_t  \\ 
\frac{d x}{d t} = \frac{1}{5} \big( (R_0 - 1) \big) x
\end{align}$$

Which is a straightforward ODE with solution: 

$$ x(t) = \alpha e^{\frac{1}{5} \big( (R_0 - 1) \big)t} $$ 

with $\alpha$ an integration constant (the num. of cases at $t = 0$.)

### Linear Regression and Assumptions

So, the exact regression we are running is simply: 

$$ y_i = \beta_1 x_i + \beta_0 $$

Where the data is grouped $(x_i, y_i) = (x_t, x_{t+5})$.

Before we begin, recall the assumptions: 

1. $0 < \mathbb{E}X^4 < \infty$ and $0 < \mathbb{E}Y^4 < \infty$: This is true. We plotted the distributions earlier, and nothing was truly obscene. Also, Canada has a finite population, and the data are small relative to that.

2. $\mathbb{E}[u_i | X_i] = 0$: This one is shaky. As $X_i$ increases, for example, $Y_i$ becomes less predictable (the 5-day incubation time was a rough estimate, and one that's liable to break down as more and more people get the virus.) But there's no easy way around it.

3. $(X_i, Y_i)$ are i.i.d. draws from their joint distribution: Certainly there wasn't any systematic sampling done (and the future 5 days away is unknowable when each $X_i$ was "sampled.") But perhaps the cutoff rule we used (there are only time periods of interest; before Mar. 11, and after) is too coarse. 

   The key question is too much serial dependence (the data point $(X_t, X_{t+5})$ is correlated with $(X_{t+5}, X_{t+10})$). But short of using sophisticated statistical techniques and multiple independent data sets, we can't see an obvious answer.

### Regression and Testing