# Assignment 2: Identifying Unreported Cases of COVID-19

This assignment was adapted from [this paper](https://arxiv.org/abs/2006.02127), titled _Data-driven Identification of Number of Unreported Cases for COVID-19: Bounds and Limitations_. Feel free to read the paper if you like, but you will not need to know anything in the paper that is not explained here.

In [None]:
import pandas as pd
import numpy as np
import torch
import matplotlib.pyplot as plt


## Goal

The goal of this assignment is to come up with a reasonable, data-driven estimate of how many people have COVID-19. You will have access to the data with the number of _reported_ cases. But the reported numbers are, of course, under-estimates because there are people who go unreported.

It seems reasonable to assume that the number of people who have it is proportional to the number reported. 

$$
\text{Number infected} = k \times \text{Number Reported}, \qquad k > 1
$$

Therefore, we just need to determine how many infected people a single case represents.

## Model

Recall that epidemiological models can have compartments representing groups of people with transition rates or probabilities between the compartments. In this assignment, we will have an _S_ (susceptible), an _I_ (infected), and an _R_ (reported) compartment. There's also a compartment for immune / isolated people who are, naturally, not connected to any other compartment. Let's call the total number of people in the population $N$.

|Variable|Meaning|
|--------|-------|
|$S_t$|No. susceptible people at time $t$|
|$I_t$|No. infected people at time $t$|
|$R_t$|No. reported people at time $t$|
|$N$|Total people in population|

|Transition|Associated Rate|
|----------|:---------------:|
|$S \to I$|Infection rate $\beta$|
|$I \to R$|Reporting probability $\gamma$|


At some time that we'll call $t$, the number of infected people $I_t$ changes over time in the following way (this is just part of the model, so we will not justify it too much). We'll measure $t$ with days, so $t-7$ is the value a week before $t$.

$$
\Delta I_t = \frac{S_t}{N}\beta \left( I_{t} - I_{t-7} \right).
$$

But we won't have access to the number infected people, so we need to rewrite this to be only in terms of reported numbers. I will wave my mathematical wand and tell you that

\begin{equation}
\Delta R_t = \left(1 - \frac{R_t}{\gamma(1-\rho)N} \right)\beta \left( R_{t} - R_{t-7} \right). \tag{1}
\end{equation}

In the equation above, $\rho$ is the percent of people who are immune / isolated from COVID and therefore play no part in its spread. We have no information about who is immune or isolated, so the best we can do is absorb the constant and learn $\bar\gamma = \gamma(1-\rho)$ from the data. Observe that this makes the empirical value we find always an underestimate: $\bar\gamma \leq \gamma$.

A derivation is included in section 3.1 of the paper if you're curious.

## Theory into practice

Now it's your turn to start implementing these ideas! There are only a few big steps:

1. Write code to do the computations according to the model equations.
3. Use the case reports to find the parameters that best describe the data.
4. Visualize / interpret our results.

In [None]:
def model(Xt, Rt, beta, gamma, N):
    """
    Xt: an array of values [ Rt - R(t-7) ] for every time t
    Rt: the number of reported cases at time t
    beta: infection rate
    gamma: reporting rate
    
    Return Delta Rt given data and parameters to model. 
    Basically implement equation 1. 
    """
    ### YOUR CODE HERE
    pass

def diff_with_delay(x, J):
    """
    Returns the differences in x over a window of size J.
    x is an array representing data sampled at regular intervals.
    Therefore if J = 1, you would return [ x[1] - x[0], x[2] - x[1], ... ]
    And if J = 2, you would return [ x[2] - x[0], x[3] - x[1], ... ]
    """
    ### YOUR CODE HERE
    pass


def least_squares_error(Xt, Rt, beta, gamma, N):
    preds       = model(Xt, Rt, beta, gamma, N)
    delta_Rt    = (Rt[1:] - Rt[:-1])[6:]
    ### YOUR CODE HERE
    # Take a look here https://pytorch.org/docs/stable/nn.html to find a suitable loss function.
    # We want a function that is the sum of the squares of difference between our predictions and the given values, perhaps multiplied by a constant.
    # You just need to make lossFn point to a variable in the torch package. Do not call the function here.
    lossFn = None
    
    loss = lossFn(reduction='sum')( preds, delta_Rt )
    return loss

In [None]:
# No need to touch this cell. We are providing this for you.
# Retrieved from https://stackoverflow.com/questions/40443020/matlabs-smooth-implementation-n-point-moving-average-in-numpy-python
def smooth(a,WSZ):
    # a: NumPy 1-D array containing the data to be smoothed
    # WSZ: smoothing window size needs, which must be odd number,
    # as in the original MATLAB implementation
    out0 = np.convolve(a,np.ones(WSZ,dtype=int),'valid')/WSZ    
    r = np.arange(1,WSZ-1,2)
    start = np.cumsum(a[:WSZ-1])[::2]/r
    stop = (np.cumsum(a[:-WSZ:-1])[::2]/r)[::-1]
    return np.concatenate((  start , out0, stop  ))

We will now consider the cases in New York State. The code below loads the time series data and puts the New York cases in the `data` variable. $N$ is the population of New York.

In [None]:
df = pd.read_csv("./time_series_covid19_confirmed_US.csv")
data = df[ df['Province_State'] == "New York" ].sum() 

### YOUR CODE HERE
# What does N represent again? (We discussed this at the top of the notebook)
# Use the U.S. Census Bureau rather than the first number Google shows.
N = None

# J is the window we will use. Leave as 7.
J = 7

In [None]:
# data.keys() contains a lot of fields from the reports. 
# Select a slice of data.keys() so that it starts on the date 2/25/20 and ends on the data 7/6/20.

### YOUR CODE HERE
start = None
end = None

date_keys = data.keys()[start:end]

In [None]:
# This cell just sets up variables useful later, but you don't need to do anything here.
total_time_interval = [1, len(date_keys)]
raw_total_cases = data[date_keys].to_numpy(dtype=np.float32)

print(data[date_keys])

Rt = smooth(data[start:end].to_numpy(dtype=np.float32), 7)

training_interval = [ total_time_interval[1] - len(Rt) , total_time_interval[1] -1 ]
model_interval = [training_interval[0] + 6, training_interval[1]]
Xt = diff_with_delay(Rt, J)

In [None]:
beta = torch.tensor(np.random.rand() * 10 , requires_grad=True)
gamma = torch.tensor(np.random.rand(),  requires_grad=True)

Rt = torch.tensor(Rt, dtype=torch.double)
Xt = torch.tensor(Xt, dtype=torch.double)

In [None]:
# This cell will iteratively adjust the model parameters to minimize model error.

# What two variables do we want to optimize over? Hint: what controls the future predictions?
### YOUR CODE HERE
params = [None, None]

optimizer = torch.optim.Adam( params , lr=0.01 )

steps = 4000
for t in range(steps):
    loss = least_squares_error(Xt, Rt, beta, gamma, N)
    if t % 100 == 0:
        print("After {} steps, the model's error = {}".format(t, loss.item()))
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

In [None]:
print(beta, gamma)

In [None]:
plt.plot(np.diff(raw_total_cases), label = 'New Cases (Raw Data)')
plt.plot(np.arange(*training_interval), np.diff(Rt), label = 'Training Data')
preds = model(Xt, Rt, beta, gamma, N).detach().numpy()
plt.plot(np.arange(*model_interval), preds, label = 'Model')
plt.ylabel("Number of Cases")
plt.xlabel("Days from 2/25/20")
plt.legend()
plt.figure()

# Questions / Exercises

1. The plot above (hopefully) shows reasonable results from the model. How confident are you in the predictions that this model will make? How would you convince someone that this model's predictions are good? (What conclusion did we make about comparing models during Lecture 2?)
2. What steps would you take to train a model and make a plot for the cases in Berkeley? Go ahead and try that. It should only require a few changes to this notebook.
3. What assumption are we making about the values of $\beta$ and $\gamma$ over time?

Put your answers in a markdown cell below.

This week's paper is [Training Classifiers with Natural Language Explanations](https://arxiv.org/pdf/1805.03818.pdf). Give it a read and answer a few simple questions in the form.

---

Question answers here