# NYC data & social distancing fits (SIR)

This notebook compares the SIR model fits of the NYC data with or without implemented social distancing measures. 

**TL;DR:** 

1. The data is described best if a social distancing measures are in place.
2. Precise knowledge about new admissions is more important than knowledge about new infections assuming a 15% relative uncertainty in new admissions and a 50% or larger uncertainty in new known cases.
3. To reliably run social distancing fits, a "bend" in the data (deviation from exponential growth) should be visible.
4. Once the bend is visible, predictions are consistent and fit precision seems limited by data. Predictions in the range of a week seem feasible, after that, the extrapolation uncertainty becomes large.

# Init

## Imports

In [None]:
from os import environ
from datetime import date, timedelta

from pandas import DataFrame, date_range, Series
from numpy import array, linspace, log, sqrt, exp, arange

import plotly.graph_objects as go
from plotly.subplots import make_subplots

from gvar import gvar, mean, sdev
from lsqfit import nonlinear_fit

from models import sir_step, one_minus_logistic_fcn, FitFcn
from utils.prepare_df import prepare_case_hosp_death, COMMIT_HASH_LAST
from utils.plotting import COLUMN_NAME_MAP, plot_fits, summarize_fit, plot_fit_range

print(COMMIT_HASH_LAST)

## Data

In [None]:
BIN_SIZE = 2
# See also the NYC-data-preparation notebook for choice analysis

chd_df = prepare_case_hosp_death(
    COMMIT_HASH_LAST,  # Specify NYC repo commit hash to ensure same data
    bin_day_range=BIN_SIZE,  # How many days should be grouped as one
    drop_days_end=3,  # Drop rows where date awas 3 days within reporting
).loc[
    "2020-03-08":"2020-04-01"
]  # Only consider given time range
chd_df.tail()

# Fitting

# Prepare fit

To fit the SIR model, I distinguish between two sets of parameters:

1. Fit data stored in a `XX` and `YY` variable. This data does not change after being specified once.
2. Fit parameters stored in a prior which specifies the initial belive. These parameters are optimized while fitting.

In [None]:
## This is data considerd fix in fit
XX = {
    "initial_susceptible": int(8.6e6),  # https://en.wikipedia.org/wiki/New_York_City
    "initial_hospitalized": chd_df.hospitalized_cummulative.iloc[0],
    "initial_recovered": 0,
    "n_iter": chd_df.shape[0] + 1,  # Because of the bin size, one iteration = 3 days
    "bin_size": BIN_SIZE,  # To convert units later on
}
print(XX)

For this fit to work, the `YY` variable must also quantify uncertainty.
I believe that daily new admissions to a hospital (`hospitalized_new`) are more accurate than a count of daily new infections (`infected_new`, I have no estimation about how accurate the estimation of new infections is at all). Thus I emphasize `hospitalized_new` over `infected_new`.
In this context, the $\chi^2$ per d.o.f. is not necessarily meaningful and uncertainties rather quantify relative weights if how much you trust individual data points. 

In [None]:
infected_new = chd_df["infected_new"].values
hospitalized_new = chd_df["hospitalized_new"].values

# This assumes that there is a 50% uncertainty in the number of infected people
## And at least 300 (if the number is small to not emphasize early measurments too much)
delta_infected_new = [max(300, infected * 0.5) for infected in infected_new]

# This assumes that there is a 10% uncertainty in the number of hospitalized people with a minimum of 50
delta_hospitalized_new = [hospitalized * 0.15 for hospitalized in hospitalized_new]

YY = gvar(
    [infected_new, hospitalized_new], [delta_infected_new, delta_hospitalized_new]
).T

# Fits

**About doubling time and social distancing**

If one assumes, in the initial phase of the COVID-19 spread, $S(t) \approx S_0$, one has

$$
    \dot I(t) = (\beta_I S_0 - \gamma_I) I(t)
    \quad \Rightarrow \quad
    I(t) = I_0 \exp\{(\beta_I S_0 - \gamma_I)t\}
$$

Thus, the initial doubling time is defined as

$$
    I(t_2) = 2 I_0 = I_0 \exp\{(\beta_I S_0 - \gamma_I)t_2\} 
    \quad \Rightarrow \quad 
    t_2 = \frac{\ln(2)}{\beta_I S_0 - \gamma_I}
    \quad \Rightarrow \quad 
    \beta_I =  \frac{1}{S_0} \left[ \frac{\ln(2)}{t_2} +  \gamma_I \right]
$$

In later fits, social distancing policies are introduced. 
Thus, the growth rate $\beta_I$ becomes a function of time which is implemented by a logistic function with three unknown coefficients

$$
    \beta_I(t; R, t_0, \Delta t) = \beta_I \left[ 1 - \frac{R}{1+\exp\{-(t-t_0)\Delta t\}}\right]
$$



## Regular SIR, no social distancing policies

In [None]:
# Describe the fit function
fcn = FitFcn(
    sir_step,  # run regular SIR model
    columns=[
        "infected_new",
        "hospitalized_new",
    ],  # return only data for these two columns
    as_array=True,  # return array not df
    drop_rows=[0],  # drop first row (since new values are NaN in this row)
)

# And the prior estimates of fit parameters
## gvars are gaussian random numbers described by their mean and standard deviation
prior = {
    # 5(2) days is the estimated initial doubling time
    "inital_doubling_time": gvar(5, 2),
    # Days until one is recovered
    "recovery_days_i": gvar(14, 5),
    # Wild guess of how many where initially infected
    "initial_infected": gvar(1.0e4, 2.0e4),
    # Rate of infected people becoming hospitalized
    "hospitalization_rate": gvar(0.3, 0.5),
}

# Run the fit
fit = nonlinear_fit(data=(XX, YY), fcn=fcn, prior=prior)

# And present it
summarize_fit(fit)
fig = plot_fits(fit, x=chd_df.index[1:])
fig.show()

## Regular SIR, with social distancing policies

In [None]:
# Fit function now with beta_i_fcn
fcn_w_distancing = FitFcn(
    sir_step,
    columns=["infected_new", "hospitalized_new"],
    beta_i_fcn=one_minus_logistic_fcn,
    as_array=True,
    drop_rows=[0],
)

# Copy prior from before but add social distancing parameters
prior_w_distancing = prior.copy()
# Maximal reduction of social distancing: R
prior_w_distancing["ratio"] = gvar(0.7, 0.3)
# How many days to go from ~ r to 0.5r: Delta t
prior_w_distancing["social_distance_halfing_days"] = gvar(14, 7)
# After how many days the measures hits 0.5r: t0
prior_w_distancing["social_distance_delay"] = gvar(14, 7)

# Run the fit
fit_w_distancing = nonlinear_fit(
    data=(XX, YY), fcn=fcn_w_distancing, prior=prior_w_distancing
)

# Summarize the fit
summarize_fit(fit_w_distancing)
fig = plot_fits(fit_w_distancing, x=chd_df.index[1:])
fig.show()

## Regular SIR, with social distancing policies, larger infected uncertainty

It is likely that the number of known cases (infections) is significantly underestimated.
How does the fit change if we blow up the uncertainty of `infected_new` by 300% of it's mean value?

In [None]:
infected_new = chd_df["infected_new"].values
hospitalized_new = chd_df["hospitalized_new"].values

# This assumes that there is a 300% uncertainty in the number of infected people
## And at least 300 (if the number is small to not emphasize early measurments too much)
delta_infected_new = [max(300, infected * 3) for infected in infected_new]

# This assumes that there is a 10% uncertainty in the number of hospitalized people with a minimum of 50
delta_hospitalized_new = [hospitalized * 0.15 for hospitalized in hospitalized_new]

YY_larger_I = gvar(
    [infected_new, hospitalized_new], [delta_infected_new, delta_hospitalized_new]
).T

In [None]:
# Run the fit
fit_w_distancing_larger_I = nonlinear_fit(
    data=(XX, YY_larger_I), fcn=fcn_w_distancing, prior=prior_w_distancing
)

# Summarize the fit
summarize_fit(fit_w_distancing_larger_I)
fig = plot_fits(fit_w_distancing_larger_I, x=chd_df.index[1:])
fig.show()

In [None]:
DataFrame(
    data=[
        fit_w_distancing.p,
        fit_w_distancing_larger_I.p,
        fit_w_distancing.p - fit_w_distancing_larger_I.p,
    ],
    index=["50% infected new uncertainty", "300% infected new uncertainty", "diff"],
).T

In other words, new admissions dictate the outcome of the fit.

## Predictivity check: Regular SIR, with social distancing policies

The final question is: how far can we predict? Or alternatively, if we leave out time slices, how many do we need to consistently describe the data?

In [None]:
NT = YY.shape[0]

fits = {}

for nt in range(NT // 3, NT + 1):
    yy_cut = YY[:nt].copy()
    xx = XX.copy()
    xx["n_iter"] = nt + 1
    fits[nt] = nonlinear_fit(
        data=(xx, yy_cut), fcn=fcn_w_distancing, prior=prior_w_distancing
    )

fig = plot_fit_range(fits, y_max=5000, x=chd_df.index[1:], col_wrap=3)

fig.show()
DataFrame(
    data=[fit.p for fit in fits.values()],
    index=Series([nt * XX["bin_size"] for nt in fits], name="fitted days"),
)

Conclusion: Fits seem to be consistent once you see a "bend" in the new hospitalizations curve.
E.g., after fitted days $\geq 14$, the social distancing function parameters are affected by the fit.