# This notebook displays how to run fits

This notebook displays how to run SIR fits including social distancing parameters for the [NYC data](https://github.com/nychealth/coronavirus-data). 

# Init

## Imports

In [None]:
# Preparing the data
from datetime import datetime
from pandas import DataFrame, Series

# Fitting
from gvar import gvar, mean, sdev
from lsqfit import nonlinear_fit

# Plotting
import plotly.graph_objects as go

# Local module functions
from models import sir_step, one_minus_logistic_fcn, FitFcn
from utils.prepare_df import prepare_case_hosp_death
from utils.plotting import COLUMN_NAME_MAP, plot_fits, summarize_fit

## Data

These cells loads in NYC Data. It is important to create a `DataFrame` with the columns `hospitalized_new` and `infected_new` (new admissions per day and new positive cases per day) and set the index to `date`s.
You can also load in your own data frame if you want to fit your data.

In [None]:
BIN_SIZE = 1
COMMIT_HASH = "master"

chd_df_all = prepare_case_hosp_death(
    COMMIT_HASH,  # Specify NYC repo commit hash to ensure same data
    bin_day_range=BIN_SIZE,  # How many days should be grouped as one
    # Drop rows where date awas 5 days within reporting (delay in reporting)
    drop_days_end=5,
)

# This is the cut for fitting, after this, results are predictions
cut = datetime(2020, 4, 5, 12)

chd_df = chd_df_all.loc["2020-03-01":cut]
chd_df_extension = chd_df_all.loc[cut:]
chd_df.tail()

In [None]:
fig = go.Figure(layout_title="New hospitalizations in NYC")
fig.add_trace(
    go.Scatter(
        x=chd_df_all.index,
        y=chd_df_all.hospitalized_new,
        name="data",
        mode="markers",
        line_color="#1f77b4",
    )
)
fig.add_trace(
    go.Scatter(
        x=[cut, cut],
        y=[0, 3500],
        name="Fit range end",
        line_color="black",
        mode="lines",
    )
)
fig.show()

# Fitting

## Prepare data

The Bayesian fit requires distributions of input variables.
For simplicity, this notebook considers gaussian distributions only.

The Gaussian random variables are implemented by the gvar module.

```
yy = gvar(yy_mean, yy_sdev)
```

Presumably, daily new admissions(`hospitalized_new`) are more accurate than a count of daily new infections (`infected_new`).
Thus this notebook emphasizes `hospitalized_new` over `infected_new`, by setting the relative uncertainty of new hospitalizations to 10% (see also the NYC-data-preparation notebook) and setting the `infected_new` uncertainty to 50%.
To not put too much importance in early time values, errors have a minimal value.

In [None]:
# don't fit first entry (since new columns start at second entry)
infected_new = chd_df["infected_new"].values[1:].copy()
hospitalized_new = chd_df["hospitalized_new"].values[1:].copy()

# This error is guessed but also less relevant since it is 5 times larger than the error of hospitalized_new
REL_ERR = {"infected_new": 0.5}
err_inf_new = [max(250, inf * REL_ERR["infected_new"]) for inf in infected_new]

# Inferred from binning daily admissions
REL_ERR["hospitalized_new"] = 0.1
err_hosp_new = [
    max(50, hosp * REL_ERR["hospitalized_new"]) for hosp in hospitalized_new
]

YY = gvar([infected_new, hospitalized_new], [err_inf_new, err_hosp_new]).T

## Prepare model

This section prepares the simulation models.
The `FitFcn` wraps the a model to simplify the fit function call, e.g.,

```
fcn = FitFcn(sir_step, **fit_fcn_kwargs)
yy = fcn_sir(xx, pp)
```
where `xx` are fixed parameters and `pp` are parameters to be fitted.

Arguments:

* `sir_fc`: The function which executes a SIR step
* `beta_i_fcn`: Function which generates a schedule for the growth rate.
* `columns`: Function call return specified columns only
* `as_array`: If true, function call returns array. Else DataFrame.
* `drop_rows`: Drop selected rows from function call. Set to `[0]` to exclude `_new` column `NaN`s.

Social distancing measures are implemented as
$$
    \beta_I(t) = \beta_I(0) \left[ 1 - f(t) \right]
    \,,\qquad
    f(t) = R \left(1 + \exp \left\{\frac{t-t_0}{\Delta t}\right\}\right)^{-1}
$$

The cell below prepares both the SIR and the SIHR model.

In [None]:
fit_fcn_kwargs = {
    "columns": ["infected_new", "hospitalized_new",],
    "beta_i_fcn": one_minus_logistic_fcn,
    "as_array": True,
    "drop_rows": [0],
}


fcn = FitFcn(sir_step, **fit_fcn_kwargs)

Specify model and data meta parameter which are fixed (initial conditions) 

In [None]:
XX = {
    # the t-values
    "date": chd_df.index,
    ## Population: https://en.wikipedia.org/wiki/New_York_City
    "initial_susceptible": 8.6e6,
    # Initial hospitalizations: Select from data
    "initial_hospitalized": chd_df.hospitalized_cummulative.iloc[0],
    # Assume nobody is recovered
    "initial_recovered": 0,
    # Hospital capacity. Only used by SIHR
    ## https://www.bloomberg.com/graphics/2020-new-york-coronavirus-outbreak-how-many-hospital-beds/
    "capacity": 23000,
    # Length of stay for hospitalized persons taken to be the same as H recovery days in SIHR
    "length_of_stay": 14
}

Specify model priors which are fitted.

In [None]:
# And the prior estimates of fit parameters
## gvars are Gaussian random numbers described by their mean and standard deviation
prior = {
    # Time after which infections double (at the beginning of the simulation)
    "inital_doubling_time": gvar(3, 2),
    # Days until infected person is recovered
    "recovery_days_i": gvar(14, 3),
    # Inital infections, wild guess since uncertain number
    "initial_infected": gvar(1.0e4, 2.0e4),
    # Maximal reduction of social distancing for (logisitc function R)
    "ratio": gvar(0.7, 0.2),
    # How many days to go from ~ R/4 to R/2 (logisitc function Delta t)
    "social_distance_halfing_days": gvar(5, 4),
    # After how many days distancing measures is 0.5 ratio (logisitc function t0)
    "social_distance_delay": gvar(5, 4),
    # The rate how of how many newly infected person become hospitalized
    "hospitalization_rate": gvar(0.2, 0.1)
}

## Run fit

In [None]:
fit_sir = nonlinear_fit(data=(XX, YY), fcn=fcn, prior=prior)

How to interpret the below summary:

**Least Square Fit:**

1. The Q value should be close to 1
2. The logGBF (log Gaussian Bayes Factor) is a relative measure of how likely a model is given the data. It promotes models that describe the data and penalizes additional parameters. Higher is better.
3. The `chi2/dof` is not necessarily meaningful because we do not have proper `y` errors. If values are below one, this suggests actual `y` errors are overestimated.

**Parameters:**

1. This section compares how much posterior values differ from prior values (in brackets).
Stars indicate that posterior values were shifted by more than a standard deviation and suggest that the prior is affecting the fit.
2. Posterior values which are almost the same as prior values suggest that the data is not constraining these parameters. One should be cautious about such values if there is no prior interpretation.

**Error budget:**

This section summarizes how much the uncertainty of each posterior value is affected by input values (prior and data).
Relatively large uncertainties coming from the data suggest that this parameter is important to describe results.
In contrast, if the uncertainty is dominated by the prior, the parameter is not important to describe the data.

In [None]:
print(summarize_fit(fit_sir))

## Results

In [None]:
plot_kwargs = dict(extend_days=31, plot_residuals=False, plot_infections=True)

# Plot fits
fig = plot_fits(
    fit_sir, fit_name="SIR", plot_data=True, **plot_kwargs
)

# And append unfitted data
for icol, col in enumerate(["infected_new", "hospitalized_new"]):
    fig.add_trace(
        go.Scatter(
            x=chd_df_extension.index,
            y=chd_df_extension[col],
            error_y_array=chd_df_extension[col] * REL_ERR[col],
            name="Not fitted data",
            line_color="red",
            showlegend=icol == 0,
        ),
        col=icol + 1,
        row=1,
    )

fig.show()

In [None]:
result = DataFrame(
    data=[{**prior}, fit_sir.p],
    index=["Prior", "Posterior SIR"],
).T
result["diff"] = result["Posterior SIR"] - result["Prior"]
result