# Data preparation

NYC has just published a continuously updating dataset for coronavirus testing, hospitalizations and death rates at

https://github.com/nychealth/coronavirus-data

I intended to test the SIR model, parameter assumptions and consequences. This notebook displays choices made for preparing the data. In general, I want to discuss temporal correlations in the data.

An assumption made in the CHIME SIR model is that the number of infected people $I$ at time $t$ can be used to compute the number of hospitalized people $H$: $I(t) / H(t) = $ const. This allows us to infer the amount of infected people from the amount of hospitalized people (observable).

Because of daily fluctuations, I also analyze whether binning of days improves the analysis
$$
    I_b(t_0) = \sum_{n=0}^{N_b} I(t_0 + \delta_t n) \, ,
$$
where $\delta_t$ is a day and $N_b$ the size of the bin.

**TL;DR:**

1. With a binning size of 3 days, results seem to be stable. 
2. This data, whether it was mathematically adjusted or depends on how NYC implements tests, suggests that the time delay between knowing the infection and hospitalization is immediate. It seems to be reliable after March 11 at $\sim 20\%$ which is consistent with the severity of symptoms probability.
3. There seems to be a strong kink in the data around 7th of April. It might be safer to drop this point from the analysis until data is updated.
4. The uncertainty in the number of new hospitalizations is estimated to be ~ 15% due to temporal effects.  

## Imports

In [None]:
from datetime import timedelta

import pandas as pd

import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

from utils.prepare_df import prepare_case_hosp_death, COMMIT_HASH_LAST

print(COMMIT_HASH_LAST)

In [None]:
chd_df = prepare_case_hosp_death(COMMIT_HASH_LAST, bin_day_range=None, drop_days_end=3)
chd_df.head()

## Binning

Below, the counts per day and cumulative counts are visualized for different binning choices. Binning data in intervals of 2-4 days seems to be an optimal choice (smoothness and consistency of curves).

In [None]:
dfs = []
for bin_day_range in range(1, 8, 1):
    chd_df = prepare_case_hosp_death(
        COMMIT_HASH_LAST, bin_day_range=bin_day_range, drop_days_end=3
    )
    for col in chd_df.columns:
        if "new" in col:
            chd_df[f"{col}_per_day"] = chd_df[col] / bin_day_range

    stacked = (
        chd_df.stack().reset_index().rename(columns={"level_1": "kind", 0: "value"})
    )
    stacked["bin_day_range"] = bin_day_range
    dfs.append(stacked)

dfs = pd.concat(dfs, ignore_index=True)
kkind = (
    dfs.kind.str.extractall("(?P<kind>[a-z]+)_(?P<agg>[\w+]+)")
    .reset_index()
    .drop(columns=["match", "level_0"])
)
dfs = dfs.drop(columns="kind").join(kkind)

In [None]:
fig = px.scatter(
    dfs.query("agg == 'new_per_day'"),
    x="date",
    y="value",
    facet_col="bin_day_range",
    labels={"bin_day_range": "binned days"},
    facet_row="kind",
    log_y=True,
    title="Counts per day",
)
fig.update_yaxes(matches=None)
fig.show()

In [None]:
fig = px.scatter(
    dfs.query("agg == 'cummulative' and bin_day_range < 5"),
    x="date",
    y="value",
    facet_col="bin_day_range",
    labels={"bin_day_range": "binned days"},
    facet_row="kind",
    log_y=True,
    title="Cummulative counts",
)
fig.update_yaxes(matches=None)
fig.show()

## Hospitalization vs Infection (delay in time)

Below, it is displayed how many new infected cases per day are being hospitalized. This is displayed over different day bin ranges and shifts between hospitalization and infection:
$$
    R(t) = \frac{H(t + \Delta_t)}{I(t)}
$$



In [None]:
shift_dfs = []
tmp = dfs.set_index(["kind", "agg", "bin_day_range", "date"]).sort_index()

for shift in range(0, 8):
    for bbin in dfs.bin_day_range.unique():
        if bbin > 5 or bbin < 2:
            continue

        h = tmp.loc[("hospitalized", "new_per_day", bbin)].shift(shift)
        i = tmp.loc[("infected", "new_per_day", bbin)]
        df = (h / i).dropna().reset_index()

        if df.empty or df.shape[0] < 5:
            continue

        df["bin"] = bbin
        df["shift"] = shift
        df["shifted_days"] = shift * bbin

        if shift * bbin > 8:
            continue

        df["label"] = df.apply(
            lambda row: "{b}, {s} [days]".format(b=row["bin"], s=row["shifted_days"]),
            axis=1,
        )
        shift_dfs.append(df)

shift_dfs = pd.concat(shift_dfs, ignore_index=True)

In [None]:
fig = px.scatter(
    shift_dfs,
    x="date",
    y="value",
    facet_col="bin",
    facet_row="shifted_days",
    # facet_col_wrap=4,
    labels={"shifted_days": "dt", "bin": "binned days"},
    # log_y=True,
    title="$H(t + \Delta t)/I(t)$",
    height=1200,
    range_y=(0.01, 1),
    log_y=True,
)
# fig.update_yaxes(matches=None)
fig.show()

# Estimating uncertainties

For the latter analysis, a rough quantification of uncertainties is relevant.
I believe the most accurate data is the number of daily admissions / new hospitalizations.
However, there will be inaccuracy from ideal model behavior due to temporal delays.
As the analysis above presents, binning days smoothens out curves.
Below, the difference for new admissions is computed under different bin ranges (using linear interpolations).

In [None]:
bin_sample_dfs = []

for bin_size in range(1, 4):
    timeshift = (bin_size - 1) // 2 * timedelta(days=1)
    tmp = dfs.query(
        "agg == 'new' and kind == 'hospitalized' and bin_day_range == @bin_size"
    ).set_index("date")[["value"]]

    if bin_size > 1:
        tmp.index += timeshift
        tmp = (
            tmp.resample("D").interpolate(method="linear").fillna(method="ffill")
        ) / bin_size

    tmp["bin_size"] = bin_size
    bin_sample_dfs.append(tmp)

bin_sampled_df = pd.concat(bin_sample_dfs)

hospitalized_deviations = (
    pd.concat(bin_sample_dfs)
    .reset_index()
    .groupby("date")
    .agg(["mean", "std"])["value"]
)
hospitalized_deviations["std/mean"] = (
    hospitalized_deviations["std"] / hospitalized_deviations["mean"] * 100
)


fig = px.scatter(
    bin_sampled_df.reset_index(),
    x="date",
    y="value",
    symbol="bin_size",
)
fig.show()
fig = px.scatter(hospitalized_deviations.reset_index(), x="date", y="std/mean",)
fig.show()

The two last points are outliers. Excluding these two points results in the average relative error.

In [None]:
print(
    "Avg ratio of std/mean: {0:1.1f}%".format(
        hospitalized_deviations["std/mean"].iloc[:-2].mean()
    )
)
print(
    "STD ratio of std/mean: {0:1.1f}%".format(hospitalized_deviations["std/mean"].iloc[:-2].std())
)