## Introduction

Stan solution to [nested_model_example.ipynb](nested_model_example.ipynb).


In [1]:
# set up Python
import re
import json
import inspect
import logging
import numpy as np
import pandas as pd
from IPython.display import Markdown
import plotnine
from plotnine import *

from nested_model_fns import (
    define_Stan_model_with_forecast_period,
    solve_forecast_by_Stan,
    plot_forecast,
    extract_sframe_result,
    plot_model_quality,
    plot_model_quality_by_prefix,
)

# quiet down Stan
logger = logging.getLogger("cmdstanpy")
logger.addHandler(logging.NullHandler())

# set plot size
plotnine.options.figure_size = (16, 8)

In [None]:
with open("generating_params.json", "r") as file:
    generating_params = json.load(file)
modeling_lags = generating_params["generating_lags"]
b_z = generating_params["b_z"]
b_x = generating_params["b_x"]

generating_params

In [3]:
d_train = pd.read_csv("d_train.csv")
d_test = pd.read_csv("d_test.csv")

## Solving again with the Bayesian "big hammer"


We now try a Bayesian model with the correct generative structure, using the [Stan](https://mc-stan.org/users/interfaces/cmdstan) software package.


In [4]:
# define a Stan model for both transient external regressors and future predictions
stan_model_with_forecast_i, stan_model_with_forecast_src_i = (
    define_Stan_model_with_forecast_period(
        application_lags=modeling_lags,
        n_transient_external_regressors=len(b_x),
        n_durable_external_regressors=len(b_z),
    )
)

In [None]:
# show the model specification
print(stan_model_with_forecast_src_i)

`y` and `y_auto` are supposed to be non-negative (a constraint we have chosen to *not* enforce, as it degraded results, probably by damaging sampling paths).

Please keep in mind a distributional statement such as <code>y ~ normal(f(y_auto, x), &sigma;)</code> is actually modeling the residual <code>(y - f(y_auto, x))</code> as being distributed <code>normal(0, &sigma;)</code>. So the above model-block statements are distributional assumptions about *residuals*, as the intended mean is an input to these statements. Thus we are specifying a normal distribution for residuals, not a normal distribution for expected values or predictions. I feel the normal approximation for `y`'s residual is not that bad. A similar statement can be made for `y_auto`'s residuals.

All in all: specifying systems to Stan is a compromise in respecting problem structure, and preserving the ability to effectively sample. The specification tends to requires some compromise and experimentation. In my opinion, it isn't quite the case that "Bayes' Law names only one legitimate inferential network and we can then use that one!" One is going to have to specify an approximate system. It becomes the user's responsibility to design for a (hopefully) high utility tradeoff between fidelity and realizability.

We obviously will not know all the priors. So we hope the problem is somewhat insensitive to them and just set them to not so bad distributions. It is possible to over-worry on priors, and somewhat freeing to just think of them as [regularizations](https://en.wikipedia.org/wiki/Regularization_(mathematics)) or biases for the values in question to be small. Also it sometimes makes sense to fight symmetries or degeneracies in the specification by adding "complementarity" constraints such as `b_auto_0 * b_imp_0 ~ normal(0, 0.1)`. This is not a distributional claim we believe, but a trick in saying we expect the product to be small to enforce we expect only one of the two values to be non-negligible. Again, think of the model distributional claims as "criticisms" and not if they are prior (before seeing data) or posterior (after seeing data) opinions. Also, don't be profligate with these exotic checks: they break convexity of the function we are optimizing and can make sampling harder.


What Stan generates is: thousands of possible trajectories of parameters, and past and future hidden state.


In [None]:
# sample from Stan model solutions
forecast_soln_i = solve_forecast_by_Stan(
    model=stan_model_with_forecast_i,
    d_train=d_train,
    d_apply=d_test,
    durable_external_regressors=["z_0"],
    transient_external_regressors=["x_0"],
)

forecast_soln_i  # see https://mc-stan.org/docs/cmdstan-guide/stansummary.html

In [None]:
lp__q_90 = forecast_soln_i['lp__'].quantile(q=0.9)
(
    ggplot(
        data=forecast_soln_i,
        mapping=aes(x='lp__')
    )
    + geom_density(fill="darkgrey", alpha=0.5)
    + geom_vline(xintercept=lp__q_90, linetype="dashed")
    + ggtitle("distribution of pseudo log-likelihood")
)

In [8]:
# a trick I like: limit down to the more plausible samples/trajectories
forecast_soln_i = (
    forecast_soln_i.loc[
        forecast_soln_i['lp__'] >= lp__q_90,
        :
    ].reset_index(drop=True, inplace=False)
)

We can look at a summary of the parameter estimates.


In [None]:
# summarize parameter estimates
soln_params_i = forecast_soln_i.loc[
    :, [c for c in forecast_soln_i if c.startswith("b_")]
].median()

soln_params_i

In [None]:
# show original generative parameters
generating_params

Notice we recover `b_x_dur ~ b_z` and `b_x_imp ~ b_x` pretty well. These effect inferences can be used for planning and policy.

In [None]:
answer_frame = pd.DataFrame(
    {
        "b_x_dur[0]": [generating_params["b_z"][0]],
        "b_x_imp[0]": [generating_params["b_x"][0]],
    }
)
parameter_plt_frame = forecast_soln_i.loc[:, [c for c in answer_frame.columns]].melt()
(
    ggplot()
    + geom_density(
        data=parameter_plt_frame,
        mapping=aes(x="value", fill="variable", color="variable"),
        linetype="",
        alpha=0.5,
    )
    + geom_vline(
        data=answer_frame.melt(),
        mapping=aes(xintercept="value", color="variable", fill="variable"),
        linetype="dashed",
        size=1,
    )
    + facet_wrap("variable", scales="free")
    + scale_color_brewer(type="qualitative", palette="Dark2")
    + scale_fill_brewer(type="qualitative", palette="Dark2")
    + ggtitle("Stan inferred parameter distributions\n(generating values dashed lines)")
)

And we can plot both the forecasts, and *estimated* quantile bands around the estimated forecasts.


In [None]:
# plot inference over time
plt_i, s_frame_i = plot_forecast(
    forecast_soln_i,
    d_test,
    model_name="Stan correct externals model",
    external_regressors=["z_0", "x_0"],
)
plt_i.show()

In [None]:
forecast_soln_i

In [14]:
plotting_quantiles = [0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95]
plotting_colors = {
    "0.05": "#66c2a4",
    "0.1": "#2ca25f",
    "0.25": "#006d2c",
    "0.5": "#005824",
    "0.75": "#006d2c",
    "0.9": "#2ca25f",
    "0.95": "#66c2a4",
}
ribbon_pairs = [("0.05", "0.95"), ("0.1", "0.9"), ("0.25", "0.75")]
sf_frame = forecast_soln_i.loc[
    :, [c.startswith("y[") for c in forecast_soln_i.columns]
].reset_index(drop=True, inplace=False)
sf_frame["trajectory_id"] = range(sf_frame.shape[0])
sf_frame = sf_frame.melt(
    id_vars=["trajectory_id"], var_name="time_tick", value_name="y"
)
sf_frame["time_tick"] = [
    int(c.replace("y[", "").replace("]", "")) for c in sf_frame["time_tick"]
]
sf_frame = (
    sf_frame.loc[:, ["time_tick", "y"]]
    .groupby(["time_tick"])
    .quantile(plotting_quantiles)
    .reset_index(drop=False)
)
sf_frame.rename(columns={"level_1": "quantile"}, inplace=True)
sf_frame["quantile"] = [str(v) for v in sf_frame["quantile"]]
sf_p = sf_frame.pivot(index="time_tick", columns="quantile")
sf_p.columns = [c[1] for c in sf_p.columns]
sf_p = sf_p.reset_index(drop=False, inplace=False)

In [None]:
plt = ggplot()
plt = (
    plt
    + geom_point(
        data=d_train,
        mapping=aes(x="time_tick", y="y"),
        size=2,
    )
    + geom_step(
        data=sf_frame.loc[sf_frame["quantile"] == "0.5", :],
        mapping=aes(x="time_tick", y="y"),
        direction="mid",
        color="#005824",
    )
)
for r_min, r_max in ribbon_pairs:
    plt = plt + geom_ribbon(
        data=sf_p,
        mapping=aes(x="time_tick", ymin=r_min, ymax=r_max),
        fill=plotting_colors[r_min],
        alpha=0.4,
        linetype="",
    )
plt = plt + xlim(900, 1000) + ggtitle("data leading to a projection into the future")
plt

In [None]:
plt = ggplot()
plt = (
    plt
    + geom_point(
        data=d_test,
        mapping=aes(x="time_tick", y="y"),
        size=2,
    )
    + geom_point(
        data=d_train,
        mapping=aes(x="time_tick", y="y"),
        size=2,
    )
    + geom_step(
        data=sf_frame.loc[sf_frame["quantile"] == "0.5", :],
        mapping=aes(x="time_tick", y="y"),
        direction="mid",
        color="#005824",
    )
)
for r_min, r_max in ribbon_pairs:
    plt = plt + geom_ribbon(
        data=sf_p,
        mapping=aes(x="time_tick", ymin=r_min, ymax=r_max),
        fill=plotting_colors[r_min],
        alpha=0.4,
        linetype="",
    )
plt = (
    plt
    + xlim(900, 1000)
    + ggtitle(
        "data leading to a projection into the future, matched to held out future"
    )
)
plt

In [None]:
# plot quality of fit as a scatter plot
d_test["Stan (correct externals structure) prediction"] = extract_sframe_result(
    s_frame_i
)
plot_model_quality(
    d_test=d_test,
    result_name="Stan (correct externals structure) prediction",
    external_regressors=["z_0", "x_0"],
)

In [None]:
# plot quality as a function of how far out we are predicting
plot_model_quality_by_prefix(
    s_frame=s_frame_i,
    d_test=d_test,
    result_name="Stan (correct externals structure) prediction",
)

The model identifies 3 very valuable things:

  * Estimates of the model parameters: 
     * `b_auto_0`
     * `b_auto[0]`
     * `b_auto[1]`
     * `b_x_dur[0]`
     * `b_x_imp[0]`
  * Projections or applications of the model for future `time_tick`s 950 through 999.
  * Good inferences of the most recent unobserved states `y_auto[948]` and `y_auto[949]` in the training period.

The third item provides a much more useful estimate of then hidden state (based on evaluation of trajectories through the entire training period) than the simple single point estimate `y_auto[i] ~ y[i] * b_x_impl[0] - x_0[i]`. One can evolve estimates forward from these inferences, and that is not always the case for the simple expected value estimates.

The issue with the simple (or naive) estimates being: they are single value point estimates, not necessarily compatible with *any* of the estimate sampling trajectories. Plugging in the naive estimates often does not allow one to evolve the prediction trajectories forward in a sensible manner. The detailed estimates from the Stan sampler do allow such forward evolution of estimates (either inside the Stan sampler as shown, or as a simple external procedure).


In [19]:
# get distribution of breakdown of predictions
history_frame = (
    forecast_soln_i
        .loc[:, [c for c in forecast_soln_i.columns if c.startswith('y[') or c.startswith('y_auto[')]]
        .reset_index(drop=True, inplace=False)
)
idx_max = np.max([int(c.replace('y[', '').replace(']', '')) for c in history_frame.columns if c.startswith('y[')])
new_cols = {}
for i in range(idx_max + 1):
    new_cols[f'y_transient[{i}]'] = history_frame[f'y[{i}]'] - history_frame[f'y_auto[{i}]']
history_frame = pd.concat([history_frame, pd.DataFrame(new_cols)], axis=1)
history_frame['trajectory_id'] = range(history_frame.shape[0])
history_frame = history_frame.melt(id_vars=['trajectory_id'])
history_frame['time_tick'] = [int(re.sub(r'^.*\[', '', v).replace(']', '')) for v in history_frame['variable']]
history_frame['variable'] = [re.sub(r'\[.*\]', '', v) for v in history_frame['variable']]
history_plot = (
    history_frame
        .loc[:, ['variable', 'value', 'time_tick']]
        .groupby(['variable', 'time_tick'])
        .median()
        .reset_index(drop=False, inplace=False)
)

In [None]:
idx0 = d_train.shape[0] - 2 * d_test.shape[0]
d_actuals_train = d_train.loc[:, ['time_tick', 'y', 'ext_regressors']].reset_index(drop=True, inplace=False)
d_actuals_train['variable'] = 'y'
d_actuals_test = d_test.loc[:, ['time_tick', 'y', 'ext_regressors']].reset_index(drop=True, inplace=False)
d_actuals_test['variable'] = 'y'
(
    ggplot(
        data=history_plot.loc[history_plot['time_tick'] >= idx0, :],
        mapping=aes(x='time_tick', y='value')
    )
    + annotate(
        "rect",
        xmin=-np.inf, 
        xmax=d_train.shape[0], 
        ymin=-np.inf, 
        ymax=np.inf,
        alpha=0.5,
        fill='#e0d7c6',
    )
    + facet_wrap('variable', ncol=1, scales='free_y')
    + geom_step(direction="mid", size=1)
    + geom_vline(xintercept=d_train.shape[0], alpha=0.5, linetype='dashed')
    + geom_point(
        data=d_actuals_train.loc[d_actuals_train['time_tick'] >= idx0, :],
        mapping=aes(x='time_tick', y='y', color='ext_regressors', shape='ext_regressors'),
        size=2,
    )
    + geom_point(
        data=d_actuals_test.loc[d_actuals_test['time_tick'] >= idx0, :],
        mapping=aes(x='time_tick', y='y', color='ext_regressors', shape='ext_regressors'),
        size=1,
        alpha=0.5,
    )
    + ggtitle('past and future visits decomposed into sub-populations\n(left side training, right side forecast)')
)

The calling sequence to do this in Stan from Python is as follows. One can also call Stan from the command line or from R.

In [None]:
display(Markdown(f"```python\n{inspect.getsource(solve_forecast_by_Stan)}\n```"))

## Bayesian method without external regressors


Let's get back to overall methodology.

As a lower-bound on model quality we can try a Bayesian (Stan) model without external regressors.


In [22]:
# define a Stan model without external predictors
stan_model_with_forecast_0, stan_model_with_forecast_src_0 = (
    define_Stan_model_with_forecast_period(
        application_lags=modeling_lags,
        n_transient_external_regressors=0,
        n_durable_external_regressors=0,
    )
)

In [None]:
# show the model specification
print(stan_model_with_forecast_src_0)

In [24]:
# sample from Stan model solutions
forecast_soln_0 = solve_forecast_by_Stan(
    model=stan_model_with_forecast_0,
    d_train=d_train,
    d_apply=d_test,
)

In [None]:
# summarize parameter estimates
forecast_soln_0.loc[:, [c for c in forecast_soln_0 if c.startswith("b_")]].median()

In [None]:
# show original generative parameters
generating_params

In [None]:
# plot fit over time
plt_0, s_frame_0 = plot_forecast(
    forecast_soln_0,
    d_test,
    model_name="Stan no externals",
    external_regressors=["z_0", "x_0"],
)
plt_0.show()

Notice both the predictions are "middle of the road" estimates again.


In [None]:
# plot quality of fit as a scatter plot
d_test["Stan (no externals model) prediction"] = extract_sframe_result(s_frame_0)
plot_model_quality(
    d_test=d_test,
    result_name="Stan (no externals model) prediction",
    external_regressors=["z_0", "x_0"],
)

## Conclusion


And that concludes our note on modeling in the presence of external regressors. The main point is: one has to specify the structure of the regressors. Do they cause durable effects (such as marketing efforts) or do they cause transient effects (such as one-off sales events)? Also: we would like such specifications to be in terms familiar to domain experts, and not deep in ARMAX or transfer function terminology.
