This notebook is a work in progress. Right now it only describes the default priors for the fixed effects in a regression with a Normal outcome. Eventually it will cover all of the GLMM cases and include some illustrative figures.

# Fixed effects only, Normal response distribution (LMs)

Consider a regression equation (row indices omitted for simplicity):

$Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \ldots + e$,

where $e \sim \text{Normal}(0,\sigma^2)$. Our goal is to devise a set of default priors for the regression coefficients that--in the absence of any additional information supplied by the user--will still allow one to obtain reasonable parameter estimates in general. One obviously unsatisfactory solution would be to just use, say, standard Normal distributions for all the priors, i.e.:

$\beta_j \sim \text{Normal}(0, 1)$.

This ignores the fact that the observed variables may all be on wildly different scales. So for some predictors the prior may be extremely narrow or “informative,” shrinking estimates strongly toward 0, while for other predictors the prior may be extremely wide or “vague,” leaving the estimates essentially unchanged. 

One remedy to this differential informativeness issue could be to set the standard deviation of the prior to a very large value, so that the prior is likely to be wide relative to almost any predictor variables. For example,

$\beta_j \sim \text{Normal}(0, 10^{10})$.

This is better, although in principle it suffers from the same problem--that is, a user could still conceivably use variables for which even this prior is implausibly narrow (although in practice it would be unlikely). A different but related worry is that different scalings of the same variable in the same dataset--for example, due to changing the units of measurement--will lead to the prior having different levels of informativeness. This seems undesirable because scaling and shifting the variables has no meaningful consequence for traditional test statistics and standardized effect sizes (with some obvious exceptions pertaining to intercepts and models with interaction terms).

### Fixed slopes

The approach we take for the default priors on fixed slopes in bambi is to set the prior indirectly, by defining the prior on the corresponding partial correlation and then seeing what scale this implies for the prior on the raw regression coefficient scale. 

One can transform the multiple regression coefficient for the predictor $X_j$ into its corresponding partial correlation $\rho^p_{YX_j}$ (i.e., the partial correlation between the outcome and $X_j$, controlling for all the other predictors) using the identity:

$\rho^p_{YX_j} = \beta_j \sqrt{\frac{(1-R^2_{X_jX_{-j}})\text{var}(X_j)}{(1-R^2_{YX_{-j}})\text{var}(Y)}}$,

where $\beta_j$ is the slope for $X_j$, $R^2_{X_jX_{-j}}$ is the coefficient of determination $R^2$ from a regression of $X_j$ on all the other predictors (ignoring $Y$), and $R^2_{YX_{-j}}$ is the $R^2$ from the regression of $Y$ on all the predictors other than $X_j$.

Now we define some prior distribution for $\rho^p_{YX_j}$ that has mean zero and standard deviation $\sigma_\rho$. This implies that

$\begin{aligned} \text{var}(\beta_j) &=  \text{var}(\rho^p_{YX_j}\sqrt{\frac{(1-R^2_{YX_{-j}})\text{var}(Y)}{(1-R^2_{X_jX_{-j}})\text{var}(X_j)}}) \\ &= \frac{(1-R^2_{YX_{-j}})\text{var}(Y)}{(1-R^2_{X_jX_{-j}})\text{var}(X_j)}\sigma^2_\rho \end{aligned}$.

One can tune the width or informativeness of this prior by setting different values of $\sigma_\rho$, corresponding to different standard deviations of the distribution of plausible partial correlations. Our default prior is a Normal distribution with zero mean and standard deviation following this scheme, with $\sigma_\rho = \sqrt{1/3} \approx .577$, which is the standard deviation of a flat prior in the interval [-1,1]. We allow users to specify their priors in terms of $\sigma_\rho$, or in terms of four labels:

 - “narrow” meaning $\sigma_\rho=.2$,
 - “medium” meaning $\sigma_\rho=.4$,
 - “wide” meaning $\sigma_\rho=\sqrt{1/3} \approx .577$ (i.e., the default), or
 - “superwide” meaning $\sigma_\rho=.8$.

Note that the maximum possible standard deviation of a distribution of partial correlations is 1, which would be a distribution with half of the values at $\rho^p_{YX_j}=-1$ and the other half at $\rho^p_{YX_j}=1$. Viewed from this partial correlation perspective, it seems hard to theoretically justify anything wider than our “wide” default, since this would correspond to something that is wider than a flat prior on the partial correlation scale (although we note that, purely practically speaking, there are often no discernible problems in using such a wider prior).

### The intercept / constant term

The default prior for the intercept $\beta_0$ must follow a different scheme, since partial correlations with the constant term are undefined. We first note that in ordinary least squares (OLS) regression $\beta_0 = \bar{Y} - \beta_1\bar{X}_1 - \beta_2\bar{X}_2 - \dots$. So we can set the mean of the prior on $\beta_0$ to

$\text{E}[\beta_0] = \bar{Y} - \text{E}[\beta_1]\bar{X}_1 - \text{E}[\beta_2]\bar{X}_2 - \dots$.

In practice, the priors on the slopes will typically be set to have zero mean, so the mean of the prior on $\beta_0$ will typically reduce to $\bar{Y}$.

Now for the variance of $\beta_0$ we have (assuming independence of the slope priors):

$\text{var}(\beta_0) = \bar{X}_1^2\text{var}(\beta_1) + \bar{X}_2^2\text{var}(\beta_2) + \dots$.

In other words, once we have defined the priors on the slopes, we can combine this with the means of the predictors to find the implied variance of $\beta_0$. Our default prior for the intercept is a Normal distribution following this scheme, and it assumes that all of the slopes were set to have “wide” priors (as defined above), regardless of what the user actually selected for the width of the slope priors.

### Residual standard deviation ($\sigma$)

We know that necessarily $0 \le \sigma \le \text{SD}(Y)$. Ideally we would have a prior with support bounded in $[0,\text{SD}(Y)]$, negatively skewed with minimum value at 0 (which implies a coefficient of determination $R^2=1$) and maximum value at $\text{SD}(Y)$ (which implies $R^2=0$). Because there is not such a ready-made distribution in the backend packages underlying Bambi--and because it will tend to make little difference in practice--we simply use $\sigma \sim \text{Uniform}(0,\text{SD}(Y))$.

### Models without a constant term

# Fixed and random effects, Normal response distribution (LMMs)

# Non-Normal response distribution (GLMs and GLMMs)