### Imports

In [1]:
import numpy as np
import pandas as pd
import arviz as az
from cmdstanpy import CmdStanModel

## Regression Models

Stan supports regression models from simple linear regressions to multilevel generalized linear models.

### Linear regression

The simplest linear regression model is the following, with a single predictor and a slope and intercept coefficient, and normally distributed noise. This model can be written using standard regression notation as:
$$
y_n = \alpha + \beta x_n + \epsilon_n
\quad\text{where}\quad
\epsilon_n \sim \operatorname{normal}(0,\sigma).
$$

This is equivalent to the following sampling involving the residual,
$$
y_n - (\alpha + \beta X_n) \sim \operatorname{normal}(0,\sigma),
$$
and reducing still further, to
$$
y_n \sim \operatorname{normal}(\alpha + \beta X_n, \, \sigma).
$$

In [2]:
linear_code_1 = '''
data {
    int<lower=0> N;
    vector[N] x;
    vector[N] y;
}

parameters {
    real alpha;
    real beta;
    real<lower=0> sigma;
}

model {
    y ~ normal(alpha + beta * x, sigma);
}
'''

stan_file = './stan_models/linear_code_1.stan'

with open(stan_file, 'w') as f:
    print(linear_code_1, file=f)
    
linear_1_model = CmdStanModel(stan_file=stan_file, force_compile=True, cpp_options={'STAN_THREADS':'true'})

11:14:40 - cmdstanpy - INFO - compiling stan file /Users/rehabnaeem/Documents/Coding-Projects/bayesian-analysis/references/Stan-Modelling/stan_models/linear_code_1.stan to exe file /Users/rehabnaeem/Documents/Coding-Projects/bayesian-analysis/references/Stan-Modelling/stan_models/linear_code_1
11:14:49 - cmdstanpy - INFO - compiled model executable: /Users/rehabnaeem/Documents/Coding-Projects/bayesian-analysis/references/Stan-Modelling/stan_models/linear_code_1


There are `N` observations and for each observation, $n \in N$,  we have predictor `x[n]` and outcome `y[n]`.  The intercept and slope parameters are `alpha` and `beta`. The model assumes a normally
distributed noise term with scale `sigma`. This model has improper priors for the two regression coefficients.

#### Matrix notation and vectorization

The distribution statement in the previous model is vectorized, with

```stan
y ~ normal(alpha + beta * x, sigma);
```

providing the same model as the unvectorized version,

```stan
for (n in 1:N) {
  y[n] ~ normal(alpha + beta * x[n], sigma);
}
```

In addition to being more concise, the vectorized form is much faster.

In general, Stan allows the arguments to distributions such as `normal` to be vectors. If any of the other arguments are vectors or arrays, they have to be the same size. If any of the other arguments is a scalar, it is reused (or broadcasted) for each vector entry.

The other reason this works is that Stan's arithmetic operators are overloaded to perform matrix arithmetic on matrices.  In this case, because `x` is of type `vector` and `beta` of type `real`, the expression `beta * x` is of type `vector`. Because Stan supports vectorization, a regression model with more than one predictor can be written directly using matrix notation.

In [3]:
linear_code_2 = '''
data {
    int<lower=0> N;         // number of data items
    int<lower=0> K;         // number of predictors
    matrix[N, K] x;         // predictor matrix
    vector[N] y;            // outcome vector
}

parameters {
    real alpha;             // intercept
    vector[K] beta;         // coefficients for predictors
    real<lower=0> sigma;    // error scale
}

model {
    y ~ normal(x * beta + alpha, sigma); // data model
}
'''

stan_file = './stan_models/linear_code_2.stan'

with open(stan_file, 'w') as f:
    print(linear_code_2, file=f)
    
linear_2_model = CmdStanModel(stan_file=stan_file, force_compile=True)

11:14:49 - cmdstanpy - INFO - compiling stan file /Users/rehabnaeem/Documents/Coding-Projects/bayesian-analysis/references/Stan-Modelling/stan_models/linear_code_2.stan to exe file /Users/rehabnaeem/Documents/Coding-Projects/bayesian-analysis/references/Stan-Modelling/stan_models/linear_code_2
11:14:58 - cmdstanpy - INFO - compiled model executable: /Users/rehabnaeem/Documents/Coding-Projects/bayesian-analysis/references/Stan-Modelling/stan_models/linear_code_2


The constraint `lower=0` in the declaration of `sigma` constrains the value to be greater than or equal to 0.  With no prior in the model block, the effect is an improper prior on non-negative real numbers.  Although a more informative prior may be added, improper priors are acceptable as long as they lead to proper posteriors.

In the model above, `x` is an $N \times K$ matrix of predictors and `beta` a $K$-vector of coefficients, so `x * beta` is an $N$-vector of predictions, one for each of the $N$ data items. These
predictions line up with the outcomes in the $N$-vector `y`, so the entire model may be written using matrix arithmetic as shown.  It would be possible to include a column of ones in the data matrix `x` to
remove the `alpha` parameter.

The distribution statement in the model above is just a more efficient, vector-based approach to coding the model with a loop, as in the following statistically equivalent model.

```stan
model {
  for (n in 1:N) {
    y[n] ~ normal(x[n] * beta, sigma);
  }
}
```

With Stan's matrix indexing scheme, `x[n]` picks out row `n` of the matrix `x`;  because `beta` is a column vector, the product `x[n] * beta` is a scalar of type `real`.

##### Intercepts as inputs

In the model formulation

```stan
y ~ normal(x * beta, sigma);
```

there is no longer an intercept coefficient `alpha`.  Instead, we have assumed that the first column of the input matrix `x` is a column of 1 values.  This way, `beta[1]` plays the role of the intercept.  If the intercept gets a different prior than the slope terms, then it would be clearer to break it out.  It is also slightly more efficient in its explicit form with the intercept variable
singled out because there's one fewer multiplications; it should not make that much of a difference to speed, though, so the choice should be based on clarity.

### The QR reparameterization

In the previous example, the linear predictor can be written as $\eta = x \beta$, where $\eta$ is a $N$-vector of predictions, $x$ is a $N \times K$ matrix, and $\beta$ is a $K$-vector of coefficients.
Presuming $N \geq K$, we can exploit the fact that any design matrix $x$ can be decomposed using the thin QR decomposition into an orthogonal matrix $Q$ and an upper-triangular matrix $R$, i.e. $x = Q
R$.

The functions `qr_thin_Q` and `qr_thin_R` implement the thin QR decomposition, which is to be preferred to the fat QR decomposition that would be obtained by using `qr_Q` and `qr_R`, as the latter would more easily run out of memory (see the Stan Functions Reference for more information on the `qr_thin_Q` and `qr_thin_R` functions). In practice, it is best to write $x = Q^\ast R^\ast$ where $Q^\ast = Q * \sqrt{n - 1}$ and $R^\ast = \frac{1}{\sqrt{n - 1}} R$. Thus, we can equivalently write $\eta = x \beta = Q R \beta = Q^\ast R^\ast \beta$. If we let $\theta = R^\ast \beta$, then we have $\eta = Q^\ast \theta$ and $\beta = R^{\ast^{-1}} \theta$. In that case, the previous Stan program becomes

In [4]:
qr_reparam = '''
data {
    int<lower=0> N;         // number of data items
    int<lower=0> K;         // number of predictors
    matrix[N, K] x;         // predictor matrix
    vector[N] y;            // outcome vector
}

transformed data {
    matrix[N, K] Q_ast;
    matrix[K, K] R_ast;
    matrix[K, K] R_ast_inverse;
    // thin and scale the QR decomposition
    Q_ast = qr_thin_Q(x) * sqrt(N - 1);
    R_ast = qr_thin_R(x) / sqrt(N - 1);
    R_ast_inverse = inverse(R_ast);
}

parameters {
    real alpha;             // intercept
    vector[K] theta;        // coefficients for Q_ast
    real<lower=0> sigma;    // error scale    
}

model {
    y ~ normal(Q_ast * theta + alpha, sigma); // data model
}

generated quantities {
    vector[K] beta;
    beta = R_ast_inverse * theta; // coefficients on x
}
'''

stan_file = './stan_models/qr_reparam.stan'

with open(stan_file, 'w') as f:
    print(qr_reparam, file=f)
    
qr_reparam_model = CmdStanModel(stan_file=stan_file, force_compile=True)

11:14:58 - cmdstanpy - INFO - compiling stan file /Users/rehabnaeem/Documents/Coding-Projects/bayesian-analysis/references/Stan-Modelling/stan_models/qr_reparam.stan to exe file /Users/rehabnaeem/Documents/Coding-Projects/bayesian-analysis/references/Stan-Modelling/stan_models/qr_reparam
11:15:10 - cmdstanpy - INFO - compiled model executable: /Users/rehabnaeem/Documents/Coding-Projects/bayesian-analysis/references/Stan-Modelling/stan_models/qr_reparam


Since this Stan program generates equivalent predictions for $y$ and the same posterior distribution for $\alpha$, $\beta$, and $\sigma$ as the previous Stan program, many wonder why the version with this QR reparameterization performs so much better in practice, often both in terms of wall time and in terms of effective sample size. The reasoning is threefold:

1. The columns of $Q^\ast$ are orthogonal whereas the columns of $x$ generally are not. Thus, it is easier for a Markov Chain to move around in $\theta$-space than in $\beta$-space.
2. The columns of $Q^\ast$ have the same scale whereas the columns of $x$ generally do not. Thus, a Hamiltonian Monte Carlo algorithm can move around the parameter space with a smaller number of larger steps
3. Since the covariance matrix for the columns of $Q^\ast$ is an identity matrix, $\theta$ typically has a reasonable scale if the units of $y$ are also reasonable. This also helps HMC move efficiently without compromising numerical accuracy.

Consequently, this QR reparameterization is recommended for linear and generalized linear models in Stan whenever $K > 1$ and you do not have an informative prior on the location of $\beta$. It can also be worthwhile to subtract the mean from each column of $x$ before obtaining the QR decomposition, which does not affect the posterior distribution of $\theta$ or $\beta$ but does affect $\alpha$ and
allows you to interpret $\alpha$ as the expectation of $y$ in a linear model.

### Robust noise models

The standard approach to linear regression is to model the noise term $\epsilon$ as having a normal distribution. From Stan's perspective, there is nothing special about normally distributed noise. For instance, robust regression can be accommodated by giving the noise term a Student-$t$ distribution. To code this in Stan, the distribution distribution is changed to the following.

```stan
data {
  // ...
  real<lower=0> nu;
}
// ...
model {
  y ~ student_t(nu, alpha + beta * x, sigma);
}
```

The degrees of freedom constant `nu` is specified as data.

### Logistic and probit regression

For binary outcomes, either of the closely related logistic or probit regression models may be used.  These generalized linear models vary only in the link function they use to map linear predictions in $(-\infty,\infty)$ to probability values in $(0,1)$.  Their respective link functions, the logistic function and the standard normal cumulative distribution function, are both sigmoid functions (i.e., they are both *S*-shaped).

A logistic regression model with one predictor and an intercept is coded as follows.

In [5]:
logistic_regression = '''
data {
    int<lower=0> N;
    vector[N] x;
    array[N] int<lower=0, upper=1> y;
}
parameters {
    real alpha;
    real beta;
}
model {
    y ~ bernoulli_logit(alpha + beta * x);
}
'''

stan_file = './stan_models/logistic_regression.stan'

with open(stan_file, 'w') as f:
    print(logistic_regression, file=f)

logistic_regression_model = CmdStanModel(stan_file=stan_file, force_compile=True, cpp_options={'STAN_THREADS':'true'})

11:15:10 - cmdstanpy - INFO - compiling stan file /Users/rehabnaeem/Documents/Coding-Projects/bayesian-analysis/references/Stan-Modelling/stan_models/logistic_regression.stan to exe file /Users/rehabnaeem/Documents/Coding-Projects/bayesian-analysis/references/Stan-Modelling/stan_models/logistic_regression
11:15:18 - cmdstanpy - INFO - compiled model executable: /Users/rehabnaeem/Documents/Coding-Projects/bayesian-analysis/references/Stan-Modelling/stan_models/logistic_regression


The noise parameter is built into the Bernoulli formulation here rather than specified directly.

Logistic regression is a kind of generalized linear model with binary outcomes and the log odds (logit) link function, defined by

$$
\operatorname{logit}(v) = \log \left( \frac{v}{1-v} \right).
$$

The inverse of the link function appears in the model:
$$
\operatorname{logit}^{-1}(u) = \texttt{inv}\mathtt{\_}\texttt{logit}(u) = \frac{1}{1 + \exp(-u)}.
$$

The model formulation above uses the logit-parameterized version of the Bernoulli distribution, which is defined by
$$
\texttt{bernoulli}\mathtt{\_}\texttt{logit}\left(y \mid \alpha \right)
=
\texttt{bernoulli}\left(y \mid \operatorname{logit}^{-1}(\alpha)\right).
$$

The formulation is also vectorized in the sense that `alpha` and `beta` are scalars and `x` is a vector, so that `alpha   + beta * x` is a vector. The vectorized formulation is equivalent to the less efficient version

```stan
for (n in 1:N) {
  y[n] ~ bernoulli_logit(alpha + beta * x[n]);
}
```

Expanding out the Bernoulli logit, the model is equivalent to the more explicit, but less efficient and less arithmetically stable


```stan
for (n in 1:N) {
  y[n] ~ bernoulli(inv_logit(alpha + beta * x[n]));
}
```

Other link functions may be used in the same way. For example, probit regression uses the cumulative normal distribution function, which is typically written as

$$
\Phi(x) = \int_{-\infty}^x \textsf{normal}\left(y \mid 0,1 \right) \,\textrm{d}y.
$$

The cumulative standard normal distribution function $\Phi$ is implemented in Stan as the function `Phi`. The probit regression model may be coded in Stan by replacing the logistic model's distribution
statement with the following.

```stan
y[n] ~ bernoulli(Phi(alpha + beta * x[n]));
```

A fast approximation to the cumulative standard normal distribution function $\Phi$ is implemented in Stan as the function `Phi_approx`.(The `Phi_approx` function is a rescaled version of the inverse logit function, so while the scale is roughly the same $\Phi$, the tails do not match.) The approximate probit regression model may be coded with the following.

```stan
y[n] ~ bernoulli(Phi_approx(alpha + beta * x[n]));
```

### Multi-logit regression

Multiple outcome forms of logistic regression can be coded directly in Stan.  For instance, suppose there are $K$ possible outcomes for each output variable $y_n$. Also suppose that there is a $D$-dimensional vector $x_n$ of predictors for $y_n$.  The multi-logit model with $\textsf{normal}(0,5)$ priors on the coefficients is coded as follows.

In [6]:
mutli_logit_reg = '''
data {
    int K;
    int N;
    int D;
    array[N] int y;
    matrix[N,D] x;
}
parameters {
    matrix[D, K] beta;
}
model {
    matrix[N, K] x_beta = x * beta;
    
    to_vector(beta) ~ normal(0, 5);
    
    for (n in 1:N) {
        y[n] ~ categorical_logit(x_beta[n]');
    }
}
'''

stan_file = './stan_models/mutli_logi_reg.stan'

with open(stan_file, 'w') as f:
    print(mutli_logit_reg, file=f)
    
mutli_logit_reg_model = CmdStanModel(stan_file=stan_file, force_compile=True, cpp_options={'STAN_THREADS':'true'})

11:15:18 - cmdstanpy - INFO - compiling stan file /Users/rehabnaeem/Documents/Coding-Projects/bayesian-analysis/references/Stan-Modelling/stan_models/mutli_logi_reg.stan to exe file /Users/rehabnaeem/Documents/Coding-Projects/bayesian-analysis/references/Stan-Modelling/stan_models/mutli_logi_reg
11:15:28 - cmdstanpy - INFO - compiled model executable: /Users/rehabnaeem/Documents/Coding-Projects/bayesian-analysis/references/Stan-Modelling/stan_models/mutli_logi_reg


where `x_beta[n]'` is the transpose of `x_beta[n]`. The prior on `beta` is coded in vectorized form. As of Stan 2.18, the categorical-logit distribution is not vectorized for parameter arguments, so the loop is required. The matrix multiplication is pulled out to define a local variable for all of the predictors for efficiency. Like the Bernoulli-logit, the categorical-logit distribution applies softmax internally to convert an arbitrary vector to a simplex,

$$
\texttt{categorical}\mathtt{\_}\texttt{logit}\left(y \mid \alpha\right)
=
\texttt{categorical}\left(y \mid \texttt{softmax}(\alpha)\right),
$$

where

$$
\texttt{softmax}(u) = \exp(u) / \operatorname{sum}\left(\exp(u)\right).
$$

The categorical distribution with log-odds (logit) scaled parameters used above is equivalent to writing

```stan
y[n] ~ categorical(softmax(x[n] * beta));
```

### Constraints on data declarations

The data block in the above model is defined without constraints on sizes `K`, `N`, and `D` or on the outcome array `y`. Constraints on data declarations provide error checking at the point data are read (or transformed data are defined), which is before sampling begins. Constraints on data declarations also make the model author's intentions more explicit, which can help with readability. The above model's declarations could be tightened to

```stan
int<lower=2> K;
int<lower=0> N;
int<lower=1> D;
array[N] int<lower=1, upper=K> y;
```

These constraints arise because the number of categories, `K`, must be at least two in order for a categorical model to be useful. The number of data items, `N`, can be zero, but not negative; unlike R, Stan's for-loops always move forward, so that a loop extent of `1:N` when `N` is equal to zero ensures the loop's body will not be executed.  The number of predictors, `D`, must be at least one in order for `beta * x[n]` to produce an appropriate argument for `softmax()`.  The categorical outcomes
`y[n]` must be between `1` and `K` in order for the discrete sampling to be well defined.

Constraints on data declarations are optional. Constraints on parameters declared in the `parameters` block, on the other hand, are *not* optional---they are required to ensure support for all parameter values satisfying their constraints. Constraints on transformed data, transformed parameters, and generated quantities are also optional.

### Identifiability

Because softmax is invariant under adding a constant to each component of its input, the model is typically only identified if there is a suitable prior on the coefficients.

An alternative is to use $(K-1)$-vectors by fixing one of them to be zero. The partially known parameters section discusses how to mix constants and parameters in a vector. In the multi-logit case, the parameter block would be redefined to use $(K - 1)$-vectors

```stan
parameters {
  matrix[D, K - 1] beta_raw;
}
```

and then these are transformed to parameters to use in the model. First, a transformed data block is added before the parameters block to define a vector of zero values,

```stan
transformed data {
  vector[D] zeros = rep_vector(0, D);
}
```

which can then be appended to `beta_raw` to produce the coefficient matrix `beta`,

```stan
transformed parameters {
  matrix[D, K] beta = append_col(beta_raw, zeros);
}
```

The `rep_vector(0, D)` call creates a column vector of size `D` with all entries set to zero. The derived matrix `beta` is then defined to be the result of appending the vector `zeros` as a new column at the end of `beta_raw`;  the vector `zeros` is defined as transformed data so that it doesn't need to be constructed from scratch each time it is used.

This is not the same model as using $K$-vectors as parameters, because now the prior only applies to $(K-1)$-vectors. In practice, this will cause the maximum likelihood solutions to be different and also the posteriors to be slightly different when taking priors centered around zero, as is typical for regression coefficients.

### Parameterizing centered vectors

When there are varying effects in a regression, the resulting likelihood is not identified unless further steps are taken. For example, we might have a global intercept $\alpha$ and then a varying effect $\beta_k$ for age group $k$ to make a linear predictor $\alpha + \beta_k$.  With this predictor, we can add a constant to $\alpha$ and subtract from each $\beta_k$ and get exactly the same likelihood.

The traditional approach to identifying such a model is to pin the first varing effect to zero, i.e., $\beta_1 = 0$.  With one of the varying effects fixed, you can no longer add a constant to all of them
and the model's likelihood is identified. In addition to the difficulty in specifying such a model in Stan, it is awkward to formulate priors because the other coefficients are all interpreted relative to $\beta_1$.  

In a Bayesian setting, a proper prior on each of the $\beta$ is enough to identify the model.  Unfortunately, this can lead to inefficiency during sampling as the model is still only weakly identified through the prior---there is a very simple example of the difference in the discussion of collinearity in collinearity section.

An alternative identification strategy that allows a symmetric prior is to enforce a sum-to-zero constraint on the varying effects, i.e., $\sum_{k=1}^K \beta_k = 0.$

A parameter vector constrained to sum to zero may also be used to identify a multi-logit regression parameter vector, or may be used for ability or difficulty parameters (but not both) in an IRT model.

### Built-in sum-to-zero vector

As of Stan 2.36, there is a built in `sum_to_zero_vector` type, which can be used as follows.

```stan
parameters {
  sum_to_zero_vector[K] beta;
  // ...
}
```

This produces a vector of size `K` such that `sum(beta) = 0`.  In the unconstrained representation requires only `K - 1` values because the last is determined by the first `K - 1`.  

Placing a prior on `beta` in this parameterization, for example,

```stan
  beta ~ normal(0, 1);
```

leads to a subtly different posterior than what you would get with the same prior on an unconstrained size-`K` vector. As explained below, the variance is reduced.

The sum-to-zero constraint can be implemented naively by setting the last element to the negative sum of the first elements, i.e., $\beta_K = -\sum_{k=1}^{K-1} \beta_k.$ But that leads to high correlation among the $\beta_k$.

The transform used in Stan eliminates these correlations by constructing an orthogonal basis and applying it to the zero-sum-constraint. The *Stan Reference Manual* provides the details in the chapter on transforms.  Although any orthogonal basis can be used, Stan uses the inverse isometric log transform because it is convenient to describe and the transform simplifies to efficient scalar operations rather than more expensive matrix operations.

#### Marginal distribution of sum-to-zero components

On the Stan forums, Aaron Goodman provided the following code to produce a prior with standard normal marginals on the components of `beta`,

```stan
model {
  beta ~ normal(0, inv(sqrt(1 - inv(K))));
  // ...
}
```

The scale component can be multiplied by `sigma` to produce a `normal(0, sigma)` prior marginally.

To generate distributions with marginals other than standard normal, the resulting `beta` may be scaled by some factor `sigma` and translated to some new location `mu`.

#### Soft centering

Adding a prior such as $\beta \sim \textsf{normal}(0,\epsilon)$ for a small $\epsilon$ will provide a kind of soft centering of a parameter vector $\beta$ by preferring, all else being equal, that $\sum_{k=1}^K \beta_k = 0$.  This approach is only guaranteed to roughly center if $\beta$ and the elementwise addition $\beta + c$ for a scalar constant $c$ produce the same likelihood (perhaps by another vector $\alpha$ being transformed to $\alpha - c$, as in the IRT models). This is another way of achieving a symmetric prior, though it requires choosing an $\epsilon$.  If $\epsilon$ is too large, there won't be a strong enough centering effect and if it is too small, it will add high curvature to the target density and impede sampling.

### Ordered logistic and probit regression

Ordered regression for an outcome $y_n \in \{ 1, \dotsc, k \}$ with predictors $x_n \in \mathbb{R}^D$ is determined by a single coefficient vector $\beta \in \mathbb{R}^D$ along with a sequence of cutpoints $c \in \mathbb{R}^{K-1}$ sorted so that $c_d < c_{d+1}$. The discrete output is $k$ if the linear predictor $x_n \beta$ falls between $c_{k-1}$ and $c_k$, assuming $c_0 = -\infty$ and $c_K = \infty$.  The noise term is fixed by the form of regression, with examples for ordered logistic and ordered probit models.

#### Ordered logistic regression

The ordered logistic model can be coded in Stan using the `ordered` data type for the cutpoints and the built-in `ordered_logistic` distribution.

In [7]:
ordered_logistic_reg ='''
data {
    int<lower=2> K;
    int<lower=0> N;
    int<lower=1> D;
    array[N] int<lower=1, upper=K> y;
    array[N] row_vector[D] x;
}
parameters {
    vector[D] beta;
    ordered[K - 1] c;
}
model {
    for (n in 1:N) {
        y[n] ~ ordered_logistic(x[n] * beta, c);
    }
}
'''

stan_file = './stan_models/ordered_logistic_reg.stan'

with open(stan_file, 'w') as f:
    print(ordered_logistic_reg, file=f)

ordered_logistic_reg_model = CmdStanModel(stan_file=stan_file, force_compile=True, cpp_options={'STAN_THREADS':'true'})

11:15:28 - cmdstanpy - INFO - compiling stan file /Users/rehabnaeem/Documents/Coding-Projects/bayesian-analysis/references/Stan-Modelling/stan_models/ordered_logistic_reg.stan to exe file /Users/rehabnaeem/Documents/Coding-Projects/bayesian-analysis/references/Stan-Modelling/stan_models/ordered_logistic_reg
11:15:36 - cmdstanpy - INFO - compiled model executable: /Users/rehabnaeem/Documents/Coding-Projects/bayesian-analysis/references/Stan-Modelling/stan_models/ordered_logistic_reg


The vector of cutpoints `c` is declared as `ordered[K - 1]`, which guarantees that `c[k]` is less than `c[k + 1]`.

If the cutpoints were assigned independent priors, the constraint effectively truncates the joint prior to support over points that satisfy the ordering constraint. Luckily, Stan does not need to compute the effect of the constraint on the normalizing term because the probability is needed only up to a proportion.

#### Ordered probit

An ordered probit model could be coded in exactly the same way by swapping the cumulative logistic (`inv_logit`) for the cumulative normal (`Phi`).

In [8]:
ordered_probit = '''
data {
    int<lower=2> K;
    int<lower=0> N;
    int<lower=1> D;
    array[N] int<lower=1, upper=K> y;
    array[N] row_vector[D] x;
}
parameters {
    vector[D] beta;
    ordered[K - 1] c;
}
model {
    vector[K] theta;
    for (n in 1:N) {
        real eta;
        eta = x[n] * beta;
        theta[1] = 1 - Phi(eta - c[1]);
        for (k in 2:(K - 1)) {
            theta[k] = Phi(eta - c[k - 1]) - Phi(eta - c[k]);
        }
        theta[K] = Phi(eta - c[K - 1]);
        y[n] ~ categorical(theta);
    }
}
'''

stan_file = './stan_models/ordered_probit.stan'

with open(stan_file, 'w') as f:
    print(ordered_probit, file=f)

ordered_probit_model = CmdStanModel(stan_file=stan_file, force_compile=True, cpp_options={'STAN_THREADS':'true'})

11:15:36 - cmdstanpy - INFO - compiling stan file /Users/rehabnaeem/Documents/Coding-Projects/bayesian-analysis/references/Stan-Modelling/stan_models/ordered_probit.stan to exe file /Users/rehabnaeem/Documents/Coding-Projects/bayesian-analysis/references/Stan-Modelling/stan_models/ordered_probit
11:15:46 - cmdstanpy - INFO - compiled model executable: /Users/rehabnaeem/Documents/Coding-Projects/bayesian-analysis/references/Stan-Modelling/stan_models/ordered_probit


The logistic model could also be coded this way by replacing `Phi` with `inv_logit`, though the built-in encoding based on the softmax transform is more efficient and more numerically stable. A small efficiency gain could be achieved by computing the values `Phi(eta - c[k])` once and storing them for re-use.

### Hierarchical regression