<a href="https://fabiandablander.com/r/Spike-and-Slab.html">Reference</a>

## Samples generation

We will now generate samples from a linear model

$$
\boldsymbol{y} = \boldsymbol{X}^T\boldsymbol{\beta} + \boldsymbol{\epsilon}
$$

where $\epsilon_i \sim \mathcal{N}(0,\sigma^2)$ and some of the coefficients $\beta_i=0$ in order to test our sparsity inducing bayesian model.

In [21]:
import numpy as np
import matplotlib.pyplot as plt

rng = np.random.RandomState(1234)
n_samples = 10

# X shaped like (n_samples,n_features)
X = rng.random(size=(n_samples,5))*10

# beta shaped like (n_features)
beta = np.array([
    1.,
    -2.,
    0.,
    0.5,
    0.
    ])

sigma = 1

# y shaped like (n_samples)
y = X @ beta + rng.randn((n_samples))*sigma

# Gibbs Sampling the Posterior distribution for a linear model with Spike and Slab prior

Usual likelihood model for linear regression:

$$
P(y|\beta,x) \sim \mathcal{N}(<\beta , x>,\sigma^2)
$$


## Spike and Slab prior

We want to somehow give a certain probability for the coefficient $\beta$ to be exactly zero: that would be the case if its prior distribution was a $\delta (\beta)$. So we construct a prior that's a "mixture" of such delta function and some other prior distribution the allows the parameter to span a whole range of values:

<img src="./img/spikeslab.png" width=300>

The way this is usually implemented is by building a hierarchical model in which we "modulate" the prior based on an additional random variable $Z \sim Ber(\theta)$, that is $Z=1$ with probability $\theta$ and $Z=0$ with probability $1-\theta$. This additional variable regulates the prior distribution for $\beta$:

$$
P(Z_i) \sim Ber(\theta_i) \\
P(\beta_i|Z_i=0) \sim \delta(\beta_i) \\
P(\beta_i|Z_i=1) \sim P_{slab}(\beta_i)
$$

or more conveniently

$$
P(\beta_i|Z_i) = (1-Z_i)\delta(\beta_i) + Z_i P_{slab}(\beta_i)
$$

Where we subscripted the variables with $i$ because they are the coefficients in a linear regression model; since, if $Z_i=0$, the $\beta_i$ is inevitably zero, we can model the likelihood like this:

$$
P(y | \boldsymbol{z} , \boldsymbol{x}, \boldsymbol{\beta}) \sim \mathcal{N} (<\boldsymbol{z} \circ \boldsymbol{\beta},\boldsymbol{x}>,\sigma^2)
$$

Where $\boldsymbol{z}=(z_0,z_1,\dots)$ is the bernoulli vector and $\circ$ denotes the element-wise product. Also, we assume the slab distribution to be a gaussian centered at the origin, much like the ridge regression prior:

$$
P_{slab} = \mathcal{N} (0,\sigma^2 \tau^2)
$$

We put $\sigma^2$ in there because we want the distribution for $\beta$ to scale like the outcome.

We assume the following conjugate priors for the parameters:

$$
\theta \sim Beta(a,b) \\
\sigma^2 \sim InverseGamma(\alpha_1,\alpha_2) \\
\tau^2 \sim InverseGamma(1/2,s^2/2)
$$

It is useful to recap the relations between the random variables in a DAG, in order to properly express the conditional probabilities in our hierarchical model:

<img src="./img/DAG.png" width=400>

This helps us in factorizing the joint probability distribution via $d$-separation: (we do not subscript $\sigma^2$ and $\tau^2$ because we infer the same prior for every coefficient and same noise on each $y$)

$$
P(y,\beta_i,z_i,\theta_i,\tau^2,\sigma^2) = P(y|\beta_i,\sigma^2)P(\sigma^2)P(\beta_i|z_i,\tau^2)P(z_i|\theta_i)P(\theta_i)P(\tau^2)
$$

As for the hyperparameters in our priors (gray bubbles in graph), we fix $a=b=1$ for the beta distribution, $\alpha_1=\alpha_2=0.01$ for the $\sigma^2$ distribution and $s=1/2$ for the $\tau^2$ distribution.

The conditional posterior distribution can be broken up with the aid once again of the DAG shown above: 

$$
P(\theta_i | y,\beta_i,z_i,\tau^2,\sigma^2) = P(\theta_i|z_i) \\
P(\tau^2 | y,\beta_i,z_i,\theta_i,\sigma^2) = P(\tau^2 | \beta_i,z_i) \\
P(\sigma^2 | y,\beta_i,z_i,\theta_i,\tau^2) = P(\sigma^2 | y,\beta_i) \\
P(z_i | y,\beta_i,\theta_i,\tau^2,\sigma^2) = P(z_i | \beta_i,\theta_i,\tau^2) \\
P(\beta_i|y,\theta_i,z_i,\tau^2,\sigma^2) = P(\beta_i | y, z_i,\tau^2,\sigma^2)
$$

These conditional probabilities are needed to perform Gibbs sampling of the posterior distribution for all of these parameters. Let's derive their expressions.

### $P(\theta_i|z_i)$

Using Bayes' theorem:

$$
P(\theta_i|z_i) = \frac{P(z_i|\theta_i)P(\theta_i)}{\int P(z_i|\theta_i)P(\theta_i) d\theta_i}
$$

Now recall that $z_i \sim Ber(\theta_i) = \theta_i^{z_i}(1-\theta_i)^{1-z_i}$, and that we established the beta distribution with parameters $a,b$ as the prior for $\theta_i$:

$$
P(\theta_i|z_i) = \frac{\theta_i^{z_i}(1-\theta_i)^{1-z_i}\frac{1}{B(a,b)}\theta_i^{a-1}(1-\theta_i)^{b-1} }{ \int \theta_i^{z_i}(1-\theta_i)^{1-z_i}\frac{1}{B(a,b)}\theta_i^{a-1}(1-\theta_i)^{b-1} d\theta_i}
$$

which is again a Beta distribution normalized:

$$
P(\theta_i|z_i) = \frac{\theta_i^{(a+z_i)-1}(1-\theta_i)^{(b+1 - z_i)-1}}{\int \theta_i^{(a+z_i)-1}(1-\theta_i)^{(b+1 - z_i)-1}d\theta_i}
$$

so

$$
P(\theta_i|z_i) \sim Beta(a+z_i,b+1-z_i)
$$

### $P(\tau^2|\beta_i,z_i)$

This also has a dependency on $z_i$ when marginalizing because $P(\beta_i)$ has a dependency on both the parameters.

$$
P(\tau^2|\beta_i,z_i)= \frac{P(\beta_i|\tau^2,z_i)P(\tau^2)P(z_i)}{\int P(\beta_i|\tau^2,z_i)P(\tau^2)P(z_i) d\tau^2} = \frac{P(\beta_i|\tau^2,z_i)P(\tau^2)}{\int P(\beta_i|\tau^2,z_i)P(\tau^2) d\tau^2}
$$

We will denote here and later the normalization constant with $Z$. Now, the conditional probability on $\beta_i$ has a dependency on $z_i$: in particular, if 

### $z_i = 1$

$$
P(\tau^2 | \beta_i,z_i=1) = \frac{1}{Z}(2\pi\sigma^2\tau^2)^{-1/2}\exp \left [ -\frac{\beta_i^2}{2\sigma^2\tau^2} \right ] \frac{(s^2/2)^{1/2}}{\Gamma (1/2)}(\tau^2)^{-1/2 - 1}\exp \left [ -\frac{s^2/2}{\tau^2} \right ] = \frac{1}{Z}P(\beta|\tau^2,z_i=1)P(\tau^2)
$$

Which, absorbing every term not dependend on $\tau^2$ into the normalization constant, is again an Inverse Gamma distribution, which is expected since it's the conjugate prior for the exponential distribution with known mean:

$$
P(\tau^2|\beta_i, z_i=1) \sim InverseGamma(\frac{1}{2} + \frac{1}{2},\frac{s^2}{2}+\frac{\beta^2}{2\sigma^2})
$$

### $z_i = 0$

In this case, $\beta_i=0$ always and we simply sample from the prior:

$$
P(\tau^2|\beta_i, z_i=0) \sim InverseGamma(\frac{1}{2},\frac{s^2}{2})
$$


So we can summarize the $P(\tau^2|\beta_i,z_i)$ as 

$$
P(\tau^2|\beta_i,z_i) \sim InverseGamma(\frac{1}{2} + z_i \frac{1}{2},\frac{s^2}{2}+z_i\frac{\beta^2}{2\sigma^2})
$$


### $P(\sigma^2|y,\beta_i)$

We have

$$
P(\sigma^2|y,\beta_i) = \frac{1}{Z}P(y|\sigma^2,\beta_i)P(\sigma^2)
$$


