# Logistic regression spike slab prior using PyMC3

Previously we have tried BoomSpikeSlab and LogitBVs R packages to fit spike slab model for logistic regression but didn't get anywhere the way we want (FIXME: add link). Now I'm looking at some customized options. According to [this post](https://www.kaggle.com/melondonkey/bayesian-spike-and-slab-in-pymc3), `stan` cannot handle spike slab model because it is not discrete. The post implements a `PyMC3` based sampler that looks neat enough so I'm trying to use it for our problem here.

Indeed discrete prior might also be not optimal for `PyMC3`, as pointed out [in this notebook](https://www.kaggle.com/derekpowll/bayesian-lr-w-cauchy-prior-in-pymc3). I think we can also try a spiky normal plus a slab normal mixture -- at least they will be continous there.

## Software required

```
pip install pymc3 -U
conda install mkl-service
```

## Data

In [1]:
X_file = 'deletion.genes.block30.for_simu.sample.genes.block_79_137.gz'
y_file = 'deletion.genes.block30.for_simu.sample.y'

In [5]:
y = np.loadtxt(y_file, dtype=int)
y.shape

(13412,)

In [6]:
X = np.loadtxt(X_file, dtype=float)
X.shape

(13412, 59)

It is 13K samples 59 features for the CNV problem.

## Model specification

That is, to specify **how the data is generated**. Specifically it is about setting up the spike slab prior for logistic model, $$b\sim \pi_0 \delta_0 + (1-\pi_0)N(\mu, \sigma^2)$$ where from `varbvs` analysis, $\pi_0 = 0.043, \mu = 0.77, \sigma = 0.84$.

For intercept since for centered data it has interpretation of baseline odds ratio, I'm giving it a normal prior $N(0, 1.5)$ to roughly cover the span of baseline odds 0.05 (log odds about -3), for a not so rare disease.

**Question: how should we handle intercept? How did `varbvs` handle intercept?**

In [7]:
import pymc3 as pm
import numpy as np
import theano.tensor as tt
from scipy.special import expit

def get_model(y, X, pi0=0.043, mu=0.77, sigma=0.84, mu_intercept=0, sigma_intercept=1.5):
    invlogit = lambda x: 1/(1 + tt.exp(-x))
    model = pm.Model()
    with model:
        xi = pm.Bernoulli('xi', pi0, shape=X.shape[1]) #inclusion probability for each variable
        alpha = pm.Normal('alpha', mu = mu_intercept, sd = sigma_intercept) # Intercept
        beta = pm.Normal('beta', mu = mu, sd = sigma , shape=X.shape[1]) #Prior for the non-zero coefficients
        p = pm.math.dot(X, xi * beta) #Deterministic function to map the stochastics to the output
        y_obs = pm.Bernoulli('y_obs', invlogit(p + alpha),  observed=y)  #Data likelihood
    return model

In [8]:
model = get_model(y,X)

## Sampling

Need to read additional `PyMC3` documentation to do proper sampling diagnostics, but generally it helps to use multiple chains.

In [9]:
# Here I use 20 cores on my 40 core machine, with 10 chains, to generate 2000 samples.
# Takes 50min on my desktop
with model:
    trace = pm.sample(2000, random_seed = 999, cores = 20, progressbar = True, chains = 10)

Multiprocess sampling (10 chains in 20 jobs)
CompoundStep
>BinaryGibbsMetropolis: [xi]
>NUTS: [beta, alpha]
Sampling 10 chains: 100%|██████████| 25000/25000 [43:20<00:00,  6.37draws/s] 
There were 6 divergences after tuning. Increase `target_accept` or reparameterize.
There were 2 divergences after tuning. Increase `target_accept` or reparameterize.
There were 2 divergences after tuning. Increase `target_accept` or reparameterize.
There were 3 divergences after tuning. Increase `target_accept` or reparameterize.
There was 1 divergence after tuning. Increase `target_accept` or reparameterize.
There were 3 divergences after tuning. Increase `target_accept` or reparameterize.
There were 4 divergences after tuning. Increase `target_accept` or reparameterize.
There were 3 divergences after tuning. Increase `target_accept` or reparameterize.
There were 4 divergences after tuning. Increase `target_accept` or reparameterize.


## Results

This will summarize samples generated to posterior quantities including PIP, $\tilde{b}$ and $\tilde{\mu}$ ($b$ given inclusion, $\xi=1$).

In [11]:
import pandas as pd
results = pd.DataFrame({'inclusion_probability':np.apply_along_axis(np.mean, 0, trace['xi']),
                       'beta':np.apply_along_axis(np.mean, 0, trace['beta']),
                       'beta_given_inclusion': np.apply_along_axis(np.sum, 0, trace['xi']*trace['beta']) / np.apply_along_axis(np.sum, 0, trace['xi'])
                       })
results.sort_values('inclusion_probability', ascending = False)

Unnamed: 0,inclusion_probability,beta,beta_given_inclusion
42,0.0315,0.757005,0.452559
57,0.0291,0.755969,0.331873
58,0.02855,0.764855,0.325449
32,0.0283,0.757791,0.38902
43,0.0282,0.759151,0.502015
23,0.0281,0.757029,0.399929
30,0.02805,0.751557,0.407373
27,0.02655,0.767675,0.411278
25,0.02625,0.758376,0.361729
29,0.02535,0.764275,0.390959


But the true causal variables are 14 and 31 ... apparently this needs more work.

## Some sanity check

1. Does posterior predictive mean $\tilde{y}$ roughly equal to data mean?
2. What's the posterior number of non-zero variables?

In [13]:
estimate = trace['beta'] * trace['xi'] 
y_hat = np.apply_along_axis(np.mean, 1, expit(trace['alpha'] + np.dot(X, np.transpose(estimate) )) )
print(np.mean(y_hat), np.sum(results.inclusion_probability))

0.5000008683342422 1.13855


So the posterior mean suggests 1 variable involved. The prior is $0.043 \times 50 \approx 2$ expected. **Need to check this with simulated truth; also should run `varbvs` on this and compare**.