# varbvs: A Software Toolkit for Fast Variable Selection in Genome-wide Association Studies and Other Large-scale Regression Applications
Peter Carbonetto, Xiang Zhou, Matthew Stephens

Variational Bayesian variable selection (varbvs): 
1. A software toolkit for the analysis of large-scale data sets using Bayesian variable selection methods; 
2. Builds on Bayesian models for variable selection in regression and variational approximation techniques.

## Introduction
Bayesian variable selection (BVS) applications:
- mapping of complex disease and trait loci
- enrichment analysis
- estimate the proportion of variance in phenotypes explained by available genotypes
- fine-mapping

BVS not widely used for GWAS, reasons include:
- difficult to use
- appropriate specification of priors
- efficient computation of posterior probabilities

Aims in developing varbvs software:
- make BVS methods accessible to practitioners
- provide an alternative to commonly used toolkits for penalized sparse regression

Advantages of BVS (over penalized sparse regression):
- computes the probabilities that each variable is included in the regression model - posterior inclusion probability (PIP)
- no cross-validation or false positive rates are required to determine significance levels
- allows for the possibility of model comparison through approximate computation of Bayes factors

Limitations of BVS:
- computing exact posterior probabilities is intractable except in very small data sets
- the choice of priors requires considerable expertise in Bayesian data analysis

varbvs remedies these two limitations above by:
1. implement fast posterior computation using variational approximation techniques
2. provide default priors that are suitable, and also allow for extensive prior customization

## Example illustrating features of varbvs
Introduce an exchangeable prior (spike-and-slab prior) on the coefficients
- The complexity of model is controlled by the prior, which is determined by two parameters:
    - $\pi$: the prior log-odds $\log_{10}(\frac{\pi}{1-\pi})$ that a variable is included in the regression model
    - $\sigma_a^2$: the prior variance of the regression coefficients (sa).


-  compute results for different settings of logodds, and keep sa constant to simplify the example.

## Bayesian variable selection, and the varbvs R interface
### Regression model
When Y is continuous: assume a basic linear regression by setting `family = "gaussian"`

$Y=\sum^m_{i=1}Z_i\mu_i + \sum_{i=1}^pX_i\beta_i+\epsilon$, where $Z$ is covariates and $\epsilon \sim N(0,\sigma^2)$

When Y is binary: an additive model for the log-odds of $Y = 1$ by setting `family = "binomial"`

$\log{\frac{\Pr(Y=1)}{\Pr(Y=0)}} = \sum_{i=1}^mZ_i\mu_i + \sum_{i=1}^p X_i\beta_i$

### Variable selection prior
- Adopt "spike-and-slab" prior to frame the vriable selection problem
    - with probability π, coefficient $β_i$ is drawn from the "slab"
    - slab distribution: normal density with zero mean and variance $σ^2σ_a^2$
    - with probability 1 − π, $β_i$ equals zero, the "spike"
    - small value of π encourage "sparse" regression models, indicates only a small proportion of variable in X help predict Y

### Hyperparameters in varbvs argument settings
Hyperparameters:
- `sa`: the prior variance of the regression coefficients
- `sigma`: the residual variance for linear regression
- `logodds`: the prior log-odds of inclusion, $\log_{10}\{\frac{\pi}{1-\pi}\}$

varbvs can encode preferences of hyperparameters
- non-exchangeable prior $\boldsymbol{\pi}$: set input logodds to a matrix, the rows correspond to variables and the columns correspond to hyperparameter settings.
- fit one or more of the hyperparameters to the data: `update.sigma = TRUE` and/or `update.sa = TRUE` in varbvs arguments
- if `sigma` and/or `sa` are not provided, varbvs will use the default

### Fast posterior computation
Computing posterior probabilities is often an intractable, high-dimensional integration problem
- sulution in varbvs: introduce a class of approximating distributions, then optimizing a criterion (the Kullback-Leibler divergence) to find the distribution within this class that best matches the posterior.

The algorithm for fitting the variational approximation consists of an inner loop and an outer loop
- outer loop: iterates over the hyperparameter settings
- inner loop: cycles through co-ordinate ascent updates to tighten the lower bound on the marginal likelihood given a setting of the hyperparameters

### Averaging over the hyperparameter settings
Since different settings are allowed for hyperparameters, they also implement the Bayesian model averaging strategy, in which averaging over settings of the hyperparameters, weighted by the posterior probability of each setting.

### varbvs function
```
varbvs(X, Z, y, family,                              # Data.
sigma, sa, logodds,                                  # Hyperparameters.
alpha, mu, eta,                                      # Variational parameters.
update.sigma, update.sa, optimize.eta,               # Optimization and model
initialize.params, nr, sa0, n0, tol, maxiter,        # fitting settings.
verbose)                                             # Other settings.
```

## Example: mapping Crohn’s disease risk loc
Large data set:
- 4,686 samples: 1,748 Crohn’s disease cases and 2,938 controls
- 442,001 SNPs
- binary outcome

varbvs results:
- the fitted regression model is very sparse: only 8 out of the 442,001 candidate variables are included in the model with probability 0.5 or greater
- Top 9 variables by inclusion probability:

```
  index variable   prob  PVE coef* Pr(coef.>0.95)
1 71850 rs10210302 1.000 NA -0.313 [-0.397,-0.236]
2 10067 rs11805303 1.000 NA 0.291 [+0.207,+0.377]
3 140044 rs17234657 1.000 NA 0.370 [+0.255,+0.484]
4 381590 rs17221417 1.000 NA 0.279 [+0.192,+0.371]
5 402183 rs2542151 0.992 NA 0.290 [+0.186,+0.392]
6 271787 rs10995271 0.987 NA 0.236 [+0.151,+0.323]
7 278438 rs7095491 0.969 NA 0.222 [+0.141,+0.303]
8 168677 rs9469220 0.586 NA -0.194 [-0.269,-0.118]
9 22989 rs12035082 0.485 NA 0.195 [+0.111,+0.277]
```

## Summary and discussion
Benefits of BVS
- account for uncertainty in hyperparameter
- fast computation if all coefficients are conditionally independent

Disadvantages
- require careful selection of an additional set of priors for the hyperparameters
- does not offer complete flexibility: does not include other types of prior, such as g-prior
- when variables are completely correlated, each of them are expected to have equal PIP, but varbvs may get this wrong by calculating PIP equal to 1 for only one variable while the other equal to 0.