# Transcriptome-Wide Association Study (TWAS) using High-dimensional Regression Methods

This notebook demonstrates how to perform Transcriptome-Wide Association Studies (TWAS) using weights derived from various high-dimensional regression methods including SuSiE, mvSuSiE and mr.mash. TWAS integrates eQTL data with GWAS summary statistics to identify gene-trait associations. The key idea is that if genetic variants affect both gene expression (eQTL effects) and trait (GWAS effects), we can aggregate these effects using learned weights from expression data to test for gene-trait associations.

For details on high-dimensional regression methods used to obtain weights, please refer to:
- `finemapping.ipynb`
- `multivariate_finemapping.ipynb`
- `mr_mash.ipynb`

Although we simulate expression data in 5 tissues to demonstrate multivariate methods capabilities, for simplicity in this TWAS exercise we will focus on using weights from the first tissue context to perform TWAS test.

In [None]:
library(susieR)
library(mvsusieR)
library(mr.mash.alpha)
set.seed(1)

## TWAS Test Definition

The goal of this exercise is to demonstrate how to compute TWAS (Transcriptome-Wide Association Study) test statistics using weights learned from various high-dimensional regression methods. Before we proceed with model fitting, we first introduce the TWAS test statistic.

The TWAS test combines eQTL weights with GWAS z-scores to compute a gene-level association statistic. The test statistic is computed as:

$$Z_{\text{TWAS}} = \frac{w^T z}{\sqrt{w^T R w}}$$

where $w$ is the vector of weights, $z$ is the vector of GWAS z-scores, and $R$ is the LD matrix. This is implemented in the following function:

In [None]:
twas_z <- function(weights, z, R = NULL, X = NULL) {
  if (length(weights) != length(z)) {
    stop("Weights and z-scores must have the same length.")
  }
  if (is.null(R)) R <- cor(X)
  stat <- t(weights) %*% z
  denom <- t(weights) %*% R %*% weights
  zscore <- stat / sqrt(denom)
  pval <- pchisq(zscore * zscore, 1, lower.tail = FALSE)
  return(list(z = zscore, pval = pval))
}

## Simulate molecular trait data

We'll use the same simulation setup as in the mr.mash tutorial, simulating expression of a gene across 5 tissues:

In [None]:
dat <- simulate_mr_mash_data(n = 300, p = 500, p_causal = 3, r = 5, pve = 0.25, V_cor = 0.25)
# Split into training and test sets
ntest <- 50
Ytrain <- dat$Y[-(1:ntest),]
Xtrain <- dat$X[-(1:ntest),]
Ytest <- dat$Y[1:ntest,]
Xtest <- dat$X[1:ntest,]

## Obtain weights from different methods

First, let's fit univariate SuSiE using the first tissue:

In [None]:
fit_susie <- susie(Xtrain, Ytrain[,1], L=10)

Next, we'll fit mr.mash, which will also provide residual variance estimates:

In [None]:
# Fit mr.mash
S0 <- compute_canonical_covs(r=5, singletons=TRUE, hetgrid=seq(0,1,0.25))
univ_sumstats <- compute_univariate_sumstats(Xtrain, Ytrain, standardize=TRUE)
scaling_grid <- autoselect.mixsd(univ_sumstats, mult=sqrt(2))^2
S0 <- expand_covs(S0, scaling_grid)
fit_mrmash <- mr.mash(X=Xtrain, Y=Ytrain, S0=S0)

Finally, we'll fit mvSuSiE using uniform mixture weights and canonical covariance matrices `S0`, removing the first component in `S0` which is the "null" effect. Note that for the residual variance, we can either use the estimate from mr.mash fit (as shown here) or directly use the covariance of Y. For prior specification, more sophisticated approaches exist, particularly the multivariate adaptive shrinkage (mash) model (Urbut et al. 2019, Nature Genetics) which can learn complex patterns of effect sharing across tissues.

In [None]:
# Get uniform weights
n_comp <- length(S0[-1])
w0 <- rep(1/n_comp, n_comp)

# Create prior
prior <- create_mixture_prior(list(matrices = S0[-1],
                                  weights = w0),
                              null_weight = 0)

# Fit mvSuSiE using mr.mash residual variance
fit_mvsusie <- mvsusie(Xtrain, Ytrain, standardize = TRUE,
                       prior_variance = prior,
                       residual_variance = fit_mrmash$V,
                       estimate_prior_variance = TRUE)

## Extract and compare weights

Let's visualize the weights (coefficients) from different methods for the first tissue:

In [None]:
par(mfrow=c(1,3))
plot(dat$B[,1], coef(fit_susie)[-1], main="SuSiE weights", 
     xlab="True", ylab="Estimated")
abline(0,1,col='red',lty=2)

plot(dat$B[,1], coef(fit_mvsusie)[,1][-1], 
     main="mvSuSiE weights", xlab="True", ylab="Estimated")
abline(0,1,col='red',lty=2)

plot(dat$B[,1], fit_mrmash$mu1[,1], 
     main="mr.mash weights", xlab="True", ylab="Estimated")
abline(0,1,col='red',lty=2)

It seems all the 3 methods perform similarly, correctly captured two out of the 3 simulated effects.

## Simulate GWAS z-scores

We'll simulate GWAS z-scores taking into account the LD structure (R). The z-scores are generated under the assumption that variants affecting gene expression in the first tissue also affect the trait with correlated effect sizes:

In [None]:
# Compute LD matrix
R <- cor(Xtrain)
# Simulate z-scores
set.seed(1)
true_effects <- sign(dat$B[,1]) * rnorm(length(dat$B[,1]), sd=3)
z <- R %*% true_effects
plot(z)

## Perform TWAS

Let's perform TWAS using weights from each method focusing on the first tissue:

In [None]:
# Compute TWAS results using different weights
twas_susie <- twas_z(coef(fit_susie)[-1], z, R)
twas_mvsusie <- twas_z(coef(fit_mvsusie)[,1][-1], z, R)
twas_mrmash <- twas_z(fit_mrmash$mu1[,1], z, R)

# Display results
results <- data.frame(
  Method = c("SuSiE", "mvSuSiE", "mr.mash"),
  Z_score = c(twas_susie$z, twas_mvsusie$z, twas_mrmash$z),
  P_value = c(twas_susie$pval, twas_mvsusie$pval, twas_mrmash$pval)
)
print(results)

## Computing TWAS weights using additional regression methods

The `pecotmr` package (https://github.com/cumc/pecotmr/) provides a unified interface for computing TWAS weights using various high-dimensional regression methods. Here's how to use them:


In [None]:
library(pecotmr)

In [None]:
# Create a character vector of weight methods to use
# Each method corresponds to a `*_weights()` function in pecotmr
w_methods <- c("susie_weights", "enet_weights", "lasso_weights", "mrash_weights", 
            "bayes_n_weights", "bayes_l_weights", "bayes_a_weights", 
            "bayes_c_weights", "bayes_r_weights")

# Compute weights using all methods
# Only using the first tissue for demonstration
weights <- twas_weights(Xtrain, Ytrain[,1], 
                       weight_methods = w_methods)

Vasualize weights from different methods against true effects,

In [None]:
par(mfrow = c(3,3), mar = c(4,4,3,1))
for (method in w_methods) {
  plot(dat$B[,1], weights[[method]], 
       main = method,
       xlab = "True effects", 
       ylab = "Estimated weights",
       pch = 20)
  abline(0, 1, col = 'red', lty = 2)
}

Performance of these method can be assessed using cross validation as one way to select the best model for TWAS test,

In [None]:
weights_cv <- twas_weights_cv(Xtrain, Ytrain[,1], 
                       weight_methods = w_methods, fold = 5)

In [None]:
weights_cv$performance

Based on cross validation performance the best model for this data-set is SuSiE, followed by BayesC. This is not suprising considering how the data was simulated (why?).

Finally we demonstrate TWAS test using weights from each method,

In [None]:
twas_results <- data.frame(
  Method = w_methods,
  Z_score = NA,
  P_value = NA
)

for (i in seq_along(w_methods)) {
  method <- w_methods[i]
  res <- twas_z(weights[[method]], z, R)
  twas_results$Z_score[i] <- res$z
  twas_results$P_value[i] <- res$pval
}

print(twas_results)