# MASH analysis pipeline with data-driven prior matrices

This notebook is a pipeline written in SoS to run `flashr + mashr` for multivariate analysis described in Urbut et al (2019). This pipeline was last applied to analyze GTEx V8 eQTL data, although it can be used as is to perform similar multivariate analysis for other association studies.

*Version: 2021.02.28 by Gao Wang and Yuxin Zou*

In [1]:
%revisions -s

Revision,Author,Date,Message
,,,
2570aa6,Gao Wang,2019-10-03,Update and improve data preprocessing pipelines
3a02d2d,Gao Wang,2019-07-11,Update mashr version
3de6678,Gao Wang,2019-02-05,Add corshrink pipeline implementation
80e89a8,Gao Wang,2019-01-28,Add a note on HPC job submission
982a1b4,Gao Wang,2019-01-28,Implement new $\hat{V}$ estimation method (with Yuxin Zou)
515e869,Gao Wang,2018-11-22,Add a prompt for empty input posterior
ba9b20e,Gao Wang,2018-11-22,Add posterior calculation for input 'strong' set
4949470,Gao Wang,2018-11-22,Fix mashr null correlation estimate interface
6145b33,Gao Wang,2018-11-21,Add --optmethod to configure convex optimization method to use


## Data overview

`fastqtl` summary statistics data were obtained from dbGaP (data on CRI at UChicago Genetic Medicine). It has 49 tissues. [more description to come]

## Preparing MASH input

Using an established workflow (which takes 33hrs to run on a cluster system as configured by `midway2.yml`; see inside `fastqtl_to_mash.ipynb` for a note on computing environment),

```
INPUT_DIR=/project/compbio/GTEx_dbGaP/GTEx_Analysis_2017-06-05_v8/eqtl/GTEx_Analysis_v8_eQTL_all_associations
JOB_OPT="-c midway2.yml -q midway2"
sos run workflows/fastqtl_to_mash.ipynb --data-list $INPUT_DIR/FastQTLSumStats.list --common-suffix ".allpairs.txt" $JOB_OPT
```

As a result of command above I obtained the "mashable" data-set in the same format [as described here](https://stephenslab.github.io/gtexresults/gtexdata.html).

### Some data integrity check

1. Check if I get the same number of groups (genes) at the end of HDF5 data conversion:

```
$ zcat Whole_Blood.allpairs.txt.gz | cut -f1 | sort -u | wc -l
20316
$ h5ls Whole_Blood.allpairs.txt.h5 | wc -l
20315
```

The results agreed on Whole Blood sample (the original data has a header thus one line more than the H5 version). We should be good (since the pipeline reported success for all other files).

### Data & job summary

The command above took 33 hours on UChicago RCC `midway2`. 

```
[MW] cat FastQTLSumStats.log
39832 out of 39832 groups merged!
```

So we have a total of 39832 genes (union of 49 tissues).

```
[MW] cat FastQTLSumStats.portable.log
15636 out of 39832 groups extracted!
```

We have 15636 groups without missing data in any tissue. This will be used to train the MASH model.

The "mashable" data file is `FastQTLSumStats.mash.rds`, 124Mb serialized R file.

## Multivariate adaptive shrinkage (MASH) analysis of eQTL data

Below is a "blackbox" implementation of the `mashr` eQTL workflow -- blackbox in the sense that you can run this pipeline as an executable, without thinking too much about it, if you see your problem fits our GTEx analysis scheme. However when reading it as a notebook it is a good source of information to help developing your own `mashr` analysis procedures.

Since the submission to biorxiv of Urbut 2017 we have improved implementation of MASH algorithm and made a new R package, [`mashr`](https://github.com/stephenslab/mashr). Major improvements compared to Urbut 2019 are:

1. Faster computation of likelihood and posterior quantities via matrix algebra tricks and a C++ implementation.
2. Faster computation of MASH mixture via convex optimization.
3. Replace `SFA` with `FLASH`, a new sparse factor analysis method to generate prior covariance candidates.
4. Improve estimate of residual variance $\hat{V}$.

At this point, the input data have already been converted from the original eQTL summary statistics to a format convenient for analysis in MASH, as a result of running the data conversion pipeline in `fastqtl_to_mash.ipynb`.

Example command:


```bash
JOB_OPT="-j 8"
#JOB_OPT="-c midway2.yml -q midway2"
sos run workflows/mashr_flashr_workflow.ipynb mash $JOB_OPT # --data ... --cwd ... --vhat ...
```

**FIXME: add comments on submitting jobs to HPC. Here we use the UChicago RCC cluster but other users can similarly configure their computating system to run the pipeline on HPC.**

### Global parameter settings

In [2]:
[global]
parameter: cwd = path('./mashr_flashr_workflow_output')
# Input summary statistics data
parameter: data = path("fastqtl_to_mash_output/FastQTLSumStats.mash.rds")
# Prefix of output files. If not specified, it will derive it from data.
# If it is specified, for example, `--output-prefix AnalysisResults`
# It will save output files as `{cwd}/AnalysisResults*`.
parameter: output_prefix = ''
# Exchangable effect (EE) or exchangable z-scores (EZ)
parameter: effect_model = 'EZ'
# Identifier of $\hat{V}$ estimate file
# Options are "identity", "simple", "mle", "vhat_corshrink_xcondition", "vhat_simple_specific"
parameter: vhat = 'simple'
parameter: mixture_components = ['flash', 'flash_nonneg', 'pca',"canonical"]
parameter: container = str
data = data.absolute()
cwd = cwd.absolute()
if len(output_prefix) == 0:
    output_prefix = f"{data:bn}"
prior_data = file_target(f"{cwd:a}/{output_prefix}.{effect_model}.prior.rds")
vhat_data = file_target(f"{cwd:a}/{output_prefix}.{effect_model}.V_{vhat}.rds")
mash_model = file_target(f"{cwd:a}/{output_prefix}.{effect_model}.V_{vhat}.mash_model.rds")

def sort_uniq(seq):
    seen = set()
    return [x for x in seq if not (x in seen or seen.add(x))]

### Command interface

In [1]:
sos run mashr_flashr_workflow.ipynb -h

usage: sos run mashr_flashr_workflow.ipynb
               [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  flash
  flash_nonneg
  pca
  vhat_identity
  vhat_simple
  vhat_mle
  vhat_corshrink_xcondition
  vhat_simple_specific
  prior
  mash
  posterior

Global Workflow Options:
  --cwd mashr_flashr_workflow_output (as path)
  --data fastqtl_to_mash_output/FastQTLSumStats.mash.rds (as path)
                        Input summary statistics data
  --output-prefix ''
                        Prefix of output files. If not specified, it will derive
                        it from data. If it is specified, for example,
                        `--output-prefix AnalysisResults` It will save out

## Factor analyses

In [None]:
# Perform FLASH analysis with non-negative factor constraint (time estimate: 20min)
[flash]
input: data
output: f"{cwd}/{output_prefix}.flash.rds"
task: trunk_workers = 1, walltime = '2h', trunk_size = 1, mem = '8G', cores = 2, tags = f'{_output:bn}'
R: expand = "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container
    dat = readRDS(${_input:r})
    dat = mashr::mash_set_data(dat$strong.b, Shat=dat$strong.s, alpha=${1 if effect_model == 'EZ' else 0}, zero_Bhat_Shat_reset = 1E3)
    res = mashr::cov_flash(dat, factors="default", remove_singleton=${"TRUE" if "canonical" in mixture_components else "FALSE"}, output_model="${_output:n}.model.rds")
    saveRDS(res, ${_output:r})

In [None]:
# Perform FLASH analysis with non-negative factor constraint (time estimate: 20min)
[flash_nonneg]
input: data
output: f"{cwd}/{output_prefix}.flash_nonneg.rds"
task: trunk_workers = 1, walltime = '2h', trunk_size = 1, mem = '8G', cores = 2, tags = f'{_output:bn}'
R: expand = "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container
    dat = readRDS(${_input:r})
    dat = mashr::mash_set_data(dat$strong.b, Shat=dat$strong.s, alpha=${1 if effect_model == 'EZ' else 0}, zero_Bhat_Shat_reset = 1E3)
    res = mashr::cov_flash(dat, factors="nonneg", remove_singleton=${"TRUE" if "canonical" in mixture_components else "FALSE"}, output_model="${_output:n}.model.rds")
    saveRDS(res, ${_output:r})

In [None]:
[pca]
# Number of components in PCA analysis for prior
# set to 3 as in mash paper
parameter: npc = 2
input: data
output: f"{cwd}/{output_prefix}.pca.rds"
task: trunk_workers = 1, walltime = '1h', trunk_size = 1, mem = '4G', cores = 2, tags = f'{_output:bn}'
R: expand = "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container
    dat = readRDS(${_input:r})
    dat = mashr::mash_set_data(dat$strong.b, Shat=dat$strong.s, alpha=${1 if effect_model == 'EZ' else 0}, zero_Bhat_Shat_reset = 1E3)
    res = mashr::cov_pca(dat, ${npc})
    saveRDS(res, ${_output:r})

In [None]:
[canonical]
input: data
output: f"{cwd}/{output_prefix}.canonical.rds"
task: trunk_workers = 1, walltime = '1h', trunk_size = 1, mem = '4G', cores = 2, tags = f'{_output:bn}'
R: expand = "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container
    library("mashr")
    dat = readRDS(${_input:r})
    dat = mashr::mash_set_data(dat$strong.b, Shat=dat$strong.s, alpha=${1 if effect_model == 'EZ' else 0}, zero_Bhat_Shat_reset = 1E3)
    res = mashr::cov_canonical(dat)
    saveRDS(res, ${_output:r})

### Estimate residual variance

FIXME: add some narratives here explaining what we do in each method.

In [6]:
# V estimate: "identity" method
[vhat_identity]
input: data
output: f'{vhat_data:nn}.V_identity.rds'
task: trunk_workers = 1, walltime = '2h', trunk_size = 1, mem = '8G', cores = 2, tags = f'{_output:bn}'
R: expand = "${ }", workdir = cwd, stderr = f"{_output:n}.stderr", stdout = f"{_output:n}.stdout", container = container
    dat = readRDS(${_input:r})
    saveRDS(diag(ncol(dat$random.b)), ${_output:r})

In [7]:
# V estimate: "simple" method (using null z-scores)
[vhat_simple]
depends: R_library("mashr")
input: data
output: f'{vhat_data:nn}.V_simple.rds'
task: trunk_workers = 1, walltime = '2h', trunk_size = 1, mem = '8G', cores = 2, tags = f'{_output:bn}'
R: expand = "${ }", workdir = cwd, stderr = f"{_output:n}.stderr", stdout = f"{_output:n}.stdout", container = container
    library(mashr)
    dat = readRDS(${_input:r})
    vhat = estimate_null_correlation_simple(mash_set_data(dat$random.b, Shat=dat$random.s, alpha=${1 if effect_model == 'EZ' else 0}, zero_Bhat_Shat_reset = 1E3))
    saveRDS(vhat, ${_output:r})

In [8]:
# V estimate: "mle" method
[vhat_mle]
# number of samples to use
parameter: n_subset = 6000
# maximum number of iterations
parameter: max_iter = 6
depends: R_library("mashr")
input: data, prior_data
output: f'{vhat_data:nn}.V_mle.rds'
task: trunk_workers = 1, walltime = '36h', trunk_size = 1, mem = '4G', cores = 1, tags = f'{_output:bn}'
R: expand = "${ }", workdir = cwd, stderr = f"{_output:n}.stderr", stdout = f"{_output:n}.stdout", container = container
    library(mashr)
    dat = readRDS(${_input[0]:r})
    # choose random subset
    set.seed(1)
    random.subset = sample(1:nrow(dat$random.b), min(${n_subset}, nrow(dat$random.b)))
    random.subset = mash_set_data(dat$random.b[random.subset,], dat$random.s[random.subset,], alpha=${1 if effect_model == 'EZ' else 0}, zero_Bhat_Shat_reset = 1E3)
    # estimate V mle
    vhatprior = mash_estimate_corr_em(random.subset, readRDS(${_input[1]:r}), max_iter = ${max_iter})
    vhat = vhat$V
    saveRDS(vhat, ${_output:r})
    saveRDS(vhat, ${_output:r})

In [None]:
# Estimate each V separately via corshrink
[vhat_corshrink_xcondition_1]
# Utility script
parameter: util_script = path('/project/mstephens/gtex/scripts/SumstatQuery.R')
# List of genes to analyze
parameter: gene_list = path()

fail_if(not gene_list.is_file(), msg = 'Please specify valid path for --gene-list')
fail_if(not util_script.is_file() and len(str(util_script)), msg = 'Please specify valid path for --util-script')
genes = sort_uniq([x.strip().strip('"') for x in open(f'{gene_list:a}').readlines() if not x.strip().startswith('#')])


depends: R_library("CorShrink")
input: data, for_each = 'genes'
output: f'{vhat_data:nn}/{vhat_data:bnn}_V_corshrink_{_genes}.rds'
task: trunk_workers = 1, walltime = '3m', trunk_size = 500, mem = '3G', cores = 1, tags = f'{_output:bn}'
R: expand = "${ }", workdir = cwd, stderr = f"{_output:n}.stderr", stdout = f"{_output:n}.stdout", container = container
    source(${util_script:r})
    CorShrink_sum = function(gene, database, z_thresh = 2){
      print(gene)
      dat <- GetSS(gene, database)
      z = dat$"z-score"
      max_absz = apply(abs(z), 1, max)
      nullish = which(max_absz < z_thresh)
      # if (length(nullish) < ncol(z)) {
        # stop("not enough null data to estimate null correlation")
      # }
      if (length(nullish) <= 1){
        mat = diag(ncol(z))
      } else {
        nullish_z = z[nullish, ]  
        mat = as.matrix(CorShrink::CorShrinkData(nullish_z, ash.control = list(mixcompdist = "halfuniform"))$cor)
      }
      return(mat)
    }
    V = Corshrink_sum("${_genes}", ${data:r})
    saveRDS(V, ${_output:r})

In [None]:
# Estimate each V separately via "simple" method
[vhat_simple_specific_1]
# Utility script
parameter: util_script = path('/project/mstephens/gtex/scripts/SumstatQuery.R')
# List of genes to analyze
parameter: gene_list = path()

fail_if(not gene_list.is_file(), msg = 'Please specify valid path for --gene-list')
fail_if(not util_script.is_file() and len(str(util_script)), msg = 'Please specify valid path for --util-script')
genes = sort_uniq([x.strip().strip('"') for x in open(f'{gene_list:a}').readlines() if not x.strip().startswith('#')])

depends: R_library("Matrix")
input: data, for_each = 'genes'
output: f'{vhat_data:nn}/{vhat_data:bnn}_V_simple_{_genes}.rds'

task: trunk_workers = 1, walltime = '1m', trunk_size = 500, mem = '3G', cores = 1, tags = f'{_output:bn}'
R: expand = "${ }", workdir = cwd, stderr = f"{_output:n}.stderr", stdout = f"{_output:n}.stdout", container = container
    source(${util_script:r})
    simple_V = function(gene, database, z_thresh = 2){
      print(gene)
      dat <- GetSS(gene, database)
      z = dat$"z-score"
      max_absz = apply(abs(z), 1, max)
      nullish = which(max_absz < z_thresh)
      # if (length(nullish) < ncol(z)) {
        # stop("not enough null data to estimate null correlation")
      # }
      if (length(nullish) <= 1){
        mat = diag(ncol(z))
      } else {
        nullish_z = z[nullish, ]
        mat = as.matrix(Matrix::nearPD(as.matrix(cov(nullish_z)), conv.tol=1e-06, doSym = TRUE, corr=TRUE)$mat)
      }
      return(mat)
    }
    V = simple_V("${_genes}", ${data:r})
    saveRDS(V, ${_output:r})

In [None]:
# Consolidate Vhat into one file
[vhat_corshrink_xcondition_2, vhat_simple_specific_2]
depends: R_library("parallel")
# List of genes to analyze
parameter: gene_list = path()

fail_if(not gene_list.is_file(), msg = 'Please specify valid path for --gene-list')
genes = paths([x.strip().strip('"') for x in open(f'{gene_list:a}').readlines() if not x.strip().startswith('#')])


input: group_by = 'all'
output: f"{vhat_data:nn}.V_{step_name.rsplit('_',1)[0]}.rds"

task: trunk_workers = 1, walltime = '1h', trunk_size = 1, mem = '4G', cores = 1, tags = f'{_output:bn}'
R: expand = "${ }", workdir = cwd, stderr = f"{_output:n}.stderr", stdout = f"{_output:n}.stdout", container = container
    library(parallel)
    files = sapply(c(${genes:r,}), function(g) paste0(c(${_input[0]:adr}), '/', g, '.rds'), USE.NAMES=FALSE)
    V = mclapply(files, function(i){ readRDS(i) }, mc.cores = 1)
    R = dim(V[[1]])[1]
    L = length(V)
    V.array = array(as.numeric(unlist(V)), dim=c(R, R, L))
    saveRDS(V.array, ${_output:ar})

### Compute MASH priors 

Main reference are our `mashr` vignettes [this for mashr eQTL outline](https://stephenslab.github.io/mashr/articles/eQTL_outline.html) and [this for using FLASH prior](https://github.com/stephenslab/mashr/blob/master/vignettes/flash_mash.Rmd). 

The outcome of this workflow should be found under `./mashr_flashr_workflow_output` folder (can be configured). File names have pattern `*.mash_model_*.rds`. They can be used to computer posterior for input list of gene-SNP pairs (see next section).

In [5]:
# Compute data-driven / canonical prior matrices (time estimate: 2h ~ 12h for ~30 49 by 49 matrix mixture)
[prior]
depends: R_library("mashr")
# if vhat method is `mle` it should use V_simple to analyze the data to provide a rough estimate, then later be refined via `mle`.
input: [data, vhat_data if vhat != "mle" else f'{vhat_data:nn}.V_simple.rds'] + [f"{cwd}/{output_prefix}.{m}.rds" for m in mixture_components]
output: prior_data

task: trunk_workers = 1, walltime = '36h', trunk_size = 1, mem = '4G', cores = 4, tags = f'{_output:bn}'
R: expand = "${ }", workdir = cwd, stderr = f"{_output:n}.stderr", stdout = f"{_output:n}.stdout", container = container
    library(mashr)
    rds_files = c(${_input:r,})
    dat = readRDS(rds_files[1])
    vhat = readRDS(rds_files[2])
    mash_data = mash_set_data(dat$strong.b, Shat=dat$strong.s, V=vhat, alpha=${1 if effect_model == 'EZ' else 0}, zero_Bhat_Shat_reset = 1E3)
    # setup prior
    U = list(XtX = t(mash_data$Bhat) %*% mash_data$Bhat / nrow(mash_data$Bhat))
    for (f in rds_files[3:length(rds_files)]) U = c(U, readRDS(f))
    U.ed = cov_ed(mash_data, U, logfile=${_output:nr})
    # Canonical matrices
    U.can = cov_canonical(mash_data)
    saveRDS(c(U.ed, U.can), ${_output:r})

## `mashr` mixture model fitting

In [9]:
# Fit MASH mixture model (time estimate: <15min for 70K by 49 matrix)
[mash_1]
depends: R_library("mashr")
input: data, vhat_data, prior_data
output: mash_model

task: trunk_workers = 1, walltime = '36h', trunk_size = 1, mem = '4G', cores = 1, tags = f'{_output:bn}'
R: expand = "${ }", workdir = cwd, stderr = f"{_output:n}.stderr", stdout = f"{_output:n}.stdout", container = container
    library(mashr)
    dat = readRDS(${_input[0]:r})
    vhat = readRDS(${_input[1]:r})
    U = readRDS(${_input[2]:r})
    mash_data = mash_set_data(dat$random.b, Shat=dat$random.s, alpha=${1 if effect_model == 'EZ' else 0}, V=vhat, zero_Bhat_Shat_reset = 1E3)
    saveRDS(mash(mash_data, Ulist = U, outputlevel = 1), ${_output:r})

### Optional posterior computations

Additionally provide posterior for the "strong" set in MASH input data.

In [10]:
# Compute posterior for the "strong" set of data as in Urbut et al 2017.
# This is optional because most of the time we want to apply the 
# MASH model learned on much larger data-set.
[mash_2]
# default to True; use --no-compute-posterior to disable this
parameter: compute_posterior = True
# input Vhat file for the batch of posterior data
skip_if(not compute_posterior)
depends: R_library("mashr")
input: data, vhat_data, mash_model
output: f"{cwd:a}/{output_prefix}.{effect_model}.posterior.rds"

task: trunk_workers = 1, walltime = '36h', trunk_size = 1, mem = '4G', cores = 1, tags = f'{_output:bn}'
R: expand = "${ }", workdir = cwd, stderr = f"{_output:n}.stderr", stdout = f"{_output:n}.stdout", container = container
    library(mashr)
    dat = readRDS(${_input[0]:r})
    vhat = readRDS(${_input[1]:r})
    mash_data = mash_set_data(dat$strong.b, Shat=dat$strong.s, alpha=${1 if effect_model == 'EZ' else 0}, V=vhat, zero_Bhat_Shat_reset = 1E3)
    mash_model = readRDS(${_input[2]:ar})
    saveRDS(mash_compute_posterior_matrices(mash_model, mash_data), ${_output:r})

## Compute MASH posteriors

In the GTEx V6 paper we assumed one eQTL per gene and applied the model learned above to those SNPs. Under that assumption, the input data for posterior calculation will be the `dat$strong.*` matrices.
It is a fairly straightforward procedure as shown in [this vignette](https://stephenslab.github.io/mashr/articles/eQTL_outline.html).

But it is often more interesting to apply MASH to given list of eQTLs, eg, from those from fine-mapping results. In GTEx V8 analysis we obtain such gene-SNP pairs from DAP-G fine-mapping analysis. See [this notebook](https://stephenslab.github.io/gtex-eqtls/analysis/Independent_eQTL_Results.html) for how the input data is prepared. The workflow below takes a number of input chunks (each chunk is a list of matrices `dat$Bhat` and `dat$Shat`) 
and computes posterior for each chunk. It is therefore suited for running in parallel posterior computation for all gene-SNP pairs, if input data chunks are provided.


```
JOB_OPT="-c midway2.yml -q midway2"
DATA_DIR=/project/compbio/GTEx_eQTL/independent_eQTL
sos run workflows/mashr_flashr_workflow.ipynb posterior \
    $JOB_OPT \
    --posterior-input $DATA_DIR/DAPG_pip_gt_0.01-AllTissues/DAPG_pip_gt_0.01-AllTissues.*.rds \
                      $DATA_DIR/ConditionalAnalysis_AllTissues/ConditionalAnalysis_AllTissues.*.rds
```

In [11]:
# Apply posterior calculations
[posterior_1]
parameter: analysis_units = path
regions = [x.replace("\"","").strip().split() for x in open(analysis_units).readlines() if x.strip() and not x.strip().startswith('#')]
parameter: mash_model = path(f"{cwd:a}/{output_prefix}.{effect_model}.V_{vhat}.mash_model.rds")
parameter: posterior_input = [path(x[0]) for x in regions]
parameter: posterior_vhat_files = paths()
# eg, if data is saved in R list as data$strong, then
# when you specify `--data-table-name strong` it will read the data as
# readRDS('{_input:r}')$strong
parameter: data_table_name = ''
parameter: bhat_table_name = 'bhat'
parameter: shat_table_name = 'sbhat'
mash_model = f"{mash_model:a}"
##  conditions can be excluded if needs arise. If nothing to exclude keep the default 0
parameter: exclude_condition = ["1","3"]

skip_if(len(posterior_input) == 0, msg = "No posterior input data to compute on. Please specify it using --posterior-input.")
fail_if(len(posterior_vhat_files) > 1 and len(posterior_vhat_files) != len(posterior_input), msg = "length of --posterior-input and --posterior-vhat-files do not agree.")
for p in posterior_input:
    fail_if(not p.is_file(), msg = f'Cannot find posterior input file ``{p}``')

depends: R_library("mashr"), mash_model
input: posterior_input, group_by = 1
output: f"{cwd}/{_input:bn}.posterior.rds"
task: trunk_workers = 1, walltime = '20h', trunk_size = 1, mem = '20G', cores = 1, tags = f'{_output:bn}'
R: expand = "${ }", workdir = cwd, stderr = f"{_output:n}.stderr", stdout = f"{_output:n}.stdout"
    library(mashr)
    data = readRDS("${_input}")${('$' + data_table_name) if data_table_name else ''}
    if(c(${",".join(exclude_condition)})[1] > 0 ){
      message(paste("Excluding condition ${exclude_condition} from the analysis"))
      data$bhat = data$bhat[,-c(${",".join(exclude_condition)})]
      data$sbhat = data$sbhat[,-c(${",".join(exclude_condition)})]
      data$Z = data$Z[,-c(${",".join(exclude_condition)})]
    }
  
    vhat = readRDS("${vhat_data if len(posterior_vhat_files) == 0 else posterior_vhat_files[_index]}")
    mash_data = mash_set_data(data$${bhat_table_name}, Shat=data$${shat_table_name}, alpha=${1 if effect_model == 'EZ' else 0}, V=vhat, zero_Bhat_Shat_reset = 1E3)
    mash_output = mash_compute_posterior_matrices(readRDS("${mash_model}"), mash_data)
    mash_output$snps = data$snps
    saveRDS(mash_output, ${_output:r})

In [None]:
[posterior_2]
input: group_by = "all"
output:f"{cwd}/mash_output_list"
python: expand = "$[ ]", workdir = cwd, stderr = f"{_output:n}.stderr", stdout = f"{_output:n}.stdout"
    import pandas as pd
    pd.DataFrame({"#mash_result" :  [$[_input:ar,]] }).to_csv("$[_output]",index = False ,header = False, sep = "t")

### Posterior results

1. The outcome of the `[posterior]` step should produce a number of serialized R objects `*.batch_*.posterior.rds` (can be loaded to R via `readRDS()`) -- I chopped data to batches to take advantage of computing in multiple cluster nodes. It should be self-explanary but please let me know otherwise.
2. Other posterior related files are:
    1. `*.batch_*.yaml`: gene-SNP pairs of interest, identified elsewhere (eg. fine-mapping analysis). 
    2. The corresponding univariate analysis summary statistics for gene-SNPs from `*.batch_*.yaml` are extracted and saved to `*.batch_*.rds`, creating input to the `[posterior]` step.
    3. Note the `*.batch_*.stdout` file documents some SNPs found in fine-mapping results but not found in the original `fastqtl` output.