# Mashing GTEx V8 release

Here I run the latest `flashr + mashr` pipeline on the latest GTEx release. 

In [1]:
%revisions -s

## Data overview

`fastqtl` summary statistics data were obtained from dbGaP (via UChicago Genetic Medicine). It has 49 tissues. [more description to come]

### Some `bash` variables

```
input_dir=/project/compbio/GTEx_dbGaP/GTEx_Analysis_2017-06-05_v8/eqtl/GTEx_Analysis_v8_eQTL_all_associations
```

## Preparing MASH input

Using an established workflow (which takes overnight to run on a cluster system as configured by `data/fe961153.localhost.yml`),

```
sos run analysis/fastqtl_to_mash.ipynb --data-list $input_dir/FastQTLSumStats.list -c data/fe961153.localhost.yml
```

I obtained the "mashable" data-set in the same format [as described here](https://stephenslab.github.io/gtexresults/gtexdata.html).

### Some data integrity check

1. Check if I get the same number of groups (genes) at the end of HDF5 data conversion:

```
$ zcat Whole_Blood.allpairs.txt.gz | cut -f1 | sort -u | wc -l
20316
$ h5ls Whole_Blood.allpairs.txt.h5 | wc -l
20315
```

The results agreed on Whole Blood sample (the original data has a header thus one line more than the H5 version). We should be good (since the pipeline reported success for all other files).

### Data & job summary

The command above took 33 hours on UChicago RCC `midway2`. 

```
[MW] cat FastQTLSumStats.log
39832 out of 39832 groups merged!
```

So we have a total of 39832 genes (union of 49 tissues).

```
[MW] cat FastQTLSumStats.portable.log
15636 out of 39832 groups extracted!
```

We have 15636 groups without missing data in any tissue. This will be used to train the MASH model.

The "mashable" data file is `FastQTLSumStats.mash.rds`, 124Mb serialized R file.

## Running the `flashr+mashr` pipeline to train MASH model

Main reference are our `mashr` vignettes [this for mashr eQTL outline](https://stephenslab.github.io/mashr/articles/eQTL_outline.html) and [this for using FLASH prior](https://github.com/stephenslab/mashr/blob/master/vignettes/flash_mash.Rmd). 
The latter was written recently specificly for this effort, and will likely be subject to changes for future versions.

```
sos run analysis/20180920_FLASH_MASH_V8.ipynb mashr
```

Pipeline codes see below:

In [2]:
[global]
parameter: cwd = path('./mashr_flashr_workflow_output')
parameter: data = path("data/FastQTLSumStats.mash.rds")
parameter: vhat = 1
parameter: alpha = 1
parameter: mosek_license = file_target("~/.mosek.lic")
flash_data = file_target(f"{cwd:a}/{data:bn}.flash.rds")
fail_if(not mosek_license.is_file(), msg = f'Please put a valid copy (NOT a symbolic link!) of MOSEK license to: \n``{mosek_license}``')

In [1]:
[flash: provides = flash_data]
# Perform FLASH analysis (time estimate: 20min)
depends: R_library("mashr@stephenslab/flashr"), R_library('mixSQP@youngseok-kim/mixSQP')
input: f"{data:a}"
output: flash_data
R: expand = "${ }", workdir = cwd
    library(flashr)
    library(mixSQP)

    my_init_fn <- function(Y, K = 1) {
      ret = flashr:::udv_si(Y, K)
      pos_sum = sum(ret$v[ret$v > 0])
      neg_sum = -sum(ret$v[ret$v < 0])
      if (neg_sum > pos_sum) {
        return(list(u = -ret$u, d = ret$d, v = -ret$v))
      } else
      return(ret)
    }

    flash_pipeline = function(data, ...) {
      ## current state-of-the art
      ## suggested by Jason Willwerscheid
      ## cf: discussion section of
      ## https://willwerscheid.github.io/MASHvFLASH/MASHvFLASHnn2.html
      ebnm_fn = "ebnm_ash"
      ebnm_param = list(l = list(mixcompdist = "normal",
                               optmethod = "mixSQP"),
                        f = list(mixcompdist = "+uniform",
                               optmethod = "mixSQP"))
      ##
      fl_g <- flashr:::flash_greedy_workhorse(data,
                    var_type = "constant",
                    ebnm_fn = ebnm_fn,
                    ebnm_param = ebnm_param,
                    init_fn = "my_init_fn",
                    stopping_rule = "factors",
                    tol = 1e-3,
                    verbose_output = "odF")
      fl_b <- flashr:::flash_backfit_workhorse(data,
                    f_init = fl_g,
                    var_type = "constant",
                    ebnm_fn = ebnm_fn,
                    ebnm_param = ebnm_param,
                    stopping_rule = "factors",
                    tol = 1e-3,
                    verbose_output = "odF")
      return(fl_b)
    }

    cov_flash = function(data, subset = NULL, mode = 'EZ', non_canonical = FALSE, save_model = NULL) {
      if(is.null(subset)) subset = 1:mashr:::n_effects(data)
      ## FIXME: is this reasonable?
      if (mode == 'EZ') b = data$Bhat / data$Shat
      else b = data$Bhat
      b.center = apply(b, 2, function(x) x - mean(x))
      ## Only keep factors with at least two values greater than 1 / sqrt(n)
      find_nonunique_effects <- function(fl) {
        thresh <- 1/sqrt(ncol(fl$fitted_values))
        vals_above_avg <- colSums(fl$ldf$f > thresh)
        nonuniq_effects <- which(vals_above_avg > 1)
        return(fl$ldf$f[, nonuniq_effects, drop = FALSE])
      }

      fmodel = flash_pipeline(b.center)
      if (non_canonical)
          flash_f = find_nonunique_effects(fmodel)
      else 
          flash_f = fmodel$ldf$f
      ## row.names(flash_f) = colnames(b)
      if (!is.null(save_model)) saveRDS(list(model=fmodel, factors=flash_f), save_model)
      U.flash = c(cov_from_factors(t(as.matrix(flash_f)), "FLASH"),
          list("tFLASH" = t(fmodel$fitted_values) %*% fmodel$fitted_values / nrow(b)))
      return(U.flash)
    }
    ##
    ##
    dat = readRDS(${_input:r})
    dat = mashr::mash_set_data(as.matrix(dat$strong.b), as.matrix(dat$strong.s))
    res = cov_flash(dat, save_model = "${_output:n}.model.rds")
    saveRDS(res, ${_output:r})