# Multivariate EBNM based prior for M&M

Here for the simulation benchmark we prepare mixture prior based on a mulrivariate Emperical Bayes Normal Mean model (previously we use Extreme Deconvolution for the task).

## Approach

Here is the analysis plan:

1. Simulate data under my phenotypic models (the latest DSC benchmark setting) and generate sumstats for them ; bhat and sbhat
3. For each data-set, take the strongest snp as the strong set
4. Also select from each data-set 1 "null" snp.
5. then try to run your estimate of Vhat to get Vhat first, and run Yunqi / Peter's ED

In UKB bloodcells, we have about 600 regions with number of SNPs between 1000 and 5000. We simulate 600 dataset with priors in [this notebook](https://zouyuxin.github.io/mmbr-rss-dsc/create_simulation_priors.html).

We simulate under identity residual variance for all regions with artificial_mixture_ukb prior. We simulate under varaince of bloodcells for all regions with ukb_bloodcells_mixture prior.

## Workflow

In [None]:
[global]
parameter: cwd = path('/project2/mstephens/yuxin/mvarbvs/dsc/mnm_prototype/output/ukb_rss')
parameter: model = 'artificial_mixture_ukb' # 'ukb_bloodcells_mixture'
# handle N = per_chunk data-set in one job
parameter: per_chunk = 1000
import glob

In [1]:
%cd /project2/mstephens/yuxin/mvarbvs/dsc/mnm_prototype/output/ukb_rss

/project2/mstephens/gaow/mvarbvs/dsc/mnm_prototype/mnm_sumstats

### Get top SNP and random close to null SNP per region

In [31]:
# extract data from summary stats
[extract_1]
parameter: seed = 999
parameter: n_null = 1
input: glob.glob(f'{cwd}/{model}/*.rds'), group_by = per_chunk
output: f"{cwd}/{model}/cache/{model}_{_index+1}.rds"
task: trunk_workers = 1, walltime = '1h', trunk_size = 1, mem = '4G', cores = 1, tags = f'{_output:bn}'
R: expand = "${ }"
    set.seed(${seed})
    matxMax <- function(mtx) {
      return(arrayInd(which.max(mtx), dim(mtx)))
    }
    remove_rownames = function(x) {
        for (name in names(x)) rownames(x[[name]]) = NULL
        return(x)
    }
    extract_one_data = function(infile, n_null) {
        dat = readRDS(infile)$sumstats
        if (is.null(dat)) return(NULL)
        z = abs(dat$bhat/dat$sbhat)
        max_idx = matxMax(z)
        strong = list(bhat = dat$bhat[max_idx[1],,drop=F], sbhat = dat$sbhat[max_idx[1],,drop=F])
        
        null.id = which(apply(abs(z), 1, max) < 2)
        null_idx = sample(null.id, n_null, replace = F)
        null = list(bhat = dat$bhat[null_idx,,drop=F], sbhat = dat$sbhat[null_idx,,drop=F])
        return(list(null = remove_rownames(null),  strong = remove_rownames(strong)))
    }
    merge_data = function(res, one_data) {
      if (length(res) == 0) {
          return(one_data)
      } else if (is.null(one_data)) {
          return(res)
      } else {
          for (d in names(one_data)) {
              for (s in names(one_data[[d]])) {
                  res[[d]][[s]] = rbind(res[[d]][[s]], one_data[[d]][[s]])
              }
          }
          return(res)
      }
    }
    res = list()
    for (f in c(${_input:r,})) {
      res = merge_data(res, extract_one_data(f, ${n_null}))
    }
    saveRDS(res, ${_output:r})
  
[extract_2]
input: group_by = "all"
output: f"{cwd}/{model}.rds"
task: trunk_workers = 1, walltime = '1h', trunk_size = 1, mem = '4G', cores = 1, tags = f'{_output:bn}'
R: expand = "${ }"
    merge_data = function(res, one_data) {
      if (length(res) == 0) {
          return(one_data)
      } else {
          for (d in names(one_data)) {
              for (s in names(one_data[[d]])) {
                  res[[d]][[s]] = rbind(res[[d]][[s]], one_data[[d]][[s]])
              }
          }
          return(res)
      }
    }
    dat = list()
    for (f in c(${_input:r,})) {
      dat = merge_data(dat, readRDS(f))
    }
    saveRDS(
          list(null.z = dat$null$bhat/dat$null$sbhat,
           strong.z = dat$strong$bhat/dat$strong$sbhat),
          ${_output:r})

To run it:

```
for m in artificial_mixture_ukb ukb_bloodcells_mixture; do 
    sos run analysis/20201221_ukb_Prepare_ED_prior.ipynb extract --model $m -c midway2.yml -q midway2
done
```

## Null Correlation

In [None]:
[nullcor]
input: f"{cwd}/{model}.rds"
output: f"{cwd}/{model}.nullzcor.rds"
R: expand = "${ }"
    dat = readRDS(${_input:r})
    nullzcor = cor(dat$null.z)
    saveRDS(nullzcor, ${_output:r})

To run it:
```
for m in artificial_mixture_ukb ukb_bloodcells_mixture; do 
    sos run analysis/20201221_ukb_Prepare_ED_prior.ipynb nullcor --model $m -c midway2.yml -q midway2
done
```

## FLASH mixture

In [None]:
[flash]
input: f"{cwd}/{model}.rds"
output: f"{cwd}/{model}.flash.rds"
R: expand = "${ }"
    library(flashr)
    dat = readRDS(${_input:r})
    f.d = flash_set_data(as.matrix(dat$strong.z))
    f = flashr::flash(f.d, greedy=TRUE, backfit = T)
    U.flash = c(mashr:::cov_from_factors(t(as.matrix(f$ldf$f)), "FLASH"),
                list("tFLASH" = t(f$fitted_values) %*% f$fitted_values / nrow(f$fitted_values)))
    saveRDS(U.flash, ${_output:r})

To run it:

```
for m in artificial_mixture_ukb ukb_bloodcells_mixture; do 
    sos run analysis/20201221_ukb_Prepare_ED_prior.ipynb flash --model $m -c midway2.yml -q midway2
done
```

## Run extreme deconvolution using `udr` or `mashr`

In [None]:
[edudr_1, edmash_1]
depends: R_library("udr")
parameter: npc = 3
input: f"{cwd}/{model}.rds", f"{cwd}/{model}.flash.rds"
output: f"{cwd}/{model}.FL_PC{npc}.rds"
R: expand = "${ }", workdir = cwd, stderr = f"{_output:n}.stderr", stdout = f"{_output:n}.stdout"
    dat = readRDS(${_input[0]:r})
    vhat = cor(dat$null.z)
    # FLASH matrices
    U.flash = readRDS(${_input[1]:r})
    # SVD matrices
    res.svd = svd(dat$strong.z,nv=3,nu=3)
    f = res.svd$v
    rownames(f) = colnames(dat$strong.z)
    U.pca = mashr:::cov_from_factors(t(f), "PCA")
    d = diag(res.svd$d[1:3])
    U.pca = c(U.pca, list("tPCA"= f %*% d^2 %*% t(f)/nrow(dat$strong.z)))
    # Emperical cov matrix
    Ulist = c(U.flash, U.pca, list("XX" = t(as.matrix(dat$strong.z)) %*% as.matrix(dat$strong.z) / nrow(dat$strong.z)))
    saveRDS(list(data = dat$strong.z, Ulist = Ulist, S = cor(dat$null.z)), ${_output:r})

In [None]:
[edudr_2]
output: f"{_input:n}.EDudr.rds"
task: trunk_workers = 1, walltime = '36h', trunk_size = 1, mem = '4G', cores = 14, tags = f'{_output:bn}'
R: expand = "${ }", workdir = cwd, stderr = f"{_output:n}.stderr", stdout = f"{_output:n}.stdout"
    library(udr) # udr commit 5265079 with changes to set lower bound on the eigenvalues
    dat = readRDS(${_input:r})
    # Denoised data-driven matrices
    f0 = ud_init(X = as.matrix(dat$data), V = dat$S, U_scaled = list(), U_unconstrained = dat$Ulist, n_rank1=0)
    res = ud_fit(f0, control = list(unconstrained.update = "ed", resid.update = 'none', maxiter=5000),
    verbose=FALSE)
    # format to input for simulation with DSC (current pipeline)
    saveRDS(list(U=res$U, w=res$w, loglik=res$loglik), ${_output:r}) 

In [None]:
[edmash_2]
output: f"{_input:n}.EDmash.rds"
task: trunk_workers = 1, walltime = '36h', trunk_size = 1, mem = '4G', cores = 14, tags = f'{_output:bn}'
R: expand = "${ }", workdir = cwd, stderr = f"{_output:n}.stderr", stdout = f"{_output:n}.stdout"
    dat = readRDS(${_input:r})
    mashdata = mashr::mash_set_data(as.matrix(dat$data), Shat=1, V = dat$S)
    # Denoised data-driven matrices
    res = mashr:::bovy_wrapper(mashdata, dat$Ulist, logfile=${_output:nr}, tol = 1e-06)
    # format to input for simulation with DSC (current pipeline)
    saveRDS(list(U=res$Ulist, w=res$pi, loglik=scan("${_output:nn}.EDmash_loglike.log")), ${_output:r})

```
sos run analysis/20201221_ukb_Prepare_ED_prior.ipynb edudr --model artificial_mixture_ukb -c midway2.yml -q midway2
sos run analysis/20201221_ukb_Prepare_ED_prior.ipynb edudr --model ukb_bloodcells_mixture -c midway2.yml -q midway2
sos run analysis/20201221_ukb_Prepare_ED_prior.ipynb edmash --model artificial_mixture_ukb -c midway2.yml -q midway2
sos run analysis/20201221_ukb_Prepare_ED_prior.ipynb edmash --model ukb_bloodcells_mixture -c midway2.yml -q midway2
```