# Identify and extract interesting data-set for vignettes

In [1]:
%revisions -s -n 10

Revision,Author,Date,Message
,,,
12fe363,Gao Wang,2018-06-27,Add finemap 95% config filter
d557425,Gao Wang,2018-06-27,Update documentation


In [2]:
%cd ~/GIT/github/mvarbvs/dsc/susie_comparison/fit_susie

/home/gaow/Documents/GIT/github/mvarbvs/dsc/susie_comparison/fit_susie

In [None]:
[global]
parameter: outdir = path('./susie_comparison')
parameter: name = '20180710'

## A non-trivial show case example for fine-mapping vignette
Here I'd like to pick up ideally examples that has enough power for susie to detect 3 simulated signals, at non-trivial yet reasonable number of iterations, size of sets and purity levels, for illustrating how susie works. In other words this is meant to pick up good susie show cases, not edge cases.

In [None]:
[A]
num_causal = 3
input: glob.glob(f'liter_data_*_summarize_ld_1_lm_less_{num_causal}_fit_susie_*.rds')[:200], group_by = 'single' 
R: expand = '${ }'
  res = readRDS(${_input:r})
  for (r in 1:2) {
  	cs = susieR::susie_get_CS(res$posterior$alpha[[r]])$cs[1:${num_causal}]
  	if (length(cs[[${num_causal}]]) < 20) {
  		print(${_input:r})
		print(r)
  		print(res$posterior$niter[r])
		print(cs)
		print("========")
  	}
  }

I ended up with `liter_data_65_summarize_ld_1_lm_less_3_fit_susie_7.rds` the first data-set, which seems interesting:

```
[[1]]
[1] 773 777

[[2]]
 [1] 360 361 362 365 368 372 373 374 379 381 383 384 386 387 388 389 391 392 396
[20] 397 398 399 400 401 403 404 405 407 408 415

[[3]]
[1] 653
```

takes 6 iterations to complete.

In [2]:
dat = readRDS('liter_data_65_summarize_ld_1_lm_less_3_fit_susie_7.rds')
names(dat)

From `dat$DSC_DEBUG$script` I load the dataset of interest:

In [3]:
input <- dscrutils::load_inputs(c('../lm_less/liter_data_65_summarize_ld_1_lm_less_3.pkl'), dscrutils::read_dsc)

In [4]:
names(input)

Now I compute summary stats and save it.

In [5]:
library(abind)
mm_regression = function(X, Y, Z=NULL) {
  if (!is.null(Z)) {
      Z = as.matrix(Z)
  }
  reg = lapply(seq_len(ncol(Y)), function (i) simplify2array(susieR:::univariate_regression(X, Y[,i], Z)))
  reg = do.call(abind, c(reg, list(along=0)))
  # return array: out[1,,] is betahat, out[2,,] is shat
  return(aperm(reg, c(3,2,1)))
}
sumstats = mm_regression(as.matrix(input$data$X), as.matrix(input$data$Y))

Since the data is private I may have to remove column and row names from data matrices:

In [13]:
names(input$data)

Okay after checking the data details there is nothing confidential to hide. We should be good.

In [11]:
saveRDS(list(data=input$data, sumstats = sumstats), '~/GIT/software/susieR/inst/data/N3finemapping.rds')

## A reasonably "difficult" case

We hope to see and show that SuSiE can deal with this reasonably difficult case where in the 2 CS setting the top z-score is result of contribution from both CS -- that is, a SNP is in weak LD between both of 2 CS, thus showing strongest z-score in univariate analysis but weak PIP and not in any CS. Need information on:

- top z-score (summary stats)
- true effect 0 or 1
- CS and PIP
- CS best is 2; true effect in CS but z-score not in CS

In [3]:
[B_1]
target = "liter_data.dataset lm_less.pve lm_less.n_signal get_sumstats fit_susie.estimate_residual_variance fit_susie.prior_var fit_susie plot_susie"
output: f'{outdir}/tutorial_{name}/result.RDS'
R: expand = '${ }'
    out = dscrutils::dscquery(${outdir:br}, target = "${target}", load.pkl = TRUE)
    saveRDS(out, ${_output:r})

Workflow can only be executed with magic %run or %sosrun.

In [None]:
[B_2]
pve = [0.2]
n = [3]
est_res = ['TRUE']
prior = [0.2]
combos = len(pve) * len(n) * len(est_res) * len(prior)
output_files = [f'{_input:d}/{x+1}.rds' for x in range(combos)]
input: for_each = ['pve', 'n', 'est_res', 'prior'], concurrent = True
output: output_files[_index]
R: expand = '${ }'

    get_combined = function(sub, dirname, ld_col) {
        out_files = sub[,c("fit_susie.output.file", "plot_susie.output.file")]
        combined = list(purity = NULL, lfsr = NULL, size = NULL, 
                        captures = NULL, total_captures = NULL, pip = NULL)
        for (i in 1:nrow(out_files)) {
            fit = readRDS(paste0(dirname, out_files[i,1], '.rds'))$posterior
            purity = readRDS(paste0(dirname, out_files[i,2], '.rds'))
            L = sub[i,"lm_less.n_signal"]
            for (r in 1:2) {
                #
                if (is.null(combined$purity)) combined$purity = purity$purity[[paste0('V',r)]][,ld_col]
                else combined$purity = cbind(combined$purity, purity$purity[[paste0('V',r)]][,ld_col])
                #
                if (is.null(combined$size)) combined$size = fit$n_in_CI[,r]
                else combined$size = cbind(combined$size, fit$n_in_CI[,r])
                #
                if (is.null(combined$lfsr)) combined$lfsr = fit$lfsr[,r]
                else combined$lfsr = cbind(combined$lfsr, fit$lfsr[,r])
                #
                if (is.null(combined$captures)) combined$captures = rowSums(purity$signal$V1)
                else combined$captures = cbind(combined$captures, rowSums(purity$signal$V1))
                #
                is_pure = which(purity$purity[[paste0('V',r)]][,ld_col] > ${ld_cutoff})
                alpha = fit$alpha[[r]][is_pure,,drop=FALSE]
                if (dim(alpha)[1] == 0) {
                  pip = t(rep(0, dim(alpha)[2]))
                } else {
                  pip = t(1 - apply(1 - alpha, 2, prod))
                }
                if (is.null(combined$pip)) combined$pip = pip
                else combined$pip = cbind(combined$pip, pip)            
                #
                detected = apply(t(purity$signal[[paste0('V',r)]][is_pure,,drop=FALSE]), 1, sum)
                if (length(detected) < L) {
                  detected = c(detected, rep(0, L - length(detected)))
                }
                if (is.null(combined$total_captures)) combined$total_captures = detected
                else combined$total_captures = combined$total_captures + detected
            }
        }
        return(combined)
    }
    out = readRDS(${_input:r})
    sub = out[which(out$lm_less.pve == ${_pve} & out$lm_less.n_signal == ${_n} & out$fit_susie.estimate_residual_variance == ${_est_res} & out$fit_susie.prior_var == ${_prior}),]
    combined = get_combined(sub, "${outdir}/", ${ld_col})
    write(paste(${_pve}, ${_n}, ${_prior}, ${_est_res}, "${_output:n}.png", sep=','), file='${_output:n}.log')
    saveRDS(combined, ${_output:r})