# M&M ASH benchmark VI

This is a continuation of Part V where I set total PVE is set to 0.1 and assume 1 or 2 causal variables per region. I added in evaluation of lfsr per condition.

The most important difference from previous simulations is that here I mix-match simulated data under different prior assumptions to analyzing them with different priors. I expect to observe that:

1. The "oracle" prior is mostly better than using other priors, for all scenarios.
2. Mixture prior generally performs well in all scenarios -- it is robust to simulation assumptions.

## Conclusion

1. The expected observations above are both true, with some interesting exceptions
    - "oracle" mixture prior is not better than using mixture prior on some other scenarios -- overfitting of mixture prior?
    - Singleton oracle is bad
2. Power table: model mis-specification will result in overlaps, but there is no overlapping issue in mixture model
3. Overlaps of singleton results are prevalent as expected
4. mixture prior has great FDR control on CS
5. mixture prior has good lfsr control on effect estimates

The benchmark was executd on UChicago midway

```
./finemap.dsc --host mnm_R5.yml --R 5 -c 12
```

This executes the `default` pipeline in `finemap.dsc` file, as of today (2019.02.04).

In [1]:
%cd ~/GIT/github/mnm-twas/dsc

/home/gaow/Documents/GIT/github/mnm-twas/dsc

In [2]:
start_time <- Sys.time()
library('dscrutils')
out = dscquery('finemap_output', "sharing_pattern mnm.eff_mode susie_scores.total susie_scores.valid susie_scores.size susie_scores.purity susie_scores.top susie_scores.n_causal susie_scores.included_causal susie_scores.overlap susie_scores.false_pos_cond_discoveries susie_scores.false_neg_cond_discoveries susie_scores.true_cond_discoveries", omit.file.columns = T, verbose = F)
end_time <- Sys.time()

In [3]:
end_time - start_time

Time difference of 13.34753 mins

In [4]:
head(out)

DSC,sharing_pattern,mnm,mnm.eff_mode,susie_scores.total,susie_scores.valid,susie_scores.size,susie_scores.purity,susie_scores.top,susie_scores.n_causal,susie_scores.included_causal,susie_scores.overlap,susie_scores.false_pos_cond_discoveries,susie_scores.false_neg_cond_discoveries,susie_scores.true_cond_discoveries
1,identity,mnm_identity,identity,2,1,16,0.9314858,0,1,1,0,3,2,5
1,identity,mnm_identity,identity,1,1,1,1.0,1,1,1,0,0,0,5
1,identity,mnm_identity,identity,2,2,5,0.9823716,0,2,2,0,0,0,10
1,identity,mnm_identity,identity,2,2,12,0.9753366,2,2,2,0,0,0,10
1,identity,mnm_identity,identity,3,3,9,0.9706318,1,3,3,0,0,0,15
1,identity,mnm_identity,identity,1,1,4,0.9939019,1,1,1,0,0,0,5


In [5]:
dim(out)

In [6]:
saveRDS(out, '../data/finemap_output.query_result.rds')

In [7]:
res = out[,c(2,4,5,6,7,8,9,10,11,12,13,14,15)]
colnames(res) = c('pattern', 'method', 'total', 'valid', 'size', 'purity', 'top_hit', 'total_true', 'total_true_included', 'overlap', 'false_positive_cross_cond', 'false_negative_cross_cond', 'true_positive_cross_cond')

### Purity of CS

In [8]:
purity = aggregate(purity~pattern + method, res, mean)
purity

pattern,method,purity
high_het,high_het,0.9827047
identity,high_het,0.9842457
low_het,high_het,0.9841991
mid_het,high_het,0.9833802
mixture01,high_het,0.9365145
shared,high_het,0.9824573
singleton,high_het,0.8627706
high_het,identity,0.9823101
identity,identity,0.9841695
low_het,identity,0.9835865


In [9]:
aggregate(purity~method, purity, mean)

method,purity
high_het,0.9594674
identity,0.95855
low_het,0.9549486
mid_het,0.9581144
mixture_1,0.9598995
shared,0.8177715
singleton,0.8883366


### Size of CS

In [10]:
size = aggregate(size~pattern+method, res, median)
size

pattern,method,size
high_het,high_het,3.0
identity,high_het,3.5
low_het,high_het,3.5
mid_het,high_het,4.0
mixture01,high_het,5.0
shared,high_het,4.0
singleton,high_het,6.0
high_het,identity,3.0
identity,identity,3.5
low_het,identity,3.5


In [11]:
aggregate(size~method, size, mean)

method,size
high_het,4.142857
identity,4.142857
low_het,4.071429
mid_het,4.071429
mixture_1,3.964286
shared,3.714286
singleton,9.714286


### Power of CS

In [12]:
total_true_included = aggregate(total_true_included ~ pattern + method, res, sum)
total_true = aggregate(total_true ~ pattern + method, res, sum)
overlap = aggregate(overlap ~ pattern + method, res, mean)
power = merge(total_true_included, total_true, by = c("pattern", "method"))
power = merge(power, overlap,  by = c("pattern", "method"))
power$power = power$total_true_included/power$total_true
power = power[order(power$method),]
power

Unnamed: 0,pattern,method,total_true_included,total_true,overlap,power
1,high_het,high_het,792,874,0.084,0.9061785
8,identity,high_het,793,856,0.024,0.9264019
15,low_het,high_het,786,857,0.078,0.9171529
22,mid_het,high_het,809,881,0.0,0.9182747
29,mixture01,high_het,706,849,0.212,0.8315665
36,shared,high_het,789,854,0.256,0.9238876
43,singleton,high_het,548,816,0.0,0.6715686
2,high_het,identity,791,874,0.138,0.9050343
9,identity,identity,794,856,0.024,0.9275701
16,low_het,identity,789,857,0.116,0.9206534


In [13]:
aggregate(power~method, power, mean)

method,power
high_het,0.8707187
identity,0.8708701
low_het,0.8658901
mid_het,0.8684465
mixture_1,0.8623861
shared,0.7389402
singleton,0.8202593


### FDR of CS

In [14]:
valid = aggregate(valid ~ pattern + method, res, sum)
total = aggregate(total ~ pattern + method, res, sum)
fdr = merge(valid, total, by = c("pattern", "method"))
fdr$fdr = (fdr$total - fdr$valid)/fdr$total
fdr = fdr[order(fdr$method),]
fdr

Unnamed: 0,pattern,method,valid,total,fdr
1,high_het,high_het,787,868,0.09331797
8,identity,high_het,792,859,0.07799767
15,low_het,high_het,782,843,0.07236062
22,mid_het,high_het,800,875,0.08571429
29,mixture01,high_het,698,759,0.08036891
36,shared,high_het,788,856,0.07943925
43,singleton,high_het,540,603,0.10447761
2,high_het,identity,787,868,0.09331797
9,identity,identity,793,859,0.07683353
16,low_het,identity,786,844,0.06872038


In [15]:
aggregate(fdr~method, fdr, mean)

method,fdr
high_het,0.0848109
identity,0.08416001
low_het,0.07104787
mid_het,0.08282189
mixture_1,0.05729344
shared,0.04568834
singleton,0.06507271


### Power for per signal per condition estimates

We compute lfsr on per signal per condition basis. We call it a signal in the condition if lfsr is smaller than 0.05.

In [16]:
tp = aggregate(true_positive_cross_cond ~ pattern + method, res, sum)
fn = aggregate(false_negative_cross_cond ~ pattern + method, res, sum)
power = merge(tp, fn, by = c("pattern", "method"))

In [17]:
power$power = power$true_positive_cross_cond/(power$true_positive_cross_cond + power$false_negative_cross_cond)
power = power[order(power$method),]
power

Unnamed: 0,pattern,method,true_positive_cross_cond,false_negative_cross_cond,power
1,high_het,high_het,3901,105,0.9737893
8,identity,high_het,3938,74,0.9815553
15,low_het,high_het,3895,64,0.9838343
22,mid_het,high_het,3967,99,0.9756517
29,mixture01,high_het,2890,51,0.982659
36,shared,high_het,3927,84,0.9790576
43,singleton,high_het,540,18,0.9677419
2,high_het,identity,3908,95,0.9762678
9,identity,identity,3941,79,0.9803483
16,low_het,identity,3906,71,0.9821473


In [18]:
aggregate(power~method, power, mean)

method,power
high_het,0.9777556
identity,0.976267
low_het,0.9854748
mid_het,0.9792159
mixture_1,0.9952224
shared,0.9234336
singleton,0.3064775


### FDR for per signal per condition estimates


In [19]:
tp = aggregate(true_positive_cross_cond ~ pattern + method, res, sum)
fp = aggregate(false_positive_cross_cond ~ pattern + method, res, sum)
fdr = merge(tp, fp, by = c("pattern", "method"))
fdr$fdr = fdr$false_positive_cross_cond/(fdr$true_positive_cross_cond + fdr$false_positive_cross_cond)
fdr = fdr[order(fdr$method),]
fdr

Unnamed: 0,pattern,method,true_positive_cross_cond,false_positive_cross_cond,fdr
1,high_het,high_het,3901,334,0.07886659
8,identity,high_het,3938,283,0.06704572
15,low_het,high_het,3895,256,0.06167189
22,mid_het,high_het,3967,309,0.0722638
29,mixture01,high_het,2890,275,0.08688784
36,shared,high_het,3927,269,0.06410867
43,singleton,high_het,540,329,0.37859609
2,high_het,identity,3908,337,0.07938751
9,identity,identity,3941,275,0.0652277
16,low_het,identity,3906,243,0.05856833


## Performance of effect size estimates
Total number of true discoveries over total number of signals to detect??