# M&M benchmark XI

This benchmark is an improvments over the [previous one](https://gaow.github.io/mvarbvs/analysis/20191108_MNM_Benchmark.html), in the following espects.

1. Use both small $R=5$ and large $R=45$ simulations to compare if merely increasing number of conditions messes it up.
2. Simulate even simpler: 1 effect and using 1 grid for effect covariance such that the prior is no longer a mixture.
3. Analyzing it with L = 1, L = 2 and L = 10.
4. Assess CS overlap at both variable and CS level
5. Evaluated both EE and EZ model when computing BF for multivariate analysis
6. Add oracle residual covariance method
7. Use correlation from FLASH method instead and use actual variance of $Y$ to scale it, for residual variance. 
8. Turn on ELBO computation and add a score to check for convergence; although for now we know that missing data ELBO computation can be problematic. Notice this ELBO evaluation is only relevant when there is no missing data. For cases of non-missing data I still use convergence in PIP.
9. Increased number of replicates to 500.

## Conclusion

1. Under simple situation, that is, small number of effects, small enough L, and no missing data, increasing $R=5$ to $R=45$ does not seem to result in false positives for EZ model.
    - And as expected from the simulated senario, increased number of conditions help with the power due to sharing of effects by magnitude.
2. EE model has high FDR when model is mis-specified (L > 1)!
3. CS overlapping situation exists for EZ model, and gets worse as $R$ increases, even for `L=2` and lower power setting compared to previous simulation.
    - Using an oracle covariance does not seem to have helped.
4. FLASH based covariance hurts the power for EZ model.
5. EE model sometimes has convergence issue by ELBO although ELBO keeps increasing.

## Next steps for this investigation

1. Missing data situation: when ELBO computation is not involved and when residual covariance is simply diag, missing data handling should be really easy -- not sure how to check for that "bug".
2. What is wrong with EE model FDR problem?

The corresponding DSC code are from `c5d75a5` and to be reproduced as follows:

```
./finemap.dsc --host dsc_mnm.yml -o mnm_20191116
```

In [1]:
%cd ~/GIT/mvarbvs/dsc_mnm

/project2/mstephens/gaow/mvarbvs/dsc_mnm

In [5]:
out = dscrutils::dscquery('mnm_20191116', targets = c('simulate.n_traits', 'mnm.resid_method', 'mnm.missing_Y', 'mnm.alpha', 'mnm.L', 'susie_scores', 'susie_scores.total', 'susie_scores.valid', 'susie_scores.size', 'susie_scores.purity', 'susie_scores.top', 'susie_scores.n_causal', 'susie_scores.included_causal', 'susie_scores.overlap_var', 'susie_scores.overlap_cs','susie_scores.false_pos_cond_discoveries', 'susie_scores.false_neg_cond_discoveries', 'susie_scores.true_cond_discoveries', 'susie_scores.converged'),
                                          module.output.files = "susie_scores", verbose = F)

In [6]:
head(out)

DSC,simulate.n_traits,mnm.resid_method,mnm.missing_Y,mnm.alpha,mnm.L,susie_scores.total,susie_scores.valid,susie_scores.size,susie_scores.purity,susie_scores.top,susie_scores.n_causal,susie_scores.included_causal,susie_scores.overlap_var,susie_scores.overlap_cs,susie_scores.false_pos_cond_discoveries,susie_scores.false_neg_cond_discoveries,susie_scores.true_cond_discoveries,susie_scores.converged,susie_scores.output.file
1,5,oracle,True,0,1,1,1,2,0.9852849,1,1,1,0,0,0,0,5,True,susie_scores/full_data_1_high_het_1_oracle_generator_1_mnm_high_het_1_susie_scores_1
1,5,oracle,True,0,1,1,1,1,1.0,1,1,1,0,0,0,0,5,True,susie_scores/full_data_2_high_het_1_oracle_generator_1_mnm_high_het_1_susie_scores_1
1,5,oracle,True,0,1,1,1,1,1.0,1,1,1,0,0,0,0,5,True,susie_scores/full_data_3_high_het_1_oracle_generator_1_mnm_high_het_1_susie_scores_1
1,5,oracle,True,0,1,1,1,43,0.9899983,0,1,1,0,0,0,0,5,True,susie_scores/full_data_4_high_het_1_oracle_generator_1_mnm_high_het_1_susie_scores_1
1,5,oracle,True,0,1,1,1,23,0.9863327,0,1,1,0,0,0,0,5,True,susie_scores/full_data_5_high_het_1_oracle_generator_1_mnm_high_het_1_susie_scores_1
1,5,oracle,True,0,1,1,1,9,0.9216036,1,1,1,0,0,0,0,5,True,susie_scores/full_data_6_high_het_1_oracle_generator_1_mnm_high_het_1_susie_scores_1


In [7]:
dim(out)

In [8]:
saveRDS(out, '../data/finemap_output.20191116.rds')

In [6]:
res = out[,-1]
colnames(res) = c('n_traits', 'resid_method', 'missing', 'EZ_model', 'L', 'total', 'valid', 'size', 'purity', 'top_hit', 'total_true', 'total_true_included', 'overlap_var', 'overlap_cs', 'false_positive_cross_cond', 'false_negative_cross_cond', 'true_positive_cross_cond', 'elbo_converged', 'filename')

### Purity of CS

Purity is higher for $R=45$ simply due to higher power; because in this simulation there is no FDR issue.

In [8]:
purity = aggregate(purity~n_traits + resid_method + missing + EZ_model + L, res, mean)
purity = purity[which(purity$missing==FALSE),-3]
purity = purity[order(purity$n_traits),]
purity

Unnamed: 0,n_traits,resid_method,EZ_model,L,purity
1,5,diag,0,1,0.995564192
3,5,flash,0,1,0.993308992
5,5,oracle,0,1,0.996260685
13,5,diag,1,1,0.898551466
15,5,flash,1,1,0.493463519
17,5,oracle,1,1,0.91837814
25,5,diag,0,2,0.99549806
27,5,flash,0,2,0.993060009
29,5,oracle,0,2,0.995250958
37,5,diag,1,2,0.859395104


### Power of CS

Focusing on $L = 2$ to evaluate overlapping CS status. In this case there still exists overlaps between CS, but not as many as with $L=10$. Overlapping status got worse when increased $R$.

In [14]:
total_true_included = aggregate(total_true_included ~ n_traits + resid_method + missing + EZ_model + L, res, sum)
total_true = aggregate(total_true ~  n_traits + resid_method + missing + EZ_model + L, res, sum)
cs_overlap = aggregate(overlap_cs ~  n_traits + resid_method + missing + EZ_model + L, res, sum)
snp_overlap = aggregate(overlap_var ~  n_traits + resid_method + missing + EZ_model + L, res, sum)
power = merge(total_true_included, total_true, by = c( 'n_traits' , 'resid_method' , 'missing' , 'EZ_model', 'L'))
power = merge(power, cs_overlap,  by = c( 'n_traits' , 'resid_method' , 'missing' , 'EZ_model', 'L'))
power = merge(power, snp_overlap,  by = c( 'n_traits' , 'resid_method' , 'missing' , 'EZ_model', 'L'))
power$power = round(power$total_true_included/power$total_true,3)
power$overlap_cs = round(power$overlap_cs, 3)
power$overlap_var = round(power$overlap_var, 3)
power = power[which(power$missing==FALSE),-3]
power = power[order(power$n_traits),]
power = power[order(power$L),]
power = power[order(power$EZ_model),]
#power = power[order(power$missing),]
power

Unnamed: 0,n_traits,resid_method,EZ_model,L,total_true_included,total_true,overlap_cs,overlap_var,power
37,5,diag,0,1,495,500,0,0,0.99
49,5,flash,0,1,499,500,0,0,0.998
61,5,oracle,0,1,492,500,0,0,0.984
1,45,diag,0,1,498,500,0,0,0.996
13,45,flash,0,1,499,500,0,0,0.998
25,45,oracle,0,1,498,500,0,0,0.996
39,5,diag,0,2,495,500,0,0,0.99
51,5,flash,0,2,499,500,0,0,0.998
63,5,oracle,0,2,492,500,0,0,0.984
3,45,diag,0,2,498,500,0,0,0.996


### FDR of CS no missing data

In [20]:
valid = aggregate(valid ~ n_traits + resid_method + missing + EZ_model + L, res, sum)
total = aggregate(total ~ n_traits + resid_method + missing + EZ_model + L, res, sum)
fdr = merge(valid, total, by = c( 'n_traits' , 'resid_method' , 'missing' , 'EZ_model', 'L'))
fdr$fdr = round((fdr$total - fdr$valid)/fdr$total,3)
fdr = fdr[which(fdr$missing==FALSE),-3]
fdr = fdr[order(fdr$n_traits),]
fdr

Unnamed: 0,n_traits,resid_method,EZ_model,L,valid,total,fdr
37,5,diag,0,1,495,500,0.01
38,5,diag,0,10,495,568,0.129
39,5,diag,0,2,495,545,0.092
40,5,diag,1,1,495,495,0.0
41,5,diag,1,10,23,23,0.0
42,5,diag,1,2,492,492,0.0
49,5,flash,0,1,499,500,0.002
50,5,flash,0,10,499,577,0.135
51,5,flash,0,2,499,543,0.081
52,5,flash,1,1,291,291,0.0


### FDR of CS with missing data

In [21]:
valid = aggregate(valid ~ n_traits + resid_method + missing + EZ_model + L, res, sum)
total = aggregate(total ~ n_traits + resid_method + missing + EZ_model + L, res, sum)
fdr = merge(valid, total, by = c( 'n_traits' , 'resid_method' , 'missing' , 'EZ_model', 'L'))
fdr$fdr = round((fdr$total - fdr$valid)/fdr$total,3)
fdr = fdr[which(fdr$missing==TRUE),-3]
fdr = fdr[order(fdr$n_traits),]
fdr

Unnamed: 0,n_traits,resid_method,EZ_model,L,valid,total,fdr
43,5,diag,0,1,467,485,0.037
44,5,diag,0,10,484,748,0.353
45,5,diag,0,2,480,561,0.144
46,5,diag,1,1,179,184,0.027
47,5,diag,1,10,0,16,1.0
48,5,diag,1,2,111,118,0.059
55,5,flash,0,1,467,485,0.037
56,5,flash,0,10,484,748,0.353
57,5,flash,0,2,480,561,0.144
58,5,flash,1,1,170,175,0.029


## Convergence

Based on ELBO. In principle all runs should converge by ELBO. If it is not converged, then it means ELBO is not non-increasing.

It is only relevant to focus on $L>1$. For without missing data the runs do converge wrt ELBO.

In [18]:
elbo_converged = aggregate(elbo_converged~n_traits + resid_method + missing +  EZ_model + L, res, mean)
#elbo_converged = elbo_converged[which(elbo_converged$missing==FALSE),-3]
elbo_converged = elbo_converged[which(elbo_converged$L!=1),]
elbo_converged = elbo_converged[order(elbo_converged$n_traits),]
elbo_converged

Unnamed: 0,n_traits,resid_method,missing,EZ_model,L,elbo_converged
25,5,diag,False,0,2,0.994
27,5,flash,False,0,2,1.0
29,5,oracle,False,0,2,0.998
31,5,diag,True,0,2,1.0
33,5,flash,True,0,2,1.0
35,5,oracle,True,0,2,1.0
37,5,diag,False,1,2,1.0
39,5,flash,False,1,2,1.0
41,5,oracle,False,1,2,1.0
43,5,diag,True,1,2,1.0


The convergence issue for EE model: they still have increasing ELBO; but the model did not converge after 100 iterations (ELBO still increase!)