# M&M benchmark VIII

This benchmark uses the latest GTEx V8 genotype data and evaluated the pipeline in the presence of missing data.

1. the number of conditions are increased to $R=45$
2. missing data in expression are simulated according to missingness pattern in the actual expression cross tissues; `flashier::flash` method was used to compute covariance of response to use as residual covariance.

## Conclusion

Our pipeline with missing data has high false positive rates even though the simulated residual correlation is diagonal. This is not an issue with FLASH because FLASH in this case give almost identical covariance estimate as simply using diagonal (I have compared some results manually).

## Next steps for this investigation

1. Figure out the problem (hopefully bug) with missing data handling in `mmbr`.
2. Add a diagnostic function to compute in between CS correlation.

The benchmark is now under `dsc_mnm`, running on UChicago RCC midway

```
./finemap.dsc --host mnm_dsc.yaml
```

This executes the `default` pipeline in `finemap.dsc` file, as of today (2019.11.08).

In [1]:
%cd ~/GIT/github/mnm-gtex-v8/dsc

/home/gaow/GIT/github/mnm-gtex-v8/dsc

In [9]:
start_time <- Sys.time()
out = dscrutils::dscquery('finemap_output', targets = c('simulate', 'mnm.missing_Y', 'susie_scores.total', 'susie_scores.valid', 'susie_scores.size', 'susie_scores.purity', 'susie_scores.top', 'susie_scores.n_causal', 'susie_scores.included_causal', 'susie_scores.overlap', 'susie_scores.false_pos_cond_discoveries', 'susie_scores.false_neg_cond_discoveries', 'susie_scores.true_cond_discoveries'), verbose = F)
end_time <- Sys.time()

In [10]:
end_time - start_time

Time difference of 1.560561 secs

In [11]:
head(out)

DSC,simulate,mnm.missing_Y,susie_scores.total,susie_scores.valid,susie_scores.size,susie_scores.purity,susie_scores.top,susie_scores.n_causal,susie_scores.included_causal,susie_scores.overlap,susie_scores.false_pos_cond_discoveries,susie_scores.false_neg_cond_discoveries,susie_scores.true_cond_discoveries
<int>,<chr>,<lgl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<int>,<dbl>,<int>,<int>,<int>
1,mid_het,True,2,2,12.5,0.8793637,1,2,2,0,0,90,0
1,mid_het,True,3,3,15.0,0.9567131,1,3,2,15,0,50,85
1,mid_het,True,5,5,18.0,0.8402476,0,1,1,171,0,221,4
1,mid_het,True,7,7,153.0,0.815227,0,2,2,1506,0,262,53
1,mid_het,True,3,1,4.0,0.9407363,1,1,1,0,32,87,16
1,mid_het,True,9,0,1.0,1.0,0,1,0,0,375,30,0


In [12]:
dim(out)

In [13]:
saveRDS(out, '../data/finemap_output.20191108.rds')

In [17]:
res = out[,-1]
colnames(res) = c('pattern', 'missing', 'total', 'valid', 'size', 'purity', 'top_hit', 'total_true', 'total_true_included', 'overlap', 'false_positive_cross_cond', 'false_negative_cross_cond', 'true_positive_cross_cond')

### Purity of CS

In [18]:
purity = aggregate(purity~pattern + missing, res, mean)
purity

pattern,missing,purity
<chr>,<lgl>,<dbl>
mid_het,False,0.7191187
mid_het,True,0.8773249


### Size of CS

In [20]:
size = aggregate(size~pattern+missing, res, median)
size

pattern,missing,size
<chr>,<lgl>,<dbl>
mid_het,False,9
mid_het,True,9


### Power of CS

**Notice here that many CS overlap -- this is not what was observed with $R=5$.**

In [24]:
total_true_included = aggregate(total_true_included ~ pattern + missing, res, sum)
total_true = aggregate(total_true ~ pattern + missing, res, sum)
overlap = aggregate(overlap ~ pattern + missing, res, mean)
power = merge(total_true_included, total_true, by = c("pattern", "missing"))
power = merge(power, overlap,  by = c("pattern", "missing"))
power$power = power$total_true_included/power$total_true
power = power[order(power$missing),]
power

pattern,missing,total_true_included,total_true,overlap,power
<chr>,<lgl>,<int>,<int>,<dbl>,<dbl>
mid_het,False,128,173,67.25,0.7398844
mid_het,True,134,173,148.22,0.7745665


### FDR of CS

**The high FDR explains the seemingly high power, and is consistent with the observations that CS are "purer".**

In [31]:
valid = aggregate(valid ~ pattern + missing, res, sum)
total = aggregate(total ~ pattern + missing, res, sum)
fdr = merge(valid, total, by = c("pattern", "missing"))
fdr$fdr = (fdr$total - fdr$valid)/fdr$total
fdr = fdr[order(fdr$missing),]
fdr

pattern,missing,valid,total,fdr
<chr>,<lgl>,<dbl>,<dbl>,<dbl>
mid_het,False,185,185,0.0
mid_het,True,224,298,0.2483221


### Power for per signal per condition estimates

We compute lfsr on per signal per condition basis. We call it a signal in the condition if lfsr is smaller than 0.05.

In [27]:
tp = aggregate(true_positive_cross_cond ~ pattern + missing, res, sum)
fn = aggregate(false_negative_cross_cond ~ pattern + missing, res, sum)
power = merge(tp, fn, by = c("pattern", "missing"))

In [29]:
power$power = power$true_positive_cross_cond/(power$true_positive_cross_cond + power$false_negative_cross_cond)
power = power[order(power$missing),]
power

pattern,missing,true_positive_cross_cond,false_negative_cross_cond,power
<chr>,<lgl>,<int>,<int>,<dbl>
mid_het,False,4811,3514,0.5778979
mid_het,True,3670,8138,0.3108062


### FDR for per signal per condition estimates


In [30]:
tp = aggregate(true_positive_cross_cond ~ pattern + missing, res, sum)
fp = aggregate(false_positive_cross_cond ~ pattern + missing, res, sum)
fdr = merge(tp, fp, by = c("pattern", "missing"))
fdr$fdr = fdr$false_positive_cross_cond/(fdr$true_positive_cross_cond + fdr$false_positive_cross_cond)
fdr = fdr[order(fdr$missing),]
fdr

pattern,missing,true_positive_cross_cond,false_positive_cross_cond,fdr
<chr>,<lgl>,<int>,<int>,<dbl>
mid_het,False,4811,0,0.0
mid_het,True,3670,1602,0.3038695
