# Meetings

## Project meeting 20170615
With Matthew on simulation results

* Analyze with ASH and see what to exploit
* Try simulation with very big \sigma$ and observe the distribution of estimates

Diagnostics:
* Distribution of effects
* Use averaged CDF: see ASH paper
* Simulate scenarios where there is qualitative differences and see if mr-ash can capture that
* Look at estimate of $\sigma^2$

Comparison with other methods:

* Simulate 2 mixture normal and compare with GEMMA
* Compare with rss-ash

## Project meeting 20170518

With Matthew on issues with GTEx V7 preprocessing.

Questions:
* Normalization: why qnorm? should I standardize it? should I do per-tissue qnorm then standardize them?
* PEER: should I do per-tissue or altogether? If altogether should I add tissue as a covariate? 
   * Variable number of PEER with GTEx V6 -- what's the deal with V7 now?
* Analysis: should I remove PEER + covariates first for each gene and then run mr-ash?
* Any additional concerns with imputed genotypes?

Feedback:
* We do quantile normalization because it is the most robust. And our model assumes normal.
  * potential issue with RPKM: if some genes are extremely highly expressed they will impact other genes's RPKM
* No good answer to whether or not we adopt alternative stretagy. We should stick to one defensible strategy which easily is the [official GTEx guideline](https://gtexportal.org/home/documentationPage#staticTextAnalysisMethods).
* We can use multiple regression to remove residue first. Potential issue is that we may over-correct but the gain in power is well worth the possible loss. Just make sure we keep P << N so that degree of freedom change is not that big. GTEx official procedure has some guideline on how many PEER to use.


## Project meeting 20170502
With Matthew and Wei. We discussed mostly the fine-mapping step of m&m, and some mr-ash related issues. First and foremost, we make it clear that focus of m&m should be constrained to eQTL fine mapping for now because it is an important problem that has not been answered in the multivariate framework we propose. 

Although we are interested in fine mapping eventually, it is perhaps too premature to make very concrete plan until we see the data analysis outcome from Step 2 the mash step (this comment was made in response to my initial request to finalize the fine mapping MCMC algorithm, and instead we brainstormed on what can be possibly done to perform fine mapping).

### How to summarize fine mapping results?

Sparse vs non-sparse result: 

* If we adopt a none sparse model it is not good idea to evaluate whether $\beta=0$ -- we may want to look at correctness of the sign instead. 
* Result under a sparse model is potentially computationally easy. Example see Wen 2016 DAP paper. Or we can even look at one eQTL per gene model, or two eQTL per gene (thus a combination of ~2000 cis SNPs choose 2 pairs), at the fine-mapping step.

What should be the output of fine mapping?
* One version is to output each SNP's LD with the eQTL
* We can also output the 95% CI for the LD estimate with eQTL and see if it covers complete LD
  * But which one is the eQTL?
* Or we can output the set of SNPs that we have 95% confidence that the set includes the eQTL (see Eskin)
* We want to show that for given SNP, what is the nearest eQTL: what is the distance of this SNP to the nearest eQTL? what is the highest LD between this SNP and an eQTL?
* Given MCMC result, how do we summarize it? 
  * What should we do if we want to use the results for data intergration? 
  * We may want a mode, not a flat PIP from MCMC
  * We can take samples from posterior, then for each sample perform intergation analysis, and access the variance / sensitivity of results (how robust the conclusion is). And somehow combine the results. This is similar to the idea of multiple imputation.

What can we learn directly from Step 2 the mash step?
* If we go a bit further computing the lfsr and posterior mean of effects, we can get the "mode" candidate for eQTL, ie, we can then perform fine mapping around the mode. 

It is also perhaps too early to make any meaningful envision on what to do with finemapping, before looking into the data:
* How uncertain are the results we'll end up getting? Maybe LD is not a big issue in many of the new eQTLs we identify, which by itself would be exciting results.

### Where do we stop for mr-ash application
We want to start the mr-ash paper by saying that there is great interest in introducing sparsity in regression, and recently there is a method called ash that introduces it in a "smart" way, and how it is relatively straightforward to use the ash idea in the context of regression.

The strength of mr-ash is the computational efficiency (VEM) and the flexibility (ASH compared to spike-slab), but there are disadvantages (PIP is too concentrated, see Carbonatto 2012). 

In data application we can show how different the distribution of effect size of eQTL is, across genes. We can focus on a single tissue, fit separate mr-ash for each gene, and comment on interesting patterns that emerges; or we can use meta-analysis for multiple tissues on genes of interest, if there is not enough power from single tissue analysis.

## Project meeting 20170417
Meeting with Matthew. We went through the m&m procedure on overleaf, revisited issue 8 on github and talked about next steps on data analysis + simulations. The discussion has led to minor changes in the overleaf write up.

Additionally we decide that the next step should be getting *Step 1* done, ie, mr-ash on GTEx data. We will start with analyzing GTEx V6 and verify with mash result, then move on to V7 data. *Step 1* would be an interesting application by itself as it is some form of univariate fine-mapping.


## Project meeting 20170330
Meeting with Matthew and Wei, to revive the project, by looking at what we have and what to be done. 

### Tentative agenda

#### Connected work

* varbvs
* rss
* ash
* mash
* mrash
* BMASS

### Questions

* How can m&m ash generalize all theses work
  * We have to think carefully what to incorperate in the generalized framework, and how to incorporate them
  * In particular how can we combine mrash / rss + mash?
* Do we start from summary statistics or full data?
* Do we need MCMC in addition to VEM?

Implementation-wise, shall we not write any code until we finalize on how the generalized framework is formulated? We should think "modularly" and we make contributions directly to other modules whenever possible, then build m&m ash with these modules.

### What to do with `mrash` as a standalone work?
If we start from full data then finalizing `mrash` is a natural first step. It is then just a discussion of whether to create a separate package or to make it part of varbvs.

### Minutes
The meeting has outlined the approach we take towards a modularized m&m. Most items on the agenda has been covered. See [this document](Modular_MNMASH.html) for details. 

## Project meeting 20161103
Meeting with Matthew. We started from recap on the motivation of project, then discussed the M&M ASH model with practical considerations.

### Motivation
M&M ASH model is motivated by what we have noticed in the MASH project. We have observed effect of a SNP (eQTL) positive in one tissue yet negative in another tissue. This bothers us. We suspect this type of observation is most likely due to negative LD between two causal SNPs both having positive effect in two separate tissues yet if we make the one eQTL per gene assumption as made in MASH we will observe opposite effects. So if we assume SNPs are independent in association analysis we obtain $\hat{\beta}$ convoluted by LD of all SNPs. 

Let's consider univariate association analysis for a moment. Because of LD, $g(.)$, the distribution of $\beta$ we estimate via univariate methods, would have long tails. In other words $g(.)$ is inflated by LD with other SNPs. Estimates of $g(.)$ from multiple regression with ASH prior via variantional EM (currently called MVASH) will not have this problem. However when we want to make inference on $\beta$ the effect size, there will be identifiability issue with MVASH because VEM can reach local optima and the effect size it reports for the SNP identified may not be the SNP that in fact has an effect. The solution to this problem is to use MCMC for fine mapping on selected regions via VEM. A hybrid approach is to estimate hyper-parameters via VEM and use MCMC to sample the posterior.

Now to solve the same issue in the context of multivariate regression, we propose the M&M ASH model, which applies multiple regression using ASH prior on multiple responses. David Gerard has derived a VEM procedure for the M&M model. Assumptions in David's derivations are: 

* The residual variance of genes (after regressing out eQTL effect) is structured low rank + diagonal
* There will be missing data in the response matrix
* The mixture proportion can be estimated per test, or be estimated jointly for all tests

### M&M ASH with diagonal residual covariance structure

Matthew suggests we make this model simpler and make sure it works. For starters we should ignore correlation among tissues. That is, we assume residual variance a diagonal matrix. Here are a few points why we should start with diagonal and why at least as a first pass we should not make non-diagonal assumption in M&M ASH:

* We are not sure yet if correlated residual will cause a problem to our inference -- unless we can show it empirically: we should find real data examples when correlation between tissues are due to correlation between genes, not due to similarity of tissues. This would raise a red flag that we should model such correlations.
* Even if the problem is confirmed we should use MASH model to show we can solve it, before incorporating the solution to M&M ASH. As MASH model is simpler, it will get us assessment from real data quickly and we'll decide if it worth to pursue the fix in M&M ASH.
* To do it in MASH we should assume this residual correlation is the same as the tissues' correlations (eQTL effect is relatively small) and we estimate the 44 by 44 matrix of covariance directly from expression data. This is not a trivial problem; many methods get estimates that shrink the structure to diagonal. But sparse factor analysis methods can be a good technique to do this, as shown by Wei's work. We then plug this estimate to MASH model
  * The advantage of this approach (over making inference jointly as what David has done for M&M ASH) is that this approach is modular and we can choose a good method (such as SFA, FLASH) to make this step of inference. The method may be biased (ignoring impact of eQTL) but has better variance
* The problem with this approach is that if eQTL induces correlation we'll wrongfully believe there is residual covariance when in fact there is not. That is, after removing effect from eQTL the residual covariance is diagonal. This observation would favor the joint approach over the modular approach. To assess if this is a problem, we can choose genes with large covariance matrix, and remove the effect of top eQTL then see if the residual covariance matrix still retains correlations or is mostly diagonal.

### Next steps
We should start with the simplest version (that residual covariance is diagonal) and make it work. The hard part is computation. Using summary data whenever possible may help with computation. Additionally in updating mixture components we can use noiser estimates, that is, estimates from randomly sampled \beta{hat} instead of 20K genes * 1000 SNPs * 50 conditions data points. We will have our next meeting (David and Gao with Matthew) after we get this simple version to work in practice.

## Project meeting 20160921
### Tentative schedule
Status
* The `m&m ash` [model](http://www.bioinformatics.org/labnotes/mnmash/mnmash-model.html) and [implementation](https://github.com/gaow/mnmashr).
* Implementation is not working on data due to [data structure design](https://github.com/gaow/mnmashr/blob/master/src/mnmash.hpp#L41)
  * When J is 40, P 2000, K 50 and L 20, the S and SI matrices will be of size 3.2G * 2 = 6.4G. Looping over such data is very slow.
* Correctness of implementation not tested

Next steps
* Get implementation working
* Run simulations
* Put into OmicsBMA framework and do real data anlaysis

### Minutes
* We should conceptually distinguish a model using original data from the one using summary data, though in the VB algorithm they are very similar.
* Currently the model assumes $\Sigma_{J \times J}$ known. This is not easy to estimate because it would involve a non-trivial multiple regression. We should try to model $\Sigma_{J \times J}$ as unknown diagonal matrix and estimate it in the VB framework. For starters, write up the J = 1 case.