# Annotation enhanced genetic fine-mapping

## Aim

This simulation study compares fine-mapping with use of annotations / without it, in terms of:

1. Improvements in fine-mapping resolution: with the use of annotations we expect to provide smaller sets of candidate SNPs than without them.
2. Improvements in power: the top signal from each candidate fine-mapping cluster is more likely to be the true causal signal when annotations are used.

## Simulation scheme

### Genotype data

We use real data genotypes from GTEx project, of ~600 European individual samples. We choose common variants (MAF > 1%). We create genomic regions (fine-mapping units) each containing 1,000 SNPs; thus retaining the realistic LD pattern between SNPs.

### Phenotype data

We assume each analysis unit contains 1, 2, or 3 causal variants. The genomic position of causal variants are simulated to associate with genomic annotations. From previous enrichment analysis of 5 annotations we estimate enrichment of GWAS signals in these regions with odds ratios ranging from 3.70 to 6.02, with mean 4.74. The 5 annotations physically cover a total of 13.36% of the genome. For simplicity we create for each analysis unit 5 non-overlapping consective regions with total length constituting 13.36% of the unit of interest. Let $p_1$ and $p_0$ denote causal probability of SNPs inside and outside these regions respectively, 

\begin{align}
\gamma & = \frac{p_1/1-p_1}{p_0/1-p_0} \\
L & = [qp_1 + (1-q)p_0] \times N
\end{align}

where $\gamma$ is the mean odds ratio ($\gamma = 4.74$), $L$ is the number of causal variants in the region ($L = 1,2,3$), $N$ is total length of the region ($N=1000$), $q$ is proportion of annotation covered region ($q=0.1336$).

To simulate phenotypes we assume for a causal variant an effect size $\beta_j \sim N(0, \sigma^2)$. We simulate quantitive phenotypes, although in practice fine-mapping studies use summary statistics either from GWAS of case control data or quantitive traits. Specifically we generate phenotype from a linear model $y=X\beta + E, E \sim N(0, \sigma_a^2)$. Relative strength of $\sigma^2$ and $\sigma_a$ are determined by percentage of variance explained (PVE) by all genetic effects. We set PVE to 0.1.

## Analysis

We use DAP-G to analyze the data. We run and compare two versions of DAP-G: one that uses the "oracle" prior from enrichment based simulation, one uses uniform priors.

## DSC benchmark

Simulate study is implemented in DSC framework. Input data are just matrices of genotypes.

### `zzz.dsc`

In [2]:
%save -f modules/zzz.dsc
%include modules/simulate_prior
%include modules/simulate_y
%include modules/fit
%include modules/evaluate

### `master.dsc`

In [1]:
%save -f master.dsc
#!/usr/bin/env dsc
%include modules/zzz

DSC:
    define:
        fit: dap, dapa
    run: simulate_prior * simulate_y * fit * evaluate
    exec_path: modules
    global:
        data_file: gtex-manifest.txt

### `simulate_prior.dsc`

In [9]:
%save -f modules/simulate_prior.dsc

simulate_prior: sim_utils.R + R(data = readRDS(dataset);
                           X = get_loci(data$X, N);
                           prior = get_prior(data$X, L, chunks, g, q))
  dataset: Shell{head -150 ${data_file}}
  N: 1000
  L: 1, 2, 3
  chunks: 5
  g: 
  q: 0.1336
  $X: X
  $prior: prior

### `simulate_y.dsc`

In [3]:
%save -f modules/simulate_y.dsc
simulate_y: sim_utils.py + Python(data = simulate(X, prior, pve, amplitude))
  X: $X
  prior: $prior
  pve: 0.1
  amplitude: 0.6
  $y: data['y']
  $coef: data['coef']

### `fit.dsc`

In [4]:
%save -f modules/fit.dsc

dap: fit_dap.py + Python(posterior = dap_batch(X, y, cache, args))
  X: $X
  y: $y
  args: "-ld_control 0.20 --all"
  cache: file(DAP)
  $posterior: posterior

dapa(dap): fit_dap.py + Python(posterior = dapa_batch(X, y, prior, cache, args))
  prior: $prior

### `evaluate.dsc`

In [13]:
%save -f modules/evaluate.dsc

evaluate: evaluate.R + R(e = evaluate(coef, posterior))
  coef: $coef
  posterior: $posterior
  $num_recovered: e$num_recovered
  $num_missed: e$num_missed
  $is_top_true: e$is_top_true
  $cluster_size: e$cluster_size

## Simulation

### `sim_utils.R`


In [14]:
%save -f modules/sim_utils.R
get_center <- function(k,n) {
  ## For given number k, get the range k surrounding n/2
  ## but have to make sure it does not go over the bounds
  if (is.null(k)) {
      return(1:n)
  }
  start = floor(n/2 - k/2)
  end = ceiling(n/2 + k/2)
  if (start<1) start = 1
  if (end>n) end = n
  return(start:end)
}

## Fine-mapping with DAP

### `fit_dap.py`

DAP version 1 was published as Wen et al 2016 AJHG. Here William has polished the software `dap-g` with another manuscript that describes improved algorithm and working with summary statistics. This benchmark uses DAP version 2. Below is an example output that I parse and save.

```
Posterior expected model size: 0.500 (sd = 0.500)
LogNC = -0.30685 ( Log10NC = -0.133 )
Posterior inclusion probability

((1))              7492 6.68581e-05       0.000 1
((2))              7490 6.68581e-05       0.000 1
... 7 lines
((8))              7491 6.68046e-05       0.000 2
((9))              7483 6.68046e-05       0.000 2
((10))             7485 6.68046e-05       0.000 2
... 13 lines
((20))             7459 6.68046e-05       0.000 2
((21))             7482 6.67422e-05       0.000 -1
((22))             7489 6.67422e-05       0.000 -1
... other lines until below ...

Independent association signal clusters

     cluster         member_snp      cluster_pip      average_r2
       {1}              7            4.680e-04          0.951                 0.951   0.037
       {2}             13            8.685e-04          0.623                 0.037   0.623

```

In [22]:
%save -f modules/fit_dap.py
import subprocess
import pandas as pd
import numpy as np

def write_dap_full(x,y,prefix,r):
    names = np.array([('geno', i+1, f'group{r}') for i in range(x.shape[1])])
    with open(f'{prefix}.data', 'w') as f:
        print(*(['pheno', 'pheno', f'group{r}'] + list(np.array(y).ravel())), file=f)
        np.savetxt(f, np.hstack((names, x.T)), fmt = '%s', delimiter = ' ')
#     grid = '''         
#         0.0000  0.1000
#         0.0000  0.2000
#         0.0000  0.4000
#         0.0000  0.8000
#         0.0000  1.6000
#         '''
#     grid = '\n'.join([x.strip() for x in grid.strip().split('\n')])
#     with open(f'{prefix}.grid', 'w') as f:
#         print(grid, file=f)
        
def run_dap_full(prefix, args):
    cmd = ['dap-g', '-d', f'{prefix}.data', '-o', f'{prefix}.result', '--output_all'] + ' '.join(args).split()
    subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE).communicate()    
           
def write_dap_ss(z,prefix):
    '''z-score vesion of dap input is the same as FINEMAP'''
    ids = np.array([str(i+1) for i in range(z.shape[0])])
    with open(f'{prefix}.z', 'w') as f:
        np.savetxt(f,  np.vstack((ids, z)).T, fmt = '%s', delimiter = ' ')

def run_dap_z(ld, prefix, args):
    cmd = ['dap-g', '-d_z', f'{prefix}.z', '-d_ld', ld, '-o', f'{prefix}.result', '--all'] + ' '.join(args).split()
    subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE).communicate()    
    
def extract_dap_output(prefix):
    out = [x.strip().split() for x in open(f'{prefix}.result').readlines()]
    pips = []
    clusters = []
    still_pip = True
    for line in out:
        if len(line) == 0:
            continue
        if len(line) > 2 and line[2] == 'cluster_pip':
            still_pip = False
            continue
        if still_pip and (not line[0].startswith('((')):
            continue
        if still_pip:
            pips.append([line[1], float(line[2]), float(line[3]), int(line[4])])
        else:
            clusters.append([len(clusters) + 1, float(line[2]), float(line[3])])
    pips = pd.DataFrame(pips, columns = ['snp', 'snp_prob', 'snp_log10bf', 'cluster'])
    clusters = pd.DataFrame(clusters, columns = ['cluster', 'cluster_prob', 'cluster_avg_r2'])
    clusters = pd.merge(clusters, pips.groupby(['cluster'])['snp'].apply(','.join).reset_index(), on = 'cluster')
    return {'snp': pips, 'set': clusters}

def dap_single(x, y, prefix, r, args):
    write_dap_full(x,y,prefix,r)
    run_dap_full(prefix,args)
    return extract_dap_output(prefix)

def dap_single_z(z, ld, prefix, args):
    write_dap_ss(z,prefix)
    run_dap_z(ld,prefix,args)
    return extract_dap_output(prefix)

def dap_batch(X, Y, prefix, *args):
    return dict([(r, dap_single(X, Y[:,r], f'{prefix}_condition_{r+1}', r+1, args)) for r in range(Y.shape[1])])

def dap_batch_z(z, ld, prefix, *args):
    return dict([(r, dap_single_z(z[:,r], ld, f'{prefix}_condition_{r+1}', args)) for r in range(z.shape[1])])