# Annotation enhanced genetic fine-mapping

## Aim

This simulation study compares fine-mapping with use of annotations / without it, in terms of:

1. Improvements in power: the top signal from each candidate fine-mapping cluster is more likely to be the true causal signal when annotations are used.
    - **power**: measures how many times the simulated signals ("singals" hereafter) are captured by fine-mapping clusters ("clusters" hereafter).
    - **false discovery porportion**: measures how many clusters do not capture any signal.
2. Improvements in fine-mapping resolution: with the use of annotations we expect to provide smaller sets of candidate SNPs than without them.
    - **top hit rate**: measures how many times the top association from each cluster is in fact a signal.
    - **size**: median size of clusters, smaller size means higher resolution.
    - **purity**: average $r^2$ within clusters, higher $r^2$ means higher resolution.

## Simulation scheme

The general idea is to simulate summary statistics enriched in genomic annotations -- because nowadays most studies work with summary statistics. We need these quantities:

1. GWAS sample size, MAF and LD matrix, which can be estimated from real genotype data
    - We estimate it from GTEx project of ~600 European samples. We use sample size of 80,000 to reflect the scale of available SCZ data.
2. PVE and polygenicity information from literature for the phenotype of interest
    - For SCZ the total PVE is 0.8; for the 1 million variants typically analyzed the proportion of causal is 0.006
3. Enrichment of possibly causal variance in different genomic annotations of relevance
    - We evaluate 5 (or 7?) atac-seq based annotations, enrichment test performed by Min Qiao

### Simulation of expected z-scores

We generate from standard univariate linear regression, 

\begin{align}
y=X\beta + E, E \sim N(0, \sigma_a^2).
\end{align}

Percentage of variance explained, assuming additive polygenic effects:

\begin{align}
PVE & = \frac{var(X\beta)}{var(y)} \\
& = \sum_{j\in C} \frac{var(X_j\beta_j)}{var(y)} \\
& = \sum_{j\in C} PVE_j,
\end{align} 

where $C$ is the causal set whose effects are independent, and for SCZ

\begin{align}
PVE_j \approx \frac{0.8}{1~\text{million} \times 0.006} = 1.34\times 10^{-4}.
\end{align}

Assume $y$ is scaled to have unit variance. Then $PVE_j = var(X_j\beta_j)$. For causal SNP $j$, 

\begin{align}
var(X_j\beta_j) &= \beta_j^2var(X_j) \\
&= 2f_j(1-f_j)\beta_j^2
\end{align} 

where $f_j$ is the causal allele frequency. The effect size is then 

\begin{align}
\beta_j &= -1^{\mathbb{1}[f_j<0.5]}\sqrt{\frac{PVE_j}{2f_j(1-f_j)}}
\end{align}

that is, under conventional genotype coding of 0 for major and 1 for minor allele, the direction of effect size $\beta_j$ is positive if the causal allele is a minor allele.

Estimate of effect size follows distribution $\hat{\beta}_j \sim N(\beta_j, \sigma_j^2)$, where the variance for BLUE of $\beta$,

\begin{align}
\sigma_j^2 &= \frac{var(y)}{Nvar({X_j})} \\
&= \frac{1}{Nvar({X_j})} \\
&= \frac{1}{2Nf_j(1-f_j)}
\end{align}

The sample size $N$, for SCZ GWAS studies Hence the z-score summary statistic for causal SNP $j$ is

\begin{align}
z_j &= \frac{\hat{\beta}_j}{\sigma_j} \\
&= -1^{\mathbb{1}[f_j<0.5]} \sqrt{N \times PVE_j}
\end{align}

z-scores are zero for non-causal SNPs $j \not\in C$. 

### Simulation of observed z-scores

In GWAS, observed z-scores $\hat{z}_j$ vary randomly about expected z-scores $z_j$ (that we derived above) with variance 1 and correlation between $j_1$ and $j_2$ equal to $cor(X_{j_1}, X_{j_2})$ (Methods section of CAVIAR paper has a justification, for example). We simulate multiple observed z-scores from a multivariate normal distrubtion, $$\hat{Z} \sim MVN(Z, R^2)$$ where the $R$ matrix, correlation between SNPs for the simulated genomic region, is the LD matrix.

### Simulation of causal signals

For simplicity we create ~2,000 arbitrary chunks of genomic regions (fine-mapping analysis units) each containing 500 SNPs. That is, ~2,000 simulations of different underlying LD structure. Each chunk should have muliple LD blocks. Under this setting the average number of causal SNPs within each region is 3 ($500 \times 0.006$). The exact positions these causal variants occur are associated with genomic annotations. From previous enrichment analysis of 5 annotations we estimate enrichment of GWAS signals in these regions with odds ratios ranging from 3.70 to 6.02, with mean 4.74. The 5 annotations physically cover a total of 13.36% of the genome.

Let $p_1$ and $p_0$ denote causal probability of SNPs inside and outside annotation regions, 

\begin{align}
\gamma & = \frac{p_1/1-p_1}{p_0/1-p_0} \\
L & = [qp_1 + (1-q)p_0] \times M
\end{align}

where $\gamma$ is the mean odds ratio (we try $\gamma = 3, 4.5, 6$), $L$ is the number of causal variants in the region ($L = 3$), $M$ is length of the region ($M=500$), $q$ is proportion of region overlapping with functional annotations ($q=0.1336$).

For simplicity we create for each analysis unit 5 non-overlapping consective regions with total length constituting 13.36% of the length of the unit.

## Fine-mapping method

We use DAP-G to analyze the data. We run and compare two versions of DAP-G: one that uses the "oracle" prior from enrichment based simulation, one uses uniform priors.

## DSC benchmark

Simulate study is implemented in DSC framework. Input data are just matrices of genotypes.

To run the benchmark, first click inside SoS notebook "restart kernel and run all" botton. Then run the exported scripts with 8 CPU threads:

```
dsc master.dsc -c 8
```

### `zzz.dsc`

In [1]:
%save -f modules/zzz.dsc
%include modules/simulate_region
%include modules/simulate_z
%include modules/fit
%include modules/evaluate

### `master.dsc`

In [2]:
%save -f master.dsc
#!/usr/bin/env dsc
%include modules/zzz

DSC:
    define:
        fit: dap, dapa
    run: simulate_region * simulate_z * fit * evaluate
    exec_path: modules
    global:
        data_file: gtex-manifest.txt
        n_units: 150
    output: xh_grant

### `simulate_region.dsc`

In [3]:
%save -f modules/simulate_region.dsc

simulate_region: sim_region.R + R(data = readRDS(dataset);
                                  X = get_loci(data$X, M);
                                  R = lapply(1:length(X), function(i) round(cor(X[[i]]),4));
                                  eff_sign = get_sign(X);
                                  prior = get_prior(M, chunks, g, q);
                                  lapply(1:length(R), function(i) write.table(R[[i]],paste0(ld_file, '.', i),quote=F,col.names=F,row.names=F)))                           
  dataset: Shell{head -${n_units} ${data_file}}
  M: 500
  chunks: 5
  g: 3, 4.5, 6
  q: 0.1336
  $eff_sign: eff_sign
  $R: R
  $prior: prior$prior
  $annotation: prior$annotation
  $ld_file: file(ld)

### `simulate_z.dsc`

In [13]:
%save -f modules/simulate_z.dsc
simulate_z: sim_z.py + Python(z, z_true, L = simulate(R, N, pve, eff_sign, n_signal, prior))
  R: $R
  eff_sign: $eff_sign
  prior: $prior
  N: 80000
  n_signal: 3
  pve: 1.34E-4
  $z: z
  $z_true: z_true
  $L: L

### `fit.dsc`

In [5]:
%save -f modules/fit.dsc

dap: fit_dap.py + Python(posterior = dap_batch_z(z, ld, cache, None, args))
  ld: $ld_file
  z: $z
  args: "-ld_control 0.20 --all"
  cache: file(DAP)
  $posterior: posterior

dapa(dap): fit_dap.py + Python(posterior = dap_batch_z(z, ld, cache, prior, args))
  prior: $prior

### `evaluate.dsc`

In [6]:
%save -f modules/evaluate.dsc

evaluate: evaluate_dap.py
  z_true: $z_true
  posterior: $posterior
  $is_recovered: is_recovered
  $is_cs_true: is_cs_true
  $is_top_true: is_top_true
  $size: size
  $purity: purity

## Code for Simulation

### `sim_region.R`

To simulate priors for every SNP we need to determine $p_0$ and $p_1$. From equations above, assuming $L=1$ we derive:

\begin{align}
p_1 = \frac{\gamma p_0}{1 - p_0 + \gamma p_0} \\
M(q-1-\gamma q + \gamma)p_0^2 - (Mq - M - Mq\gamma + \gamma - 1)p_0 - 1= 0
\end{align}

We can solve this numerically in R, eg:

In [7]:
g = 4.5
N = 500
q = 0.1336
foo = function(x) N * (q-1-g*q+g) * x^2 - (N*q-N-N*q*g+g-1) * x - 1
p0 = uniroot(foo, lower=0, upper=1, tol = .Machine$double.eps^0.8)$root
p1 = g * p0 / (1-p0+g*p0)
print(c(p0,p1))

[1] 0.001365430 0.006115208


In [8]:
# verify it:
print(p1/(1-p1) / (p0 / (1-p0)))
print(N * q * p1 + N * (1-q) * p0)

[1] 4.5
[1] 1


These probabilities recover the odds ratio, so we should be good. I'll use this in my simulation code below.

In [9]:
%save -f modules/sim_region.R
get_loci = function(X, N) {
    segs = floor(ncol(X) / N)
    lapply(1:segs, function(i) X[,i:(i+N-1)])
}
get_prior = function(N, chunks, g, q) {
    foo = function(x) N * (q-1-g*q+g) * x^2 - (N*q-N-N*q*g+g-1) * x - 1
    p0 = uniroot(foo, lower=0, upper=1, tol = .Machine$double.eps^0.8)$root
    p1 = g * p0 / (1-p0+g*p0)
    per_chunk_len = N * q / chunks
    n_bins = floor(N/chunks)
    annotated = unlist(lapply(1:chunks, function(i) ((i-1) * n_bins + 1):((i-1) * n_bins + per_chunk_len)))
    prior = rep(p0, N)
    prior[annotated] = p1                          
    list(prior=prior, annotation=annotated)                          
}
get_sign = function(X) {
    lapply(1:length(X), function(i) apply(X[[i]], 2, function(x) (-1)^as.integer((mean(x)/2) > 0.5)))
}

### `sim_z.py`

In [10]:
%save -f modules/sim_z.py
import numpy as np

def sim_gwas_z(R, N, pve, eff_sign, n_signal, prior):
    np.random.seed(int(np.sum(np.array(R))))
    # get expected z-score assuming all causal
    z_true = np.sqrt(N * pve) * eff_sign
    # sparsify expected z-score to allow for n_signal causal
    # FIXME: might overlap if some prior is very high; 
    # thus not gauranteed to have exactly n_signal non-zeros
    # but it is quite convenient to call this function
    z_true *= np.random.multinomial(n_signal, prior)
    # get observed z-scores
    z = np.random.multivariate_normal(z_true, np.square(R))
    return z, z_true

def simulate(R, N, pve, eff_sign, n_signal, prior):
    z_true = {k:[] for k in R}
    z = {k:[] for k in R}
    for k in R:
        z[k], z_true[k] = sim_gwas_z(R[k], N, pve, eff_sign[k], n_signal, prior)
    return z, z_true, {k: sum(z_true[k]!=0) for k in z_true}

## Fine-mapping with DAP

### `fit_dap.py`

Below is example output for DAP-G:

```
Posterior expected model size: 0.500 (sd = 0.500)
LogNC = -0.30685 ( Log10NC = -0.133 )
Posterior inclusion probability

((1))              7492 6.68581e-05       0.000 1
((2))              7490 6.68581e-05       0.000 1
... 7 lines
((8))              7491 6.68046e-05       0.000 2
((9))              7483 6.68046e-05       0.000 2
((10))             7485 6.68046e-05       0.000 2
... 13 lines
((20))             7459 6.68046e-05       0.000 2
((21))             7482 6.67422e-05       0.000 -1
((22))             7489 6.67422e-05       0.000 -1
... other lines until below ...

Independent association signal clusters

     cluster         member_snp      cluster_pip      average_r2
       {1}              7            4.680e-04          0.951                 0.951   0.037
       {2}             13            8.685e-04          0.623                 0.037   0.623

```

In [11]:
%save -f modules/fit_dap.py
import subprocess
import pandas as pd
import numpy as np

def write_dap_z(z, prefix):
    '''z-score vesion of dap input is the same as FINEMAP'''
    ids = np.array([str(i+1) for i in range(z.shape[0])])
    with open(f'{prefix}.z', 'w') as f:
        np.savetxt(f,  np.vstack((ids, z)).T, fmt = '%s', delimiter = ' ')

def write_prior(prior, prefix):
    with open(f'{prefix}.prior', 'w') as f:
        f.write('\n'.join([f'{i+1}\t{p}' for i, p in enumerate(prior)]))

def run_dap_z(ld, r, prior, prefix, args):
    cmd = ['dap-g', '-d_z', f'{prefix}.z', '-d_ld', f'{ld}.{r}', '-o', f'{prefix}.result', '--output_all'] + ' '.join(args).split()
    if prior is not None:
        cmd.extend(['-p', f'{prefix}.prior'])
    subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE).communicate()    
    
def extract_dap_output(prefix):
    out = [x.strip().split() for x in open(f'{prefix}.result').readlines()]
    pips = []
    clusters = []
    still_pip = True
    for line in out:
        if len(line) == 0:
            continue
        if len(line) > 2 and line[2] == 'cluster_pip':
            still_pip = False
            continue
        if still_pip and (not line[0].startswith('((')):
            continue
        if still_pip:
            pips.append([line[1], float(line[2]), float(line[3]), int(line[4])])
        else:
            clusters.append([len(clusters) + 1, float(line[2]), float(line[3])])
    pips = pd.DataFrame(pips, columns = ['snp', 'snp_prob', 'snp_log10bf', 'cluster'])
    clusters = pd.DataFrame(clusters, columns = ['cluster', 'cluster_prob', 'cluster_avg_r2'])
    clusters = pd.merge(clusters, pips.groupby(['cluster'])['snp'].apply(','.join).reset_index(), on = 'cluster')
    return {'snp': pips, 'set': clusters}

def dap_single_z(z, ld, prefix, r, prior, args):
    write_dap_z(z,prefix)
    if prior is not None:
        write_prior(prior, prefix)
    run_dap_z(ld, r, prior, prefix, args)
    return extract_dap_output(prefix)


def dap_batch_z(Z, ld, prefix, prior, *args):
    return dict([(k, dap_single_z(Z[k], ld, f'{prefix}_condition_{k}', k, prior, args)) for k in Z])

## Evaluate results
### `evaluate.py`

In [12]:
%save -f modules/evaluate_dap.py
def dap_summary(cluster, snp, coef):
    signal_expected = [f'{idx+1}' for idx, i in enumerate(coef) if i != 0]
    cluster = cluster.loc[cluster['cluster_prob'] > 0.95]
    if cluster.shape[0] == 0:
        return ["failed"] * 5
    # to return
    purity = cluster['cluster_prob'].tolist()
    cluster = [x.split(',') for x in cluster['snp']]
    signal_detected = sum(cluster, [])
    # to return
    size = [len(c) for c in cluster]
    snp = [snp.loc[snp['snp'].isin(c)] for c in cluster]
    top_snp = [s.loc[s['snp_prob'] == max(s['snp_prob'])]['snp'].tolist()[0] for s in snp]
    # to return
    is_top_true = [1 if x in signal_expected else 0 for x in top_snp]
    # to return
    is_recovered = [1 if x in signal_detected else 0 for x in signal_expected]
    # to return
    is_cs_true = [1 if len(set(x).intersection(signal_expected)) else 0 for x in cluster]
    return is_recovered, is_cs_true, is_top_true, size, purity

is_recovered = dict()
is_cs_true = dict()
is_top_true = dict()
size = dict()
purity = dict()
for k in posterior:
    is_recovered[k], is_cs_true[k], is_top_true[k], size[k], purity[k] = dap_summary(posterior[k]['set'], posterior[k]['snp'], coef[k])    