# Analysis of toy data via DAP

This notebook explores using [DAP](https://github.com/xqwen/dap) software directly to analyze our CNV data, on a toy example. It largely follows from [this tutorial](https://github.com/xqwen/dap/wiki/Case-study:-multi-SNP-fine-mapping).

In [1]:
# ! cd ~/GIT/software; git clone https://github.com/xqwen/dap

Here is Python codes to prepare DAP input data, from a matrix where the first column is response and the rest are regressors. In the case of our toy data the first column is disease status and the rest columns are indicators of whether a gene harbors a CNV.

In [7]:
import feather
import pandas as pd
fn = "data/toy_4genes_n_1370.feather"
fout = "data/toy_4genes_n_1370.dap"

dat = feather.read_dataframe(fn)

def run_dap(df, fout, prefix = None, exec_path = None):
    '''Convert pandas dataframe to dap input:
        - phenotype / genotype file
        - prior file
        - grid file (of effect size): omega^2 + phi^2 is what we care. lets set it to
            1 1; 2 2; 3 3 and 4 4 for now, as we only have one Y
    '''
    import os
    if prefix is None:
        import time
        prefix = "/tmp/F" + str(time.time())
    if exec_path is None:
        exec_path = 'dap/dap'
    chrom = 'chr6'
    pos = 100000
    dat = [['pheno', 'trait', 'chicago'] + [str(x) for x in df['phenotype']]]
    prior = []
    grid = [(1,1),(2,2),(3,3),(4,4)]
    for idx, item in enumerate(df.columns.values):
        if item == 'phenotype':
            continue
        dat.append(['geno', '{}.{}'.format(chrom, pos + idx), 'chicago'] + [str(x) for x in df[item]])
        prior.append(['{}.{}'.format(chrom, pos + idx), str(1/(df.shape[1] - 1))])
    with open(prefix + '.dat', 'w') as f:
        f.write('\n'.join([' '.join(x) for x in dat]))
    with open(prefix + '.prior', 'w') as f:
        f.write('\n'.join([' '.join(x) for x in prior]))
    with open(prefix + '.grid', 'w') as f:
        f.write('\n'.join([' '.join(map(str, x)) for x in grid]))
    os.system("{0} -d {1}.dat -g {1}.grid -t 8 -it 0.05 -prior {1}.prior > {2}".format(exec_path, prefix, fout))

To run DAP:

In [8]:
run_dap(dat, fout)

%preview -n data/toy_4genes_n_1280.dap

    1   7.2437e-01    2      7.465   [chr6.100001] [chr6.100003]
    2   2.1527e-01    1      6.938   [chr6.100001]

Posterior expected model size: 1.664 (sd = 0.586)

LogNC = 17.51028 ( Log10NC = 7.605 )

Posterior inclusion probability

    1 chr6.100001   9.39636e-01      7.914
    2 chr6.100003   7.24368e-01     -1.392

The first line ranks the high-probability association models. 

The posterior probability of the association model for [chr6.100001] and [chr6.100003] is $0.72437$.

The unnormalized posterior score $\text{log}_{10}$(prior)+$\text{log}_{BF}$ is $7.465$. 

    1   7.2437e-01    2      7.465   [chr6.100001] [chr6.100003]
The last section of the output is the posterior inclusion probabilities (PIP) for top ranked genes, for example, PIP for gene1 is $0.9396$.