# Simulation of Exome-wide CNV data for case control samples

## Simulation outline
For given type of CNV (e.g. duplication) we first simulate all the variants across exome, 
then assign these simulated CNVs to genes. For case samples we expect an enrichment in causal
genes and the enrichment should be quantified by relative risk of a CNV's contribution to disease
when presented in causal gene vs. non causal gene.

## Parameters
We define causal CNV as CNV in causal gene, and denote it as CCNV. Likewise we have NCNV for non-causal CNV.

| Parameter | Description | Value|
|:-----:|:-----:|:-----:|
|$\lambda$|Average number of CNV per exome| 100 for dup, 20 for del (*FIXME: reference required*)|
|$p_L$|Parameter for Geometric distribution that models length of CNV|To be fitted from data|
|$\gamma$| Odds ratio of CCNV|To draw from distribution Gamma(5, 1)|
|$p_0$| Pr(case &#124; NCNV)| To be calculated below|
|$p_1$| Pr(case &#124; CCNV)| To be calculated from $p_0$ and $\gamma$|
|$N_0, N_1$| Case / Ctrl sample size|Various numbers|
|$q$| Proportion of causal gene|(*FIXME: reference required*)|
|$k$|The prevalence of diease|(*FIXME: reference required*)|
## Simulation steps
The steps below simulates exome data for single individual case or control for one type of CNV. 
The steps are to be repeated multiple times to simulate all samples and all types of CNV.

### Step 1: simulate number of CNV per exome
Simply drawn from Pois($\lambda$)

### Step 2: get the length of each of these CNV simulated

First, an empirical distribution of CNV is to be obtained from real data.

Then the parameter for the geometric distribution is to be fitted from the empirical distribution.

Finally for each CNV, draw the length of it from the geometric distribution thus obtained.

### Step 3: get gene label for each CNV

First, put all genes into 2 groups: causal gene and non-causal gene. This can be random for starters, or be based on annotation from real data for refined simulations.

Then for each CNV, calculate its probability that it belongs to causal / non-causal gene group given its sample disease status, ie.:

$$Pr(CCNV|\text{case}) = \frac{p_1 \times q}{k}$$
$$Pr(NCNV|\text{case}) = 1 - Pr(CCNV|\text{case})$$

Given $k$ and $q$ it remains that $p_1$ is to be cauculated. By definition of $k$ and law of total probability:
$$k=p_0 \times (1 - q) + p_1 \times q$$
and by difiniation of odds ratio:
$$\gamma = \frac{\frac{p_1}{1-p_1}}{\frac{p_0}{1-p_0}}$$

Drawing $\gamma$ from a Gamma distribution and looking up $p_0$ from literature (or make it up) we can calculate $p_1 = f(\gamma, p_0)$. Then from the 2 equations above it is possible to solve for $p_0$ and calculate $p_1$. So we will obtain Pr(CCNV|case) and Pr(NCNV|case). For controls, Pr(CCNV|ctrl) = Pr(CCNV) = $q$ and Pr(NCNV|ctrl) = Pr(NCNV) = $1-q$.

Finally we can use a Bernouli random number generator to determine if a CNV will fall into the causal gene group for given sample with known status. Once the group is determined, we will randomly assign the CNV to a gene in the group.

### Step 4: get position of a CNV
For Step 2 we know the lenght of the CNV and Step 3 the gene the CNV belongs to. Now we should place the CNV in the genome such that it has reasonable overlap with this gene. The start position of the CNV should therefore lie in between $(T_s - L, T_e)$ where $T_s$ and $T_e$ are start and end positions of the gene and $L$ is the length of the CNV.

In [21]:
import random
import pandas as pd
from pandasql import sqldf

In [41]:
ref_gene = pd.read_table("data/refGene.txt.gz", compression="gzip", sep="\t", 
                         header = None, usecols=(1,2,4,5,12), 
                         names = ["tx_name", "chrom", "tx_start", "tx_end", "gene_name"])
#gene_df = ref_gene.drop_duplicates(subset=("chrom", "tx_start", "tx_end", "gene_name"))
gene_df = ref_gene.drop_duplicates(subset=("chrom", "tx_start", "tx_end"))
query = '''
select chrom, gene_name, min(tx_start), max(tx_end)
from gene_df
group by chrom, gene_name
'''
gene_table = sqldf(query)
print (gene_table)

      chrom     gene_name  min(tx_start)  max(tx_end)
0      chr1       A3GALT2       33772366     33786699
1      chr1       AADACL3       12776117     12788726
2      chr1       AADACL4       12704565     12727097
3      chr1         ABCA4       94458393     94586705
4      chr1        ABCB10      229652328    229694442
5      chr1         ABCD3       94883932     94984219
6      chr1          ABL2      179068461    179198819
7      chr1         ACADM       76190031     76229363
8      chr1         ACAP3        1227763      1243269
9      chr1         ACBD3      226332379    226374423
10     chr1         ACBD6      180257351    180472022
11     chr1         ACKR1      159173802    159176290
12     chr1        ACOT11       55013806     55100417
13     chr1         ACOT7        6324331      6453826
14     chr1          ACP6      147119167    147142665
15     chr1         ACTA1      229566992    229569843
16     chr1      ACTG1P20       27650364     27653016
17     chr1       ACTG1P4   