# Simulation process, analysis and result

## Introduction and aims
1. CNVs are large genomic insertion or deletion events, which are a type of structural variation of an organism’s chromosome. The length of CNVs varies to a great extent, often spread from 50 base pairs to kilo- or even mega-bases.
2. The challenge is that CNVs often span multiple genes, and it is difficult to distinguish susceptible or causal gene(s) from other genes in the same CNV event.
3. The goal of this project is to develop a statistical framework that leverage genome-wide CNVs for mapping susceptibility genes.

## Simulation steps
- Step 1: Obtain gene blocks: each gene block contains at least 30 genes/exons. The criteria for boundary gene is that it must not overlap with any CNV events for all individuals. If the 30th gene overlaps with CNV event for at least one individual, we will go for the next gene and see if it satisfies the criteria, till the nearest one does.
- Step 2: Simulate samples (X matrix) for deletion: sample each block from non-repetitive individuals and merge them together as a simulated individual. Repeat this process for $100,000$ times to collect $100,000$ individuals before simulating phenotype.
- Step 3: Simulate phenotype (y matrix): set penetrance/prevalence as $0.05$, prevalence (p) $\approx \frac{e^{\beta_0}}{1-e^{\beta_0}}$, so $\beta_0 \approx \log \frac{p}{1-p}$. Odds ratio (OR) follows $e^{Normal(\mu,\sigma)}$ or Gamma distribution, and $\beta_j = \text{log(OR)} \sim Normal(\mu,\sigma)$, then use Bernoulli ($\pi$) to decrease $95$% of $\beta_j$'s to 0. 
- Step 4: Simulate y: $\text{logit}(y_i)=X_i\boldsymbol{\beta}+\beta_0$, $y_i=\frac{e^{x\boldsymbol{\beta}+\beta_0}}{1+e^{x\boldsymbol{\beta}+\beta_0}}$ ($0<y_i<1$). Larger $y_i$ indicates higher probability that it will be assigned as case. Then use Bernoulli ($y_i$) (will obtain 0 or 1) to classify $y_i$ to either case (1) or control (0). Then select all 1's (about $5\%$) as cases and randomly select equal number of 0's as controls.

## Simulation parameters
1. $\beta_j$ ~ Normal ($0.77,0.84$). $0.77\ (\mu)$ and $0.84\ (\sigma)$ are calculated by `varbvs` over the whole genome.
2. penetrance = $0.05$
3. $\pi = 0.043$
4. Simulated sample size: $200,000$

## Analysis
1. Divide the matrix of selected simulated samples ($23,856$) by genes ($23,343$) into blocks
2. Methods 
    1. Fisher's exact test: generate 2 $\times$ 2 table for each block

## Results
