## Introduction (Use dot for numbering)
1. Copy number variations (CNVs): large genomic insertion or deletion events;
2. Length of CNVs: spread from 50 base pairs to kilo- or even mega-bases;
3. Previous CNV research: (1) detection of CNVs associated with disease risks, (2) identification of gene sets with CNV burden;
4. Challenge: CNVs often span multiple genes (1-30 or more), unclear which one has genetic susceptibility in the same CNV event.

Figure 5

## Motivation and Aim
1. Genome-wide association studies (GWAS) have limitations: (1) Risk alleles are usually at low frequencies and difficult to detect, (2) most loci have small effect sizes, not likely to be deleterious mutations, (3) causal variants or genes in GWAS loci are often unclear;
2. Schizophrenia (SCZ) symptoms: (1) early-onset, usually late adolescence and early adulthood, (2) cognitive impairment/dysfunction, a core feature of SCZ, (3) difficulty in social activities, (4) potential to cause disability;
3. Inspired by statistical fine-mapping of causal variants in linkage-disequilibrium blocks from GWAS, we aim to develop a new approach that exploits large-scale genome-wide CNV data in case-control studies to map genes for psychiatric disorders;
4. It can be integrated with other gene-level datasets, e.g. results from exome-sequencing studies.

## Data available
1. Swedish schizophrenia population-based case-control exome sequencing CNV data from dbGAP;
2. Schizophrenia case-control CNV data from International Schizophrenia Consortium (ISC) study;
3. hg19 refGene from UCSC genome annotation database.

## Approach and algorighm
Suppose the genome is divided into disjoint blocks and no CNVs in common between blocks (i.e. allow overlapping CNVs). Within a block $R$, we may have multiple, possibly overlapping, CNV events, and assume there is at least one causal gene in $R$. To infer CNV-gene configuration from case-control data, we leverage the statistical machinery of Bayesian regression.

Assume a mixture prior with spike-and-slab for $β_j$ and logistic regression model for the phenotype:
$$\beta_{j} = (1 - \pi_{j})\delta_0 + \pi_{j}g(\cdot)$$
where $$g(\cdot) \sim N(\mu,\sigma^2)$$
$$\text{logit P}(y_i = 1) =\log\big[\frac{\pi_{j}}{1 - \pi_{j}} \big] = \alpha_0 + \sum_{j=1}^m \alpha_j d_{{ij}}$$

$\pi_{j}$: prior inclusion probability of $\textit{j}$-th gene in a CNV-gene block

$y_i$: the phenotypic status of sample $i$

$\alpha_j$: effect size of $\textit{j}$-th gene

$d_{ij}$: the overlapping status with CNV event of $j$-th gene in sample $i$

$\mu, \sigma$: prior mean and standard error for spike, where $\mu \neq 0$

## Simulation processes
1. Obtain CNV-gene blocks: simulate CNV-gene blocks containing at least a certain number of genes/exons (no overlapping CNVs between blocks). 
2. Obtain genome-wide CNV-gene pattern (X matrix): randomly sample each block and merge them together as a simulated individual. Repeat this process $100,000$ times.
3. Obtain spike-and-slab prior: set penetrance/prevalence ($p$) as $0.05$, then $p \approx \frac{e^{\beta_0}}{1-e^{\beta_0}}$ and $\beta_0 \approx \log \frac{p}{1-p}$. Set $\pi_j = 0.05$, then $95\%$ $\beta_j's$ are adjusted to 0. Odds ratio (OR) for $j$-th gene is $\text{exp}(\beta_j)$.
4. Obtain phenotype y: first calculate $y=\frac{e^{x\boldsymbol{\beta}+\beta_0}}{1+e^{x\boldsymbol{\beta}+\beta_0}}$, then use Bernoulli($y_i$) to categorize each $y_i$ to either cases ($1's$) or controls ($0's$). Select all cases (about $5\%$) and randomly select equal number of controls.

## Simulation results
1. Use R package `varbvs` to obtain prior parameters for MCMC method, $\pi = 0.0438$, $\mu = 0.777$ and $\sigma = 0.844$.
2. Map the genes for susceptibility in one CNV-gene block using R software packages, `SuSiE` and `varbvs`, to obtain posterior inclusion probabilities (PIP) and potentially credible set (CS);
3. Perform Bayesian Logistic Regression using python package `PyMC3`.

        Block sharing CNV-gene pattern: 
        Block not sharing CNV-gene pattern: 
        
        
- Block example 1 (5 genes, 1 positive effect gene): 

Pattern??? 2by2???

|gene index|simulated effect|SuSiE|varbvs|PyMC3|
|:---:|:---:|:---:|:---:|:---:|
|1|0|0.0442|0.0913|0.0290|
|2|0|0.0442|0.0913|0.0370|
|3|0.9806|**0.8173**|**0.6076**|**0.4680**|
|4|0|0.0501|0.0804|0.0255|
|5|0|0.0442|0.0913|0.0310|

- Block example 2 (14 genes, 2 positive effect genes)

|gene index|simulated effect|SuSiE|varbvs|PyMC3|
|:---:|:---:|:---:|:---:|:---:|
|1|0|0.0414|0.0500|0.0385|
|2|0|0.0414|0.0500|0.0365|
|3|0.60|0.2900|0.2861|0.3040|
|4|0|0.0570|0.0617|0.0580|
|5|0|0.0570|0.0617|0.0505|
|6|0|0.0570|0.0617|0.0650|
|7|0|0.0570|0.0617|0.0495|
|8|0|0.0570|0.0617|0.0475|
|9|0|0.0570|0.0617|0.0640|
|10|0|0.0570|0.0617|0.0635|
|11|0|0.0570|0.0617|0.0585|
|12|0|0.0570|0.0617|0.0485|
|13|0|0.0570|0.0617|0.0580|
|14|1.07|0.0570|0.0617|0.0580|

- Block example 3 (12 genes, 3 positive effect genes):

|gene index|simulated effect|SuSiE|varbvs|PyMC3|
|:---:|:---:|:---:|:---:|:---:|
|1|0|8.8818e-16|0.0435|0.0240|
|2|0|3.3307e-16|0.0330|0.0195|
|3|0|3.3307e-16|0.0330|0.0110|
|4|0|3.3307e-16|0.0330|0.0150|
|5|1.2865|**0.4999**|1|0.5460|
|6|0.5374|**0.4999**|0.0192|0.5185|
|7|0|7.9936e-15|0.0338|0.0185|
|8|0|7.9936e-15|0.0338|0.0230|
|9|0|7.9936e-15|0.0338|0.0300|
|10|0|5.2757e-13|0.0614|0.1270|
|11|0|5.2757e-13|0.0614|0.1325|
|12|0.9833|5.2757e-13|0.0614|0.1080|

## Real data