## Integrated fine-mapping of non-coding disease variants with functional genomics data

### Background and Significance
Abundant common disease risk variants have been found in various Genome-wide Association Studies (GWAS) over the past ten years. Most of these variants locate in non-coding regions of human genome, which are likely to be regulatory elements/sequences, such as introns, enhancers and promoters (1-3). Functional interpretation for most of these non-coding variants remains unanswered, and it is still challenging to even identify them.

Chromatin's default tight coiling structure limits its accessibility. As a result, gene expression only happens when chromatin is in "opening" state. A new genomic technique, Assay for Transposase-Accessible Chromatin using sequencing (ATAC-seq), is widely used in recent years to access sequence on open chromatin, thus can map regions of transcription factor binding sites and nucleosome positions (4, 5). ATAC-seq takes advantage of next-generation sequencing (NGS) technology - the number of reads of a sequence is mainly determined by the extent of opening status of the chromatin region at a single nucleotide resolution.

The fact that open chromatin regions (OCRs) often overlap with regulatory sequence can help detect functional non-coding risk variants for some neuropsychiatric disorders, such as schizophrenia, in induced pluripotent stem cells (iPSC)-derived neurons from adult (6, 7).

Another promising approach to overcome the challenge is to focus on allele-specific open chromatin (ASoC) variants, characterized by allelic imbalance in sequencing reads at heterozygous single nucleotide polymorphism (SNP) sites. They can be mapped by comparing the chromatin accessibilities of both alleles (SNPs) in the same individual. The main advantage of ASoC mapping compared to another commonly used approach, expression quantitative trait loci (eQTL), lies in its direct identification of putative functional disease variants (8).

### Aim
Schizophrenia is a type of mental disorder, typically starting to develop in late adolescence or early adulthood. Schizophrenia can be severe due to its cognitive impairment, difficulty in social activities, and even potential to cause disability (9). Children may also develop schizophrenia (10). The prevalence of schizophrenia ranges between 0.25% and 0.64% in the United States (11).

The goal of this proposal is to explore the enrichment of schizophrenia GWAS signals, as well as other neurodevelopmental disorders and common phenotypes for comparison, in OCRs and ASoC SNPs obtained in experiments, and to precisely identify causal variants among ASoC variants associated with schizophrenia, then investigate the mechanism of non-coding regulatory elements.

### Data
1. A set of 20 iPSC lines used open chromatin mapping are reprogrammed by collaborators. The cell lines then differentiated into neural progenitor cells (NPC), subsequently to glutamatergic (iN-Glut), GABAergic (iN-GA), and dopaminergic (iN-DN) neurons. ATAC-seq is utilized to call OCR peaks (FDR<0.05) for each cell type. The median length for all 5 types of neuron cells is 335 base pair (bp), and all OCR peaks cover about 4% of the whole human genome.

2. ASoC SNPs are obtained by testing heterozygous SNPs that showed allelic imbalance in ATAC-seq reads in iN-Glut and NPC neurons.

3. Schizophrenia GWAS risk variants from Psychiatric Genomics Consortium (PGC).

### Methods
#### Statistical model
The statistical model to combine functional annotation with genetic fine-mapping is 

$$\beta_{l_j} = (1 - \pi_{l_j})\delta_0 + \pi_{l_j}g(\cdot)$$
$$\log\big[\frac{\pi_{l_j}}{1 - \pi_{l_j}} \big] = \alpha_0 + \sum_{k=1}^m \alpha_k d_{{l_j}k}$$
where $\pi_{l_j}$ is prior inclusion probability of SNP $\textit{j}$ in locus $\textit{l}$; $\alpha_k$ is enrichment of annotation $\textit{k}$ in causal variants.

#### Numerical studies
To assess advantage of this model compared to the conventional fine-mapping model where $\pi_j$ come from a uniform distribution, we perform numerical studies by simulating GWAS summary statistics and functional annotations under pre-specified fold of enrichment setting. We will fit the proposed model and compare with a version that does not incorporate the enrichment information, using several metric such as 1) type I and power to detect an association, 2) size of fine-mapped credible sets (CS), 3) average correlation between variables in detected CS and 4) the posterior inclusion probability (PIP) of the "true" causal variables.

### ORC and ASoC informed GWAS fine-mapping for psychiatric disorders
In order to perform SNP-based enrichment analyses, specifically, to explore whether schizophrenia associated signals (SNPs) from GWAS are enriched in certain types of functional genomic annotations, including 2 types of ASoC SNPs, ATAC-seq peaks called from 5 neuron cell types and commonly used functional regions, such as promoters, codings and introns, we will apply TORUS (12), a tool based on a Bayesian hierarchical model, for this purpose. We hypothesize that ASoC variants are even more enriched for functional disease variants.

Ripke et al. identified 108 genome-wide significant schizophrenia loci in 2014 (13). We will employ SuSiE (14), a newly developed Bayesian variable selection and genetic fine-mapping software package, to determine each SNP of being causal based on PIP over the 108 schizophrenia significant loci, incorporated with external linkage disequilibrium (LD) from the 1000 Genomes project and prior inclusion probabilities as functions of genomic features for each SNP.

### Reference
1. M. J. Gandal et al., Transcriptome-wide isoform-level dysregulation in ASD, schizophrenia, and bipolar disorder. Science (2018).
2. P. Rajarajan et al., Neuron-specific signatures in the chromosomal connectome associated with schizophrenia risk. Science (2018).
3. M. Li et al., Integrative functional genomic analysis of human brain development and neuropsychiatric risks. Science (2018).
4. J. D. Buenrostro et al., Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nature Methods (2013).
5. J. D. Buenrostro et al., ATAC-seq: A Method for Assaying Chromatin Accessibility Genome-Wide. Current Protocols in Molecular Biology (2015).
6. J. F. Degner et al., DNase I sensitivity QTLs are a major determinant of human expression variation. Nature 482, 390-394 (2012).
7. R. E. Thurman et al., The accessible chromatin landscape of the human genome. Nature 489, 75-82 (2012).
8. S. Zhang et al., Landscape of allele-specific open chromatin in human iPSC-differentiated neurons and its implication for mental disorders. European Neuropsychopharmacology 29 (2019).
9. J. van Os et al., Schizophrenia. Lancet 374 (2009).
10. American Psychiatric Association. Schizophrenia. 295.90 (F20.9). American Psychiatric Publishing (2013).
11. E. Q. Wu et al., Annual prevalence of diagnosed schizophrenia in the USA: a claims data analysis approach. Psychol Med (2006).
12. X. Wen et al., Molecular QTL discovery incorporating genomic annotations using Bayesian false discovery rate control. Ann Appl Stat 10,  (2016).
13. C. Schizophrenia Working Group of the Psychiatric Genomics, Biological insights from 108 schizophrenia-associated genetic loci. Nature 511, (2014).
14. G. Wang et al., A simple new approach to variable selection in regression, with application to genetic fine-mapping. biorxiv 2018

## A statistical method to identify causal genes in copy number variation data

### Background and Significance
For over a decade, genome-wide association studies (GWAS) have been the main strategy to uncover genetic architecture of complex diseases and traits and detected tens of thousands of loci associated with some psychiatric disorders (1). GWAS typically focus on studying single nucleotide polymorphisms (SNPs). Despite their success in a wide range of disease applications, GWAS have major limitations: 1) Risk alleles are usually at low frequencies thus difficult to detect, especially for in early-onset cognitive development disorders such as Schizophrenia (SCZ) and autism (2); 2) Most loci have small effect sizes, not likely to be deleterious mutations (3); 3) Furthermore, causal variants or genes in GWAS loci are often unclear. Therefore, SNPs only explain limited heritability.

Copy number variations (CNVs), on the other hand, are believed to play a critical role in complex disease etiology (4). CNVs are large genomic insertion or deletion events, which are a type of structural variation of an organism's chromosome. The length of CNVs vary enormously, often spread from 50 base pairs to kilo- or even mega-bases. Computational methods to call CNVs have been developed from sequencing data, making great opportunities for CNV analysis after DNA sequencing (5).

Existing methods for analyzing CNV data are usually focused on either the detection of association of CNVs and disease risk, or the identification of gene sets with CNV burden (6, 7), for example, CNV deletions that affecting a gene set are more common in cases than controls.

### Aim
Previous studies have suggested that CNVs are an important source of genetic variation affecting neurodevelopmental disorders. It is estimated that 10% of individuals with developmental delay carry at least one CNV larger than 500 kilo-bases (6). However, since the challenge is that CNVs often span multiple genes, distinguish the susceptible or causal gene(s) from other genes in the same CNV event is a difficult problem. Results of CNV associations are thus difficult to interpret. This hurdles the use of CNV data to unravel the genetics of complex diseases. The goal of this proposal is to develop a statistical framework that leverage genome-wide CNVs for mapping susceptibility genes of psychiatric diseases, and link these findings with those from single-nucleotide variants (SNVs) -based studies. The advantage of our method is its highly integrative characteristic, which can use enrichment results to re-rank genes and avoid a fixed cutoff.

### Methods
To address the challenge we will develop a new approach that exploits large-scale genome-wide CNV data in case-control studies to map genes. It is inspired by statistical fine-mapping of causal variants in linkage-disequilibrium blocks from GWAS. Unlike existing approaches that directly test for CNV associations, our method seeks to identify in CNV events true susceptible genes in a rigorous statistical framework. Genome-wide CNV data are first clustered into disjoint analysis blocks, i.e. no CNV spans between any two blocks. For genes within a block we test for disease associations while accounting for correlations among genes induced by CNV in the same block. We accomplish this by extension of a recently developed Bayesian variable selection method, SuSiE (7). Our method thus selects a small number of putative risk genes among multiple correlated ones that best explain the CNV-phenotype data. Furthermore, we leverage knowledge of known biological pathways to set prior probabilities of genes in CNV events to increase power. Our model estimates posterior probabilities of all putative risk genes as well as 95% credible sets (i.e. the set of genes that cover all risk genes with high probability). Using this new approach we perform gene-level analysis in several case-control CNV datasets in SCZ. Since our method reports the statistical confidence of genes, it can be integrated with other gene-level datasets, e.g. results from exome-sequencing studies. This provides a powerful strategy to integrate data from independent sources.

To infer CNV configuration $B(Z)$ from case-control data, we leverage the statistical machinery of Bayesian regression. Specifically, let $\beta_j$ be the effect size of the $j$-th gene.

$$\beta_j|Z_j = 0\sim0, \beta_j|Z_j = 1\sim N(\mu,\sigma^2), \text{logit P}(y_i = 1) = \beta_0+\rho w_i + \sum_j \beta_j x_{ij}$$
where $w_i$ denotes possible covariates (such as the total number of CNVs in subject $i$) and $\rho$ their effects. Inference on this model leads to the support of each CNV configuration $B(Z)$.

### Integrated association analysis of CNV and SNP data for causal genes mapping in psychiatric disorders

We will apply the method to individual-level schizophrenia CNV data and non-individual level case-control CNV data.
1. Swedish Schizophrenia Population-Based Case-control Exome Sequencing CNV data from dbGAP.
2. Schizophrenia case-control CNV data International Schizophrenia Consortium (ISC) study.
3. Individual-level genotype and phenotype datasets specific for autism containing the CNV data from dbGAP.

### Reference
1. J. Gratten et al., Large-scale genomics unveils the genetic architecture of psychiatric disorders. Nature Neuroscience 17 (2014).
2. B. Devlin et al., Genetic architecture in autism spectrum disorder. Current Opinion in Genetics & Development 22 (2012).
3. J. Hardy et al., Genomewide association studies and human disease. The New England Journal of Medicine 360 (2009).
4. C. Lowther et al., Genomic Disorders in Psychiatry-What Does the Clinician Need to Know? Current Psychiatry Reports 19 (2017).
5. C. Alkan et al., Genome structural variation discovery and genotyping. Nature Reviews Genetics 12 (2011).
6. S. Girirajan et al., Human copy number variation and complex genetic disease. Annual Review of Genetics 45 (2011).
7. G. Wang et al., A simple new approach to variable selection in regression, with application to genetic fine-mapping. biorxiv 2018