Using regulatory genomics data to interpret the function of disease variants and prioritise genes from expression studies

true

Introduction

Discovering and bringing new drugs to the market is a long, expensive and inefficient process [1, 2]. The majority of drug discovery programmes fail for efficacy reasons [3], with up to 40% of these failures due to lack of a clear link between the target and the disease under investigation [4]. Target selection, the first step in drug discovery programmes, is thus a critical decision point. It has previously been shown that therapeutic targets with a genetic link to the disease under investigation are more likely to progress through the drug discovery pipeline, suggesting that genetics can be used as a tool to prioritise and validate drug targets in early discovery [5, 6].

One of the biggest challenges in translating findings from genome-wide association studies (GWASs) to therapies is that the great majority of single nucleotide polymorphisms (SNPs) associated with disease are found in non-coding regions of the genome, and therefore cannot be easily linked to a target gene [7]. Many of these SNPs could be regulatory variants, affecting the expression of nearby or distal genes by interfering with the transcriptional process [8].

The most established way to map disease-associated regulatory variants to target genes is to use expression quantitative trait loci (eQTLs) [9], variants that affect the expression of specific genes. The GTEx consortium profiled eQTLs across 44 human tissues by performing a large-scale mapping of genome-wide correlations between genetic variants and gene expression [10]. However, depending on the power of the study, it might not be possible to detect all existing regulatory variants as eQTLs. An alternative is to use information on the location of promoters and distal enhancers across the genome and link these regulatory elements to their target genes. Large, multi-centre initiatives such as ENCODE [11], Roadmap Epigenomics [12] and BLUEPRINT [13, 14] mapped regulatory elements in the genome by profiling a number of chromatin features, including DNase hypersensitive sites (DHSs), several types of histone marks and binding of chromatin-associated proteins in a large number of cells and tissues. Similarly, the FANTOM consortium used cap analysis of gene expression (CAGE) to identify promoters and enhancers across hundreds of cells and tissues [15].

Knowing that a certain stretch of DNA is an enhancer is however not informative of the target gene(s). One way to infer links between enhancers and promoters in silico is to identify significant correlations across a large panel of cell types, an approach that was used for distal and promoter DHSs [16] as well as for CAGE-defined promoters and enhancers [17]. Experimental methods to assay interactions between regulatory elements also exist. Chromatin interaction analysis by paired-end tag sequencing (ChIA-PET) [18, 19] couples chromatin immunoprecipitation with DNA ligation to identify DNA regions interacting thanks to the binding of a specific protein. Promoter capture Hi-C [20, 21] extends chromatin conformation capture by using "baits" to enrich for promoter interactions and increase resolution.

Overall, linking genetic variants to their candidate target genes is not straightforward, not only because of the complexity of the human genome and transcriptional regulation, but also because of the variety of data types and approaches that can be used. To address this problem, we developed STOPGAP, a database of disease variants mapped to their most likely target gene(s) using several different types of regulatory genomic data [22]. The database is currently undergoing a major overhaul and will eventually be superseded by POSTGAP. A valid and recent alternative is INFERNO [23], though it does only rely on eQTL data for target gene assignment. These resources implement some or all of the approaches that will be reviewed in the workflow and constitute good entry points for identifying the most likely target gene(s) of regulatory SNPs. However, as they tend to hide much of the complexity involved in the process, we will not use them and rely on the original datasets instead.

In this workflow, we will explore how regulatory genomic data can be used to connect the genetic and transcriptional layers by providing a framework for the discovery of novel therapeutic targets. We will use eQTL data from GTEx [10], FANTOM5 correlations between promoters and enhancers [17] and promoter capture Hi-C data [21] to annotate significant GWAS variants to putative target genes and to prioritise genes obtained from a differential expression analysis (Figure 1).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

biocondutor-regulatory-genomics-workflow.md

biocondutor-regulatory-genomics-workflow.md

Using regulatory genomics data to interpret the function of disease variants and prioritise genes from expression studies

Introduction

Workflow

Install required packages

Gene expression data and differential gene expression analysis

Accessing GWAS data

Annotation of coding and proximal SNPs to target genes

Use of regulatory genomic data to map intronic and intergenic SNPs to target genes

eQTL data

FANTOM5 data

Promoter Capture Hi-C data

Functional analysis of prioritised hits

Conclusions

Abbreviations

Data and software availability

Competing interests

Grant information

References

Files

biocondutor-regulatory-genomics-workflow.md

Latest commit

History

biocondutor-regulatory-genomics-workflow.md

File metadata and controls

Using regulatory genomics data to interpret the function of disease variants and prioritise genes from expression studies

Introduction

Workflow

Install required packages

Gene expression data and differential gene expression analysis

Accessing GWAS data

Annotation of coding and proximal SNPs to target genes

Use of regulatory genomic data to map intronic and intergenic SNPs to target genes

eQTL data

FANTOM5 data

Promoter Capture Hi-C data

Functional analysis of prioritised hits

Conclusions

Abbreviations

Data and software availability

Competing interests

Grant information

References