Skip to content

bioc/denyranges

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Deny ranges

Genomic ranges of deny (aka problematic, exclusion, blacklisted) regions that should be avoided when working with genomic data. For human, mouse, and selected other organisms.

Lifecycle: experimental

Basics

denyranges is an experimental data package, to be submitted as an AnnotatiohHub Bioconductor package.

denyranges contains genomic coordinates of problematic genomic regions. See Amemiya, Haley M., Anshul Kundaje, and Alan P. Boyle. “The ENCODE Blacklist: Identification of Problematic Regions of the Genome.” Scientific Reports, December 2019. Boyle-Lab/Blacklist/

TL;DR - For human hg38 genome assembly, Anshul recommends ENCFF356LFX.

BED files of exclusion regions are available on the ENCODE project website. Human (hg19, hg38) and mouse (mm9, mm10) exclusion regions are available. However, exclusion lists generated by multiple labs often create uncertainty what to use. The purpose of this package is to provide a unified place for informed retrieval of exclusion regions.

Naming convention: <genome assembly>.<lab>.<original file name>, e.g., hg19.Birney.wgEncodeDacMapabilityConsensusExcludable.

Download all data from the Google Drive folder

See inst/scripts/make-data.R how to create the denyranges GRanges objects.

Object Number of regions Assembly Lab Number of columns Source
ce10.Kundaje.ce10-blacklist.rds 122 ce10 Anshul Kundaje, Stanford 3 ce10-C.elegans
dm3.Kundaje.dm3-blacklist.rds 492 dm3 Anshul Kundaje, Stanford 3 dm3-D.melanogaster
hg19.Bernstein.Mint_Blacklist_hg19.rds 9035 hg19 Bradley Bernstein, Broad 6 ENCFF200UUD
hg19.Birney.wgEncodeDacMapabilityConsensusExcludable.rds 411 hg19 Ewan Birney, EBI 6 ENCFF001TDO
hg19.Crawford.wgEncodeDukeMapabilityRegionsExcludable.rds 1649 hg19 Gregory Crawford, Duke 6 ENCFF001THR
hg19.Wold.hg19mitoblack.rds 295 hg19 Barbara Wold, Caltech 3 ENCFF055QTV
hg19.Yeo.eCLIP_blacklistregions.hg19.rds 57 hg19 Gene Yeo, UCSD 6 ENCFF039QTN
hg38.Bernstein.Mint_Blacklist_GRCh38.rds 12052 hg38 Bradley Bernstein, Broad 6 ENCFF023CZC
hg38.Kundaje.GRCh38.blacklist.rds 38 hg38 Anshul Kundaje, Stanford 3 ENCFF356LFX
hg38.Kundaje.GRCh38_unified_blacklist.rds 910 hg38 Anshul Kundaje, Stanford 3 ENCFF419RSJ
hg38.Reddy.wgEncodeDacMapabilityConsensusExcludable.hg38.rds 401 hg38 Tim Reddy, Duke 6 ENCFF220FIN
hg38.Wold.hg38mitoblack.rds 299 hg38 Barbara Wold, Caltech 3 ENCFF940NTE
hg38.Yeo.eCLIP_blacklistregions.hg38liftover.bed.fixed.rds 56 hg38 Gene Yeo, UCSD 6 ENCFF269URO
mm10.Hardison.blacklist.full.rds 7865 mm10 Ross Hardison, PennState 3 ENCFF790DJT
mm10.Hardison.psublacklist.mm10.rds 5552 mm10 Ross Hardison, PennState 3 ENCFF226BDM
mm10.Kundaje.anshul.blacklist.mm10.rds 3010 mm10 Anshul Kundaje, Stanford 3 ENCFF999QPV
mm10.Kundaje.mm10.blacklist.rds 164 mm10 Anshul Kundaje, Stanford 3 ENCFF547MET
mm10.Wold.mm10mitoblack.rds 123 mm10 Barbara Wold, Caltech 3 ENCFF759PJK
mm9.Wold.mm9mitoblack.rds 123 mm9 Barbara Wold, Caltech 3 ENCFF299EZH

Install denyranges

if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}

BiocManager::install("mdozmorov/denyranges")

## Check that you have a valid Bioconductor installation
# BiocManager::valid()

Use denyranges

# hg38 denyranges coordinates
download.file(url = "https://drive.google.com/uc?export=download&id=1Q00zluvfLGyUHlyOBgQ2ECd9QHCLpnXk", destfile = "hg38.Kundaje.GRCh38_unified_blacklist.rds")
denyGR.hg38.Kundaje.1 <- readRDS(file = "hg38.Kundaje.GRCh38_unified_blacklist.rds")
denyGR.hg38.Kundaje.1
> denyGR.hg38.Kundaje.1
GRanges object with 910 ranges and 0 metadata columns:
        seqnames            ranges strand
           <Rle>         <IRanges>  <Rle>
    [1]     chr1     628903-635104      *
    [2]     chr1   5850087-5850571      *
    [3]     chr1   8909610-8910014      *
    [4]     chr1   9574580-9574997      *
    [5]     chr1 32043823-32044203      *
    ...      ...               ...    ...
  [906]     chrY 11290797-11334278      *
  [907]     chrY 11493053-11592850      *
  [908]     chrY 11671014-11671046      *
  [909]     chrY 11721528-11749472      *
  [910]     chrY 56694632-56889743      *
  -------
  seqinfo: 24 sequences from hg38 genome

We can load other deny regions for the hg38 genome assembly and compare them.

download.file(url = "https://drive.google.com/uc?export=download&id=1xYPG6bufZb-VrlmOBQ5DaHlEy3gztZrR", destfile = "hg38.Bernstein.Mint_Blacklist_GRCh38.rds")
denyGR.hg38.Bernstein <- readRDS(file = "hg38.Bernstein.Mint_Blacklist_GRCh38.rds")

download.file(url = "https://drive.google.com/uc?export=download&id=1AoHK_6m_i_Qhw0jDO5BUwxf8NXo58x35", destfile = "hg38.Kundaje.GRCh38.blacklist.rds")
denyGR.hg38.Kundaje.2 <- readRDS(file = "hg38.Kundaje.GRCh38.blacklist.rds")

download.file(url = "https://drive.google.com/uc?export=download&id=1A7Ih8mxibhuWnB64-kAR8Mf63SFFZZmv", destfile = "hg38.Reddy.wgEncodeDacMapabilityConsensusExcludable.hg38.rds")
denyGR.hg38.Reddy <- readRDS(file = "hg38.Reddy.wgEncodeDacMapabilityConsensusExcludable.hg38.rds")

download.file(url = "https://drive.google.com/uc?export=download&id=1f2SBVNOlcXKpemM7N64-zsYJWnMbY97J", destfile = "hg38.Wold.hg38mitoblack.rds")
denyGR.hg38.Wold <- readRDS(file = "hg38.Wold.hg38mitoblack.rds")

download.file(url = "https://drive.google.com/uc?export=download&id=1yc60OgoNdAAub1MdHApvOoZB7p9G3YEB", destfile = "hg38.Yeo.eCLIP_blacklistregions.hg38liftover.bed.fixed.rds")
denyGR.hg38.Yeo <- readRDS(file = "hg38.Yeo.eCLIP_blacklistregions.hg38liftover.bed.fixed.rds")

Compare the number of deny regions.

library(ggplot2)
mtx_to_plot <- data.frame(Count = c(length(denyGR.hg38.Bernstein), 
                                    length(denyGR.hg38.Kundaje.1), 
                                    length(denyGR.hg38.Kundaje.2), 
                                    length(denyGR.hg38.Reddy), 
                                    length(denyGR.hg38.Wold), 
                                    length(denyGR.hg38.Yeo)),
                          Source = c("Bernstein.Mint_Blacklist_GRCh38", 
                                     "Kundaje.GRCh38_unified_blacklist", 
                                     "Kundaje.GRCh38.blacklist", 
                                     "Reddy.wgEncodeDacMapabilityConsensusExcludable", 
                                     "Wold.hg38mitoblack", 
                                     "Yeo.eCLIP_blacklistregions.hg38liftover.bed"))
# Order Source by the number of regions
mtx_to_plot$Source <- factor(mtx_to_plot$Source, levels = mtx_to_plot$Source[order(mtx_to_plot$Count)])

ggplot(mtx_to_plot, aes(x = Source, y = Count, fill = Source)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  theme_bw() + theme(legend.position = "none")
ggsave("denyranges_hg38_count.png", width = 5.5, height = 2)

Compare the width of deny regions. log2 scale because of heavy right tail distributions.

library(ggridges)
mtx_to_plot <- data.frame(Width = c(width(denyGR.hg38.Bernstein), 
                                    width(denyGR.hg38.Kundaje.1), 
                                    width(denyGR.hg38.Kundaje.2), 
                                    width(denyGR.hg38.Reddy), 
                                    width(denyGR.hg38.Wold), 
                                    width(denyGR.hg38.Yeo)),
                          Source = c(rep("Bernstein.Mint_Blacklist_GRCh38", length(denyGR.hg38.Bernstein)),
                                     rep("Kundaje.GRCh38_unified_blacklist", length(denyGR.hg38.Kundaje.1)),
                                     rep("Kundaje.GRCh38.blacklist", length(denyGR.hg38.Kundaje.2)),
                                     rep("Reddy.wgEncodeDacMapabilityConsensusExcludable", length(denyGR.hg38.Reddy)),
                                     rep("Wold.hg38mitoblack", length(denyGR.hg38.Wold)),
                                     rep("Yeo.eCLIP_blacklistregions.hg38liftover.bed", length(denyGR.hg38.Yeo))))

ggplot(mtx_to_plot, aes(x = log2(Width), y = Source, fill = Source)) +
  geom_density_ridges() +
  theme_bw() + theme(legend.position = "none")
ggsave("denyranges_hg38_width.png", width = 5.5, height = 2)

We can investigate the total width of each set of deny ranges.

mtx_to_plot <- data.frame(TotalWidth = c(sum(width(denyGR.hg38.Bernstein)), 
                                         sum(width(denyGR.hg38.Kundaje.1)), 
                                         sum(width(denyGR.hg38.Kundaje.2)), 
                                         sum(width(denyGR.hg38.Reddy)), 
                                         sum(width(denyGR.hg38.Wold)), 
                                         sum(width(denyGR.hg38.Yeo))), 
                          Source = c("Bernstein.Mint_Blacklist_GRCh38", 
                                     "Kundaje.GRCh38_unified_blacklist", 
                                     "Kundaje.GRCh38.blacklist", 
                                     "Reddy.wgEncodeDacMapabilityConsensusExcludable", 
                                     "Wold.hg38mitoblack", 
                                     "Yeo.eCLIP_blacklistregions.hg38liftover"))
ggplot(mtx_to_plot, aes(x = TotalWidth, y = Source, fill = Source)) + 
  geom_bar(stat="identity") + scale_x_log10() + scale_y_discrete(label=abbreviate) +
  xlab("log10 total width")
ggsave("denyranges_hg38_sumwidth.png", width = 6.5, height = 2)

We can compare Jaccard overlap between those sets of deny regions.

library(pheatmap)
library(stringr)
# Jaccard calculations
jaccard <- function(gr_a, gr_b) {
  intersects <- GenomicRanges::intersect(gr_a, gr_b, ignore.strand = TRUE)
  intersection <- sum(width(intersects))
  union <- sum(width(GenomicRanges::union(gr_a, gr_b, ignore.strand = TRUE)))
  DataFrame(intersection, union, 
            jaccard = intersection/union,
             n_intersections = length(intersects))
}
# List and names of all deny regions
all_denyGR_list <- list(denyGR.hg38.Bernstein, 
                        denyGR.hg38.Kundaje.1, 
                        denyGR.hg38.Kundaje.2,
                        denyGR.hg38.Reddy,
                        denyGR.hg38.Wold,
                        denyGR.hg38.Yeo)
all_denyGR_name <- c("Bernstein.Mint_Blacklist_GRCh38", 
                     "Kundaje.GRCh38_unified_blacklist", 
                     "Kundaje.GRCh38.blacklist", 
                     "Reddy.wgEncodeDacMapabilityConsensusExcludable", 
                     "Wold.hg38mitoblack", 
                     "Yeo.eCLIP_blacklistregions.hg38liftover")
# Correlation matrix, empty
mtx_to_plot <- matrix(data = 0, nrow = length(all_denyGR_list), ncol = length(all_denyGR_list))
# Fill it in
for (i in 1:length(all_denyGR_list)) {
  for (j in 1:length(all_denyGR_list)) {
    # If diagonal, set to zero
    if (i == j) mtx_to_plot[i, j] <- 0
    # Process only one half, the other is symmetric
    if (i > j) {
      mtx_to_plot[i, j] <- mtx_to_plot[j, i] <- jaccard(all_denyGR_list[[i]], all_denyGR_list[[j]])[["jaccard"]]
    }
  }
}
# Trim row/colnames
rownames(mtx_to_plot) <- colnames(mtx_to_plot) <- str_trunc(all_denyGR_name, width = 25) 
# Save the plot
png("denyranges_hg38_jaccard.png", width = 1000, height = 900, res = 200)
pheatmap(data.matrix(mtx_to_plot))
dev.off()

Note that some deny ranges objects contain six columns, implying there may be some interesting metadata. Let’s explore one.

mcols(denyGR.hg38.Reddy)
mtx_to_plot <- as.data.frame(table(mcols(denyGR.hg38.Reddy)[["name"]]))
colnames(mtx_to_plot) <- c("Type", "Number")
mtx_to_plot <- mtx_to_plot[order(mtx_to_plot$Number), ]
mtx_to_plot$Type <- factor(mtx_to_plot$Type, levels = mtx_to_plot$Type)
ggplot(mtx_to_plot, aes(x = Number, y = Type, fill = Type)) +
  geom_bar(stat="identity") +
  theme_bw() + theme(legend.position = "none")
ggsave("denyranges_hg38_Reddy_metadata.png", width = 5, height = 2.5)

One may decide to combine the deny ranges from all labs, although from previous results we may decide to follow Anshul’s advice advice about ENCFF356LFX and use the denyGR.hg38.Kundaje.1 object.

denyGR.hg38.all <- reduce(c(denyGR.hg38.Bernstein, denyGR.hg38.Kundaje.1, denyGR.hg38.Kundaje.2, denyGR.hg38.Reddy, denyGR.hg38.Wold, denyGR.hg38.Yeo))
# Keep only standard chromosomes
denyGR.hg38.all <- keepStandardChromosomes(denyGR.hg38.all, pruning.mode = "coarse")
print(length(denyGR.hg38.all))
# [1] 13239
summary(width(denyGR.hg38.all))
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#       5    1778    2306    8153    2859 5407757 

Centromeres, telomeres, etc.

Besides the ENCODE-produced deny regions, we may want to exclude centromeres, telomeres, and other gap locations. The “Gap Locations” track for Homo Sapiens is available for the GRcH37/hg19 genome assembly as a UCSC ‘gap’ table. It can be retrieved from AnnotationHub, but lacks the metadata columns needed to decide the type of gaps.

suppressPackageStartupMessages(library(AnnotationHub))
ah <- AnnotationHub()
# Search for the gap track
# ahData <- query(ah, c("gap", "Homo sapiens", "hg19"))
# ahData[ctcfData$title == "Gap"]
gaps <- ahData[["AH6444"]]
> gaps
UCSC track 'gap'
UCSCData object with 457 ranges and 0 metadata columns:
               seqnames              ranges strand
                  <Rle>           <IRanges>  <Rle>
    [1]            chr1 124535435-142535434      *
    [2]            chr1 121535435-124535434      *
    [3]            chr1     3845269-3995268      *
    [4]            chr1   13219913-13319912      *
    [5]            chr1   17125659-17175658      *
    ...             ...                 ...    ...
  [453]  chr6_ssto_hap7     4639807-4661556      *
  [454]  chr6_ssto_hap7     4723510-4774367      *
  [455]  chr6_ssto_hap7     4783799-4836049      *
  [456] chr17_ctg5_hap1     1256795-1306794      *
  [457] chr17_ctg5_hap1     1588969-1638968      *
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

The UCSC ‘gap’ table provides better granularity about the types of gaps available. E.g., for human, hg19, we have the following types and the number of gaps.

Those objects are provided as individual GRanges.

Naming convention: <genome assembly>.UCSC.<gap type>, e.g., hg19.UCSC.gap_centromere.

Download all data from the Google Drive folder

Object Number of regions Assembly Lab Number of columns Source
hg19.UCSC.centromere.rds 24 hg19 UCSC 9 hg19 gap table
hg19.UCSC.clone.rds 207 hg19 UCSC 9 hg19 gap table
hg19.UCSC.contig.rds 163 hg19 UCSC 9 hg19 gap table
hg19.UCSC.heterochromatin.rds 12 hg19 UCSC 9 hg19 gap table
hg19.UCSC.scaffold.rds 40 hg19 UCSC 9 hg19 gap table
hg19.UCSC.short_arm.rds 5 hg19 UCSC 9 hg19 gap table
hg19.UCSC.telomere.rds 46 hg19 UCSC 9 hg19 gap table
hg38.UCSC.contig.rds 285 hg38 UCSC 9 hg38 gap table
hg38.UCSC.heterochromatin.rds 11 hg38 UCSC 9 hg38 gap table
hg38.UCSC.scaffold.rds 478 hg38 UCSC 9 hg38 gap table
hg38.UCSC.short_arm.rds 5 hg38 UCSC 9 hg38 gap table
hg38.UCSC.telomere.rds 48 hg38 UCSC 9 hg38 gap table
mm10.UCSC.centromere.rds 20 mm10 UCSC 9 mm10 gap table
mm10.UCSC.clone.rds 114 mm10 UCSC 9 mm10 gap table
mm10.UCSC.contig.rds 104 mm10 UCSC 9 mm10 gap table
mm10.UCSC.fragment.rds 1 mm10 UCSC 9 mm10 gap table
mm10.UCSC.other.rds 384 mm10 UCSC 9 mm10 gap table
mm10.UCSC.short_arm.rds 21 mm10 UCSC 9 mm10 gap table
mm10.UCSC.telomere.rds 42 mm10 UCSC 9 mm10 gap table
mm9.UCSC.centromere.rds 21 mm9 UCSC 9 mm9 gap table
mm9.UCSC.contig.rds 281 mm9 UCSC 9 mm9 gap table
mm9.UCSC.fragment.rds 709 mm9 UCSC 9 mm9 gap table

We can similarly load any gap type object.

download.file(url = "https://drive.google.com/uc?export=download&id=1JwkhZi3BHFqOueHv-tOOJtroAnFoyFWF", destfile = "hg19.UCSC.centromere.rds")
gapsGR_hg19_centromete <- readRDS(file = "hg19.UCSC.centromere.rds")
> gapsGR_hg19_centromete
GRanges object with 24 ranges and 6 metadata columns:
      seqnames              ranges strand |       bin        ix           n      size        type      bridge
         <Rle>           <IRanges>  <Rle> | <numeric> <numeric> <character> <numeric> <character> <character>
    2     chr1 121535434-124535434      * |        23      1270           N   3000000  centromere          no
  184    chr21   11288129-14288129      * |        10        22           N   3000000  centromere          no
  199    chr22   13000000-16000000      * |        10         3           N   3000000  centromere          no
  206    chr19   24681782-27681782      * |         1       410           N   3000000  centromere          no
  224     chrY   10104553-13104553      * |        10       105           N   3000000  centromere          no
  ...      ...                 ...    ... .       ...       ...         ...       ...         ...         ...
  439     chr6   58830166-61830166      * |        16       628           N   3000000  centromere          no
  453     chr5   46405641-49405641      * |        14       452           N   3000000  centromere          no
  460     chr4   49660117-52660117      * |         1       447           N   3000000  centromere          no
  476     chr3   90504854-93504854      * |         2       784           N   3000000  centromere          no
  481     chr2   92326171-95326171      * |        20       770           N   3000000  centromere          no
  -------
  seqinfo: 24 sequences from hg19 genome

Centromeres for the hg38 genome assembly

Note that the UCSC ‘gap’ table for the hg38 human genome assembly does not contain genomic coordinates for the “centromere” gap type. These can be obtained from the rCGH package as follows:

suppressPackageStartupMessages(library(rCGH))
suppressPackageStartupMessages(library(GenomicRanges))
# hg38 # data.frame
# Adjust chromosome names
hg38$chrom[hg38$chrom == 23] <- "X"
hg38$chrom[hg38$chrom == 24] <- "Y"
hg38$chrom <- paste0("chr", hg38$chrom)
# Make GRanges object
hg38.UCSC.centromere <- makeGRangesFromDataFrame(hg38, seqnames.field = "chrom", start.field = "centromerStart", end.field = "centromerEnd")
# Assign seqinfo data
seqlengths(hg38.UCSC.centromere) <- hg38$length
genome(hg38.UCSC.centromere)     <- "hg38"
# Resulting object
hg38.UCSC.centromere
#> GRanges object with 24 ranges and 0 metadata columns:
#>        seqnames              ranges strand
#>           <Rle>           <IRanges>  <Rle>
#>    [1]     chr1 121535434-124535434      *
#>    [2]     chr2   92326171-95326171      *
#>    [3]     chr3   90504854-93504854      *
#>    [4]     chr4   49660117-52660117      *
#>    [5]     chr5   46405641-49405641      *
#>    ...      ...                 ...    ...
#>   [20]    chr20   26369569-29369569      *
#>   [21]    chr21   11288129-14288129      *
#>   [22]    chr22   13000000-16000000      *
#>   [23]     chrX   58632012-61632012      *
#>   [24]     chrY   10104553-13104553      *
#>   -------
#>   seqinfo: 24 sequences from hg38 genome

The rCGH package also contains data for the hg19 and hg18 genomes. The hg19 centromere data is equivalent to the hg19.UCSC.centromere object provided in our denyranges package.

Citation

Below is the citation output from using citation('denyranges') in R. Please run this yourself to check for any updates on how to cite denyranges.

print(citation("denyranges"), bibtex = TRUE)
#> 
#> Dozmorov MG, Davis E, Mu W, Lee S, Triche T, Phanstiel D, Love M
#> (2021). _denyranges_.
#> https://github.com/mdozmorov/denyranges/denyranges - R package version
#> 0.99.1, <URL: https://github.com/mdozmorov/denyranges>.
#> 
#> A BibTeX entry for LaTeX users is
#> 
#>   @Manual{,
#>     title = {denyranges},
#>     author = {Mikhail G. Dozmorov and Eric Davis and Wancen Mu and Stuart Lee and Tim Triche and Douglas Phanstiel and Michael Love},
#>     year = {2021},
#>     url = {https://github.com/mdozmorov/denyranges},
#>     note = {https://github.com/mdozmorov/denyranges/denyranges - R package version 0.99.1},
#>   }

Code of Conduct

Please note that the denyranges project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

This package was developed using biocthis.

About

This is a read-only mirror of the git repos at https://bioconductor.org

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages