Skip to content

Sweep-line algorithm for genomic features. Detect overlaps on large files w/ minimal memory.

Notifications You must be signed in to change notification settings

endrebak/chrom_sweep

 
 

Repository files navigation

Motivation

Files of genomic features continue to grow in size. Traditional methods of detecting overlaps (e.g., R-trees, interval trees, binary search) among genomic intervals require substantial memory for very large datasets. However, if datasets are sorted by chromosome and start position, one can, in principle, adapt principles from sort-merge and sweep-line algorithms to detect overlaps "on the fly" by marching from "left to right" across a chromosome and catching overlaps as you encounter features.

This is the principle behind chromsweep. Corner cases arise when handling changes in chromosomes in the two files. This version now handles these cases, save for the caveats laid out below in the "issues" section.

Previous work

Joel Richardson published a similar algorithm called "fjoin". This approach required that the intervals be sorted by position, irrespective of chromosome. Thus, it did not report results in chrom order and required files to be sorted in an atypical way for most datasets. The following implementation has several advantages:

  1. It works with files, streams, and FIFOs
  2. It works with BED, GFF, VCF.
  3. It works with compressed files.
  4. In principle, it works with BAM; once this is ported to BEDTools it will.
  5. Typical annotation and dataset files are already sorted in exactly the manner required.

Issues

  1. Currently, it only reports the _count_ of overlaps between the Query file (A file in BEDTools parlance) and the Database file (B file).

Requirements

  1. chromsweep depends upon pybedtools (https://github.com/daler/pybedtools), which is a very powerful Python library for parsing and manipulating genomic features in BED/GFF/VCF format.
  2. Files must be sorted by chrom, then start position. For BED files, the following will work:
$ sort -k1,1 -k2,2n data.bed > data.sorted.bed

Alternatively, the BEDTools sortBed can be used to sort BED/GFF/VCF files in this manner.

Future Work

  1. N-files.
  2. Paralellism.

Example

$  ./chrom_sweep.py
Usage:
chrom_sweep.py [QUERY] [DATABASE]

$ zcat rmsk.bed.gz | head
chr1    10000   10468   (CCCTAA)n       1504    +
chr1    10468   11447   TAR1    3612    -
chr1    11503   11675   L1MC    437     -
chr1    11677   11780   MER5B   239     -
chr1    15264   15355   MIR3    318     -
chr1    16712   16749   (TGG)n  203     +
chr1    18906   19048   L2a     239     +
chr1    19947   20405   L3      652     +
chr1    20530   20679   Plat_L3 270     +
chr1    21948   22075   MLT1K   254     +

$ zcat knownGene.bed.gz | head
chr1    11873   14409
chr1    11873   14409
chr1    11873   14409
chr1    14362   16765
chr1    14362   19759
chr1    14362   19759
chr1    14362   19759
chr1    14362   24901
chr1    14362   29370
chr1    14362   29370

$  ./chrom_sweep.py knownGene.bed.gz rmsk.bed.gz | head
chr1    11873   14409   0
chr1    11873   14409   0
chr1    11873   14409   0
chr1    14362   16765   2
chr1    14362   19759   3
chr1    14362   19759   3
chr1    14362   19759   3
chr1    14362   24901   10
chr1    14362   29370   17
chr1    14362   29370   17

$ ./chrom_sweep.py knownGene.bed.gz rmsk.bed.gz > chrom_sweep.out

$ intersectBed -a knownGene.bed.gz -b rmsk.bed.gz -c > bedtools.out

$ diff chrom_sweep.out bedtools.out

About

Sweep-line algorithm for genomic features. Detect overlaps on large files w/ minimal memory.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%