-
Notifications
You must be signed in to change notification settings - Fork 53
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Several changes to how binning works in interval_index_file.
First, the maximum has been increased (to 4096*1024*1024, bigger than the 32bit integer we use to actaully store the positions can hold, thus MAX is actually 2**31). Second, the number of levels of bins to use is determined from the max size passed to create the index. Thus, where small intervals (e.g. contigs) previously the smallest number of bins one could use had to include space for all the high level bins ( 512 + 64 + 8 + 1 bins!). Now a small contig (under 128KB) has exactly one bin. Not only does this save a ton of space, but it makes finding the bin for an interval much faster. The version number has been incremented, files with version < 2 will always use the old binning scheme. Impact: it is now tractable to index species such as possum: extremely large chromosomes were not supported before, now an additional bin will be added when creating indexes on large regions platypus: highly fragmented assembly with many small contigs, was taking a prohibitively long time to index, and wasting a ton of space, now can be indexed rapidly and compactly, e.g.: 2906252445 2007-07-11 11:02 ornAna1.maf 36021451 2007-07-11 13:22 ornAna1.maf.index
- Loading branch information
Showing
3 changed files
with
100 additions
and
15 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
import interval_index_file | ||
from interval_index_file import Indexes | ||
from tempfile import mktemp | ||
import random | ||
|
||
def test(): | ||
ix = Indexes() | ||
chrs = [] | ||
for i in range( 5 ): | ||
intervals = [] | ||
name = "seq%d" % i | ||
max = random.randint( 0, interval_index_file.MAX ) | ||
print name, "size", max | ||
for i in range( 500 ): | ||
start = random.randint( 0, max ) | ||
end = random.randint( 0, max ) | ||
if end < start: | ||
end, start = start, end | ||
ix.add( name, start, end, i ) | ||
intervals.append( ( start, end, i ) ) | ||
chrs.append( intervals ) | ||
fname = mktemp() | ||
f = open( fname, "w" ) | ||
ix.write( f ) | ||
f.close() | ||
del ix | ||
|
||
ix = Indexes( fname ) | ||
for i in range( 5 ): | ||
intervals = chrs[i] | ||
name = "seq%d" % i | ||
for i in range( 100 ): | ||
start = random.randint( 0, max ) | ||
end = random.randint( 0, max ) | ||
if end < start: | ||
end, start = start, end | ||
query_intervals = set() | ||
for ( s, e, i ) in intervals: | ||
if e > start and s < end: | ||
query_intervals.add( ( s, e, i ) ) | ||
result = ix.find( name, start, end ) | ||
for inter in result: | ||
assert inter in query_intervals | ||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters