Speed up construction of universe from IntervalSets #21

haydenm · 2019-01-07T16:21:24Z

set_cover.approx_multiuniverse() constructs the universes by taking
the union over all input sets. When use_intervalsets is True,
this was previously slow. (use_intervalsets is True for most runs because
set_cover_filter sets it to True.) It worked by creating an empty
IntervalSet, and then repeatedly taking the union of the universe's
IntervalSet with another input set (also represented as an IntervalSet).
In the worst-case, taking the union takes O(n) time where n is the
current size of the universe. As a result, constructing the universe
took, in the worst-case, O(n^2) time. For particularly long genomes,
where n is large, this could take hours in practice.

This fix instead initializes each universe by simply constructing
a list of the input intervals for it, and then initializing an IntervalSet
from the list. Initializing the IntervalSet already merges overlapping
intervals (effectively taking the union of all of them) in O(N log N)
time, where N is the number of input intervals. Since the input sets
are constructed from the universe, N is O(n), where n is defined above.
So, now, constructing the universe takes, in the worst-case, O(n log n)
time. In practice, this seems to provide a noticeable speed up as well.

Add check that creating a new IntervalSet from a list of 2 intervals is equivalent to taking the union of those 2 intervals.

set_cover.approx_multiuniverse() constructs the universes by taking the union over all input sets. When use_intervalsets is True, this was previously slow. (use_intervalsets is True for most runs because set_cover_filter sets it to True.) It worked by creating an empty IntervalSet, and then repeatedly taking the union of the universe's IntervalSet with another input set (also represented as an IntervalSet). In the worst-case, taking the union takes O(n) time where n is the current size of the universe. As a result, constructing the universe took, in the worst-case, O(n^2) time. For particularly long genomes, where n is large, this could take hours in practice. This fix instead initializes each universe by simply constructing a list of the input intervals for it, and then initializing an IntervalSet from the list. Initializing the IntervalSet already merges overlapping intervals (effectively taking the union of all of them) in O(N log N) time, where N is the number of input intervals. Since the input sets are constructed from the universe, N is O(n), where n is defined above. So, now, constructing the universe takes, in the worst-case, O(n log n) time. In practice, this seems to provide a noticeable speed up as well.

coveralls · 2019-01-07T16:51:44Z

Coverage increased (+0.008%) to 95.006% when pulling 8c98893 on speed-sc-setup into 745499f on master.

haydenm added 2 commits January 7, 2019 10:57

Expand test for IntervalSet initialization

05b381b

Add check that creating a new IntervalSet from a list of 2 intervals is equivalent to taking the union of those 2 intervals.

haydenm merged commit 90d7b9b into master Jan 7, 2019

haydenm deleted the speed-sc-setup branch January 7, 2019 17:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speed up construction of universe from IntervalSets #21

Speed up construction of universe from IntervalSets #21

Uh oh!

haydenm commented Jan 7, 2019

Uh oh!

coveralls commented Jan 7, 2019 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Speed up construction of universe from IntervalSets #21

Speed up construction of universe from IntervalSets #21

Uh oh!

Conversation

haydenm commented Jan 7, 2019

Uh oh!

coveralls commented Jan 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

coveralls commented Jan 7, 2019 •

edited

Loading