Skip to content

Conversation

@haydenm
Copy link
Collaborator

@haydenm haydenm commented Jan 7, 2019

set_cover.approx_multiuniverse() constructs the universes by taking
the union over all input sets. When use_intervalsets is True,
this was previously slow. (use_intervalsets is True for most runs because
set_cover_filter sets it to True.) It worked by creating an empty
IntervalSet, and then repeatedly taking the union of the universe's
IntervalSet with another input set (also represented as an IntervalSet).
In the worst-case, taking the union takes O(n) time where n is the
current size of the universe. As a result, constructing the universe
took, in the worst-case, O(n^2) time. For particularly long genomes,
where n is large, this could take hours in practice.

This fix instead initializes each universe by simply constructing
a list of the input intervals for it, and then initializing an IntervalSet
from the list. Initializing the IntervalSet already merges overlapping
intervals (effectively taking the union of all of them) in O(N log N)
time, where N is the number of input intervals. Since the input sets
are constructed from the universe, N is O(n), where n is defined above.
So, now, constructing the universe takes, in the worst-case, O(n log n)
time. In practice, this seems to provide a noticeable speed up as well.

Add check that creating a new IntervalSet from a list of
2 intervals is equivalent to taking the union of those
2 intervals.
set_cover.approx_multiuniverse() constructs the universes by taking
the union over all input sets. When use_intervalsets is True,
this was previously slow. (use_intervalsets is True for most runs because
set_cover_filter sets it to True.) It worked by creating an empty
IntervalSet, and then repeatedly taking the union of the universe's
IntervalSet with another input set (also represented as an IntervalSet).
In the worst-case, taking the union takes O(n) time where n is the
current size of the universe. As a result, constructing the universe
took, in the worst-case, O(n^2) time. For particularly long genomes,
where n is large, this could take hours in practice.

This fix instead initializes each universe by simply constructing
a list of the input intervals for it, and then initializing an IntervalSet
from the list. Initializing the IntervalSet already merges overlapping
intervals (effectively taking the union of all of them) in O(N log N)
time, where N is the number of input intervals. Since the input sets
are constructed from the universe, N is O(n), where n is defined above.
So, now, constructing the universe takes, in the worst-case, O(n log n)
time. In practice, this seems to provide a noticeable speed up as well.
@coveralls
Copy link

coveralls commented Jan 7, 2019

Coverage Status

Coverage increased (+0.008%) to 95.006% when pulling 8c98893 on speed-sc-setup into 745499f on master.

@haydenm haydenm merged commit 90d7b9b into master Jan 7, 2019
@haydenm haydenm deleted the speed-sc-setup branch January 7, 2019 17:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants