Speed up construction of universe from IntervalSets #21
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
set_cover.approx_multiuniverse()constructs the universes by takingthe union over all input sets. When
use_intervalsetsis True,this was previously slow. (
use_intervalsetsis True for most runs becauseset_cover_filtersets it to True.) It worked by creating an emptyIntervalSet, and then repeatedly taking the union of the universe's
IntervalSet with another input set (also represented as an IntervalSet).
In the worst-case, taking the union takes O(n) time where n is the
current size of the universe. As a result, constructing the universe
took, in the worst-case, O(n^2) time. For particularly long genomes,
where n is large, this could take hours in practice.
This fix instead initializes each universe by simply constructing
a list of the input intervals for it, and then initializing an IntervalSet
from the list. Initializing the IntervalSet already merges overlapping
intervals (effectively taking the union of all of them) in O(N log N)
time, where N is the number of input intervals. Since the input sets
are constructed from the universe, N is O(n), where n is defined above.
So, now, constructing the universe takes, in the worst-case, O(n log n)
time. In practice, this seems to provide a noticeable speed up as well.