Skip to content

MSQ: Add limitHint to global-sort shuffles.#16911

Merged
gianm merged 4 commits intoapache:masterfrom
gianm:msq-limit-hint
Sep 3, 2024
Merged

MSQ: Add limitHint to global-sort shuffles.#16911
gianm merged 4 commits intoapache:masterfrom
gianm:msq-limit-hint

Conversation

@gianm
Copy link
Contributor

@gianm gianm commented Aug 16, 2024

This allows pushing down limits into the SuperSorter.

This allows pushing down limits into the SuperSorter.
@gianm gianm added Performance Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Aug 16, 2024
* and also requires spotbugs exclusions (see spotbugs-exclude.xml).
*/
@SuppressWarnings({"EqualsAndHashcode", "EqualsHashCode"})
public class LimitHintJsonIncludeFilter

Check failure

Code scanning / CodeQL

Inconsistent equals and hashCode

Class LimitHintJsonIncludeFilter overrides [equals](1) but not hashCode.
Copy link
Contributor

@LakshSingla LakshSingla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT of pushing this limit hint into the InputChannelsImpl#openSorted (perhaps in a future patch)? This will also reduce the amount of data a single worker would have to read per partition.

} else if (rowLimit == 0 && activeProcessors == 0) {
// We had a row limit, and got it all the way down to zero.
// Generate empty output channels for any partitions that we haven't written yet.
superSorterProgressTracker.markTriviallyComplete();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there was a row limit initially and we brought it down to zero, should it be called trivially completed? Trivially completed means that there wasn't any data to begin with.

Copy link
Contributor Author

@gianm gianm Aug 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yeah, you're right. I changed this to instead call addMergedBatchesForLevel to "fill in" the empty channels.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can similar optimisation be applied for ScanQueryKit as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I just added it. It's only there for the case where the scan query requires a sort. In theory it could be there for the non-sort case too, but in this patch only the sorting shuffle spec has a limit hint. I think we could add limitHint to other specs in the future.

@gianm
Copy link
Contributor Author

gianm commented Aug 20, 2024

WDYT of pushing this limit hint into the InputChannelsImpl#openSorted (perhaps in a future patch)? This will also reduce the amount of data a single worker would have to read per partition.

I do think that'd make sense as a followup.

*
* Implementations may also ignore this hint completely, or may apply a limit that is somewhat higher than this hint.
*/
default long limitHint()
Copy link
Contributor

@LakshSingla LakshSingla Aug 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if there's any merit in separating the limit hint into a POJO, and having both limit and offset baked into it. Something like DefaultLimitSpec without the columns. It's up to the reader to calculate the combined limit and use that, instead of baking that knowledge into a long.

I was thinking of use cases when we want to pushdown this limit into other portions of MSQ's stack and want to distinguish between rows [0..offset) (thrown away) and rows [offset, offset + limit) (kept)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think offset can be pushed down? Limit can be pushed down when sorting because if some row is in the top N globally, it must also be in the first N rows of whichever partition it appears in.

But with offset, if for example you have LIMIT N OFFSET M, we can't push down OFFSET M (i.e. skip M rows). It is possible that some of the first M rows in partition A still need to appear in the final resultset, perhaps because some of them are greater than any of the first M rows in the globally-sorted result. So the best we can do is push down LIMIT N + M.

Copy link
Contributor

@LakshSingla LakshSingla Aug 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my understanding, how does this work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's this API: https://fasterxml.github.io/jackson-annotations/javadoc/2.9/com/fasterxml/jackson/annotation/JsonInclude.Include.html#CUSTOM

Value that indicates that separate filter Object (specified by JsonInclude.valueFilter() for value itself, and/or JsonInclude.contentFilter() for contents of structured types) is to be used for determining inclusion criteria. Filter object's equals() method is called with value to serialize; if it returns true value is excluded (that is, filtered out); if false value is included.

It's kind of goofy, but it's the only tool Jackson provides us for keeping the serialized JSON clean other than "include non-null", "include non-default", and "include non-empty".

@gianm gianm merged commit 786c959 into apache:master Sep 3, 2024
@gianm gianm deleted the msq-limit-hint branch September 3, 2024 16:05
gianm added a commit to gianm/druid that referenced this pull request Sep 3, 2024
abhishekrb19 pushed a commit that referenced this pull request Sep 3, 2024
edgar2020 pushed a commit to edgar2020/druid that referenced this pull request Sep 5, 2024
* MSQ: Add limitHint to global-sort shuffles.

This allows pushing down limits into the SuperSorter.

* Test fixes.

* Add limitSpec to ScanQueryKit. Fix SuperSorter tracking.
edgar2020 pushed a commit to edgar2020/druid that referenced this pull request Sep 5, 2024
gianm added a commit to gianm/druid that referenced this pull request Sep 17, 2024
Previously, the processor used "remainingChannels" to track the number of
non-null entries of currentFrame. Now, "remainingChannels" tracks the
number of channels that are unfinished.

The difference is subtle. In the previous code, when an input channel
was blocked upon exiting nextFrame(), the "currentFrames" entry would be
null, and therefore the "remainingChannels" variable would be decremented.
After the next await and call to populateCurrentFramesAndTournamentTree(),
"remainingChannels" would be incremented if the channel had become
unblocked after awaiting.

This means that finished(), which returned true if remainingChannels was
zero, would not be reliable if called between nextFrame() and the
next await + populateCurrentFramesAndTournamentTree().

This patch changes things such that finished() is always reliable. This
fixes a regression introduced in PR apache#16911, which added a call to
finished() that was, at that time, unsafe.
gianm added a commit that referenced this pull request Sep 17, 2024
Previously, the processor used "remainingChannels" to track the number of
non-null entries of currentFrame. Now, "remainingChannels" tracks the
number of channels that are unfinished.

The difference is subtle. In the previous code, when an input channel
was blocked upon exiting nextFrame(), the "currentFrames" entry would be
null, and therefore the "remainingChannels" variable would be decremented.
After the next await and call to populateCurrentFramesAndTournamentTree(),
"remainingChannels" would be incremented if the channel had become
unblocked after awaiting.

This means that finished(), which returned true if remainingChannels was
zero, would not be reliable if called between nextFrame() and the
next await + populateCurrentFramesAndTournamentTree().

This patch changes things such that finished() is always reliable. This
fixes a regression introduced in PR #16911, which added a call to
finished() that was, at that time, unsafe.
pranavbhole pushed a commit to pranavbhole/druid that referenced this pull request Sep 17, 2024
Previously, the processor used "remainingChannels" to track the number of
non-null entries of currentFrame. Now, "remainingChannels" tracks the
number of channels that are unfinished.

The difference is subtle. In the previous code, when an input channel
was blocked upon exiting nextFrame(), the "currentFrames" entry would be
null, and therefore the "remainingChannels" variable would be decremented.
After the next await and call to populateCurrentFramesAndTournamentTree(),
"remainingChannels" would be incremented if the channel had become
unblocked after awaiting.

This means that finished(), which returned true if remainingChannels was
zero, would not be reliable if called between nextFrame() and the
next await + populateCurrentFramesAndTournamentTree().

This patch changes things such that finished() is always reliable. This
fixes a regression introduced in PR apache#16911, which added a call to
finished() that was, at that time, unsafe.
kfaraz pushed a commit to kfaraz/druid that referenced this pull request Sep 30, 2024
Previously, the processor used "remainingChannels" to track the number of
non-null entries of currentFrame. Now, "remainingChannels" tracks the
number of channels that are unfinished.

The difference is subtle. In the previous code, when an input channel
was blocked upon exiting nextFrame(), the "currentFrames" entry would be
null, and therefore the "remainingChannels" variable would be decremented.
After the next await and call to populateCurrentFramesAndTournamentTree(),
"remainingChannels" would be incremented if the channel had become
unblocked after awaiting.

This means that finished(), which returned true if remainingChannels was
zero, would not be reliable if called between nextFrame() and the
next await + populateCurrentFramesAndTournamentTree().

This patch changes things such that finished() is always reliable. This
fixes a regression introduced in PR apache#16911, which added a call to
finished() that was, at that time, unsafe.
kfaraz added a commit that referenced this pull request Sep 30, 2024
…17194)

Previously, the processor used "remainingChannels" to track the number of
non-null entries of currentFrame. Now, "remainingChannels" tracks the
number of channels that are unfinished.

The difference is subtle. In the previous code, when an input channel
was blocked upon exiting nextFrame(), the "currentFrames" entry would be
null, and therefore the "remainingChannels" variable would be decremented.
After the next await and call to populateCurrentFramesAndTournamentTree(),
"remainingChannels" would be incremented if the channel had become
unblocked after awaiting.

This means that finished(), which returned true if remainingChannels was
zero, would not be reliable if called between nextFrame() and the
next await + populateCurrentFramesAndTournamentTree().

This patch changes things such that finished() is always reliable. This
fixes a regression introduced in PR #16911, which added a call to
finished() that was, at that time, unsafe.

Co-authored-by: Gian Merlino <gianmerlino@gmail.com>
@kfaraz kfaraz added this to the 31.0.0 milestone Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area - Batch Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 Performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants