Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make WindowFrames more specific #16741

Merged
merged 14 commits into from
Jul 25, 2024

Conversation

kgyrtkirk
Copy link
Member

@kgyrtkirk kgyrtkirk commented Jul 16, 2024

this patch changes the WindowFrame internals / representation a bit; introduces a dedicated frametype for rows, groups so that later if we decide to add better ranges support it will less likely need a bigger refactor

it also changes how its represented in the native query:

-      frame: { peerType: "ROWS", lowUnbounded: true, lowOffset: 0, uppUnbounded: true, uppOffset: 0 }
+      frame: { type: rows }
-       frame: { peerType: "ROWS", lowUnbounded: false, lowOffset: 0, uppUnbounded: false, uppOffset: 2 }
+      frame: { type: rows, lowerOffset: 0, upperOffset: 2 }
-      frame: { peerType: "RANGE", lowUnbounded: true, lowOffset: 0, uppUnbounded: false, uppOffset: 0, orderBy: [{ column: l1, direction: ASC }] }
+      frame: { type: group, upperOffset: 0, orderBy: [{ column: l1, direction: ASC }] }

this patch changes the WindowFrame internals / representation a bit;
introduces a dedicated frametype for rows, groups and unbounded.
@github-actions github-actions bot added Area - Batch Ingestion Area - Querying Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Jul 16, 2024
@JsonSubTypes.Type(name = "rows", value = WindowFrame.Rows.class),
@JsonSubTypes.Type(name = "groups", value = WindowFrame.Groups.class),
})
public interface WindowFrame
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently as we just support only UNBOUNDED and CURRENT ROW for RANGE (which would be same for GROUPS), so this looks good to use groups to represent RANGE queries as well. But later on, if we want to support both, shouldnt we have something to distinguish them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes we will have that; later when we add support for real RANGE stuff; we will add a new JsonSubType for that -

this makes it clear the right now we don't support generic RANGE stuff :)

this.orderBy = ImmutableList.copyOf(orderBy);
}

public List<String> getOrderByColNames()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice to have getOrderByColumns as well to read the data along with direction

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually we are not using the "direction" right now during processing ; and that's because it must be pre-ordered...

shouldn't we ask for a list of strings instead ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, even I checked it we dont use the direction of cols when we are at this level

@@ -371,7 +370,7 @@ public void testUnboundedWindowedAggregation()
FramedOnHeapAggregatable agger = FramedOnHeapAggregatable.fromRAC(rac);

final RowsAndColumns results = agger.aggregateAll(
new WindowFrame(WindowFrame.PeerType.ROWS, true, 0, true, 0, null),
WindowFrame.rows(null, null),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the difference between WindowFrame.rows(null, 0) and WindowFrame.unbounded()? Is it just the rows frame unwraps as OffsetFrame? can we use them interchangeably?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WindowFrame.unbounded() is the same as WindowFrame.rows(null,null) ; during compilation we will do unbounded in those cases in Windowing

changeds this call to unbounded()

private static Iterable<AggInterval> buildUnboundedIteratorFor(AppendableRowsAndColumns rac)
{
int[] groupBoundaries = new int[] {0, rac.numRows()};
return new GroupIteratorForWindowFrame(WindowFrame.rows(0, 0), groupBoundaries);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldnt this be WindowFrame.rows(null, null) for better readability? Technically anything would work

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes; it could be anything as long as current row is inside :D
changed it to WindowFrame.rows(null, null) :D

public class WindowFrame
@JsonTypeInfo(use = JsonTypeInfo.Id.NAME, property = "type")
@JsonSubTypes(value = {
@JsonSubTypes.Type(name = "unbounded", value = WindowFrame.Unbounded.class),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this really the right abstraction? "unbounded" isn't really a frame type, rather a frame start/end thing?

Why not just make a singleton instance for an 'unbounded' Rows.class? (assuming this is meant to be ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING or whatever)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is more like just syntactic sugar for native queries; yes it could also be {type: rows } - that's kinda the same - unbounded is just more explicit if its present in the native query.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed unbounded - it made things simpler as well!

Comment on lines -59 to -60
@JsonProperty("uppUnbounded") boolean upperUnbounded,
@JsonProperty("uppOffset") int upperOffset,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this going to cause compatibility issues during rolling upgrade because on OffsetFrame these are replaced by upperOffset? I forget exactly if/where these get sent over the wire (msq maybe?). Does OffsetFrame needs to accept these old property names too just to be safe?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's kinda the main reason it's not yet rolled out - and its not on the wire; so we could still change it...


public class WindowFrame
@JsonTypeInfo(use = JsonTypeInfo.Id.NAME, property = "type")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this going to cause issues during rolling upgrade? old json used peerType to pick rows/groups, but is probably uppercase from enum?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it could not cause such issues - as it have not worked correctly previously

Comment on lines 54 to 57
static Groups groups(
final Integer lowerOffset,
final Integer upperOffset,
final List<String> orderByColumns)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this could all fit on one line (also not sure why style bot isn't flagging the ')' not being on a newline...)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

joined it back to occupy a single line;
unfortunately I can't set that policy in my formatter

@JsonTypeInfo(use = JsonTypeInfo.Id.NAME, property = "type")
@JsonSubTypes(value = {
@JsonSubTypes.Type(name = "rows", value = WindowFrame.Rows.class),
@JsonSubTypes.Type(name = "groups", value = WindowFrame.Groups.class),
Copy link
Member

@clintropolis clintropolis Jul 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, I have another question now that i re-read the PR description, why did RANGE change to groups? Is this because only group by queries are supported currently? Or was that restriction lifted? I naively would expect the syntax present in the query the user wrote to match what appears here, but that doesn't seem to be the case with this change. Is this confusing? How will this make a future refactor easier?

Copy link
Member

@clintropolis clintropolis Jul 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, do we even support using groups in SQL syntax? it seems like maybe we don't and there are no tests, is this also confusing? I'm confused.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to cover all aspects of your questions :)

why did RANGE change to groups

for a RANGE if both endpoints are UNBOUNDED or CURRENT ROW there is no difference between RANGE and GROUPS (we support only these right now)

Is this because only group by queries are supported currently?

it has no connection to that - groups is a frame evaluation mode; which groups rows with the same value together
I always find a different doc about this...today I've found this ; but its also in the standard.

I naively would expect the syntax present in the query the user wrote to match what appears here, but that doesn't seem to be the case with this change. Is this confusing?

I don't feel like its confusing - as the resulting plan may be different from what the user supplied already: filters and join conditions may have changed; predicates could be pushed and window frame specs may change

  • I don't know if this would be optimized; but consider that the user gives: SUM() OVER (PARTITION BY X ORDER BY Y,Z RANGE BETWEEN UNBOUNDED PRECEEDING AND UNBOUNDED PROCEEDING)
  • that's a fully unbounded window...do we need to do the ORDER? if not...then it could be executed as SUM() OVER (PARTITION BY X)

How will this make a future refactor easier?

We will be able to add range later along with the supporting algo enhancements

also, do we even support using groups in SQL syntax?

Although the SQL standard has it - the Calcite layer doesn't accept GROUPS today - only RANGE and ROWS are allowed.
With this change we will naturally expose the support of groups in the native layer - and native api users may even go beyond and use all features of groups.
When the Calcite support will arrive for GROUPS we will already have everything prepared to enable full support for it.

it seems like maybe we don't and there are no tests, is this also confusing? I'm confused.

As we recognize edge cases of RANGE as GROUPS - the sqlTest plans contain rows and groups.
For all RANGE frames we can't reliably execute correctly there is an exception explaining it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, i understand GROUPS is part of the standard, it just seems really strange that internally we "support" it but don't support it in SQL, and externally we support RANGE but dont support it internally. If we are doing this just for edge cases, then we should really add lots of javadocs to explain what is going on so it isn't so confusing maybe?

@kgyrtkirk kgyrtkirk merged commit 7e3fab5 into apache:master Jul 25, 2024
88 checks passed
sreemanamala pushed a commit to sreemanamala/druid that referenced this pull request Aug 6, 2024
Changes the WindowFrame internals / representation a bit; introduces dedicated frametypes for rows and groups which corresponds to the implemented processing methods
@kfaraz kfaraz added this to the 31.0.0 milestone Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area - Batch Ingestion Area - Documentation Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 Area - Querying
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants