New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-34457][SQL] DataSource V2: Add default null ordering to SortDirection #31580
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @aokolnychyi . This is Apache Spark community, not Iceberg community.
To give more context, could you file a JIRA issue first and use the ID?
Oops, I created the JIRA and used it for the branch name but forgot to add it to the description. Fixed now. |
Thank you for update, @aokolnychyi ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you, @aokolnychyi .
Kubernetes integration test starting |
I prefer to see the relevant code in same PR (or even another PR on top of this PR) unless the code diff is huge so we have to break down. At least I'd like to ensure we have JIRA issue to address the rest for sately. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
@HeartSaVioR, this change is self-contained. At least, I don't have a PR that depends on it. I think I overlooked this part when I introduced this class. As far as I know, the SQL standard does not define the default null ordering for a sort direction. That's why the default value is different among query engines. Since this class is part of our public expression API, I am afraid to let data sources to interpret this. That's why I tried to match the behavior in Catalyst |
OK from what I understand from your comment is, this is informative one to let data source developers to "indicate" the default null ordering for ASC and DESC, (If we think the default null ordering is not clear for ASC and DESC so we have to have a way to indicate, what about including the fact to toString as well? Or, having javadoc to each element?) |
/** | ||
* A sort direction used in sorting expressions. | ||
* <p> | ||
* Each direction has a default null ordering that is implied if no null ordering is specified |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is confusing me as someone could specify the null ordering explicitly but in reality we don't leverage the default null ordering in any way. (Otherwise we should have the relevant code to leverage that.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never mind. The doc is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (non-binding)
Kubernetes integration test status success |
So I basically understand what this PR is trying to do and I'm appreciated about the details. Nice! I'd just like to check what more approaches we can do to clarify the default null ordering to data source developers easier to indicate. Do we expect them to look into the code directly, or expect them to call |
(I'm a bit afraid that we are going to have two source-of-truths as data source developers would tend to believe classes in connector package and Spark maintainers would tend to believe classes in catalyst. I wouldn't claim we should fix it now though, as I guess there would be technical reasons to do so.) |
@HeartSaVioR, I think Spark should always pass an explicit null order to data sources. For example, |
But I am also open to any suggestions. |
First of all, I'm sorry I think I'm giving ignorant inputs. Sorry for joining as novice. Where my concern came from is, we add a new method and use a new method nowhere here, which I can't imagine how it affects the behavior. If that is just to provide the information we lack in connector class (assuming data source devs don't know about the details on catalyst behaviors), that's a nice addition and thanks for doing it. Just I'd also like to think out loud to ask a question are there more ways we can give information here. If it's not just to provide the information but also affects the behavior or the method is somehow be leveraged in any way, please shed a light how, especially how this PR helps to enable below
without adding any code in catalyst to call new method. If you have plans to leverage the new functionality, please share the plan and ideally but optionally file JIRA issues so that anyone doesn't misunderstand we're adding unused functionality. |
Test build #135206 has finished for PR 31580 at commit
|
/** | ||
* Returns the default null ordering to use if no null ordering is specified explicitly. | ||
*/ | ||
public NullOrdering defaultNullOrdering() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense, because the catalyst SortDirection
also has a default null ordering.
Shall we further follow catalyst and add an overload of Expressions.sort
which doesn't take NullOrdering
and infer null ordering from SortDirection
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just 2 cents, I'd feel more natural to change the null ordering "optional" in SortOrder. It'd be more intuitive on the behavior we want, "if null ordering is not presented, default null ordering of SortDirection is taken", given the fact we now add default null ordering to SortDirection. Callers of SortOrder has to resolve the null ordering but then we no longer delegate the implementations to do that instead.
Each implementation can still simply call the method to get the default null ordering and provide it so no strong voice. Just would like to think out loud what'd be the clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we decide to introduce this, I think it makes sense to leverage it in Expressions$sort
as @cloud-fan said. That's a good use case.
@HeartSaVioR, could you give an example, please? I am not sure I fully got it. Do you mean to make nullOrdering
in SortOrder
optional instead of exposing a default null ordering in SortDirection
? If so, I usually prefer to use the simplest types in the API. Using Optional
will make the API harder to use, in my view. In addition, we probably want to handle SortDirection
in the same way as it is also optional. Not a strong opinion, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the current implementation of SortOrder is following:
/**
* Represents a sort order in the public expression API.
*
* @since 3.2.0
*/
@Experimental
public interface SortOrder extends Expression {
/**
* Returns the sort expression.
*/
Expression expression();
/**
* Returns the sort direction.
*/
SortDirection direction();
/**
* Returns the null ordering.
*/
NullOrdering nullOrdering();
}
which we force to implement nullOrdering()
which destroys the purpose of having default null ordering in SortDirection.
Given SortDirection will have default null ordering, we can provide "default implementation" of nullOrdering()
which would be simply return direction().defaultNullOrdering();
and add some explanation to javadoc to override this method if they don't want to follow the default null ordering of SortDirection.
(I actually thought about Optional or null-like, and I found by myself not prefer it. Given we're able to leverage default method in interface, I don't see the reason not to use it.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said as "2 cents", no strong opinion on this (implementors just need to add the line by themselves, which doesn't seem to be critical) and I'll leave the decision to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HeartSaVioR SortOrder
is not an API we ask users to implement. The users should call Expressions.sort
to create an instance of SortOrder
.
That's why I suggested adding an overload of Expressions.sort
, as that makes null ordering optional to the users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah OK. I was misunderstood about the usage on SortOrder. If the users (I assume this includes data source implementation) are discouraged to implement this, no problem at all and please disregard my comments.
I understand the catalyst module is considered as private API, but given connector package is now playing as public API (at least some of APIs are public API data source implementations should leverage), it'd be nice if we could explicitly separate the public vs private APIs.
Sorry, I was distracted. Let me take a look at recent comments. |
I've created SPARK-34586 for the use case I mentioned above. Plus, the example @cloud-fan mentioned is another place. I should update this PR to handle that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 from my side. Will leave this as it is to see this reflected on @cloud-fan suggestion.
Gentle ping, @aokolnychyi . |
Sorry for the delay, adding an overloaded version of |
Ready for another round. |
Kubernetes integration test starting |
Kubernetes integration test status success |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for update.
+1, LGTM. I confirmed that the last commit addresses @cloud-fan 's request
Shall we further follow catalyst and add an overload of
Expressions.sort
which doesn't takeNullOrdering
and infer null ordering fromSortDirection
?
@cloud-fan . Do you want to do the final sign-off?
thanks, merging to master! |
Thank you, @aokolnychyi and all! |
Thanks for the review, @cloud-fan @dongjoon-hyun @HeartSaVioR @rdblue @viirya @sunchao! |
What changes were proposed in this pull request?
This PR adds a default null ordering to public
SortDirection
to match the Catalyst behavior.Why are the changes needed?
The SQL standard does not define the default null ordering for a sort direction. That's why it is up to a query engine to assign one. We need to standardize this in our public connector expressions to avoid ambiguity. That's why I propose to match the behavior in our Catalyst expressions.
Does this PR introduce any user-facing change?
Yes, it affects unreleased connector expression API.
How was this patch tested?
Existing tests.