Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-34457][SQL] DataSource V2: Add default null ordering to SortDirection #31580

Closed
wants to merge 2 commits into from

Conversation

aokolnychyi
Copy link
Contributor

What changes were proposed in this pull request?

This PR adds a default null ordering to public SortDirection to match the Catalyst behavior.

Why are the changes needed?

The SQL standard does not define the default null ordering for a sort direction. That's why it is up to a query engine to assign one. We need to standardize this in our public connector expressions to avoid ambiguity. That's why I propose to match the behavior in our Catalyst expressions.

Does this PR introduce any user-facing change?

Yes, it affects unreleased connector expression API.

How was this patch tested?

Existing tests.

@github-actions github-actions bot added the SQL label Feb 17, 2021
@aokolnychyi
Copy link
Contributor Author

aokolnychyi commented Feb 17, 2021

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @aokolnychyi . This is Apache Spark community, not Iceberg community.
To give more context, could you file a JIRA issue first and use the ID?

@aokolnychyi aokolnychyi changed the title DataSource V2: Add default null ordering to SortDirection [SPARK-34457][SQL] DataSource V2: Add default null ordering to SortDirection Feb 17, 2021
@aokolnychyi
Copy link
Contributor Author

Oops, I created the JIRA and used it for the branch name but forgot to add it to the description. Fixed now.

@dongjoon-hyun
Copy link
Member

Thank you for update, @aokolnychyi !

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @aokolnychyi .

@SparkQA
Copy link

SparkQA commented Feb 17, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39787/

@HeartSaVioR
Copy link
Contributor

I prefer to see the relevant code in same PR (or even another PR on top of this PR) unless the code diff is huge so we have to break down. At least I'd like to ensure we have JIRA issue to address the rest for sately.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@aokolnychyi
Copy link
Contributor Author

@HeartSaVioR, this change is self-contained. At least, I don't have a PR that depends on it. I think I overlooked this part when I introduced this class. As far as I know, the SQL standard does not define the default null ordering for a sort direction. That's why the default value is different among query engines. Since this class is part of our public expression API, I am afraid to let data sources to interpret this. That's why I tried to match the behavior in Catalyst SortDirection, which also exposes this method. But it is just a safety measure, there is no requirement to do it.

@HeartSaVioR
Copy link
Contributor

HeartSaVioR commented Feb 17, 2021

OK from what I understand from your comment is, this is informative one to let data source developers to "indicate" the default null ordering for ASC and DESC, not to provide the way to define/override to their own need, right?

(If we think the default null ordering is not clear for ASC and DESC so we have to have a way to indicate, what about including the fact to toString as well? Or, having javadoc to each element?)

/**
* A sort direction used in sorting expressions.
* <p>
* Each direction has a default null ordering that is implied if no null ordering is specified
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is confusing me as someone could specify the null ordering explicitly but in reality we don't leverage the default null ordering in any way. (Otherwise we should have the relevant code to leverage that.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind. The doc is correct.

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (non-binding)

@SparkQA
Copy link

SparkQA commented Feb 17, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39787/

@HeartSaVioR
Copy link
Contributor

HeartSaVioR commented Feb 17, 2021

So I basically understand what this PR is trying to do and I'm appreciated about the details. Nice!

I'd just like to check what more approaches we can do to clarify the default null ordering to data source developers easier to indicate. Do we expect them to look into the code directly, or expect them to call defaultNullOrdering() before using it, or just rely on javadoc? If we suppose the data source developers are lazy and just want to look javadoc at most, is the new addition exposed to the javadoc? (It's just to confirm, I don't know the answer unless creating javadoc.)

@HeartSaVioR
Copy link
Contributor

(I'm a bit afraid that we are going to have two source-of-truths as data source developers would tend to believe classes in connector package and Spark maintainers would tend to believe classes in catalyst. I wouldn't claim we should fix it now though, as I guess there would be technical reasons to do so.)

@aokolnychyi
Copy link
Contributor Author

@HeartSaVioR, I think Spark should always pass an explicit null order to data sources. For example, SortOrder in the public expression API that is supposed to be passed to data sources always has a null ordering. Where this PR can be handy is if we want to support defining distribution or ordering in SQL statements (e.g. table creation). Then if the user just says ASC, we will parse it SortDirection.ASC and take the default null ordering.

@aokolnychyi
Copy link
Contributor Author

But I am also open to any suggestions.

@HeartSaVioR
Copy link
Contributor

HeartSaVioR commented Feb 17, 2021

First of all, I'm sorry I think I'm giving ignorant inputs. Sorry for joining as novice.

Where my concern came from is, we add a new method and use a new method nowhere here, which I can't imagine how it affects the behavior. If that is just to provide the information we lack in connector class (assuming data source devs don't know about the details on catalyst behaviors), that's a nice addition and thanks for doing it. Just I'd also like to think out loud to ask a question are there more ways we can give information here.

If it's not just to provide the information but also affects the behavior or the method is somehow be leveraged in any way, please shed a light how, especially how this PR helps to enable below

Where this PR can be handy is if we want to support defining distribution or ordering in SQL statements (e.g. table creation). Then if the user just says ASC, we will parse it SortDirection.ASC and take the default null ordering.

without adding any code in catalyst to call new method.

If you have plans to leverage the new functionality, please share the plan and ideally but optionally file JIRA issues so that anyone doesn't misunderstand we're adding unused functionality.

@SparkQA
Copy link

SparkQA commented Feb 18, 2021

Test build #135206 has finished for PR 31580 at commit ef22297.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

/**
* Returns the default null ordering to use if no null ordering is specified explicitly.
*/
public NullOrdering defaultNullOrdering() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense, because the catalyst SortDirection also has a default null ordering.

Shall we further follow catalyst and add an overload of Expressions.sort which doesn't take NullOrdering and infer null ordering from SortDirection?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just 2 cents, I'd feel more natural to change the null ordering "optional" in SortOrder. It'd be more intuitive on the behavior we want, "if null ordering is not presented, default null ordering of SortDirection is taken", given the fact we now add default null ordering to SortDirection. Callers of SortOrder has to resolve the null ordering but then we no longer delegate the implementations to do that instead.

Each implementation can still simply call the method to get the default null ordering and provide it so no strong voice. Just would like to think out loud what'd be the clearer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we decide to introduce this, I think it makes sense to leverage it in Expressions$sort as @cloud-fan said. That's a good use case.

@HeartSaVioR, could you give an example, please? I am not sure I fully got it. Do you mean to make nullOrdering in SortOrder optional instead of exposing a default null ordering in SortDirection? If so, I usually prefer to use the simplest types in the API. Using Optional will make the API harder to use, in my view. In addition, we probably want to handle SortDirection in the same way as it is also optional. Not a strong opinion, though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the current implementation of SortOrder is following:

/**
 * Represents a sort order in the public expression API.
 *
 * @since 3.2.0
 */
@Experimental
public interface SortOrder extends Expression {
  /**
   * Returns the sort expression.
   */
  Expression expression();

  /**
   * Returns the sort direction.
   */
  SortDirection direction();

  /**
   * Returns the null ordering.
   */
  NullOrdering nullOrdering();
}

which we force to implement nullOrdering() which destroys the purpose of having default null ordering in SortDirection.

Given SortDirection will have default null ordering, we can provide "default implementation" of nullOrdering() which would be simply return direction().defaultNullOrdering(); and add some explanation to javadoc to override this method if they don't want to follow the default null ordering of SortDirection.

(I actually thought about Optional or null-like, and I found by myself not prefer it. Given we're able to leverage default method in interface, I don't see the reason not to use it.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said as "2 cents", no strong opinion on this (implementors just need to add the line by themselves, which doesn't seem to be critical) and I'll leave the decision to you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HeartSaVioR SortOrder is not an API we ask users to implement. The users should call Expressions.sort to create an instance of SortOrder.

That's why I suggested adding an overload of Expressions.sort, as that makes null ordering optional to the users.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK. I was misunderstood about the usage on SortOrder. If the users (I assume this includes data source implementation) are discouraged to implement this, no problem at all and please disregard my comments.

I understand the catalyst module is considered as private API, but given connector package is now playing as public API (at least some of APIs are public API data source implementations should leverage), it'd be nice if we could explicitly separate the public vs private APIs.

@aokolnychyi
Copy link
Contributor Author

Sorry, I was distracted. Let me take a look at recent comments.

@aokolnychyi
Copy link
Contributor Author

If you have plans to leverage the new functionality, please share the plan and ideally but optionally file JIRA issues so that anyone doesn't misunderstand we're adding unused functionality.

I've created SPARK-34586 for the use case I mentioned above. Plus, the example @cloud-fan mentioned is another place. I should update this PR to handle that.

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 from my side. Will leave this as it is to see this reflected on @cloud-fan suggestion.

@dongjoon-hyun
Copy link
Member

Gentle ping, @aokolnychyi .

@aokolnychyi
Copy link
Contributor Author

Sorry for the delay, adding an overloaded version of sort to Expressions right now.

@aokolnychyi
Copy link
Contributor Author

Ready for another round.

@SparkQA
Copy link

SparkQA commented Mar 11, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40535/

@SparkQA
Copy link

SparkQA commented Mar 11, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40535/

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for update.

+1, LGTM. I confirmed that the last commit addresses @cloud-fan 's request

Shall we further follow catalyst and add an overload of Expressions.sort which doesn't take NullOrdering and infer null ordering from SortDirection?

@cloud-fan . Do you want to do the final sign-off?

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 7226379 Mar 11, 2021
@dongjoon-hyun
Copy link
Member

Thank you, @aokolnychyi and all!

@aokolnychyi
Copy link
Contributor Author

Thanks for the review, @cloud-fan @dongjoon-hyun @HeartSaVioR @rdblue @viirya @sunchao!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
8 participants