Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-40938][CONNECT] Support Alias for every type of Relation #38415

Closed
wants to merge 5 commits into from

Conversation

amaliujia
Copy link
Contributor

What changes were proposed in this pull request?

In the past, Connect server can check alias for Read and Project. However for Spark DataFrame, every DataFrame can be chained with as(alias: String) thus every Relation/LogicalPlan can have an alias. This PR refactors to make this work.

Why are the changes needed?

Improve API coverage.

Does this PR introduce any user-facing change?

No

How was this patch tested?

UT

@amaliujia
Copy link
Contributor Author

R: @cloud-fan

@@ -47,6 +47,13 @@ message Relation {

Unknown unknown = 999;
}
// Optional. Every relation might have an alias.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion, but not sure which one is better. We can also follow catalyst, and add a new plan SubqueryAlias(child, alias)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this is already a message, also if we want to match DataFrame API, I think it makes more sense to have a SubqueryAlias in proto, and the call for as(alias) in the client will just create this SubqueryAlias, which is the same as existing DF implementation.

I have updated this PR accordingly.

Alias alias = 200;

// Relation alias.
message Alias {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't we use string directly?

Copy link
Contributor Author

@amaliujia amaliujia Oct 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think if we need to care client call xx.as("")? Does client side should reject this? If so we can just use a string.

It's always a matter of if we need to know a field set or not set or default value, etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I thought the default value of string is null...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default is empty string...

I really don't like the way of proto for this..

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

message SubqueryAlias {
// Required. The input relation.
Relation input = 1;
// Required. The alias.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how can we check if it's present or not, given the default value is empty string?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given our discussion on the protocol, this is a required field so we ask clients to always set it. Server side only fetch the value in this field not matter what it is (either "", or not)

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in c09c779 Nov 1, 2022
SandishKumarHN pushed a commit to SandishKumarHN/spark that referenced this pull request Dec 12, 2022
### What changes were proposed in this pull request?

In the past, Connect server can check `alias` for `Read` and `Project`. However for Spark DataFrame, every DataFrame can be chained with `as(alias: String)` thus every Relation/LogicalPlan can have an `alias`. This PR refactors to make this work.

### Why are the changes needed?

Improve API coverage.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

UT

Closes apache#38415 from amaliujia/every_relation_has_alias.

Authored-by: Rui Wang <rui.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants