Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-42720][PS][SQL] Uses expression for distributed-sequence default index instead of plan #40456

Closed
wants to merge 4 commits into from

Conversation

HyukjinKwon
Copy link
Member

What changes were proposed in this pull request?

This PR replaces DataFrame.withSequenceColumn to DataFrame.select(distributed_sequence_column, col("*") internally because this essentially attaches a column and it should be treated as a scalar expression at the logical level.

This is used to generate the unique index only for pandas API on Spark.

Why are the changes needed?

For better readability of codes, and for cleaner definition of Spark Connect protobuf message, see also #40270.

Does this PR introduce any user-facing change?

No, it's internal change only.

How was this patch tested?

Existing test cases in pandas API on Spark verify this change.

@github-actions github-actions bot added the SQL label Mar 16, 2023
@HyukjinKwon
Copy link
Member Author

cc @cloud-fan @hvanhovell FYI

@HyukjinKwon HyukjinKwon force-pushed the SPARK-42720 branch 3 times, most recently from e98b674 to dc74cdf Compare March 16, 2023 12:10
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@dongjoon-hyun
Copy link
Member

BTW, according to the JIRA, this is only for Apache Spark 3.5?

@HyukjinKwon
Copy link
Member Author

Merged to master.

@HyukjinKwon HyukjinKwon deleted the SPARK-42720 branch January 15, 2024 00:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants