[SPARK-42720][PS][SQL] Uses expression for distributed-sequence default index instead of plan #40456

HyukjinKwon · 2023-03-16T09:19:48Z

What changes were proposed in this pull request?

This PR replaces DataFrame.withSequenceColumn to DataFrame.select(distributed_sequence_column, col("*") internally because this essentially attaches a column and it should be treated as a scalar expression at the logical level.

This is used to generate the unique index only for pandas API on Spark.

Why are the changes needed?

For better readability of codes, and for cleaner definition of Spark Connect protobuf message, see also #40270.

Does this PR introduce any user-facing change?

No, it's internal change only.

How was this patch tested?

Existing test cases in pandas API on Spark verify this change.

HyukjinKwon · 2023-03-16T09:20:05Z

cc @cloud-fan @hvanhovell FYI

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/DistributedSequenceID.scala

dongjoon-hyun

+1, LGTM.

dongjoon-hyun · 2023-03-17T03:37:04Z

BTW, according to the JIRA, this is only for Apache Spark 3.5?

...yst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ExtractDistributedSequenceID.scala

HyukjinKwon · 2023-03-20T12:15:52Z

Merged to master.

[SPARK-42720][SQL] Refactor the withSequenceColumn

9e690c7

github-actions bot added the SQL label Mar 16, 2023

HyukjinKwon mentioned this pull request Mar 16, 2023

[WIP][SPARK-42662][CONNECT][PYTHON][PS] Support withSequenceColumn as PySpark DataFrame internal function. #40270

Closed

zhengruifeng reviewed Mar 16, 2023

View reviewed changes

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/DistributedSequenceID.scala Show resolved Hide resolved

HyukjinKwon force-pushed the SPARK-42720 branch 3 times, most recently from e98b674 to dc74cdf Compare March 16, 2023 12:10

fix

1a812b0

HyukjinKwon force-pushed the SPARK-42720 branch from dc74cdf to 1a812b0 Compare March 16, 2023 12:13

dongjoon-hyun approved these changes Mar 17, 2023

View reviewed changes

cloud-fan reviewed Mar 17, 2023

View reviewed changes

...yst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ExtractDistributedSequenceID.scala Outdated Show resolved Hide resolved

Address a comment

a7d8c3c

cloud-fan reviewed Mar 17, 2023

View reviewed changes

...yst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ExtractDistributedSequenceID.scala Show resolved Hide resolved

cloud-fan reviewed Mar 17, 2023

View reviewed changes

...yst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ExtractDistributedSequenceID.scala Outdated Show resolved Hide resolved

Add a resolve check

cc149aa

cloud-fan approved these changes Mar 20, 2023

View reviewed changes

HyukjinKwon closed this in 551cda9 Mar 20, 2023

HyukjinKwon deleted the SPARK-42720 branch January 15, 2024 00:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-42720][PS][SQL] Uses expression for distributed-sequence default index instead of plan #40456

[SPARK-42720][PS][SQL] Uses expression for distributed-sequence default index instead of plan #40456

HyukjinKwon commented Mar 16, 2023

HyukjinKwon commented Mar 16, 2023

dongjoon-hyun left a comment

dongjoon-hyun commented Mar 17, 2023

HyukjinKwon commented Mar 20, 2023

[SPARK-42720][PS][SQL] Uses expression for distributed-sequence default index instead of plan #40456

[SPARK-42720][PS][SQL] Uses expression for distributed-sequence default index instead of plan #40456

Conversation

HyukjinKwon commented Mar 16, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

HyukjinKwon commented Mar 16, 2023

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented Mar 17, 2023

HyukjinKwon commented Mar 20, 2023