[SPARK-40178][SQL][COONECT] Support coalesce hints with ease for PySpark and R #42255

advancedxy · 2023-08-01T06:09:32Z

What changes were proposed in this pull request?

Refactor UnresolvedHint to accept Expressions only as parameters
ResolveHints now parses StringLiteral as UnresolvedAttribute, which would allow users to specify string in parameters directly
hint method in Dataset now treats all its parameters as Columns or Literals, all other values would be rejected. The method signature is kept for better compatibility and ease of use. It also matches how hint method is handled in the Connect module.
Connect: PySpark Connect now accepts Column as hint's parameters.
PySpark: allows Column as hint's parameters and tighten the input parameters type check: for list input, only list of primitive values is now allowed
SparkR: allows Column as hint's parameters and corresponding test.

Why are the changes needed?

This is a rework of #37616. Before this commit, there's no way for users to directly specify hint info that include column info in PySpark's hint method. In other ways, rebalance hint that requires column refs is not possible before this PR.

Does this PR introduce any user-facing change?

Yes. PySpark and Spark for R uses may specify rebalance and repartition hint with ease.

How was this patch tested?

Added UTs.

python/pyspark/sql/tests/test_dataframe.py

advancedxy · 2023-08-01T06:11:48Z

@HyukjinKwon @ulysses-you would you mind to please a look at this when you have time.

LuciferYang · 2023-08-01T06:38:47Z

also cc @zhengruifeng

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

zhengruifeng · 2023-08-03T06:13:56Z

also cc @cloud-fan

advancedxy · 2023-08-06T07:16:21Z

Gently ping @HyukjinKwon @cloud-fan

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

advancedxy · 2023-08-10T17:05:46Z

@cloud-fan @zhengruifeng @HyukjinKwon @ulysses-you please take a look again since this PR now touches various parts of spark.

advancedxy · 2023-08-14T02:30:35Z

Gently ping @cloud-fan @HyukjinKwon

advancedxy · 2023-08-17T08:56:36Z

Gently ping @cloud-fan @HyukjinKwon @zhengruifeng again.

cloud-fan · 2023-08-18T01:46:23Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameHintSuite.scala

        df.logicalPlan
      )
    )

    check(
-      df.hint("hint1", Seq(1, 2, 3), Seq($"a", $"b", $"c")),
-      UnresolvedHint("hint1", Seq(Seq(1, 2, 3), Seq($"a", $"b", $"c")),
+      df.hint("hint1", Array(1, 2, 3), array($"a", $"b", $"c")),


is this a breaking change? so Seq(1, 2, 3) doesn't work in df.hint anymore?

Yeah. After this PR, we will reject the Seq(1,2,3) input as it cannot be treated as a literal.

The main reason that I didn't transform Scala's Seq to Java's Array is that we believe should align the semantics between Spark Connect and this Dataframe's API. Spark Connect's hint method also treats input as literal, which means Seq(1,2,3) doesn't work too.

If backward compatibility is important, I think both connect and this API should all treat Seq as Array. But if we are targeting 4.0, I think we may have the chance to introduce som breaking changes.

It's better to avoid breaking change unless it needs a lot of effort.

Is it only for Seq[Int]? Maybe we can special-case it.

It's better to avoid breaking change unless it needs a lot of effort.

I do agree that we should avoid breaking change unless necessary.

However if we are going to normalize the input to the hint method, such as requiring it to be a column/literal, we will bring breaking changes. We can special-case for Seq(not just Seq[Int]) to Array, however since the hint accept any type of input, we will break other inputs potentially.

Also, I didn't see any hint accept a Seq as input in the code, are you aware of such hints exists in the wild?

Oh I missed it. It's a custom hint hint1. I think we are fine as long as the builtin hints are not broken.

cloud-fan

SQL part LGTM. @HyukjinKwon can you help review the Python and R part?

HyukjinKwon · 2023-08-21T00:28:00Z

Merged to master.

…ark and R ### What changes were proposed in this pull request? 1. Refactor `UnresolvedHint` to accept Expressions only as parameters 2. ResolveHints now parses StringLiteral as UnresolvedAttribute, which would allow users to specify string in parameters directly 3. `hint` method in Dataset now treats all its parameters as `Column`s or `Literal`s, all other values would be rejected. The method signature is kept for better compatibility and ease of use. It also matches how hint method is handled in the Connect module. 4. Connect: PySpark Connect now accepts `Column` as hint's parameters. 5. PySpark: allows `Column` as hint's parameters and tighten the input parameters type check: for list input, only list of primitive values is now allowed 6. SparkR: allows `Column` as hint's parameters and corresponding test. ### Why are the changes needed? This is a rework of apache#37616. Before this commit, there's no way for users to directly specify hint info that include column info in PySpark's hint method. In other ways, `rebalance` hint that requires column refs is not possible before this PR. ### Does this PR introduce _any_ user-facing change? Yes. PySpark and Spark for R uses may specify rebalance and repartition hint with ease. ### How was this patch tested? Added UTs. Closes apache#42255 from advancedxy/SPARK-40178. Lead-authored-by: Xianjin <xianjin@apache.org> Co-authored-by: Xianjin YE <xianjin@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

[SPARK-40178][SQL] Support string parameters in hint method

5df1346

github-actions bot added SQL PYTHON labels Aug 1, 2023

advancedxy commented Aug 1, 2023

View reviewed changes

python/pyspark/sql/tests/test_dataframe.py Outdated Show resolved Hide resolved

HyukjinKwon reviewed Aug 1, 2023

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala Outdated Show resolved Hide resolved

fix errors in Python and R

a71c3a1

github-actions bot added the R label Aug 2, 2023

cloud-fan reviewed Aug 7, 2023

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala Show resolved Hide resolved

advancedxy marked this pull request as draft August 9, 2023 12:40

tmp: code sync

61e6acd

github-actions bot added the CONNECT label Aug 9, 2023

advancedxy added 4 commits August 10, 2023 09:29

stash

e7d334b

allow column as input in Python and R

2afe12d

fix

a1a2ed5

fix R again

5f0e07e

advancedxy changed the title ~~[SPARK-40178][SQL] Support string parameters in hint method~~ [SPARK-40178][SQL] support coalesce hints with ease for PySpark and R Aug 10, 2023

advancedxy changed the title ~~[SPARK-40178][SQL] support coalesce hints with ease for PySpark and R~~ [SPARK-40178][SQL][COONECT] support coalesce hints with ease for PySpark and R Aug 10, 2023

refine

1383efa

advancedxy marked this pull request as ready for review August 10, 2023 17:01

cloud-fan reviewed Aug 18, 2023

View reviewed changes

cloud-fan approved these changes Aug 18, 2023

View reviewed changes

HyukjinKwon approved these changes Aug 21, 2023

View reviewed changes

HyukjinKwon changed the title ~~[SPARK-40178][SQL][COONECT] support coalesce hints with ease for PySpark and R~~ [SPARK-40178][SQL][COONECT] Support coalesce hints with ease for PySpark and R Aug 21, 2023

HyukjinKwon closed this in 3d5a7d9 Aug 21, 2023

advancedxy deleted the SPARK-40178 branch August 22, 2023 04:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-40178][SQL][COONECT] Support coalesce hints with ease for PySpark and R #42255

[SPARK-40178][SQL][COONECT] Support coalesce hints with ease for PySpark and R #42255

advancedxy commented Aug 1, 2023 •

edited

Loading

advancedxy commented Aug 1, 2023 •

edited

Loading

LuciferYang commented Aug 1, 2023

zhengruifeng commented Aug 3, 2023

advancedxy commented Aug 6, 2023

advancedxy commented Aug 10, 2023

advancedxy commented Aug 14, 2023

advancedxy commented Aug 17, 2023

cloud-fan Aug 18, 2023

advancedxy Aug 18, 2023 •

edited

Loading

cloud-fan Aug 18, 2023

advancedxy Aug 18, 2023

cloud-fan Aug 18, 2023

cloud-fan left a comment

HyukjinKwon commented Aug 21, 2023

[SPARK-40178][SQL][COONECT] Support coalesce hints with ease for PySpark and R #42255

[SPARK-40178][SQL][COONECT] Support coalesce hints with ease for PySpark and R #42255

Conversation

advancedxy commented Aug 1, 2023 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

advancedxy commented Aug 1, 2023 • edited Loading

LuciferYang commented Aug 1, 2023

zhengruifeng commented Aug 3, 2023

advancedxy commented Aug 6, 2023

advancedxy commented Aug 10, 2023

advancedxy commented Aug 14, 2023

advancedxy commented Aug 17, 2023

cloud-fan Aug 18, 2023

Choose a reason for hiding this comment

advancedxy Aug 18, 2023 • edited Loading

Choose a reason for hiding this comment

cloud-fan Aug 18, 2023

Choose a reason for hiding this comment

advancedxy Aug 18, 2023

Choose a reason for hiding this comment

cloud-fan Aug 18, 2023

Choose a reason for hiding this comment

cloud-fan left a comment

Choose a reason for hiding this comment

HyukjinKwon commented Aug 21, 2023

advancedxy commented Aug 1, 2023 •

edited

Loading

advancedxy commented Aug 1, 2023 •

edited

Loading

advancedxy Aug 18, 2023 •

edited

Loading