Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-40178][SQL][COONECT] Support coalesce hints with ease for PySpark and R #42255

Closed
wants to merge 8 commits into from

Conversation

advancedxy
Copy link
Contributor

@advancedxy advancedxy commented Aug 1, 2023

What changes were proposed in this pull request?

  1. Refactor UnresolvedHint to accept Expressions only as parameters
  2. ResolveHints now parses StringLiteral as UnresolvedAttribute, which would allow users to specify string in parameters directly
  3. hint method in Dataset now treats all its parameters as Columns or Literals, all other values would be rejected. The method signature is kept for better compatibility and ease of use. It also matches how hint method is handled in the Connect module.
  4. Connect: PySpark Connect now accepts Column as hint's parameters.
  5. PySpark: allows Column as hint's parameters and tighten the input parameters type check: for list input, only list of primitive values is now allowed
  6. SparkR: allows Column as hint's parameters and corresponding test.

Why are the changes needed?

This is a rework of #37616. Before this commit, there's no way for users to directly specify hint info that include column info in PySpark's hint method. In other ways, rebalance hint that requires column refs is not possible before this PR.

Does this PR introduce any user-facing change?

Yes. PySpark and Spark for R uses may specify rebalance and repartition hint with ease.

How was this patch tested?

Added UTs.

@advancedxy
Copy link
Contributor Author

advancedxy commented Aug 1, 2023

@HyukjinKwon @ulysses-you would you mind to please a look at this when you have time.

@LuciferYang
Copy link
Contributor

also cc @zhengruifeng

@github-actions github-actions bot added the R label Aug 2, 2023
@zhengruifeng
Copy link
Contributor

also cc @cloud-fan

@advancedxy
Copy link
Contributor Author

Gently ping @HyukjinKwon @cloud-fan

@advancedxy advancedxy marked this pull request as draft August 9, 2023 12:40
@advancedxy advancedxy changed the title [SPARK-40178][SQL] Support string parameters in hint method [SPARK-40178][SQL] support coalesce hints with ease for PySpark and R Aug 10, 2023
@advancedxy advancedxy changed the title [SPARK-40178][SQL] support coalesce hints with ease for PySpark and R [SPARK-40178][SQL][COONECT] support coalesce hints with ease for PySpark and R Aug 10, 2023
@advancedxy advancedxy marked this pull request as ready for review August 10, 2023 17:01
@advancedxy
Copy link
Contributor Author

@cloud-fan @zhengruifeng @HyukjinKwon @ulysses-you please take a look again since this PR now touches various parts of spark.

@advancedxy
Copy link
Contributor Author

Gently ping @cloud-fan @HyukjinKwon

@advancedxy
Copy link
Contributor Author

Gently ping @cloud-fan @HyukjinKwon @zhengruifeng again.

df.logicalPlan
)
)

check(
df.hint("hint1", Seq(1, 2, 3), Seq($"a", $"b", $"c")),
UnresolvedHint("hint1", Seq(Seq(1, 2, 3), Seq($"a", $"b", $"c")),
df.hint("hint1", Array(1, 2, 3), array($"a", $"b", $"c")),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a breaking change? so Seq(1, 2, 3) doesn't work in df.hint anymore?

Copy link
Contributor Author

@advancedxy advancedxy Aug 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. After this PR, we will reject the Seq(1,2,3) input as it cannot be treated as a literal.

The main reason that I didn't transform Scala's Seq to Java's Array is that we believe should align the semantics between Spark Connect and this Dataframe's API. Spark Connect's hint method also treats input as literal, which means Seq(1,2,3) doesn't work too.

If backward compatibility is important, I think both connect and this API should all treat Seq as Array. But if we are targeting 4.0, I think we may have the chance to introduce som breaking changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to avoid breaking change unless it needs a lot of effort.

Is it only for Seq[Int]? Maybe we can special-case it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to avoid breaking change unless it needs a lot of effort.

I do agree that we should avoid breaking change unless necessary.

However if we are going to normalize the input to the hint method, such as requiring it to be a column/literal, we will bring breaking changes. We can special-case for Seq(not just Seq[Int]) to Array, however since the hint accept any type of input, we will break other inputs potentially.

Also, I didn't see any hint accept a Seq as input in the code, are you aware of such hints exists in the wild?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I missed it. It's a custom hint hint1. I think we are fine as long as the builtin hints are not broken.

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SQL part LGTM. @HyukjinKwon can you help review the Python and R part?

@HyukjinKwon HyukjinKwon changed the title [SPARK-40178][SQL][COONECT] support coalesce hints with ease for PySpark and R [SPARK-40178][SQL][COONECT] Support coalesce hints with ease for PySpark and R Aug 21, 2023
@HyukjinKwon
Copy link
Member

Merged to master.

@advancedxy advancedxy deleted the SPARK-40178 branch August 22, 2023 04:59
valentinp17 pushed a commit to valentinp17/spark that referenced this pull request Aug 24, 2023
…ark and R

### What changes were proposed in this pull request?
1. Refactor `UnresolvedHint` to accept Expressions only as parameters
2. ResolveHints now parses StringLiteral as UnresolvedAttribute, which would allow users to specify string in parameters directly
3. `hint` method in Dataset now treats all its parameters as `Column`s or `Literal`s, all other values would be rejected. The method signature is kept for better compatibility and ease of use. It also matches how hint method is handled in the Connect module.
4. Connect: PySpark Connect now accepts `Column` as hint's parameters.
5. PySpark: allows `Column` as hint's parameters and tighten the input parameters type check: for list input, only list of primitive values is now allowed
6. SparkR: allows `Column` as hint's parameters and corresponding test.

### Why are the changes needed?
This is a rework of apache#37616. Before this commit, there's no way for users to directly specify hint info that include column info in PySpark's hint method. In other ways, `rebalance` hint that requires column refs is not possible before this PR.

### Does this PR introduce _any_ user-facing change?
Yes. PySpark and Spark for R uses may specify rebalance and repartition hint with ease.

### How was this patch tested?
Added UTs.

Closes apache#42255 from advancedxy/SPARK-40178.

Lead-authored-by: Xianjin <xianjin@apache.org>
Co-authored-by: Xianjin YE <xianjin@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
ragnarok56 pushed a commit to ragnarok56/spark that referenced this pull request Mar 2, 2024
…ark and R

### What changes were proposed in this pull request?
1. Refactor `UnresolvedHint` to accept Expressions only as parameters
2. ResolveHints now parses StringLiteral as UnresolvedAttribute, which would allow users to specify string in parameters directly
3. `hint` method in Dataset now treats all its parameters as `Column`s or `Literal`s, all other values would be rejected. The method signature is kept for better compatibility and ease of use. It also matches how hint method is handled in the Connect module.
4. Connect: PySpark Connect now accepts `Column` as hint's parameters.
5. PySpark: allows `Column` as hint's parameters and tighten the input parameters type check: for list input, only list of primitive values is now allowed
6. SparkR: allows `Column` as hint's parameters and corresponding test.

### Why are the changes needed?
This is a rework of apache#37616. Before this commit, there's no way for users to directly specify hint info that include column info in PySpark's hint method. In other ways, `rebalance` hint that requires column refs is not possible before this PR.

### Does this PR introduce _any_ user-facing change?
Yes. PySpark and Spark for R uses may specify rebalance and repartition hint with ease.

### How was this patch tested?
Added UTs.

Closes apache#42255 from advancedxy/SPARK-40178.

Lead-authored-by: Xianjin <xianjin@apache.org>
Co-authored-by: Xianjin YE <xianjin@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants