-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-40178][SQL][COONECT] Support coalesce hints with ease for PySpark and R #42255
Conversation
@HyukjinKwon @ulysses-you would you mind to please a look at this when you have time. |
also cc @zhengruifeng |
also cc @cloud-fan |
Gently ping @HyukjinKwon @cloud-fan |
@cloud-fan @zhengruifeng @HyukjinKwon @ulysses-you please take a look again since this PR now touches various parts of spark. |
Gently ping @cloud-fan @HyukjinKwon |
Gently ping @cloud-fan @HyukjinKwon @zhengruifeng again. |
df.logicalPlan | ||
) | ||
) | ||
|
||
check( | ||
df.hint("hint1", Seq(1, 2, 3), Seq($"a", $"b", $"c")), | ||
UnresolvedHint("hint1", Seq(Seq(1, 2, 3), Seq($"a", $"b", $"c")), | ||
df.hint("hint1", Array(1, 2, 3), array($"a", $"b", $"c")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this a breaking change? so Seq(1, 2, 3)
doesn't work in df.hint
anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. After this PR, we will reject the Seq(1,2,3)
input as it cannot be treated as a literal.
The main reason that I didn't transform Scala's Seq
to Java's Array
is that we believe should align the semantics between Spark Connect and this Dataframe's API. Spark Connect's hint method also treats input as literal, which means Seq(1,2,3)
doesn't work too.
If backward compatibility is important, I think both connect and this API should all treat Seq as Array. But if we are targeting 4.0, I think we may have the chance to introduce som breaking changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to avoid breaking change unless it needs a lot of effort.
Is it only for Seq[Int]
? Maybe we can special-case it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to avoid breaking change unless it needs a lot of effort.
I do agree that we should avoid breaking change unless necessary.
However if we are going to normalize the input to the hint method, such as requiring it to be a column/literal, we will bring breaking changes. We can special-case for Seq
(not just Seq[Int]
) to Array, however since the hint accept any type of input, we will break other inputs potentially.
Also, I didn't see any hint accept a Seq
as input in the code, are you aware of such hints exists in the wild?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I missed it. It's a custom hint hint1
. I think we are fine as long as the builtin hints are not broken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SQL part LGTM. @HyukjinKwon can you help review the Python and R part?
Merged to master. |
…ark and R ### What changes were proposed in this pull request? 1. Refactor `UnresolvedHint` to accept Expressions only as parameters 2. ResolveHints now parses StringLiteral as UnresolvedAttribute, which would allow users to specify string in parameters directly 3. `hint` method in Dataset now treats all its parameters as `Column`s or `Literal`s, all other values would be rejected. The method signature is kept for better compatibility and ease of use. It also matches how hint method is handled in the Connect module. 4. Connect: PySpark Connect now accepts `Column` as hint's parameters. 5. PySpark: allows `Column` as hint's parameters and tighten the input parameters type check: for list input, only list of primitive values is now allowed 6. SparkR: allows `Column` as hint's parameters and corresponding test. ### Why are the changes needed? This is a rework of apache#37616. Before this commit, there's no way for users to directly specify hint info that include column info in PySpark's hint method. In other ways, `rebalance` hint that requires column refs is not possible before this PR. ### Does this PR introduce _any_ user-facing change? Yes. PySpark and Spark for R uses may specify rebalance and repartition hint with ease. ### How was this patch tested? Added UTs. Closes apache#42255 from advancedxy/SPARK-40178. Lead-authored-by: Xianjin <xianjin@apache.org> Co-authored-by: Xianjin YE <xianjin@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…ark and R ### What changes were proposed in this pull request? 1. Refactor `UnresolvedHint` to accept Expressions only as parameters 2. ResolveHints now parses StringLiteral as UnresolvedAttribute, which would allow users to specify string in parameters directly 3. `hint` method in Dataset now treats all its parameters as `Column`s or `Literal`s, all other values would be rejected. The method signature is kept for better compatibility and ease of use. It also matches how hint method is handled in the Connect module. 4. Connect: PySpark Connect now accepts `Column` as hint's parameters. 5. PySpark: allows `Column` as hint's parameters and tighten the input parameters type check: for list input, only list of primitive values is now allowed 6. SparkR: allows `Column` as hint's parameters and corresponding test. ### Why are the changes needed? This is a rework of apache#37616. Before this commit, there's no way for users to directly specify hint info that include column info in PySpark's hint method. In other ways, `rebalance` hint that requires column refs is not possible before this PR. ### Does this PR introduce _any_ user-facing change? Yes. PySpark and Spark for R uses may specify rebalance and repartition hint with ease. ### How was this patch tested? Added UTs. Closes apache#42255 from advancedxy/SPARK-40178. Lead-authored-by: Xianjin <xianjin@apache.org> Co-authored-by: Xianjin YE <xianjin@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
What changes were proposed in this pull request?
UnresolvedHint
to accept Expressions only as parametershint
method in Dataset now treats all its parameters asColumn
s orLiteral
s, all other values would be rejected. The method signature is kept for better compatibility and ease of use. It also matches how hint method is handled in the Connect module.Column
as hint's parameters.Column
as hint's parameters and tighten the input parameters type check: for list input, only list of primitive values is now allowedColumn
as hint's parameters and corresponding test.Why are the changes needed?
This is a rework of #37616. Before this commit, there's no way for users to directly specify hint info that include column info in PySpark's hint method. In other ways,
rebalance
hint that requires column refs is not possible before this PR.Does this PR introduce any user-facing change?
Yes. PySpark and Spark for R uses may specify rebalance and repartition hint with ease.
How was this patch tested?
Added UTs.