-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-42557][CONNECT] Add Broadcast to functions #40275
Conversation
* @since 3.4.0 | ||
*/ | ||
def broadcast[T](df: Dataset[T]): Dataset[T] = { | ||
df.hint("broadcast") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this enough? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
val left = spark.range(100).select(col("id"), rand(10).as("a")) | ||
val right = spark.range(100).select(col("id"), rand(12).as("a")) | ||
val joined = | ||
left.join(broadcast(right), left("id") === right("id")).select(left("id"), right("a")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you check if we actually broadcast the right table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Can you add an check if actually broadcasting
val joined = | ||
left.join(broadcast(right), left("id") === right("id")).select(left("id"), right("a")) | ||
assert(joined.schema.catalogString === "struct<id:bigint,a:double>") | ||
testCapturedStdOut(joined.explain(), "BroadcastHashJoin") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For later: we should have a better way to get the plan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah.
left.join(broadcast(right), left("id") === right("id")).select(left("id"), right("a")) | ||
assert(joined.schema.catalogString === "struct<id:bigint,a:double>") | ||
testCapturedStdOut(joined.explain(), "BroadcastHashJoin") | ||
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "10MB") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer to use try.. finally .. when you are resetting confs. That way failing tests do not start influencing others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we copy SQLHelper
to client module later? withSQLConf
and other functions are more useful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LuciferYang It's good idea. Let's do it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM
@LuciferYang I want support the similar |
@LuciferYang Thank you. |
Merging. |
### What changes were proposed in this pull request? Currently, the connect functions missing the broadcast API. This PR want add this API to connect's functions. ### Why are the changes needed? Add the broadcast function to connect's functions.scala. ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? New test cases. Closes #40275 from beliefer/SPARK-42557. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit 94b50e0) Signed-off-by: Herman van Hovell <herman@databricks.com>
@hvanhovell @LuciferYang Thank you. |
…xclude` rule from mima check ### What changes were proposed in this pull request? #40275 has implemented the `functions#broadcast`, so this pr remove the corresponding `ProblemFilters.exclude` rule from `CheckConnectJvmClientCompatibility` ### Why are the changes needed? Remove `unnecessary` `ProblemFilters.exclude` rule. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manual check `dev/connect-jvm-client-mima-check` passed Closes #40463 from LuciferYang/SPARK-42557-FOLLOWUP. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Herman van Hovell <herman@databricks.com>
…xclude` rule from mima check ### What changes were proposed in this pull request? #40275 has implemented the `functions#broadcast`, so this pr remove the corresponding `ProblemFilters.exclude` rule from `CheckConnectJvmClientCompatibility` ### Why are the changes needed? Remove `unnecessary` `ProblemFilters.exclude` rule. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manual check `dev/connect-jvm-client-mima-check` passed Closes #40463 from LuciferYang/SPARK-42557-FOLLOWUP. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit 6ad8bf4) Signed-off-by: Herman van Hovell <herman@databricks.com>
### What changes were proposed in this pull request? Currently, the connect functions missing the broadcast API. This PR want add this API to connect's functions. ### Why are the changes needed? Add the broadcast function to connect's functions.scala. ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? New test cases. Closes apache#40275 from beliefer/SPARK-42557. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit 94b50e0) Signed-off-by: Herman van Hovell <herman@databricks.com>
…xclude` rule from mima check ### What changes were proposed in this pull request? apache#40275 has implemented the `functions#broadcast`, so this pr remove the corresponding `ProblemFilters.exclude` rule from `CheckConnectJvmClientCompatibility` ### Why are the changes needed? Remove `unnecessary` `ProblemFilters.exclude` rule. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manual check `dev/connect-jvm-client-mima-check` passed Closes apache#40463 from LuciferYang/SPARK-42557-FOLLOWUP. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit 6ad8bf4) Signed-off-by: Herman van Hovell <herman@databricks.com>
What changes were proposed in this pull request?
Currently, the connect functions missing the broadcast API. This PR want add this API to connect's functions.
Why are the changes needed?
Add the broadcast function to connect's functions.scala.
Does this PR introduce any user-facing change?
'No'.
New feature.
How was this patch tested?
New test cases.