Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CDAP-16709 implement manual broadcasts #12213

Merged
merged 1 commit into from
May 28, 2020

Conversation

albertshau
Copy link
Contributor

Honor the broadcast flag set in the JoinDefinition when joining
multiple DataFrames. Added a small tweak to the join logic to
join all non-broadcasted datasets first in order to ensure that
both sides of the join are not broadcast, and to reduce the amount
of data that is being shuffled in non-broadcast intermediate joins.

@albertshau
Copy link
Contributor Author

@albertshau albertshau force-pushed the feature_release/CDAP-16709-manual-broadcast branch from 54dfff9 to e5a66dc Compare May 27, 2020 21:54
@@ -95,6 +95,9 @@ public void initialize() throws Exception {

SparkConf sparkConf = new SparkConf();
sparkConf.set("spark.speculation", "false");
// turn off auto-broadcast by default until we better understand the implications and can set this to a
// value that we are confident is safe.
sparkConf.set("spark.sql.autoBroadcastJoinThreshold", "-1");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any constant class we can use for these config?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not that I know of. If there was, it could potentially change across different Spark versions too.

@albertshau albertshau force-pushed the feature_release/CDAP-16709-manual-broadcast branch from e5a66dc to 9064e40 Compare May 28, 2020 20:28
Honor the broadcast flag set in the JoinDefinition when joining
multiple DataFrames. Added a small tweak to the join logic to
join all non-broadcasted datasets first in order to ensure that
both sides of the join are not broadcast, and to reduce the amount
of data that is being shuffled in non-broadcast intermediate joins.
@albertshau albertshau force-pushed the feature_release/CDAP-16709-manual-broadcast branch from 9064e40 to 65d9edc Compare May 28, 2020 21:17
@albertshau albertshau merged commit 4807776 into release/6.1 May 28, 2020
@albertshau albertshau deleted the feature_release/CDAP-16709-manual-broadcast branch May 28, 2020 23:56
@albertshau albertshau added the 6.1 label Aug 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants