New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

CDAP-16709 batch spark auto-join implementation #12201

Merged

albertshau merged 1 commit into release/6.1 from feature_release/CDAP-16709-implement-auto-join

May 27, 2020

Contributor

albertshau commented May 20, 2020

Implemented auto join for batch spark pipelines.

Added a join method to SparkCollection that takes in the list of
other SparkCollections that it should be joined to.
RDDCollection converts RDDs into Datasets and uses the Dataset
join method to implement the join. This allows Spark to broadcast
small datasets automatically, and to use sort merge join instead
of shuffle hash join, which has better memory characteristics.

As part of this, added a separate RDDCollection implementation for
Spark1 and Spark2, since the Spark API for joins is not compatible.

googlebot added the cla: yes label

albertshau force-pushed the feature_release/CDAP-16709-implement-auto-join branch from cb8e631 to 57d102b Compare

May 21, 2020 21:47

chtyim reviewed

View reviewed changes

...ap-etl/cdap-etl-core/src/main/java/io/cdap/cdap/etl/common/submit/PipelinePhasePreparer.java Outdated

                     boolean isConnectorSink =
                       Constants.Connector.PLUGIN_TYPE.equals(pluginType) && phase.getSinks().contains(stageName);
-                    SubmitterPlugin submitterPlugin;
+                    SubmitterPlugin submitterPlugin = null;

Contributor

chtyim May 26, 2020

So the submitterPlugin can be null after the following ifs checks? Is it being handled already? I don't think the logic can this value being null. It is probably better to have a else that throw if the joiner plugin is not of one of the supported classes.

...p-etl/hydrator-spark-core-base/src/main/java/io/cdap/cdap/etl/spark/SparkPipelineRunner.java Outdated

+                  // join required stages first to cut down the data as much as possible
+                  List<JoinStage> requiredStages = joinDefinition.getStages().stream()
+                    .filter(s -> s.isRequired())

Contributor

chtyim May 26, 2020

Use method reference .filter(JoinStage::isRequired)

...p-etl/hydrator-spark-core-base/src/main/java/io/cdap/cdap/etl/spark/SparkPipelineRunner.java Outdated

+                    .collect(Collectors.toList());
+                  List<JoinStage> orderedStages = new ArrayList<>(requiredStages.size() + optionalStages.size());
+                  orderedStages.addAll(requiredStages);
+                  orderedStages.addAll(optionalStages);

Contributor

chtyim May 26, 2020

This list can be obtained by sorting the list directly.

List<JoinStage> orderedStages = new ArrayList<>(joinDefinition.getStages())
orderedStages.sort((s1, s2) -> s1.isRequired() ? (s2.isRequired() ? 0 : -1) : s2.isRequired ? 1 : 0));

...p-etl/hydrator-spark-core-base/src/main/java/io/cdap/cdap/etl/spark/SparkPipelineRunner.java Outdated

+                  }
+                  JoinCondition.OnKeys onKeys = (JoinCondition.OnKeys) condition;
+                  Map<String, List<String>> stageKeys = onKeys.getKeys().stream()
+                    .collect(Collectors.toMap(j -> j.getStageName(), j -> j.getFields()));

Contributor

chtyim May 26, 2020

Use method reference instead of lambda toMap(JoinKey::getStageName, JoinKey::getFields)

yaojiefeng reviewed

View reviewed changes

...p-etl/hydrator-spark-core-base/src/main/java/io/cdap/cdap/etl/spark/SparkPipelineRunner.java

+                  JoinCondition.OnKeys onKeys = (JoinCondition.OnKeys) condition;
+                  Map<String, List<String>> stageKeys = onKeys.getKeys().stream()
+                    .collect(Collectors.toMap(j -> j.getStageName(), j -> j.getFields()));

Contributor

yaojiefeng May 26, 2020

nit - extra line

...p-etl/hydrator-spark-core-base/src/main/java/io/cdap/cdap/etl/spark/SparkPipelineRunner.java

                   }
                 }
+                private SparkCollection<Object> handleAutoJoin(AutoJoiner autoJoiner, AutoJoinerContext autoJoinerContext,

Contributor

yaojiefeng May 26, 2020

This method and the handleJoin method is non-trivial logic, it is better to add some javadoc about the logic

...p-etl/hydrator-spark-core-base/src/main/java/io/cdap/cdap/etl/spark/SparkPipelineRunner.java Outdated

+                  List<JoinStage> optionalStages = joinDefinition.getStages().stream()
+                    .filter(s -> !s.isRequired())
+                    .collect(Collectors.toList());
+                  List<JoinStage> orderedStages = new ArrayList<>(requiredStages.size() + optionalStages.size());

Contributor

yaojiefeng May 26, 2020

Can you explain a bit why we want an ordered stages? From the logic below, we are basically going through the list one by one, and then determine the join type based on left is required or right is required. Why do we want to loop through required stages first?

Contributor Author

albertshau May 26, 2020

there's a comment above but I can expand on it. If there is an outer join and an inner join, it is more performant to do the inner join first because it will generate less intermediate data.

Contributor Author

albertshau May 26, 2020 •

edited

Loading

though now that I think more about it, it may be better not to do this, and just follow the order provided by the plugin. Can add this back in if it helps once we experiment a bit more on uneven joins.

...p-etl/hydrator-spark-core-base/src/main/java/io/cdap/cdap/etl/spark/join/JoinCollection.java

+                private final SparkCollection<?> data;
+                private final Schema schema;
+                private final List<String> key;
+                private final boolean broadcast;

Contributor

yaojiefeng May 26, 2020

Is this used anywhere? The getter doesn't seem to get called. Also it is good to add some comment on what this means.

Contributor Author

albertshau May 26, 2020

not currently, was going to implement it later.

...-etl/hydrator-spark-core2_2.11/src/main/java/io/cdap/cdap/etl/spark/batch/RDDCollection.java

+               *
+               * @param <T> type of object in the collection
+               */
+              public class RDDCollection<T> extends BaseRDDCollection<T> {

Contributor

yaojiefeng May 26, 2020

I feel it is good to add some comment on why we need different implementations for spark 1 and 2. Sometimes it takes some time for me to understand the difference. Some javadoc will help understand the compatibility or implementation difference between spark 1 and 2

Contributor Author

albertshau commented May 26, 2020

https://builds.cask.co/browse/CDAP-RUT1782

yaojiefeng approved these changes

View reviewed changes

Contributor

yaojiefeng left a comment

One minor comment, LGTM

...l/hydrator-spark-core-base/src/main/java/io/cdap/cdap/etl/spark/batch/BaseRDDCollection.java Outdated

               import org.apache.spark.storage.StorageLevel;
               import scala.Tuple2;
               import javax.annotation.Nullable;
               /**
-               * Implementation of {@link SparkCollection} that is backed by a JavaRDD.
+               * Implementation of {@link SparkCollection} that is backed by a JavaRDD. Spark1 and Spark2 implementations need to be
+               * separate because DataFrames are not compatible between Spark1 and Spark2. In Spark2, DataFrame is a just a

Contributor

yaojiefeng May 26, 2020

is a just -> is just

albertshau force-pushed the feature_release/CDAP-16709-implement-auto-join branch from b7f16f4 to e31279c Compare

May 27, 2020 00:47


          CDAP-16709 batch spark auto-join implementation

d50f0ea

Implemented auto join for batch spark pipelines.

Added a join method to SparkCollection that takes in the list of
other SparkCollections that it should be joined to.
RDDCollection converts RDDs into Datasets and uses the Dataset
join method to implement the join. This allows Spark to broadcast
small datasets automatically, and to use sort merge join instead
of shuffle hash join, which has better memory characteristics.

As part of this, added a separate RDDCollection implementation for
Spark1 and Spark2, since the Spark API for joins is not compatible.

albertshau force-pushed the feature_release/CDAP-16709-implement-auto-join branch from e31279c to d50f0ea Compare

May 27, 2020 03:14

albertshau merged commit 3e1901c into release/6.1

albertshau deleted the feature_release/CDAP-16709-implement-auto-join branch

May 27, 2020 16:21

albertshau mentioned this pull request

Feature/cdap 16709 pipeline performance cp #12234

Merged

albertshau added the 6.1 label

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels