Support data preprocessing in Spark framework by jackjlli · Pull Request #7299 · apache/pinot

jackjlli · 2021-08-13T18:17:11Z

Description

Currently the data preprocessing Hadoop job is located in pinot-hadoop module. And there is no data preprocessing Spark job. In order to reused some of the logic of data preprocessing (e.g. support data in AVRO and ORC format) for both MR and Spark framework, the common code is moved from pinot-hadoop to pinot-ingestion-common module.

This PR:

adds data preprocessing support for Spark.
refactors some code from pinot-hadoop to pinot-ingestion-common so that some part of the code can be reused for both MR and Spark job.
both AVRO and ORC formats are supported in both MR and Spark job.

Upgrade Notes

Does this PR prevent a zero down-time upgrade? (Assume upgrade order: Controller, Broker, Server, Minion)

Yes (Please label as backward-incompat, and complete the section below on Release Notes)

Does this PR fix a zero-downtime upgrade introduced earlier?

Yes (Please label this as backward-incompat, and complete the section below on Release Notes)

Does this PR otherwise need attention when creating release notes? Things to consider:

New configuration options
Deprecation of configurations
Signature changes to public methods/interfaces
New plugins added or old plugins removed

Yes (Please label this PR as release-notes and complete the section on Release Notes)

Release Notes

Documentation

codecov-commenter · 2021-08-13T18:47:34Z

Codecov Report

Merging #7299 (cf0d18c) into master (ce2c367) will decrease coverage by 31.08%.
The diff coverage is 0.00%.

❗ Current head cf0d18c differs from pull request most recent head 475542c. Consider uploading reports for the commit 475542c to get more accurate results

@@              Coverage Diff              @@
##             master    #7299       +/-   ##
=============================================
- Coverage     70.39%   39.30%   -31.09%     
+ Complexity     3299       92     -3207     
=============================================
  Files          1508     1526       +18     
  Lines         74754    75359      +605     
  Branches      10846    10951      +105     
=============================================
- Hits          52621    29618    -23003     
- Misses        18508    43482    +24974     
+ Partials       3625     2259     -1366

Flag	Coverage Δ
integration1	`30.72% <ø> (?)`
integration2	`29.15% <ø> (-0.03%)`	⬇️
unittests1	`?`
unittests2	`14.37% <0.00%> (-0.14%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
.../pinot/ingestion/jobs/SegmentPreprocessingJob.java	`0.00% <0.00%> (ø)`
...estion/preprocess/AvroDataPreprocessingHelper.java	`0.00% <0.00%> (ø)`
.../ingestion/preprocess/DataPreprocessingHelper.java	`0.00% <0.00%> (ø)`
...ion/preprocess/DataPreprocessingHelperFactory.java	`0.00% <ø> (ø)`
...gestion/preprocess/OrcDataPreprocessingHelper.java	`0.00% <0.00%> (ø)`
...reprocess/mappers/AvroDataPreprocessingMapper.java	`0.00% <ø> (ø)`
...preprocess/mappers/OrcDataPreprocessingMapper.java	`0.00% <ø> (ø)`
...preprocess/mappers/SegmentPreprocessingMapper.java	`0.00% <ø> (ø)`
...partitioners/AvroDataPreprocessingPartitioner.java	`0.00% <ø> (ø)`
...on/preprocess/partitioners/GenericPartitioner.java	`0.00% <ø> (ø)`
... and 899 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ce2c367...475542c. Read the comment docs.

kishoreg · 2021-08-14T15:55:54Z

is this needed only for v0_deprecated/spark? Can you please add more to the description on what is the change and why it's needed.

jackjlli · 2021-08-18T00:50:34Z

is this needed only for v0_deprecated/spark? Can you please add more to the description on what is the change and why it's needed.

I've updated the description of the PR. Basically this PR is to add the data preprocessing Spark job in pinot. And since some logic can be reused for both MR and Spark jobs, I refactored some code from pinot-hadoop module to pinot-ingestion-common module. At LinkedIn this data preprocessing job has already been used along with the PinotBuildAndPushJob from v0_deprecated module.

...ed/pinot-hadoop/src/main/java/org/apache/pinot/hadoop/job/mappers/SegmentCreationMapper.java

...-ingestion-common/src/main/java/org/apache/pinot/ingestion/jobs/SegmentPreprocessingJob.java

klsince · 2021-08-23T20:20:27Z

...spark/src/main/java/org/apache/pinot/spark/jobs/preprocess/SparkDataPreprocessingHelper.java

+    // Repartition and sort within partitions.
+    Comparator<Object> comparator = new SparkDataPreprocessingComparator();
+    JavaPairRDD<Object, Row> partitionedSortedPairRDD =
+        pairRDD.repartitionAndSortWithinPartitions(sparkPartitioner, comparator);


sometimes use coalesce() (then sortWithinPartitions()) can be more efficient than repartition(). is it feasible to allow coalesce()?

Good question! The coalesce() method uses its default partition function which is not murmur2 for partitioning, and there is no way to specify our custom partition function. That's why repartitionAndSortWithinPartitions is used here, as it provides the ability to specify our own custom partition function here.

klsince · 2021-08-23T20:39:13Z

.../src/main/java/org/apache/pinot/spark/jobs/preprocess/SparkDataPreprocessingPartitioner.java

+
+  @Override
+  public int getPartition(Object key) {
+    SparkDataPreprocessingJobKey jobKey = (SparkDataPreprocessingJobKey) key;


why not just call generatePartitionId() in this method?

The key of the KV pair is not only used for partitioning but also used for sorting purpose. It'd be good to generate the partition index in order to minimize the footprint of the key, since any data type can be the partition column (e.g. string).

xiangfu0 · 2021-08-23T20:53:56Z

Shall we consider use pinot-ingestion-spark module?

jackjlli · 2021-08-23T21:30:36Z

Shall we consider use pinot-ingestion-spark module?

Good question! I also thought about putting it to pinot-ingestion-spark module, while this data preprocessing job for Spark uses common code which is also used for the preprocessing MR job in pinot-hadoop module. That's why I put it under the same 'v0_deprecated/' directory for now. The core logic should be simple and can be easily put to pinot-ingestion-spark module later on.

kishoreg requested review from Jackie-Jiang and xiangfu0 August 14, 2021 15:56

jackjlli force-pushed the support-spark-preprocessing branch from 3bdd50c to aa0792c Compare August 18, 2021 17:47

klsince reviewed Aug 23, 2021

View reviewed changes

jackjlli force-pushed the support-spark-preprocessing branch 2 times, most recently from f9204f7 to 475542c Compare August 23, 2021 22:37

Support data preprocessing in Spark framework

e2e92fc

jackjlli force-pushed the support-spark-preprocessing branch from 475542c to e2e92fc Compare November 17, 2021 00:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support data preprocessing in Spark framework#7299

Support data preprocessing in Spark framework#7299
jackjlli wants to merge 1 commit intomasterfrom
support-spark-preprocessing

jackjlli commented Aug 13, 2021 •

edited

Loading

Uh oh!

codecov-commenter commented Aug 13, 2021 •

edited

Loading

Uh oh!

kishoreg commented Aug 14, 2021

Uh oh!

jackjlli commented Aug 18, 2021

Uh oh!

Uh oh!

Uh oh!

klsince Aug 23, 2021

Uh oh!

jackjlli Aug 23, 2021

Uh oh!

klsince Aug 23, 2021

Uh oh!

jackjlli Aug 23, 2021

Uh oh!

xiangfu0 commented Aug 23, 2021

Uh oh!

jackjlli commented Aug 23, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

jackjlli commented Aug 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Upgrade Notes

Release Notes

Documentation

Uh oh!

codecov-commenter commented Aug 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

kishoreg commented Aug 14, 2021

Uh oh!

jackjlli commented Aug 18, 2021

Uh oh!

Uh oh!

Uh oh!

klsince Aug 23, 2021

Choose a reason for hiding this comment

Uh oh!

jackjlli Aug 23, 2021

Choose a reason for hiding this comment

Uh oh!

klsince Aug 23, 2021

Choose a reason for hiding this comment

Uh oh!

jackjlli Aug 23, 2021

Choose a reason for hiding this comment

Uh oh!

xiangfu0 commented Aug 23, 2021

Uh oh!

jackjlli commented Aug 23, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jackjlli commented Aug 13, 2021 •

edited

Loading

codecov-commenter commented Aug 13, 2021 •

edited

Loading