Support data preprocessing in Spark framework#7299
Conversation
Codecov Report
@@ Coverage Diff @@
## master #7299 +/- ##
=============================================
- Coverage 70.39% 39.30% -31.09%
+ Complexity 3299 92 -3207
=============================================
Files 1508 1526 +18
Lines 74754 75359 +605
Branches 10846 10951 +105
=============================================
- Hits 52621 29618 -23003
- Misses 18508 43482 +24974
+ Partials 3625 2259 -1366
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
|
is this needed only for v0_deprecated/spark? Can you please add more to the description on what is the change and why it's needed. |
I've updated the description of the PR. Basically this PR is to add the data preprocessing Spark job in pinot. And since some logic can be reused for both MR and Spark jobs, I refactored some code from pinot-hadoop module to pinot-ingestion-common module. At LinkedIn this data preprocessing job has already been used along with the PinotBuildAndPushJob from v0_deprecated module. |
3bdd50c to
aa0792c
Compare
...ed/pinot-hadoop/src/main/java/org/apache/pinot/hadoop/job/mappers/SegmentCreationMapper.java
Show resolved
Hide resolved
...-ingestion-common/src/main/java/org/apache/pinot/ingestion/jobs/SegmentPreprocessingJob.java
Outdated
Show resolved
Hide resolved
| // Repartition and sort within partitions. | ||
| Comparator<Object> comparator = new SparkDataPreprocessingComparator(); | ||
| JavaPairRDD<Object, Row> partitionedSortedPairRDD = | ||
| pairRDD.repartitionAndSortWithinPartitions(sparkPartitioner, comparator); |
There was a problem hiding this comment.
sometimes use coalesce() (then sortWithinPartitions()) can be more efficient than repartition(). is it feasible to allow coalesce()?
There was a problem hiding this comment.
Good question! The coalesce() method uses its default partition function which is not murmur2 for partitioning, and there is no way to specify our custom partition function. That's why repartitionAndSortWithinPartitions is used here, as it provides the ability to specify our own custom partition function here.
|
|
||
| @Override | ||
| public int getPartition(Object key) { | ||
| SparkDataPreprocessingJobKey jobKey = (SparkDataPreprocessingJobKey) key; |
There was a problem hiding this comment.
why not just call generatePartitionId() in this method?
There was a problem hiding this comment.
The key of the KV pair is not only used for partitioning but also used for sorting purpose. It'd be good to generate the partition index in order to minimize the footprint of the key, since any data type can be the partition column (e.g. string).
|
Shall we consider use pinot-ingestion-spark module? |
Good question! I also thought about putting it to pinot-ingestion-spark module, while this data preprocessing job for Spark uses common code which is also used for the preprocessing MR job in pinot-hadoop module. That's why I put it under the same 'v0_deprecated/' directory for now. The core logic should be simple and can be easily put to pinot-ingestion-spark module later on. |
f9204f7 to
475542c
Compare
475542c to
e2e92fc
Compare
Description
Currently the data preprocessing Hadoop job is located in pinot-hadoop module. And there is no data preprocessing Spark job. In order to reused some of the logic of data preprocessing (e.g. support data in AVRO and ORC format) for both MR and Spark framework, the common code is moved from pinot-hadoop to pinot-ingestion-common module.
This PR:
Upgrade Notes
Does this PR prevent a zero down-time upgrade? (Assume upgrade order: Controller, Broker, Server, Minion)
backward-incompat, and complete the section below on Release Notes)Does this PR fix a zero-downtime upgrade introduced earlier?
backward-incompat, and complete the section below on Release Notes)Does this PR otherwise need attention when creating release notes? Things to consider:
release-notesand complete the section on Release Notes)Release Notes
Documentation