[HUDI-5394] Fix tests for RowCustomColumnsSortPartitioner#8741
[HUDI-5394] Fix tests for RowCustomColumnsSortPartitioner#8741xushiyan merged 1 commit intoapache:masterfrom
Conversation
| boolean populateMetaFields) { | ||
| Dataset<Row> records1 = generateTestRecords(); | ||
| Dataset<Row> records2 = generateTestRecords(); | ||
| Dataset<Row> records = generateTestRecords(); |
There was a problem hiding this comment.
not sure why the existing test case runs the same logic twice with records1 and records2. @boneanxs any thoughts?
There was a problem hiding this comment.
testCustomColumnSortPartitionerWithRows was copied from testBulkInsertInternalPartitioner. And I looked org.apache.hudi.execution.bulkinsert.TestBulkInsertInternalPartitioner#testBulkInsertInternalPartitioner:177, it actually generates two records sets with different union times:
JavaRDD<HoodieRecord> records1 = generateTestRecordsForBulkInsert(jsc);
JavaRDD<HoodieRecord> records2 = generateTripleTestRecordsForBulkInsert(jsc);So I think this should be a mistake, and I think union it twice should be enough(Here different union times for different partitions?)
There was a problem hiding this comment.
There is no change for any codes in the write path, so why the tests run successfully for Spark 3.1 or 2.4 ?
There was a problem hiding this comment.
the test only passes spark 2.4 which is an coincident. The existing test logic asserts 2 rdd partitions after re-partition by the partitioner. with spark 2.4's sort and coalesce, it gives 2 and passes the test as a local partitioner. The correct expectation is the partitioner is doing global sort and the resulting num partition should be 2 or less, which is what spark 3 gives us.
| records2, true, false, true, generateExpectedPartitionNumRecords(records2), Option.of(comparator), true); | ||
| records, true, true, true, generateExpectedPartitionNumRecords(records), Option.of(comparator), true); |
There was a problem hiding this comment.
the existing test treated the partitioner as non-global and hence failed the test scenario under spark 3.2
boneanxs
left a comment
There was a problem hiding this comment.
Sorry for the mistake. +1 for this
| boolean populateMetaFields) { | ||
| Dataset<Row> records1 = generateTestRecords(); | ||
| Dataset<Row> records2 = generateTestRecords(); | ||
| Dataset<Row> records = generateTestRecords(); |
There was a problem hiding this comment.
testCustomColumnSortPartitionerWithRows was copied from testBulkInsertInternalPartitioner. And I looked org.apache.hudi.execution.bulkinsert.TestBulkInsertInternalPartitioner#testBulkInsertInternalPartitioner:177, it actually generates two records sets with different union times:
JavaRDD<HoodieRecord> records1 = generateTestRecordsForBulkInsert(jsc);
JavaRDD<HoodieRecord> records2 = generateTripleTestRecordsForBulkInsert(jsc);So I think this should be a mistake, and I think union it twice should be enough(Here different union times for different partitions?)
Change Logs
Fix test to use
RowCustomColumnsSortPartitioneras global sort partitioner.This blocks #8445
Impact
NA
Risk level
Low.
Documentation Update
NA
Contributor's checklist