[HUDI-5023] Switching default Write Executor type to SIMPLE#7476
[HUDI-5023] Switching default Write Executor type to SIMPLE#7476alexeykudinkin merged 12 commits intoapache:masterfrom
SIMPLE#7476Conversation
|
@hudi-bot run azure |
SIMPLESIMPLE
|
@hudi-bot run azure |
| public Builder withWriteBufferSize(int size) { | ||
| writeConfig.setValue(WRITE_DISRUPTOR_BUFFER_SIZE, String.valueOf(size)); | ||
| writeConfig.setValue(WRITE_EXECUTOR_DISRUPTOR_BUFFER_SIZE, String.valueOf(size)); | ||
| return this; | ||
| } |
There was a problem hiding this comment.
nit: If this is not intended to use as a public API, could you remove this and only keep Builder#withWriteExecutorDisruptorWriteBufferSize which has the same functionality?
| } | ||
|
|
||
| @VisibleForTesting | ||
| public HoodieWriteConfig build(boolean shouldValidate) { |
There was a problem hiding this comment.
should this method use the default scope/visibility instead of public?
There was a problem hiding this comment.
Problem is that it's being used in tests in different packages
There was a problem hiding this comment.
OK. We should think about how to hide this, so it is not used by users accidentally.
| private final String instantTime = HoodieActiveTimeline.createNewInstantTime(); | ||
|
|
||
|
|
||
| private final HoodieWriteConfig writeConfig = HoodieWriteConfig.newBuilder() |
There was a problem hiding this comment.
nit: (not required in this PR) the tests using different types of write executors should be generalized in a base class, like TestWriteMarkersBase, instead of duplicating the code. This can be refactored later on.
There was a problem hiding this comment.
Agreed. Created a task to follow-up HUDI-5622
|
|
||
| try { | ||
| exec = new SimpleHoodieExecutor(hoodieRecords.iterator(), consumer, getCloningTransformer(HoodieTestDataGenerator.AVRO_SCHEMA)); | ||
| exec = new SimpleExecutor(hoodieRecords.iterator(), consumer, Function.identity()); |
There was a problem hiding this comment.
Still use getTransformer instead of Function.identity() for the tests?
There was a problem hiding this comment.
We actually don't need to clone in that case
|
@alexeykudinkin are we claiming that there is no downside to this? esp on large file size? does this still provide same performance. Love to understand benchmarks |
4c401fc to
5935539
Compare
5935539 to
9956563
Compare
| .withDocumentation("Size of in-memory buffer used for parallelizing network reads and lake storage writes."); | ||
|
|
||
| public static final ConfigProperty<String> WRITE_DISRUPTOR_BUFFER_SIZE = ConfigProperty | ||
| public static final ConfigProperty<String> WRITE_EXECUTOR_DISRUPTOR_BUFFER_SIZE = ConfigProperty |
There was a problem hiding this comment.
(Let's take this in a separate PR) What is the unit of this config (B, KB, or MB)? We should mention the unit in the config naming.
There was a problem hiding this comment.
Discussed offline: will follow-up w/ separate PR addressing this
| } | ||
|
|
||
| @VisibleForTesting | ||
| public HoodieWriteConfig build(boolean shouldValidate) { |
There was a problem hiding this comment.
OK. We should think about how to hide this, so it is not used by users accidentally.
| exec = new DisruptorExecutor(hoodieWriteConfig.getDisruptorWriteBufferSize(), hoodieRecords.iterator(), consumer, | ||
| getCloningTransformer(HoodieTestDataGenerator.AVRO_SCHEMA), Option.of(WaitStrategyFactory.DEFAULT_STRATEGY), getPreExecuteRunnable()); | ||
| exec = new DisruptorExecutor<>(writeConfig.getWriteExecutorDisruptorWriteBufferSize(), hoodieRecords.iterator(), consumer, | ||
| Function.identity(), WaitStrategyFactory.DEFAULT_STRATEGY, getPreExecuteRunnable()); |
There was a problem hiding this comment.
Should this also use getTransformer(HoodieTestDataGenerator.AVRO_SCHEMA, writeConfig)?
There was a problem hiding this comment.
There's no reason to over-complicate this test -- it's goal is to test the Executor not how it's being used
| getCloningTransformer(HoodieTestDataGenerator.AVRO_SCHEMA), Option.of(WaitStrategyFactory.DEFAULT_STRATEGY), getPreExecuteRunnable()); | ||
| DisruptorExecutor<HoodieRecord, HoodieRecord, Integer> | ||
| executor = new DisruptorExecutor<>(1024, hoodieRecords.iterator(), consumer, | ||
| Function.identity(), WaitStrategyFactory.DEFAULT_STRATEGY, getPreExecuteRunnable()); |
|
|
||
| try { | ||
| exec = new SimpleHoodieExecutor(hoodieRecords.iterator(), consumer, getCloningTransformer(HoodieTestDataGenerator.AVRO_SCHEMA)); | ||
| exec = new SimpleExecutor<>(hoodieRecords.iterator(), consumer, Function.identity()); |
There was a problem hiding this comment.
Similar here for getTransformer. The production code passes the transfer in by calling getTransformer so let's follow the same in the test code.
|
|
||
| try { | ||
| exec = new SimpleHoodieExecutor(hoodieRecords.iterator(), consumer, getCloningTransformer(HoodieTestDataGenerator.AVRO_SCHEMA)); | ||
| exec = new SimpleExecutor<>(hoodieRecords.iterator(), consumer, Function.identity()); |
| SimpleHoodieExecutor<HoodieRecord, Tuple2<HoodieRecord, Option<IndexedRecord>>, Integer> exec = | ||
| new SimpleHoodieExecutor(iterator, consumer, getCloningTransformer(HoodieTestDataGenerator.AVRO_SCHEMA)); | ||
| SimpleExecutor<HoodieRecord, HoodieRecord, Integer> exec = | ||
| new SimpleExecutor<>(iterator, consumer, Function.identity()); |
|
@alexeykudinkin let's make sure we cover different scenarios (including the different base file sizes) in the benchmarking, and that there is no regression in common use cases before landing this PR. |
This change switches default Write Executor to be SIMPLE ie one bypassing reliance on any kind of Queue (either BoundedInMemory or Disruptor's one). This should considerably trim down on Runtime (compared to BIMQ) Compute wasted (compared to BIMQ, Disruptor) Since it eliminates unnecessary intermediary "staging" of the records in the queue (for ex, in Spark such in-memory enqueueing occurs at the ingress points, ie shuffling), and allows to handle records writing in one pass (even avoiding making copies of the records in the future)
…#7476) This change switches default Write Executor to be SIMPLE ie one bypassing reliance on any kind of Queue (either BoundedInMemory or Disruptor's one). This should considerably trim down on Runtime (compared to BIMQ) Compute wasted (compared to BIMQ, Disruptor) Since it eliminates unnecessary intermediary "staging" of the records in the queue (for ex, in Spark such in-memory enqueueing occurs at the ingress points, ie shuffling), and allows to handle records writing in one pass (even avoiding making copies of the records in the future)
…#7476) This change switches default Write Executor to be SIMPLE ie one bypassing reliance on any kind of Queue (either BoundedInMemory or Disruptor's one). This should considerably trim down on Runtime (compared to BIMQ) Compute wasted (compared to BIMQ, Disruptor) Since it eliminates unnecessary intermediary "staging" of the records in the queue (for ex, in Spark such in-memory enqueueing occurs at the ingress points, ie shuffling), and allows to handle records writing in one pass (even avoiding making copies of the records in the future)
Change Logs
This change switches default Write Executor to be
SIMPLEie one bypassing reliance on any kind of Queue (either BoundedInMemory or Disruptor's one).This should considerably trim down on
Since it eliminates unnecessary intermediary "staging" of the records in the queue (for ex, in Spark such in-memory enqueueing occurs at the ingress points, ie shuffling), and allows to handle records writing in one pass (even avoiding making copies of the records in the future)
Impact
Users w/ upsert-/insert-heavy payloads should see considerable boost in writing performance.
Risk level (write none, low medium or high below)
Low
Documentation Update
N/A
Contributor's checklist