[HUDI-4329] Add separate control for Flink compaction operation sync/async mode#5991
[HUDI-4329] Add separate control for Flink compaction operation sync/async mode#5991chenshzh wants to merge 1 commit intoapache:masterfrom
Conversation
|
@danny0405 And if it's convenient, pls also check out this. It seems that it has been for some time. Thanks! |
1d800b3 to
0ca6be8
Compare
0ca6be8 to
d520045
Compare
| this.conf = conf; | ||
| this.asyncCompaction = OptionsResolver.needsAsyncCompaction(conf); | ||
| this.asyncCompactionOperation = OptionsResolver.needsAsyncCompactionOperation(conf); | ||
| } |
There was a problem hiding this comment.
Why switching to a new option key, what's the purpose here ?
There was a problem hiding this comment.
We already support sync compaction for bounded source, it should be fine for async compaction for streaming source ?
There was a problem hiding this comment.
Normally we use compaction.async.enabled to turn on compaction. But we could not make it sync because it's already been true.
if (asyncCompaction) {
// executes the compaction task asynchronously to not block the checkpoint barrier propagate.
executor.execute(
() -> doCompaction(instantTime, compactionOperation, collector, reloadWriteConfig()),
(errMsg, t) -> collector.collect(new CompactionCommitEvent(instantTime, compactionOperation.getFileId(), taskID)),
"Execute compaction for instant %s from task %d", instantTime, taskID);
} else {
// executes the compaction task synchronously for batch mode.
LOG.info("Execute compaction for instant {} from task {}", instantTime, taskID);
doCompaction(instantTime, compactionOperation, collector, writeClient.getConfig());
}support sync compaction for bounded source
We will use sync compaction mode for unbounded source in some scenarios. And actually the bounded source sync compaction seems weird. It use compaction.async.enabled true to turn on compaction, and then switch it to fasle for sync mode.
// compaction
if (OptionsResolver.needsAsyncCompaction(conf)) { // here FlinkOptions.COMPACTION_ASYNC_ENABLED decides that we need compaction
// use synchronous compaction for bounded source.
if (context.isBounded()) {
conf.setBoolean(FlinkOptions.COMPACTION_ASYNC_ENABLED, false); // we come here because it is true, and it's so weird to turn it false, actually we just want the operation to be executed sync.
}
return Pipelines.compact(conf, pipeline);
} else {
return Pipelines.clean(conf, pipeline);
}There was a problem hiding this comment.
@danny0405 pls help see whether there are any more problems here?
There was a problem hiding this comment.
And actually the bounded source sync compaction seems weird. It use compaction.async.enabled true to turn on compaction, and then switch it to fasle for sync mode.
I agree the logic here is a little weird, we can refactor it though,
the bounded source sync compaction is meaningful, especially when people do batch ingestion for mor table, they do not what a separate compaction job running again.
There was a problem hiding this comment.
keep sync compaction as an option for unbounded source.
And in a whole for hudi feature, you might also agree that users should be provided sync compaction option for unbounded source, no matter the above mentioned scenarios?
There was a problem hiding this comment.
@danny0405 do we have any other questions here?
There was a problem hiding this comment.
and at the same time we use async thread to execute compaction async and collect the result compaction msgs, the output.collector will become thread unsafe
The output collector is thread unsafe here, let's fix it to be thread safe then ~
There was a problem hiding this comment.
OK, but that will be another issue, i think.
What about this pr ? We have discussed much about its necessity to keep sync compaction option for unbounded source.
There was a problem hiding this comment.
The watermark conflicts issue has been addressed in #8379, while I still deem this PR as valid, the suggestions:
- rename
compaction.async.enabledtocompaction.enabled - store the isBounded option in the
confto switch to sync compaction for bounded source.
|
@danny0405 is this still needed for Hudi Flink? |
|
Yeah, let's move it to 1.0 release. |
Change Logs
Add separate control for Flink compaction operation sync/async mode.
Details in https://issues.apache.org/jira/projects/HUDI/issues/HUDI-4329
Problem Review
The compact operation sync/async in CompactionFunction is now controlled by FlinkOptions#COMPACTION_ASYNC_ENABLED
While in fact it cannot be switched to sync mode because the pipeline defined by sync compaction will only include the clean but not compact operators.
Improvement
Add another separate control switch for compaction operation sync/async mode.
Impact
add sync compaction switch for MoR compaction.
Risk level (write none, low medium or high below)
If medium or high, explain what verification was done to mitigate the risks.
Documentation Update
Add
compaction.operation.async.enabled: used to turn on the synchronous compaction operationThe original
compaction.async.enabled: used to turn on the compaction process.Contributor's checklist