[HUDI-7503] Compaction execution should fail if another active writer is already executing the same plan#18012
[HUDI-7503] Compaction execution should fail if another active writer is already executing the same plan#18012kbuci wants to merge 14 commits intoapache:masterfrom
Conversation
nsivabalan
left a comment
There was a problem hiding this comment.
can we create a follow up to enable this for clustering as well.
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java
Outdated
Show resolved
Hide resolved
| Future<?> future1 = executors.submit(() -> { | ||
| try { | ||
| // Wait for both writers to be ready | ||
| cyclicBarrier.await(60, TimeUnit.SECONDS); |
There was a problem hiding this comment.
can we use countdownLatch to be deterministic and not add a 60 sec wait.
There was a problem hiding this comment.
I removed the 60 sec wait, but my understanding is that this API is deterministic as well
| client1.close(); | ||
| client2.close(); | ||
| FileIOUtils.deleteDirectory(new File(basePath)); | ||
| } |
There was a problem hiding this comment.
can we also test the case where writer1 started compaction execution and failed mid-way. and after sometime writer2 starts and able to rollback and execute the failed compaction (after heart beat expired)
There was a problem hiding this comment.
Sure, added another test cae
|
@nsivabalan While checking failing UTs, I noticed a complication. In some places the client calls Because of this, for now I am thinking that we can just have this heatbeat guard if auto-complete/commit is enabled?
The issue though is that this adds some code complexity since there seem to be multiple APIs that users can directly call for committing a compaction:
What are your thoughts? |
|
I feel, we should keep it simple. start at the beginning of w/n but lets ensure we have a try catch block w/n compact(), so that w/n catch block we can stop the heart beat on failures. |
That's what I originally tried. But the issue I saw though is that there could be case like #18012 (comment) where a user calls compact without comimting, and then calls compact again. So in that scenario we would need to stop the heartbeat of that first compact call be exiting, even though auto commit wasn't enabled. |
|
@nsivabalan As discussed, updated PR under the assumption that:
|
Describe the issue this Pull Request addresses
Updated compact to start a heartbeat (within a transaction) before attempting to execute a plan.
If multiple writers attempt to execute same compact/logcompact plan at same time, only one of them will process and the rest will fail with an exception (upon seeing a heartbeat has already been started) and will abort.
Summary and Changelog
Without this change, if multiple jobs are launched at the same that target executing the same compact plan (due to a non-HUDI related user-side configuration/orchestration issue) then one job can execute the compact plan and create an inflight/commit instant file while the other jobs can roll it back (and delete inflight instant files or data files). This can lead to dataset being in a corrupted state.
With this change, we are updating
compactAPI (that executes an existing compaction plan) to first take the table lock (by starting a transaction), start a heartbeat, and release the table lock (end the transaction). Though if a heartbeat already exists, then an exception will be thrown. The heartbeat will be closed when the compaction plan execution succeeds. This means that if compaction execution fails but a spark task/driver is still lingering and writing/deleting files, the heartbeat won't be immediately cleaned up.If multiple concurrent job attempt to try execute a compaction plan, then all jobs except the first one will fail (until the heartbeat of first on expires). Note that the table lock mentioned above is needed to ensure that multiple writers don't independently check that heartbeat is inactive and start it.
Impact
Risk Level
Documentation Update
None
Contributor's checklist