[FLINK-39315][mysql] Unregister listeners of BinaryLogClient to prevent snapshot reader hang during backfill#4334
Conversation
…nt snapshot reader hang during backfill
There was a problem hiding this comment.
Pull request overview
Fixes a MySQL CDC snapshot backfill hang when reusing a BinaryLogClient across multiple snapshot splits by ensuring listeners registered during streaming execution are cleaned up deterministically.
Changes:
- Unregister
BinaryLogClientevent and lifecycle listeners after eachMySqlStreamingChangeEventSource.execute()run to prevent cross-split interference. - Add a stress-style test adjustment to generate more binlog activity during sequential split backfills.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
.../SnapshotSplitReaderTest.java |
Extends an existing test to generate enough binlog events to reproduce/guard against the hang scenario. |
.../MySqlStreamingChangeEventSource.java |
Tracks registered listeners and unregisters them in cleanup to avoid listener accumulation with a reused BinaryLogClient. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Only when we reach the first BEGIN event will we start to skip events ... | ||
| skipEvent = false; | ||
|
|
||
| Throwable executionError = null; |
There was a problem hiding this comment.
If an exception occurs in the main try and any unregister* call throws in finally, the cleanup exception will replace the original failure, making the root cause harder to diagnose. Consider using the executionError variable to preserve the primary exception: record the original throwable, then wrap unregister* in a try/catch and attach cleanup failures via addSuppressed (or only throw cleanup failures when there was no primary error).
There was a problem hiding this comment.
Useless, deleted.
| client.unregisterEventListener(eventListener); | ||
| client.unregisterEventListener(metricsEventListener); | ||
| client.unregisterLifecycleListener(lifecycleListener); | ||
| if (logEventListener != null) { | ||
| client.unregisterEventListener(logEventListener); |
There was a problem hiding this comment.
If an exception occurs in the main try and any unregister* call throws in finally, the cleanup exception will replace the original failure, making the root cause harder to diagnose. Consider using the executionError variable to preserve the primary exception: record the original throwable, then wrap unregister* in a try/catch and attach cleanup failures via addSuppressed (or only throw cleanup failures when there was no primary error).
| client.unregisterEventListener(eventListener); | |
| client.unregisterEventListener(metricsEventListener); | |
| client.unregisterLifecycleListener(lifecycleListener); | |
| if (logEventListener != null) { | |
| client.unregisterEventListener(logEventListener); | |
| try { | |
| client.unregisterEventListener(eventListener); | |
| } catch (Exception e) { | |
| LOGGER.info("Exception while unregistering event listener", e); | |
| } | |
| try { | |
| client.unregisterEventListener(metricsEventListener); | |
| } catch (Exception e) { | |
| LOGGER.info("Exception while unregistering metrics event listener", e); | |
| } | |
| try { | |
| client.unregisterLifecycleListener(lifecycleListener); | |
| } catch (Exception e) { | |
| LOGGER.info("Exception while unregistering lifecycle listener", e); | |
| } | |
| if (logEventListener != null) { | |
| try { | |
| client.unregisterEventListener(logEventListener); | |
| } catch (Exception e) { | |
| LOGGER.info("Exception while unregistering log event listener", e); | |
| } |
There was a problem hiding this comment.
Thanks for the review. I agree that if unregister*() throws in finally , it can mask the original exception and make the root cause harder to diagnose.
After reconsideration, I moved the listener unregistration to the end of the normal execution path instead of the finally block. The reason is that the problematic case we want to avoid is cross-split reuse when the execution finishes normally; if an exception happens and we exit early, the task will fail and the BinaryLogClient will be recreated on recovery, so the listener accumulation issue should not be hit in that path.
| // To verify that FLINK-39315 is fixed, generate sufficient binlog events, | ||
| // so that the MySqlBinlogSplitReadTask runs long enough to exercise the | ||
| // context-running checks in binlog reading backfill phase. | ||
| for (int i = 0; i < 1000; i++) { | ||
| mySqlConnection.execute( | ||
| "UPDATE " | ||
| + tableId | ||
| + " SET address = 'Beijing' WHERE id = 106"); | ||
| mySqlConnection.commit(); | ||
| mySqlConnection.execute( | ||
| "UPDATE " | ||
| + tableId | ||
| + " SET address = 'Shanghai' WHERE id = 106"); | ||
| mySqlConnection.commit(); | ||
| } |
There was a problem hiding this comment.
This adds 2000 updates with 2000 commits, which can significantly slow down the unit test and increase flakiness on slower CI environments. Consider batching commits (e.g., commit every N updates or run the loop in a single transaction), reducing iterations to the minimum that still reproduces the issue, or deriving the iteration count from a constant/system property so CI can tune it if needed.
There was a problem hiding this comment.
I reduced the loop to 100 loops, which is still sufficient to reproduce the issue when the fix is commented out. This should make the unit test significantly faster and less flaky on CI.
This closes FLINK-39315.
What is the purpose of the change
This PR fixes a MySQL CDC source hang in the snapshot backfill phase when processing multiple snapshot splits sequentially while reusing the same
BinaryLogClient.SnapshotSplitReader.pollWithBuffer()waits for theBINLOG_ENDwatermark to finish a split. However,BinaryLogClientwas reused across split executions andMySqlStreamingChangeEventSource.execute()registered multiple event/lifecycle listeners on each execution without unregistering them. As a result, listeners from previous splits could still receive binlog events during later splits and:ChangeEventSourceContextprematurely (causing the current split’s backfill to exit early), and/orBINLOG_ENDvia a staleSignalEventDispatcherinto a stale queue (so the currentpollWithBuffer()never seesBINLOG_END).This could leave the queue empty while the backfill thread has already stopped, causing the reader to hang indefinitely.
Brief change log
BinaryLogClientevent and lifecycle listeners inMySqlStreamingChangeEventSource.execute()after each execution to avoid listener accumulation and cross-split interference.SnapshotSplitReaderTest(based ontestMultipleSplitsWithBackfill) to validate multiple snapshot splits with a forced backfill phase can finish and produce the expected output.Verifying this change
This change is verified by unit tests:
SnapshotSplitReaderTest#testMultipleSplitsWithBackfillDoes this pull request potentially affect one of the following parts:
Documentation