[regression-test](streaming-job) add cdc operational cases for offset modes and pg slot lifecycle#63514
[regression-test](streaming-job) add cdc operational cases for offset modes and pg slot lifecycle#63514JNSimba wants to merge 7 commits into
Conversation
… modes and pg slot lifecycle
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
/review |
There was a problem hiding this comment.
I found a blocking reliability issue in two of the new PostgreSQL CDC regression tests. The tests drop user-provided logical replication slots immediately after DROP JOB while the CDC task/replication connection may still be winding down; PostgreSQL rejects pg_drop_replication_slot on an active slot, so these cases can fail or leak cleanup state even when the product behavior is correct.
Critical checkpoint conclusions:
- Goal/test proof: The PR adds CDC operational regression coverage and mostly targets the stated scenarios, but two tests need synchronization before slot cleanup to be reliable proof.
- Scope/focus: Changes are regression-only and focused on streaming CDC scenarios.
- Concurrency/lifecycle: The affected tests interact with asynchronous job cancellation and an external PG replication slot lifecycle; cleanup currently assumes DROP JOB closes the replication connection synchronously.
- Configuration/compatibility/protocol: No new configs, storage format, or FE-BE protocol changes in the PR files.
- Parallel paths: The same active-slot cleanup pattern appears in both new user-slot PG tests and should be fixed in both.
- Test coverage/results: Coverage is useful, but the cleanup race can make the tests flaky and leave external resources behind.
- Observability/performance/transactions: No additional concern found for these regression-only changes.
User focus: No additional user-provided review focus was present.
|
run buildall |
|
/review |
There was a problem hiding this comment.
Review summary:
No new blocking issues found in the current head. The existing inline comments about pg_drop_replication_slot racing with active replication connections are already-known review context; the latest commit now polls for inactive slots before manual cleanup in the two affected tests, so I did not duplicate those comments.
Critical checkpoint conclusions:
- Goal/test proof: The PR adds regression coverage for MySQL earliest offset, PostgreSQL resource validation, mid-snapshot drop cleanup, FE restart offset durability, and PostgreSQL slot LSN advancement; the added/updated tests directly exercise those goals.
- Scope/focus: Changes are limited to CDC regression tests and expected output.
- Concurrency/lifecycle: The tests intentionally exercise async CDC task cancellation, FE restart, pause/resume, and external PostgreSQL slot/publication lifecycle. Current cleanup polling is consistent with those asynchronous paths.
- Configuration/compatibility/protocol: No new configs, storage formats, or FE-BE protocol changes.
- Parallel paths: Both MySQL and PostgreSQL offset/lifecycle paths receive targeted coverage where applicable.
- Error handling: New negative PostgreSQL validator cases use
test { ... exception ... }and check actionable error substrings. - Test coverage/results: Added
.outfor the deterministic MySQL query output and uses ordered query output. I did not run the external Docker/JDBC regression suites in this runner. - Observability/performance: No production observability or hot-path code changes.
- Transaction/persistence/data correctness: The FE restart case covers replayed CDC offset state; no production transaction code is changed.
- User focus: No additional user-provided review focus was specified.
|
/review |
|
run buildall |
There was a problem hiding this comment.
I found blocking issues in the CDC reader lifecycle change and one regression-test correctness gap. Critical checkpoints: the PR goal is clear and mostly covered by new regression tests, but the current synchronization change does not safely accomplish cancel/drop cleanup for stream/binlog readers; the change is focused, but the monitor is held across blocking poll calls; concurrency is involved and close can now be serialized behind a blocked poll in both PG/JDBC and MySQL paths; no special lifecycle/static-init/config/storage-format compatibility issues found; parallel PG/MySQL reader paths were both updated but both carry the same close-path risk; tests cover the intended CDC invariants, with one LSN conversion gap called out inline; no additional user-provided focus points were specified.
There was a problem hiding this comment.
Pull request overview
This PR adds new regression coverage for CDC operational invariants (offset modes, Postgres slot/publication lifecycle, FE-restart offset persistence) and hardens cdc_client source-reader shutdown behavior to avoid concurrency failures during mid-snapshot cancellation.
Changes:
- Added new MySQL/PG streaming-job regression suites for “earliest/latest/specific(JSON)” offset behavior, FE restart offset persistence, and PG slot/publication lifecycle invariants.
- Extended the existing PG publication test to cover fail-fast validation paths for missing slot/publication and incomplete publications.
- Updated
cdc_clientMySQL + generic JDBC incremental readers to use thread-safe collections and synchronized lifecycle methods to avoid concurrent-modification during cancellation/close.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| regression-test/suites/job_p0/streaming_job/cdc/test_streaming_postgres_job_special_offset_restart_fe.groovy | New test: JSON LSN offset survives FE restart and streaming continues |
| regression-test/suites/job_p0/streaming_job/cdc/test_streaming_postgres_job_slot_lsn_advance.groovy | New test: confirmed_flush_lsn advances, freezes on PAUSE, resumes after RESUME |
| regression-test/suites/job_p0/streaming_job/cdc/test_streaming_postgres_job_publication.groovy | Adds validator fail-fast negative cases for missing slot/pub/required tables |
| regression-test/suites/job_p0/streaming_job/cdc/test_streaming_postgres_job_drop_during_snapshot.groovy | New test: DROP JOB mid-snapshot cleans up auto-managed slot/publication |
| regression-test/suites/job_p0/streaming_job/cdc/test_streaming_mysql_job_offset_earliest.groovy | New test: MySQL offset=earliest replays historical binlog-only (no snapshot) |
| regression-test/data/job_p0/streaming_job/cdc/test_streaming_mysql_job_offset_earliest.out | Expected output for the new MySQL earliest-offset regression test |
| fs_brokers/cdc_client/src/main/java/org/apache/doris/cdcclient/source/reader/mysql/MySqlSourceReader.java | Concurrency hardening: thread-safe collections + synchronized lifecycle/close changes |
| fs_brokers/cdc_client/src/main/java/org/apache/doris/cdcclient/source/reader/JdbcIncrementalSourceReader.java | Concurrency hardening for the generic JDBC incremental reader used by PG path |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…r for PG LSN offset
|
/review |
There was a problem hiding this comment.
Reviewed the PR against the Doris code-review checklist. The new regression coverage targets the stated CDC operational invariants, and the tests cover several important positive and negative paths. However, the current reader cleanup fix still leaves a snapshot-poll concurrency race: close()/finishSplitRecords() can clear and null activePollFutures while the monitor-free pollRecords() path is still using it. This affects both the JDBC/PG and MySQL reader implementations and can turn DROP/cancel during snapshot polling into a task-side NullPointerException or other cleanup-time failure instead of deterministic cancellation.
Critical checkpoint conclusions: Goal/test coverage: the PR adds tests for CDC offset and PG slot/publication lifecycle, but the code fix does not fully satisfy concurrent DROP/cancel safety. Scope/focus: changes are mostly focused on CDC reader lifecycle and regression coverage. Concurrency: blocking stream/binlog monitor concerns and executor leaks were already raised in existing threads; this review adds a separate activePollFutures mutation race in snapshot polling. Lifecycle/resource cleanup: slot/publication cleanup is tested, but reader cleanup still has unsafe concurrent state mutation. Compatibility/config/storage formats: no incompatible format or protocol change found. Parallel code paths: JDBC/PG and MySQL both carry the same race and both need the same fix. Observability: existing logs are sufficient for the reviewed paths. User focus: no additional user-provided review focus was supplied.
…Completion to avoid close NPE race
|
run buildall |
|
/review |
There was a problem hiding this comment.
I found a remaining concurrency issue in the CDC reader cleanup path. The existing review threads already cover the activePollFutures lifetime race, executor shutdown leak, PG LSN overflow, and user-slot cleanup waits; I did not duplicate those.
Critical checkpoint conclusions:
- Goal/test coverage: the PR adds CDC operational regression coverage and attempts to make reader cleanup safe during cancellation, but the stream/binlog close-vs-poll path is still unsafe.
- Focus/scope: changes are mostly focused on CDC reader cleanup and CDC regression tests.
- Concurrency: blocking snapshot polling no longer holds the reader monitor, but making pollRecords monitor-free leaves stream/binlog readers exposed to concurrent nulling/closing by close()/finishSplitRecords(); this is blocking.
- Lifecycle/resources: existing review context already notes pollExecutor shutdown is not handled; no duplicate inline comment was added here.
- Compatibility/configuration/persistence: no new incompatible format/config persistence concern found in this pass.
- Parallel paths: the same stream/binlog race exists in both PostgreSQL/JDBC and MySQL reader implementations, so both paths need the same treatment.
- Tests: new tests cover important PG/MySQL CDC invariants, but they do not prove close racing with an in-flight stream/binlog poll remains safe.
- Observability/performance: no additional blocking issue found beyond the concurrency problem.
- User focus: no additional user-provided review focus was supplied.
…y to avoid close NPE race
|
/review |
|
run buildall |
There was a problem hiding this comment.
Review result: request changes. I did not add new inline comments because the blocking issue I can still confirm is already covered by existing inline threads and duplicating it would violate the review context instructions.
Confirmed unresolved issue: the per-reader pollExecutor lifecycle remains unsafe. JdbcIncrementalSourceReader and MySqlSourceReader still create a fixed thread pool in initialize(), but close()/finishSplitRecords() only cancel active futures/readers and never shut the executor down. Since Env creates a SourceReader per job and /api/close removes the JobContext after reader.close(), repeated job create/drop leaves those executor threads and their queues reachable until process exit. This matches the existing inline threads on JdbcIncrementalSourceReader and MySqlSourceReader and should be fixed before merge.
Critical checkpoint conclusions:
- Goal and tests: the PR adds CDC operational regression coverage and fixes reader cleanup races. The new tests cover important PG/MySQL CDC invariants, but the reader lifecycle goal is incomplete while pollExecutor is leaked.
- Scope: the changes are mostly focused on CDC reader cleanup and regression tests.
- Concurrency: this PR directly changes close/poll concurrency. Existing stream/binlog local snapshots reduce null-field races, but executor lifecycle remains a concurrency/lifecycle resource leak.
- Lifecycle: per-job readers are removed from Env on close, so their private executor must be explicitly shut down; current code does not do that.
- Config compatibility: no new config item requiring dynamic behavior was found.
- Incompatible formats/protocols: none found.
- Parallel paths: MySQL and JDBC/PG paths are mirrored; the executor leak applies to both.
- Tests: added regression tests are relevant. Existing slot cleanup/LSN overflow review threads appear addressed in the latest code.
- Observability: no additional observability requirement found beyond existing logs.
- Transaction/persistence/data correctness: no new transaction visibility or persistence issue found in the reviewed diff.
- Performance/resource use: the unresolved executor leak can accumulate threads/memory across repeated CDC job churn.
User focus: no additional user-provided review focus was specified.
Summary
Add 5 regression cases for CDC operational invariants:
`drop_during_snapshot` surfaced a real cdc_client concurrency bug: HTTP `/api/close` raced the async-write task thread on `JdbcIncrementalSourceReader` snapshot-reader lists (CME), making `PostgresSourceReader.close` abort before dropping the auto-managed slot/publication. Fix: thread-safe collections (CopyOnWriteArrayList / ConcurrentHashMap.newKeySet) + serialize reader state mutations (`prepareSnapshotSplits` / `prepareStreamSplit` / `finishSplitRecords` / `close`) on the reader monitor; `pollRecords` stays monitor-free so a blocking poll never fences `close` out. Mirrored in `MySqlSourceReader`.
Also fixed two existing issues in the new cases:
Test plan