Skip to content

branch-4.1: [Improve](Streamingjob) support only snapshot sync for mysql and pg #61389#61615

Open
github-actions[bot] wants to merge 1 commit intobranch-4.1from
auto-pick-61389-branch-4.1
Open

branch-4.1: [Improve](Streamingjob) support only snapshot sync for mysql and pg #61389#61615
github-actions[bot] wants to merge 1 commit intobranch-4.1from
auto-pick-61389-branch-4.1

Conversation

@github-actions
Copy link
Contributor

Cherry-picked from #61389

…61389)

### What problem does this PR solve?

#### Background
StreamingJob currently supports two offset modes:
- `initial`: full snapshot + continuous incremental replication
- `latest`: incremental replication only (no snapshot)
                  
  There is no way to perform a one-time full sync and stop. This is
  needed for data migration scenarios where only a point-in-time full
  copy is required, without ongoing replication.

  #### Usage

  Set `offset=snapshot` when creating a StreamingJob:

  ```sql
CREATE JOB mysql_db_sync
ON STREAMING
FROM MYSQL (
    ...
    "user" = "root",
    "password" = "",
    "database" = "db",
    "include_tables" = "user_info,student", 
    "offset" = "snapshot"
)
TO DATABASE target_test_db (
)
```

  The job will perform a full table snapshot and automatically transition
  to FINISHED once all data is synced. No binlog/WAL subscription is
  established.

 #### Design

The implementation centers on a hasReachedEnd() signal in SourceOffsetProvider:
                                                                                                                                       
  - FE: JdbcSourceOffsetProvider.hasReachedEnd() returns true when in snapshot-only mode and all snapshot splits have been consumed    
  (finishedSplits non-empty, remainingSplits empty). StreamingInsertJob.onStreamTaskSuccess() checks hasReachedEnd() before creating   
  the next task — if true, the job is marked FINISHED.                                                                                 
  - BE (cdc_client): snapshot maps to StartupOptions.snapshot() for both MySQL and PostgreSQL connectors. The chunk-split path is
  reused from initial mode.
  - Crash recovery: if FE crashes before persisting FINISHED, the job resumes via PAUSED→PENDING. handlePendingState() calls
  replayOffsetProviderIfNeed() then checks hasReachedEnd() — if all splits are already finished, the job transitions directly to       
  FINISHED without creating any new task.

#### Testing

  Added regression tests for both MySQL and PostgreSQL:
  - test_streaming_mysql_job_snapshot.groovy
  - test_streaming_postgres_job_snapshot.groovy

  Both tests verify:
  1. All existing data is synced correctly after job finishes
  2. Job status transitions to FINISHED
@github-actions github-actions bot requested a review from yiguolei as a code owner March 23, 2026 06:24
@Thearas
Copy link
Contributor

Thearas commented Mar 23, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@dataroaring dataroaring reopened this Mar 23, 2026
@Thearas
Copy link
Contributor

Thearas commented Mar 23, 2026

run buildall

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants