[Improve](streaming job) support custom table name mapping for CDC streaming job #61317
[Improve](streaming job) support custom table name mapping for CDC streaming job #61317JNSimba wants to merge 3 commits intoapache:masterfrom
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
/review |
There was a problem hiding this comment.
Pull request overview
Adds support for streaming CDC jobs (Postgres) to map upstream source table names to different Doris target table names via per-table config (table.<src>.target_table), including schema-change DDL routing and regression coverage.
Changes:
- Introduce per-table config key constants and validation for
table.<tableName>.<suffix>(addstarget_tablesuffix). - Update FE table auto-creation to create Doris tables using mapped target names while keeping CDC monitoring based on source table names.
- Update CDC client to route stream-load writes and schema-change DDLs to mapped target tables; add regression tests for mapping + multi-source merge.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| regression-test/suites/job_p0/streaming_job/cdc/test_streaming_postgres_job_table_mapping.groovy | New regression suite covering table name mapping and two-source-to-one-target merge. |
| regression-test/data/job_p0/streaming_job/cdc/test_streaming_postgres_job_table_mapping.out | Expected results for the new mapping regression suite. |
| fs_brokers/cdc_client/src/main/java/org/apache/doris/cdcclient/utils/ConfigUtil.java | Adds helper to parse all table.<src>.target_table mappings from config. |
| fs_brokers/cdc_client/src/main/java/org/apache/doris/cdcclient/source/deserialize/PostgresDebeziumJsonDeserializer.java | Route schema-change DDLs to the mapped Doris target table. |
| fs_brokers/cdc_client/src/main/java/org/apache/doris/cdcclient/source/deserialize/DebeziumJsonDeserializer.java | Cache parsed source→target mappings and provide resolveTargetTable(). |
| fs_brokers/cdc_client/src/main/java/org/apache/doris/cdcclient/service/PipelineCoordinator.java | Route stream-load writes to mapped target table names. |
| fe/fe-core/src/main/java/org/apache/doris/job/util/StreamingJobUtils.java | Generate CREATE TABLE commands keyed by source table name and create Doris tables using mapped target names. |
| fe/fe-core/src/main/java/org/apache/doris/job/extensions/insert/streaming/StreamingInsertJob.java | Use the new source→CreateTableCommand mapping and ensure CDC monitors source tables. |
| fe/fe-core/src/main/java/org/apache/doris/job/extensions/insert/streaming/DataSourceConfigValidator.java | Validate per-table config key format and allowlisted suffixes. |
| fe/fe-common/src/main/java/org/apache/doris/job/cdc/DataSourceConfigKeys.java | Adds per-table key constants (table, exclude_columns, target_table). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
fs_brokers/cdc_client/src/main/java/org/apache/doris/cdcclient/utils/ConfigUtil.java
Show resolved
Hide resolved
fs_brokers/cdc_client/src/main/java/org/apache/doris/cdcclient/utils/ConfigUtil.java
Outdated
Show resolved
Hide resolved
...rc/main/java/org/apache/doris/job/extensions/insert/streaming/DataSourceConfigValidator.java
Show resolved
Hide resolved
...rc/main/java/org/apache/doris/job/extensions/insert/streaming/DataSourceConfigValidator.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Code Review Summary
This PR adds support for mapping upstream (PostgreSQL) table names to custom downstream (Doris) table names in CDC streaming jobs. The design is generally sound — the mapping is applied consistently across FE table creation, CDC client DML writes, and schema change DDL generation.
Critical Checkpoints
Goal achievement: The code accomplishes the stated goal. Table name mapping is applied in all three necessary places: (1) FE CreateTableCommand uses the mapped target name, (2) PipelineCoordinator routes DML records to the mapped Doris table, (3) PostgresDebeziumJsonDeserializer generates DDLs with the mapped name. Regression tests cover both basic mapping and multi-table merge scenarios.
Modification focus: The change is focused and touches only the necessary files. Good.
Concurrency: targetTableMappingsCache in DebeziumJsonDeserializer is a plain HashMap written once in init() and only read afterwards. This is safe in practice but could be made more robust with Collections.unmodifiableMap(). Low risk, not blocking.
Lifecycle management: No special lifecycle concerns.
Configuration items: New config key format table.<src>.target_table is validated in DataSourceConfigValidator. Dynamic changes not applicable (job creation time only).
Incompatible changes: The return type of generateCreateTableCmds changed from List<CreateTableCommand> to LinkedHashMap<String, CreateTableCommand>. This has only one caller (createTableIfNotExists), so no compatibility concern.
Parallel code paths: MySqlDebeziumJsonDeserializer exists as a parallel path. Its DML write routing goes through the same PipelineCoordinator code (correctly mapped). Its handleSchemaChangeEvent() is a TODO stub that returns empty, so no mapping needed yet. When implemented, it will need to use resolveTargetTable() — the infrastructure is already in the base class. Acceptable.
Test coverage: Good. Two regression test cases cover basic mapping (INSERT/UPDATE/DELETE) and multi-table merge (two PG tables → one Doris table). Tests use ORDER BY for deterministic output. Tables are dropped before use, not after. .out file appears auto-generated.
Observability: No new critical paths requiring additional logging. Existing log messages correctly use source table identifiers.
Persistence/transactions: Not applicable — no EditLog or transaction modifications.
Issues Found
-
[CRITICAL] Compilation error in
ConfigUtil.java: Missingimport java.util.HashMapandimport org.apache.doris.job.cdc.DataSourceConfigKeys. The newparseAllTargetTableMappingsmethod usesHashMapandDataSourceConfigKeysbut neither is imported. This file will not compile. -
[Minor] Dead code:
TABLE_EXCLUDE_COLUMNS_SUFFIXis declared inDataSourceConfigKeysbut never referenced anywhere in the codebase (not in the validator'sALLOW_TABLE_LEVEL_SUFFIXES, not in any consumer). If this is a placeholder for a future feature, it should be removed from this PR and added when actually needed to avoid confusion. -
[Minor] Validator rejects table names containing dots: The
DataSourceConfigValidatorsplits on.and requires exactly 3 parts. A table name containing a dot (e.g.,my.table) would produce more parts and be rejected. PostgreSQL allows dots in quoted identifiers. Consider documenting this limitation or usingindexOf/lastIndexOfinstead ofsplit.
fs_brokers/cdc_client/src/main/java/org/apache/doris/cdcclient/utils/ConfigUtil.java
Show resolved
Hide resolved
fe/fe-common/src/main/java/org/apache/doris/job/cdc/DataSourceConfigKeys.java
Show resolved
Hide resolved
| throw new IllegalArgumentException("Malformed per-table config key: '" + key | ||
| + "'. Expected format: table.<tableName>.<suffix>"); | ||
| } | ||
| String suffix = parts[parts.length - 1]; |
There was a problem hiding this comment.
[Minor] Table names with dots will be rejected: split("\\.", -1) with parts.length != 3 means that a table name containing a dot (e.g., table.my.dotted.table.target_table) will produce more than 3 parts and fail validation. PostgreSQL allows dots in quoted identifiers.
Consider using indexOf/lastIndexOf instead:
int firstDot = key.indexOf('.', TABLE_LEVEL_PREFIX.length());
int lastDot = key.lastIndexOf('.');
if (firstDot == -1 || firstDot != lastDot - ???) { ... }Or at minimum, document this limitation (no dots in source table names).
|
run buildall |
TPC-H: Total hot run time: 26847 ms |
TPC-DS: Total hot run time: 169055 ms |
|
/review |
There was a problem hiding this comment.
Pull request overview
Adds support for mapping upstream CDC source table names to different Doris target table names for streaming jobs (e.g., Postgres), and validates/loads data/DDL against the mapped targets.
Changes:
- Introduce
table.<src>.target_tableper-table config and plumb it through FE table creation and BE/cdc_client stream-load routing. - Update Postgres Debezium schema-change DDL generation to apply to the mapped Doris table name.
- Add regression coverage for 1:1 mapping and multi-source → single-target merge.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| regression-test/suites/job_p0/streaming_job/cdc/test_streaming_postgres_job_table_mapping.groovy | New regression test validating table name mapping + merge behavior |
| regression-test/data/job_p0/streaming_job/cdc/test_streaming_postgres_job_table_mapping.out | Expected outputs for the new regression test |
| fs_brokers/cdc_client/src/main/java/org/apache/doris/cdcclient/utils/ConfigUtil.java | Add parsing helper for all target_table mappings from config |
| fs_brokers/cdc_client/src/main/java/org/apache/doris/cdcclient/source/deserialize/DebeziumJsonDeserializer.java | Cache mapping parsed during deserializer init and expose resolver |
| fs_brokers/cdc_client/src/main/java/org/apache/doris/cdcclient/source/deserialize/PostgresDebeziumJsonDeserializer.java | Emit schema-change DDL against mapped Doris table name |
| fs_brokers/cdc_client/src/main/java/org/apache/doris/cdcclient/service/PipelineCoordinator.java | Route stream-load writes to mapped Doris table name |
| fe/fe-core/src/main/java/org/apache/doris/job/util/StreamingJobUtils.java | Create Doris tables using mapped target table name; return source→command mapping |
| fe/fe-core/src/main/java/org/apache/doris/job/extensions/insert/streaming/StreamingInsertJob.java | Use source table names for CDC monitoring/splitting while creating mapped target tables |
| fe/fe-core/src/main/java/org/apache/doris/job/extensions/insert/streaming/DataSourceConfigValidator.java | Add validation for per-table config key format and per-table suffix allowlist |
| fe/fe-common/src/main/java/org/apache/doris/job/cdc/DataSourceConfigKeys.java | Define per-table config prefix/suffix constants incl. target_table |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
...rc/main/java/org/apache/doris/job/extensions/insert/streaming/DataSourceConfigValidator.java
Show resolved
Hide resolved
...rc/main/java/org/apache/doris/job/extensions/insert/streaming/DataSourceConfigValidator.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Code Review Summary
This PR adds support for mapping upstream (PostgreSQL) table names to custom downstream (Doris) table names in CDC streaming jobs. The overall design is clean and the mapping is correctly applied across FE (table creation, CDC monitoring), cdc_client DML routing (PipelineCoordinator), and DDL generation (PostgresDebeziumJsonDeserializer). Tests cover both basic mapping and multi-table merge scenarios.
Critical Checkpoint Conclusions
1. Goal accomplishment: The PR achieves its stated goal — source-to-target table name mapping works for table creation, DML routing, and schema change DDL generation. Tests prove basic and multi-table merge scenarios.
2. Modification scope: Focused and minimal. Changes are spread across the correct layers (FE config/validation, FE job utils, cdc_client coordinator/deserializer/util).
3. Concurrency: No new concurrency concerns. targetTableMappingsCache in DebeziumJsonDeserializer is populated once in init() and read-only afterward. targetTableMappings in PipelineCoordinator.writeRecords() is local to the method.
4. Lifecycle management: No special lifecycle concerns. The mapping cache is a simple HashMap with no circular references.
5. Configuration items: New config key format table.<srcTable>.target_table is properly validated in DataSourceConfigValidator. No dynamic change concerns (config is set at job creation time).
6. Incompatible changes: Return type of generateCreateTableCmds changed from List<CreateTableCommand> to LinkedHashMap<String, CreateTableCommand>. This is an internal API — no external compatibility concern.
7. Parallel code paths: MySQL deserializer (MySqlDebeziumJsonDeserializer) has a stub handleSchemaChangeEvent() — no mapping needed there yet. When MySQL DDL handling is implemented, it must use resolveTargetTable(). The DML routing path in PipelineCoordinator already handles both MySQL and Postgres uniformly. No missing parallel path updates.
8. Test coverage: Good coverage with basic mapping (snapshot + incremental INSERT/UPDATE/DELETE) and multi-table merge. However, negative test cases for the FE validator are missing (see inline comment). Schema change with mapping is also untested.
9. Observability: No new critical paths requiring additional metrics or logging.
10. Data correctness: See inline comments regarding multi-table merge with schema-incompatible source tables.
Issues Found
See inline comments for details. Summary:
- [Medium] Dead constant:
TABLE_EXCLUDE_COLUMNS_SUFFIXis defined but not used or permitted by the validator. - [Medium] Multi-table merge schema safety: When two source tables with different schemas map to the same Doris target, only the first source table's schema is used. The second is silently skipped. This could cause runtime insert failures.
- [Low] Parsing inconsistency: FE validator uses
split(".")requiring exactly 3 parts, while cdc_client uses prefix/suffix matching. They behave differently for table names containing dots (theoretical concern for PG/MySQL). - [Low] Missing negative tests: No test for malformed config keys or
exclude_columnssuffix rejection.
fe/fe-common/src/main/java/org/apache/doris/job/cdc/DataSourceConfigKeys.java
Show resolved
Hide resolved
| // reject malformed keys like "table.exclude_columns" (missing tableName) | ||
| String[] parts = key.split("\\.", -1); | ||
| if (parts.length != 3 || parts[1].isEmpty()) { | ||
| throw new IllegalArgumentException("Malformed per-table config key: '" + key |
There was a problem hiding this comment.
[Low] The split("\\.", -1) + parts.length != 3 approach assumes table names never contain dots. While this is typically true for PG/MySQL, note that ConfigUtil.parseAllTargetTableMappings on the cdc_client side uses a prefix/suffix substring approach that correctly handles dots in table names. The two modules use inconsistent parsing strategies for the same key format.
This is a minor issue in practice (PG/MySQL identifiers rarely contain dots), but worth noting as a design inconsistency. Consider aligning both sides to use the same parsing strategy — the prefix/suffix approach is strictly more correct.
| createtblCmds.add(createtblCmd); | ||
| // Key: source (PG/MySQL) table name; Value: command that creates the Doris target table | ||
| createtblCmds.put(table, createtblCmd); | ||
| } |
There was a problem hiding this comment.
[Medium] Multi-table merge schema safety: When two source tables map to the same Doris target table, this LinkedHashMap will contain two entries with different keys but both CreateTableCommand values targeting the same Doris table name. In createTableIfNotExists(), the second entry's CreateTableCommand is silently skipped (because the table already exists after the first entry creates it).
This means the Doris target table is created with only the first source table's schema. If the two source tables have different columns, data from the second source may fail at stream-load time.
Consider either:
- Validating that all source tables mapping to the same target have compatible schemas (at least same column names and compatible types), or
- Documenting this limitation clearly, or
- Merging column definitions (union of columns) when building the
CreateTableCommandfor shared targets.
The current test works because both PG source tables have identical schemas (id int, name varchar(200)).
|
run buildall |
TPC-H: Total hot run time: 26620 ms |
TPC-DS: Total hot run time: 166923 ms |
What problem does this PR solve?
Summary
Add support for mapping upstream (PostgreSQL) table names to custom downstream (Doris) table names
in CDC streaming jobs. Without this feature, the Doris target table must have the same name as the
upstream source table.
New configuration
Key format:
"table.<srcTable>.target_table" = "<dstTable>"in theFROMclause properties.When not configured, behavior is unchanged (target table name = source table name).
Key design decisions
distinguish source names (for CDC monitoring) from target names (for DDL) — this fixes a bug
where the CDC split assigner would look up the Doris target table name in PostgreSQL
Test plan
created with target name, not source name)
incremental)
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)