Skip to content

[Improve](Streamingjob) support exclude_columns for Postgres streaming job #61267

Open
JNSimba wants to merge 9 commits intoapache:masterfrom
JNSimba:pg-col-filter-new
Open

[Improve](Streamingjob) support exclude_columns for Postgres streaming job #61267
JNSimba wants to merge 9 commits intoapache:masterfrom
JNSimba:pg-col-filter-new

Conversation

@JNSimba
Copy link
Member

@JNSimba JNSimba commented Mar 12, 2026

What problem does this PR solve?

Add column-level filtering support for PostgreSQL CDC streaming jobs via the
table.<tableName>.exclude_columns property. Users can specify a comma-separated
list of columns to exclude from synchronization.

Syntax example:

CREATE JOB my_job
  ON STREAMING
  FROM POSTGRES (
    ...
    "include_tables" = "my_table",
    "table.my_table.exclude_columns" = "secret,internal_col"
  )
  TO DATABASE my_db (...)

Changes

FE (validation & table creation)

  • DataSourceConfigKeys: add TABLE and TABLE_EXCLUDE_COLUMNS_SUFFIX constants
  • DataSourceConfigValidator: recognize table..exclude_columns as a valid
    per-table config key (using suffix allowlist)
  • StreamingJobUtils.generateCreateTableCmds(): parse excluded columns, validate
    they exist in the upstream PG table and are not PK columns, then exclude them
    from the Doris CREATE TABLE statement

cdc_client (DML filtering & schema change handling)

  • ConfigUtil: add parseExcludeColumns(config, tableName) utility
  • DebeziumJsonDeserializer: skip excluded fields when building INSERT/UPDATE/DELETE rows
  • PostgresDebeziumJsonDeserializer: skip DROP/ADD DDL for excluded columns during
    schema change detection, so the Doris table is never modified for columns it
    was never meant to have

Behavior

Scenario Behavior
Snapshot / incremental DML Excluded column values are not written to Doris
PG DROP excluded column DDL skipped; stored schema updated; sync continues
PG ADD excluded column back DDL skipped; sync continues; Doris never gains the column
Exclude non-existent column CREATE JOB fails with clear error
Exclude PK column CREATE JOB fails with clear error

Tests

  • test_streaming_postgres_job_col_filter.groovy: covers validation errors,
    snapshot filtering, incremental DML filtering, DROP excluded column, re-ADD
    excluded column; uses Awaitility polling instead of fixed sleeps

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@JNSimba JNSimba requested a review from Copilot March 12, 2026 09:37
@JNSimba
Copy link
Member Author

JNSimba commented Mar 12, 2026

/review

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds column-level filtering for PostgreSQL CDC streaming jobs via per-table property table.<tableName>.exclude_columns, ensuring excluded columns are omitted during Doris table creation, DML ingestion, and Postgres schema-change handling.

Changes:

  • FE: validate/parse table.<tableName>.exclude_columns, reject non-existent columns and PK columns, and omit excluded columns from generated CREATE TABLE.
  • cdc_client: parse exclude_columns and filter excluded fields from Debezium DML rows; skip ADD/DROP DDL for excluded columns during PG schema change detection.
  • Regression: add a Postgres streaming job suite covering validation, snapshot/incremental DML filtering, and DROP/ADD excluded-column scenarios.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
regression-test/suites/job_p0/streaming_job/cdc/test_streaming_postgres_job_col_filter.groovy New regression suite validating exclude-column behavior end-to-end
regression-test/data/job_p0/streaming_job/cdc/test_streaming_postgres_job_col_filter.out Expected outputs for the new regression suite
fs_brokers/cdc_client/src/main/java/org/apache/doris/cdcclient/utils/ConfigUtil.java Adds parsing utility for table.<table>.exclude_columns
fs_brokers/cdc_client/src/main/java/org/apache/doris/cdcclient/source/deserialize/DebeziumJsonDeserializer.java Filters excluded columns out of DML JSON rows
fs_brokers/cdc_client/src/main/java/org/apache/doris/cdcclient/source/deserialize/PostgresDebeziumJsonDeserializer.java Skips schema-change DDL for excluded columns
fe/fe-core/src/main/java/org/apache/doris/job/util/StreamingJobUtils.java Applies exclude-column validation and omission during Doris table creation
fe/fe-core/src/main/java/org/apache/doris/job/extensions/insert/streaming/DataSourceConfigValidator.java Allows per-table config keys by suffix for source validation
fe/fe-common/src/main/java/org/apache/doris/job/cdc/DataSourceConfigKeys.java Introduces constants for per-table config key construction

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

This PR adds column-level filtering support (exclude_columns) for PostgreSQL CDC streaming jobs. The feature is well-structured across FE validation, CDC client DML filtering, and schema change handling. The regression test is comprehensive, covering snapshot, incremental DML, DROP and re-ADD of excluded columns.

Critical Checkpoint Conclusions

Goal & Correctness: The PR accomplishes its stated goal. FE validates excluded columns exist and are not PK columns, the Doris CREATE TABLE omits them, the CDC client skips excluded columns during DML deserialization, and schema change DDLs are suppressed for excluded columns. Tests prove the key scenarios.

Modification focus: The change is focused and well-scoped to the exclude_columns feature.

Concurrency: No new concurrency concerns; the deserializer operates single-threaded per task, and the config map is read-only.

Lifecycle / Static init: No lifecycle issues or SIOF risks introduced.

Configuration items added: Two new constants (TABLE, TABLE_EXCLUDE_COLUMNS_SUFFIX) and the per-table config key pattern table.<name>.exclude_columns. See issues below regarding validation gaps.

Incompatible changes: None. This is purely additive.

Parallel code paths: The MySQL CDC deserializer (MySqlDebeziumJsonDeserializer) does not have DDL handling yet (it's a stub), so no parallel path needs updating now. The base class DebeziumJsonDeserializer already applies DML filtering for all connectors.

Test coverage: Good end-to-end coverage: validation errors, snapshot filtering, incremental DML (INSERT/UPDATE/DELETE), DROP excluded column, re-ADD excluded column. Uses Awaitility polling instead of fixed sleeps.

Observability: Good INFO-level logging for schema change skip decisions.

Performance: See inline comment about per-record re-parsing.

Issues Found

See inline comments for details:

  1. [Performance] ConfigUtil.parseExcludeColumns() is called on every DML record in the hot path, re-parsing the comma-separated string and creating a new HashSet each time. Should be cached per table name in the deserializer.

  2. [Validation Gap] DataSourceConfigValidator.validateSource() accepts malformed keys like table.exclude_columns (missing table name segment) because it only checks the suffix after the last dot. Should verify at least 3 dot-separated segments.

  3. [Incorrect Javadoc] ConfigUtil.parseExcludeColumns() Javadoc says @return lower-cased column name set but the implementation does NOT lowercase column names.

  4. [Code Duplication] parseExcludeColumns is duplicated identically in StreamingJobUtils (FE) and ConfigUtil (CDC client). Both modules depend on fe-common where DataSourceConfigKeys lives -- this method should be consolidated there.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

@JNSimba
Copy link
Member Author

JNSimba commented Mar 13, 2026

run buildall

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage 3.33% (1/30) 🎉
Increment coverage report
Complete coverage report

@JNSimba
Copy link
Member Author

JNSimba commented Mar 13, 2026

run buildall

@JNSimba
Copy link
Member Author

JNSimba commented Mar 13, 2026

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 27553 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit b66469cf60206e57c5110ccd872ae8e884d8d8e0, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17622	4483	4285	4285
q2	q3	10654	761	503	503
q4	4670	366	245	245
q5	7555	1197	1020	1020
q6	178	172	146	146
q7	785	832	666	666
q8	9287	1420	1319	1319
q9	4812	4723	4672	4672
q10	6326	1894	1629	1629
q11	473	264	229	229
q12	732	573	471	471
q13	18051	2920	2184	2184
q14	233	230	226	226
q15	944	800	820	800
q16	761	723	710	710
q17	711	846	424	424
q18	5931	5361	5312	5312
q19	1141	975	597	597
q20	502	521	386	386
q21	4522	1984	1480	1480
q22	380	313	249	249
Total cold run time: 96270 ms
Total hot run time: 27553 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4680	4632	4563	4563
q2	q3	3945	4390	3909	3909
q4	879	1183	769	769
q5	4032	4373	4392	4373
q6	186	182	178	178
q7	1790	1658	1523	1523
q8	2506	2676	2525	2525
q9	7608	7491	7591	7491
q10	3734	3985	3602	3602
q11	504	432	409	409
q12	478	595	461	461
q13	2827	3191	2336	2336
q14	279	292	285	285
q15	846	883	788	788
q16	712	759	708	708
q17	1176	1473	1518	1473
q18	7229	6885	6782	6782
q19	846	840	902	840
q20	2081	2173	1984	1984
q21	3932	3444	3402	3402
q22	464	451	404	404
Total cold run time: 50734 ms
Total hot run time: 48805 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 153522 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit b66469cf60206e57c5110ccd872ae8e884d8d8e0, data reload: false

query5	4326	653	534	534
query6	333	227	217	217
query7	4201	466	271	271
query8	345	266	230	230
query9	8672	2776	2756	2756
query10	523	389	338	338
query11	7445	5894	5606	5606
query12	179	133	150	133
query13	1250	478	340	340
query14	5678	3785	3590	3590
query14_1	2798	2791	2818	2791
query15	201	192	174	174
query16	982	466	454	454
query17	1080	705	595	595
query18	2428	438	335	335
query19	222	207	180	180
query20	136	132	133	132
query21	227	141	119	119
query22	4867	4798	4949	4798
query23	15896	15648	15363	15363
query23_1	15371	16288	15903	15903
query24	8187	1794	1290	1290
query24_1	1278	1275	1341	1275
query25	611	539	459	459
query26	1288	280	158	158
query27	3014	514	305	305
query28	5167	1957	1984	1957
query29	911	654	570	570
query30	357	258	233	233
query31	1453	1380	1264	1264
query32	95	83	85	83
query33	607	353	372	353
query34	1093	968	635	635
query35	655	677	615	615
query36	1137	1109	1012	1012
query37	144	95	89	89
query38	2984	2936	2931	2931
query39	955	877	847	847
query39_1	833	845	833	833
query40	235	174	141	141
query41	71	65	63	63
query42	314	343	296	296
query43	240	267	223	223
query44	
query45	197	189	182	182
query46	872	986	610	610
query47	2144	2140	2077	2077
query48	321	319	234	234
query49	629	484	391	391
query50	702	281	221	221
query51	4195	4150	4136	4136
query52	296	295	283	283
query53	297	337	283	283
query54	299	271	256	256
query55	96	92	91	91
query56	332	319	305	305
query57	1372	1319	1286	1286
query58	308	280	273	273
query59	1356	1450	1329	1329
query60	344	329	327	327
query61	153	146	140	140
query62	635	588	527	527
query63	308	280	277	277
query64	5051	1318	993	993
query65	
query66	1463	448	354	354
query67	16418	16526	16328	16328
query68	
query69	399	311	288	288
query70	1014	981	973	973
query71	346	312	306	306
query72	2759	2783	2429	2429
query73	541	560	324	324
query74	10010	9905	9733	9733
query75	2848	2756	2487	2487
query76	2295	1047	674	674
query77	370	392	318	318
query78	11137	11368	10650	10650
query79	1163	828	593	593
query80	689	634	555	555
query81	496	281	247	247
query82	1330	148	118	118
query83	372	261	241	241
query84	254	128	99	99
query85	838	510	441	441
query86	376	337	302	302
query87	3165	3135	2995	2995
query88	3560	2668	2670	2668
query89	438	381	345	345
query90	1956	172	171	171
query91	166	164	140	140
query92	77	75	70	70
query93	889	819	499	499
query94	451	329	294	294
query95	587	341	325	325
query96	639	515	232	232
query97	2463	2501	2430	2430
query98	240	228	224	224
query99	1035	987	873	873
Total cold run time: 234283 ms
Total hot run time: 153522 ms

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage 3.33% (1/30) 🎉
Increment coverage report
Complete coverage report

@JNSimba
Copy link
Member Author

JNSimba commented Mar 13, 2026

run external

@JNSimba
Copy link
Member Author

JNSimba commented Mar 13, 2026

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 26872 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 6f587f016a7761d90d114163fe9421a0ca2d4b79, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17655	4412	4303	4303
q2	q3	10658	798	522	522
q4	4668	374	256	256
q5	7538	1188	1021	1021
q6	180	173	145	145
q7	784	858	669	669
q8	9473	1430	1367	1367
q9	4860	4763	4705	4705
q10	6335	1903	1687	1687
q11	481	273	257	257
q12	758	564	464	464
q13	18079	2943	2163	2163
q14	231	230	223	223
q15	q16	744	744	672	672
q17	707	797	471	471
q18	5843	5304	5241	5241
q19	1119	963	612	612
q20	535	487	390	390
q21	4782	1831	1433	1433
q22	534	361	271	271
Total cold run time: 95964 ms
Total hot run time: 26872 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4712	4531	4572	4531
q2	q3	3877	4401	3887	3887
q4	903	1210	814	814
q5	4101	4431	4323	4323
q6	178	180	136	136
q7	1735	1670	1539	1539
q8	2503	2691	2608	2608
q9	7745	7384	7437	7384
q10	3752	3954	3721	3721
q11	544	463	428	428
q12	512	589	471	471
q13	2867	3159	2281	2281
q14	288	300	274	274
q15	q16	713	776	722	722
q17	1148	1366	1358	1358
q18	7184	6749	6619	6619
q19	882	882	917	882
q20	2077	2147	2005	2005
q21	3939	3561	3340	3340
q22	458	506	387	387
Total cold run time: 50118 ms
Total hot run time: 47710 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 168537 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 6f587f016a7761d90d114163fe9421a0ca2d4b79, data reload: false

query5	4321	648	540	540
query6	330	247	214	214
query7	4209	471	269	269
query8	354	255	243	243
query9	8707	2790	2755	2755
query10	507	377	340	340
query11	7015	5104	4884	4884
query12	184	127	129	127
query13	1282	469	363	363
query14	5748	3759	3462	3462
query14_1	2847	2823	2869	2823
query15	214	194	173	173
query16	973	491	461	461
query17	1114	725	622	622
query18	2457	457	356	356
query19	228	211	201	201
query20	134	132	130	130
query21	216	132	112	112
query22	13251	14144	14947	14144
query23	16130	15789	15604	15604
query23_1	15639	15666	16285	15666
query24	7248	1630	1221	1221
query24_1	1226	1236	1225	1225
query25	587	505	498	498
query26	1232	266	152	152
query27	2756	498	299	299
query28	4939	1861	1825	1825
query29	812	564	485	485
query30	311	243	200	200
query31	1043	956	870	870
query32	85	71	74	71
query33	521	347	281	281
query34	906	897	553	553
query35	640	706	616	616
query36	1128	1134	1031	1031
query37	141	101	87	87
query38	3031	2911	2863	2863
query39	858	832	810	810
query39_1	791	786	801	786
query40	225	154	135	135
query41	64	59	60	59
query42	256	256	254	254
query43	237	264	230	230
query44	
query45	233	195	180	180
query46	901	974	599	599
query47	2137	2160	2036	2036
query48	326	332	234	234
query49	631	458	375	375
query50	683	285	212	212
query51	4101	4049	4064	4049
query52	265	267	253	253
query53	288	339	279	279
query54	295	271	258	258
query55	100	99	83	83
query56	317	319	296	296
query57	1929	1727	1693	1693
query58	284	276	276	276
query59	2814	2958	2733	2733
query60	340	342	320	320
query61	150	144	139	139
query62	615	590	536	536
query63	320	284	276	276
query64	5022	1256	977	977
query65	
query66	1462	462	349	349
query67	24295	24256	24203	24203
query68	
query69	386	312	281	281
query70	895	910	957	910
query71	350	309	300	300
query72	2800	2617	2360	2360
query73	544	558	324	324
query74	9604	9613	9374	9374
query75	2863	2769	2456	2456
query76	2278	1041	698	698
query77	369	380	308	308
query78	10877	11155	10463	10463
query79	2182	726	567	567
query80	1681	635	549	549
query81	549	265	226	226
query82	1002	152	113	113
query83	335	269	248	248
query84	250	113	93	93
query85	903	499	433	433
query86	410	298	331	298
query87	3175	3213	3011	3011
query88	3607	2678	2661	2661
query89	438	376	349	349
query90	2040	182	176	176
query91	168	159	136	136
query92	75	72	74	72
query93	1024	868	495	495
query94	626	327	285	285
query95	579	388	307	307
query96	645	524	232	232
query97	2458	2498	2398	2398
query98	241	227	221	221
query99	1004	1000	924	924
Total cold run time: 251048 ms
Total hot run time: 168537 ms

@JNSimba
Copy link
Member Author

JNSimba commented Mar 13, 2026

run p0

@JNSimba
Copy link
Member Author

JNSimba commented Mar 13, 2026

run cloud_p0

1 similar comment
@JNSimba
Copy link
Member Author

JNSimba commented Mar 13, 2026

run cloud_p0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants