[SPARK-56870][SDP] Implement SCD1 Batch Processor; Extend Microbatch with CDC Metadata by AnishMahto · Pull Request #55970 · apache/spark

AnishMahto · 2026-05-19T02:32:53Z

Approved AutoCDC SPIP: https://lists.apache.org/thread/j6sj9wo9odgdpgzlxtvhoy7szs0jplf7

This is a stacked PR. Review incremental diff here: AnishMahto/spark@SPARK-56856-SCD1-microbatch-deduplication...SPARK-56870-extend-microbatch-with-cdc-metadata

Preamble:

The SCD type 1 flow is a foreachBatch streaming query on an input change-data-feed, and is responsible for reconciling the incoming change data onto some target table that follows SCD1 replication semantics.

SCD1 flows also maintain an "auxiliary" table to keep track of early-arriving out-of-order received events state. Each microbatch will need to reconcile against this auxiliary table as well, and update the auxiliary table's state appropriately for future microbatches.

Extend Microbatch with CDC Metadata:

After deduplication, all of the incoming rows can be classified as either a delete event or an upsert event (mutually exclusive), and there's at most one per key.

If we identify a row as a delete event, remember its sequencing as its deleteSequence. If we identify a row as an upsert event, remember its sequencing as its upsertSequence. That is, deleteSequence/upsertSequence encode both the sequencing for the row as well as the row classification (delete or upsert).

We need to persist this encoded information now, because in future stages we may drop the columns that deleteCondition needed to do the classification in the first place, depending on which columns were selected by ChangeArgs.columnSelection.

Where is the CDC Metadata stored?

Within the microbatch, we append a _cdc_metadata struct column, that stores the deleteSequence and upsertSequence.

This _cdc_metadata column will eventually also land in the persisted target and auxiliary tables, which are the artifacts of an AutoCDC flow. This column represents operational metadata that the AutoCDC flow has tagged a row with, and is necessary for out-of-order correctness of the SCD decomposition.

Users will not be able to opt out of persisting this column in the target table using ChangeArgs.columnSelection, as it is necessary for correctness. The column will not have a stable public contract, and users should make no assumptions on its contents.

szehon-ho

Review of the incremental diff on top of #55969 (extend microbatch with CDC metadata). Overall this looks good to merge with minor nits.

What looks good

The delete/upsert encoding in _cdc_metadata matches the SPIP story: mutually exclusive deleteSequence / upsertSequence, persisted before columnSelection can drop deleteCondition columns.
resolvedSequencingType at processor construction is the right split (flow setup vs per-microbatch work); the Int→Long cast test and incompatible cast test are valuable.
Reserved-column conflict uses conf.resolver and CaseSensitivityLabels — consistent with session case sensitivity.
constructCdcMetadataCol driven off cdcMetadataColSchema with ordered fields is clean; companion constants keep tests readable.
AUTOCDC_RESERVED_COLUMN_NAME_CONFLICT / SQLSTATE 42710 is appropriate.
Test coverage for classification, no delete condition, column ordering, cast success/failure, and reserved-name conflict is solid.

Incremental diff is focused and stacks cleanly on #55836 + #55969.

AnishMahto added 10 commits May 12, 2026 21:02

Introduce ChangeArgs

8b08cbe

linting

202f3a5

reorder error condition

4ac75e7

PR feedback

11606c5

linting

d1a38e6

PR feedback

bbe5335

buff error message and revert to case class

95ca0e1

test UnqualifiedColumnName('col')

481ca9f

minor test buff

0126659

address PR feedbak

ac15be5

AnishMahto force-pushed the SPARK-56870-extend-microbatch-with-cdc-metadata branch from 552e33c to 9a566ff Compare May 19, 2026 17:10

AnishMahto changed the title ~~[SPARK-56870][SDP] Extend Microbatch with CDC Metadata~~ [SPARK-56870][SDP] SCD1 Extend Microbatch with CDC Metadata May 19, 2026

AnishMahto changed the title ~~[SPARK-56870][SDP] SCD1 Extend Microbatch with CDC Metadata~~ [SPARK-56870][SDP] Implement SCD1 Batch Processor; Extend Microbatch with CDC Metadata May 19, 2026

AnishMahto added 7 commits May 19, 2026 18:10

PR feedback

436ff0a

Implement deduplicateMicrobatch

875f0b1

indenting cleanup

08ea9f4

schema comment

cf3ec82

casing

21d4ffe

linting

2ff07f4

PR feedback

76d775d

szehon-ho reviewed May 19, 2026

View reviewed changes

AnishMahto added 8 commits May 19, 2026 21:49

use reserved __spark_autocdc* prefix

8790a2d

Add deduplicate test when row contains nested columns

5c0c0f8

validation

1a640d1

buff scaladoc

88e9c1d

use spark resolver

fd631ad

lingint

2a886af

rebase conflict

e8df6f0

PR feedback

f9c2aed

AnishMahto force-pushed the SPARK-56870-extend-microbatch-with-cdc-metadata branch from 9a566ff to f9c2aed Compare May 19, 2026 23:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56870][SDP] Implement SCD1 Batch Processor; Extend Microbatch with CDC Metadata#55970

[SPARK-56870][SDP] Implement SCD1 Batch Processor; Extend Microbatch with CDC Metadata#55970
AnishMahto wants to merge 25 commits into
apache:masterfrom
AnishMahto:SPARK-56870-extend-microbatch-with-cdc-metadata

AnishMahto commented May 19, 2026 •

edited

Loading

Uh oh!

szehon-ho left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AnishMahto commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

szehon-ho left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AnishMahto commented May 19, 2026 •

edited

Loading