[SPARK-55949][SQL][CONNECT] Add DataFrame API and Spark Connect support for CDC queries#54739
[SPARK-55949][SQL][CONNECT] Add DataFrame API and Spark Connect support for CDC queries#54739gengliangwang wants to merge 12 commits intoapache:masterfrom
Conversation
...nect/common/src/main/scala/org/apache/spark/sql/connect/ConnectClientUnsupportedErrors.scala
Outdated
Show resolved
Hide resolved
|
|
||
| /** @inheritdoc */ | ||
| def changes(tableName: String): DataFrame = { | ||
| assertNoSpecifiedSchema("changes") |
There was a problem hiding this comment.
Should we check if source is null?
|
Let's get #54738 in first. |
…C queries Add DataFrameReader.changes() and DataStreamReader.changes() APIs: - API definitions in sql/api (DataFrameReader, DataStreamReader) - Classic implementation using ChangelogInfoUtils + RelationChanges - Spark Connect: ReadChanges proto message, client serialization, server-side planner deserialization - Remove ConnectClientUnsupportedErrors.tableCDC() (no longer needed) - DataFrame API tests in ChangelogResolutionSuite and ChangelogEndToEndSuite - Spark Connect PlanGenerationTestSuite golden files
f6920a0 to
f6bb53e
Compare
|
cc @johanl-db as well |
| * }}} | ||
| * | ||
| * @param tableName | ||
| * is either a qualified or unqualified name that designates a table. |
There was a problem hiding this comment.
Minor: can we shorten it to fit on one line?
There was a problem hiding this comment.
Every @param in these files (and the project) uses the multi-line format:
- @param name
- description
This is enforced by scalafmt 3.8.6 with docstrings.style = Asterisk. This style automatically breaks @param tags so the description starts
on the next line with 2-space indent — regardless of line length.
| * }}} | ||
| * | ||
| * @param tableName | ||
| * is either a qualified or unqualified name that designates a table. |
| "endingversion": "5" | ||
| } | ||
| } | ||
| } No newline at end of file |
There was a problem hiding this comment.
Do we ned empty line at the end?
There was a problem hiding this comment.
There's no trailing empty line — the file ends with } (no trailing newline), which is consistent with all other JSON files in the query-tests/queries/ directory.
| "computeupdates": "true" | ||
| } | ||
| } | ||
| } No newline at end of file |
There was a problem hiding this comment.
Same as above — no trailing empty line, consistent with the other JSON files.
| sparkSession.newDataFrame { builder => | ||
| val changesBuilder = builder.getRelationChangesBuilder | ||
| .setUnparsedIdentifier(tableName) | ||
| extraOptions.foreach { case (k, v) => |
There was a problem hiding this comment.
Should we use putAllOptions here to be consistent with table method above?
There was a problem hiding this comment.
Good catch. Updated to use putAllOptions to be consistent with the table method.
| val changesBuilder = builder.getRelationChangesBuilder | ||
| .setUnparsedIdentifier(tableName) | ||
| .setIsStreaming(true) | ||
| sourceBuilder.getOptionsMap.forEach { (k, v) => |
There was a problem hiding this comment.
Same question about putAllOptions?
| val tableName = rel.getUnparsedIdentifier | ||
| val options = new CaseInsensitiveStringMap(rel.getOptionsMap) | ||
| val changelogInfo = ChangelogInfoUtils.fromOptions( | ||
| options, session.sessionState.conf.sessionLocalTimeZone) |
There was a problem hiding this comment.
Optional: I would define a temp var for time zone and would move UnresolvedRelation on one line.
val timeZone = session.sessionState.conf.sessionLocalTimeZone
val changelogInfo = ChangelogInfoUtils.fromOptions(options, timeZone)
val ident = parser.parseMultipartIdentifier(tableName)
val relation = UnresolvedRelation(ident, options, isStreaming = rel.getIsStreaming)
There was a problem hiding this comment.
Done. Extracted timeZone var and put UnresolvedRelation on one line.
sql/core/src/test/scala/org/apache/spark/sql/connector/ChangelogResolutionSuite.scala
Outdated
Show resolved
Hide resolved
|
@aokolnychyi @HyukjinKwon @viirya @zhengruifeng Thanks for the review. Merging to master. |
…rt for CDC queries ### What changes were proposed in this pull request? This is PR 2/2 of the CDC framework. It adds the DataFrame API (`changes()`) and Spark Connect support. Depends on apache#54738. **DataFrame API — new methods on `DataFrameReader` and `DataStreamReader`:** ```scala // Batch spark.read .option("startingVersion", "1") .option("endingVersion", "5") .option("deduplicationMode", "netChanges") .changes("catalog.table") // Streaming spark.readStream .option("startingVersion", "1") .changes("catalog.table") ``` - API definitions in `sql/api` (`DataFrameReader.changes()`, `DataStreamReader.changes()`). Currently only supported for DSv2 tables whose catalog implements `TableCatalog.loadChangelog()`. - Classic implementation in `sql/core`: Parses options via `ChangelogInfoUtils.fromOptions()` using the session timezone, creates `UnresolvedRelation` + `RelationChanges`. **Spark Connect support:** - Proto (`relations.proto`): New `RelationChanges` message (field 46) with `unparsed_identifier`, `options` map, and `is_streaming` flag. - Client (`DataFrameReader.scala`, `DataStreamReader.scala`): Serializes `changes()` calls into `RelationChanges` proto. - Server (`SparkConnectPlanner.scala`): `transformRelationChanges()` deserializes the proto, calls `ChangelogInfoUtils.fromOptions()` with the session timezone, and constructs `UnresolvedRelation` + `RelationChanges`. ### Why are the changes needed? Users need a programmatic API to query CDC data, complementing the SQL `CHANGES` clause added in apache#54738. The DataFrame API provides type-safe, composable access to change data with the standard option-based configuration pattern. Spark Connect support ensures feature parity for remote Spark sessions. ### Does this PR introduce _any_ user-facing change? Yes. New DataFrame API methods: ```scala // Before: spark.read.changes("table") throws SparkUnsupportedOperationException // After: val changes = spark.read .option("startingVersion", "1") .changes("my_catalog.my_table") changes.filter("_change_type = 'insert'").show() ``` Both `DataFrameReader.changes()` and `DataStreamReader.changes()` are now functional in classic and Spark Connect sessions. ### How was this patch tested? - **`ChangelogResolutionSuite`** — 5 tests: `changes()` resolution, catalog without CDC support, schema rejection on `readStream`, streaming resolution to `StreamingRelationV2`, streaming catalog error. - **`ChangelogEndToEndSuite`** — 19 end-to-end tests using `InMemoryChangelogCatalog`, each pairing the DataFrame API with the equivalent SQL syntax: - Basic retrieval (3 tests): data retrieval, open-ended version range, empty results with schema validation. - Projection/filter/aggregation (4 tests): CDC metadata column selection, projection with filter, aggregation on change types, schema verification. - Version range filtering (1 test): inclusive range with boundary rows. - Bound inclusivity (4 tests): default inclusive bounds (with explicit `INCLUSIVE` keyword), `startingBoundInclusive=false` (SQL: `EXCLUSIVE`), `endingBoundInclusive=false`, both bounds exclusive. - CDC options (3 tests): default `dropCarryovers` mode, `deduplicationMode=none`, `netChanges` with `computeUpdates`. - Timestamp range (1 test): `startingTimestamp`/`endingTimestamp` (SQL: `FROM TIMESTAMP ... TO TIMESTAMP ...`). - Error cases (1 test): user-specified schema rejection. - Streaming (4 tests): basic retrieval, `startingVersion` filtering, projection with filter, CDC options pass-through — all with both DataFrame API and SQL (`SELECT * FROM STREAM table CHANGES ...`) variants. - Streaming error cases (1 test): schema rejection. - **`PlanGenerationTestSuite`** — 3 golden files: `read_changes`, `read_changes_with_options`, `streaming_changes_API_with_options`. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code (Claude Opus 4.6) Closes apache#54739 from gengliangwang/cdc-dataframe. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request? Add `changes()` method to PySpark `DataFrameReader` and `DataStreamReader` for both classic and Spark Connect modes. **Classic PySpark:** - `DataFrameReader.changes(tableName)` — delegates to `self._jreader.changes(tableName)` - `DataStreamReader.changes(tableName)` — delegates to `self._jreader.changes(tableName)` with type checking **Spark Connect PySpark:** - New `RelationChanges` plan class in `plan.py` that serializes to the `RelationChanges` protobuf message - `DataFrameReader.changes(tableName)` — creates `RelationChanges` plan (batch) - `DataStreamReader.changes(tableName)` — creates `RelationChanges` plan with `is_streaming=True` ### Why are the changes needed? To expose the CDC `changes()` API added in #54739 to Python users. ### Does this PR introduce _any_ user-facing change? Yes. PySpark users can now use: ```python # Batch df = spark.read.option("startingVersion", "1").changes("my_table") # Streaming df = spark.readStream.option("startingVersion", "1").changes("my_table") ``` ### How was this patch tested? 7 plan generation tests in `test_connect_plan.py` covering: - Batch read with version/timestamp options - No-options and multi-part table names - Proto `oneof` discriminator verification - Streaming via direct plan and via `DataStreamReader` - `print()` debug output ### Was this patch authored or co-authored using generative AI tooling? Yes. Closes #54746 from gengliangwang/cdc-pyspark. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
What changes were proposed in this pull request?
This is PR 2/2 of the CDC framework. It adds the DataFrame API (
changes()) and Spark Connect support. Depends on #54738.DataFrame API — new methods on
DataFrameReaderandDataStreamReader:sql/api(DataFrameReader.changes(),DataStreamReader.changes()). Currently only supported for DSv2 tables whose catalog implementsTableCatalog.loadChangelog().sql/core: Parses options viaChangelogInfoUtils.fromOptions()using the session timezone, createsUnresolvedRelation+RelationChanges.Spark Connect support:
relations.proto): NewRelationChangesmessage (field 46) withunparsed_identifier,optionsmap, andis_streamingflag.DataFrameReader.scala,DataStreamReader.scala): Serializeschanges()calls intoRelationChangesproto.SparkConnectPlanner.scala):transformRelationChanges()deserializes the proto, callsChangelogInfoUtils.fromOptions()with the session timezone, and constructsUnresolvedRelation+RelationChanges.Why are the changes needed?
Users need a programmatic API to query CDC data, complementing the SQL
CHANGESclause added in #54738. The DataFrame API provides type-safe, composable access to change data with the standard option-based configuration pattern. Spark Connect support ensures feature parity for remote Spark sessions.Does this PR introduce any user-facing change?
Yes. New DataFrame API methods:
Both
DataFrameReader.changes()andDataStreamReader.changes()are now functional in classic and Spark Connect sessions.How was this patch tested?
ChangelogResolutionSuite— 5 tests:changes()resolution, catalog without CDC support, schema rejection onreadStream, streaming resolution toStreamingRelationV2, streaming catalog error.ChangelogEndToEndSuite— 19 end-to-end tests usingInMemoryChangelogCatalog, each pairing the DataFrame API with the equivalent SQL syntax:INCLUSIVEkeyword),startingBoundInclusive=false(SQL:EXCLUSIVE),endingBoundInclusive=false, both bounds exclusive.dropCarryoversmode,deduplicationMode=none,netChangeswithcomputeUpdates.startingTimestamp/endingTimestamp(SQL:FROM TIMESTAMP ... TO TIMESTAMP ...).startingVersionfiltering, projection with filter, CDC options pass-through — all with both DataFrame API and SQL (SELECT * FROM STREAM table CHANGES ...) variants.PlanGenerationTestSuite— 3 golden files:read_changes,read_changes_with_options,streaming_changes_API_with_options.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Claude Opus 4.6)