Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-39650][SS] Fix incorrect value schema in streaming deduplicati…
…on with backward compatibility ### What changes were proposed in this pull request? This PR proposes to fix the incorrect value schema in streaming deduplication. It stores the empty row having a single column with null (using NullType), but the value schema is specified as all columns, which leads incorrect behavior from state store schema compatibility checker. This PR proposes to set the schema of value as `StructType(Array(StructField("__dummy__", NullType)))` to fit with the empty row. With this change, the streaming queries creating the checkpoint after this fix would work smoothly. To not break the existing streaming queries having incorrect value schema, this PR proposes to disable the check for value schema on streaming deduplication. Disabling the value check was there for the format validation (we have two different checkers for state store), but it has been missing for state store schema compatibility check. To avoid adding more config, this PR leverages the existing config "format validation" is using. ### Why are the changes needed? This is a bug fix. Suppose the streaming query below: ``` # df has the columns `a`, `b`, `c` val df = spark.readStream.format("...").load() val query = df.dropDuplicate("a").writeStream.format("...").start() ``` while the query is running, df can produce a different set of columns (e.g. `a`, `b`, `c`, `d`) from the same source due to schema evolution. Since we only deduplicate the rows with column `a`, the change of schema should not matter for streaming deduplication, but state store schema checker throws error saying "value schema is not compatible" before this fix. ### Does this PR introduce _any_ user-facing change? No, this is basically a bug fix which end users wouldn't notice unless they encountered a bug. ### How was this patch tested? New tests. Closes #37041 from HeartSaVioR/SPARK-39650. Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit fe53603) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
- Loading branch information
1 parent
1387af7
commit 9adfc3a
Showing
37 changed files
with
152 additions
and
22 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file added
BIN
+12 Bytes
...rces/structured-streaming/checkpoint-version-3.3.0-streaming-deduplication/commits/.0.crc
Binary file not shown.
Binary file added
BIN
+12 Bytes
...rces/structured-streaming/checkpoint-version-3.3.0-streaming-deduplication/commits/.1.crc
Binary file not shown.
2 changes: 2 additions & 0 deletions
2
...resources/structured-streaming/checkpoint-version-3.3.0-streaming-deduplication/commits/0
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
v1 | ||
{"nextBatchWatermarkMs":0} |
2 changes: 2 additions & 0 deletions
2
...resources/structured-streaming/checkpoint-version-3.3.0-streaming-deduplication/commits/1
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
v1 | ||
{"nextBatchWatermarkMs":0} |
1 change: 1 addition & 0 deletions
1
.../resources/structured-streaming/checkpoint-version-3.3.0-streaming-deduplication/metadata
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
{"id":"33e8de33-00b8-4b60-8246-df2f433257ff"} |
Binary file added
BIN
+16 Bytes
...rces/structured-streaming/checkpoint-version-3.3.0-streaming-deduplication/offsets/.0.crc
Binary file not shown.
Binary file added
BIN
+16 Bytes
...rces/structured-streaming/checkpoint-version-3.3.0-streaming-deduplication/offsets/.1.crc
Binary file not shown.
3 changes: 3 additions & 0 deletions
3
...resources/structured-streaming/checkpoint-version-3.3.0-streaming-deduplication/offsets/0
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
v1 | ||
{"batchWatermarkMs":0,"batchTimestampMs":1656644489789,"conf":{"spark.sql.streaming.stateStore.providerClass":"org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider","spark.sql.streaming.join.stateFormatVersion":"2","spark.sql.streaming.stateStore.compression.codec":"lz4","spark.sql.streaming.stateStore.rocksdb.formatVersion":"5","spark.sql.streaming.statefulOperator.useStrictDistribution":"true","spark.sql.streaming.flatMapGroupsWithState.stateFormatVersion":"2","spark.sql.streaming.multipleWatermarkPolicy":"min","spark.sql.streaming.aggregation.stateFormatVersion":"2","spark.sql.shuffle.partitions":"5"}} | ||
0 |
3 changes: 3 additions & 0 deletions
3
...resources/structured-streaming/checkpoint-version-3.3.0-streaming-deduplication/offsets/1
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
v1 | ||
{"batchWatermarkMs":0,"batchTimestampMs":1656644492462,"conf":{"spark.sql.streaming.stateStore.providerClass":"org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider","spark.sql.streaming.join.stateFormatVersion":"2","spark.sql.streaming.stateStore.compression.codec":"lz4","spark.sql.streaming.stateStore.rocksdb.formatVersion":"5","spark.sql.streaming.statefulOperator.useStrictDistribution":"true","spark.sql.streaming.flatMapGroupsWithState.stateFormatVersion":"2","spark.sql.streaming.multipleWatermarkPolicy":"min","spark.sql.streaming.aggregation.stateFormatVersion":"2","spark.sql.shuffle.partitions":"5"}} | ||
1 |
Binary file added
BIN
+12 Bytes
...uctured-streaming/checkpoint-version-3.3.0-streaming-deduplication/state/0/0/.1.delta.crc
Binary file not shown.
Binary file added
BIN
+12 Bytes
...uctured-streaming/checkpoint-version-3.3.0-streaming-deduplication/state/0/0/.2.delta.crc
Binary file not shown.
Binary file added
BIN
+77 Bytes
...s/structured-streaming/checkpoint-version-3.3.0-streaming-deduplication/state/0/0/1.delta
Binary file not shown.
Binary file added
BIN
+46 Bytes
...s/structured-streaming/checkpoint-version-3.3.0-streaming-deduplication/state/0/0/2.delta
Binary file not shown.
Binary file added
BIN
+12 Bytes
...treaming/checkpoint-version-3.3.0-streaming-deduplication/state/0/0/_metadata/.schema.crc
Binary file not shown.
Binary file added
BIN
+254 Bytes
...red-streaming/checkpoint-version-3.3.0-streaming-deduplication/state/0/0/_metadata/schema
Binary file not shown.
Binary file added
BIN
+12 Bytes
...uctured-streaming/checkpoint-version-3.3.0-streaming-deduplication/state/0/1/.1.delta.crc
Binary file not shown.
Binary file added
BIN
+12 Bytes
...uctured-streaming/checkpoint-version-3.3.0-streaming-deduplication/state/0/1/.2.delta.crc
Binary file not shown.
Binary file added
BIN
+46 Bytes
...s/structured-streaming/checkpoint-version-3.3.0-streaming-deduplication/state/0/1/1.delta
Binary file not shown.
Binary file added
BIN
+77 Bytes
...s/structured-streaming/checkpoint-version-3.3.0-streaming-deduplication/state/0/1/2.delta
Binary file not shown.
Binary file added
BIN
+12 Bytes
...uctured-streaming/checkpoint-version-3.3.0-streaming-deduplication/state/0/2/.1.delta.crc
Binary file not shown.
Binary file added
BIN
+12 Bytes
...uctured-streaming/checkpoint-version-3.3.0-streaming-deduplication/state/0/2/.2.delta.crc
Binary file not shown.
Binary file added
BIN
+46 Bytes
...s/structured-streaming/checkpoint-version-3.3.0-streaming-deduplication/state/0/2/1.delta
Binary file not shown.
Binary file added
BIN
+46 Bytes
...s/structured-streaming/checkpoint-version-3.3.0-streaming-deduplication/state/0/2/2.delta
Binary file not shown.
Binary file added
BIN
+12 Bytes
...uctured-streaming/checkpoint-version-3.3.0-streaming-deduplication/state/0/3/.1.delta.crc
Binary file not shown.
Binary file added
BIN
+12 Bytes
...uctured-streaming/checkpoint-version-3.3.0-streaming-deduplication/state/0/3/.2.delta.crc
Binary file not shown.
Binary file added
BIN
+46 Bytes
...s/structured-streaming/checkpoint-version-3.3.0-streaming-deduplication/state/0/3/1.delta
Binary file not shown.
Binary file added
BIN
+46 Bytes
...s/structured-streaming/checkpoint-version-3.3.0-streaming-deduplication/state/0/3/2.delta
Binary file not shown.
Binary file added
BIN
+12 Bytes
...uctured-streaming/checkpoint-version-3.3.0-streaming-deduplication/state/0/4/.1.delta.crc
Binary file not shown.
Binary file added
BIN
+12 Bytes
...uctured-streaming/checkpoint-version-3.3.0-streaming-deduplication/state/0/4/.2.delta.crc
Binary file not shown.
Binary file added
BIN
+46 Bytes
...s/structured-streaming/checkpoint-version-3.3.0-streaming-deduplication/state/0/4/1.delta
Binary file not shown.
Binary file added
BIN
+46 Bytes
...s/structured-streaming/checkpoint-version-3.3.0-streaming-deduplication/state/0/4/2.delta
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters