[SPARK-46906][SS] Add a check for stateful operator change for streaming #44927

jingz-db · 2024-01-29T05:16:35Z

What changes were proposed in this pull request?

Currently user will get a misleading error as org.apache.spark.sql.execution.streaming.state.StateSchemaNotCompatible if restarting query in the same checkpoint location and changing their stateful operator. This PR catches such errors and throws a new error with informative message.

After physical planning, before execution phase, we will read from state metadata with the current operator id to fetch operator name of committed batch with the same operator id. If operator name does not match, throws the error.

Why are the changes needed?

The current error message is misleading to users. We should provide users with message that can guide them to the real root cause of error.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit tests.

Was this patch authored or co-authored using generative AI tooling?

No

chaoqin-li1123 · 2024-01-30T00:34:12Z

common/utils/src/main/resources/error/error-classes.json

@@ -3317,6 +3317,12 @@
    ],
    "sqlState" : "XXKST"
  },
+  "STREAMING_STATEFUL_OPERATOR_NOT_MATCH_IN_STATE_METADATA" : {
+    "message" : [
+      "Streaming stateful operator name does not match with the operator in state metadata with the same operator id (id: <operatorId>). Stateful Operator name for current batch: <currentOperatorName>; Operator name in the state metadata: <stateMetadataOperatorName>."


Can we explain why this occur to the customer? Like "changing stateful operator of existing streaming query is not allowed."

chaoqin-li1123

LGTM

HeartSaVioR · 2024-01-31T04:09:58Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/IncrementalExecution.scala

@@ -184,6 +185,41 @@ class IncrementalExecution(
    }
  }

+  /**


Maybe better to compare in higher level - construct a map opId -> operatorName for both physical plan and state metadata, and compare two.

With this approach we can also give a better error message for all cases 1) new addition of stateful operator(s), 2) removal of existing stateful operator(s), 3) replacement of existing stateful operator(s). 3) can also happen if someone tries to replace dropDuplicates with dropDuplicatesWithinWatermark after encountering some issue.

Worth noting that we consider state metadata as a source of truth in this PR - so if there is no state metadata prior to this, it would be just same as letting Spark create a new state metadata file and we do comparison (this cannot perform any real check though). So the check can be also done after executing physical planning rules, maybe at the end of state.apply().

Thanks for the review Jungtaek! I also like the idea of adding a map.

So the check can be also done after executing physical planning rules, maybe at the end of state.apply()

I tried to add the check after WriteStatefulOperatorMetadataRule but this will miss detecting for adding an operator after restart (because the additional operator is already written to metadata). So I keep the check before WriteStatefulOperatorMetadataRule and will omit the check if metadata is empty.
It is also worth noting that if we do not perform the check before writing to metadata and fail the query, untruthful info will be written to state metadata.

HeartSaVioR

Another pass.

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/IncrementalExecution.scala

HeartSaVioR · 2024-02-14T03:49:53Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/IncrementalExecution.scala

@@ -387,8 +433,29 @@ class IncrementalExecution(
      rulesToCompose.reduceLeft { (ruleA, ruleB) => ruleA orElse ruleB }
    }

+    private def checkOperatorValidWithMetadata(): Unit = {


Shall we inline all the logic e.g. building opMapInMetadata and opMapInPhysicalPlan to here? I don't see we use these fields other than here. Let's scope fields and methods be narrower whenever possible.

That said, You don't need to use rule to build opMapInPhysicalPlan. Let's just use foreach to traverse the plan and build opMapInPhysicalPlan.

HeartSaVioR · 2024-02-14T03:51:15Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/IncrementalExecution.scala

+       val opInCurBatch = opMapInPhysicalPlan.getOrElse(opId, "not found")
+       if (opInMetadata != opInCurBatch) {
+         throw QueryExecutionErrors.statefulOperatorNotMatchInStateMetadataError(
+           opMapInMetadata.values.toSeq,


Shall we print out association between opId and opName in error message? It may be uneasy to understand what is mismatching only with opNames.

sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala

HeartSaVioR · 2024-02-14T03:52:55Z

docs/sql-error-conditions.md

+
+[SQLSTATE: 42K03](sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation)
+
+Streaming stateful operator name does not match with the operator in state metadata. This likely to happen when user changes stateful operator of existing streaming query.


nit: when user adds/removes/changes

...c/test/scala/org/apache/spark/sql/execution/streaming/state/OperatorStateMetadataSuite.scala

HeartSaVioR · 2024-02-14T04:01:04Z

...c/test/scala/org/apache/spark/sql/execution/streaming/state/OperatorStateMetadataSuite.scala

+        StopStream
+      )
+
+      def checkOpChangeError(OpsInMetadataSeq: Seq[String],


nit: if you are using more than two lines to define the method, param should start at second line of definition. (In other words, all params should appear at the same indentation.)

https://github.com/databricks/scala-style-guide?tab=readme-ov-file#indent

Also, param should start with lowercase.

Do we have any automation tool for checking this other than ./dev/scalafmt? This command is listed on the spark developer tool wiki, and is actually quite messy - it will touch all existing files other than only formatting my code change.

Maybe community has to run the formatter at once for the whole codebase. I'm not sure scalafmt can deal with the whole styles though. It is still good to familiarize Scala style guide for Databricks; it doesn't only contain styles automation can handle.

Got it! Thanks Jungtaek!

...c/test/scala/org/apache/spark/sql/execution/streaming/state/OperatorStateMetadataSuite.scala

HeartSaVioR · 2024-02-14T04:36:44Z

...c/test/scala/org/apache/spark/sql/execution/streaming/state/OperatorStateMetadataSuite.scala

+        }
+      )
+
+      // remove operator


Please split down to separate test case if this is fully isolated with other check.

Btw, this actually brings up food for thought. Do we disallow stateful query to be stateless? E.g. could you simply test the removal of stateful operator with checkpointDir rather than spinning up another checkpoint?

It's OK if we have been supporting the case (although undocumented) and we keep supporting the case. If not, we could just test the case via using checkpointDir.

Do we disallow stateful query to be stateless?

We don't allow even before adding the operator check. Streaming will throw error with message as "state path not found".

E.g. could you simply test the removal of stateful operator with checkpointDir rather than spinning up another checkpoint?

Done. Restarting a stateless query from a stateful query will now trigger error with message as:

[STREAMING_STATEFUL_OPERATOR_NOT_MATCH_IN_STATE_METADATA] Streaming stateful operator name does not match with the operator in state metadata. This likely to happen when user adds/removes/changes stateful operator of existing streaming query. Stateful operators in the metadata: [(OperatorId: 0 -> OperatorName: dedupe)]; Stateful operators in current batch: []. SQLSTATE: 42K03

jingz-db · 2024-02-14T19:54:34Z

Thanks Jungtaek for your thorough code review!

HeartSaVioR

+1 with super-nit (assume you'll make a fix). Thanks for the work!

HeartSaVioR · 2024-02-14T21:33:34Z

...c/test/scala/org/apache/spark/sql/execution/streaming/state/OperatorStateMetadataSuite.scala

+        testStream(restartStream, Update)(
+          StartStream(checkpointLocation = checkpointDir.toString),
+          AddData(inputData, 3),
+          ExpectFailure[SparkRuntimeException] { t => {


super nit: another {} is unnecessary after { t =>. multiple lines are allowed after =>.

HeartSaVioR

+1 pending CI.

HeartSaVioR · 2024-02-15T04:21:50Z

@jingz-db Mind retriggering GA? You can either manually do this in your fork or simply push an empty commit to do this automatically. Thanks!

HeartSaVioR · 2024-02-20T23:55:11Z

Thanks! Merging to master.

init impl

eec170c

github-actions bot added SQL STRUCTURED STREAMING labels Jan 29, 2024

jingz-db marked this pull request as ready for review January 29, 2024 05:17

add error class doc

d2d79a4

github-actions bot added the DOCS label Jan 29, 2024

jingz-db changed the title ~~[SS] Add a check for stateful operator change for streaming~~ [SPARK-46906][SS] Add a check for stateful operator change for streaming Jan 29, 2024

chaoqin-li1123 reviewed Jan 30, 2024

View reviewed changes

chaoqin-li1123 approved these changes Jan 30, 2024

View reviewed changes

modify error message

98adaaa

HeartSaVioR reviewed Jan 31, 2024

View reviewed changes

jingz-db added 2 commits February 12, 2024 22:23

sub error clss, not working

89f67ad

add opId to name map

37ef7fd

jingz-db requested a review from HeartSaVioR February 13, 2024 22:16

HeartSaVioR reviewed Feb 14, 2024

View reviewed changes

jingz-db force-pushed the operator-check branch from 1ae65c4 to 72dfeb9 Compare February 14, 2024 19:24

resolve comments

7ecb493

jingz-db force-pushed the operator-check branch from 72dfeb9 to 7ecb493 Compare February 14, 2024 19:35

jingz-db added 3 commits February 14, 2024 11:39

scala style

5241dbb

error doc

20121cd

scala style

e427824

jingz-db requested a review from HeartSaVioR February 14, 2024 19:53

HeartSaVioR approved these changes Feb 14, 2024

View reviewed changes

final nits fixed

0618e9e

jingz-db requested a review from HeartSaVioR February 14, 2024 21:57

HeartSaVioR approved these changes Feb 14, 2024

View reviewed changes

retrigger CI

a6619cb

HeartSaVioR closed this in 8ede494 Feb 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-46906][SS] Add a check for stateful operator change for streaming #44927

[SPARK-46906][SS] Add a check for stateful operator change for streaming #44927

jingz-db commented Jan 29, 2024

chaoqin-li1123 Jan 30, 2024

chaoqin-li1123 left a comment

HeartSaVioR Jan 31, 2024

jingz-db Feb 13, 2024

HeartSaVioR left a comment

HeartSaVioR Feb 14, 2024

HeartSaVioR Feb 14, 2024

HeartSaVioR Feb 14, 2024

HeartSaVioR Feb 14, 2024

jingz-db Feb 14, 2024

HeartSaVioR Feb 14, 2024

jingz-db Feb 14, 2024

HeartSaVioR Feb 14, 2024

jingz-db Feb 14, 2024

jingz-db commented Feb 14, 2024

HeartSaVioR left a comment

HeartSaVioR Feb 14, 2024

HeartSaVioR left a comment

HeartSaVioR commented Feb 15, 2024

HeartSaVioR commented Feb 20, 2024

@@ @@ -184,6 +185,41 @@ class IncrementalExecution( @@
                   }
                 }
+                /**


		[SQLSTATE: 42K03](sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation)

		Streaming stateful operator name does not match with the operator in state metadata. This likely to happen when user changes stateful operator of existing streaming query.

[SPARK-46906][SS] Add a check for stateful operator change for streaming #44927

[SPARK-46906][SS] Add a check for stateful operator change for streaming #44927

Conversation

jingz-db commented Jan 29, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Choose a reason for hiding this comment

chaoqin-li1123 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HeartSaVioR left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jingz-db commented Feb 14, 2024

HeartSaVioR left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HeartSaVioR left a comment

Choose a reason for hiding this comment

HeartSaVioR commented Feb 15, 2024

HeartSaVioR commented Feb 20, 2024