[SPARK-32794][SS] Fixed rare corner case error in micro-batch engine with some stateful queries + no-data-batches + V1 sources #29696

tdas · 2020-09-09T18:00:11Z

What changes were proposed in this pull request?

Make MicroBatchExecution explicitly call getBatch when the start and end offsets are the same.

Why are the changes needed?

Structured Streaming micro-batch engine has the contract with V1 data sources that, after a restart, it will call source.getBatch() on the last batch attempted before the restart. However, a very rare combination of sequences violates this contract. It occurs only when

The streaming query has specific types of stateful operations with watermarks (e.g., aggregation in append, mapGroupsWithState with timeouts).
- These queries can execute a batch even without new data when the previous updates the watermark and the stateful ops are such that the new watermark can cause new output/cleanup. Such batches are called no-data-batches.
The last batch before termination was an incomplete no-data-batch. Upon restart, the micro-batch engine fails to call source.getBatch when attempting to re-execute the incomplete no-data-batch.

This occurs because no-data-batches has the same and end offsets, and when a batch is executed, if the start and end offset is same then calling source.getBatch is skipped as it is assumed the generated plan will be empty. This only affects V1 data sources which rely on this invariant to detect in the source whether the query is being started from scratch or restarted.

Does this PR introduce any user-facing change?

No

How was this patch tested?

New unit test with a mock v1 source that fails without the fix.

…with some stateful queries + no-data-batches + V1 sources Make MicroBatchExecution explicitly call `getBatch` when the start and end offsets are the same. Structured Streaming micro-batch engine has the contract with V1 data sources that, after a restart, it will call `source.getBatch()` on the last batch attempted before the restart. However, a very rare combination of sequences violates this contract. It occurs only when - The streaming query has specific types of stateful operations with watermarks (e.g., aggregation in append, mapGroupsWithState with timeouts). - These queries can execute a batch even without new data when the previous updates the watermark and the stateful ops are such that the new watermark can cause new output/cleanup. Such batches are called no-data-batches. - The last batch before termination was an incomplete no-data-batch. Upon restart, the micro-batch engine fails to call `source.getBatch` when attempting to re-execute the incomplete no-data-batch. This occurs because no-data-batches has the same and end offsets, and when a batch is executed, if the start and end offset is same then calling `source.getBatch` is skipped as it is assumed the generated plan will be empty. This only affects V1 data sources like Delta and Autoloader which rely on this invariant to detect in the source whether the query is being started from scratch or restarted. No New unit test with a mock v1 source that fails without the fix. Closes apache#29651 from tdas/SPARK-32794. Authored-by: Tathagata Das <tathagata.das1565@gmail.com> Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

tdas · 2020-09-09T18:01:15Z

sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecutionSuite.scala

+import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
+import org.apache.spark.sql.catalyst.plans.logical.Range
+import org.apache.spark.sql.connector.read.streaming
+import org.apache.spark.sql.connector.read.streaming.SparkDataStream
 import org.apache.spark.sql.functions.{count, window}


@zsxwing conflict was in the imports. master uses this new function import org.apache.spark.sql.functions.timestamp_seconds which does not exist in 3.0

SparkQA · 2020-09-09T18:06:30Z

Test build #128460 has finished for PR 29696 at commit 1272f2f.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

…with some stateful queries + no-data-batches + V1 sources ### What changes were proposed in this pull request? Make MicroBatchExecution explicitly call `getBatch` when the start and end offsets are the same. ### Why are the changes needed? Structured Streaming micro-batch engine has the contract with V1 data sources that, after a restart, it will call `source.getBatch()` on the last batch attempted before the restart. However, a very rare combination of sequences violates this contract. It occurs only when - The streaming query has specific types of stateful operations with watermarks (e.g., aggregation in append, mapGroupsWithState with timeouts). - These queries can execute a batch even without new data when the previous updates the watermark and the stateful ops are such that the new watermark can cause new output/cleanup. Such batches are called no-data-batches. - The last batch before termination was an incomplete no-data-batch. Upon restart, the micro-batch engine fails to call `source.getBatch` when attempting to re-execute the incomplete no-data-batch. This occurs because no-data-batches has the same and end offsets, and when a batch is executed, if the start and end offset is same then calling `source.getBatch` is skipped as it is assumed the generated plan will be empty. This only affects V1 data sources which rely on this invariant to detect in the source whether the query is being started from scratch or restarted. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New unit test with a mock v1 source that fails without the fix. Closes #29696 from tdas/SPARK-32794-3.0. Authored-by: Tathagata Das <tathagata.das1565@gmail.com> Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

…with some stateful queries + no-data-batches + V1 sources Make MicroBatchExecution explicitly call `getBatch` when the start and end offsets are the same. Structured Streaming micro-batch engine has the contract with V1 data sources that, after a restart, it will call `source.getBatch()` on the last batch attempted before the restart. However, a very rare combination of sequences violates this contract. It occurs only when - The streaming query has specific types of stateful operations with watermarks (e.g., aggregation in append, mapGroupsWithState with timeouts). - These queries can execute a batch even without new data when the previous updates the watermark and the stateful ops are such that the new watermark can cause new output/cleanup. Such batches are called no-data-batches. - The last batch before termination was an incomplete no-data-batch. Upon restart, the micro-batch engine fails to call `source.getBatch` when attempting to re-execute the incomplete no-data-batch. This occurs because no-data-batches has the same and end offsets, and when a batch is executed, if the start and end offset is same then calling `source.getBatch` is skipped as it is assumed the generated plan will be empty. This only affects V1 data sources which rely on this invariant to detect in the source whether the query is being started from scratch or restarted. No New unit test with a mock v1 source that fails without the fix. Closes apache#29696 from tdas/SPARK-32794-3.0. Authored-by: Tathagata Das <tathagata.das1565@gmail.com> Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

SparkQA · 2020-09-09T23:25:02Z

Test build #128465 has finished for PR 29696 at commit c1f51a1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…with some stateful queries + no-data-batches + V1 sources ### What changes were proposed in this pull request? Make MicroBatchExecution explicitly call `getBatch` when the start and end offsets are the same. ### Why are the changes needed? Structured Streaming micro-batch engine has the contract with V1 data sources that, after a restart, it will call `source.getBatch()` on the last batch attempted before the restart. However, a very rare combination of sequences violates this contract. It occurs only when - The streaming query has specific types of stateful operations with watermarks (e.g., aggregation in append, mapGroupsWithState with timeouts). - These queries can execute a batch even without new data when the previous updates the watermark and the stateful ops are such that the new watermark can cause new output/cleanup. Such batches are called no-data-batches. - The last batch before termination was an incomplete no-data-batch. Upon restart, the micro-batch engine fails to call `source.getBatch` when attempting to re-execute the incomplete no-data-batch. This occurs because no-data-batches has the same and end offsets, and when a batch is executed, if the start and end offset is same then calling `source.getBatch` is skipped as it is assumed the generated plan will be empty. This only affects V1 data sources which rely on this invariant to detect in the source whether the query is being started from scratch or restarted. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New unit test with a mock v1 source that fails without the fix. Closes apache#29696 from tdas/SPARK-32794-3.0. Authored-by: Tathagata Das <tathagata.das1565@gmail.com> Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

probot-autolabeler bot added SQL STRUCTURED STREAMING labels Sep 9, 2020

tdas commented Sep 9, 2020

View reviewed changes

Fix style

c1f51a1

zsxwing approved these changes Sep 9, 2020

View reviewed changes

tdas closed this Sep 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-32794][SS] Fixed rare corner case error in micro-batch engine with some stateful queries + no-data-batches + V1 sources #29696

[SPARK-32794][SS] Fixed rare corner case error in micro-batch engine with some stateful queries + no-data-batches + V1 sources #29696

tdas commented Sep 9, 2020 •

edited

Loading

tdas Sep 9, 2020 •

edited

Loading

SparkQA commented Sep 9, 2020

SparkQA commented Sep 9, 2020

[SPARK-32794][SS] Fixed rare corner case error in micro-batch engine with some stateful queries + no-data-batches + V1 sources #29696

[SPARK-32794][SS] Fixed rare corner case error in micro-batch engine with some stateful queries + no-data-batches + V1 sources #29696

Conversation

tdas commented Sep 9, 2020 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

tdas Sep 9, 2020 • edited Loading

Choose a reason for hiding this comment

SparkQA commented Sep 9, 2020

SparkQA commented Sep 9, 2020

tdas commented Sep 9, 2020 •

edited

Loading

tdas Sep 9, 2020 •

edited

Loading