[SPARK-30657][SPARK-30658][SS] Fixed two bugs in streaming limits by tdas · Pull Request #27373 · apache/spark

tdas · 2020-01-28T09:27:29Z

What changes were proposed in this pull request?

This PR solves two bugs related to streaming limits

Bug 1 (SPARK-30658): Limit before a streaming aggregate (i.e. df.limit(5).groupBy().count()) in complete mode was not being planned as a stateful streaming limit. The planner rule planned a logical limit with a stateful streaming limit plan only if the query is in append mode. As a result, instead of allowing max 5 rows across batches, the planned streaming query was allowing 5 rows in every batch thus producing incorrect results.

Solution: Change the planner rule to plan the logical limit with a streaming limit plan even when the query is in complete mode if the logical limit has no stateful operator before it.

Bug 2 (SPARK-30657): LocalLimitExec does not consume the iterator of the child plan. So if there is a limit after a stateful operator like streaming dedup in append mode (e.g. df.dropDuplicates().limit(5)), the state changes of streaming duplicate may not be committed (most stateful ops commit state changes only after the generated iterator is fully consumed).

Solution: Change the planner rule to always use a new StreamingLocalLimitExec which always fully consumes the iterator. This is the safest thing to do. However, this will introduce a performance regression as consuming the iterator is extra work. To minimize this performance impact, add an additional post-planner optimization rule to replace StreamingLocalLimitExec with LocalLimitExec when there is no stateful operator before the limit that could be affected by it.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Updated incorrect unit tests and added new ones

SparkQA · 2020-01-28T13:19:41Z

Test build #117479 has finished for PR 27373 at commit 9cc0353.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-01-28T14:54:36Z

Test build #117481 has finished for PR 27373 at commit 06b5f05.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2020-01-30T02:16:35Z

@tdas is this good to go before the code freeze?

zsxwing · 2020-01-30T17:49:35Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala

  }

-  test("streaming limit in complete mode") {
+  test("streaming limit before agg in complete mode") {


nit: could you add the jira id to the test name for this and the rest of tests? We recently added the following rule:

Also, you should consider writing a JIRA ID in the tests when your pull request targets to fix a specific issue. In practice, usually it is added when a JIRA type is a bug or a PR adds a couple of tests to an existing test class. See the examples below: Scala test("SPARK-12345: a short description of the test") {

aah right. thanks!

zsxwing · 2020-01-30T17:54:47Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala

+      case ReturnAnswer(Limit(IntegerLiteral(limit), child)) if generatesStreamingAppends(child) =>
+        StreamingGlobalLimitExec(limit, StreamingLocalLimitExec(limit, planLater(child))) :: Nil
+
+      case Limit(IntegerLiteral(limit), child) if generatesStreamingAppends (child) =>


super nit: extra space between generatesStreamingAppends and (child)

i am surprised that the style checker did not catch this

zsxwing

LGTM except some nits.

tdas · 2020-01-30T22:16:11Z

@srowen this is good to go

zsxwing · 2020-01-31T01:13:27Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala

  }

-  test("streaming limit in complete mode") {
+  test("streaming limit before agg in complete mode (SPARK-30658)") {


@tdas this is not the right style. It should be test("SPARK-12345: a short description of the test"). Could you also fix other test names?

SparkQA · 2020-01-31T02:52:54Z

Test build #117593 has finished for PR 27373 at commit 469ad23.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-01-31T08:05:02Z

Test build #117617 has finished for PR 27373 at commit b127525.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2020-01-31T08:10:42Z

jenkins retest this please

SparkQA · 2020-01-31T12:15:24Z

Test build #117636 has finished for PR 27373 at commit b127525.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2020-01-31T12:16:59Z

jenkins retest this please.

tdas · 2020-01-31T12:17:27Z

the last failure was in unrelated python test. nonetheless kicking off another round of tests to be sure.

SparkQA · 2020-01-31T16:45:27Z

Test build #117654 has finished for PR 27373 at commit b127525.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing

LGTM. There is one test missing jira id. I will fix it when merging the PR.

zsxwing · 2020-01-31T17:24:16Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala

        false))
  }

+  test("streaming limit should not apply on limits on state subplans") {


nit: this is missing the jira id.

zsxwing · 2020-01-31T17:44:39Z

@tdas I merged this to master. Could you also submit a PR for branch-2.4?

tdas · 2020-01-31T22:10:25Z

Thank you for merging. Should this be merged to branch 2.4? This is a slightly scary change deep in the incremental execution stuff.

tdas added 2 commits January 28, 2020 01:19

Fixed bug

9cc0353

improved

06b5f05

tdas requested a review from zsxwing January 28, 2020 10:59

dongjoon-hyun added the STRUCTURED STREAMING label Jan 29, 2020

dongjoon-hyun changed the title ~~[SPARK-30657] [SPARK-30658] [SS] Fixed two bugs in streaming limits~~ [SPARK-30657][SPARK-30658][SS] Fixed two bugs in streaming limits Jan 29, 2020

zsxwing reviewed Jan 30, 2020

View reviewed changes

addressed comments

469ad23

zsxwing reviewed Jan 31, 2020

View reviewed changes

fixed test names

b127525

zsxwing approved these changes Jan 31, 2020

View reviewed changes

asfgit closed this in 481e521 Jan 31, 2020

Conversation

tdas commented Jan 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Jan 28, 2020

Uh oh!

SparkQA commented Jan 28, 2020

Uh oh!

srowen commented Jan 30, 2020

Uh oh!

zsxwing Jan 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tdas Jan 30, 2020

Choose a reason for hiding this comment

Uh oh!

zsxwing Jan 30, 2020

Choose a reason for hiding this comment

Uh oh!

tdas Jan 30, 2020

Choose a reason for hiding this comment

Uh oh!

zsxwing left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tdas commented Jan 30, 2020

Uh oh!

zsxwing Jan 31, 2020

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 31, 2020

Uh oh!

SparkQA commented Jan 31, 2020

Uh oh!

tdas commented Jan 31, 2020

Uh oh!

SparkQA commented Jan 31, 2020

Uh oh!

tdas commented Jan 31, 2020

Uh oh!

tdas commented Jan 31, 2020

Uh oh!

SparkQA commented Jan 31, 2020

Uh oh!

zsxwing left a comment

Choose a reason for hiding this comment

Uh oh!

zsxwing Jan 31, 2020

Choose a reason for hiding this comment

Uh oh!

zsxwing commented Jan 31, 2020

Uh oh!

tdas commented Jan 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

tdas commented Jan 28, 2020 •

edited

Loading

zsxwing Jan 30, 2020 •

edited

Loading

zsxwing left a comment •

edited

Loading

tdas commented Jan 31, 2020 •

edited

Loading