Skip to content

[SPARK-30657][SPARK-30658][SS] Fixed two bugs in streaming limits#27373

Closed
tdas wants to merge 4 commits intoapache:masterfrom
tdas:SPARK-30657
Closed

[SPARK-30657][SPARK-30658][SS] Fixed two bugs in streaming limits#27373
tdas wants to merge 4 commits intoapache:masterfrom
tdas:SPARK-30657

Conversation

@tdas
Copy link
Contributor

@tdas tdas commented Jan 28, 2020

What changes were proposed in this pull request?

This PR solves two bugs related to streaming limits

Bug 1 (SPARK-30658): Limit before a streaming aggregate (i.e. df.limit(5).groupBy().count()) in complete mode was not being planned as a stateful streaming limit. The planner rule planned a logical limit with a stateful streaming limit plan only if the query is in append mode. As a result, instead of allowing max 5 rows across batches, the planned streaming query was allowing 5 rows in every batch thus producing incorrect results.

Solution: Change the planner rule to plan the logical limit with a streaming limit plan even when the query is in complete mode if the logical limit has no stateful operator before it.

Bug 2 (SPARK-30657): LocalLimitExec does not consume the iterator of the child plan. So if there is a limit after a stateful operator like streaming dedup in append mode (e.g. df.dropDuplicates().limit(5)), the state changes of streaming duplicate may not be committed (most stateful ops commit state changes only after the generated iterator is fully consumed).

Solution: Change the planner rule to always use a new StreamingLocalLimitExec which always fully consumes the iterator. This is the safest thing to do. However, this will introduce a performance regression as consuming the iterator is extra work. To minimize this performance impact, add an additional post-planner optimization rule to replace StreamingLocalLimitExec with LocalLimitExec when there is no stateful operator before the limit that could be affected by it.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Updated incorrect unit tests and added new ones

@tdas tdas requested a review from zsxwing January 28, 2020 10:59
@SparkQA
Copy link

SparkQA commented Jan 28, 2020

Test build #117479 has finished for PR 27373 at commit 9cc0353.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 28, 2020

Test build #117481 has finished for PR 27373 at commit 06b5f05.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-30657] [SPARK-30658] [SS] Fixed two bugs in streaming limits [SPARK-30657][SPARK-30658][SS] Fixed two bugs in streaming limits Jan 29, 2020
@srowen
Copy link
Member

srowen commented Jan 30, 2020

@tdas is this good to go before the code freeze?

}

test("streaming limit in complete mode") {
test("streaming limit before agg in complete mode") {
Copy link
Member

@zsxwing zsxwing Jan 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could you add the jira id to the test name for this and the rest of tests? We recently added the following rule:

Also, you should consider writing a JIRA ID in the tests when your pull request targets to fix a specific issue. In practice, usually it is added when a JIRA type is a bug or a PR adds a couple of tests to an existing test class. See the examples below: Scala test("SPARK-12345: a short description of the test") {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aah right. thanks!

case ReturnAnswer(Limit(IntegerLiteral(limit), child)) if generatesStreamingAppends(child) =>
StreamingGlobalLimitExec(limit, StreamingLocalLimitExec(limit, planLater(child))) :: Nil

case Limit(IntegerLiteral(limit), child) if generatesStreamingAppends (child) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit: extra space between generatesStreamingAppends and (child)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am surprised that the style checker did not catch this

Copy link
Member

@zsxwing zsxwing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except some nits.

@tdas
Copy link
Contributor Author

tdas commented Jan 30, 2020

@srowen this is good to go

}

test("streaming limit in complete mode") {
test("streaming limit before agg in complete mode (SPARK-30658)") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tdas this is not the right style. It should be test("SPARK-12345: a short description of the test"). Could you also fix other test names?

@SparkQA
Copy link

SparkQA commented Jan 31, 2020

Test build #117593 has finished for PR 27373 at commit 469ad23.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 31, 2020

Test build #117617 has finished for PR 27373 at commit b127525.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tdas
Copy link
Contributor Author

tdas commented Jan 31, 2020

jenkins retest this please

@SparkQA
Copy link

SparkQA commented Jan 31, 2020

Test build #117636 has finished for PR 27373 at commit b127525.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tdas
Copy link
Contributor Author

tdas commented Jan 31, 2020

jenkins retest this please.

@tdas
Copy link
Contributor Author

tdas commented Jan 31, 2020

the last failure was in unrelated python test. nonetheless kicking off another round of tests to be sure.

@SparkQA
Copy link

SparkQA commented Jan 31, 2020

Test build #117654 has finished for PR 27373 at commit b127525.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@zsxwing zsxwing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. There is one test missing jira id. I will fix it when merging the PR.

false))
}

test("streaming limit should not apply on limits on state subplans") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this is missing the jira id.

@asfgit asfgit closed this in 481e521 Jan 31, 2020
@zsxwing
Copy link
Member

zsxwing commented Jan 31, 2020

@tdas I merged this to master. Could you also submit a PR for branch-2.4?

@tdas
Copy link
Contributor Author

tdas commented Jan 31, 2020

Thank you for merging. Should this be merged to branch 2.4? This is a slightly scary change deep in the incremental execution stuff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants