[SPARK-24063][SS] Add maximum epoch queue threshold for ContinuousExecution #23156

gaborgsomogyi · 2018-11-27T13:53:44Z

What changes were proposed in this pull request?

Continuous processing is waiting on epochs which are not yet complete (for example one partition is not making progress) and stores pending items in queues. These queues are unbounded and can consume up all the memory easily. In this PR I've added spark.sql.streaming.continuous.epochBacklogQueueSize configuration possibility to make them bounded. If the related threshold reached then the query will stop with IllegalStateException.

How was this patch tested?

Existing + additional unit tests.

…cution.

SparkQA · 2018-11-27T17:27:41Z

Test build #99328 has finished for PR 23156 at commit 72733c5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gaborgsomogyi · 2018-11-27T17:33:19Z

cc @jose-torres @HeartSaVioR

arunmahadevan · 2018-12-05T22:07:30Z

Rather than controlling the queue sizes it would be better to limit the max epoch backlog and fail the query once that threshold is reached. There already seems to be patch that attempted to address this #21392

gaborgsomogyi · 2018-12-06T08:56:07Z

@arunmahadevan don't fully understand your comment:

Rather than controlling the queue sizes it would be better to limit the max epoch backlog and fail the query once that threshold is reached.

I've written the following in the PR description:

If the related threshold reached then the query will stop with IllegalStateException.

AFAIK max epoch backlog == epochsWaitingToBeCommitted which is a queue,
but that's not the only unbounded part of EpochCoordinator (please see additional unit tests).
As a result I've limited partitionOffsets and partitionCommits as well.

arunmahadevan · 2018-12-06T17:15:36Z

@gaborgsomogyi what I meant was rather than exposing a config to control the internal queue sizes, we could have a higher level config like the max pending epochs. This would act as a back pressure mechanism to stop further processing until the pending epochs are committed. I assume this would also put a limit on the three queues.

gaborgsomogyi · 2018-12-06T17:26:32Z

@arunmahadevan as I understand this is more like renaming the config than changing what the PR basically does, have I understood it well?

Not having backpressure but stopping the query is already agreed on another PRs, please check them. If the backlog reaches 10k items there is no way back.

SparkQA · 2018-12-07T13:30:52Z

Test build #99822 has finished for PR 23156 at commit b0c5056.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2018-12-10T10:32:13Z

I'd avoid not jumping in something regarding continuous mode unless the overall design (including aggregation and join) of continuous mode is cleared and stabilized.

gaborgsomogyi · 2018-12-10T10:39:35Z

I thought this part is not affected. Who leads it? Asking it because haven't seen progress anywhere.

HeartSaVioR · 2018-12-10T11:09:36Z

I think @jose-torres previously led the feature.

gaborgsomogyi · 2018-12-10T11:13:38Z

Ah, ok. This solution was agreed with him on #20936.

gaborgsomogyi · 2018-12-10T11:19:33Z

BTW, coming back to your clean up PR but it takes some time to switch context :)

HeartSaVioR · 2018-12-10T11:20:44Z

@gaborgsomogyi No problem :) When you get some other times please take a look at my other PRs as well. There're some shorter PRs as well.

SparkQA · 2019-01-22T23:20:56Z

Test build #101550 has finished for PR 23156 at commit b0c5056.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2019-01-24T21:15:26Z

Test build #101640 has finished for PR 23156 at commit 357f834.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gaborgsomogyi · 2019-01-24T21:21:07Z

ping @jose-torres

jose-torres

A few comments, but I agree with the general strategy.

...src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousExecution.scala

sql/core/src/test/scala/org/apache/spark/sql/streaming/continuous/ContinuousSuite.scala

gaborgsomogyi · 2019-01-25T21:45:01Z

Yay @jose-torres, congratulation becoming a committer! 🙂
Next week will address the suggestions.

SparkQA · 2019-01-28T19:02:33Z

Test build #101760 has finished for PR 23156 at commit f6bc301.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gaborgsomogyi · 2019-02-04T09:15:29Z

ping @jose-torres

gaborgsomogyi · 2019-02-10T18:34:51Z

cc @dongjoon-hyun

dongjoon-hyun · 2019-02-10T19:30:31Z

Thank you for ping me, @gaborgsomogyi . Retest this please.

SparkQA · 2019-02-10T23:46:06Z

Test build #102158 has finished for PR 23156 at commit f6bc301.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gaborgsomogyi · 2019-02-19T09:52:29Z

cc @vanzin

gaborgsomogyi · 2019-02-21T17:58:40Z

Standing here for long time and I think resolved all the comments. Can someone pick this up?

...src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousExecution.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

sql/core/src/test/scala/org/apache/spark/sql/streaming/continuous/ContinuousSuite.scala

sql/core/src/test/scala/org/apache/spark/sql/streaming/continuous/EpochCoordinatorSuite.scala

* CONSTANT.key used in tests * Removed newline hell

gaborgsomogyi · 2019-02-22T14:57:09Z

@vanzin thanks for your time!

SparkQA · 2019-02-22T18:36:17Z

Test build #102657 has finished for PR 23156 at commit 96152da.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2019-02-22T18:47:01Z

Test build #102662 has finished for PR 23156 at commit 43e61ef.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-02-22T19:29:56Z

Test build #102664 has finished for PR 23156 at commit 9324c90.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-02-25T14:17:59Z

Test build #102746 has finished for PR 23156 at commit 8a9b67f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jose-torres · 2019-02-25T22:25:35Z

(The PR looks fine to me modulo @vanzin 's review comments - sorry I dropped it for so long.)

vanzin

Just minor things. I'm not a big fan of checking exact exception messages in tests, but in this case it seems ok.

...re/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/EpochCoordinator.scala

sql/core/src/test/scala/org/apache/spark/sql/streaming/continuous/EpochCoordinatorSuite.scala

SparkQA · 2019-02-27T14:08:09Z

Test build #102820 has finished for PR 23156 at commit d67db64.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2019-02-27T17:52:26Z

Merging to master.

zsxwing

My 2 cents, when a query fails due to the queue capability, it's already falling behind. What the user may do is probably just restarting the query and it will take time to bring a query back, which will make the situation worse.

A better way to solve this problem is making continuous processing support backpressure.

zsxwing · 2019-03-18T16:51:23Z

...src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousExecution.scala

+  def stopInNewThread(error: Throwable): Unit = {
+    if (failure.compareAndSet(null, error)) {
+      logError(s"Query $prettyIdString received exception $error")
+      stopInNewThread()


Looks like there is a race here. The query stop may happen before the continuous-execution checks failure and the query will just stop without any exception, just like someone stops a query manually.

…cution ## What changes were proposed in this pull request? Continuous processing is waiting on epochs which are not yet complete (for example one partition is not making progress) and stores pending items in queues. These queues are unbounded and can consume up all the memory easily. In this PR I've added `spark.sql.streaming.continuous.epochBacklogQueueSize` configuration possibility to make them bounded. If the related threshold reached then the query will stop with `IllegalStateException`. ## How was this patch tested? Existing + additional unit tests. Closes apache#23156 from gaborgsomogyi/SPARK-24063. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

[SPARK-24063][SS] Add maximum epoch queue threshold for ContinuousExe…

72733c5

…cution.

Merge branch 'master' into SPARK-24063

b0c5056

Merge branch 'master' into SPARK-24063

357f834

jose-torres reviewed Jan 25, 2019

View reviewed changes

...src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousExecution.scala Outdated Show resolved Hide resolved

sql/core/src/test/scala/org/apache/spark/sql/streaming/continuous/ContinuousSuite.scala Outdated Show resolved Hide resolved

withSQLConf used and test commented

f6bc301

vanzin reviewed Feb 21, 2019

View reviewed changes

gaborgsomogyi added 3 commits February 22, 2019 14:44

Review fixes:

96152da

* CONSTANT.key used in tests * Removed newline hell

Merge branch 'master' into SPARK-24063

43e61ef

AtomicReference used for failure reference

9324c90

Made config public

8a9b67f

vanzin reviewed Feb 26, 2019

View reviewed changes

...re/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/EpochCoordinator.scala Outdated Show resolved Hide resolved

sql/core/src/test/scala/org/apache/spark/sql/streaming/continuous/EpochCoordinatorSuite.scala Outdated Show resolved Hide resolved

gaborgsomogyi added 2 commits February 27, 2019 10:43

Review fix

41656f1

Review fix

d67db64

vanzin closed this in c4bbfd1 Feb 27, 2019

zsxwing reviewed Mar 18, 2019

View reviewed changes

attilapiros mentioned this pull request Apr 4, 2019

[SPARK-27355][SS] Make query execution more sensitive to epoch message late or lost #24283

Closed

[SPARK-24063][SS] Add maximum epoch queue threshold for ContinuousExecution #23156

[SPARK-24063][SS] Add maximum epoch queue threshold for ContinuousExecution #23156

Conversation

gaborgsomogyi commented Nov 27, 2018

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Nov 27, 2018

gaborgsomogyi commented Nov 27, 2018

arunmahadevan commented Dec 5, 2018

gaborgsomogyi commented Dec 6, 2018

arunmahadevan commented Dec 6, 2018

gaborgsomogyi commented Dec 6, 2018 • edited

SparkQA commented Dec 7, 2018

HeartSaVioR commented Dec 10, 2018 • edited

gaborgsomogyi commented Dec 10, 2018

HeartSaVioR commented Dec 10, 2018

gaborgsomogyi commented Dec 10, 2018

gaborgsomogyi commented Dec 10, 2018

HeartSaVioR commented Dec 10, 2018 • edited

SparkQA commented Jan 22, 2019

SparkQA commented Jan 24, 2019

gaborgsomogyi commented Jan 24, 2019

jose-torres left a comment

Choose a reason for hiding this comment

gaborgsomogyi commented Jan 25, 2019

SparkQA commented Jan 28, 2019

gaborgsomogyi commented Feb 4, 2019

gaborgsomogyi commented Feb 10, 2019

dongjoon-hyun commented Feb 10, 2019

SparkQA commented Feb 10, 2019

gaborgsomogyi commented Feb 19, 2019

gaborgsomogyi commented Feb 21, 2019

gaborgsomogyi commented Feb 22, 2019

SparkQA commented Feb 22, 2019

SparkQA commented Feb 22, 2019

SparkQA commented Feb 22, 2019

SparkQA commented Feb 25, 2019

jose-torres commented Feb 25, 2019

vanzin left a comment

Choose a reason for hiding this comment

SparkQA commented Feb 27, 2019

vanzin commented Feb 27, 2019

zsxwing left a comment

Choose a reason for hiding this comment

zsxwing Mar 18, 2019

Choose a reason for hiding this comment

gaborgsomogyi commented Dec 6, 2018 •

edited

HeartSaVioR commented Dec 10, 2018 •

edited

HeartSaVioR commented Dec 10, 2018 •

edited