[SPARK-47253][CORE] Allow LiveEventBus to stop without the completely draining of event queue #45367

TakawaAkirayo · 2024-03-04T01:56:02Z

What changes were proposed in this pull request?

Add config spark.scheduler.listenerbus.exitTimeout(default 0, wait until dispatch thread exist).
Before this PR: The event queue will wait for the event to drain completely on stop.
After this PR: Allow user to control this behavior(wait for completely drain or not) by spark config.

Why are the changes needed?

####Problem statement:
The SparkContext.stop() hung a long time on LiveEventBus.stop() when listeners slow

####User scenarios:
We have a centralized service with multiple instances to regularly execute user's scheduled tasks.
For each user task within one service instance, the process is as follows:

1.Create a Spark session directly within the service process with an account defined in the task.
2.Instantiate listeners by class names and register them with the SparkContext. The JARs containing the listener classes are uploaded to the service by the user.
3.Prepare resources.
4.Run user logic (Spark SQL).
5.Stop the Spark session by invoking SparkSession.stop().

In step 5, it will wait for the LiveEventBus to stop, which requires the remaining events to be completely drained by each listener.

Since the listener is implemented by users and we cannot prevent some heavy stuffs within the listener on each event, there are cases where a single heavy job has over 30,000 tasks,
and it could take 30 minutes for the listener to process all the remaining events, because within the listener, it requires a coarse-grained global lock and update the internal status to the remote database.

This kind of delay affects other user tasks in the queue. Therefore, from the server side perspective, we need the guarantee that the stop operation finishes quickly.

Does this PR introduce any user-facing change?

Add cofig spark.scheduler.listenerbus.exitTimeout.
Default is 0, it will wait for the event to drain completely. If set to a non negative integer, the LivenEventBus will wait for atleast that duration (in ms) before it stops irrespective of whether the events are drained or not.

How was this patch tested?

By UT and verified the feature in out production environment

Was this patch authored or co-authored using generative AI tooling?

No

…etely draining of event queue

LuciferYang · 2024-03-04T05:01:08Z

@TakawaAkirayo Doesn't master have this issue? The pr should be submitted to the master first, then backported to other branches if need.

TakawaAkirayo · 2024-03-04T07:16:03Z

@LuciferYang The master branch have this issue too, this pr is already submitted in master branch. I add [3.5] in the title is because I though I must have a target version for PR. So what's the next action? please advise, thanks!

LuciferYang · 2024-03-04T07:20:44Z

I add [3.5] in the title is because I though I must have a target version for PR

@TakawaAkirayo But from the Type of Jira, this is an improvement rather than a bug fix. However, Apache Spark community has a backporting policy which allows a bug-fix only.

If you really want to backport this to branch-3.5 when it is merged, please change the issue type to BUG and provide some background about the bug.

TakawaAkirayo · 2024-03-04T08:11:21Z

@LuciferYang I believe it's an improvement rather than a bug. If it's not back ported to version 3.5, will this change be included in the next release version? I'm okay for now with it because I can change the code in our internal spark version and use it. Once this change is available in the public version, I can switch to using the public Spark binary.

Updated the version to 4.0.0 in the JIRA and start version to 4.0.0 in the config

beliefer · 2024-03-05T03:36:08Z

core/src/main/scala/org/apache/spark/scheduler/AsyncEventQueue.scala

    // in that case.
-    if (Thread.currentThread() != dispatchThread) {
+    if (waitForEventDispatchExit() && Thread.currentThread() != dispatchThread) {


Shall we remove waitForEventDispatchExit() and inline conf.get(LISTENER_BUS_EVENT_QUEUE_WAIT_FOR_EVENT_DISPATCH_EXIT_ON_STOP) ?

@beliefer Thanks for the advise! I think use waitForEventDispatchExit is more readable here, it's explaining what it does. Using LISTENER_BUS_EVENT_QUEUE_WAIT_FOR_EVENT_DISPATCH_EXIT_ON_STOP, you have to jump to the place that define this conf to figure out the meaning.

The name of LISTENER_BUS_EVENT_QUEUE_WAIT_FOR_EVENT_DISPATCH_EXIT_ON_STOP have better readability.
It's not worth extract a method used only once too.

@beliefer Sure, already commit, please have a check

TakawaAkirayo · 2024-04-10T11:15:11Z

Hi @LuciferYang @beliefer Just a follow up on this, what's the next action items? Any other reviewers need be involved?

core/src/main/scala/org/apache/spark/internal/config/package.scala

…until thread exit. 2. wait for a specified time

core/src/main/scala/org/apache/spark/internal/config/package.scala

core/src/main/scala/org/apache/spark/scheduler/AsyncEventQueue.scala

core/src/main/scala/org/apache/spark/internal/config/package.scala

…cala Co-authored-by: Mridul Muralidharan <1591700+mridulm@users.noreply.github.com>

….scala Co-authored-by: Mridul Muralidharan <1591700+mridulm@users.noreply.github.com>

…cala Co-authored-by: Mridul Muralidharan <1591700+mridulm@users.noreply.github.com>

mridulm · 2024-04-12T20:22:28Z

core/src/test/scala/org/apache/spark/scheduler/SparkListenerSuite.scala

+
+    // let the event drained now
+    drainWait.acquire()
+    assert(drained)


Are these two lines testing any behavior ?
It is a consequence of onJobEnd concluding, right ?

@mridulm Yes, this is less relevant to the major change, it's just a check that the dispatch thread should unblocked. I removed those two lines now.

mridulm

Looks good to me

mridulm · 2024-04-13T04:59:39Z

The test failures are unrelated to this PR.

mridulm · 2024-04-13T05:05:35Z

I have updated the description, and merged to master.
Thanks for fixing this @TakawaAkirayo !
Thanks for the review @beliefer and @LuciferYang :-)

TakawaAkirayo · 2024-04-13T05:43:52Z

@mridulm @beliefer @LuciferYang Thanks for your review and guidance to improve the PR :-)

dongjoon-hyun

+1, late LGTM. Thank you all.

[SPARK-47253][CORE][3.5] Allow LiveEventBus to stop without the compl…

e86e9d1

…etely draining of event queue

github-actions bot added the CORE label Mar 4, 2024

[SPARK-47253][CORE][3.5] add comments for method

60b2366

TakawaAkirayo changed the title ~~[SPARK-47253][CORE][3.5] Allow LiveEventBus to stop without the completely draining of event queue~~ [SPARK-47253][CORE] Allow LiveEventBus to stop without the completely draining of event queue Mar 4, 2024

[SPARK-47253][CORE] change start version to 4.0.0

d37fd06

beliefer reviewed Mar 5, 2024

View reviewed changes

[SPARK-47253][CORE] inline the wait for completion check flag

1128744

mridulm reviewed Apr 10, 2024

View reviewed changes

core/src/main/scala/org/apache/spark/internal/config/package.scala Outdated Show resolved Hide resolved

[SPARK-47253][CORE] change the conf to number to cover both: 1. wait …

36bdb11

…until thread exit. 2. wait for a specified time

TakawaAkirayo requested a review from mridulm April 12, 2024 02:37

mridulm reviewed Apr 12, 2024

View reviewed changes

TakawaAkirayo and others added 5 commits April 12, 2024 17:41

Update core/src/main/scala/org/apache/spark/internal/config/package.s…

bd5db04

…cala Co-authored-by: Mridul Muralidharan <1591700+mridulm@users.noreply.github.com>

Update core/src/main/scala/org/apache/spark/scheduler/AsyncEventQueue…

61f5cbb

….scala Co-authored-by: Mridul Muralidharan <1591700+mridulm@users.noreply.github.com>

Update core/src/main/scala/org/apache/spark/internal/config/package.s…

97af78e

…cala Co-authored-by: Mridul Muralidharan <1591700+mridulm@users.noreply.github.com>

Update core/src/main/scala/org/apache/spark/internal/config/package.s…

a0866a1

…cala Co-authored-by: Mridul Muralidharan <1591700+mridulm@users.noreply.github.com>

[SPARK-47253][CORE] change conf key name and change UT accordingly

c2da588

TakawaAkirayo requested a review from mridulm April 12, 2024 10:14

mridulm reviewed Apr 12, 2024

View reviewed changes

[SPARK-47253][CORE] remove less relevant ut code

36bb48a

TakawaAkirayo requested a review from mridulm April 13, 2024 01:55

mridulm approved these changes Apr 13, 2024

View reviewed changes

mridulm closed this in b53c6f9 Apr 13, 2024

dongjoon-hyun reviewed Apr 14, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-47253][CORE] Allow LiveEventBus to stop without the completely draining of event queue #45367

[SPARK-47253][CORE] Allow LiveEventBus to stop without the completely draining of event queue #45367

TakawaAkirayo commented Mar 4, 2024 •

edited by mridulm

LuciferYang commented Mar 4, 2024

TakawaAkirayo commented Mar 4, 2024

LuciferYang commented Mar 4, 2024 •

edited

TakawaAkirayo commented Mar 4, 2024 •

edited

beliefer Mar 5, 2024

TakawaAkirayo Mar 5, 2024 •

edited

beliefer Mar 5, 2024

TakawaAkirayo Mar 5, 2024

TakawaAkirayo commented Apr 10, 2024

mridulm Apr 12, 2024

TakawaAkirayo Apr 13, 2024

mridulm left a comment

mridulm commented Apr 13, 2024

mridulm commented Apr 13, 2024

TakawaAkirayo commented Apr 13, 2024

dongjoon-hyun left a comment

[SPARK-47253][CORE] Allow LiveEventBus to stop without the completely draining of event queue #45367

[SPARK-47253][CORE] Allow LiveEventBus to stop without the completely draining of event queue #45367

Conversation

TakawaAkirayo commented Mar 4, 2024 • edited by mridulm

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

LuciferYang commented Mar 4, 2024

TakawaAkirayo commented Mar 4, 2024

LuciferYang commented Mar 4, 2024 • edited

TakawaAkirayo commented Mar 4, 2024 • edited

beliefer Mar 5, 2024

Choose a reason for hiding this comment

TakawaAkirayo Mar 5, 2024 • edited

Choose a reason for hiding this comment

beliefer Mar 5, 2024

Choose a reason for hiding this comment

TakawaAkirayo Mar 5, 2024

Choose a reason for hiding this comment

TakawaAkirayo commented Apr 10, 2024

mridulm Apr 12, 2024

Choose a reason for hiding this comment

TakawaAkirayo Apr 13, 2024

Choose a reason for hiding this comment

mridulm left a comment

Choose a reason for hiding this comment

mridulm commented Apr 13, 2024

mridulm commented Apr 13, 2024

TakawaAkirayo commented Apr 13, 2024

dongjoon-hyun left a comment

Choose a reason for hiding this comment

TakawaAkirayo commented Mar 4, 2024 •

edited by mridulm

LuciferYang commented Mar 4, 2024 •

edited

TakawaAkirayo commented Mar 4, 2024 •

edited

TakawaAkirayo Mar 5, 2024 •

edited