[MINOR] Reference-count shared Spark context to prevent concurrent shutdown#2516
Merged
Merged
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2516 +/- ##
=========================================
Coverage 71.56% 71.57%
+ Complexity 49125 49123 -2
=========================================
Files 1575 1575
Lines 189784 189814 +30
Branches 37232 37239 +7
=========================================
+ Hits 135823 135853 +30
+ Misses 43470 43467 -3
- Partials 10491 10494 +3 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
4c40609 to
530be01
Compare
The singleton JVM-wide SparkContext in SparkExecutionContext was stopped unconditionally at the end of every DMLScript.execute() run. When two DML script executions run concurrently in the same JVM (e.g. surefire parallel tests with parallel=classes and threadCount=2), one execution finishing its script would call close() and stop the context while the other execution still had an in-flight spark job. The wedged job never receives a completion event from the now-stopped DAGScheduler, so it hangs until the 1000s test watchdog fires, surfacing as "Job N cancelled because SparkContext was shut down" and intermittently blowing the suite's CI time budget. Reference-count the active users of the shared context and separate counting from teardown: - enterSparkExecution()/exitSparkExecution() maintain a static count of active DML executions (DMLScript.execute registers after creating a SparkExecutionContext and releases in the finally block). - close() never changes the count; it only stops the context when no registered execution remains. This way an unpaired close() (a caller that borrows the shared context but never registered, e.g. tests or parfor children) can no longer tear down a context another execution is still using, and a finishing execution cannot stop the context mid-job. Single-run behavior is unchanged (enter -> 1, exit -> 0, close -> stop). When the context is force-discarded outside close() (resetSparkContextStatic(), and the re-init path in getSparkContextStatic() when the previous context was stopped), the count is reset to zero so a stale registration cannot skip a future legitimate stop. The keep-alive branch logs at debug level for diagnosability. The invariant covers only the DMLScript.execute() path that owns the context; MLContext manages an externally-owned context separately and is unaffected. SparkContextReferenceCountTest covers two scenarios that both fail on the old unconditional-stop code and pass now: a finishing execution keeps the context alive while another is active, and an unpaired close() does not stop a context another execution still uses.
530be01 to
e51b6eb
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The JVM-wide singleton
SparkContextinSparkExecutionContextwas stopped unconditionally at the end of everyDMLScript.execute(). With surefire parallel tests (parallel=classes,threadCount=2), one finishing execution would stop the context while a concurrent execution still had an in-flight Spark job.Change
close()reference-counted.SparkExecutionContext.enterSparkExecution()is called fromDMLScript.execute()when the execution context is aSparkExecutionContext, andclose()only stops the shared context once the active-execution count reaches zero.Notes
close()always in thefinallyblock.DMLScript.execute()path that owns the context is covered; MLContext manages an externally-owned context separately and is unaffected.