[SPARK-19267][SS]Fix a race condition when stopping StateStore#16627
[SPARK-19267][SS]Fix a race condition when stopping StateStore#16627zsxwing wants to merge 4 commits intoapache:masterfrom zsxwing:SPARK-19267
Conversation
There was a problem hiding this comment.
There is another potential issue here: if a SparkContext is created after checking SparkEnv.get == null, the following stop may cancel a new valid task. However, I think that won't happen in practice, so don't fix it.
|
Test build #71545 has finished for PR 16627 at commit
|
|
Test build #71624 has finished for PR 16627 at commit
|
tdas
left a comment
There was a problem hiding this comment.
I had a more surgical fix in mind when I said that put multiple volatile objects into a class so that we can replace the class completely. Here is what I had in mind.
https://github.com/apache/spark/compare/master...tdas:state-store-fix?expand=1
What do you think?
There was a problem hiding this comment.
Why do these need to be synchronized by external Lock?
There was a problem hiding this comment.
This is locking the state store while maintenance is going on. since it using the same lock as the external lock this, the task using the store will block on the maintenance task.
|
@tdas could you take another look? I fixed some minor issues in your patch. |
|
|
||
| @volatile private var maintenanceTask: ScheduledFuture[_] = null | ||
| @volatile private var _coordRef: StateStoreCoordinatorRef = null | ||
| class MaintenanceTask(periodMs: Long, task: => Unit, onError: => Unit) { |
There was a problem hiding this comment.
you should mention the properties of this class. that it automatically cancels the periodic task if there is an exception. and what is onError for.
| private def startMaintenanceIfNeeded(): Unit = loadedProviders.synchronized { | ||
| val env = SparkEnv.get | ||
| if (maintenanceTask == null && env != null) { | ||
| if (env != null && (maintenanceTask == null || !maintenanceTask.isRunning)) { |
There was a problem hiding this comment.
Can you replace this with the method isMaintenanceRunning?
|
Test build #71739 has finished for PR 16627 at commit
|
|
LGTM, pending tests. |
|
Test build #71744 has finished for PR 16627 at commit
|
## What changes were proposed in this pull request? There is a race condition when stopping StateStore which makes `StateStoreSuite.maintenance` flaky. `StateStore.stop` doesn't wait for the running task to finish, and an out-of-date task may fail `doMaintenance` and cancel the new task. Here is a reproducer: zsxwing@dde1b5b This PR adds MaintenanceTask to eliminate the race condition. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #16627 from zsxwing/SPARK-19267. (cherry picked from commit ea31f92) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
## What changes were proposed in this pull request? There is a race condition when stopping StateStore which makes `StateStoreSuite.maintenance` flaky. `StateStore.stop` doesn't wait for the running task to finish, and an out-of-date task may fail `doMaintenance` and cancel the new task. Here is a reproducer: zsxwing@dde1b5b This PR adds MaintenanceTask to eliminate the race condition. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Author: Tathagata Das <tathagata.das1565@gmail.com> Closes apache#16627 from zsxwing/SPARK-19267.
## What changes were proposed in this pull request? There is a race condition when stopping StateStore which makes `StateStoreSuite.maintenance` flaky. `StateStore.stop` doesn't wait for the running task to finish, and an out-of-date task may fail `doMaintenance` and cancel the new task. Here is a reproducer: zsxwing@dde1b5b This PR adds MaintenanceTask to eliminate the race condition. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Author: Tathagata Das <tathagata.das1565@gmail.com> Closes apache#16627 from zsxwing/SPARK-19267.
What changes were proposed in this pull request?
There is a race condition when stopping StateStore which makes
StateStoreSuite.maintenanceflaky.StateStore.stopdoesn't wait for the running task to finish, and an out-of-date task may faildoMaintenanceand cancel the new task. Here is a reproducer: zsxwing@dde1b5bThis PR adds MaintenanceTask to eliminate the race condition.
How was this patch tested?
Jenkins