Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix issue with CuratorLoadQueuePeon shutting down executors it does not own #8140

Merged
merged 3 commits into from Jul 24, 2019

Conversation

clintropolis
Copy link
Member

Fixes #8137.

Description

#7088 implemented parallel loading for CuratorLoadQueuePeon, but is incorrectly shutting down the peon executor and callback executor that is shared by all peons in the stop method of any peon. This means that the coordinator will operate correctly until a server disappears for any reason, which will then lead to an exception of the form:

2019-07-23T19:17:27,993 ERROR [Coordinator-Exec--0] org.apache.druid.server.coordinator.DruidCoordinator - Caught exception, ignoring so that schedule keeps going.: {class=org.apache.druid.server.coordinator.DruidCoordinator, exceptionType=class java.util.concurrent.RejectedExecutionException, exceptionMessage=Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@7d63afe1 rejected from java.util.concurrent.ScheduledThreadPoolExecutor@576bd596[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 72]}
java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@7d63afe1 rejected from java.util.concurrent.ScheduledThreadPoolExecutor@576bd596[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 72]
	at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063) ~[?:1.8.0_192]
	at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830) ~[?:1.8.0_192]
	at java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:326) ~[?:1.8.0_192]
	at java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:533) ~[?:1.8.0_192]
	at java.util.concurrent.ScheduledThreadPoolExecutor.submit(ScheduledThreadPoolExecutor.java:632) ~[?:1.8.0_192]
	at org.apache.druid.server.coordinator.CuratorLoadQueuePeon.dropSegment(CuratorLoadQueuePeon.java:194) ~[classes/:?]
	at org.apache.druid.server.coordinator.helper.DruidCoordinatorCleanupUnneeded.run(DruidCoordinatorCleanupUnneeded.java:62) ~[classes/:?]
	at org.apache.druid.server.coordinator.DruidCoordinator$CoordinatorRunnable.run(DruidCoordinator.java:667) [classes/:?]
	at org.apache.druid.server.coordinator.DruidCoordinator$2.call(DruidCoordinator.java:559) [classes/:?]
	at org.apache.druid.server.coordinator.DruidCoordinator$2.call(DruidCoordinator.java:552) [classes/:?]
	at org.apache.druid.java.util.common.concurrent.ScheduledExecutors$2.run(ScheduledExecutors.java:92) [classes/:?]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_192]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_192]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_192]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:1.8.0_192]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_192]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_192]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_192]
2019-07-23T19:17:27,994 INFO [Coordinator-Exec--0] org.apache.druid.java.util.emitter.core.LoggingEmitter - Event [{"feed":"alerts","timestamp":"2019-07-23T19:17:27.994Z","service":"druid/coordinator","host":"localhost:8081","version":"","severity":"component-failure","description":"Caught exception, ignoring so that schedule keeps going.","data":{"class":"org.apache.druid.server.coordinator.DruidCoordinator","exceptionType":"java.util.concurrent.RejectedExecutionException","exceptionMessage":"Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@7d63afe1 rejected from java.util.concurrent.ScheduledThreadPoolExecutor@576bd596[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 72]","exceptionStackTrace":"java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@7d63afe1 rejected from java.util.concurrent.ScheduledThreadPoolExecutor@576bd596[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 72]\n\tat java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063)\n\tat java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)\n\tat java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:326)\n\tat java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:533)\n\tat java.util.concurrent.ScheduledThreadPoolExecutor.submit(ScheduledThreadPoolExecutor.java:632)\n\tat org.apache.druid.server.coordinator.CuratorLoadQueuePeon.dropSegment(CuratorLoadQueuePeon.java:194)\n\tat org.apache.druid.server.coordinator.helper.DruidCoordinatorCleanupUnneeded.run(DruidCoordinatorCleanupUnneeded.java:62)\n\tat org.apache.druid.server.coordinator.DruidCoordinator$CoordinatorRunnable.run(DruidCoordinator.java:667)\n\tat org.apache.druid.server.coordinator.DruidCoordinator$2.call(DruidCoordinator.java:559)\n\tat org.apache.druid.server.coordinator.DruidCoordinator$2.call(DruidCoordinator.java:552)\n\tat org.apache.druid.java.util.common.concurrent.ScheduledExecutors$2.run(ScheduledExecutors.java:92)\n\tat java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)\n\tat java.util.concurrent.FutureTask.run(FutureTask.java:266)\n\tat java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)\n\tat java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\n"}}]

as described in #8137.

This PR fixes the issue by not shutting down the executors.


This PR has:

  • been self-reviewed.
  • added unit tests or modified existing tests to cover new code paths.
  • been tested in a laptop test Druid cluster.

Copy link
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick fix!

@@ -335,8 +335,6 @@ public void stop()

queuedSize.set(0L);
failedAssignCount.set(0);
processingExecutor.shutdown();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we shut these down somewhere? (Maybe tie LoadQueueTaskMaster to lifecycle and shutdown the execs there?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems reasonable, 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used the factory passed into the method in CliCoordinator to create the executors, which itself appears to be under lifecycle.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that won't work...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that won't work...

Fixed to use ExecutorServices.manageLifecycle to tie the executors to the main service lifecycle.

@jon-wei jon-wei merged commit 0695e48 into apache:master Jul 24, 2019
@clintropolis clintropolis deleted the fix-curator-load-queue-peon branch July 24, 2019 18:39
gianm pushed a commit to implydata/druid-public that referenced this pull request Jul 24, 2019
…ot own (apache#8140)

* fix issue with CuratorLoadQueuePeon shutting down executors it does not own

* use lifecycled executors

* maybe this
clintropolis added a commit to clintropolis/druid that referenced this pull request Jul 24, 2019
…ot own (apache#8140)

* fix issue with CuratorLoadQueuePeon shutting down executors it does not own

* use lifecycled executors

* maybe this
@clintropolis clintropolis added this to the 0.15.1 milestone Jul 24, 2019
fjy pushed a commit that referenced this pull request Jul 25, 2019
…ot own (#8140) (#8151)

* fix issue with CuratorLoadQueuePeon shutting down executors it does not own

* use lifecycled executors

* maybe this
gvsmirnov pushed a commit to Plumbr/druid that referenced this pull request Aug 22, 2019
…ot own (apache#8140) (apache#8151)

* fix issue with CuratorLoadQueuePeon shutting down executors it does not own

* use lifecycled executors

* maybe this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

coordinator throwing exception trying to load segments
3 participants