Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix forking task runner task shutdown to be more graceful #8085

Merged

Conversation

clintropolis
Copy link
Member

@clintropolis clintropolis commented Jul 15, 2019

Description

This PR modifies ForkingTaskRunner.shutdown to behave in the same manner as ForkingTaskRunner.stop, which closes the output stream to the forked task instead of calling destroy directly, which retains the log output on that stream allowing it to be fully written before terminating the task. I am unable to definitively determine if this means the task is not waiting for lifecycle stop to complete, or if it's strictly a logging issue, as I have collected conflicting observations. This appears to be strictly a logging issue.

Background

While looking into #8075 and #7962, noticed that the task logs stop abruptly when ForkingTaskRunner.shutdown is called, with an exception appearing in the middle-manager logs:

2019-07-13T05:13:10,009 INFO [forking-task-runner-0] org.apache.druid.indexing.overlord.ForkingTaskRunner - Logging task index_wikiticker_2019-07-13T05:13:09.860Z output to: /var/folders/v0/zc_jt5n12d39kx7dn6xmg73w0000gn/T/persistent/task/index_wikiticker_2019-07-13T05:13:09.860Z/log
2019-07-13T05:13:17,271 INFO [qtp345893819-70] org.apache.druid.indexing.overlord.ForkingTaskRunner - Shutdown [index_wikiticker_2019-07-13T05:13:09.860Z] because: [shut down request via HTTP endpoint]
2019-07-13T05:13:17,271 INFO [qtp345893819-70] org.apache.druid.indexing.overlord.ForkingTaskRunner - Killing process for task: index_wikiticker_2019-07-13T05:13:09.860Z
2019-07-13T05:13:17,283 INFO [forking-task-runner-0] org.apache.druid.indexing.common.tasklogs.FileTaskLogs - Wrote task log to: log/index_wikiticker_2019-07-13T05:13:09.860Z.log
2019-07-13T05:13:17,284 INFO [forking-task-runner-0] org.apache.druid.indexing.overlord.ForkingTaskRunner - Exception caught during execution
java.io.IOException: Stream closed
    at java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:170) ~[?:1.8.0_192]
    at java.io.BufferedInputStream.read1(BufferedInputStream.java:291) ~[?:1.8.0_192]
    at java.io.BufferedInputStream.read(BufferedInputStream.java:345) ~[?:1.8.0_192]
    at java.io.FilterInputStream.read(FilterInputStream.java:107) ~[?:1.8.0_192]
    at com.google.common.io.ByteStreams.copy(ByteStreams.java:175) ~[guava-16.0.1.jar:?]
    at org.apache.druid.indexing.overlord.ForkingTaskRunner$1.call(ForkingTaskRunner.java:436) [classes/:?]
    at org.apache.druid.indexing.overlord.ForkingTaskRunner$1.call(ForkingTaskRunner.java:221) [classes/:?]
    at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_192]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_192]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_192]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_192]
2019-07-13T05:13:17,289 INFO [forking-task-runner-0] org.apache.druid.indexing.overlord.ForkingTaskRunner - Removing task directory: /var/folders/v0/zc_jt5n12d39kx7dn6xmg73w0000gn/T/persistent/task/index_wikiticker_2019-07-13T05:13:09.860Z
2019-07-13T05:13:17,339 INFO [WorkerTaskManager-NoticeHandler] org.apache.druid.indexing.worker.WorkerTaskManager - Job's finished. Completed [index_wikiticker_2019-07-13T05:13:09.860Z] with status [FAILED]

and the task log output just ends, without the normal lifecycle stopping messaging:

...
2019-07-13T05:13:17,257 INFO [task-runner-0-priority-0] org.apache.druid.segment.realtime.appenderator.AppenderatorImpl - Hydrant[FireHydrant{, queryable=wikiticker_2015-09-12T11:00:00.000Z_2015-09-12T12:00:00.000Z_2019-07-13T05:13:09.862Z, count=0}] hasn't persisted yet, persisting. Segment[wikiticker_2015-09-12T11:00:00.000Z_2015-09-12T12:00:00.000Z_2019-07-13T05:13:09.862Z]
2019-07-13T05:13:17,257 INFO [task-runner-0-priority-0] org.apache.druid.segment.realtime.appenderator.AppenderatorImpl - Submitting persist runnable for dataSource[wikiticker]
2019-07-13T05:13:17,260 INFO [wikiticker-incremental-persist] org.apache.druid.segment.realtime.appenderator.AppenderatorImpl - Segment[wikiticker_2015-09-12T08:00:00.000Z_2015-09-12T09:00:00.000Z_2019-07-13T05:13:09.862Z], persisting Hydrant[FireHydrant{, queryable=wikiticker_2015-09-12T08:00:00.000Z_2015-09-12T09:00:00.000Z_2019-07-13T05:13:09.862Z, count=0}]

However, terminating the entire middle-manager does retain these logs for the tasks.

middle-manager log during entire middle-manager shutdown:

...
2019-07-13T05:15:32,931 INFO [forking-task-runner-1] org.apache.druid.indexing.overlord.ForkingTaskRunner - Logging task index_wikiticker_2019-07-13T05:15:32.875Z output to: /var/folders/v0/zc_jt5n12d39kx7dn6xmg73w0000gn/T/persistent/task/index_wikiticker_2019-07-13T05:15:32.875Z/log
2019-07-13T05:15:40,412 INFO [Thread-49] org.apache.druid.java.util.common.lifecycle.Lifecycle - Lifecycle [module] running shutdown hook
2019-07-13T05:15:40,414 INFO [Thread-49] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [ANNOUNCEMENTS]
2019-07-13T05:15:40,415 INFO [Thread-49] org.apache.druid.curator.discovery.CuratorDruidNodeAnnouncer - Unannouncing [DiscoveryDruidNode{druidNode=DruidNode{serviceName='druid/middlemanager', host='localhost', bindOnHost=false, port=-1, plaintextPort=8091, enablePlaintextPort=true, tlsPort=-1, enableTlsPort=false}, nodeType='MIDDLE_MANAGER', services={workerNodeService=WorkerNodeService{ip='clintbook', capacity=7, version='0'}}}].
2019-07-13T05:15:40,415 INFO [Thread-49] org.apache.druid.curator.announcement.Announcer - unannouncing [/druid/internal-discovery/MIDDLE_MANAGER/localhost:8091]
2019-07-13T05:15:40,450 INFO [Thread-49] org.apache.druid.curator.discovery.CuratorDruidNodeAnnouncer - Unannounced [DiscoveryDruidNode{druidNode=DruidNode{serviceName='druid/middlemanager', host='localhost', bindOnHost=false, port=-1, plaintextPort=8091, enablePlaintextPort=true, tlsPort=-1, enableTlsPort=false}, nodeType='MIDDLE_MANAGER', services={workerNodeService=WorkerNodeService{ip='clintbook', capacity=7, version='0'}}}].
2019-07-13T05:15:40,450 INFO [Thread-49] org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void org.apache.druid.curator.announcement.Announcer.stop()] on object[org.apache.druid.curator.announcement.Announcer@ebda593].
2019-07-13T05:15:40,450 INFO [Thread-49] org.apache.druid.curator.announcement.Announcer - Stopping announcer
2019-07-13T05:15:40,452 INFO [Thread-49] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [SERVER]
2019-07-13T05:15:40,452 INFO [Thread-49] org.apache.druid.server.initialization.jetty.JettyServerModule - Stopping Jetty Server...
2019-07-13T05:15:40,464 INFO [Thread-49] org.eclipse.jetty.server.AbstractConnector - Stopped ServerConnector@da28d03{HTTP/1.1,[http/1.1]}{0.0.0.0:8091}
2019-07-13T05:15:40,465 INFO [Thread-49] org.eclipse.jetty.server.session - node0 Stopped scavenging
2019-07-13T05:15:40,469 INFO [Thread-49] org.eclipse.jetty.server.handler.ContextHandler - Stopped o.e.j.s.ServletContextHandler@33eb6758{/,null,UNAVAILABLE}
2019-07-13T05:15:40,477 INFO [Thread-49] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [NORMAL]
2019-07-13T05:15:40,477 INFO [Thread-49] org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void org.apache.druid.indexing.worker.WorkerTaskMonitor.stop() throws java.lang.Exception] on object[org.apache.druid.indexing.worker.WorkerTaskMonitor@12fcc71f].
2019-07-13T05:15:40,477 INFO [Thread-49] org.apache.druid.indexing.overlord.ForkingTaskRunner - Unregistered listener [WorkerTaskManager]
2019-07-13T05:15:40,477 INFO [Thread-49] org.apache.druid.indexing.overlord.ForkingTaskRunner - Closing output stream to task[index_wikiticker_2019-07-13T05:15:32.875Z].
2019-07-13T05:15:40,478 INFO [Thread-49] org.apache.druid.indexing.overlord.ForkingTaskRunner - Waiting up to 300,000ms for shutdown.
2019-07-13T05:15:40,697 INFO [forking-task-runner-1-[index_wikiticker_2019-07-13T05:15:32.875Z]] org.apache.druid.indexing.overlord.ForkingTaskRunner - Process exited with status[130] for task: index_wikiticker_2019-07-13T05:15:32.875Z
2019-07-13T05:15:40,699 INFO [forking-task-runner-1] org.apache.druid.indexing.common.tasklogs.FileTaskLogs - Wrote task log to: log/index_wikiticker_2019-07-13T05:15:32.875Z.log
2019-07-13T05:15:40,699 INFO [forking-task-runner-1] org.apache.druid.indexing.overlord.TaskRunnerUtils - Task [index_wikiticker_2019-07-13T05:15:32.875Z] status changed to [FAILED].
2019-07-13T05:15:40,700 ERROR [forking-task-runner-1] com.google.common.util.concurrent.ExecutionList - RuntimeException while executing runnable com.google.common.util.concurrent.Futures$4@4b6e74ad with executor com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService@115e899b
java.util.concurrent.RejectedExecutionException: Task org.apache.druid.indexing.worker.WorkerTaskManager$$Lambda$48/474419945@7ade2f5a rejected from java.util.concurrent.ThreadPoolExecutor@69048cc1[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 5]
	at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063) ~[?:1.8.0_192]
	at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830) ~[?:1.8.0_192]
	at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379) ~[?:1.8.0_192]
	at java.util.concurrent.Executors$DelegatedExecutorService.execute(Executors.java:668) ~[?:1.8.0_192]
	at org.apache.druid.indexing.worker.WorkerTaskManager.submitNoticeToExec(WorkerTaskManager.java:180) ~[classes/:?]
	at org.apache.druid.indexing.worker.WorkerTaskManager.access$000(WorkerTaskManager.java:77) ~[classes/:?]
	at org.apache.druid.indexing.worker.WorkerTaskManager$2.onSuccess(WorkerTaskManager.java:244) ~[classes/:?]
	at org.apache.druid.indexing.worker.WorkerTaskManager$2.onSuccess(WorkerTaskManager.java:240) ~[classes/:?]
	at com.google.common.util.concurrent.Futures$4.run(Futures.java:1181) ~[guava-16.0.1.jar:?]
	at com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297) ~[guava-16.0.1.jar:?]
	at com.google.common.util.concurrent.ExecutionList.executeListener(ExecutionList.java:156) [guava-16.0.1.jar:?]
	at com.google.common.util.concurrent.ExecutionList.execute(ExecutionList.java:145) [guava-16.0.1.jar:?]
	at com.google.common.util.concurrent.ListenableFutureTask.done(ListenableFutureTask.java:91) [guava-16.0.1.jar:?]
	at java.util.concurrent.FutureTask.finishCompletion(FutureTask.java:384) [?:1.8.0_192]
	at java.util.concurrent.FutureTask.set(FutureTask.java:233) [?:1.8.0_192]
	at java.util.concurrent.FutureTask.run(FutureTask.java:274) [?:1.8.0_192]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_192]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_192]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_192]
2019-07-13T05:15:40,701 INFO [Thread-49] org.apache.druid.indexing.overlord.ForkingTaskRunner - Finished stopping in 224ms.
2019-07-13T05:15:40,701 INFO [Thread-49] org.apache.druid.indexing.worker.WorkerTaskManager - Stopped WorkerTaskManager.
2019-07-13T05:15:40,701 INFO [Thread-49] org.apache.druid.indexing.worker.WorkerTaskMonitor - Stopped WorkerTaskMonitor.
2019-07-13T05:15:40,701 INFO [Thread-49] org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void org.apache.druid.discovery.DruidLeaderClient.stop()] on object[org.apache.druid.discovery.DruidLeaderClient@1e7ab390].
2019-07-13T05:15:40,701 INFO [Thread-49] org.apache.druid.discovery.DruidLeaderClient - Stopped.
2019-07-13T05:15:40,701 INFO [Thread-49] org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void org.apache.druid.curator.discovery.ServerDiscoverySelector.stop() throws java.io.IOException] on object[org.apache.druid.curator.discovery.ServerDiscoverySelector@22a6e998].
2019-07-13T05:15:40,704 INFO [Thread-49] org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void org.apache.druid.curator.discovery.CuratorDruidNodeDiscoveryProvider.stop()] on object[org.apache.druid.curator.discovery.CuratorDruidNodeDiscoveryProvider@7e74a380].
2019-07-13T05:15:40,704 INFO [Thread-49] org.apache.druid.curator.discovery.CuratorDruidNodeDiscoveryProvider - stopping
2019-07-13T05:15:40,704 INFO [Thread-49] org.apache.druid.curator.discovery.CuratorDruidNodeDiscoveryProvider - stopped
2019-07-13T05:15:40,704 INFO [Thread-49] org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void org.apache.druid.java.util.http.client.NettyHttpClient.stop()] on object[org.apache.druid.java.util.http.client.NettyHttpClient@5bc28f40].
2019-07-13T05:15:40,730 INFO [Thread-49] org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void org.apache.druid.indexing.worker.WorkerCuratorCoordinator.stop()] on object[org.apache.druid.indexing.worker.WorkerCuratorCoordinator@290d10ef].
2019-07-13T05:15:40,730 INFO [Thread-49] org.apache.druid.indexing.worker.WorkerCuratorCoordinator - Stopping WorkerCuratorCoordinator for worker[localhost:8091]
2019-07-13T05:15:40,730 INFO [Thread-49] org.apache.druid.curator.announcement.Announcer - Stopping announcer
2019-07-13T05:15:40,731 INFO [Thread-49] org.apache.druid.curator.announcement.Announcer - unannouncing [/druid/indexer/announcements/localhost:8091]
2019-07-13T05:15:40,737 INFO [Thread-49] org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void org.apache.druid.java.util.metrics.MonitorScheduler.stop()] on object[org.apache.druid.java.util.metrics.MonitorScheduler@33899f7a].
2019-07-13T05:15:40,737 INFO [Thread-49] org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void org.apache.druid.java.util.emitter.service.ServiceEmitter.close() throws java.io.IOException] on object[ServiceEmitter{serviceDimensions={service=druid/middlemanager, host=localhost:8091, version=}, emitter=LoggingEmitter{log=Logger{name=[org.apache.druid.java.util.emitter.core.LoggingEmitter], class[class org.apache.logging.slf4j.Log4jLogger]}, level=INFO}}].
2019-07-13T05:15:40,737 INFO [Thread-49] org.apache.druid.java.util.emitter.core.LoggingEmitter - Close: started [false]
2019-07-13T05:15:40,737 INFO [Thread-49] org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void org.apache.druid.java.util.emitter.core.LoggingEmitter.close()] on object[LoggingEmitter{log=Logger{name=[org.apache.druid.java.util.emitter.core.LoggingEmitter], class[class org.apache.logging.slf4j.Log4jLogger]}, level=INFO}].
2019-07-13T05:15:40,738 INFO [Thread-49] org.apache.druid.curator.CuratorModule - Stopping Curator
2019-07-13T05:15:40,739 INFO [Curator-Framework-0] org.apache.curator.framework.imps.CuratorFrameworkImpl - backgroundOperationsLoop exiting
2019-07-13T05:15:40,746 INFO [Thread-49] org.apache.zookeeper.ZooKeeper - Session: 0x100126ca33300c6 closed
2019-07-13T05:15:40,747 INFO [Thread-49] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [INIT]
2019-07-13T05:15:40,746 INFO [main-EventThread] org.apache.zookeeper.ClientCnxn - EventThread shut down for session: 0x100126ca33300c6
2019-07-13T05:15:40,747 INFO [Thread-49] org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void org.apache.druid.initialization.Log4jShutterDownerModule$Log4jShutterDowner.stop()] on object[org.apache.druid.initialization.Log4jShutterDownerModule$Log4jShutterDowner@360bc645].

and the task logs have their own lifecycle stop messaging:

2019-07-13T05:15:40,422 INFO [wikiticker-incremental-persist] org.apache.druid.segment.IndexMergerV9 - Completed metric column[count] in 1 millis.
2019-07-13T05:15:40,423 INFO [wikiticker-incremental-persist] org.apache.druid.segment.IndexMergerV9 - Completed metric column[deleted] in 1 millis.
2019-07-13T05:15:40,425 INFO [wikiticker-incremental-persist] org.apache.druid.segment.IndexMergerV9 - Completed metric column[delta] in 1 millis.
2019-07-13T05:15:40,429 INFO [wikiticker-incremental-persist] org.apache.druid.segment.IndexMergerV9 - Completed metric column[user_unique] in 4 millis.
2019-07-13T05:15:40,429 INFO [wikiticker-incremental-persist] org.apache.druid.segment.IndexMergerV9 - Completed metric columns in 11 millis.
2019-07-13T05:15:40,443 INFO [Thread-110] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [SERVER]
2019-07-13T05:15:40,443 INFO [Thread-110] org.apache.druid.server.initialization.jetty.JettyServerModule - Stopping Jetty Server...
2019-07-13T05:15:40,455 INFO [Thread-110] org.eclipse.jetty.server.AbstractConnector - Stopped ServerConnector@12192604{HTTP/1.1,[http/1.1]}{0.0.0.0:8100}
2019-07-13T05:15:40,456 INFO [Thread-110] org.eclipse.jetty.server.session - node0 Stopped scavenging
2019-07-13T05:15:40,458 INFO [wikiticker-incremental-persist] org.apache.druid.segment.StringDimensionMergerV9 - Completed dim[channel] inverted with cardinality[37] in 28 millis.
2019-07-13T05:15:40,459 INFO [Thread-110] org.eclipse.jetty.server.handler.ContextHandler - Stopped o.e.j.s.ServletContextHandler@703fa45{/,null,UNAVAILABLE}
2019-07-13T05:15:40,468 INFO [Thread-110] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [NORMAL]
2019-07-13T05:15:40,468 INFO [Thread-110] org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void org.apache.druid.server.listener.announcer.ListenerResourceAnnouncer.stop()] on object[org.apache.druid.query.lookup.LookupResourceListenerAnnouncer@3fbe503c].
2019-07-13T05:15:40,469 INFO [Thread-110] org.apache.druid.server.listener.announcer.ListenerResourceAnnouncer - Unannouncing start time on [/druid/listeners/lookups/__default/http:localhost:8100]
2019-07-13T05:15:40,469 INFO [Thread-110] org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void org.apache.druid.indexing.worker.executor.ExecutorLifecycle.stop() throws java.lang.Exception] on object[org.apache.druid.indexing.worker.executor.ExecutorLifecycle@62b57479].
2019-07-13T05:15:40,470 INFO [Thread-110] org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner.stop()] on object[org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner@627ff1b8].
2019-07-13T05:15:40,471 INFO [Thread-110] org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner - Starting graceful shutdown of task[index_wikiticker_2019-07-13T05:15:32.875Z].
2019-07-13T05:15:40,471 INFO [Thread-110] org.apache.druid.indexing.overlord.TaskRunnerUtils - Task [index_wikiticker_2019-07-13T05:15:32.875Z] status changed to [FAILED].
2019-07-13T05:15:40,473 INFO [Thread-110] org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void org.apache.druid.client.cache.CaffeineCache.close()] on object[org.apache.druid.client.cache.CaffeineCache@1877f3de].
2019-07-13T05:15:40,473 INFO [task-runner-0-priority-0] org.apache.druid.segment.realtime.appenderator.AppenderatorImpl - Shutting down...
2019-07-13T05:15:40,475 INFO [Thread-110] org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void org.apache.druid.discovery.DruidLeaderClient.stop()] on object[org.apache.druid.discovery.DruidLeaderClient@526e8108].
2019-07-13T05:15:40,476 INFO [Thread-110] org.apache.druid.discovery.DruidLeaderClient - Stopped.
2019-07-13T05:15:40,476 INFO [wikiticker-incremental-persist] org.apache.druid.segment.StringDimensionMergerV9 - Completed dim[cityName] inverted with cardinality[54] in 7 millis.
2019-07-13T05:15:40,476 INFO [Thread-110] org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void org.apache.druid.curator.discovery.ServerDiscoverySelector.stop() throws java.io.IOException] on object[org.apache.druid.curator.discovery.ServerDiscoverySelector@2ec99035].
2019-07-13T05:15:40,478 INFO [parent-monitor-0] org.apache.druid.indexing.worker.executor.ExecutorLifecycle - Triggering JVM shutdown.
2019-07-13T05:15:40,479 INFO [Thread-110] org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void org.apache.druid.java.util.metrics.MonitorScheduler.stop()] on object[org.apache.druid.java.util.metrics.MonitorScheduler@5f366587].
2019-07-13T05:15:40,479 INFO [Thread-110] org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void org.apache.druid.query.lookup.LookupReferencesManager.stop()] on object[org.apache.druid.query.lookup.LookupReferencesManager@32120956].
2019-07-13T05:15:40,479 INFO [Thread-110] org.apache.druid.query.lookup.LookupReferencesManager - LookupExtractorFactoryContainerProvider is stopping.
2019-07-13T05:15:40,479 INFO [LookupExtractorFactoryContainerProvider-MainThread] org.apache.druid.query.lookup.LookupReferencesManager - Lookup Management loop exited, Lookup notices are not handled anymore.
2019-07-13T05:15:40,479 INFO [Thread-110] org.apache.druid.query.lookup.LookupReferencesManager - LookupExtractorFactoryContainerProvider is stopped.
2019-07-13T05:15:40,479 INFO [Thread-110] org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void org.apache.druid.discovery.DruidLeaderClient.stop()] on object[org.apache.druid.discovery.DruidLeaderClient@546394ed].
2019-07-13T05:15:40,479 INFO [Thread-110] org.apache.druid.discovery.DruidLeaderClient - Stopped.
2019-07-13T05:15:40,479 INFO [Thread-110] org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void org.apache.druid.curator.discovery.ServerDiscoverySelector.stop() throws java.io.IOException] on object[org.apache.druid.curator.discovery.ServerDiscoverySelector@438c9aa7].
2019-07-13T05:15:40,480 INFO [Thread-110] org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void org.apache.druid.curator.discovery.CuratorDruidNodeDiscoveryProvider.stop()] on object[org.apache.druid.curator.discovery.CuratorDruidNodeDiscoveryProvider@33188612].
2019-07-13T05:15:40,480 INFO [Thread-110] org.apache.druid.curator.discovery.CuratorDruidNodeDiscoveryProvider - stopping
2019-07-13T05:15:40,480 INFO [Thread-110] org.apache.druid.curator.discovery.CuratorDruidNodeDiscoveryProvider - stopped
2019-07-13T05:15:40,480 INFO [Thread-110] org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void org.apache.druid.java.util.http.client.NettyHttpClient.stop()] on object[org.apache.druid.java.util.http.client.NettyHttpClient@3b42121d].
2019-07-13T05:15:40,509 INFO [Thread-110] org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void org.apache.druid.java.util.emitter.service.ServiceEmitter.close() throws java.io.IOException] on object[ServiceEmitter{serviceDimensions={service=druid/peon, host=localhost:8100, version=}, emitter=LoggingEmitter{log=Logger{name=[org.apache.druid.java.util.emitter.core.LoggingEmitter], class[class org.apache.logging.slf4j.Log4jLogger]}, level=DEBUG}}].
2019-07-13T05:15:40,510 INFO [Thread-110] org.apache.druid.curator.CuratorModule - Stopping Curator
2019-07-13T05:15:40,511 INFO [Curator-Framework-0] org.apache.curator.framework.imps.CuratorFrameworkImpl - backgroundOperationsLoop exiting
2019-07-13T05:15:40,531 INFO [Thread-110] org.apache.zookeeper.ZooKeeper - Session: 0x100126ca33300c8 closed
2019-07-13T05:15:40,531 INFO [Thread-110] org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void org.apache.druid.java.util.emitter.core.LoggingEmitter.close()] on object[LoggingEmitter{log=Logger{name=[org.apache.druid.java.util.emitter.core.LoggingEmitter], class[class org.apache.logging.slf4j.Log4jLogger]}, level=DEBUG}].
2019-07-13T05:15:40,531 INFO [Thread-110] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [INIT]
2019-07-13T05:15:40,531 INFO [Thread-110] org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void org.apache.druid.initialization.Log4jShutterDownerModule$Log4jShutterDowner.stop()] on object[org.apache.druid.initialization.Log4jShutterDownerModule$Log4jShutterDowner@4116f66a].
2019-07-13T05:15:40,531 INFO [main-EventThread] org.apache.zookeeper.ClientCnxn - EventThread shut down for session: 0x100126ca33300c8

After modifying ForkingTaskRunner.shutdown to use the same mechanism to stop the tasks as ForkingTaskRunner.stop, the IOException is no longer there, and the task logs retain their shutdown messaging.

middle-manager log:

2019-07-13T05:22:41,865 INFO [forking-task-runner-0] org.apache.druid.indexing.overlord.TaskRunnerUtils - Task [index_wikiticker_2019-07-13T05:22:41.728Z] location changed to [TaskLocation{host='localhost', port=8100, tlsPort=-1}].
2019-07-13T05:22:41,867 INFO [forking-task-runner-0] org.apache.druid.indexing.overlord.TaskRunnerUtils - Task [index_wikiticker_2019-07-13T05:22:41.728Z] status changed to [RUNNING].
2019-07-13T05:22:41,868 INFO [forking-task-runner-0] org.apache.druid.indexing.overlord.ForkingTaskRunner - Logging task index_wikiticker_2019-07-13T05:22:41.728Z output to: /var/folders/v0/zc_jt5n12d39kx7dn6xmg73w0000gn/T/persistent/task/index_wikiticker_2019-07-13T05:22:41.728Z/log
2019-07-13T05:22:49,258 INFO [qtp345893819-70] org.apache.druid.indexing.overlord.ForkingTaskRunner - Shutdown [index_wikiticker_2019-07-13T05:22:41.728Z] because: [shut down request via HTTP endpoint]
2019-07-13T05:22:49,258 INFO [qtp345893819-70] org.apache.druid.indexing.overlord.ForkingTaskRunner - Closing output stream to task[index_wikiticker_2019-07-13T05:22:41.728Z].
2019-07-13T05:22:49,347 INFO [qtp345893819-66] org.apache.druid.indexing.overlord.ForkingTaskRunner - Shutdown [index_wikiticker_2019-07-13T05:22:41.728Z] because: [shut down request via HTTP endpoint]
2019-07-13T05:22:49,347 INFO [qtp345893819-66] org.apache.druid.indexing.overlord.ForkingTaskRunner - Closing output stream to task[index_wikiticker_2019-07-13T05:22:41.728Z].
2019-07-13T05:22:49,547 INFO [forking-task-runner-0-[index_wikiticker_2019-07-13T05:22:41.728Z]] org.apache.druid.indexing.overlord.ForkingTaskRunner - Process exited with status[2] for task: index_wikiticker_2019-07-13T05:22:41.728Z
2019-07-13T05:22:49,558 INFO [forking-task-runner-0] org.apache.druid.indexing.common.tasklogs.FileTaskLogs - Wrote task log to: log/index_wikiticker_2019-07-13T05:22:41.728Z.log
2019-07-13T05:22:49,558 INFO [forking-task-runner-0] org.apache.druid.indexing.overlord.TaskRunnerUtils - Task [index_wikiticker_2019-07-13T05:22:41.728Z] status changed to [FAILED].
2019-07-13T05:22:49,559 INFO [forking-task-runner-0] org.apache.druid.indexing.overlord.ForkingTaskRunner - Removing task directory: /var/folders/v0/zc_jt5n12d39kx7dn6xmg73w0000gn/T/persistent/task/index_wikiticker_2019-07-13T05:22:41.728Z
2019-07-13T05:22:49,591 INFO [WorkerTaskManager-NoticeHandler] org.apache.druid.indexing.worker.WorkerTaskManager - Job's finished. Completed [index_wikiticker_2019-07-13T05:22:41.728Z] with status [FAILED]
2019-07-13T05:23:20,800 INFO [MonitorScheduler-0] org.apache.druid.java.util.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2019-07-13T05:23:20.792Z","service":"druid/middlemanager","host":"localhost:8091","version":"","metric":"jetty/numOpenConnections","value":2}]
2019-07-13T05:23:21,252 ERROR [WorkerTaskManager-CompletedTasksCleaner] org.apache.druid.curator.discovery.ServerDiscoverySelector - No server instance found for [druid/overlord]
2019-07-13T05:23:21,382 INFO [WorkerTaskManager-CompletedTasksCleaner] org.apache.druid.indexing.worker.WorkerTaskManager - Deleting completed task[index_wikiticker_2019-07-13T05:22:41.728Z] information, overlord task status[FAILED]

and the task log has the same lifecycle stop messaging that appears when the entire middle-manager is terminated, and the tasks still retain the 'failed' status.


This PR has:

  • been self-reviewed.
  • been tested in a test Druid cluster.

@@ -714,6 +701,21 @@ private void saveRunningTasks()
}
}

private void shutdownTaskProcess(ForkingTaskRunnerWorkItem taskInfo)
Copy link
Contributor

@jihoonson jihoonson Jul 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you add a comment like "this method triggers lifecycle.stop()"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sort of seems like a misleading comment, since it doesn't call that directly, it's just closing the stream of the forked task and destroying it if that encounters an exception. Also since I am not entirely sure that lifecycle stop wasn't being called previously...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, I agree it probably needs some sort of javadoc/comment, i'll see what I can come up with

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added javadoc

@clintropolis
Copy link
Member Author

Removed "fixes" from description because this isn't really a complete fix for that issue.

@jon-wei
Copy link
Contributor

jon-wei commented Jul 15, 2019

I am unable to definitively determine if this means the task is not waiting for lifecycle stop to complete, or if it's strictly a logging issue, as I have collected conflicting observations.

I did a test on HadoopindexTask where I put the following code into stopGracefully, and killed a task (without this PR applied):

    try {
      FileOutputStream outputStream = new FileOutputStream("/tmp/hadoop_test_indicator");
      byte[] strToBytes = StringUtils.toUtf8("HELLO WORLD\n");
      outputStream.write(strToBytes);
      outputStream.close();
    }
    catch (Exception e) {
      log.error(e, "a");
    }

The file did get written with the contents, so this does seem to be a logging-only issue.

@clintropolis
Copy link
Member Author

Thanks for review 👍

@clintropolis clintropolis merged commit cb82d72 into apache:master Jul 23, 2019
@clintropolis clintropolis deleted the forking-task-runner-shutdown-log-fix branch July 23, 2019 18:47
clintropolis added a commit to clintropolis/druid that referenced this pull request Jul 24, 2019
* fix forking task runner shutdown to be more graceful

* javadoc
@clintropolis clintropolis added this to the 0.15.1 milestone Jul 24, 2019
gianm pushed a commit to implydata/druid-public that referenced this pull request Jul 24, 2019
* fix forking task runner shutdown to be more graceful

* javadoc
clintropolis added a commit that referenced this pull request Jul 29, 2019
)

* fix forking task runner shutdown to be more graceful

* javadoc
gvsmirnov pushed a commit to Plumbr/druid that referenced this pull request Aug 22, 2019
* fix forking task runner shutdown to be more graceful

* javadoc
seoeun25 pushed a commit to seoeun25/incubator-druid that referenced this pull request Jan 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants