You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am currently testing Druid 0.19.0, I'd like to upgrade the two clusters that we run in production from 0.12.3 to 0.19.0. So far all my tests are good (really nice job!) but I noticed a weird behavior when an indexation task kicks off a map-reduce job that fails for some reason. I get the following logs repeated over and over, and eventually the middle manager stops:
2020-07-24T06:44:34,700 INFO org.apache.druid.indexer.JobHelper: Deleting path[/tmp/druid-indexing/test_webrequest_new_hadoop/2020-07-24T064311.660Z_5d3b1f48fbc244be97c7dbe42c49bef5]
2020-07-24T06:44:34,999 INFO org.apache.druid.indexing.worker.executor.ExecutorLifecycle: Task completed with status: {
"id" : "index_hadoop_test_webrequest_new_hadoop_cldagnim_2020-07-24T06:43:11.653Z",
"status" : "FAILED",
"duration" : 74867,
"errorMsg" : "{\"attempt_1594733405064_2112_m_000009_1\":\"Error: org.apache.druid.java.util.common.RE: Failure on ro...",
"location" : {
"host" : null,
"port" : -1,
"tlsPort" : -1
}
}
2020-07-24T06:44:35,008 INFO org.apache.druid.java.util.common.lifecycle.Lifecycle: Stopping lifecycle [module] stage [ANNOUNCEMENTS]
2020-07-24T06:44:35,011 INFO org.apache.druid.java.util.common.lifecycle.Lifecycle: Stopping lifecycle [module] stage [SERVER]
2020-07-24T06:44:35,016 INFO org.eclipse.jetty.server.AbstractConnector: Stopped ServerConnector@36f40d72{HTTP/1.1,[http/1.1]}{0.0.0.0:8200}
2020-07-24T06:44:35,016 INFO org.eclipse.jetty.server.session: node0 Stopped scavenging
2020-07-24T06:44:35,019 INFO org.eclipse.jetty.server.handler.ContextHandler: Stopped o.e.j.s.ServletContextHandler@3bc18fec{/,null,UNAVAILABLE}
2020-07-24T06:44:35,021 INFO org.apache.druid.java.util.common.lifecycle.Lifecycle: Stopping lifecycle [module] stage [NORMAL]
2020-07-24T06:44:35,022 INFO org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner: Starting graceful shutdown of task[index_hadoop_test_webrequest_new_hadoop_cldagnim_2020-07-24T06:43:11.653Z].
[..]
2020-07-24T06:44:39,457 WARN org.apache.hadoop.ipc.Client: Exception encountered while connecting to the server : javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
2020-07-24T06:44:39,464 INFO org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider: Failing over to analytics1029-eqiad-wmnet
2020-07-24T06:44:39,466 WARN org.apache.hadoop.ipc.Client: Failed to connect to server: analytics1029.eqiad.wmnet/10.64.36.129:8032: retries get failed due to exceeded maximum allowed retries number: 0
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:1.8.0_252]
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:714) ~[?:1.8.0_252]
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) ~[hadoop-common-2.8.5.jar:?]
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531) ~[hadoop-common-2.8.5.jar:?]
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:685) ~[hadoop-common-2.8.5.jar:?]
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:788) ~[hadoop-common-2.8.5.jar:?]
at org.apache.hadoop.ipc.Client$Connection.access$3500(Client.java:410) ~[hadoop-common-2.8.5.jar:?]
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1550) ~[hadoop-common-2.8.5.jar:?]
at org.apache.hadoop.ipc.Client.call(Client.java:1381) ~[hadoop-common-2.8.5.jar:?]
at org.apache.hadoop.ipc.Client.call(Client.java:1345) ~[hadoop-common-2.8.5.jar:?]
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227) ~[hadoop-common-2.8.5.jar:?]
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) ~[hadoop-common-2.8.5.jar:?]
at com.sun.proxy.$Proxy318.getApplicationReport(Unknown Source) ~[?:?]
at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationReport(ApplicationClientProtocolPBClientImpl.java:228) ~[hadoop-yarn-common-2.8.5.jar:?]
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_252]
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_252]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_252]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_252]
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:409) ~[hadoop-common-2.8.5.jar:?]
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:163) ~[hadoop-common-2.8.5.jar:?]
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:155) ~[hadoop-common-2.8.5.jar:?]
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) ~[hadoop-common-2.8.5.jar:?]
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:346) ~[hadoop-common-2.8.5.jar:?]
at com.sun.proxy.$Proxy319.getApplicationReport(Unknown Source) ~[?:?]
at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationReport(YarnClientImpl.java:480) ~[hadoop-yarn-client-2.8.5.jar:?]
at org.apache.hadoop.mapred.ResourceMgrDelegate.getApplicationReport(ResourceMgrDelegate.java:314) ~[hadoop-mapreduce-client-jobclient-2.8.5.jar:?]
at org.apache.hadoop.mapred.ClientServiceDelegate.getProxy(ClientServiceDelegate.java:155) ~[hadoop-mapreduce-client-jobclient-2.8.5.jar:?]
at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:324) ~[hadoop-mapreduce-client-jobclient-2.8.5.jar:?]
at org.apache.hadoop.mapred.ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:429) ~[hadoop-mapreduce-client-jobclient-2.8.5.jar:?]
at org.apache.hadoop.mapred.YARNRunner.getJobStatus(YARNRunner.java:617) ~[hadoop-mapreduce-client-jobclient-2.8.5.jar:?]
at org.apache.hadoop.mapreduce.Cluster.getJob(Cluster.java:207) ~[hadoop-mapreduce-client-core-2.8.5.jar:?]
at org.apache.hadoop.mapreduce.tools.CLI.getJob(CLI.java:547) ~[hadoop-mapreduce-client-core-2.8.5.jar:?]
at org.apache.hadoop.mapreduce.tools.CLI.run(CLI.java:304) ~[hadoop-mapreduce-client-core-2.8.5.jar:?]
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) ~[hadoop-common-2.8.5.jar:?]
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90) ~[hadoop-common-2.8.5.jar:?]
at org.apache.druid.indexing.common.task.HadoopIndexTask$HadoopKillMRJobIdProcessingRunner.runTask(HadoopIndexTask.java:768) ~[druid-indexing-service-0.19.0.jar:0.19.0]
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_252]
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_252]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_252]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_252]
at org.apache.druid.indexing.common.task.HadoopIndexTask.killHadoopJob(HadoopIndexTask.java:492) ~[druid-indexing-service-0.19.0.jar:0.19.0]
at org.apache.druid.indexing.common.task.HadoopIndexTask.lambda$runInternal$0(HadoopIndexTask.java:311) ~[druid-indexing-service-0.19.0.jar:0.19.0]
at org.apache.druid.indexing.common.task.TaskResourceCleaner.clean(TaskResourceCleaner.java:50) [druid-indexing-service-0.19.0.jar:0.19.0]
at org.apache.druid.indexing.common.task.AbstractBatchIndexTask.stopGracefully(AbstractBatchIndexTask.java:132) [druid-indexing-service-0.19.0.jar:0.19.0]
at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner.stop(SingleTaskBackgroundRunner.java:186) [druid-indexing-service-0.19.0.jar:0.19.0]
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_252]
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_252]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_252]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_252]
at org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler.stop(Lifecycle.java:465) [druid-core-0.19.0.jar:0.19.0]
at org.apache.druid.java.util.common.lifecycle.Lifecycle.stop(Lifecycle.java:368) [druid-core-0.19.0.jar:0.19.0]
at org.apache.druid.cli.CliPeon.run(CliPeon.java:306) [druid-services-0.19.0.jar:0.19.0]
at org.apache.druid.cli.Main.main(Main.java:113) [druid-services-0.19.0.jar:0.19.0]
The Kerberos config works very well for all other hadoop-related tasks, like adding segments, fetching from HDFS, etc.. And the above error does not happen if the indexation succeeds.
Affected Version
0.19.0 (upgrading from 0.12.3)
Important note: we still run a CDH 5.16 Hadoop version. It works well with -Dhadoop.mapreduce.job.classloader=true without rebuilding Druid with different deps.
Description
Please include as much detailed information about the problem as possible.
Cluster size
Configurations in use
Steps to reproduce the problem
The error message or stack traces encountered. Providing more context, such as nearby log messages or even entire logs, can be helpful.
Any debugging that you have already done
The text was updated successfully, but these errors were encountered:
This generally happens when the kerberos ticket that is being used to authenticate with the Hadoop cluster has expired. If the kerberos credentials are configured in Druid using druid.hadoop.security.kerberos.principal and druid.hadoop.security.kerberos.keytab, Druid takes care of refreshing the ticket before expiry. If not, you might be relying on an external service to renew the ticket cache and it might be worth checking that out to see if the renewal is working as expected.
This issue has been marked as stale due to 280 days of inactivity.
It will be closed in 4 weeks if no further activity occurs. If this issue is still
relevant, please simply write any comment. Even if closed, you can still revive the
issue at any time or discuss it on the dev@druid.apache.org list.
Thank you for your contributions.
This issue has been closed due to lack of activity. If you think that
is incorrect, or the issue requires additional review, you can revive the issue at
any time.
Hi everybody,
I am currently testing Druid 0.19.0, I'd like to upgrade the two clusters that we run in production from 0.12.3 to 0.19.0. So far all my tests are good (really nice job!) but I noticed a weird behavior when an indexation task kicks off a map-reduce job that fails for some reason. I get the following logs repeated over and over, and eventually the middle manager stops:
The Kerberos config works very well for all other hadoop-related tasks, like adding segments, fetching from HDFS, etc.. And the above error does not happen if the indexation succeeds.
Affected Version
0.19.0 (upgrading from 0.12.3)
Important note: we still run a CDH 5.16 Hadoop version. It works well with
-Dhadoop.mapreduce.job.classloader=true
without rebuilding Druid with different deps.Description
Please include as much detailed information about the problem as possible.
The text was updated successfully, but these errors were encountered: