Kerberos GSS errors reported when an indexation task map-reduce job fails #10209

elukey · 2020-07-24T09:44:43Z

Hi everybody,

I am currently testing Druid 0.19.0, I'd like to upgrade the two clusters that we run in production from 0.12.3 to 0.19.0. So far all my tests are good (really nice job!) but I noticed a weird behavior when an indexation task kicks off a map-reduce job that fails for some reason. I get the following logs repeated over and over, and eventually the middle manager stops:

2020-07-24T06:44:34,700 INFO org.apache.druid.indexer.JobHelper: Deleting path[/tmp/druid-indexing/test_webrequest_new_hadoop/2020-07-24T064311.660Z_5d3b1f48fbc244be97c7dbe42c49bef5]
2020-07-24T06:44:34,999 INFO org.apache.druid.indexing.worker.executor.ExecutorLifecycle: Task completed with status: {
  "id" : "index_hadoop_test_webrequest_new_hadoop_cldagnim_2020-07-24T06:43:11.653Z",
  "status" : "FAILED",
  "duration" : 74867,
  "errorMsg" : "{\"attempt_1594733405064_2112_m_000009_1\":\"Error: org.apache.druid.java.util.common.RE: Failure on ro...",
  "location" : {
    "host" : null,
    "port" : -1,
    "tlsPort" : -1
  }
}
2020-07-24T06:44:35,008 INFO org.apache.druid.java.util.common.lifecycle.Lifecycle: Stopping lifecycle [module] stage [ANNOUNCEMENTS]
2020-07-24T06:44:35,011 INFO org.apache.druid.java.util.common.lifecycle.Lifecycle: Stopping lifecycle [module] stage [SERVER]
2020-07-24T06:44:35,016 INFO org.eclipse.jetty.server.AbstractConnector: Stopped ServerConnector@36f40d72{HTTP/1.1,[http/1.1]}{0.0.0.0:8200}
2020-07-24T06:44:35,016 INFO org.eclipse.jetty.server.session: node0 Stopped scavenging
2020-07-24T06:44:35,019 INFO org.eclipse.jetty.server.handler.ContextHandler: Stopped o.e.j.s.ServletContextHandler@3bc18fec{/,null,UNAVAILABLE}
2020-07-24T06:44:35,021 INFO org.apache.druid.java.util.common.lifecycle.Lifecycle: Stopping lifecycle [module] stage [NORMAL]
2020-07-24T06:44:35,022 INFO org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner: Starting graceful shutdown of task[index_hadoop_test_webrequest_new_hadoop_cldagnim_2020-07-24T06:43:11.653Z].

[..]

2020-07-24T06:44:39,457 WARN org.apache.hadoop.ipc.Client: Exception encountered while connecting to the server : javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
2020-07-24T06:44:39,464 INFO org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider: Failing over to analytics1029-eqiad-wmnet
2020-07-24T06:44:39,466 WARN org.apache.hadoop.ipc.Client: Failed to connect to server: analytics1029.eqiad.wmnet/10.64.36.129:8032: retries get failed due to exceeded maximum allowed retries number: 0
java.net.ConnectException: Connection refused
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:1.8.0_252]
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:714) ~[?:1.8.0_252]
	at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) ~[hadoop-common-2.8.5.jar:?]
	at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531) ~[hadoop-common-2.8.5.jar:?]
	at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:685) ~[hadoop-common-2.8.5.jar:?]
	at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:788) ~[hadoop-common-2.8.5.jar:?]
	at org.apache.hadoop.ipc.Client$Connection.access$3500(Client.java:410) ~[hadoop-common-2.8.5.jar:?]
	at org.apache.hadoop.ipc.Client.getConnection(Client.java:1550) ~[hadoop-common-2.8.5.jar:?]
	at org.apache.hadoop.ipc.Client.call(Client.java:1381) ~[hadoop-common-2.8.5.jar:?]
	at org.apache.hadoop.ipc.Client.call(Client.java:1345) ~[hadoop-common-2.8.5.jar:?]
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227) ~[hadoop-common-2.8.5.jar:?]
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) ~[hadoop-common-2.8.5.jar:?]
	at com.sun.proxy.$Proxy318.getApplicationReport(Unknown Source) ~[?:?]
	at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationReport(ApplicationClientProtocolPBClientImpl.java:228) ~[hadoop-yarn-common-2.8.5.jar:?]
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_252]
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_252]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_252]
	at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_252]
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:409) ~[hadoop-common-2.8.5.jar:?]
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:163) ~[hadoop-common-2.8.5.jar:?]
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:155) ~[hadoop-common-2.8.5.jar:?]
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) ~[hadoop-common-2.8.5.jar:?]
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:346) ~[hadoop-common-2.8.5.jar:?]
	at com.sun.proxy.$Proxy319.getApplicationReport(Unknown Source) ~[?:?]
	at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationReport(YarnClientImpl.java:480) ~[hadoop-yarn-client-2.8.5.jar:?]
	at org.apache.hadoop.mapred.ResourceMgrDelegate.getApplicationReport(ResourceMgrDelegate.java:314) ~[hadoop-mapreduce-client-jobclient-2.8.5.jar:?]
	at org.apache.hadoop.mapred.ClientServiceDelegate.getProxy(ClientServiceDelegate.java:155) ~[hadoop-mapreduce-client-jobclient-2.8.5.jar:?]
	at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:324) ~[hadoop-mapreduce-client-jobclient-2.8.5.jar:?]
	at org.apache.hadoop.mapred.ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:429) ~[hadoop-mapreduce-client-jobclient-2.8.5.jar:?]
	at org.apache.hadoop.mapred.YARNRunner.getJobStatus(YARNRunner.java:617) ~[hadoop-mapreduce-client-jobclient-2.8.5.jar:?]
	at org.apache.hadoop.mapreduce.Cluster.getJob(Cluster.java:207) ~[hadoop-mapreduce-client-core-2.8.5.jar:?]
	at org.apache.hadoop.mapreduce.tools.CLI.getJob(CLI.java:547) ~[hadoop-mapreduce-client-core-2.8.5.jar:?]
	at org.apache.hadoop.mapreduce.tools.CLI.run(CLI.java:304) ~[hadoop-mapreduce-client-core-2.8.5.jar:?]
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) ~[hadoop-common-2.8.5.jar:?]
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90) ~[hadoop-common-2.8.5.jar:?]
	at org.apache.druid.indexing.common.task.HadoopIndexTask$HadoopKillMRJobIdProcessingRunner.runTask(HadoopIndexTask.java:768) ~[druid-indexing-service-0.19.0.jar:0.19.0]
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_252]
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_252]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_252]
	at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_252]
	at org.apache.druid.indexing.common.task.HadoopIndexTask.killHadoopJob(HadoopIndexTask.java:492) ~[druid-indexing-service-0.19.0.jar:0.19.0]
	at org.apache.druid.indexing.common.task.HadoopIndexTask.lambda$runInternal$0(HadoopIndexTask.java:311) ~[druid-indexing-service-0.19.0.jar:0.19.0]
	at org.apache.druid.indexing.common.task.TaskResourceCleaner.clean(TaskResourceCleaner.java:50) [druid-indexing-service-0.19.0.jar:0.19.0]
	at org.apache.druid.indexing.common.task.AbstractBatchIndexTask.stopGracefully(AbstractBatchIndexTask.java:132) [druid-indexing-service-0.19.0.jar:0.19.0]
	at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner.stop(SingleTaskBackgroundRunner.java:186) [druid-indexing-service-0.19.0.jar:0.19.0]
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_252]
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_252]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_252]
	at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_252]
	at org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler.stop(Lifecycle.java:465) [druid-core-0.19.0.jar:0.19.0]
	at org.apache.druid.java.util.common.lifecycle.Lifecycle.stop(Lifecycle.java:368) [druid-core-0.19.0.jar:0.19.0]
	at org.apache.druid.cli.CliPeon.run(CliPeon.java:306) [druid-services-0.19.0.jar:0.19.0]
	at org.apache.druid.cli.Main.main(Main.java:113) [druid-services-0.19.0.jar:0.19.0]

The Kerberos config works very well for all other hadoop-related tasks, like adding segments, fetching from HDFS, etc.. And the above error does not happen if the indexation succeeds.

Affected Version

0.19.0 (upgrading from 0.12.3)
Important note: we still run a CDH 5.16 Hadoop version. It works well with -Dhadoop.mapreduce.job.classloader=true without rebuilding Druid with different deps.

Description

Please include as much detailed information about the problem as possible.

Cluster size
Configurations in use
Steps to reproduce the problem
The error message or stack traces encountered. Providing more context, such as nearby log messages or even entire logs, can be helpful.
Any debugging that you have already done

The text was updated successfully, but these errors were encountered:

a2l007 · 2020-07-29T14:10:36Z

This generally happens when the kerberos ticket that is being used to authenticate with the Hadoop cluster has expired. If the kerberos credentials are configured in Druid using druid.hadoop.security.kerberos.principal and druid.hadoop.security.kerberos.keytab, Druid takes care of refreshing the ticket before expiry. If not, you might be relying on an external service to renew the ticket cache and it might be worth checking that out to see if the renewal is working as expected.

sydt2014 · 2020-08-02T09:49:59Z

This is a bug!

dkhwangbo · 2020-09-21T06:33:27Z

@sydt2014 Hi! I have same problem. Any solution you found?

github-actions · 2023-09-06T00:15:09Z

This issue has been marked as stale due to 280 days of inactivity.
It will be closed in 4 weeks if no further activity occurs. If this issue is still
relevant, please simply write any comment. Even if closed, you can still revive the
issue at any time or discuss it on the dev@druid.apache.org list.
Thank you for your contributions.

github-actions · 2023-10-04T00:15:33Z

This issue has been closed due to lack of activity. If you think that
is incorrect, or the issue requires additional review, you can revive the issue at
any time.

elukey added the Uncategorized problem report label Jul 24, 2020

github-actions bot added the stale label Sep 6, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kerberos GSS errors reported when an indexation task map-reduce job fails #10209

Kerberos GSS errors reported when an indexation task map-reduce job fails #10209

elukey commented Jul 24, 2020

a2l007 commented Jul 29, 2020

sydt2014 commented Aug 2, 2020

dkhwangbo commented Sep 21, 2020

github-actions bot commented Sep 6, 2023

github-actions bot commented Oct 4, 2023

Kerberos GSS errors reported when an indexation task map-reduce job fails #10209

Kerberos GSS errors reported when an indexation task map-reduce job fails #10209

Comments

elukey commented Jul 24, 2020

Affected Version

Description

a2l007 commented Jul 29, 2020

sydt2014 commented Aug 2, 2020

dkhwangbo commented Sep 21, 2020

github-actions bot commented Sep 6, 2023

github-actions bot commented Oct 4, 2023