Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kerberos GSS errors reported when an indexation task map-reduce job fails #10209

Closed
elukey opened this issue Jul 24, 2020 · 5 comments
Closed

Comments

@elukey
Copy link

elukey commented Jul 24, 2020

Hi everybody,

I am currently testing Druid 0.19.0, I'd like to upgrade the two clusters that we run in production from 0.12.3 to 0.19.0. So far all my tests are good (really nice job!) but I noticed a weird behavior when an indexation task kicks off a map-reduce job that fails for some reason. I get the following logs repeated over and over, and eventually the middle manager stops:

2020-07-24T06:44:34,700 INFO org.apache.druid.indexer.JobHelper: Deleting path[/tmp/druid-indexing/test_webrequest_new_hadoop/2020-07-24T064311.660Z_5d3b1f48fbc244be97c7dbe42c49bef5]
2020-07-24T06:44:34,999 INFO org.apache.druid.indexing.worker.executor.ExecutorLifecycle: Task completed with status: {
  "id" : "index_hadoop_test_webrequest_new_hadoop_cldagnim_2020-07-24T06:43:11.653Z",
  "status" : "FAILED",
  "duration" : 74867,
  "errorMsg" : "{\"attempt_1594733405064_2112_m_000009_1\":\"Error: org.apache.druid.java.util.common.RE: Failure on ro...",
  "location" : {
    "host" : null,
    "port" : -1,
    "tlsPort" : -1
  }
}
2020-07-24T06:44:35,008 INFO org.apache.druid.java.util.common.lifecycle.Lifecycle: Stopping lifecycle [module] stage [ANNOUNCEMENTS]
2020-07-24T06:44:35,011 INFO org.apache.druid.java.util.common.lifecycle.Lifecycle: Stopping lifecycle [module] stage [SERVER]
2020-07-24T06:44:35,016 INFO org.eclipse.jetty.server.AbstractConnector: Stopped ServerConnector@36f40d72{HTTP/1.1,[http/1.1]}{0.0.0.0:8200}
2020-07-24T06:44:35,016 INFO org.eclipse.jetty.server.session: node0 Stopped scavenging
2020-07-24T06:44:35,019 INFO org.eclipse.jetty.server.handler.ContextHandler: Stopped o.e.j.s.ServletContextHandler@3bc18fec{/,null,UNAVAILABLE}
2020-07-24T06:44:35,021 INFO org.apache.druid.java.util.common.lifecycle.Lifecycle: Stopping lifecycle [module] stage [NORMAL]
2020-07-24T06:44:35,022 INFO org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner: Starting graceful shutdown of task[index_hadoop_test_webrequest_new_hadoop_cldagnim_2020-07-24T06:43:11.653Z].

[..]

2020-07-24T06:44:39,457 WARN org.apache.hadoop.ipc.Client: Exception encountered while connecting to the server : javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
2020-07-24T06:44:39,464 INFO org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider: Failing over to analytics1029-eqiad-wmnet
2020-07-24T06:44:39,466 WARN org.apache.hadoop.ipc.Client: Failed to connect to server: analytics1029.eqiad.wmnet/10.64.36.129:8032: retries get failed due to exceeded maximum allowed retries number: 0
java.net.ConnectException: Connection refused
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:1.8.0_252]
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:714) ~[?:1.8.0_252]
	at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) ~[hadoop-common-2.8.5.jar:?]
	at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531) ~[hadoop-common-2.8.5.jar:?]
	at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:685) ~[hadoop-common-2.8.5.jar:?]
	at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:788) ~[hadoop-common-2.8.5.jar:?]
	at org.apache.hadoop.ipc.Client$Connection.access$3500(Client.java:410) ~[hadoop-common-2.8.5.jar:?]
	at org.apache.hadoop.ipc.Client.getConnection(Client.java:1550) ~[hadoop-common-2.8.5.jar:?]
	at org.apache.hadoop.ipc.Client.call(Client.java:1381) ~[hadoop-common-2.8.5.jar:?]
	at org.apache.hadoop.ipc.Client.call(Client.java:1345) ~[hadoop-common-2.8.5.jar:?]
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227) ~[hadoop-common-2.8.5.jar:?]
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) ~[hadoop-common-2.8.5.jar:?]
	at com.sun.proxy.$Proxy318.getApplicationReport(Unknown Source) ~[?:?]
	at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationReport(ApplicationClientProtocolPBClientImpl.java:228) ~[hadoop-yarn-common-2.8.5.jar:?]
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_252]
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_252]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_252]
	at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_252]
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:409) ~[hadoop-common-2.8.5.jar:?]
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:163) ~[hadoop-common-2.8.5.jar:?]
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:155) ~[hadoop-common-2.8.5.jar:?]
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) ~[hadoop-common-2.8.5.jar:?]
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:346) ~[hadoop-common-2.8.5.jar:?]
	at com.sun.proxy.$Proxy319.getApplicationReport(Unknown Source) ~[?:?]
	at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationReport(YarnClientImpl.java:480) ~[hadoop-yarn-client-2.8.5.jar:?]
	at org.apache.hadoop.mapred.ResourceMgrDelegate.getApplicationReport(ResourceMgrDelegate.java:314) ~[hadoop-mapreduce-client-jobclient-2.8.5.jar:?]
	at org.apache.hadoop.mapred.ClientServiceDelegate.getProxy(ClientServiceDelegate.java:155) ~[hadoop-mapreduce-client-jobclient-2.8.5.jar:?]
	at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:324) ~[hadoop-mapreduce-client-jobclient-2.8.5.jar:?]
	at org.apache.hadoop.mapred.ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:429) ~[hadoop-mapreduce-client-jobclient-2.8.5.jar:?]
	at org.apache.hadoop.mapred.YARNRunner.getJobStatus(YARNRunner.java:617) ~[hadoop-mapreduce-client-jobclient-2.8.5.jar:?]
	at org.apache.hadoop.mapreduce.Cluster.getJob(Cluster.java:207) ~[hadoop-mapreduce-client-core-2.8.5.jar:?]
	at org.apache.hadoop.mapreduce.tools.CLI.getJob(CLI.java:547) ~[hadoop-mapreduce-client-core-2.8.5.jar:?]
	at org.apache.hadoop.mapreduce.tools.CLI.run(CLI.java:304) ~[hadoop-mapreduce-client-core-2.8.5.jar:?]
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) ~[hadoop-common-2.8.5.jar:?]
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90) ~[hadoop-common-2.8.5.jar:?]
	at org.apache.druid.indexing.common.task.HadoopIndexTask$HadoopKillMRJobIdProcessingRunner.runTask(HadoopIndexTask.java:768) ~[druid-indexing-service-0.19.0.jar:0.19.0]
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_252]
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_252]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_252]
	at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_252]
	at org.apache.druid.indexing.common.task.HadoopIndexTask.killHadoopJob(HadoopIndexTask.java:492) ~[druid-indexing-service-0.19.0.jar:0.19.0]
	at org.apache.druid.indexing.common.task.HadoopIndexTask.lambda$runInternal$0(HadoopIndexTask.java:311) ~[druid-indexing-service-0.19.0.jar:0.19.0]
	at org.apache.druid.indexing.common.task.TaskResourceCleaner.clean(TaskResourceCleaner.java:50) [druid-indexing-service-0.19.0.jar:0.19.0]
	at org.apache.druid.indexing.common.task.AbstractBatchIndexTask.stopGracefully(AbstractBatchIndexTask.java:132) [druid-indexing-service-0.19.0.jar:0.19.0]
	at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner.stop(SingleTaskBackgroundRunner.java:186) [druid-indexing-service-0.19.0.jar:0.19.0]
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_252]
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_252]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_252]
	at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_252]
	at org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler.stop(Lifecycle.java:465) [druid-core-0.19.0.jar:0.19.0]
	at org.apache.druid.java.util.common.lifecycle.Lifecycle.stop(Lifecycle.java:368) [druid-core-0.19.0.jar:0.19.0]
	at org.apache.druid.cli.CliPeon.run(CliPeon.java:306) [druid-services-0.19.0.jar:0.19.0]
	at org.apache.druid.cli.Main.main(Main.java:113) [druid-services-0.19.0.jar:0.19.0]

The Kerberos config works very well for all other hadoop-related tasks, like adding segments, fetching from HDFS, etc.. And the above error does not happen if the indexation succeeds.

Affected Version

0.19.0 (upgrading from 0.12.3)
Important note: we still run a CDH 5.16 Hadoop version. It works well with -Dhadoop.mapreduce.job.classloader=true without rebuilding Druid with different deps.

Description

Please include as much detailed information about the problem as possible.

  • Cluster size
  • Configurations in use
  • Steps to reproduce the problem
  • The error message or stack traces encountered. Providing more context, such as nearby log messages or even entire logs, can be helpful.
  • Any debugging that you have already done
@a2l007
Copy link
Contributor

a2l007 commented Jul 29, 2020

This generally happens when the kerberos ticket that is being used to authenticate with the Hadoop cluster has expired. If the kerberos credentials are configured in Druid using druid.hadoop.security.kerberos.principal and druid.hadoop.security.kerberos.keytab, Druid takes care of refreshing the ticket before expiry. If not, you might be relying on an external service to renew the ticket cache and it might be worth checking that out to see if the renewal is working as expected.

@sydt2014
Copy link

sydt2014 commented Aug 2, 2020

This is a bug!

@dkhwangbo
Copy link
Contributor

@sydt2014 Hi! I have same problem. Any solution you found?

@github-actions
Copy link

github-actions bot commented Sep 6, 2023

This issue has been marked as stale due to 280 days of inactivity.
It will be closed in 4 weeks if no further activity occurs. If this issue is still
relevant, please simply write any comment. Even if closed, you can still revive the
issue at any time or discuss it on the dev@druid.apache.org list.
Thank you for your contributions.

@github-actions github-actions bot added the stale label Sep 6, 2023
@github-actions
Copy link

github-actions bot commented Oct 4, 2023

This issue has been closed due to lack of activity. If you think that
is incorrect, or the issue requires additional review, you can revive the issue at
any time.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants