[SPARK-11662] [YARN]. In Client mode, make sure we re-login before at… #9875

harishreedharan · 2015-11-20T23:47:27Z

…tempting to create new delegation tokens if a new SparkContext is created within the same application.

Since Hadoop gives precedence to the delegation tokens, we must make sure we login as a different user, get new tokens and replace the old ones in the current user's credentials cache to avoid not being able to get new ones.

/cc @tedyu @tgravescs

…tempting to create new delegation tokens if a new SparkContext is created within the same application. Since Hadoop gives precedence to the delegation tokens, we must make sure we login as a different user, get new tokens and replace the old ones in the current user's credentials cache to avoid not being able to get new ones.

harishreedharan · 2015-11-21T00:02:47Z

Jenkins, test this please

SparkQA · 2015-11-21T00:42:54Z

Test build #46458 has finished for PR 9875 at commit 70f610f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-11-21T00:47:01Z

Test build #46454 has finished for PR 9875 at commit 70f610f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

harishreedharan · 2015-11-21T01:38:43Z

retest this please

SparkQA · 2015-11-21T02:07:07Z

Test build #46466 has finished for PR 9875 at commit 70f610f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2015-11-23T14:50:37Z

@harishreedharan I thought this issue existed even without reusing the jvm (going by other jira and prs that were filed)? For instance if you just have a long running process and had specified a keytab to use. I thought they had said it wasn't relogging in because of the token that it was acquiring.

harishreedharan · 2015-11-23T16:32:53Z

Yes, that is correct but this happens even without the tokens expiring.

What do you think about doing the relogin in the YarnCLientSchedulerBackend? This is messy with client mode but I don't think we have another option though.

harishreedharan · 2015-11-23T22:21:38Z

Also, to be clear..it would work fine in Cluster mode. In Client mode, #7394 should have taken care of the long-running app issue (though there was one where @SaintBacchus I think mentioned that the EventLoggingListener was not able to write to HDFS anymore, even though the tokens were getting updated.

So in either case, I am wondering whether in client mode, we should simply re-login using keytab and not bother with tokens on the driver app at all (so the AM would login, update tokens etc, while the client app always just logs in). So you think that make sense, @tgravescs?

steveloughran · 2015-11-24T11:45:36Z

Is the situation where the client can only have tokens need to be covered? As users may not have keytabs, only kinit-granted tickets, and they still have the right to submit work with a lifespan <= ticket life.

tgravescs · 2015-11-24T14:22:56Z

If we are going to include the fix for the issue @SaintBacchus mentioned then I think the right thing is to only login from the keytab or get the tokens (not both). If keytab is supplied always use that and don't bother with tokens on the driver, otherwise get the tokens.

djdean · 2015-11-30T22:46:08Z

After applying the provided patch things still do not work. I've been doing some debugging, I've found some additional information. When it works, it seems that two tokens are created with the renewal interval being set for the first one using the "getTokenRenewalInterval(stagingDirPath)" function in Client.scala. The second time around (after stopping and restarting the context), however, it prints a message saying 1 token was created, but no renewal interval is set. Finally, it dies saying the token can't be found in the cache. The relevant output is below (ip/hostnames removed):

---------------Successful run--------------
15/11/30 14:18:57 INFO yarn.Client: Credentials file set to: credentials-372be24e-9614-48d4-9f51-4cf275c51f46
15/11/30 14:18:57 INFO yarn.YarnSparkHadoopUtil: delegTokenRenewer: hadoop
15/11/30 14:18:57 INFO yarn.YarnSparkHadoopUtil: getting token for namenode: hdfs://HOSTNAME:9000/user/hadoop/.sparkStaging/application_1446695132208_0114
15/11/30 14:18:57 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token 142 for hadoop on xxx.xxx.xxx.xxx
15/11/30 14:18:57 INFO yarn.Client: Renewal Interval set to 86400400
15/11/30 14:18:57 INFO yarn.Client: Preparing resources for our AM container
15/11/30 14:18:57 INFO yarn.YarnSparkHadoopUtil: delegTokenRenewer: rm/HOSTNAME
15/11/30 14:18:57 INFO yarn.YarnSparkHadoopUtil: getting token for namenode: hdfs://HOSTNAME:9000/user/hadoop/.sparkStaging/application_1446695132208_0114
15/11/30 14:18:57 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token 143 for hadoop on xxx.xxx.xxx.xxx
15/11/30 14:18:58 INFO yarn.YarnSparkHadoopUtil: Hive class not found java.lang.ClassNotFoundException: org.apache.hadoop.hive.conf.HiveConf
15/11/30 14:18:58 INFO yarn.Client: To enable the AM to login from keytab, credentials are being copied over to the AM via the YARN Secure Distributed Cache.
15/11/30 14:18:58 INFO yarn.Client: Uploading resource file:/etc/security/keytabs/hadoop.keytab -> /user/hadoop/.sparkStaging/application_1446695132208_0114/hadoop.keytab
--------End successful run--------------
--------Failed run------------
15/11/30 14:19:46 INFO yarn.Client: Credentials file set to: credentials-b91660b6-a7c4-49f1-b869-ded70fec1641
15/11/30 14:19:46 INFO yarn.Client: Preparing resources for our AM container
15/11/30 14:19:46 INFO yarn.YarnSparkHadoopUtil: Called with conf: Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml
15/11/30 14:19:46 INFO yarn.YarnSparkHadoopUtil: delegTokenRenewer: rm/HOSTNAME
15/11/30 14:19:46 INFO yarn.YarnSparkHadoopUtil: getting token for namenode: hdfs://HOSTNAME:9000/user/hadoop/.sparkStaging/application_1446695132208_0115
15/11/30 14:19:46 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token 144 for hadoop on xxx.xxx.xxx.xxx
15/11/30 14:19:46 INFO yarn.YarnSparkHadoopUtil: Hive class not found java.lang.ClassNotFoundException: org.apache.hadoop.hive.conf.HiveConf
15/11/30 14:19:46 INFO yarn.Client: To enable the AM to login from keytab, credentials are being copied over to the AM via the YARN Secure Distributed Cache.
15/11/30 14:19:46 INFO yarn.Client: Uploading resource file:/etc/security/keytabs/hadoop.keytab -> hdfs://HOSTNAME:9000/user/hadoop/.sparkStaging/application_1446695132208_0115/hadoop.keytab
15/11/30 14:19:46 INFO yarn.Client: Uploading resource file:/var/tmp/spark-1.6.0-SNAPSHOT-bin-patch-8/lib/spark-assembly-1.6.0-SNAPSHOT-hadoop2.7.1.jar -> hdfs://HOSTNAME/user/hadoop/.sparkStaging/application_1446695132208_0115/spark-assembly-1.6.0-SNAPSHOT-hadoop2.7.1.jar
15/11/30 14:19:58 WARN ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token 143 for hadoop) can't be found in cache
15/11/30 14:19:58 WARN hdfs.LeaseRenewer: Failed to renew lease for [DFSClient_NONMAPREDUCE_-695293104_13] for 30 seconds. Will retry shortly ...
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token 143 for hadoop) can't be found in cache

SparkQA · 2016-05-12T09:36:35Z

Test build #58467 has finished for PR 9875 at commit 70f610f.

This patch fails R style tests.
This patch does not merge cleanly.
This patch adds no public classes.

vanzin · 2016-06-08T23:29:23Z

yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala

+    // If this JVM hosted an yarn-client mode driver before, the credentials of the current user
+    // now has delegation tokens, which means Hadoop security code will look at that and not the
+    // keytab login. So we must re-login and get new tokens.
+    if (reusedJVM && loginFromKeytab && !isClusterMode) {


Hey @harishreedharan , do you plan on updating this patch?

If yes, I'm wondering why not do this in all cases, not just when a new context is created. The same code should work in both scenarios, right?

If not, should probably close the PR.

vanzin · 2016-07-12T18:12:19Z

ping @harishreedharan please close the PR if you don't intend to work on it.

harishreedharan mentioned this pull request Nov 20, 2015

[SPARK-11662] Call startExecutorDelegationTokenRenewer() ahead of client app submission #9635

Closed

vanzin reviewed Jun 8, 2016
View reviewed changes

harishreedharan closed this Jul 12, 2016

[SPARK-11662] [YARN]. In Client mode, make sure we re-login before at… #9875

[SPARK-11662] [YARN]. In Client mode, make sure we re-login before at… #9875

Uh oh!

Conversation

harishreedharan commented Nov 20, 2015

Uh oh!

harishreedharan commented Nov 21, 2015

Uh oh!

SparkQA commented Nov 21, 2015

Uh oh!

SparkQA commented Nov 21, 2015

Uh oh!

harishreedharan commented Nov 21, 2015

Uh oh!

SparkQA commented Nov 21, 2015

Uh oh!

tgravescs commented Nov 23, 2015

Uh oh!

harishreedharan commented Nov 23, 2015

Uh oh!

harishreedharan commented Nov 23, 2015

Uh oh!

steveloughran commented Nov 24, 2015

Uh oh!

tgravescs commented Nov 24, 2015

Uh oh!

djdean commented Nov 30, 2015

Uh oh!

SparkQA commented May 12, 2016

Uh oh!

vanzin Jun 8, 2016

Choose a reason for hiding this comment

Uh oh!

vanzin commented Jul 12, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants