Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-32905][Core][Yarn] ApplicationMaster fails to receive UpdateDelegationTokens message #29777

Closed
wants to merge 1 commit into from

Conversation

yaooqinn
Copy link
Member

@yaooqinn yaooqinn commented Sep 17, 2020

What changes were proposed in this pull request?

With a long-running application in kerberized mode, the AMEndpiont handles UpdateDelegationTokens message wrong, which is an OneWayMessage that should be handled in the receive function.

20-09-15 18:53:01 INFO yarn.YarnAllocator: Received 22 containers from YARN, launching executors on 0 of them.
20-09-16 12:52:28 ERROR netty.Inbox: Ignoring error
org.apache.spark.SparkException: NettyRpcEndpointRef(spark-client://YarnAM) does not implement 'receive'
	at org.apache.spark.rpc.RpcEndpoint$$anonfun$receive$1.applyOrElse(RpcEndpoint.scala:70)
	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
	at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
	at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
20-09-17 06:52:28 ERROR netty.Inbox: Ignoring error
org.apache.spark.SparkException: NettyRpcEndpointRef(spark-client://YarnAM) does not implement 'receive'
	at org.apache.spark.rpc.RpcEndpoint$$anonfun$receive$1.applyOrElse(RpcEndpoint.scala:70)
	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
	at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
	at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Why are the changes needed?

bugfix, without a proper token refresher, the long-running apps are going to fail potentially in kerberized cluster

Does this PR introduce any user-facing change?

no

How was this patch tested?

Passing jenkins

and verify manually

I am running the sub-module kyuubi-spark-sql-engine of https://github.com/yaooqinn/kyuubi

The simplest way to reproduce the bug and verify this fix is to follow these steps

1 build the kyuubi-spark-sql-engine module

mvn clean package -pl :kyuubi-spark-sql-engine

2. config the spark with Kerberos settings towards your secured cluster

3. start it in the background

nohup bin/spark-submit --class org.apache.kyuubi.engine.spark.SparkSQLEngine ../kyuubi-spark-sql-engine-1.0.0-SNAPSHOT.jar > kyuubi.log &

4. check the AM log and see

"Updating delegation tokens ..." for SUCCESS

"Inbox: Ignoring error ...... does not implement 'receive'" for FAILURE

@yaooqinn
Copy link
Member Author

cc @cloud-fan @maropu @dongjoon-hyun thanks~

@SparkQA
Copy link

SparkQA commented Sep 17, 2020

Test build #128788 has finished for PR 29777 at commit 312acaa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@tgravescs tgravescs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, the changes make sense. its a bit sad this was broken for that long and it would be really nice to have a test for this but understand that is hard here.
Can you detail what tests you ran?

@mridulm
Copy link
Contributor

mridulm commented Sep 17, 2020

Thanks for fixing this @yaooqinn !
Oh my, this is indeed a nasty bug @tgravescs ... I am wondering which other message is mishandled like this - will try to do an audit later this week; I am guessing we dont have an automated way to do it ?

@yaooqinn
Copy link
Member Author

yaooqinn commented Sep 18, 2020

thanks, @tgravescs, and @mridulm

I am running the sub-module kyuubi-spark-sql-engine of https://github.com/yaooqinn/kyuubi

The simplest way to reproduce the bug and verify this fix is to follow these steps

1 build the kyuubi-spark-sql-engine module

mvn clean package -pl :kyuubi-spark-sql-engine

2. config the spark with Kerberos settings towards your secured cluster

3. start it in the background

nohup bin/spark-submit --class org.apache.kyuubi.engine.spark.SparkSQLEngine ../kyuubi-spark-sql-engine-1.0.0-SNAPSHOT.jar > kyuubi.log &

4. check the AM log and see

"Updating delegation tokens ..." for SUCCESS

"Inbox: Ignoring error ...... does not implement 'receive'" for FAILURE

@cloud-fan
Copy link
Contributor

The fix is pretty straightforward. It will be better if we have some UTs for the Kerberos support inside Spark, instead of relying on the manual end-to-end test. But we shouldn't block this fix.

Merging to master/3.0, thanks!

@cloud-fan cloud-fan closed this in 9e9d4b6 Sep 18, 2020
cloud-fan pushed a commit that referenced this pull request Sep 18, 2020
…legationTokens message

### What changes were proposed in this pull request?

With a long-running application in kerberized mode, the AMEndpiont handles `UpdateDelegationTokens` message wrong, which is an OneWayMessage that should be handled in the `receive` function.

```java
20-09-15 18:53:01 INFO yarn.YarnAllocator: Received 22 containers from YARN, launching executors on 0 of them.
20-09-16 12:52:28 ERROR netty.Inbox: Ignoring error
org.apache.spark.SparkException: NettyRpcEndpointRef(spark-client://YarnAM) does not implement 'receive'
	at org.apache.spark.rpc.RpcEndpoint$$anonfun$receive$1.applyOrElse(RpcEndpoint.scala:70)
	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
	at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
	at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
20-09-17 06:52:28 ERROR netty.Inbox: Ignoring error
org.apache.spark.SparkException: NettyRpcEndpointRef(spark-client://YarnAM) does not implement 'receive'
	at org.apache.spark.rpc.RpcEndpoint$$anonfun$receive$1.applyOrElse(RpcEndpoint.scala:70)
	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
	at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
	at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
```

### Why are the changes needed?

bugfix, without a proper token refresher, the long-running apps are going to fail potentially in kerberized cluster

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

Passing jenkins

and verify manually

I am running the sub-module `kyuubi-spark-sql-engine` of https://github.com/yaooqinn/kyuubi

The simplest way to reproduce the bug and verify this fix is to follow these steps

#### 1 build the `kyuubi-spark-sql-engine` module
```
mvn clean package -pl :kyuubi-spark-sql-engine
```
#### 2. config the spark with Kerberos settings towards your secured cluster

#### 3. start it in the background
```
nohup bin/spark-submit --class org.apache.kyuubi.engine.spark.SparkSQLEngine ../kyuubi-spark-sql-engine-1.0.0-SNAPSHOT.jar > kyuubi.log &
```

#### 4. check the AM log and see

"Updating delegation tokens ..." for SUCCESS

"Inbox: Ignoring error ...... does not implement 'receive'" for FAILURE

Closes #29777 from yaooqinn/SPARK-32905.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 9e9d4b6)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@tgravescs
Copy link
Contributor

@mridulm I don't think we have any current automated way to audit the message types, would definitely be nice, most things should be caught with unit tests, its just delegation token handling is especially hard to test.

holdenk pushed a commit to holdenk/spark that referenced this pull request Oct 27, 2020
…legationTokens message

### What changes were proposed in this pull request?

With a long-running application in kerberized mode, the AMEndpiont handles `UpdateDelegationTokens` message wrong, which is an OneWayMessage that should be handled in the `receive` function.

```java
20-09-15 18:53:01 INFO yarn.YarnAllocator: Received 22 containers from YARN, launching executors on 0 of them.
20-09-16 12:52:28 ERROR netty.Inbox: Ignoring error
org.apache.spark.SparkException: NettyRpcEndpointRef(spark-client://YarnAM) does not implement 'receive'
	at org.apache.spark.rpc.RpcEndpoint$$anonfun$receive$1.applyOrElse(RpcEndpoint.scala:70)
	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
	at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
	at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
20-09-17 06:52:28 ERROR netty.Inbox: Ignoring error
org.apache.spark.SparkException: NettyRpcEndpointRef(spark-client://YarnAM) does not implement 'receive'
	at org.apache.spark.rpc.RpcEndpoint$$anonfun$receive$1.applyOrElse(RpcEndpoint.scala:70)
	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
	at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
	at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
```

### Why are the changes needed?

bugfix, without a proper token refresher, the long-running apps are going to fail potentially in kerberized cluster

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

Passing jenkins

and verify manually

I am running the sub-module `kyuubi-spark-sql-engine` of https://github.com/yaooqinn/kyuubi

The simplest way to reproduce the bug and verify this fix is to follow these steps

#### 1 build the `kyuubi-spark-sql-engine` module
```
mvn clean package -pl :kyuubi-spark-sql-engine
```
#### 2. config the spark with Kerberos settings towards your secured cluster

#### 3. start it in the background
```
nohup bin/spark-submit --class org.apache.kyuubi.engine.spark.SparkSQLEngine ../kyuubi-spark-sql-engine-1.0.0-SNAPSHOT.jar > kyuubi.log &
```

#### 4. check the AM log and see

"Updating delegation tokens ..." for SUCCESS

"Inbox: Ignoring error ...... does not implement 'receive'" for FAILURE

Closes apache#29777 from yaooqinn/SPARK-32905.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 9e9d4b6)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants