[SPARK-28305][YARN] Request GetExecutorLossReason to use a smaller timeout parameter#25078
[SPARK-28305][YARN] Request GetExecutorLossReason to use a smaller timeout parameter#25078cxzl25 wants to merge 1 commit intoapache:masterfrom
Conversation
|
AM LOG: 19/07/08 16:56:48 [dispatcher-event-loop-0] INFO YarnAllocator: add executor 951 to pendingLossReasonRequests for get the loss reason Driver LOG: 19/07/08 16:58:48,476 [rpc-server-3-3] ERROR TransportChannelHandler: Connection to /xx.xx.xx.xx:19398 has been quiet for 120000 ms while there are outstanding requests. Assuming connection is dead; please adjust spark.network.timeout if this is wrong. |
|
Interesting, this makes sense, but I'm wondering does the same happen in other places as well. I assume if you configure the spark.rpc.askTimeout to be < network timeout, that also works around the issue? Or if you io.connectionTimeout to be larger? |
|
In the yarn-client mode, the driver closes the AM connection, causing the entire job to exit, causing unnecessary failures. Adjust parameters should be work. Parameter priority
I found a problem with the abnormal exit of the yarn client mode last time, and found and fixed a problem. #23989 close connection use ask+recover |
|
right, so to clarify I think you are say yes if you set askTimeout separate it also fixes the problem, correct? So my concern is that you are fixing this in a single location, I think there are other places that could have the same issue, although the question is whether those are recoverable or not. so I'm wondering if you can just set the ask timeout globally different or if that causes other issues? |
|
Yes, I used the following configuration to test successfully in the test environment, but not on a large scale in the production environment. After the Driver closes the AM connection, the AM does not have a chance to reconnect , and the sparkcontext also stops. finish(FinalApplicationStatus.SUCCEEDED, ApplicationMaster.EXIT_SUCCESS)Or should I take the minimum of three configuration items? Later I discovered that this may also be a race condition.
I'm not sure if my example is too extreme, but it does happen in our environment. |
|
Can one of the admins verify this patch? |
|
We're closing this PR because it hasn't been updated in a while. If you'd like to revive this PR, please reopen it! |
What changes were proposed in this pull request?
Request GetExecutorLossReason to use a smaller timeout parameter.
In some cases, such as NM machine crashes or shuts down,driver ask
GetExecutorLossReason,AM
getCompletedContainersStatusescan't get the failure information of container.Because the yarn NM detection timeout is 10 minutes, it is controlled by the parameter yarn.resourcemanager.rm.container-allocation.expiry-interval-ms.
So AM has to wait for 10 minutes to get the cause of the container failure.
Although the driver's ask fails, it will call recover.
However, due to the 2-minute timeout (spark.network.timeout) configured by
IdleStateHandler, the connection between driver and am is closed, AM exits, app finish, driver exits, causing the job to fail.How was this patch tested?