New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SPARK-41415: SASL Request Retries #38959
Conversation
Can one of the admins verify this patch? |
It is not clear to me why we need the protocol change, and why not simply create a new socket connection ? The main scenario I can think of, where this approach could help (in comparison to new socket) - would be when the delays are due to netty being unable to process the connect events on time repeatedly. Also wondering if simply bumping up the thread pool size might help ? I can see why it might not - but any results from an experiment ? |
@mridulm updated the PR to not have protocol/server side changes. In this case, we are creating a new connection every time the SASL retry is triggered. Confirmed that this is the case by throwing some simulated exceptions to trigger SASL retries on our cluster. |
re-running the CI to see if the linters errors are just transient. The linters errors seem unrelated to this PR as there are no python related changes in this PR. |
common/network-common/src/main/java/org/apache/spark/network/client/TransportClient.java
Outdated
Show resolved
Hide resolved
common/network-common/src/main/java/org/apache/spark/network/client/TransportClient.java
Outdated
Show resolved
Hide resolved
...on/network-shuffle/src/main/java/org/apache/spark/network/shuffle/BlockTransferListener.java
Outdated
Show resolved
Hide resolved
.../network-shuffle/src/main/java/org/apache/spark/network/shuffle/RetryingBlockTransferor.java
Outdated
Show resolved
Hide resolved
...ork-shuffle/src/test/java/org/apache/spark/network/shuffle/RetryingBlockTransferorSuite.java
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/shuffle/ShuffleBlockPusher.scala
Outdated
Show resolved
Hide resolved
common/network-common/src/main/java/org/apache/spark/network/util/TransportConf.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think metric change is not complete. I don't see the change in the rest api and many other places.
...on/network-shuffle/src/main/java/org/apache/spark/network/shuffle/BlockFetchingListener.java
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/shuffle/ShuffleBlockPusher.scala
Outdated
Show resolved
Hide resolved
Agree, this should be WIP. |
@mridulm should be good to review now |
Can you fix the test failure @akpatnam25 ? |
Also, please update to latest master |
.../network-shuffle/src/main/java/org/apache/spark/network/shuffle/RetryingBlockTransferor.java
Outdated
Show resolved
Hide resolved
.../network-shuffle/src/main/java/org/apache/spark/network/shuffle/RetryingBlockTransferor.java
Outdated
Show resolved
Hide resolved
Missed out on the imports - once builds succeeds, I will merge it. |
sounds good, thanks for following up even on the weekend @mridulm! |
If we want to backport to other branches, we might want to create new PR's. |
Merged to master. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, late LGTM. Thank you all.
In addition, I'm +1 for backporting, @mridulm .
Thanks @dongjoon-hyun ! |
Add the ability to retry SASL requests. Will add it as a metric too soon to track SASL retries. We are seeing increased SASL timeouts internally, and this issue would mitigate the issue. We already have this feature enabled for our 2.3 jobs, and we have seen failures significantly decrease. No Added unit tests, and tested on cluster to ensure the retries are being triggered correctly. Closes apache#38959 from akpatnam25/SPARK-41415. Authored-by: Aravind Patnam <apatnam@linkedin.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
### What changes were proposed in this pull request? Add the ability to retry SASL requests. Will add it as a metric too soon to track SASL retries. ### Why are the changes needed? We are seeing increased SASL timeouts internally, and this issue would mitigate the issue. We already have this feature enabled for our 2.3 jobs, and we have seen failures significantly decrease. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit tests, and tested on cluster to ensure the retries are being triggered correctly. Closes apache#38959 from akpatnam25/SPARK-41415. Authored-by: Aravind Patnam <apatnam@linkedin.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
### What changes were proposed in this pull request? Add the ability to retry SASL requests. Will add it as a metric too soon to track SASL retries. ### Why are the changes needed? We are seeing increased SASL timeouts internally, and this issue would mitigate the issue. We already have this feature enabled for our 2.3 jobs, and we have seen failures significantly decrease. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit tests, and tested on cluster to ensure the retries are being triggered correctly. Closes apache#38959 from akpatnam25/SPARK-41415. Authored-by: Aravind Patnam <apatnam@linkedin.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
Add the ability to retry SASL requests. Will add it as a metric too soon to track SASL retries. We are seeing increased SASL timeouts internally, and this issue would mitigate the issue. We already have this feature enabled for our 2.3 jobs, and we have seen failures significantly decrease. No Added unit tests, and tested on cluster to ensure the retries are being triggered correctly. Closes apache#38959 from akpatnam25/SPARK-41415. Authored-by: Aravind Patnam <apatnam@linkedin.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
### What changes were proposed in this pull request? Add the ability to retry SASL requests. Will add it as a metric too soon to track SASL retries. ### Why are the changes needed? We are seeing increased SASL timeouts internally, and this issue would mitigate the issue. We already have this feature enabled for our 2.3 jobs, and we have seen failures significantly decrease. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit tests, and tested on cluster to ensure the retries are being triggered correctly. Closes #38959 from akpatnam25/SPARK-41415. Authored-by: Aravind Patnam <apatnamlinkedin.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> Closes #39644 from akpatnam25/SPARK-41415-backport-3.3. Authored-by: Aravind Patnam <apatnam@linkedin.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
Add the ability to retry SASL requests. Will add it as a metric too soon to track SASL retries. We are seeing increased SASL timeouts internally, and this issue would mitigate the issue. We already have this feature enabled for our 2.3 jobs, and we have seen failures significantly decrease. No Added unit tests, and tested on cluster to ensure the retries are being triggered correctly. Closes #38959 from akpatnam25/SPARK-41415. Authored-by: Aravind Patnam <apatnamlinkedin.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> Closes #39645 from akpatnam25/SPARK-41415-backport-3.2. Authored-by: Aravind Patnam <apatnam@linkedin.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
Add the ability to retry SASL requests. Will add it as a metric too soon to track SASL retries. We are seeing increased SASL timeouts internally, and this issue would mitigate the issue. We already have this feature enabled for our 2.3 jobs, and we have seen failures significantly decrease. No Added unit tests, and tested on cluster to ensure the retries are being triggered correctly. Closes apache#38959 from akpatnam25/SPARK-41415. Authored-by: Aravind Patnam <apatnamlinkedin.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> Closes apache#39645 from akpatnam25/SPARK-41415-backport-3.2. Authored-by: Aravind Patnam <apatnam@linkedin.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit 1a26c7b)
What changes were proposed in this pull request?
Add the ability to retry SASL requests. Will add it as a metric too soon to track SASL retries.
Why are the changes needed?
We are seeing increased SASL timeouts internally, and this issue would mitigate the issue. We already have this feature enabled for our 2.3 jobs, and we have seen failures significantly decrease.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Added unit tests, and tested on cluster to ensure the retries are being triggered correctly.