SPARK-41415: SASL Request Retries #38959

akpatnam25 · 2022-12-07T05:57:37Z

What changes were proposed in this pull request?

Add the ability to retry SASL requests. Will add it as a metric too soon to track SASL retries.

Why are the changes needed?

We are seeing increased SASL timeouts internally, and this issue would mitigate the issue. We already have this feature enabled for our 2.3 jobs, and we have seen failures significantly decrease.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added unit tests, and tested on cluster to ensure the retries are being triggered correctly.

akpatnam25 · 2022-12-07T05:58:57Z

cc @mridulm @otterc @zhouyejoe

AmplabJenkins · 2022-12-08T18:26:43Z

Can one of the admins verify this patch?

akpatnam25 · 2022-12-13T07:01:57Z

@otterc @mridulm removing the WIP tag from this PR. this should be good to review now.

mridulm · 2022-12-13T07:43:32Z

It is not clear to me why we need the protocol change, and why not simply create a new socket connection ?

The main scenario I can think of, where this approach could help (in comparison to new socket) - would be when the delays are due to netty being unable to process the connect events on time repeatedly.
I dont think that is what is causing the issues being seen (if I understand correctly, it is due to child event group not being able to get to the request before the sasl timeout due to other tasks occupying it).

Also wondering if simply bumping up the thread pool size might help ? I can see why it might not - but any results from an experiment ?

akpatnam25 · 2023-01-04T23:39:46Z

@mridulm updated the PR to not have protocol/server side changes. In this case, we are creating a new connection every time the SASL retry is triggered. Confirmed that this is the case by throwing some simulated exceptions to trigger SASL retries on our cluster.

akpatnam25 · 2023-01-05T18:56:49Z

re-running the CI to see if the linters errors are just transient. The linters errors seem unrelated to this PR as there are no python related changes in this PR.

common/network-common/src/main/java/org/apache/spark/network/client/TransportClient.java

...on/network-shuffle/src/main/java/org/apache/spark/network/shuffle/BlockTransferListener.java

.../network-shuffle/src/main/java/org/apache/spark/network/shuffle/RetryingBlockTransferor.java

...ork-shuffle/src/test/java/org/apache/spark/network/shuffle/RetryingBlockTransferorSuite.java

core/src/main/scala/org/apache/spark/shuffle/ShuffleBlockPusher.scala

common/network-common/src/main/java/org/apache/spark/network/util/TransportConf.java

otterc

I think metric change is not complete. I don't see the change in the rest api and many other places.

...on/network-shuffle/src/main/java/org/apache/spark/network/shuffle/BlockFetchingListener.java

core/src/main/scala/org/apache/spark/shuffle/ShuffleBlockPusher.scala

mridulm · 2023-01-07T05:47:54Z

I think metric change is not complete. I don't see the change in the rest api and many other places.

Agree, this should be WIP.

akpatnam25 · 2023-01-12T19:44:51Z

@mridulm should be good to review now

mridulm · 2023-01-14T08:43:26Z

Can you fix the test failure @akpatnam25 ? testBlockTransferFailureAfterSasl appears to be broken. Thx

mridulm · 2023-01-14T20:16:59Z

Also, please update to latest master

.../network-shuffle/src/main/java/org/apache/spark/network/shuffle/RetryingBlockTransferor.java

mridulm · 2023-01-15T00:48:33Z

Missed out on the imports - once builds succeeds, I will merge it.
Based on current test progress, looks like all the tests are working fine (just taking time in sql slow tests).

akpatnam25 · 2023-01-15T00:58:38Z

sounds good, thanks for following up even on the weekend @mridulm!

mridulm · 2023-01-15T06:05:38Z

If we want to backport to other branches, we might want to create new PR's.
@dongjoon-hyun, thoughts on getting this added to 3.3 and 3.2 ?

mridulm · 2023-01-15T06:06:05Z

Merged to master.
Thanks for working on this @akpatnam25 !
Thanks for the reviews @otterc, @rmcyang :-)

dongjoon-hyun

+1, late LGTM. Thank you all.
In addition, I'm +1 for backporting, @mridulm .

mridulm · 2023-01-15T08:21:35Z

Thanks @dongjoon-hyun !
Can you create a PR for the 3.3 and 3.2 branch as well by backporting this @akpatnam25 ? Thanks.

Add the ability to retry SASL requests. Will add it as a metric too soon to track SASL retries. We are seeing increased SASL timeouts internally, and this issue would mitigate the issue. We already have this feature enabled for our 2.3 jobs, and we have seen failures significantly decrease. No Added unit tests, and tested on cluster to ensure the retries are being triggered correctly. Closes apache#38959 from akpatnam25/SPARK-41415. Authored-by: Aravind Patnam <apatnam@linkedin.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>

### What changes were proposed in this pull request? Add the ability to retry SASL requests. Will add it as a metric too soon to track SASL retries. ### Why are the changes needed? We are seeing increased SASL timeouts internally, and this issue would mitigate the issue. We already have this feature enabled for our 2.3 jobs, and we have seen failures significantly decrease. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit tests, and tested on cluster to ensure the retries are being triggered correctly. Closes apache#38959 from akpatnam25/SPARK-41415. Authored-by: Aravind Patnam <apatnam@linkedin.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>

Add the ability to retry SASL requests. Will add it as a metric too soon to track SASL retries. We are seeing increased SASL timeouts internally, and this issue would mitigate the issue. We already have this feature enabled for our 2.3 jobs, and we have seen failures significantly decrease. No Added unit tests, and tested on cluster to ensure the retries are being triggered correctly. Closes apache#38959 from akpatnam25/SPARK-41415. Authored-by: Aravind Patnam <apatnam@linkedin.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>

### What changes were proposed in this pull request? Add the ability to retry SASL requests. Will add it as a metric too soon to track SASL retries. ### Why are the changes needed? We are seeing increased SASL timeouts internally, and this issue would mitigate the issue. We already have this feature enabled for our 2.3 jobs, and we have seen failures significantly decrease. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit tests, and tested on cluster to ensure the retries are being triggered correctly. Closes #38959 from akpatnam25/SPARK-41415. Authored-by: Aravind Patnam <apatnamlinkedin.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> Closes #39644 from akpatnam25/SPARK-41415-backport-3.3. Authored-by: Aravind Patnam <apatnam@linkedin.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>

Add the ability to retry SASL requests. Will add it as a metric too soon to track SASL retries. We are seeing increased SASL timeouts internally, and this issue would mitigate the issue. We already have this feature enabled for our 2.3 jobs, and we have seen failures significantly decrease. No Added unit tests, and tested on cluster to ensure the retries are being triggered correctly. Closes #38959 from akpatnam25/SPARK-41415. Authored-by: Aravind Patnam <apatnamlinkedin.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> Closes #39645 from akpatnam25/SPARK-41415-backport-3.2. Authored-by: Aravind Patnam <apatnam@linkedin.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>

Add the ability to retry SASL requests. Will add it as a metric too soon to track SASL retries. We are seeing increased SASL timeouts internally, and this issue would mitigate the issue. We already have this feature enabled for our 2.3 jobs, and we have seen failures significantly decrease. No Added unit tests, and tested on cluster to ensure the retries are being triggered correctly. Closes apache#38959 from akpatnam25/SPARK-41415. Authored-by: Aravind Patnam <apatnamlinkedin.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> Closes apache#39645 from akpatnam25/SPARK-41415-backport-3.2. Authored-by: Aravind Patnam <apatnam@linkedin.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit 1a26c7b)

SPARK-41415: SASL Request Retries

4a48e49

github-actions bot added the CORE label Dec 7, 2022

fixed unit test and added metric definition

f3e2ea9

Aravind Patnam added 6 commits December 8, 2022 22:18

fixed more unit tests

b87420d

fixed some spaces

1816d0d

fixed one last unit test and linter changes

6302cb1

cleaned up some stuff

d0487fd

added timing code

ae50262

fix some more stuff

dc163f2

akpatnam25 changed the title ~~[WIP] SPARK-41415: SASL Request Retries~~ SPARK-41415: SASL Request Retries Dec 13, 2022

Aravind Patnam added 2 commits January 4, 2023 15:34

remove protocol/server side changes

d6bf891

fix some minor issues

d5ed4a3

mridulm reviewed Jan 6, 2023

View reviewed changes

otterc reviewed Jan 6, 2023

View reviewed changes

common/network-common/src/main/java/org/apache/spark/network/util/TransportConf.java Outdated Show resolved Hide resolved

otterc reviewed Jan 6, 2023

View reviewed changes

...on/network-shuffle/src/main/java/org/apache/spark/network/shuffle/BlockFetchingListener.java Outdated Show resolved Hide resolved

core/src/main/scala/org/apache/spark/shuffle/ShuffleBlockPusher.scala Outdated Show resolved Hide resolved

Aravind Patnam added 6 commits January 11, 2023 13:46

revert SASL metrics changes

a41bfc9

fix minor issues

035c665

addressed all comments

129375d

remove some unneccessary code

6b2a3c6

remove some unneccessary code

ce71f69

fix SASL exception throwing

273b7ac

fix linter error

b91c2d4

fix linter

5623b04

Aravind Patnam added 2 commits January 14, 2023 14:29

fixed the test issues in retrying block transferor

207b3b6

Merge branch 'master' into SPARK-41415

b162d74

mridulm approved these changes Jan 15, 2023

View reviewed changes

.../network-shuffle/src/main/java/org/apache/spark/network/shuffle/RetryingBlockTransferor.java Outdated Show resolved Hide resolved

.../network-shuffle/src/main/java/org/apache/spark/network/shuffle/RetryingBlockTransferor.java Outdated Show resolved Hide resolved

change import order

3bebbb6

change import ordering to be alphabetical

6fc2379

mridulm closed this in 2878cd8 Jan 15, 2023

dongjoon-hyun reviewed Jan 15, 2023

View reviewed changes

akpatnam25 deleted the SPARK-41415 branch January 17, 2023 18:03

akpatnam25 restored the SPARK-41415 branch January 17, 2023 23:13

akpatnam25 mentioned this pull request Jan 17, 2023

SPARK-41415/SPARK-42090 Backport to 3.2 #39632

Closed

akpatnam25 mentioned this pull request Jan 17, 2023

SPARK-41415/SPARK-42090 Backport to 3.3 #39634

Closed

akpatnam25 mentioned this pull request Jan 18, 2023

[SPARK-41415][3.3] SASL Request Retries #39644

Closed

akpatnam25 mentioned this pull request Jan 18, 2023

[SPARK-41415][3.2] SASL Request Retries #39645

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPARK-41415: SASL Request Retries #38959

SPARK-41415: SASL Request Retries #38959

akpatnam25 commented Dec 7, 2022 •

edited

akpatnam25 commented Dec 7, 2022

AmplabJenkins commented Dec 8, 2022

akpatnam25 commented Dec 13, 2022

mridulm commented Dec 13, 2022 •

edited

akpatnam25 commented Jan 4, 2023

akpatnam25 commented Jan 5, 2023 •

edited

otterc left a comment

mridulm commented Jan 7, 2023

akpatnam25 commented Jan 12, 2023

mridulm commented Jan 14, 2023 •

edited

mridulm commented Jan 14, 2023

mridulm commented Jan 15, 2023

akpatnam25 commented Jan 15, 2023

mridulm commented Jan 15, 2023

mridulm commented Jan 15, 2023

dongjoon-hyun left a comment

mridulm commented Jan 15, 2023

SPARK-41415: SASL Request Retries #38959

SPARK-41415: SASL Request Retries #38959

Conversation

akpatnam25 commented Dec 7, 2022 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

akpatnam25 commented Dec 7, 2022

AmplabJenkins commented Dec 8, 2022

akpatnam25 commented Dec 13, 2022

mridulm commented Dec 13, 2022 • edited

akpatnam25 commented Jan 4, 2023

akpatnam25 commented Jan 5, 2023 • edited

otterc left a comment

Choose a reason for hiding this comment

mridulm commented Jan 7, 2023

akpatnam25 commented Jan 12, 2023

mridulm commented Jan 14, 2023 • edited

mridulm commented Jan 14, 2023

mridulm commented Jan 15, 2023

akpatnam25 commented Jan 15, 2023

mridulm commented Jan 15, 2023

mridulm commented Jan 15, 2023

dongjoon-hyun left a comment

Choose a reason for hiding this comment

mridulm commented Jan 15, 2023

akpatnam25 commented Dec 7, 2022 •

edited

mridulm commented Dec 13, 2022 •

edited

akpatnam25 commented Jan 5, 2023 •

edited

mridulm commented Jan 14, 2023 •

edited