Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARK-41415: SASL Request Retries #38959

Closed
wants to merge 29 commits into from

Conversation

akpatnam25
Copy link

@akpatnam25 akpatnam25 commented Dec 7, 2022

What changes were proposed in this pull request?

Add the ability to retry SASL requests. Will add it as a metric too soon to track SASL retries.

Why are the changes needed?

We are seeing increased SASL timeouts internally, and this issue would mitigate the issue. We already have this feature enabled for our 2.3 jobs, and we have seen failures significantly decrease.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added unit tests, and tested on cluster to ensure the retries are being triggered correctly.

@github-actions github-actions bot added the CORE label Dec 7, 2022
@akpatnam25
Copy link
Author

cc @mridulm @otterc @zhouyejoe

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@akpatnam25 akpatnam25 changed the title [WIP] SPARK-41415: SASL Request Retries SPARK-41415: SASL Request Retries Dec 13, 2022
@akpatnam25
Copy link
Author

@otterc @mridulm removing the WIP tag from this PR. this should be good to review now.

@mridulm
Copy link
Contributor

mridulm commented Dec 13, 2022

It is not clear to me why we need the protocol change, and why not simply create a new socket connection ?

The main scenario I can think of, where this approach could help (in comparison to new socket) - would be when the delays are due to netty being unable to process the connect events on time repeatedly.
I dont think that is what is causing the issues being seen (if I understand correctly, it is due to child event group not being able to get to the request before the sasl timeout due to other tasks occupying it).

Also wondering if simply bumping up the thread pool size might help ? I can see why it might not - but any results from an experiment ?

@akpatnam25
Copy link
Author

@mridulm updated the PR to not have protocol/server side changes. In this case, we are creating a new connection every time the SASL retry is triggered. Confirmed that this is the case by throwing some simulated exceptions to trigger SASL retries on our cluster.

@akpatnam25
Copy link
Author

akpatnam25 commented Jan 5, 2023

re-running the CI to see if the linters errors are just transient. The linters errors seem unrelated to this PR as there are no python related changes in this PR.

Copy link
Contributor

@otterc otterc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think metric change is not complete. I don't see the change in the rest api and many other places.

@mridulm
Copy link
Contributor

mridulm commented Jan 7, 2023

I think metric change is not complete. I don't see the change in the rest api and many other places.

Agree, this should be WIP.

@akpatnam25
Copy link
Author

@mridulm should be good to review now

@mridulm
Copy link
Contributor

mridulm commented Jan 14, 2023

Can you fix the test failure @akpatnam25 ? testBlockTransferFailureAfterSasl appears to be broken. Thx

@mridulm
Copy link
Contributor

mridulm commented Jan 14, 2023

Also, please update to latest master

@mridulm
Copy link
Contributor

mridulm commented Jan 15, 2023

Missed out on the imports - once builds succeeds, I will merge it.
Based on current test progress, looks like all the tests are working fine (just taking time in sql slow tests).

@akpatnam25
Copy link
Author

sounds good, thanks for following up even on the weekend @mridulm!

@mridulm mridulm closed this in 2878cd8 Jan 15, 2023
@mridulm
Copy link
Contributor

mridulm commented Jan 15, 2023

If we want to backport to other branches, we might want to create new PR's.
@dongjoon-hyun, thoughts on getting this added to 3.3 and 3.2 ?

@mridulm
Copy link
Contributor

mridulm commented Jan 15, 2023

Merged to master.
Thanks for working on this @akpatnam25 !
Thanks for the reviews @otterc, @rmcyang :-)

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, late LGTM. Thank you all.
In addition, I'm +1 for backporting, @mridulm .

@mridulm
Copy link
Contributor

mridulm commented Jan 15, 2023

Thanks @dongjoon-hyun !
Can you create a PR for the 3.3 and 3.2 branch as well by backporting this @akpatnam25 ? Thanks.

@akpatnam25 akpatnam25 deleted the SPARK-41415 branch January 17, 2023 18:03
@akpatnam25 akpatnam25 restored the SPARK-41415 branch January 17, 2023 23:13
akpatnam25 pushed a commit to akpatnam25/spark that referenced this pull request Jan 17, 2023
Add the ability to retry SASL requests. Will add it as a metric too soon to track SASL retries.

We are seeing increased SASL timeouts internally, and this issue would mitigate the issue. We already have this feature enabled for our 2.3 jobs, and we have seen failures significantly decrease.

No

Added unit tests, and tested on cluster to ensure the retries are being triggered correctly.

Closes apache#38959 from akpatnam25/SPARK-41415.

Authored-by: Aravind Patnam <apatnam@linkedin.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
akpatnam25 pushed a commit to akpatnam25/spark that referenced this pull request Jan 17, 2023
### What changes were proposed in this pull request?

Add the ability to retry SASL requests. Will add it as a metric too soon to track SASL retries.

### Why are the changes needed?
We are seeing increased SASL timeouts internally, and this issue would mitigate the issue. We already have this feature enabled for our 2.3 jobs, and we have seen failures significantly decrease.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added unit tests, and tested on cluster to ensure the retries are being triggered correctly.

Closes apache#38959 from akpatnam25/SPARK-41415.

Authored-by: Aravind Patnam <apatnam@linkedin.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
akpatnam25 pushed a commit to akpatnam25/spark that referenced this pull request Jan 18, 2023
### What changes were proposed in this pull request?

Add the ability to retry SASL requests. Will add it as a metric too soon to track SASL retries.

### Why are the changes needed?
We are seeing increased SASL timeouts internally, and this issue would mitigate the issue. We already have this feature enabled for our 2.3 jobs, and we have seen failures significantly decrease.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added unit tests, and tested on cluster to ensure the retries are being triggered correctly.

Closes apache#38959 from akpatnam25/SPARK-41415.

Authored-by: Aravind Patnam <apatnam@linkedin.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
akpatnam25 pushed a commit to akpatnam25/spark that referenced this pull request Jan 18, 2023
Add the ability to retry SASL requests. Will add it as a metric too soon to track SASL retries.

We are seeing increased SASL timeouts internally, and this issue would mitigate the issue. We already have this feature enabled for our 2.3 jobs, and we have seen failures significantly decrease.

No

Added unit tests, and tested on cluster to ensure the retries are being triggered correctly.

Closes apache#38959 from akpatnam25/SPARK-41415.

Authored-by: Aravind Patnam <apatnam@linkedin.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
mridulm pushed a commit that referenced this pull request Jan 21, 2023
### What changes were proposed in this pull request?

Add the ability to retry SASL requests. Will add it as a metric too soon to track SASL retries.

### Why are the changes needed?
We are seeing increased SASL timeouts internally, and this issue would mitigate the issue. We already have this feature enabled for our 2.3 jobs, and we have seen failures significantly decrease.

### Does this PR introduce _any_ user-facing change? No

### How was this patch tested?
Added unit tests, and tested on cluster to ensure the retries are being triggered correctly.

Closes #38959 from akpatnam25/SPARK-41415.

Authored-by: Aravind Patnam <apatnamlinkedin.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>

Closes #39644 from akpatnam25/SPARK-41415-backport-3.3.

Authored-by: Aravind Patnam <apatnam@linkedin.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
mridulm pushed a commit that referenced this pull request Jan 21, 2023
Add the ability to retry SASL requests. Will add it as a metric too soon to track SASL retries.

We are seeing increased SASL timeouts internally, and this issue would mitigate the issue. We already have this feature enabled for our 2.3 jobs, and we have seen failures significantly decrease.

No

Added unit tests, and tested on cluster to ensure the retries are being triggered correctly.

Closes #38959 from akpatnam25/SPARK-41415.

Authored-by: Aravind Patnam <apatnamlinkedin.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>

Closes #39645 from akpatnam25/SPARK-41415-backport-3.2.

Authored-by: Aravind Patnam <apatnam@linkedin.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
sunchao pushed a commit to sunchao/spark that referenced this pull request Jun 2, 2023
Add the ability to retry SASL requests. Will add it as a metric too soon to track SASL retries.

We are seeing increased SASL timeouts internally, and this issue would mitigate the issue. We already have this feature enabled for our 2.3 jobs, and we have seen failures significantly decrease.

No

Added unit tests, and tested on cluster to ensure the retries are being triggered correctly.

Closes apache#38959 from akpatnam25/SPARK-41415.

Authored-by: Aravind Patnam <apatnamlinkedin.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>

Closes apache#39645 from akpatnam25/SPARK-41415-backport-3.2.

Authored-by: Aravind Patnam <apatnam@linkedin.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(cherry picked from commit 1a26c7b)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
6 participants