Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARK-41415/SPARK-42090 Backport to 3.2 #39632

Conversation

akpatnam25
Copy link

What changes were proposed in this pull request?

Add the ability to retry SASL requests. Will add it as a metric too soon to track SASL retries.

Why are the changes needed?

We are seeing increased SASL timeouts internally, and this issue would mitigate the issue. We already have this feature enabled for our 2.3 jobs, and we have seen failures significantly decrease.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added unit tests, and tested on cluster to ensure the retries are being triggered correctly.

Closes #38959 from akpatnam25/SPARK-41415.

Authored-by: Aravind Patnam apatnam@linkedin.com
Signed-off-by: Mridul Muralidharan <mridulgmail.com>

What changes were proposed in this pull request?

This PR introduces sasl retry count in RetryingBlockTransferor.

Why are the changes needed?

Previously a boolean variable, saslTimeoutSeen, was used. However, the boolean variable wouldn't cover the following scenario:

  1. SaslTimeoutException
  2. IOException
  3. SaslTimeoutException
  4. IOException

Even though IOException at #2 is retried (resulting in increment of retryCount), the retryCount would be cleared at step #4.
Since the intention of saslTimeoutSeen is to undo the increment due to retrying SaslTimeoutException, we should keep a counter for SaslTimeoutException retries and subtract the value of this counter from retryCount.

Does this PR introduce any user-facing change?

No

How was this patch tested?

New test is added, courtesy of Mridul.

Closes #39611 from tedyu/sasl-cnt.

Authored-by: Ted Yu yuzhihong@gmail.com
Signed-off-by: Mridul Muralidharan <mridulgmail.com>

Aravind Patnam and others added 2 commits January 17, 2023 15:41
Add the ability to retry SASL requests. Will add it as a metric too soon to track SASL retries.

We are seeing increased SASL timeouts internally, and this issue would mitigate the issue. We already have this feature enabled for our 2.3 jobs, and we have seen failures significantly decrease.

No

Added unit tests, and tested on cluster to ensure the retries are being triggered correctly.

Closes apache#38959 from akpatnam25/SPARK-41415.

Authored-by: Aravind Patnam <apatnam@linkedin.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
### What changes were proposed in this pull request?
This PR introduces sasl retry count in RetryingBlockTransferor.

### Why are the changes needed?
Previously a boolean variable, saslTimeoutSeen, was used. However, the boolean variable wouldn't cover the following scenario:

1. SaslTimeoutException
2. IOException
3. SaslTimeoutException
4. IOException

Even though IOException at apache#2 is retried (resulting in increment of retryCount), the retryCount would be cleared at step apache#4.
Since the intention of saslTimeoutSeen is to undo the increment due to retrying SaslTimeoutException, we should keep a counter for SaslTimeoutException retries and subtract the value of this counter from retryCount.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
New test is added, courtesy of Mridul.

Closes apache#39611 from tedyu/sasl-cnt.

Authored-by: Ted Yu <yuzhihong@gmail.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
@github-actions github-actions bot added the CORE label Jan 17, 2023
@akpatnam25 akpatnam25 changed the title Spark 41415 spark 42090 backport SPARK-41415/SPARK-42090 Backport to 3.2 Jan 17, 2023
@akpatnam25
Copy link
Author

akpatnam25 commented Jan 17, 2023

@mridulm @dongjoon-hyun @tedyu backport into 3.2

@akpatnam25
Copy link
Author

akpatnam25 commented Jan 17, 2023

++ CC @otterc

@akpatnam25
Copy link
Author

@dongjoon-hyun oops tagged the wrong person :)

@akpatnam25 akpatnam25 closed this Jan 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants