-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-21040][CORE] Speculate tasks which are running on decommission executors #28619
Closed
prakharjain09
wants to merge
11
commits into
apache:master
from
prakharjain09:SPARK-21040-speculate-decommission-exec-tasks
Closed
Changes from all commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
775cacb
Speculate tasks which are running on decommission executors based on …
prakharjain09 7521adf
add test case
prakharjain09 55dc94f
address review comments
prakharjain09 dae9cfe
remove unnecessary test
prakharjain09 f5a7313
empty commit to trigger build
prakharjain09 1cae338
Merge remote-tracking branch 'origin/master' into SPARK-21040-specula…
prakharjain09 795ede6
Merge remote-tracking branch 'origin/master' into SPARK-21040-specula…
prakharjain09 61f850d
use same notation in comments
prakharjain09 43ba62e
Merge remote-tracking branch 'origin/master' into SPARK-21040-specula…
prakharjain09 4affa58
Merge remote-tracking branch 'origin/master' into SPARK-21040-specula…
prakharjain09 d87b311
doc fixes
prakharjain09 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the timeout is decided by the cloud vendors? What does this config specify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan This config can be set by users based on their setups. If they are using AWS spot nodes, timeout can be set to somewhere around 120 seconds, if they are using fix duration 6hrs spot blocks (say they decommission executors at 5:45), timeout can be set to 15 mins and so on.
If user doesn't set this timeout, things will remain as they were and tasks running on decommission executors won't get any special treatment with respect to speculation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible that Spark can get this timeout value from the cluster manager? So that users don't need to set it manually. cc @holdenk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan As per my understanding, Worker Decommissioning is getting triggered currently using SIGPWR signal (and not via some message coming from YARN/Kubernetes Cluster manager). So getting this timeout from Spark Cluster Manager might not be possible. We might be able to do this once Spark's Worker Decommissioning logic starts triggering via communication from YARN etc in future. cc @holdenk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe there are some situations where we can know the length of time from the cluster manager or from Spark it's self, but not all. I think having a configurable default for folks who know their cloud provider environment makes sense