Fix: Handle SIGTERM in kubeflow pytorch elastic training plugin #2064
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Why are the changes needed?
We recently observed many pytorch elastic training jobs failing with the following "user error":
This
SignalException
is raised by pytorch when the elastic launch agent process (which launches the worker processes within a Pod) receives SIGTERM etc. This in turn happens for instance when a multi pod distributed training job fails in one pod and the Kubeflow training operator deletes the other pods.There haven't been any changes in the pytorch elastic training plugin that would explain this sudden change of behaviour and we also didn't upgrade torch.
I noticed in google cloud logging, that we started to see these
SignalException
s on the day we upgraded fromflytekit==1.9.0
toflytekit==1.10.1
.I traced the cause down to this commit in which the following change was made to the task pods' entrypoint/command
pyflyte-fast-execute
:With this change, it makes sense that starting with
flytekit==1.9.1
, the pytorch elastic training agent process in the elastic plugin starts to receive SIGTERM signals, causing it to raise theSignalException
.What changes were proposed in this pull request?
The
SignalException
raised by the pytorch elastic launch agent process happens in the user scope, causing it to be shown in the UI as a user error:This is confusing for users as it suggests that the
SignalException
was the root cause of the failure while it actually is a side-effect of the shutdown. Treating theSignalException
as a user error also influences the retry behaviour in an unintended way as this error is not recoverable.Because of this, we need to catch and ignore this error.