Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: Handle SIGTERM in kubeflow pytorch elastic training plugin #2064

Merged
merged 1 commit into from
Dec 21, 2023

Conversation

fg91
Copy link
Member

@fg91 fg91 commented Dec 21, 2023

Why are the changes needed?

We recently observed many pytorch elastic training jobs failing with the following "user error":

      File ".../flytekitplugins/kfpytorch/task.py", line 391, in _execute
        out = elastic_launch(
      ...
      File ".../torch/distributed/elastic/multiprocessing/api.py", line 62, in _terminate_process_handler
        raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)

Message:

    Process 107 got signal: 15

User error.

This SignalException is raised by pytorch when the elastic launch agent process (which launches the worker processes within a Pod) receives SIGTERM etc. This in turn happens for instance when a multi pod distributed training job fails in one pod and the Kubeflow training operator deletes the other pods.


There haven't been any changes in the pytorch elastic training plugin that would explain this sudden change of behaviour and we also didn't upgrade torch.

I noticed in google cloud logging, that we started to see these SignalExceptions on the day we upgraded from flytekit==1.9.0 to flytekit==1.10.1.
I traced the cause down to this commit in which the following change was made to the task pods' entrypoint/command pyflyte-fast-execute:

@_pass_through.command("pyflyte-fast-execute")
def fast_execute_task_cmd(...):

   # Before
    p = subprocess.run(cmd, check=False)
    exit(p.returncode)

    # After
    p = subprocess.Popen(cmd)

    def handle_sigterm(signum, frame):
        logger.info(f"passing signum {signum} [frame={frame}] to subprocess")
        p.send_signal(signum)

    signal.signal(signal.SIGTERM, handle_sigterm)
    returncode = p.wait()
    exit(returncode)

With this change, it makes sense that starting with flytekit==1.9.1, the pytorch elastic training agent process in the elastic plugin starts to receive SIGTERM signals, causing it to raise the SignalException.

What changes were proposed in this pull request?

The SignalException raised by the pytorch elastic launch agent process happens in the user scope, causing it to be shown in the UI as a user error:

Screenshot 2023-12-21 at 14 10 47

This is confusing for users as it suggests that the SignalException was the root cause of the failure while it actually is a side-effect of the shutdown. Treating the SignalException as a user error also influences the retry behaviour in an unintended way as this error is not recoverable.

Because of this, we need to catch and ignore this error.


  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Signed-off-by: Fabio Graetz <fabiograetz@googlemail.com>
@fg91 fg91 marked this pull request as ready for review December 21, 2023 13:22
@fg91 fg91 self-assigned this Dec 21, 2023
@fg91 fg91 added the bug Something isn't working label Dec 21, 2023
Copy link

codecov bot commented Dec 21, 2023

Codecov Report

Attention: 3 lines in your changes are missing coverage. Please review.

Comparison is base (fccf7c7) 86.05% compared to head (853e267) 86.05%.

Files Patch % Lines
...tekit-kf-pytorch/flytekitplugins/kfpytorch/task.py 25.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2064      +/-   ##
==========================================
- Coverage   86.05%   86.05%   -0.01%     
==========================================
  Files         313      313              
  Lines       23344    23348       +4     
  Branches     3489     3489              
==========================================
+ Hits        20088    20091       +3     
- Misses       2644     2646       +2     
+ Partials      612      611       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@eapolinario eapolinario merged commit 8af01f2 into master Dec 21, 2023
77 of 79 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants