Fix: Handle SIGTERM in kubeflow pytorch elastic training plugin #2064

fg91 · 2023-12-21T13:22:17Z

Why are the changes needed?

We recently observed many pytorch elastic training jobs failing with the following "user error":

      File ".../flytekitplugins/kfpytorch/task.py", line 391, in _execute
        out = elastic_launch(
      ...
      File ".../torch/distributed/elastic/multiprocessing/api.py", line 62, in _terminate_process_handler
        raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)

Message:

    Process 107 got signal: 15

User error.

This SignalException is raised by pytorch when the elastic launch agent process (which launches the worker processes within a Pod) receives SIGTERM etc. This in turn happens for instance when a multi pod distributed training job fails in one pod and the Kubeflow training operator deletes the other pods.

There haven't been any changes in the pytorch elastic training plugin that would explain this sudden change of behaviour and we also didn't upgrade torch.

I noticed in google cloud logging, that we started to see these SignalExceptions on the day we upgraded from flytekit==1.9.0 to flytekit==1.10.1.
I traced the cause down to this commit in which the following change was made to the task pods' entrypoint/command pyflyte-fast-execute:

@_pass_through.command("pyflyte-fast-execute")
def fast_execute_task_cmd(...):

   # Before
    p = subprocess.run(cmd, check=False)
    exit(p.returncode)

    # After
    p = subprocess.Popen(cmd)

    def handle_sigterm(signum, frame):
        logger.info(f"passing signum {signum} [frame={frame}] to subprocess")
        p.send_signal(signum)

    signal.signal(signal.SIGTERM, handle_sigterm)
    returncode = p.wait()
    exit(returncode)

With this change, it makes sense that starting with flytekit==1.9.1, the pytorch elastic training agent process in the elastic plugin starts to receive SIGTERM signals, causing it to raise the SignalException.

What changes were proposed in this pull request?

The SignalException raised by the pytorch elastic launch agent process happens in the user scope, causing it to be shown in the UI as a user error:

This is confusing for users as it suggests that the SignalException was the root cause of the failure while it actually is a side-effect of the shutdown. Treating the SignalException as a user error also influences the retry behaviour in an unintended way as this error is not recoverable.

Because of this, we need to catch and ignore this error.

I updated the documentation accordingly.
All new and existing tests passed.
All commits are signed-off.

Signed-off-by: Fabio Graetz <fabiograetz@googlemail.com>

codecov · 2023-12-21T13:25:30Z

Codecov Report

Attention: 3 lines in your changes are missing coverage. Please review.

Comparison is base (fccf7c7) 86.05% compared to head (853e267) 86.05%.

Files	Patch %	Lines
...tekit-kf-pytorch/flytekitplugins/kfpytorch/task.py	25.00%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #2064      +/-   ##
==========================================
- Coverage   86.05%   86.05%   -0.01%     
==========================================
  Files         313      313              
  Lines       23344    23348       +4     
  Branches     3489     3489              
==========================================
+ Hits        20088    20091       +3     
- Misses       2644     2646       +2     
+ Partials      612      611       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Fix: Handle SIGTERM in kubeflow pytorch elastic training plugin

853e267

Signed-off-by: Fabio Graetz <fabiograetz@googlemail.com>

fg91 marked this pull request as ready for review December 21, 2023 13:22

fg91 requested review from wild-endeavor, kumare3, eapolinario, pingsutw and cosmicBboy as code owners December 21, 2023 13:22

fg91 self-assigned this Dec 21, 2023

fg91 added the bug Something isn't working label Dec 21, 2023

eapolinario approved these changes Dec 21, 2023

View reviewed changes

eapolinario merged commit 8af01f2 into master Dec 21, 2023
77 of 79 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Handle SIGTERM in kubeflow pytorch elastic training plugin #2064

Fix: Handle SIGTERM in kubeflow pytorch elastic training plugin #2064

fg91 commented Dec 21, 2023

codecov bot commented Dec 21, 2023 •

edited

Fix: Handle SIGTERM in kubeflow pytorch elastic training plugin #2064

Fix: Handle SIGTERM in kubeflow pytorch elastic training plugin #2064

Conversation

fg91 commented Dec 21, 2023

Why are the changes needed?

What changes were proposed in this pull request?

codecov bot commented Dec 21, 2023 • edited

Codecov Report

codecov bot commented Dec 21, 2023 •

edited