Skip to content
This repository has been archived by the owner on Oct 9, 2023. It is now read-only.

Don't add master replica log link when doing elastic pytorch training #356

Merged
merged 2 commits into from
Jun 7, 2023

Conversation

fg91
Copy link
Member

@fg91 fg91 commented Jun 7, 2023

TL;DR

When doing torch elastic training, there is no so-called master replica in the resulting PytorchJob as opposed to when doing non-elastic pytorch distributed training.

Flyteplugins, however, still generates a log link for the non-existing master replica in case of elastic training. This PR fixes this.

Type

  • Bug Fix
  • Feature
  • Plugin

Are all requirements met?

  • Code completed
  • Smoke tested
  • Unit tests added
  • Code documentation added
  • Any pending items have an associated Issue

I built a propeller image and tested that the correct log links are shown both for elastic and the original non-elastic pytorch tasks.

Complete description

When doing "normal" non-elastic training, a flyte task looks like this:

@task(
    task_config=PyTorch(
        num_workers=3,
    )
)
def train():

The pytorch job that is created from this task definition looks like this:

apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
  ...
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      ...
    Worker:
      replicas: 3
      ...

Notice that there is a so-called "master" replica and multiple workers.

In the Flyte console, a link to the master replica and to the 3 worker replicas logs is shown.

When using the new elastic training task (torchrun) ...

@task(
    task_config=Elastic(
        nnodes=3,
        nproc_per_node=1,
    ),
)
def train():

... the resulting pytorch job looks like this:

apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
  ...
spec:
  elasticPolicy:
    ...
  pytorchReplicaSpecs:
    Worker:
      replicas: 3
      ...

Notice that there is no-more "master" replica.

Even though there is no "master" replica, currently the Flyte console still shows a log link for the master replica that doesn't exist.

This PR fixes this.

Tracking Issue

NA

Follow-up issue

NA

@fg91 fg91 force-pushed the fabio/fix/elastic-log-links branch from 984c99c to a4fbf94 Compare June 7, 2023 14:21
@codecov
Copy link

codecov bot commented Jun 7, 2023

Codecov Report

Merging #356 (173acea) into master (76a80ec) will increase coverage by 1.44%.
The diff coverage is 77.83%.

❗ Current head 173acea differs from pull request most recent head 63b4fcd. Consider uploading reports for the commit 63b4fcd to get more accurate results

@@            Coverage Diff             @@
##           master     #356      +/-   ##
==========================================
+ Coverage   62.76%   64.20%   +1.44%     
==========================================
  Files         148      148              
  Lines       12444    10289    -2155     
==========================================
- Hits         7810     6606    -1204     
+ Misses       4038     3072     -966     
- Partials      596      611      +15     
Flag Coverage Δ
unittests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
go/tasks/plugins/webapi/agent/config.go 100.00% <ø> (ø)
.../plugins/k8s/kfoperators/common/common_operator.go 65.94% <34.88%> (-5.99%) ⬇️
go/tasks/plugins/k8s/dask/dask.go 85.77% <57.14%> (+1.85%) ⬆️
go/tasks/plugins/webapi/agent/plugin.go 68.22% <66.66%> (ø)
go/tasks/plugins/k8s/kfoperators/mpi/mpi.go 75.00% <81.48%> (+0.80%) ⬆️
go/tasks/plugins/k8s/ray/ray.go 77.48% <84.09%> (-0.09%) ⬇️
...s/plugins/k8s/kfoperators/tensorflow/tensorflow.go 77.77% <84.33%> (+3.02%) ⬆️
...o/tasks/plugins/k8s/kfoperators/pytorch/pytorch.go 78.08% <88.37%> (-1.45%) ⬇️
...tasks/pluginmachinery/flytek8s/container_helper.go 86.36% <88.88%> (+0.36%) ⬆️
go/tasks/pluginmachinery/flytek8s/pod_helper.go 69.81% <100.00%> (-3.21%) ⬇️
... and 1 more

... and 121 files with indirect coverage changes

fg91 added 2 commits June 7, 2023 16:11
Signed-off-by: Fabio Graetz <fabiograetz@googlemail.com>
Signed-off-by: Fabio Graetz <fabiograetz@googlemail.com>
@fg91 fg91 force-pushed the fabio/fix/elastic-log-links branch from 0709b32 to 63b4fcd Compare June 7, 2023 16:11
@fg91 fg91 marked this pull request as ready for review June 7, 2023 16:12
@fg91 fg91 requested a review from pingsutw June 7, 2023 16:12
@fg91 fg91 merged commit de1d1d3 into master Jun 7, 2023
7 checks passed
@fg91 fg91 deleted the fabio/fix/elastic-log-links branch June 7, 2023 18:15
eapolinario pushed a commit that referenced this pull request Sep 6, 2023
…#356)

* Don't add master log link when doing elastic pytorch training

Signed-off-by: Fabio Graetz <fabiograetz@googlemail.com>

* Lint

Signed-off-by: Fabio Graetz <fabiograetz@googlemail.com>

---------

Signed-off-by: Fabio Graetz <fabiograetz@googlemail.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
2 participants