Fix deletion of elastic task resource requests #379

fg91 · 2023-07-19T14:02:26Z

TL;DR

Currently, the following elastic task's pods will not have any resource requests (1.8 release):

@task(
    task_config=Elastic(
        nnodes=2,
        nproc_per_node=2,
    ),
    requests=Resources(cpu="1", mem="1Gi", gpu="1"),
)
def train():
    ...

This PR fixes this.

Type

Bug Fix
Feature
Plugin

Are all requirements met?

Complete description

The kubeflow plugin was recently refactored in flytekit (flyteorg/flytekit#1627) as well as in flyteplugins (#345).

After the refactoring, users can set e.g. different resources for master and worker replicas of PyTorch tasks:

@task(
    task_config=PyTorch(
        worker=Worker(
            requests=Resources(cpu="2", mem="2Gi"),
            limits=Resources(cpu="4", mem="2Gi"),
        ),
        master=...
      )
      requests=Resources(cpu="1", mem="1Gi", gpu="1"),
)
....

If the user overrides one of the resource requests in the task_config, in the backend we override the resources configured on the task level here.

Now, in a @task(task_config=PyTorch()..) task, when the user doesn't override the resources, we send task_models.Resources(requests=[], limits=[]) (which is different from None/nil, see here) to the backend as part of the worker spec. We go into this if statement but not into this one within the first. As a consequence, in this case we don't overwrite the resource request set in @task(requests=Resources()). This makes sense.

However, the Elastic task config in flytekitplugins currently doesn't allow overwriting the resource request since elastic pytorch training jobs don't have a differentiation between master and worker replicas. There are only workers which means we can simply use the normal @task(requests=Resources()) to specify their resources.

This, however, means that in this case we send None/nil to the backend as part of the worker specs resource request. Here we go into the else case and simply delete what the user set as @task(requests=Resources()).

This seems wrong and this PR changes this.

Tracking Issue

NA

fixes https://github.com/flyteorg/flyte/issues/

Follow-up issue

NA

fg91 · 2023-07-19T14:02:46Z

go/tasks/plugins/k8s/kfoperators/common/common_operator.go

 					}
 					podSpec.Containers[idx].Resources = *resources
 				}
-			} else {


This is the problematic line.

fg91 · 2023-07-19T14:03:19Z

go/tasks/plugins/k8s/kfoperators/pytorch/pytorch_test.go

@@ -720,7 +720,7 @@ func TestBuildResourcePytorchV1WithZeroWorker(t *testing.T) {
 	assert.Error(t, err)
 }

-func TestParasElasticConfig(t *testing.T) {
+func TestParseElasticConfig(t *testing.T) {


Scope creep: fix typo

Signed-off-by: Fabio Graetz <fabiograetz@googlemail.com>

fg91 · 2023-07-19T14:04:56Z

@yubofredwang I somehow can't tag you as reviewer but could you please sanity check this? Thanks

codecov · 2023-07-19T14:31:55Z

Codecov Report

Merging #379 (5f9842a) into master (13ef7d6) will decrease coverage by 0.03%.
The diff coverage is 0.00%.

❗ Current head 5f9842a differs from pull request most recent head 331a60c. Consider uploading reports for the commit 331a60c to get more accurate results

@@            Coverage Diff             @@
##           master     #379      +/-   ##
==========================================
- Coverage   62.96%   62.94%   -0.03%     
==========================================
  Files         154      154              
  Lines       13019    13016       -3     
==========================================
- Hits         8198     8193       -5     
- Misses       4208     4210       +2     
  Partials      613      613

Flag	Coverage Δ
unittests	`62.94% <0.00%> (-0.03%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
.../plugins/k8s/kfoperators/common/common_operator.go	`64.32% <0.00%> (-0.57%)`	⬇️

... and 1 file with indirect coverage changes

hamersaw · 2023-07-24T21:24:09Z

@fg91 just want to make sure I understand the logic here. Basically for the pytorch, MPI, and tensorflow plugins there are three states resources for each replica can be:

nil: this means that they were not provided and we need to set them to empty lists
empty list: we do nothing, this essentially means they were not set in flytekit
list: if provided then we override

This introduces issues because in the elastic plugin the resources will always be nil because they are not set. So this PR is changing the handling of nil resources. My concern is whether there is a good reason that nil resources are set to empty lists rather then leaving them as is.

Setting the resources for each replica in flytekit to empty lists for the elastic plugin would achieve the same result right?

fg91 · 2023-07-25T10:14:47Z

@fg91 just want to make sure I understand the logic here. Basically for the pytorch, MPI, and tensorflow plugins there are three states resources for each replica can be:

nil: this means that they were not provided and we need to set them to empty lists

empty list: we do nothing, this essentially means they were not set in flytekit

list: if provided then we override

Your understanding is correct that there are these three possible states for resources for each replica and that this is the current behaviour.

Setting the resources for each replica in flytekit to empty lists for the elastic plugin would achieve the same result right?

It would, yes, I first considered doing this, but I personally find the current logic here odd:

Is there a scenario where we would ever want to end up with a pod without any resource requests despite having set them in the task decorator's requests arg? I could imagine that one sets resources in the task decorator and then chooses lower/higher resources for one specific replica. But no resources at all?

From the perspective of the elastic plugin I find it unintuitive that in order for the resources set in @task(requests= to not be deleted, I have to include an empty list in the worker spec's resources and that if I completely omit the resources field there (which is the default), the ones I specified in @task(requests= are deleted.

Considering the function OverrideContainerSpec, which is passed a pod spec with certain resources and override resources, I find it unintuitive that when I pass empty resources, the original ones in the pod spec are preserved but if I pass resources=nil, the resources in the pod spec are set to empty resources.

This is why I personally think we should address this in the backend and not in flytekit but of course I'm happy to discuss!

I'd be interested in your thoughts on this @yubofredwang

yubofredwang · 2023-07-26T00:00:31Z

@fg91 just want to make sure I understand the logic here. Basically for the pytorch, MPI, and tensorflow plugins there are three states resources for each replica can be:

nil: this means that they were not provided and we need to set them to empty lists

empty list: we do nothing, this essentially means they were not set in flytekit

list: if provided then we override

Your understanding is correct that there are these three possible states for resources for each replica and that this is the current behaviour.

Setting the resources for each replica in flytekit to empty lists for the elastic plugin would achieve the same result right?

It would, yes, I first considered doing this, but I personally find the current logic here odd:

Is there a scenario where we would ever want to end up with a pod without any resource requests despite having set them in the task decorator's requests arg? I could imagine that one sets resources in the task decorator and then chooses lower/higher resources for one specific replica. But no resources at all?

From the perspective of the elastic plugin I find it unintuitive that in order for the resources set in @task(requests= to not be deleted, I have to include an empty list in the worker spec's resources and that if I completely omit the resources field there (which is the default), the ones I specified in @task(requests= are deleted.

Considering the function OverrideContainerSpec, which is passed a pod spec with certain resources and override resources, I find it unintuitive that when I pass empty resources, the original ones in the pod spec are preserved but if I pass resources=nil, the resources in the pod spec are set to empty resources.

This is why I personally think we should address this in the backend and not in flytekit but of course I'm happy to discuss!

I'd be interested in your thoughts on this @yubofredwang

The reason of the code logic is that I assume when user does not specify resources at all. They want to use the default resources user set for that namespace. Considering the link here

How about having an extra check like this

if resources != nil {
  ...
} else if podSepc[containeridx].resources == nil {
  podSpec.Containers[idx].Resources = v1.ResourceRequirements{}
}

Does this fix this bug?

fg91 · 2023-07-27T11:12:11Z

The reason of the code logic is that I assume when user does not specify resources at all. They want to use the default resources user set for that namespace. Considering the link here

When not specifying any resources at all in a task, e.g.

@task
def train() -> str:
    print("Trained")

the respective pod should have the task defaults configured on the Flyte platform side , not defaults configured e.g. via a kind: LimitRange.

In our case the task above runs with the platform defaults we configured in the helm values

    resources:
      limits:
        cpu: "1"
        memory: 2Gi
      requests:
        cpu: "1"
        memory: 2Gi

I would expect the same resources for this task:

@task(
    task_config=Elastic(
        nnodes=2,
        nproc_per_node=2,
    )
)
def train() -> str:
    print("Trained")

When using flyte 1.8.0 (helm chart, flytekit, flytekitplugins-kfpytorch), however, the task doesn't have the default resources configured on the platform side:

resources: {}

In my opinion this is not the expected behaviour.

Signed-off-by: Fabio Graetz <fabiograetz@googlemail.com>

fg91 commented Jul 19, 2023

View reviewed changes

Fix deletion of elastic task resource requests

331a60c

Signed-off-by: Fabio Graetz <fabiograetz@googlemail.com>

fg91 force-pushed the fg91/fix/empty-elastic-task-resources branch from 5f9842a to 331a60c Compare July 19, 2023 14:03

fg91 requested a review from hamersaw July 19, 2023 14:04

fg91 self-assigned this Jul 19, 2023

fg91 added the bug Something isn't working label Jul 19, 2023

hamersaw approved these changes Aug 8, 2023

View reviewed changes

hamersaw merged commit c528bb8 into master Aug 8, 2023
5 of 7 checks passed

eapolinario pushed a commit that referenced this pull request Sep 6, 2023

Fix deletion of elastic task resource requests (#379)

c977cf8

Signed-off-by: Fabio Graetz <fabiograetz@googlemail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix deletion of elastic task resource requests #379

Fix deletion of elastic task resource requests #379

fg91 commented Jul 19, 2023 •

edited

fg91 Jul 19, 2023

fg91 Jul 19, 2023

fg91 commented Jul 19, 2023

codecov bot commented Jul 19, 2023 •

edited

hamersaw commented Jul 24, 2023

fg91 commented Jul 25, 2023

yubofredwang commented Jul 26, 2023

fg91 commented Jul 27, 2023 •

edited

Fix deletion of elastic task resource requests #379

Fix deletion of elastic task resource requests #379

Conversation

fg91 commented Jul 19, 2023 • edited

TL;DR

Type

Are all requirements met?

Complete description

Tracking Issue

Follow-up issue

fg91 Jul 19, 2023

Choose a reason for hiding this comment

fg91 Jul 19, 2023

Choose a reason for hiding this comment

fg91 commented Jul 19, 2023

codecov bot commented Jul 19, 2023 • edited

Codecov Report

hamersaw commented Jul 24, 2023

fg91 commented Jul 25, 2023

yubofredwang commented Jul 26, 2023

fg91 commented Jul 27, 2023 • edited

fg91 commented Jul 19, 2023 •

edited

codecov bot commented Jul 19, 2023 •

edited

fg91 commented Jul 27, 2023 •

edited