Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Timed out task in subworkflow does not fail workflow #4087

Closed
2 tasks done
bstadlbauer opened this issue Sep 27, 2023 · 5 comments
Closed
2 tasks done

[BUG] Timed out task in subworkflow does not fail workflow #4087

bstadlbauer opened this issue Sep 27, 2023 · 5 comments
Assignees
Labels
backlogged For internal use. Reserved for contributor team workflow. bug Something isn't working exo

Comments

@bstadlbauer
Copy link
Member

bstadlbauer commented Sep 27, 2023

Describe the bug

In the following example, a task in a subworkflow fails due to a timeout. When that task fails, it goes into an Unkonwn state in the dashboard, whilst the toplevel workflow is still runnning instead of failing as expected.

I've tested this in a dev sendbox with version v1.9.1.

import time
from datetime import timedelta

from flytekit import task, workflow, Resources


@task(
    timeout=timedelta(seconds=1),
    # Just making sure that things run on my local machine; not required to reproduce
    limits=Resources(cpu="0.5", mem="500Mi"),
)
def demo_task():
    time.sleep(10)


@workflow
def subworkflow():
    demo_task()


@workflow
def demo_workflow():
    subworkflow()

image

Expected behavior

The workflow should fail as the task timed out.

Additional context to reproduce

Here's the output of flytectl for that execution:

- node_exec:
    closure:
      createdAt: "2023-09-27T08:14:27.730147Z"
      outputUri: s3://my-s3-bucket/metadata/propeller/dask-testing-development-f43853b3ae7cc4a469ee/start-node/data/0/outputs.pb
      phase: SUCCEEDED
      updatedAt: "2023-09-27T08:14:27.730373Z"
    id:
      executionId:
        domain: development
        name: f43853b3ae7cc4a469ee
        project: dask-testing
      nodeId: start-node
    inputUri: s3://my-s3-bucket/metadata/propeller/dask-testing-development-f43853b3ae7cc4a469ee/start-node/data/inputs.pb
    metadata:
      specNodeId: start-node
- child_nodes:
  - node_exec:
      closure:
        createdAt: "2023-09-27T08:14:30.079744Z"
        phase: SKIPPED
        updatedAt: "2023-09-27T08:14:30.079926Z"
      id:
        executionId:
          domain: development
          name: f43853b3ae7cc4a469ee
          project: dask-testing
        nodeId: n0-0-end-node
      inputUri: s3://my-s3-bucket/metadata/propeller/dask-testing-development-f43853b3ae7cc4a469ee/n0/data/0/end-node/inputs.pb
      metadata:
        retryGroup: "0"
        specNodeId: end-node
  - node_exec:
      closure:
        createdAt: "2023-09-27T08:14:27.912820Z"
        startedAt: "2023-09-27T08:14:28.073327Z"
        updatedAt: "2023-09-27T08:14:29.993783Z"
      id:
        executionId:
          domain: development
          name: f43853b3ae7cc4a469ee
          project: dask-testing
        nodeId: n0-0-n0
      inputUri: s3://my-s3-bucket/metadata/propeller/dask-testing-development-f43853b3ae7cc4a469ee/n0/data/0/n0/inputs.pb
      metadata:
        retryGroup: "0"
        specNodeId: n0
    task_execs:
    - closure:
        createdAt: "2023-09-27T08:14:28.048292Z"
        duration: 1.026488s
        error:
          code: Task Aborted
          message: node timed out
        eventVersion: 1
        logs:
        - messageFormat: JSON
          name: Kubernetes Logs (User)
          uri: http://localhost:30080/kubernetes-dashboard/#/log/dask-testing-development/f43853b3ae7cc4a469ee-n0-0-n0-0/pod?namespace=dask-testing-development
        metadata:
          generatedName: f43853b3ae7cc4a469ee-n0-0-n0-0
          pluginIdentifier: container
        phase: ABORTED
        reason: '[ContainersNotReady|ContainerCreating]: containers with unready status:
          [f43853b3ae7cc4a469ee-n0-0-n0-0]|'
        reasons:
        - message: task submitted to K8s
          occurredAt: "2023-09-27T08:14:28.048292Z"
        - message: Scheduling
          occurredAt: "2023-09-27T08:14:28.166021Z"
        - message: '[ContainersNotReady|ContainerCreating]: containers with unready
            status: [f43853b3ae7cc4a469ee-n0-0-n0-0]|'
          occurredAt: "2023-09-27T08:14:28Z"
        startedAt: "2023-09-27T08:14:29Z"
        taskType: python-task
        updatedAt: "2023-09-27T08:14:30.026488Z"
      id:
        nodeExecutionId:
          executionId:
            domain: development
            name: f43853b3ae7cc4a469ee
            project: dask-testing
          nodeId: n0-0-n0
        taskId:
          domain: development
          name: flyte_dev_setup.workflows.demo_task
          project: dask-testing
          resourceType: TASK
          version: pytest-c9761586
      inputUri: s3://my-s3-bucket/metadata/propeller/dask-testing-development-f43853b3ae7cc4a469ee/n0/data/0/n0/inputs.pb
  - node_exec:
      closure:
        createdAt: "2023-09-27T08:14:27.858971Z"
        outputUri: s3://my-s3-bucket/metadata/propeller/dask-testing-development-f43853b3ae7cc4a469ee/n0/data/0/start-node/0/outputs.pb
        phase: SUCCEEDED
        updatedAt: "2023-09-27T08:14:27.859360Z"
      id:
        executionId:
          domain: development
          name: f43853b3ae7cc4a469ee
          project: dask-testing
        nodeId: n0-0-start-node
      inputUri: s3://my-s3-bucket/metadata/propeller/dask-testing-development-f43853b3ae7cc4a469ee/n0/data/0/start-node/inputs.pb
      metadata:
        retryGroup: "0"
        specNodeId: start-node
  node_exec:
    closure:
      createdAt: "2023-09-27T08:14:27.768020Z"
      phase: RUNNING
      startedAt: "2023-09-27T08:14:27.877543Z"
      updatedAt: "2023-09-27T08:14:27.877613Z"
    id:
      executionId:
        domain: development
        name: f43853b3ae7cc4a469ee
        project: dask-testing
      nodeId: n0
    inputUri: s3://my-s3-bucket/metadata/propeller/dask-testing-development-f43853b3ae7cc4a469ee/n0/data/inputs.pb
    metadata:
      isParentNode: true
      specNodeId: n0

Screenshots

No response

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@bstadlbauer bstadlbauer added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Sep 27, 2023
@bstadlbauer
Copy link
Member Author

Debugging this further it seems not quite deterministic. Adding a retires=1 to the @task does make it deterministic though.

It does seem like the subworkflow is not terminating in case of the demo_task going into an Unknown state, either triggered by the pod not being ready (as the timeout is short), or by being retried again

@bstadlbauer
Copy link
Member Author

So the following example fails consistently, note though that I don't think it's the retry per-se which causes the failure:

import time
from datetime import timedelta

from flytekit import task, workflow


@task(
    timeout=timedelta(seconds=10),
    retries=1,
)
def demo_task():
    time.sleep(30)


@workflow
def subworkflow():
    demo_task()


@workflow
def demo_workflow():
    subworkflow()

@hamersaw
Copy link
Contributor

For some context in debugging:

  • Here is where propeller determines if a node has timed out. Basically, if after checking the node state, it is not terminal (ie. succeeded, failure, etc) then we check if it is timed out.
  • The next round of evaluation propeller should find the node in the TimingOut phase and abort the node
  • After that propeller handles cleanup in aborting the workflow as expected. I suspect there is an issue with subworkflow status on child failures. Here is a more recent issue that is similar to this (but I suspect not the same). Basically, I think propeller is doing the correct things (ie. aborting subtasks / cleanup), but is not correctly sending events to admin to convey state in some corner cases. You should be able to find where this happens by searching for nCtx.EventsRecorder().RecordNodeEvent in the executor.go file

@eapolinario eapolinario removed the untriaged This issues has not yet been looked at by the Maintainers label Sep 29, 2023
@hamersaw hamersaw added exo backlogged For internal use. Reserved for contributor team workflow. labels Nov 8, 2023
@pvditt pvditt self-assigned this Jan 22, 2024
@pvditt
Copy link
Contributor

pvditt commented Jan 24, 2024

@bstadlbauer this issue should be closed by #4766. Please feel free to re-open and tag me if you notice any further issues.

@pvditt pvditt closed this as completed Jan 24, 2024
@bstadlbauer
Copy link
Member Author

Ok great!
We will definitely try this and report back if things should still be broken!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlogged For internal use. Reserved for contributor team workflow. bug Something isn't working exo
Projects
None yet
Development

No branches or pull requests

4 participants