Skip to content

Enforce execution_timeout in deferrable KubernetesPodOperator#67229

Open
paultmathew wants to merge 2 commits into
apache:mainfrom
paultmathew:fix/67227-kpo-deferrable-execution-timeout
Open

Enforce execution_timeout in deferrable KubernetesPodOperator#67229
paultmathew wants to merge 2 commits into
apache:mainfrom
paultmathew:fix/67227-kpo-deferrable-execution-timeout

Conversation

@paultmathew
Copy link
Copy Markdown
Contributor

Why + What

KubernetesPodOperator(deferrable=True) does not enforce execution_timeout. Once the operator defers, the synchronous execute() returns and the signal.alarm-based timeout context wrapping it exits cleanly — there is no further execution_timeout enforcement for the lifetime of the deferral. Pods continue running well past execution_timeout, bounded only by active_deadline_seconds (which defaults to ~1h or whatever the operator passed).

The framework gap is acknowledged by # TODO: handle timeout in case of deferral at task-sdk/.../task_runner.py:1782.

This PR fixes the symptom for KubernetesPodOperator, mirroring the pattern already merged for AirbyteTriggerSyncOperator (PR #64051) and DbtCloudRunJobOperator (PR #66449).

Approach

  1. Operator (pod.py): in invoke_defer_method, translate execution_timeout into an absolute deadline anchored on ti.start_date:

    • execution_deadline = ti.start_date.timestamp() + execution_timeout.total_seconds()
    • Pass execution_deadline to KubernetesPodTrigger.
    • Pass timeout=remaining (timedelta) to self.defer() so the framework's trigger_timeout also bounds the trigger lifetime as a backstop.
    • Anchoring on ti.start_date keeps the deadline stable across re-deferrals (e.g. logging_interval re-entries), since Airflow preserves the original start_date when a task resumes from defer.
    • Re-pass context from trigger_reentryinvoke_defer_method so the deadline is recomputed correctly on each re-defer.
  2. Trigger (pod.py): at the top of _wait_for_container_completion, check time.time() >= execution_deadline and emit a status="timeout" event when the deadline is crossed. The operator's existing trigger_reentry terminal-event path already handles status in ("error", "failed", "timeout", "success") — the operator fails the task and _clean() runs on_finish_action (default: delete pod).

Impact

  • Existing behaviour preserved: execution_timeout was previously a no-op for deferred KPO tasks, and remains a no-op when not set. Tasks without execution_timeout see no behaviour change (execution_deadline=None, defer.timeout=None).
  • No public API changes: the new execution_deadline parameter on KubernetesPodTrigger is keyword-only with a None default. Trigger serialization adds the field but defaults preserve back-compat for existing serialized triggers (the trigger's __init__ accepts the kwarg as optional).
  • Pod cleanup: the existing on_finish_action path handles pod deletion (default delete_pod) when the operator fails on a timeout event. _clean() already special-cases event["status"] == "timeout" to skip await_pod_completion (the pod may hang on ErrImagePull/ContainerCreating).

Tests

  • Trigger (tests/unit/cncf/kubernetes/triggers/test_pod.py):
    • Updated test_serialize to include the new execution_deadline key.
    • Added test_serialize_with_execution_deadline — round-trips a non-None deadline.
    • Added test_run_loop_emits_timeout_event_when_execution_deadline_reached — past-deadline → first iteration emits status="timeout" event.
    • Added test_run_loop_does_not_emit_timeout_when_execution_deadline_not_reached — far-future deadline → trigger keeps polling normally.
  • Operator (tests/unit/cncf/kubernetes/operators/test_pod.py):
    • Added test_invoke_defer_method_passes_execution_deadline_when_execution_timeout_set — operator with execution_timeout=300s passes a deadline ≈ ti.start_date + 300s to the trigger; defer.timeout is set.
    • Added test_invoke_defer_method_passes_no_deadline_when_execution_timeout_not_set — operator without execution_timeout passes None (no enforcement, no behaviour change).

Backwards Compatibility

No public API changes. New execution_deadline parameter on KubernetesPodTrigger is optional with default None. Behaviour change: execution_timeout-equipped deferred KPO tasks now actually fail at the configured timeout instead of running indefinitely; this is the documented contract.

Closes

Closes: #67227

@boring-cyborg boring-cyborg Bot added area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues labels May 20, 2026
@paultmathew paultmathew force-pushed the fix/67227-kpo-deferrable-execution-timeout branch from 4eb5810 to fb5e3fb Compare May 20, 2026 14:18
@paultmathew paultmathew marked this pull request as ready for review May 20, 2026 15:27
@paultmathew paultmathew force-pushed the fix/67227-kpo-deferrable-execution-timeout branch from 5151d22 to fb5e3fb Compare May 20, 2026 15:57
Copy link
Copy Markdown
Contributor

@jscheffl jscheffl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the extension, looks good to me. Except some comments.

last_log_time=last_log_time,
logging_interval=logging_interval,
trigger_kwargs=trigger_kwargs,
execution_deadline=execution_deadline,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds a coupling between the AWS and CNCF-K8s providers which are packaged into different distributions. If we keep it like this the AWS provider would gain a required dependency of the next future K8s version being available. This dependency would need to be added to pyproject.toml as # use next version

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched EksPodTrigger to forward base-trigger kwargs through **kwargs (commit 8dfa1af) rather than listing each parent parameter explicitly. This mirrors GKEStartPodTrigger where *args, **kwargs get forwarded to super().__init__ directly.

The explicit kwarg list was readable as documentation of the supported surface. I think **kwargs + the KubernetesPodTrigger docstring is a reasonable substitute — but if you'd rather keep the explicit list and add a # use next version marker in providers/amazon/pyproject.toml, happy to flip back. Let me know.

Comment thread providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/operators/pod.py Outdated
Comment thread providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/triggers/pod.py Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enforces execution_timeout for KubernetesPodOperator(deferrable=True) by translating the timeout into an absolute deadline, passing it to KubernetesPodTrigger, and adding trigger-side logic to emit a terminal timeout event when the deadline is exceeded (with accompanying unit tests). It also updates the EKS-specific trigger subclass to forward the new parameter.

Changes:

  • Add execution_deadline plumbing from KubernetesPodOperator.invoke_defer_method() to KubernetesPodTrigger and pass a timeout= to defer() based on remaining budget.
  • Add trigger-side deadline enforcement that emits a status="timeout" event once the deadline is crossed.
  • Extend/adjust unit tests for trigger serialization and timeout behavior, plus operator deferral plumbing.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/operators/pod.py Compute an absolute execution deadline from ti.start_date and execution_timeout, pass it to the trigger, and set defer(timeout=…).
providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/triggers/pod.py Add execution_deadline to trigger init/serialization and emit a timeout TriggerEvent when the deadline is exceeded.
providers/cncf/kubernetes/tests/unit/cncf/kubernetes/operators/test_pod.py Add tests asserting the operator passes execution_deadline (or None) into the trigger and sets defer.timeout appropriately.
providers/cncf/kubernetes/tests/unit/cncf/kubernetes/triggers/test_pod.py Update serialization expectations and add trigger run-loop tests for deadline timeout vs. continued polling.
providers/amazon/src/airflow/providers/amazon/aws/triggers/eks.py Forward the new execution_deadline parameter through EksPodTrigger to the base Kubernetes trigger.

Comment on lines +923 to +925
# ``trigger_timeout``).
remaining = execution_deadline - time.time()
defer_timeout = datetime.timedelta(seconds=max(0.0, remaining))
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed by clamping defer_timeout to a 60-second minimum buffer:

remaining = execution_deadline - time.time()
defer_timeout = max(
    datetime.timedelta(seconds=remaining),
    datetime.timedelta(seconds=60),
)

Rationale for the 60s buffer (vs the alternatives you suggested):

  • Don't set timeout when remaining <= 0: works, but loses the framework backstop entirely. If the trigger hangs (bug, network partition, etc.) the task stays deferred forever.
  • Fail immediately when remaining <= 0: cleanest in theory, but invasive — would need to raise an exception from invoke_defer_method and route through cleanup. Bigger refactor than the bug warrants.
  • 60s minimum buffer (chosen): the trigger's first-iteration deadline check (top of run()) fires within ~poll_interval seconds (default 2s) and emits the operator-handled status="timeout" event. The 60s framework backstop only fires if the trigger is actually hung. Best of both worlds.

Added test test_invoke_defer_method_clamps_defer_timeout_to_minimum_buffer_when_deadline_close that uses time_machine.travel(ti_start + 600s) to put the deadline 300s in the past and asserts defer.timeout == timedelta(seconds=60).

Open to revisiting if you'd prefer a different minimum (or one of the alternative approaches).

Comment thread providers/cncf/kubernetes/tests/unit/cncf/kubernetes/triggers/test_pod.py Outdated
Comment thread providers/cncf/kubernetes/tests/unit/cncf/kubernetes/operators/test_pod.py Outdated
@paultmathew paultmathew force-pushed the fix/67227-kpo-deferrable-execution-timeout branch from fb5e3fb to 8dfa1af Compare May 20, 2026 23:45
@paultmathew
Copy link
Copy Markdown
Contributor Author

@jscheffl Thanks for the review. I pushed a change and addressed the comments.

@paultmathew paultmathew force-pushed the fix/67227-kpo-deferrable-execution-timeout branch 2 times, most recently from 30e6f9a to 448ace1 Compare May 21, 2026 01:23
paultmathew and others added 2 commits May 22, 2026 12:47
Closes: apache#67227
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@paultmathew paultmathew force-pushed the fix/67227-kpo-deferrable-execution-timeout branch from 2da1d91 to 6eec8cf Compare May 22, 2026 18:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

KubernetesPodOperator does not enforce execution_timeout semantics in Deferrable mode

3 participants