Add crash recovery ability to SparkSubmitOperator against Kubernetes#68067
Open
amoghrajesh wants to merge 2 commits into
Open
Add crash recovery ability to SparkSubmitOperator against Kubernetes#68067amoghrajesh wants to merge 2 commits into
amoghrajesh wants to merge 2 commits into
Conversation
potiuk
requested changes
Jun 5, 2026
Member
potiuk
left a comment
There was a problem hiding this comment.
Good direction wiring the K8s branch into execute_resumable. A few things to fix — CI is red on real issues, not flakes:
get_job_statussplit logic:namespace, pod_name = external_id.split(":", 1)happens before thelen(parts) != 2check, so that validation is unreachable — a malformedexternal_idraises the unpack error, not yourValueError. And it splits/unpacks three times. Suggest:parts = external_id.split(":", 1)→ validate →namespace, pod_name = parts, once.- MyPy: line 327
return cachedreturns theJsonValuefromtask_store.get(...), but the method is-> str. Needs acast("str", cached)(or coerce) to satisfy the contract. - Failing test:
test_k8s_execute_persists_pod_id_to_task_store_when_reconnect_on_retryfails on all four Compat runs (persisted_before_pollis empty). Looks like the pod id isn't being persisted beforepoll_until_complete, or the test's expectation is off — worth tracing which. - Mind adding a short description to the PR body (what/why)? Helps the changelog and reviewers.
Drafted-by: Claude Code (Opus 4.8); reviewed by @potiuk before posting
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Was generative AI tooling used to co-author this PR?
What is being solved?
This is part of the resumability story for spark with various modes like: #67118 (for standalone spark) and #67473 (in flight for yarn). Right now, in this mode if the Airflow worker dies while the Spark job is running, Airflow loses track of the driver pod entirely and the retry submits a fresh job, wasting work already done or causing conflicts when the Spark job is not idempotent.
Current behaviour
With
track_driver_via_k8s_api=True(added in #67715), thespark-submitJVM is released after pod creation and the operator polls via the Python K8s client. However, the driver pod name is never persisted, so a worker crash still causes the retry to submit a fresh job. That PR was just a building block to attempt resumability.Proposed change
Wires the
SparkSubmitOperatorK8s path intoResumableJobMixinto benefit from resumability.Flow:
execute()detects_should_track_driver_via_k8s_api()and routes toexecute_resumable(whenreconnect_on_retry=True, the default) or a plain submit-and-poll (whenreconnect_on_retry=False).submit_job()injectsspark.kubernetes.submission.waitAppCompletion=false, callshook.submit(), captures the driver pod name from the submit log, encodes it as"{namespace}:{pod_name}", and returns it. The mixin writes this totask_storebefore polling begins, this acts as the crash recovery anchor.get_job_status()checks ak8s_driver_statuscache key intask_storefirst (to handle the pod garbage collected after success case without a live K8s API call), then queries the live pod phase viakube_client.read_namespaced_pod.is_job_active()/is_job_succeeded()map raw pod phases to the mixin semantics.poll_until_complete()setshook._kubernetes_driver_podfrom the external ID, delegates to the existing_poll_k8s_driver_via_api()loop (which handles transient API errors, consecutive unknown phases, and pod cleanup on success), then writes"Succeeded"totask_storeunderk8s_driver_statusbefore the pod is deleted.Diagram attached for reference:

On retry, the mixin reads the saved pod ID, calls
get_job_status, and either reconnects to the running pod or resubmits fresh based on the pod phase:RunningPendingSucceededFailedUnknownNotFound(pod GC'd)Why only
"Succeeded"is cachedThe operator does not delete failed pods — they remain queryable. So
Failednever needs caching. If a failed pod is also GC'd before the retry,NotFound→ resubmit is the correct behaviour anyway."Succeeded"is cached specifically because the operator deletes the driver pod on success; without the cache, a succeeded-then-GC'd pod would be indistinguishable from a failed-then-GC'd one and would trigger a spurious resubmit.Why
get_job_statuscheckstask_storebefore the K8s APIOn a retry,
execute_resumablecallsget_job_statusto decide whether to reconnect orresubmit. For K8s, this means querying the live pod phase but pods are ephemeral. The K8s
API has no record of completed pods: once a pod is deleted (by the operator on
success, by the K8s TTL controller, or by cluster admins), that phase information is gone
permanently. There is no equivalent of YARN's application history server.
This means a live K8s API query for a completed pod always returns
404 NotFound, which themixin would otherwise treat as a terminal failure and resubmit. That would be wrong if the job
already succeeded.
The
k8s_driver_statuskey written at the end ofpoll_until_completeis what bridges thegap. By checking
task_storefirst and only falling through to the live API when no cachedstatus is present,
get_job_statuscorrectly reports the job's terminal outcome regardless ofwhether the pod still exists.
Things of note
spark.kubernetes.driver.deleteOnTermination=falseis not injected in this PR. K8s will GC the driver pod after it exits normally. Thek8s_driver_statuscache covers the succeeded+GC'd case. For crash recovery, there is a small window: if the worker crashes after_poll_k8s_driver_via_apicompletes but beforetask_store.set(k8s_driver_status, "Succeeded")is written, the next retry will seeNotFoundand resubmit fresh rather than recognising the job already succeeded. This is a known limitation shared with the standalone crash recovery path.submit_jobreturnsNoneand the mixin falls back to a fresh submit on retry. A warning is logged.Succeededuntil all sidecars exit. Setexecution_timeoutas a hard bound, this is being tracked in For spark operator withtrack_driver_via_k8s_api, detect driver completion by container status rather than pod phase #67934Testing
Running Airflow on breeze against a
kindclusterBreeze is configured to talk to the kind cluster
Airflow connection defined:
airflow connections add spark_default \ --conn-type spark \ --conn-host "k8s://${K8S_SERVER}" \ --conn-extra '{"deploy-mode": "cluster", "namespace": "spark"}'DAG:
Test 1: Success State
Logs before these changes:
successlogs.txt
Logs after changes:
success_state_logs.txt
Pod deleted:

Test 2: Crash Recovery (kill worker mid run)
Test 2a: Kill worker first and wait for driver pod to complete (success state, no reconnect but avoids duplicate submission)
Pod is up
Worker down:

Spark Driver completed

Fired the worker back up and this is what we get
Test 2a: Kill worker first and resume worker mid way (reconnect abilities)
Same steps as above
But resume worker when spark driver is still running
Poll continues till completion:
{pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.