Skip to content

Add crash recovery ability to SparkSubmitOperator against Kubernetes#68067

Open
amoghrajesh wants to merge 2 commits into
apache:mainfrom
astronomer:aip-103-spark-on-k8s-crash-recovery
Open

Add crash recovery ability to SparkSubmitOperator against Kubernetes#68067
amoghrajesh wants to merge 2 commits into
apache:mainfrom
astronomer:aip-103-spark-on-k8s-crash-recovery

Conversation

@amoghrajesh
Copy link
Copy Markdown
Contributor


Was generative AI tooling used to co-author this PR?
  • Yes: claude sonnet

What is being solved?

This is part of the resumability story for spark with various modes like: #67118 (for standalone spark) and #67473 (in flight for yarn). Right now, in this mode if the Airflow worker dies while the Spark job is running, Airflow loses track of the driver pod entirely and the retry submits a fresh job, wasting work already done or causing conflicts when the Spark job is not idempotent.

Current behaviour

With track_driver_via_k8s_api=True (added in #67715), the spark-submit JVM is released after pod creation and the operator polls via the Python K8s client. However, the driver pod name is never persisted, so a worker crash still causes the retry to submit a fresh job. That PR was just a building block to attempt resumability.

Proposed change

Wires the SparkSubmitOperator K8s path into ResumableJobMixin to benefit from resumability.

Flow:

  1. execute() detects _should_track_driver_via_k8s_api() and routes to execute_resumable (when reconnect_on_retry=True, the default) or a plain submit-and-poll (when reconnect_on_retry=False).
  2. submit_job() injects spark.kubernetes.submission.waitAppCompletion=false, calls hook.submit(), captures the driver pod name from the submit log, encodes it as "{namespace}:{pod_name}", and returns it. The mixin writes this to task_store before polling begins, this acts as the crash recovery anchor.
  3. get_job_status() checks a k8s_driver_status cache key in task_store first (to handle the pod garbage collected after success case without a live K8s API call), then queries the live pod phase via kube_client.read_namespaced_pod.
  4. is_job_active() / is_job_succeeded() map raw pod phases to the mixin semantics.
  5. poll_until_complete() sets hook._kubernetes_driver_pod from the external ID, delegates to the existing _poll_k8s_driver_via_api() loop (which handles transient API errors, consecutive unknown phases, and pod cleanup on success), then writes "Succeeded" to task_store under k8s_driver_status before the pod is deleted.

Diagram attached for reference:
K8s Job Submission and-2026-06-05-084419

On retry, the mixin reads the saved pod ID, calls get_job_status, and either reconnects to the running pod or resubmits fresh based on the pod phase:

Pod phase on retry Active? Succeeded? Mixin action
Running Yes Reconnect and continue polling
Pending Yes Reconnect and continue polling
Succeeded No Yes Return result, skip resubmit
Failed No No Resubmit fresh
Unknown No No Resubmit fresh
NotFound (pod GC'd) No No Resubmit fresh

Why only "Succeeded" is cached

The operator does not delete failed pods — they remain queryable. So Failed never needs caching. If a failed pod is also GC'd before the retry, NotFound → resubmit is the correct behaviour anyway. "Succeeded" is cached specifically because the operator deletes the driver pod on success; without the cache, a succeeded-then-GC'd pod would be indistinguishable from a failed-then-GC'd one and would trigger a spurious resubmit.

Why get_job_status checks task_store before the K8s API

On a retry, execute_resumable calls get_job_status to decide whether to reconnect or
resubmit. For K8s, this means querying the live pod phase but pods are ephemeral. The K8s
API has no record of completed pods: once a pod is deleted (by the operator on
success, by the K8s TTL controller, or by cluster admins), that phase information is gone
permanently. There is no equivalent of YARN's application history server.

This means a live K8s API query for a completed pod always returns 404 NotFound, which the
mixin would otherwise treat as a terminal failure and resubmit. That would be wrong if the job
already succeeded.

The k8s_driver_status key written at the end of poll_until_complete is what bridges the
gap. By checking task_store first and only falling through to the live API when no cached
status is present, get_job_status correctly reports the job's terminal outcome regardless of
whether the pod still exists.

Things of note

  • spark.kubernetes.driver.deleteOnTermination=false is not injected in this PR. K8s will GC the driver pod after it exits normally. The k8s_driver_status cache covers the succeeded+GC'd case. For crash recovery, there is a small window: if the worker crashes after _poll_k8s_driver_via_api completes but before task_store.set(k8s_driver_status, "Succeeded") is written, the next retry will see NotFound and resubmit fresh rather than recognising the job already succeeded. This is a known limitation shared with the standalone crash recovery path.
  • If the driver pod name is not captured from the submit log (e.g. submit output is suppressed), submit_job returns None and the mixin falls back to a fresh submit on retry. A warning is logged.
  • If your driver namespace has sidecar injection enabled (e.g. Istio), the pod phase may not advance to Succeeded until all sidecars exit. Set execution_timeout as a hard bound, this is being tracked in For spark operator with track_driver_via_k8s_api, detect driver completion by container status rather than pod phase #67934

Testing

  • Running Airflow on breeze against a kind cluster

  • Breeze is configured to talk to the kind cluster

  • Airflow connection defined:

airflow connections add spark_default \
    --conn-type spark \
    --conn-host "k8s://${K8S_SERVER}" \
    --conn-extra '{"deploy-mode": "cluster", "namespace": "spark"}'
  • Airflow workers have access to the k8s cluster

DAG:

from airflow.sdk import DAG
from airflow.providers.apache.spark.operators.spark_submit import SparkSubmitOperator

with DAG(
    dag_id="spark_k8s_crash_recovery_repro",
    start_date=datetime.datetime(2025, 1, 1),
    schedule=None,
    catchup=False,
) as dag:
    SparkSubmitOperator(
        task_id="submit_long_running_job",
        conn_id="spark_default",
        application="local:///opt/spark/examples/jars/spark-examples_2.12-3.5.3.jar",
        java_class="org.apache.spark.examples.SparkPi",
        application_args=["100000"],
        conf={
            "spark.kubernetes.container.image": "apache/spark:3.5.3",
            "spark.kubernetes.authenticate.driver.serviceAccountName": "spark",
            "spark.driver.extraJavaOptions": "-Djavax.net.ssl.trustAll=true",
            "spark.executor.extraJavaOptions": "-Djavax.net.ssl.trustAll=true",
        },
        retries=1,
        retry_delay=datetime.timedelta(seconds=5),
        track_driver_via_k8s_api=True,
    )

Test 1: Success State

image

Logs before these changes:
successlogs.txt

Logs after changes:
success_state_logs.txt

Pod deleted:
image

Test 2: Crash Recovery (kill worker mid run)

Test 2a: Kill worker first and wait for driver pod to complete (success state, no reconnect but avoids duplicate submission)

Pod is up

image

Worker down:
image

image

Spark Driver completed
image

Fired the worker back up and this is what we get

image
Test 2a: Kill worker first and resume worker mid way (reconnect abilities)

Same steps as above

But resume worker when spark driver is still running

image image

Poll continues till completion:

image
  • Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
  • For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
  • When adding dependency, check compliance with the ASF 3rd Party License Policy.
  • For significant user-facing changes create newsfragment: {pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.

Copy link
Copy Markdown
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good direction wiring the K8s branch into execute_resumable. A few things to fix — CI is red on real issues, not flakes:

  • get_job_status split logic: namespace, pod_name = external_id.split(":", 1) happens before the len(parts) != 2 check, so that validation is unreachable — a malformed external_id raises the unpack error, not your ValueError. And it splits/unpacks three times. Suggest: parts = external_id.split(":", 1) → validate → namespace, pod_name = parts, once.
  • MyPy: line 327 return cached returns the JsonValue from task_store.get(...), but the method is -> str. Needs a cast("str", cached) (or coerce) to satisfy the contract.
  • Failing test: test_k8s_execute_persists_pod_id_to_task_store_when_reconnect_on_retry fails on all four Compat runs (persisted_before_poll is empty). Looks like the pod id isn't being persisted before poll_until_complete, or the test's expectation is off — worth tracing which.
  • Mind adding a short description to the PR body (what/why)? Helps the changelog and reviewers.

Drafted-by: Claude Code (Opus 4.8); reviewed by @potiuk before posting

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants