Skip to content

Enhance ResumableJobMixin.get_job_status with context for better job status tracking#68009

Merged
amoghrajesh merged 2 commits into
apache:mainfrom
astronomer:aip-103-add-ctx-to-get-job-status
Jun 4, 2026
Merged

Enhance ResumableJobMixin.get_job_status with context for better job status tracking#68009
amoghrajesh merged 2 commits into
apache:mainfrom
astronomer:aip-103-add-ctx-to-get-job-status

Conversation

@amoghrajesh
Copy link
Copy Markdown
Contributor

@amoghrajesh amoghrajesh commented Jun 4, 2026


Was generative AI tooling used to co-author this PR?
  • Yes (please specify the tool below)

What?

ResumableJobMixin.get_job_status had no access to context, making it impossible to handle backends where the remote resource carrying job state can disappear after completion. I encountered this problem when running Spark with Kubernetes driver pods which after a run get garbage collected, where without context an implementation cannot distinguish "SUCCEEDED but pod gone" from "FAILED but pod gone". The only workaround was deleteOnTermination=false, which accumulates pods indefinitely.

The same gap will also appear in another form in cases when a job is tracked by external IDs (e.g. EMR (cluster_id, step_id), Glue (job_name, run_id)).

Current behaviour

get_job_status(self, external_id) has no context, no access to task_store, no way to cache terminal status before the remote resource disappears.

Proposed change

Add context: Context as a second parameter to get_job_status throughout the interface.


  • Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
  • For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
  • When adding dependency, check compliance with the ASF 3rd Party License Policy.
  • For significant user-facing changes create newsfragment: {pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.

@uranusjr
Copy link
Copy Markdown
Member

uranusjr commented Jun 4, 2026

I think this needs a backcompat shim so an old spark provider can work with new task-sdk and vice versa.

@amoghrajesh
Copy link
Copy Markdown
Contributor Author

amoghrajesh commented Jun 4, 2026

I don't think there was a release for providers yet after previous iteration of task sdk changes (added in #67118)?

@ashb
Copy link
Copy Markdown
Member

ashb commented Jun 4, 2026

Provider release is happening today I think. Speak to @potiuk. We could possibly exclude Spark from that release wave?

Copy link
Copy Markdown
Member

@ashb ashb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change looks okay (though the spark op doesn't seem to use it yet? Follow on pr to make it use this?)

@amoghrajesh
Copy link
Copy Markdown
Contributor Author

@ashb spark isn't going to use it. And not every operator needs to use it, this is mainly for K8s use case I mentioned in PR desc above

@amoghrajesh
Copy link
Copy Markdown
Contributor Author

Error seems to be unrelated:

  ERROR: all download attempts failed for https://public.dhe.ibm.com/ibmdl/export/pub/software/websphere/messaging/mqdev/redist/9.4.0.0-IBM-MQC-Redist-LinuxX64.tar.gz; last error: <urlopen error [Errno 110] Connection timed out>
  Downloading https://public.dhe.ibm.com/ibmdl/export/pub/software/websphere/messaging/mqdev/redist/9.4.0.0-IBM-MQC-Redist-LinuxX64.tar.gz
  Primary download failed (URLError: <urlopen error [Errno 110] Connection timed out>); trying 4 fallback IP(s)
    Retrying with public.dhe.ibm.com -> 170.225.126.18
    170.225.126.18 failed: URLError: <urlopen error [Errno 110] Connection timed out>
    Retrying with public.dhe.ibm.com -> 129.35.224.1
    129.35.224.1 failed: URLError: <urlopen error [Errno 110] Connection timed out>
    Retrying with public.dhe.ibm.com -> 129.124.168.5
    129.124.168.5 failed: URLError: <urlopen error [Errno 110] Connection timed out>
    Retrying with public.dhe.ibm.com -> 9.133.44.11
    9.133.44.11 failed: URLError: <urlopen error [Errno 110] Connection timed out>
  Pre-extras install failed for ibm.mq

Merging this one in

@amoghrajesh amoghrajesh merged commit 6dbe76a into apache:main Jun 4, 2026
109 of 110 checks passed
@amoghrajesh amoghrajesh deleted the aip-103-add-ctx-to-get-job-status branch June 4, 2026 12:21
@github-project-automation github-project-automation Bot moved this from In progress to Done in AIP-103: Task State Management Jun 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

3 participants