Skip to content

Fix batchoperator deferrable xcom links#64745

Draft
kakatur wants to merge 3 commits intoapache:mainfrom
kakatur:fix-batchoperator-deferrable-xcom-links
Draft

Fix batchoperator deferrable xcom links#64745
kakatur wants to merge 3 commits intoapache:mainfrom
kakatur:fix-batchoperator-deferrable-xcom-links

Conversation

@kakatur
Copy link
Copy Markdown

@kakatur kakatur commented Apr 5, 2026


Was generative AI tooling used to co-author this PR?
  • Yes (please specify the tool below)

  • Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
  • For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
  • When adding dependency, check compliance with the ASF 3rd Party License Policy.
  • For significant user-facing changes create newsfragment: {pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.

kakatur added 3 commits April 4, 2026 13:46
When BatchOperator runs with deferrable=True, operator links
(job definition, job queue, CloudWatch logs) were not being
persisted as XCom values, causing ERROR-level logs when the UI
tried to render them.

This change:
- Extracts link persistence logic into _persist_links() method
- Calls _persist_links() from execute_complete() for deferrable tasks
- Refactors monitor_job() to use the same method (DRY principle)

Fixes XCom not found errors for:
- batch_job_definition
- batch_job_queue
- cloudwatch_events
Added _persist_links(context) call in execute() method before deferring.
This ensures operator links (batch_job_definition, batch_job_queue,
cloudwatch_events) are persisted as XCom values immediately after job
submission, making them available in the UI while the task is deferred.
- Persist placeholder None value for cloudwatch_events XCom when logs
  aren't available at submission time
- Change log level from WARNING to INFO for missing CloudWatch logs
- Prevents 'XCom not found' warning in UI while task is deferred
- CloudWatch link will be populated when job completes and logs exist
@boring-cyborg
Copy link
Copy Markdown

boring-cyborg bot commented Apr 5, 2026

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide (https://github.com/apache/airflow/blob/main/contributing-docs/README.rst)
Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our prek-hooks will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
  • Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to ensure the AWS Batch operator’s extra UI links (job definition/queue/logs) are available for deferrable runs by persisting the related XComs before/around deferral and completion.

Changes:

  • Persist operator extra-link XComs before deferring in execute() and again in execute_complete().
  • Refactor link persistence logic into a new _persist_links() helper.
  • Adjust CloudWatch log-link persistence and logging behavior.

Comment on lines +225 to 230
# Persist operator links before deferring so they're available in the UI
self._persist_links(context)

job = self.hook.get_job_description(self.job_id)
job_status = job.get("status")
if job_status == self.hook.SUCCESS_STATE:
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the deferrable path this adds multiple describe_jobs calls in quick succession: _persist_links() calls get_job_description() and then get_job_all_awslogs_info() (which calls get_job_description() again), and execute() then calls get_job_description() again for job_status. This increases AWS API usage and can hit Batch API throttling. Consider refactoring to reuse a single job description (e.g., have _persist_links accept/return job_desc) and/or avoid fetching CloudWatch log info before deferring (it is often unavailable immediately).

Copilot uses AI. Check for mistakes.
Comment on lines +263 to +264
# Persist operator links for UI
self._persist_links(context)
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

execute_complete() now unconditionally calls _persist_links(context), which performs AWS API calls and assumes the context contains a task instance. In this repo's unit tests execute_complete is invoked with an empty dict (context={}), and _persist_links can raise (e.g., when it hits the CloudWatch placeholder context["task_instance"].xcom_push). Either guard _persist_links when required context keys are missing, or update the tests/mocks to pass a real TI context and stub the hook calls.

Suggested change
# Persist operator links for UI
self._persist_links(context)
# Persist operator links for UI when task instance context is available.
if context and context.get("task_instance") is not None:
self._persist_links(context)

Copilot uses AI. Check for mistakes.
Comment on lines +398 to +404
else:
# Persist placeholder to prevent "XCom not found" warnings
# CloudWatch logs will be updated when job completes
context["task_instance"].xcom_push(
key="cloudwatch_events",
value=None,
)
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This manually pushes an XCom placeholder even when do_xcom_push=False, uses a hard-coded key string, and directly indexes context["task_instance"] (KeyError if not present). It also isn’t needed for BaseAwsLink.get_link() since missing/falsey XCom values already render as no link. Prefer removing the placeholder entirely, or at least respect operator.do_xcom_push and use CloudWatchEventsLink.key plus the same context["ti"] convention used by BaseAwsLink.persist().

Suggested change
else:
# Persist placeholder to prevent "XCom not found" warnings
# CloudWatch logs will be updated when job completes
context["task_instance"].xcom_push(
key="cloudwatch_events",
value=None,
)

Copilot uses AI. Check for mistakes.
Comment on lines +416 to +441
# Persist operator links
self._persist_links(context)

if self.awslogs_enabled:
if self.waiters:
self.waiters.wait_for_job(self.job_id, get_batch_log_fetcher=self._get_batch_log_fetcher)
else:
self.hook.wait_for_job(self.job_id, get_batch_log_fetcher=self._get_batch_log_fetcher)
else:
if self.waiters:
self.waiters.wait_for_job(self.job_id)
else:
self.hook.wait_for_job(self.job_id)

# Log all CloudWatch log stream links for user reference
try:
awslogs = self.hook.get_job_all_awslogs_info(self.job_id)
if awslogs:
self.log.info(
"AWS Batch job (%s) CloudWatch Events details found. Links to logs:", self.job_id
)
link_builder = CloudWatchEventsLink()
for log in awslogs:
self.log.info(link_builder.format_link(**log))
except AirflowException as ae:
self.log.warning("Cannot determine where to find the AWS logs for this Batch job: %s", ae)
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

monitor_job() calls _persist_links() (which already calls get_job_all_awslogs_info()), then after waiting it calls get_job_all_awslogs_info() again just to log stream links. This adds extra describe_jobs calls and will break existing expectations in unit tests that get_job_description is called exactly twice during execute() (see test_execute_without_failures). Consider fetching/logging/persisting CloudWatch log info only once (e.g., do it after waiting, like the previous implementation, or have _persist_links optionally skip CloudWatch work when called pre-wait).

Copilot uses AI. Check for mistakes.
Comment on lines +383 to +384
# CloudWatch logs may not be available immediately after job submission
self.log.info("CloudWatch logs not yet available for Batch job: %s", ae)
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The except AirflowException branch assumes the failure to fetch log info is transient (“not yet available”) and logs at INFO. get_job_all_awslogs_info() can raise AirflowException for non-transient cases too (e.g. unsupported job type), so this message/level is misleading and can hide real configuration issues. Consider logging a more neutral message (or warning) and differentiating transient “stream not created yet” (which returns []) from actual exceptions.

Suggested change
# CloudWatch logs may not be available immediately after job submission
self.log.info("CloudWatch logs not yet available for Batch job: %s", ae)
self.log.warning(
"Unable to retrieve CloudWatch log information for AWS Batch job (%s): %s",
self.job_id,
ae,
)

Copilot uses AI. Check for mistakes.
@eladkal eladkal requested review from ferruzzi and vincbeck April 7, 2026 05:43
@o-nikolas o-nikolas marked this pull request as draft April 8, 2026 00:58
@o-nikolas
Copy link
Copy Markdown
Contributor

Converting this to draft since there is unaddressed copilot feedback and failing tests. change this to ready for review once all feedback is addressed and the tests are green 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:amazon AWS/Amazon - related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants