Skip to content

Resolve Beam Dataflow job id by name after launcher returns#67711

Open
evgeniy-b wants to merge 1 commit into
apache:mainfrom
evgeniy-b:fix-beam-dataflow-job-id-resolve-by-name
Open

Resolve Beam Dataflow job id by name after launcher returns#67711
evgeniy-b wants to merge 1 commit into
apache:mainfrom
evgeniy-b:fix-beam-dataflow-job-id-resolve-by-name

Conversation

@evgeniy-b
Copy link
Copy Markdown
Contributor

Resolve dataflow_job_id on BeamRun{Python,Java,Go}PipelineOperator by looking it up via the Dataflow API after the Beam launcher subprocess returns, instead of relying on the Beam SDK stdout regex (JOB_ID_PATTERN) which silently leaves the id as None when the line is missing or formatted differently and breaks deferred polling, on_kill, and xcom consumers downstream.

Adds DataflowHook.fetch_job_id_by_name alongside the existing name-based lookups (is_job_dataflow_running, cancel_job, get_job): lists active jobs whose name starts with the configured dataflow_job_name and returns the id when exactly one match is found. Lookup failures are logged and swallowed.


Was generative AI tooling used to co-author this PR?
  • Yes (Claude Code)

Generated-by: Claude Code following the guidelines


  • Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
  • For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
  • When adding dependency, check compliance with the ASF 3rd Party License Policy.
  • For significant user-facing changes create newsfragment: {pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.

@evgeniy-b evgeniy-b requested a review from shahar1 as a code owner May 29, 2026 12:01
@boring-cyborg boring-cyborg Bot added area:providers provider:apache-beam provider:google Google (including GCP) related issues labels May 29, 2026
`process_line_and_extract_dataflow_job_id_callback` in
`airflow.providers.google.cloud.hooks.dataflow` extracts the Dataflow
job id from the Beam SDK's stdout via `JOB_ID_PATTERN`. When the line is
missing or formatted differently, `dataflow_job_id` stays `None` and any
downstream call that requires it (deferred polling, on_kill, xcom
consumers) fails.

Drop the stdout scrape from
`BeamRunPythonPipelineOperator.execute_on_dataflow`,
`BeamRunJavaPipelineOperator.execute_on_dataflow`, and
`BeamRunGoPipelineOperator.execute_on_dataflow`, and look the job id up
once via the Dataflow API after the Beam launcher subprocess returns.
Add `DataflowHook.fetch_job_id_by_name` alongside the other name-based
lookups (`is_job_dataflow_running`, `cancel_job`, `get_job`): list
active jobs whose name starts with the configured `dataflow_job_name`
and return the id when exactly one match is found. Lookup failures are
logged and swallowed.
@evgeniy-b evgeniy-b force-pushed the fix-beam-dataflow-job-id-resolve-by-name branch from 18b7035 to ebf385d Compare May 29, 2026 12:40
Copy link
Copy Markdown
Contributor

@MaksYermak MaksYermak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@evgeniy-b could you run system tests for Dataflow and provide screenshots from Airflow UI that there are passed successfully?

Comment on lines +1148 to +1149
if len(jobs) != 1:
return None
Copy link
Copy Markdown
Contributor

@MaksYermak MaksYermak May 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@evgeniy-b as I understand in case when users run in parallel 2 or more Jobs with the same name or on Dataflow the Job with this name already present than this code returns None as JobID value, please correct me if I am wrong?

In the current logic with callbacks the code parse Apache Beam logs for availability of JobID and when getting it then starts the waiting process in deferrable or non-deferable mode. It means that we always have unique Job ID.

This new logic looks for me as a breaking change because returns None as JobID in case when in Dataflow the users have 2 or more Jobs with the same name. It is possible scenario for the most of our users because in Dataflow is impossible to remove finished Jobs the user can only archived it. And our _fetch_all_jobs method does not sort Jobs by finished or running and returns all Jobs with the same name.

Copy link
Copy Markdown
Contributor Author

@evgeniy-b evgeniy-b May 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me explain a bit how I arrived here. On an airflow cluster I maintain I noticed python beam jobs running with deferrable=False, so I switched that flag to true to not waste worker resources. On the next day the jobs failed while transitioning to async triggers because their STDOUT didn't contain the job ID. In the sync mode a missing job ID doesn't prevent the task from succeeding:

_DataflowJobsController.wait_for_done polls self._refresh_jobs():

def wait_for_done(self) -> None:
"""Wait for result of submitted job."""
self.log.info("Start waiting for done.")
self._refresh_jobs()
while self._jobs and not all(
self.job_reached_terminal_state(job, self._wait_until_finished, self._expected_terminal_state)
for job in self._jobs
):
self.log.info("Waiting for done. Sleep %s s", self._poll_sleep)
time.sleep(self._poll_sleep)
self._refresh_jobs()

_refresh_jobs calls self._get_current_jobs():

def _refresh_jobs(self) -> None:
"""
Get all jobs by name.
:return: jobs
"""
self._jobs = self._get_current_jobs()

_get_current_jobs — with no _job_id — calls self._fetch_jobs_by_prefix_name(self._job_name.lower()):

def _get_current_jobs(self) -> list[dict]:
"""
Get list of jobs that start with job name or id.
:return: list of jobs including id's
"""
if not self._multiple_jobs and self._job_id:
return [self.fetch_job_by_id(self._job_id)]
if self._jobs:
return [self.fetch_job_by_id(job["id"]) for job in self._jobs]
if self._job_name:
jobs = self._fetch_jobs_by_prefix_name(self._job_name.lower())

_fetch_jobs_by_prefix_name calls self._fetch_all_jobs() and returns every prefix-matched job (archived + running, no terminal-state filter):

def _fetch_jobs_by_prefix_name(self, prefix_name: str) -> list[dict]:
jobs = self._fetch_all_jobs()
jobs = [job for job in jobs if job["name"].startswith(prefix_name)]
return jobs

So today's sync path already silently picks up every prefix-matched job whenever the regex misses.

With default append_job_name=True the job name will be unique and job ID will be retrieved.
But you are right, it is a degradation: for jobs without unique names but printing out their IDs to console, the job ID will become missing.

I guess an alternative could be to replicate the sync mode's behavior in the async path which currently fails without job_id. However it means that xcom and a link to the job will stay broken.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think job ID in output detection should be reverted. While it is awkward in principle, it is the only way (?) to reliably get ID when job names are not unique. Then name-based ID detection can be used as a fallback but only when append_job_name=True. And if the trigger receives empty job ID it should fallback to polling status of all jobs matching the name (and not in terminal status).
@MaksYermak what's your take on this?

@potiuk potiuk added the ready for maintainer review Set after triaging when all criteria pass. label Jun 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:apache-beam provider:google Google (including GCP) related issues ready for maintainer review Set after triaging when all criteria pass.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants