Skip to content

Lock in Databricks workflow depends_on parent-key behavior (#47614)#66681

Open
Vamsi-klu wants to merge 3 commits into
apache:mainfrom
Vamsi-klu:fix-issue-47614-databricks-depends-on
Open

Lock in Databricks workflow depends_on parent-key behavior (#47614)#66681
Vamsi-klu wants to merge 3 commits into
apache:mainfrom
Vamsi-klu:fix-issue-47614-databricks-depends-on

Conversation

@Vamsi-klu
Copy link
Copy Markdown
Contributor

@Vamsi-klu Vamsi-klu commented May 11, 2026

The runtime fix for issue #47614 shipped in #48492 (commit 9dcce2f).
Closes: #47614
The GitHub issue stayed open because:

  1. The existing unit test (test_convert_to_databricks_workflow_task) passed
    relevant_upstreams = [MagicMock(task_id="upstream_task")] — a list of
    mock objects, not strings — so the task_id in relevant_upstreams filter
    was silently always-False and the depends_on branch was never exercised.
    The fix could regress without anyone noticing.
  2. A few small follow-on quality issues in the same area were worth tightening.
  3. The issue was not linked to the merged PR, so it wasn't auto-closed.

This PR is therefore a lock-in / regression / hardening change, not a new fix.

Changes

  • databricks.py: corrected relevant_upstreams annotation from
    list[BaseOperator] to list[str] on both DatabricksTaskBaseOperator
    and DatabricksNotebookOperator overrides.
  • databricks_workflow.py: changed self.relevant_upstreams = [task_id]
    to self.relevant_upstreams: list[str] = []. Behavior is unchanged
    (the launch task's raw, unprefixed task_id was previously filtered out
    only by an accidental prefix mismatch); the new initializer makes the
    intent explicit.
  • test_databricks_workflow.py: new TestWorkflowDependsOn class with 7
    end-to-end tests that build a real DAG + DatabricksWorkflowTaskGroup
    with real DatabricksNotebookOperator tasks and assert on the
    tasks[*]['depends_on'] payload returned by create_workflow_json().
    Covers default keys, custom parent key, >100-char parent key, diamond,
    fan-out, root tasks (launch task is filtered out), and external
    upstream filtering.
  • test_databricks.py: fixed the existing
    test_convert_to_databricks_workflow_task to pass strings instead of
    mocks and to assert the correct parent-keyed depends_on. Added
    test_generate_databricks_task_key_requires_task_dict_when_task_id_passed.
  • changelog.rst: Bug Fixes entry under 7.14.0.

closes: #47614


Was generative AI tooling used to co-author this PR?
  • Yes — Claude Code (Opus 4.7)

Generated-by: Claude Code (Opus 4.7) following the guidelines

@Vamsi-klu
Copy link
Copy Markdown
Contributor Author

Vamsi-klu commented May 11, 2026

@eladkal @jscheffl @shahar1 — could one of you take a look when you have a moment? This PR is regression coverage / type-hint cleanup for the depends_on bug whose runtime fix you reviewed in #48492. The actual behavior is unchanged; the goal is to lock it in with end-to-end tests so the same bug can't silently come back, and to close out #47614.

Comment thread providers/databricks/docs/changelog.rst Outdated
@eladkal
Copy link
Copy Markdown
Contributor

eladkal commented May 11, 2026

cc @moomindani for review

@eladkal eladkal requested a review from pankajkoti May 13, 2026 04:47
Comment thread providers/databricks/docs/changelog.rst Outdated
Copy link
Copy Markdown
Contributor

@moomindani moomindani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diff itself looks good to me — the type annotation is the right one, the relevant_upstreams = [task_id] → [] initializer is genuinely dead code being cleaned up (the bare "launch" could never match the child tasks' prefixed "<group>.launch" upstreams, so the original list was always semantically empty), the existing MagicMock-based test fix actually exercises the depends_on branch now, and TestWorkflowDependsOn covers default/custom/oversize key, diamond, fan-out, root-task filtering, and external-upstream filtering. Changelog revert in 025e8a4 is clean.

A couple of things before I'd be comfortable approving:

  1. CI is red. The remaining failure (Compat 3.0.6:P3.10) is gh run download artifact infra, not your code, but we shouldn't merge on red — could you re-trigger it (or push an empty commit) so we get a clean build?

  2. Real-workspace validation. The runtime fix has been live since #48492 in April 2025, so the user-visible behavior change here is small, but this PR does touch production code (relevant_upstreams = [task_id] → []). Have you (or anyone) actually run a DatabricksWorkflowTaskGroup with a parent/child task pair against a Databricks workspace and confirmed the rendered Jobs API JSON has depends_on: [{task_key: <parent_key>}]? The unit tests build real DAG objects but assert on the in-process create_workflow_json() payload, not what arrives at Databricks. An airflow dags test run + a screenshot of the resulting Databricks job graph would close the loop (similar pattern to what I used in #66613).

Once CI is green and there's some evidence the wire payload is right, LGTM.

@Vamsi-klu Vamsi-klu force-pushed the fix-issue-47614-databricks-depends-on branch from 025e8a4 to c231337 Compare May 19, 2026 05:55
@Vamsi-klu
Copy link
Copy Markdown
Contributor Author

Vamsi-klu commented May 19, 2026

@moomindani thanks for the review. Two things addressed:

(A) CI the failing provider distributions tests / Compat 3.0.6:P3.10 shard was artifact infra: gh run download couldn't find the CI-image stash ci-image-save-v3-linux_amd64-3.10-66681_merge. All other compat jobs (2.11.1 / 3.1.8 / 3.2.1), DB tests, mypy, static checks, and integration tests were green. I've rebased onto current main (the branch was 129 commits behind) and force-with-lease pushed, which re-runs the full matrix from scratch.

(B) Wire-payload validation added a new TestWorkflowDependsOnWirePayload class in providers/databricks/tests/unit/databricks/operators/test_databricks_workflow.py with two tests that drive _CreateDatabricksWorkflowOperator._create_or_reset_job end to end and assert on what the hook receives:

  • test_create_job_payload_carries_parent_depends_on captures launch_task._hook.create_job.call_args.args[0] (the new-job path) and asserts tasks[task_b]["depends_on"] == [{"task_key": <md5(task_a)>}].
  • test_reset_job_payload_carries_parent_depends_on — captures launch_task._hook.reset_job.call_args.args[1] (the existing-job path, job_id=42) and asserts the same shape.

Both tests build a real DAG + DatabricksWorkflowTaskGroup populated with real DatabricksNotebookOperators (no operator mocks), so the captured job_spec is exactly what would be POSTed to /api/2.1/jobs/create and /api/2.1/jobs/reset. The existing TestWorkflowDependsOn class still covers the in-process create_workflow_json payload; this new class closes the loop at the wire boundary and lives in CI forever.

I don't have a Databricks workspace handy for the live airflow dags test + job-graph screenshot, but the call-args assertion verifies the same JSON the Jobs API would receive. The runtime fix has also been live since #48492 (April 2025) without regressions reported on the depends_on path. Happy to add the screenshot too if you'd still like it.

Copy link
Copy Markdown
Contributor

@moomindani moomindani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Real-workspace validation

Drove the PR branch end-to-end through airflow dags test against a Databricks workspace with a minimal DatabricksWorkflowTaskGroup (task_a >> task_b, both DatabricksNotebookOperator), then queried the created Databricks job back via 2.2/jobs/list. The Jobs API payload is correctly parent-keyed:

{
  "tasks": [
    {
      "task_key": "d13f300f3f3b28c910b3710198314589"
      // md5("pr66681_realenv__wf.task_a"); no depends_on
    },
    {
      "task_key": "dcf56d6d083da88df804ce506620a7a2",  // md5("pr66681_realenv__wf.task_b")
      "depends_on": [
        {"task_key": "d13f300f3f3b28c910b3710198314589"}  // parent task_a's key, not its own
      ]
    }
  ]
}

Closes the wire-payload loop on top of the unit tests.

Reproduction

export AIRFLOW_CONN_DATABRICKS_DEFAULT='{"conn_type":"databricks","host":"https://<workspace>","password":"<token>"}'
export PR66681_NOTEBOOK_PATH=/Users/<you>/airflow-pr66681-noop  # any existing notebook in the workspace
airflow dags test pr66681_realenv

Then verify the resulting Databricks job via:

curl -s -H "Authorization: Bearer $DATABRICKS_TOKEN" \
  "https://<workspace>/api/2.2/jobs/list?name=pr66681_realenv.wf&limit=1&expand_tasks=true" | jq '.jobs[0].settings.tasks'
dags/dag_pr66681_realenv.py
"""Real-environment validation DAG for PR #66681 (GH-47614)."""
from __future__ import annotations

import hashlib
import os
from datetime import datetime

from airflow.providers.databricks.hooks.databricks import DatabricksHook
from airflow.providers.databricks.operators.databricks import DatabricksNotebookOperator
from airflow.providers.databricks.operators.databricks_workflow import DatabricksWorkflowTaskGroup
from airflow.sdk import DAG, task

NOTEBOOK_PATH = os.environ["PR66681_NOTEBOOK_PATH"]
DAG_ID = "pr66681_realenv"
GROUP_ID = "wf"
CONN_ID = "databricks_default"


def _hook() -> DatabricksHook:
    return DatabricksHook(databricks_conn_id=CONN_ID)


def _expected_task_key(task_id_in_group: str) -> str:
    full_task_id = f"{GROUP_ID}.{task_id_in_group}"
    return hashlib.md5(f"{DAG_ID}__{full_task_id}".encode()).hexdigest()


with DAG(dag_id=DAG_ID, start_date=datetime(2026, 1, 1), schedule=None, catchup=False) as dag:
    with DatabricksWorkflowTaskGroup(
        group_id=GROUP_ID,
        databricks_conn_id=CONN_ID,
        job_clusters=[
            {
                "job_cluster_key": "shared",
                "new_cluster": {
                    "spark_version": "15.4.x-scala2.12",
                    "node_type_id": "i3.xlarge",
                    "num_workers": 0,
                    "spark_conf": {"spark.master": "local[*]"},
                    "custom_tags": {"ResourceClass": "SingleNode"},
                },
            }
        ],
    ) as wf:
        task_a = DatabricksNotebookOperator(
            task_id="task_a", notebook_path=NOTEBOOK_PATH, source="WORKSPACE", job_cluster_key="shared"
        )
        task_b = DatabricksNotebookOperator(
            task_id="task_b", notebook_path=NOTEBOOK_PATH, source="WORKSPACE", job_cluster_key="shared"
        )
        task_a >> task_b

    @task(task_id="verify_depends_on", trigger_rule="all_done")
    def verify_depends_on() -> None:
        jobs = _hook()._do_api_call(
            ("GET", "2.2/jobs/list"),
            {"name": f"{DAG_ID}.{GROUP_ID}", "limit": 1, "expand_tasks": True},
        )
        job = jobs["jobs"][0]
        tasks_by_key = {t["task_key"]: t for t in job["settings"]["tasks"]}
        a_key = _expected_task_key("task_a")
        b_key = _expected_task_key("task_b")
        assert tasks_by_key[a_key].get("depends_on", []) == []
        assert tasks_by_key[b_key]["depends_on"] == [{"task_key": a_key}]
        _hook()._do_api_call(("POST", "2.2/jobs/delete"), {"job_id": job["job_id"]})

    wf >> verify_depends_on()

Approving — note I'm not a committer, so this is only a non-binding LGTM; a committer still needs to sign off before merge.

def _convert_to_databricks_workflow_task(
self,
relevant_upstreams: list[BaseOperator],
relevant_upstreams: list[str],
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify why this change is needed?

Copy link
Copy Markdown
Contributor Author

@Vamsi-klu Vamsi-klu May 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. This is correcting the annotation to match the runtime value rather than changing behavior.

relevant_upstreams is populated from task.task_id in DatabricksWorkflowTaskGroup.__exit__, so it is a list of task-id strings. _convert_to_databricks_workflow_task() then compares those strings with self.upstream_task_ids:

for task_id in self.upstream_task_ids
if task_id in relevant_upstreams

The old list[BaseOperator] annotation was misleading: passing operators there would make that membership check fail because it compares str task IDs to operator objects. The actual BaseOperator instances are still carried separately in task_dict, which is used when resolving the parent task’s Databricks task key.

So this change is mainly a type-hint cleanup that also makes the tests reflect the real call shape.

Vamsi-klu added 3 commits May 21, 2026 08:44
The runtime fix for issue apache#47614 shipped in PR apache#48492; this PR adds
end-to-end regression coverage so the bug cannot silently regress, plus
small type-hint and constructor-clarity follow-ups in the same area.

Tests build a real DAG + DatabricksWorkflowTaskGroup with
DatabricksNotebookOperator tasks and assert depends_on payloads for the
default-key, custom-key, >100-char-key, diamond, fan-out, root-task, and
external-upstream cases. Also fixes the existing
test_convert_to_databricks_workflow_task to pass strings (not mocks) so
the depends_on branch is actually exercised, and adds a one-line check
that _generate_databricks_task_key raises when called with a parent
task_id but no task_dict.

closes: apache#47614
Provider changelogs are regenerated from git log by the release
manager and should not be edited by hand.
Existing TestWorkflowDependsOn coverage verified the in-process
create_workflow_json() output. The new TestWorkflowDependsOnWirePayload
class drives _CreateDatabricksWorkflowOperator._create_or_reset_job end
to end and asserts on the spec that DatabricksHook.create_job /
DatabricksHook.reset_job actually receive, i.e. the payload that lands
in /api/2.1/jobs/create and /api/2.1/jobs/reset.

Both branches (no existing job -> create_job, existing job -> reset_job)
are exercised; both assert tasks[child].depends_on == [{task_key:
md5(parent)}] and tasks[parent].depends_on == [].
@Vamsi-klu Vamsi-klu force-pushed the fix-issue-47614-databricks-depends-on branch from c231337 to 684bfee Compare May 21, 2026 15:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Databricks Operator set it owns task_key in depends_on instead of parent task key

3 participants