Skip to content

Databricks: Fail fast for non-serializable retry_args in deferrable operators and triggers#64960

Open
kosiew wants to merge 8 commits intoapache:mainfrom
kosiew:callable-deserialization-01-64609
Open

Databricks: Fail fast for non-serializable retry_args in deferrable operators and triggers#64960
kosiew wants to merge 8 commits intoapache:mainfrom
kosiew:callable-deserialization-01-64609

Conversation

@kosiew
Copy link
Copy Markdown

@kosiew kosiew commented Apr 9, 2026


Summary

Validate databricks_retry_args / retry_args for deferrable Databricks operators, sensors, and triggers to ensure they are serialization-safe before crossing the trigger boundary. Non-serializable values (e.g., Tenacity strategy callables) now raise a clear ValueError at initialization/execution time.


Motivation / Problem

Deferrable Databricks execution forwards retry configuration through the trigger serialization boundary. Non-serializable objects (such as tenacity.wait_incrementing(...) or tenacity.stop_after_attempt(...)) cannot be serialized by Airflow’s serde layer and fail at runtime in the triggerer, making debugging difficult.


What this PR does

  • Introduces validate_deferrable_databricks_retry_args utility to assert JSON/serde-serializability of retry args.

  • Invokes validation in:

    • DatabricksExecutionTrigger
    • DatabricksSQLStatementExecutionTrigger
  • Ensures deferrable operator/sensor paths fail fast when invalid retry args are provided.

  • Adds comprehensive unit tests covering:

    • Operators rejecting non-serializable retry args in deferrable mode
    • Sensors rejecting non-serializable retry args in deferrable mode
    • Trigger initialization rejecting invalid retry args

Behavior change

Before:

  • Invalid retry args fail later in triggerer serialization with unclear errors.

After:

  • Immediate ValueError with actionable message:

    • "Use JSON-serializable values, remove callable retry strategies, or disable deferrable mode."

Example of unsupported config

from tenacity import wait_incrementing

DatabricksSubmitRunOperator(
    task_id="example",
    deferrable=True,
    databricks_retry_args={"wait": wait_incrementing(1, 1, 3)},  # ❌ now rejected
)

Example of supported config

DatabricksSubmitRunOperator(
    task_id="example",
    deferrable=True,
    databricks_retry_args={"max_retries": 3, "delay": 5},  # ✅ JSON-serializable
)

Implementation details

  • New module: providers/databricks/utils/retry.py
  • Uses Airflow serde (airflow.sdk.serde.serialize) to validate compatibility
  • Catches AttributeError, RecursionError, TypeError and rethrows as ValueError

Tests

  • Added shared test utilities for invalid retry args

  • Parametrized tests using Tenacity objects:

    • wait_incrementing
    • stop_after_attempt
  • Coverage includes:

    • Operator execution (deferrable)
    • Sensor execution (deferrable)
    • Trigger initialization

Backward compatibility

  • No impact for non-deferrable usage
  • Only affects misconfigured deferrable retry args
  • Valid JSON-compatible retry configurations remain unaffected

Documentation

  • No user-facing docs required (error message is self-explanatory)

Checklist

  • Tests added/updated
  • Backward compatibility considered
  • Clear error messaging

Was generative AI tooling used to co-author this PR?

  • Yes (please specify the tool below)
    Codex
    Github Copilot
    ChatGPT

kosiew added 5 commits April 9, 2026 19:06
Implement a shared validation guard to reject
non-serializable databricks_retry_args before
deferrable Databricks tasks cross the trigger boundary.
Enforce this check for deferrable operators and SQL
sensor in databricks.py. Add regression tests to cover
failure modes for both in test_databricks.py.
Move validation logic to retry.py for better cohesion. Enforce
validation in both trigger constructors within databricks.py.
Add direct trigger regression tests in test_databricks.py and
update sensor test setup to maintain deferrable branch coverage.
Enhance operators, sensors, and triggers tests to cover two
unsupported Tenacity shapes. Tests are now parameterized for
{"wait": wait_incrementing(...)} and {"stop":
stop_after_attempt(...)} scenarios.
Extract shared invalid retry-arg test data and pytest.raises
assertion into _retry_test_utils.py. Remove duplicated
UNSUPPORTED_RETRY_ARGS definitions from operator, sensor, and
trigger test files. Simplify setup in operator and sensor
negative tests with local helpers for the running deferrable
path. Combine two trigger-construction negative tests into
one shared parametrized test in test_databricks.py.
Require owner explicitly in retry.py's private helper.
Define an UNSUPPORTED_RETRY_ARGS constant in
_retry_test_utils.py and update operator, sensor, and
trigger tests to parametrize directly from it in
test_databricks.py.
@boring-cyborg
Copy link
Copy Markdown

boring-cyborg bot commented Apr 9, 2026

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide
Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our prek-hooks will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
  • Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

@kosiew kosiew marked this pull request as ready for review April 9, 2026 14:08
@kaxil kaxil requested a review from Copilot April 10, 2026 19:55
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds fail-fast validation for Databricks deferrable retry arguments to ensure they’re serialization-safe before crossing the trigger boundary, improving debuggability when non-serializable Tenacity strategies are passed.

Changes:

  • Introduces a validate_deferrable_databricks_retry_args helper that checks Airflow serde-serializability of retry args.
  • Validates retry_args during Databricks trigger initialization to prevent triggerer-side serialization failures.
  • Adds unit tests (operators/sensors/triggers) and shared test utilities for unsupported retry args.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
providers/databricks/src/airflow/providers/databricks/utils/retry.py Adds serde-based validation helper and standardized error message.
providers/databricks/src/airflow/providers/databricks/triggers/databricks.py Calls validation in trigger constructors to fail fast.
providers/databricks/tests/unit/databricks/_retry_test_utils.py Adds shared invalid retry args and assertion helper for tests.
providers/databricks/tests/unit/databricks/triggers/test_databricks.py Adds parametrized tests ensuring triggers reject non-serializable retry args.
providers/databricks/tests/unit/databricks/operators/test_databricks.py Adds deferrable operator test to reject unsupported retry args early.
providers/databricks/tests/unit/databricks/sensors/test_databricks.py Adds deferrable sensor test to reject unsupported retry args early.

Comment on lines +29 to +35
try:
serde_serialize(retry_args)
except (AttributeError, RecursionError, TypeError) as err:
raise ValueError(
f"{owner} does not support non-serializable databricks_retry_args when deferrable=True. "
"Use JSON-serializable values, remove callable retry strategies, or disable deferrable mode."
) from err
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

airflow.sdk.serde.serialize can also fail with ValueError (implementation-dependent), which currently bypasses the intended “clear ValueError” message and may surface a less actionable error. Consider including ValueError in the caught exceptions (or catching a broader serde-specific base exception if available) and re-raising with the standardized message to ensure consistent fail-fast behavior.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, good catch. I agree the helper should normalize serializer failures into the Databricks-specific message so users get the same fail-fast guidance regardless of which serde exception is raised.

Comment on lines +32 to +35
raise ValueError(
f"{owner} does not support non-serializable databricks_retry_args when deferrable=True. "
"Use JSON-serializable values, remove callable retry strategies, or disable deferrable mode."
) from err
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The validation is invoked for trigger constructor argument retry_args, but the error message only references databricks_retry_args, which can be confusing when failures occur in trigger init paths. Consider updating the message to mention both (retry_args / databricks_retry_args) or accepting a param_name argument so the error can accurately name the failing parameter depending on call site.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. The same validation helper is used from both the operator-facing databricks_retry_args path and the trigger-facing retry_args path, so the message should name both to avoid confusion.

caller: str = "DatabricksExecutionTrigger",
) -> None:
super().__init__()
validate_deferrable_databricks_retry_args(retry_args, owner=self.__class__.__name__)
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The trigger constructor already accepts a caller argument (likely used to report the originating component), but the new validation uses self.__class__.__name__ for owner. If caller is meant to carry more precise context (e.g., operator vs trigger), consider passing owner=caller to keep error attribution consistent with existing patterns.

Suggested change
validate_deferrable_databricks_retry_args(retry_args, owner=self.__class__.__name__)
validate_deferrable_databricks_retry_args(retry_args, owner=caller)

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. caller is already part of the trigger API and gives better attribution when the trigger is constructed on behalf of an operator or sensor.

{
"statement_id": STATEMENT_ID,
"databricks_conn_id": DEFAULT_CONN_ID,
"end_time": time.time() + 60,
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using time.time() in test data makes the case slightly non-deterministic and harder to reason about, especially since end_time is not relevant to the behavior under test (retry args validation). Consider replacing it with a fixed constant (e.g., end_time=1234567890.0) to keep the test fully deterministic.

Suggested change
"end_time": time.time() + 60,
"end_time": 1234567890.0,

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. The timestamp is incidental to retry validation, so a fixed constant is clearer.

Comment on lines +19 to +24
from typing import Any

from airflow.sdk.serde import serialize as serde_serialize


def validate_deferrable_databricks_retry_args(retry_args: dict[Any, Any] | None, *, owner: str) -> None:
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The signature dict[Any, Any] suggests arbitrary key types are acceptable, but the function’s intent is “JSON/serde-serializable” retry configuration, which typically implies string keys (JSON object keys). Consider narrowing the type to Mapping[str, Any] | None (or dict[str, Any] | None) to better document the expected API contract and help callers catch issues earlier via typing.

Suggested change
from typing import Any
from airflow.sdk.serde import serialize as serde_serialize
def validate_deferrable_databricks_retry_args(retry_args: dict[Any, Any] | None, *, owner: str) -> None:
from typing import Any, Mapping
from airflow.sdk.serde import serialize as serde_serialize
def validate_deferrable_databricks_retry_args(
retry_args: Mapping[str, Any] | None, *, owner: str
) -> None:

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. A mapping type better documents that the helper only reads the retry configuration, and str keys better reflect the public retry-args contract.

kosiew added 3 commits April 11, 2026 12:06
Refactor retry.py to catch ValueErrors and clarify
retry_args/databricks_retry_args messages. Adjust
validation in databricks.py to use owner=caller. Update
tests in operators, sensors, and triggers for
Databricks. Fix test-helper import to follow repo style.
Replace SDK serde import with stdlib JSON serialization
in retry.py. Update validation call to use json.dumps()
instead of serde_serialize() to improve simplicity and
reduce dependencies.
Implement tests for the retry validation function in the
Databricks provider. Handle cases for `None` and valid
JSON-serializable primitive retry configurations, while
ensuring unsupported Tenacity retry arguments are rejected.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants