Skip to content

Add Databricks query tags for DatabricksSqlOperator and DatabricksCopyIntoOperator#66886

Open
Shaan-alpha wants to merge 4 commits into
apache:mainfrom
Shaan-alpha:feature/databricks-query-tags
Open

Add Databricks query tags for DatabricksSqlOperator and DatabricksCopyIntoOperator#66886
Shaan-alpha wants to merge 4 commits into
apache:mainfrom
Shaan-alpha:feature/databricks-query-tags

Conversation

@Shaan-alpha
Copy link
Copy Markdown

Description

This PR addresses issue #66839 by injecting Airflow context metadata (dag_id, ask_id, and
un_id) into the session_configuration parameter as query_tags during the execution of DatabricksSqlOperator and DatabricksCopyIntoOperator.

By automatically populating query tags, it significantly enhances observability on the Databricks side, allowing administrators and developers to trace specific queries executed in Databricks SQL directly back to the Airflow task run that spawned them.

Key Changes:

  • Added a _format_query_tags(context) helper to extract and safely escape the metadata strings before formatting them into a comma-separated query tag string.
  • Modified the execute() methods of DatabricksSqlOperator and DatabricksCopyIntoOperator to correctly merge the generated tags into the existing session_configuration (preserving any user-defined custom query tags).
  • Updated existing unit tests and added est_query_tags_injection for both operators in est_databricks_sql.py and est_databricks_copy.py to ensure injection and non-regression of user tags.

Related Issues

Testing

  • Added unit tests mimicking operator execution and asserting proper manipulation of the hook's session_config parameter.

@boring-cyborg
Copy link
Copy Markdown

boring-cyborg Bot commented May 13, 2026

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide
Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our prek-hooks will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example Dag that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
  • Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

@Shaan-alpha
Copy link
Copy Markdown
Author

Refactored query-tag injection into shared helper utilities to reduce duplicated logic and improve maintainability across operators.

@Shaan-alpha Shaan-alpha force-pushed the feature/databricks-query-tags branch 3 times, most recently from aaacdcb to e072ad1 Compare May 13, 2026 22:43
Copy link
Copy Markdown
Contributor

@SameerMesiah97 SameerMesiah97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have left a few comments.

if "run_id" in context and context["run_id"]:
tags.append(f"airflow_run_id:{_escape_query_tag_value(context['run_id'])}")

return ",".join(tags)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be made more clear and explicit via a mapping driven approach. Please see the below for guidance:

_QUERY_TAG_FIELDS = {
    "airflow_dag_id": ("dag", "dag_id"),
    "airflow_task_id": ("task", "task_id"),
    "airflow_run_id": ("run_id", None),
}


def _format_query_tags(context: Context) -> str:
    tags = []

    for tag_name, (context_key, attr) in _QUERY_TAG_FIELDS.items():
        value = context.get(context_key)

        if attr:
            value = getattr(value, attr, None)

        if value:
            tags.append(f"{tag_name}:{_escape_query_tag_value(value)}")

    return ",".join(tags)

Also, you could do the same for _escape_query_tag_value:

_QUERY_TAG_ESCAPE_SEQUENCES = {
    "\\": "\\\\",
    ",": "\\,",
    ":": "\\:",
}


def _escape_query_tag_value(value: str) -> str:
    """Escape Databricks query-tag separator characters in a tag value."""
    escaped = str(value)

    for char, replacement in _QUERY_TAG_ESCAPE_SEQUENCES.items():
        escaped = escaped.replace(char, replacement)

    return escaped

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion @SameerMesiah97! I agree the mapping-driven approach is cleaner and more maintainable. I've gone ahead and refactored both _format_query_tags and _escape_query_tag_value to use the _QUERY_TAG_FIELDS and _QUERY_TAG_ESCAPE_SEQUENCES mappings exactly as you suggested — this is included in commit 732df03 ("Refactor Databricks query tag helper utilities"). Please take another look and let me know if you'd like any further tweaks.


def execute(self, context: Context) -> Any:
_inject_query_tags(self.get_db_hook(), context)
return super().execute(context)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want this behavior configurable (operator/provider-level opt-out)? Since this mutates session_configuration automatically, some users may prefer explicit control over injected warehouse metadata.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point — making this configurable makes sense so users retain explicit control over session_configuration. I'll add an inject_query_tags: bool = True parameter to both DatabricksSqlOperator and DatabricksCopyIntoOperator (defaulting to True to preserve the observability benefit, while allowing easy opt-out). I'll also document the new parameter in the operator docstrings. Will push the update shortly.

}
assert result.job_facets == {"sql": SQLJobFacet(query=op._sql)}


Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test should be moved above the openlineage tests as it is closer to core execution logic. Also, they only cover the happy-path. Could we add coverage for empty/partial context, empty existing query_tags, and values containing commas/colons/backslashes?

The escaping path is probably the most important one here since _escape_query_tag_value() is not really exercised end-to-end right now. It would also be good to verify that unrelated session_configuration values are preserved after the merge.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that ordering makes sense. I'll move test_query_tags_injection above the OpenLineage tests so it sits closer to the core execution logic. I'll also expand coverage to include: empty/partial context (missing dag/task/run_id), empty existing query_tags, values containing commas/colons/backslashes (to exercise _escape_query_tag_value end-to-end), and an assertion that unrelated keys in session_configuration are preserved after the merge. Will include in the next push.

assert bucket == "my-bucket"
assert object_name == "path/to/file.parquet"


Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move this test after test_exec_write_gcs_parquet_output. Also, the above comment regarding incomplete test coverage applies here too.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do — I'll move the new test to sit right after test_exec_write_gcs_parquet_output and mirror the same expanded coverage (empty/partial context, empty existing query_tags, escape-character values, and preservation of unrelated session_configuration keys). Thanks for the review!

Shaan-alpha added a commit to Shaan-alpha/airflow that referenced this pull request May 14, 2026
Address review feedback on apache#66886:

- Add `inject_query_tags: bool = True` parameter to DatabricksSqlOperator
  and DatabricksCopyIntoOperator, allowing users to opt out of the
  automatic session_configuration mutation while preserving the default
  observability benefit. Documented in both operator docstrings.

- Move query-tag tests above the OpenLineage block (test_databricks_copy)
  and after test_exec_write_gcs_parquet_output (test_databricks_sql) so
  they sit closer to core execution logic.

- Expand coverage in both test files: empty/partial context, empty
  existing query_tags, end-to-end exercise of escape sequences for
  commas/colons/backslashes, preservation of unrelated
  session_configuration keys, fallback to connection extras when
  session_config is None, and verification of the opt-out path.
@Shaan-alpha
Copy link
Copy Markdown
Author

@SameerMesiah97 thanks again for the review — pushed 38d58f8 addressing the remaining three points:

  • Opt-out control: added inject_query_tags: bool = True to both DatabricksSqlOperator and DatabricksCopyIntoOperator, documented in the operator docstrings. Defaults to True to preserve the observability benefit, but users can pass inject_query_tags=False to retain full control over session_configuration.
  • Test placement: moved the query-tag tests above the OpenLineage block in test_databricks_copy.py and to right after test_exec_write_gcs_parquet_output in test_databricks_sql.py, so they sit closer to the core execution logic.
  • Expanded test coverage (mirrored across both files):
    • empty / partial Airflow context (missing dag / task / run_id)
    • empty existing query_tags
    • escape-character values (,, :, \) — exercises _escape_query_tag_value end-to-end
    • preservation of unrelated session_configuration keys after the merge
    • fallback to databricks_conn.extra_dejson["session_configuration"] when hook.session_config is None
    • opt-out path: inject_query_tags=False leaves session_config untouched

Comment 1 (mapping-driven helpers) was already addressed in 732df03. Ready for another look whenever you have a moment.

@SameerMesiah97
Copy link
Copy Markdown
Contributor

@SameerMesiah97 thanks again for the review — pushed 38d58f8 addressing the remaining three points:

  • Opt-out control: added inject_query_tags: bool = True to both DatabricksSqlOperator and DatabricksCopyIntoOperator, documented in the operator docstrings. Defaults to True to preserve the observability benefit, but users can pass inject_query_tags=False to retain full control over session_configuration.

  • Test placement: moved the query-tag tests above the OpenLineage block in test_databricks_copy.py and to right after test_exec_write_gcs_parquet_output in test_databricks_sql.py, so they sit closer to the core execution logic.

  • Expanded test coverage (mirrored across both files):

    • empty / partial Airflow context (missing dag / task / run_id)
    • empty existing query_tags
    • escape-character values (,, :, \) — exercises _escape_query_tag_value end-to-end
    • preservation of unrelated session_configuration keys after the merge
    • fallback to databricks_conn.extra_dejson["session_configuration"] when hook.session_config is None
    • opt-out path: inject_query_tags=False leaves session_config untouched

Comment 1 (mapping-driven helpers) was already addressed in 732df03. Ready for another look whenever you have a moment.

Let's wait for a maintainer to trigger CI.

@Shaan-alpha Shaan-alpha force-pushed the feature/databricks-query-tags branch from 38d58f8 to 8768dca Compare May 15, 2026 00:39
Shaan-alpha added a commit to Shaan-alpha/airflow that referenced this pull request May 15, 2026
Address review feedback on apache#66886:

- Add `inject_query_tags: bool = True` parameter to DatabricksSqlOperator
  and DatabricksCopyIntoOperator, allowing users to opt out of the
  automatic session_configuration mutation while preserving the default
  observability benefit. Documented in both operator docstrings.

- Move query-tag tests above the OpenLineage block (test_databricks_copy)
  and after test_exec_write_gcs_parquet_output (test_databricks_sql) so
  they sit closer to core execution logic.

- Expand coverage in both test files: empty/partial context, empty
  existing query_tags, end-to-end exercise of escape sequences for
  commas/colons/backslashes, preservation of unrelated
  session_configuration keys, fallback to connection extras when
  session_config is None, and verification of the opt-out path.
@Shaan-alpha
Copy link
Copy Markdown
Author

Follow-up improvements planned based on review and maintainability considerations:

  • Move duplicated test helpers (_make_context and _run_with_mocked_hook) into shared test utilities/fixtures to reduce duplication across Databricks operator tests.
  • Keep query tag injection scoped to execution lifecycle while preserving existing session configuration semantics.
  • Add a small usage/example snippet for inject_query_tags behavior in docs/docstrings if needed.

This PR intentionally preserves backward compatibility by:

  • preserving existing query_tags
  • preserving unrelated session_configuration keys
  • supporting opt-out via inject_query_tags=False
  • handling escaping for Databricks query tag separators.

Shaan-alpha and others added 4 commits May 16, 2026 22:48
…QL operators

This PR injects the Airflow dag_id, task_id, and run_id into the session_configuration parameter of DatabricksSqlOperator and DatabricksCopyIntoOperator. This enhances observability on the Databricks side. The tags are safely escaped and user-defined query tags are preserved.

Closes apache#66839
Refactor query tag formatting and escaping into mapping-driven helper utilities for maintainability and extensibility.
Address review feedback on apache#66886:

- Add `inject_query_tags: bool = True` parameter to DatabricksSqlOperator
  and DatabricksCopyIntoOperator, allowing users to opt out of the
  automatic session_configuration mutation while preserving the default
  observability benefit. Documented in both operator docstrings.

- Move query-tag tests above the OpenLineage block (test_databricks_copy)
  and after test_exec_write_gcs_parquet_output (test_databricks_sql) so
  they sit closer to core execution logic.

- Expand coverage in both test files: empty/partial context, empty
  existing query_tags, end-to-end exercise of escape sequences for
  commas/colons/backslashes, preservation of unrelated
  session_configuration keys, fallback to connection extras when
  session_config is None, and verification of the opt-out path.
@Shaan-alpha Shaan-alpha force-pushed the feature/databricks-query-tags branch from d39b2fd to d425a49 Compare May 16, 2026 17:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve Databricks operators with query tags

2 participants