Skip to content

AIP-99: Add DataFusionToolset#62850

Merged
gopidesupavan merged 6 commits intoapache:mainfrom
gopidesupavan:add-object-storage-support-for-tools
Mar 5, 2026
Merged

AIP-99: Add DataFusionToolset#62850
gopidesupavan merged 6 commits intoapache:mainfrom
gopidesupavan:add-object-storage-support-for-tools

Conversation

@gopidesupavan
Copy link
Member

@gopidesupavan gopidesupavan commented Mar 4, 2026

Summary

Add DataFusionToolset to accept datasource_configs to work with objectstores enabling LLM agents to query files on object stores (S3, local filesystem, Iceberg) through Apache DataFusion.

from __future__ import annotations

from airflow.providers.common.ai.operators.agent import AgentOperator
from airflow.providers.common.compat.sdk import dag, task
from airflow.providers.common.ai.toolsets.datafusion import DataFusionToolset
from airflow.providers.common.sql.config import DataSourceConfig, FormatType

datasource_config_users = DataSourceConfig(
    conn_id="",
    uri="file:///opt/airflow/users_data/",
    table_name="users",
    format=FormatType.CSV,
)

datasource_config_order = DataSourceConfig(
    conn_id="",
    uri="file:///opt/airflow/orders/",
    table_name="orders",
    format=FormatType.CSV,
)
@dag
def example_agent_operator_sql():
    agent = AgentOperator(
        task_id="analyst",
        prompt="1. What are the top 5 users by age?"
               "2. Show the customer details, who has more providers",
        llm_conn_id="pydantic_ai_default",
        system_prompt=(
            "You are a SQL analyst. Use the available tools to explore "
            "the schema and answer the question with data."
        ),
        model_id="google-gla:gemini-2.5-pro",
        toolsets=[
            DataFusionToolset(
                datasource_configs=[datasource_config_users, datasource_config_order]
            )
        ],
    )

    @task
    def show_agent_response(agent_response):
        print(agent_response)

    agent >> show_agent_response(agent.output)

example_agent_operator_sql()

)

Was generative AI tooling used to co-author this PR?
  • Yes (please specify the tool below)

  • Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
  • For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
  • When adding dependency, check compliance with the ASF 3rd Party License Policy.
  • For significant user-facing changes create newsfragment: {pr_number}.significant.rst or {issue_number}.significant.rst, in airflow-core/newsfragments.

@gopidesupavan
Copy link
Member Author

should we create new toolset? for objectstores eg ObjectStoreSQLToolSet(SQLToolset)? so that the current SQLToolset will be much cleaner?

@gopidesupavan gopidesupavan marked this pull request as draft March 4, 2026 13:42
@kaxil
Copy link
Member

kaxil commented Mar 4, 2026

should we create new toolset? for objectstores eg ObjectStoreSQLToolSet(SQLToolset)? so that the current SQLToolset will be much cleaner?

yes, separate please

@gopidesupavan
Copy link
Member Author

image

@gopidesupavan gopidesupavan force-pushed the add-object-storage-support-for-tools branch from 2c3c1e6 to 312c826 Compare March 4, 2026 15:05
@gopidesupavan gopidesupavan marked this pull request as ready for review March 4, 2026 15:06
@gopidesupavan gopidesupavan changed the title Add objectstorage support to SQLToolset via DataFusion Add DataFusionToolset Mar 4, 2026
@gopidesupavan gopidesupavan changed the title Add DataFusionToolset AIP-99: Add DataFusionToolset Mar 4, 2026
Copy link
Member

@kaxil kaxil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See inline comments below.

@gopidesupavan gopidesupavan force-pushed the add-object-storage-support-for-tools branch from ee6fc35 to 77320ad Compare March 4, 2026 21:42
@gopidesupavan
Copy link
Member Author

ooo my bad old habbits debuging ..

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new DataFusionToolset to the Common AI provider, enabling pydantic-ai agents to discover tables, inspect schemas, and run SQL queries against object-store-backed datasets via Apache DataFusion (through DataFusionEngine from providers-common-sql).

Changes:

  • Introduces DataFusionToolset with list_tables, get_schema, and query tools, plus lazy DataFusionEngine initialization.
  • Adds unit tests validating tool registration, tool behavior, and engine lazy creation/caching.
  • Extends toolset documentation to include DataFusionToolset and its security posture.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File Description
providers/common/ai/src/airflow/providers/common/ai/toolsets/datafusion.py Implements the new DataFusion-backed toolset and tool call dispatch.
providers/common/ai/tests/unit/common/ai/toolsets/test_datafusion.py Adds unit coverage for initialization, tool behavior, errors, and engine resolution.
providers/common/ai/docs/toolsets.rst Documents DataFusionToolset, its parameters, and security defaults.
Comments suppressed due to low confidence (1)

providers/common/ai/docs/toolsets.rst:31

  • The intro now says “Three toolsets are included”, but the bullet list still only includes HookToolset and SQLToolset. Please add a DataFusionToolset bullet and update the nearby wording (e.g. “Both implement…”) so the section stays consistent.
Three toolsets are included:

- :class:`~airflow.providers.common.ai.toolsets.hook.HookToolset` — generic
  adapter for any Airflow Hook.
- :class:`~airflow.providers.common.ai.toolsets.sql.SQLToolset` — curated

You can also share your feedback on Copilot code review. Take the survey.

@gopidesupavan gopidesupavan merged commit 01f62b9 into apache:main Mar 5, 2026
125 of 127 checks passed
@gopidesupavan gopidesupavan deleted the add-object-storage-support-for-tools branch March 5, 2026 06:28
1Ninad pushed a commit to 1Ninad/airflow that referenced this pull request Mar 6, 2026
* Add objectstorage support to SQLToolset via DataFusion

* Add DataFusionToolset

* Update tests

* Resolve comments

* Resolve comments

* Resolve comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

3 participants