Skip to content

ElasticsearchSQLHook: add Polars DataFrame support via custom SQL reader#66220

Draft
SameerMesiah97 wants to merge 1 commit into
apache:mainfrom
SameerMesiah97:ElasticSearchSQLHook-Get-Polars-DF
Draft

ElasticsearchSQLHook: add Polars DataFrame support via custom SQL reader#66220
SameerMesiah97 wants to merge 1 commit into
apache:mainfrom
SameerMesiah97:ElasticSearchSQLHook-Get-Polars-DF

Conversation

@SameerMesiah97
Copy link
Copy Markdown
Contributor

Description

This PR is a follow-up to #50454 which adds support for returning query results as a Polars DataFrame in ElasticsearchSQLHook by implementing _get_polars_df.

Instead of relying on polars.read_database, which requires DB-API compatibility, this implementation introduces a custom reader that executes Elasticsearch SQL queries using cursor-based pagination and converts the results into a Polars DataFrame.

Rationale

ElasticsearchSQLCursor is not compatible with the execution model expected by polars.read_database, which prevents native Polars support via the existing DB-API abstraction.

This change resolves the outstanding TODO identified in #50454 by implementing a dedicated reader that interacts directly with the Elasticsearch SQL API. This avoids the need to adapt the custom cursor to Polars’ internal executor interface, which would introduce unnecessary complexity and tighter coupling.

Tests

Added unit tests verifying that:

  • Elasticsearch SQL responses are correctly converted into Polars DataFrames with expected schema and data, including empty result sets.
  • Pagination using cursors is handled correctly, including aggregation of multi-page results and support for max_rows truncation.
  • Cursor cleanup is performed using the last non-null cursor and is skipped when no cursor is returned by Elasticsearch.
  • The hook correctly delegates to the custom reader when df_type="polars" is requested.

Documentation

Updated the _get_polars_df docstring to describe the use of a custom reader and explain why polars.read_database is not used.

Backwards Compatibility

This change adds support for df_type="polars" in ElasticsearchSQLHook and does not modify existing behavior for other data frame types. No breaking changes are introduced.

Copy link
Copy Markdown
Contributor

@justinpakzad justinpakzad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should polars be added as an optional dependency in the pyproject.toml? Currently it's only in the dev dependencies, so I think this would fail with an import error if a user passes df_type="polars". Might be worth adding a guarded import with an error message pointing users to the right install extra.

…Hook. Use cursor-based pagination and add unit tests for pagination,

max_rows handling, and cursor cleanup.
@SameerMesiah97 SameerMesiah97 force-pushed the ElasticSearchSQLHook-Get-Polars-DF branch from 452e963 to 07a45b5 Compare May 4, 2026 16:27
@SameerMesiah97
Copy link
Copy Markdown
Contributor Author

Should polars be added as an optional dependency in the pyproject.toml? Currently it's only in the dev dependencies, so I think this would fail with an import error if a user passes df_type="polars". Might be worth adding a guarded import with an error message pointing users to the right install extra.

polars has been added as an optional dependency in pyproject.toml. A guard has also been added to direct users to install the dependency as well..

@eladkal
Copy link
Copy Markdown
Contributor

eladkal commented May 4, 2026

cc @guan404ming for review as the one who added polars support for most of the providers

@eladkal eladkal requested a review from guan404ming May 4, 2026 16:48
@eladkal
Copy link
Copy Markdown
Contributor

eladkal commented May 4, 2026

Rationale

ElasticsearchSQLCursor is not compatible with the execution model expected by polars.read_database, which prevents native Polars support via the existing DB-API abstraction.

This change resolves the outstanding TODO identified in #50454 by implementing a dedicated reader that interacts directly with the Elasticsearch SQL API. This avoids the need to adapt the custom cursor to Polars’ internal executor interface, which would introduce unnecessary complexity and tighter coupling.

To be specific
#50454 (comment)
is the issue we have with elastic and polars

@SameerMesiah97
Copy link
Copy Markdown
Contributor Author

cc @guan404ming for review as the one who added polars support for most of the providers

@guan404ming

Would be interested in your thoughts on this issue as well:

#66290

@justinpakzad
Copy link
Copy Markdown
Contributor

polars has been added as an optional dependency in pyproject.toml. A guard has also been added to direct users to install the dependency as well..

Just revisiting this now and I noticed that the base DbApiHook also has guarded imports for both pandas and polars, but the error messages point users to install via the common-sql package (e.g.,apache-airflow-providers-common-sql[polars]). Since the pandas path would fall through to the base hook's exception (because _get_pandas_df isn't overridden here), maybe it actually makes sense to match that same pattern in this override for consistency. That way both df types point users to the same install path. A maintainer would know better but thought it was worth mentioning.

@SameerMesiah97
Copy link
Copy Markdown
Contributor Author

polars has been added as an optional dependency in pyproject.toml. A guard has also been added to direct users to install the dependency as well..

Just revisiting this now and I noticed that the base DbApiHook also has guarded imports for both pandas and polars, but the error messages point users to install via the common-sql package (e.g.,apache-airflow-providers-common-sql[polars]). Since the pandas path would fall through to the base hook's exception (because _get_pandas_df isn't overridden here), maybe it actually makes sense to match that same pattern in this override for consistency. That way both df types point users to the same install path. A maintainer would know better but thought it was worth mentioning.

Were you able to get get_pandas_df to work with the ElasticsearchSQLHook? Based on a few sanity checks, it seems that that the Elasticsearch library is not fully DP-API compliant and this makes get_pandas_df non-functional as database semantics (for e.g. rollbacks) are not handled by the current implementation. I am actually considering another PR which is essentially doing this what PR is doing but for Pandas instead. That PR would override _get_pandas_df and include the same AirflowOptionalProviderFeatureException which should make it symmetrical for both pandas and polars.

@justinpakzad
Copy link
Copy Markdown
Contributor

Were you able to get get_pandas_df to work with the ElasticsearchSQLHook? Based on a few sanity checks, it seems that that the Elasticsearch library is not fully DP-API compliant and this makes get_pandas_df non-functional as database semantics (for e.g. rollbacks) are not handled by the current implementation. I am actually considering another PR which is essentially doing this what PR is doing but for Pandas instead. That PR would override _get_pandas_df and include the same AirflowOptionalProviderFeatureException which should make it symmetrical for both pandas and polars.

Good shout. I didn't actually test the get_pandas_df with the Elastic hook, I was just going based on what I had initially observed. The additional PR make sense and that would address the misalignment between the optional provider messages (as you mentioned). Nice one.

@SameerMesiah97
Copy link
Copy Markdown
Contributor Author

Were you able to get get_pandas_df to work with the ElasticsearchSQLHook? Based on a few sanity checks, it seems that that the Elasticsearch library is not fully DP-API compliant and this makes get_pandas_df non-functional as database semantics (for e.g. rollbacks) are not handled by the current implementation. I am actually considering another PR which is essentially doing this what PR is doing but for Pandas instead. That PR would override _get_pandas_df and include the same AirflowOptionalProviderFeatureException which should make it symmetrical for both pandas and polars.

Good shout. I didn't actually test the get_pandas_df with the Elastic hook, I was just going based on what I had initially observed. The additional PR make sense and that would address the misalignment between the optional provider messages (as you mentioned). Nice one.

Actually, that is a false alarm. The root cause was a version mismatch between the elasticsearch client used by airflow and the cluster.

@potiuk potiuk added the ready for maintainer review Set after triaging when all criteria pass. label May 11, 2026
@SameerMesiah97
Copy link
Copy Markdown
Contributor Author

Requesting review for this.

@SameerMesiah97 SameerMesiah97 marked this pull request as draft May 20, 2026 21:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants