test(python): add datafusion-python compatibility tests by andygrove · Pull Request #1614 · apache/datafusion-ballista

andygrove · 2026-04-28T14:14:22Z

Which issue does this PR close?

Closes #.

Rationale for this change

Ballista's Python bindings extend datafusion-python heavily through subclassing and metaclass introspection (see python/python/ballista/extension.py):

RedefiningDataFrameMeta walks the parent DataFrame.__dict__ and re-wraps every method whose return annotation is the literal string "DataFrame" so it returns DistributedDataFrame instead.
RedefiningSessionContextMeta does the same for SessionContext.
A hardcoded EXECUTION_METHODS = ["collect", "collect_partitioned", "show", "count", "to_arrow_table", "to_pandas", "to_polars", "write_json"] is wrapped to route execution through the Ballista cluster.
DistributedDataFrame.write_csv, write_parquet, and write_parquet_with_options are explicitly defined and call into _internal_ballista Rust bindings, bypassing the metaclass.

If a future datafusion-python release changes annotation style (e.g. switches from forward-reference strings to real class objects), renames methods, or alters signatures, the wrapping silently stops happening or breaks at runtime. Today only collect() was exercised under Ballista in test_context.py. The other seven EXECUTION_METHODS plus all three explicit write methods were entirely uncovered.

While adding the new tests, two real signature-drift bugs surfaced in extension.py: write_csv and write_parquet_with_options were not passing the write_options argument that the underlying PyO3-bound DataFrame requires, so any caller of ctx.sql(...).write_csv(...) got a TypeError. These are also fixed in this PR.

What changes are included in this PR?

python/python/ballista/extension.py bug fixes — pass None for write_options in the underlying DataFrame.write_csv and DataFrame.write_parquet_with_options calls, matching the existing pattern used elsewhere.

New file python/python/tests/test_datafusion_compat.py with 14 tests in four groups:

Metaclass smoke tests (3) — fail loudly if introspection no longer matches:
- test_distributed_dataframe_wraps_dataframe_returning_methods — confirms representative DataFrame methods (select, filter, with_column, aggregate) carry the string "DataFrame" return annotation and are re-wrapped on DistributedDataFrame.
- test_ballista_session_context_wraps_dataframe_returning_methods — same check for sql / read_csv / read_parquet on BallistaSessionContext.
- test_execution_methods_are_present_on_dataframe — every name in EXECUTION_METHODS still exists on datafusion.DataFrame.
Per-EXECUTION_METHODS round-trip (8) — one test per name in EXECUTION_METHODS (collect, collect_partitioned, show, count, to_arrow_table, to_pandas, to_polars, write_json). Builds a small DistributedDataFrame and calls each, asserting return shape and content.
Write-method round-trip (3) — covers the explicit, non-metaclass write methods on DistributedDataFrame. Each writes to tmp_path and reads back with pyarrow to verify row count and column values:
- test_write_csv_round_trip
- test_write_parquet_round_trip
- test_write_parquet_with_options_round_trip — uses non-default ParquetWriterOptions (snappy compression, custom batch / row-group sizes, statistics_enabled='chunk') so the ~20 attributes shovelled through extension.py:173-194 are actually exercised. Asserts the compression attribute propagated to the written file.

Dev dependency additions — pandas>=2.0.0 and polars>=1.0.0 in [dependency-groups].dev so the to_pandas / to_polars tests run unconditionally in CI rather than skipping when those libraries are absent. uv.lock is regenerated accordingly.

Are there any user-facing changes?

Yes — BallistaSessionContext.sql(...).write_csv(...) and .write_parquet_with_options(...) no longer raise TypeError from a missing write_options argument. No API shape changes.

Ballista's BallistaSessionContext / DistributedDataFrame rely on metaclass introspection of datafusion's SessionContext / DataFrame: - methods whose return annotation is the literal string 'DataFrame' are re-wrapped to return DistributedDataFrame - a hardcoded EXECUTION_METHODS list is re-wrapped to route execution through the Ballista cluster Both mechanisms can break silently if datafusion changes annotation style or renames methods, leaving queries to quietly run locally instead of on the cluster. Add tests that exercise each wrapping path so drift surfaces as a test failure rather than incorrect behavior: - 3 metaclass smoke tests confirm wrapping happened on DataFrame and SessionContext, and that EXECUTION_METHODS names still exist on datafusion.DataFrame - 8 per-method round-trip tests, one per name in EXECUTION_METHODS, with pytest.importorskip for pandas/polars optional deps Tests pass against datafusion 51.

…ally Add pandas and polars to the dev dependency group so the to_pandas and to_polars compatibility tests exercise real conversions instead of skipping when the optional libraries are missing. CI pipelines that run 'uv sync --dev' will now install both, ensuring drift in either path is caught.

- Module-scope ctx fixture registers the test table once instead of spinning up a fresh cluster per test (file runtime ~16s -> ~2s). - Drop redundant pathlib.Path() wrap and os.path.getsize, use out.glob and p.stat().st_size; remove now-unused os import. - Extract the 'DataFrame' return-annotation literal into a module constant and document why __dict__ rather than getattr is used.

milenkovicm · 2026-04-28T14:30:06Z

df.write_* may have been changed a lot if i remember

The PyO3-bound DataFrame.write_csv and DataFrame.write_parquet_with_options require write_options to be passed even though their Python signatures declare a None default, so calls to BallistaSessionContext.sql(...).write_csv(...) or .write_parquet_with_options(...) currently fail with: TypeError: DataFrame.write_csv() missing 1 required positional argument: 'write_options' Pass None explicitly. The commented-out raw_write_options block in write_parquet_with_options was attempting to thread an unsupplied parameter and is removed.

These three methods are explicitly defined on DistributedDataFrame and bypass the metaclass, routing through _to_internal_df() into the Rust-side _internal_ballista bindings. None of them had test coverage before. - test_write_csv_round_trip: writes a small DataFrame to CSV and reads it back with pyarrow, verifying row count and column values. - test_write_parquet_round_trip: same shape for the default Parquet path. - test_write_parquet_with_options_round_trip: constructs a non-default ParquetWriterOptions (snappy compression, custom batch / row-group sizes, statistics_enabled='chunk') so the ~20 attributes shovelled through extension.py:173-194 are actually exercised. Asserts the compression attribute propagated to the written file.

andygrove · 2026-04-28T15:01:15Z

df.write_* may have been changed a lot if i remember

Thanks, I added tests for all the write methods and fixed one bug that was discovered 🎉

milenkovicm · 2026-04-28T16:55:18Z

Should we merge this with #1590?

milenkovicm · 2026-04-28T19:55:35Z

-        df.write_csv(path, with_header)
+        # The PyO3-bound DataFrame.write_csv requires write_options to be
+        # passed even though its Python signature shows a None default.
+        df.write_csv(path, with_header, None)


as with #1590 we're missing write_options: DataFrameWriteOptions

I looked into this and this looks complex and beyond my current knowledge of how these things work 😞

andygrove · 2026-04-28T20:03:24Z

Should we merge this with #1590?

My plan was to get tests passing on main branch first, then rebase the DF 52 upgrade and make sure there were no regressions

milenkovicm · 2026-04-28T20:08:10Z

makes sense.

we could add write_options: DataFrameWriteOptions to signatures but internally ignore it for now. i left few comments in #1590

milenkovicm · 2026-04-28T21:00:35Z

lets merge both of them, i can follow up to fix them over weekend

andygrove added 2 commits April 28, 2026 08:12

github-actions Bot added the python label Apr 28, 2026

andygrove added 2 commits April 28, 2026 08:43

milenkovicm reviewed Apr 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(python): add datafusion-python compatibility tests#1614

test(python): add datafusion-python compatibility tests#1614
andygrove wants to merge 5 commits intoapache:mainfrom
andygrove:py-datafusion-compat-tests

andygrove commented Apr 28, 2026 •

edited

Loading

Uh oh!

milenkovicm commented Apr 28, 2026

Uh oh!

andygrove commented Apr 28, 2026

Uh oh!

milenkovicm commented Apr 28, 2026

Uh oh!

milenkovicm Apr 28, 2026

Uh oh!

andygrove Apr 28, 2026

Uh oh!

andygrove commented Apr 28, 2026

Uh oh!

milenkovicm commented Apr 28, 2026

Uh oh!

milenkovicm commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

andygrove commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

milenkovicm commented Apr 28, 2026

Uh oh!

andygrove commented Apr 28, 2026

Uh oh!

milenkovicm commented Apr 28, 2026

Uh oh!

milenkovicm Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

andygrove Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

andygrove commented Apr 28, 2026

Uh oh!

milenkovicm commented Apr 28, 2026

Uh oh!

milenkovicm commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

andygrove commented Apr 28, 2026 •

edited

Loading