Skip to content

test(python): add datafusion-python compatibility tests#1614

Draft
andygrove wants to merge 5 commits intoapache:mainfrom
andygrove:py-datafusion-compat-tests
Draft

test(python): add datafusion-python compatibility tests#1614
andygrove wants to merge 5 commits intoapache:mainfrom
andygrove:py-datafusion-compat-tests

Conversation

@andygrove
Copy link
Copy Markdown
Member

@andygrove andygrove commented Apr 28, 2026

Which issue does this PR close?

Closes #.

Rationale for this change

Ballista's Python bindings extend datafusion-python heavily through subclassing and metaclass introspection (see python/python/ballista/extension.py):

  • RedefiningDataFrameMeta walks the parent DataFrame.__dict__ and re-wraps every method whose return annotation is the literal string "DataFrame" so it returns DistributedDataFrame instead.
  • RedefiningSessionContextMeta does the same for SessionContext.
  • A hardcoded EXECUTION_METHODS = ["collect", "collect_partitioned", "show", "count", "to_arrow_table", "to_pandas", "to_polars", "write_json"] is wrapped to route execution through the Ballista cluster.
  • DistributedDataFrame.write_csv, write_parquet, and write_parquet_with_options are explicitly defined and call into _internal_ballista Rust bindings, bypassing the metaclass.

If a future datafusion-python release changes annotation style (e.g. switches from forward-reference strings to real class objects), renames methods, or alters signatures, the wrapping silently stops happening or breaks at runtime. Today only collect() was exercised under Ballista in test_context.py. The other seven EXECUTION_METHODS plus all three explicit write methods were entirely uncovered.

While adding the new tests, two real signature-drift bugs surfaced in extension.py: write_csv and write_parquet_with_options were not passing the write_options argument that the underlying PyO3-bound DataFrame requires, so any caller of ctx.sql(...).write_csv(...) got a TypeError. These are also fixed in this PR.

What changes are included in this PR?

python/python/ballista/extension.py bug fixes — pass None for write_options in the underlying DataFrame.write_csv and DataFrame.write_parquet_with_options calls, matching the existing pattern used elsewhere.

New file python/python/tests/test_datafusion_compat.py with 14 tests in four groups:

  1. Metaclass smoke tests (3) — fail loudly if introspection no longer matches:

    • test_distributed_dataframe_wraps_dataframe_returning_methods — confirms representative DataFrame methods (select, filter, with_column, aggregate) carry the string "DataFrame" return annotation and are re-wrapped on DistributedDataFrame.
    • test_ballista_session_context_wraps_dataframe_returning_methods — same check for sql / read_csv / read_parquet on BallistaSessionContext.
    • test_execution_methods_are_present_on_dataframe — every name in EXECUTION_METHODS still exists on datafusion.DataFrame.
  2. Per-EXECUTION_METHODS round-trip (8) — one test per name in EXECUTION_METHODS (collect, collect_partitioned, show, count, to_arrow_table, to_pandas, to_polars, write_json). Builds a small DistributedDataFrame and calls each, asserting return shape and content.

  3. Write-method round-trip (3) — covers the explicit, non-metaclass write methods on DistributedDataFrame. Each writes to tmp_path and reads back with pyarrow to verify row count and column values:

    • test_write_csv_round_trip
    • test_write_parquet_round_trip
    • test_write_parquet_with_options_round_trip — uses non-default ParquetWriterOptions (snappy compression, custom batch / row-group sizes, statistics_enabled='chunk') so the ~20 attributes shovelled through extension.py:173-194 are actually exercised. Asserts the compression attribute propagated to the written file.

Dev dependency additionspandas>=2.0.0 and polars>=1.0.0 in [dependency-groups].dev so the to_pandas / to_polars tests run unconditionally in CI rather than skipping when those libraries are absent. uv.lock is regenerated accordingly.

Are there any user-facing changes?

Yes — BallistaSessionContext.sql(...).write_csv(...) and .write_parquet_with_options(...) no longer raise TypeError from a missing write_options argument. No API shape changes.

Ballista's BallistaSessionContext / DistributedDataFrame rely on metaclass
introspection of datafusion's SessionContext / DataFrame:

- methods whose return annotation is the literal string 'DataFrame' are
  re-wrapped to return DistributedDataFrame
- a hardcoded EXECUTION_METHODS list is re-wrapped to route execution
  through the Ballista cluster

Both mechanisms can break silently if datafusion changes annotation style
or renames methods, leaving queries to quietly run locally instead of on
the cluster. Add tests that exercise each wrapping path so drift surfaces
as a test failure rather than incorrect behavior:

- 3 metaclass smoke tests confirm wrapping happened on DataFrame and
  SessionContext, and that EXECUTION_METHODS names still exist on
  datafusion.DataFrame
- 8 per-method round-trip tests, one per name in EXECUTION_METHODS,
  with pytest.importorskip for pandas/polars optional deps

Tests pass against datafusion 51.
…ally

Add pandas and polars to the dev dependency group so the to_pandas and
to_polars compatibility tests exercise real conversions instead of
skipping when the optional libraries are missing. CI pipelines that run
'uv sync --dev' will now install both, ensuring drift in either path is
caught.
- Module-scope ctx fixture registers the test table once instead of
  spinning up a fresh cluster per test (file runtime ~16s -> ~2s).
- Drop redundant pathlib.Path() wrap and os.path.getsize, use
  out.glob and p.stat().st_size; remove now-unused os import.
- Extract the 'DataFrame' return-annotation literal into a module
  constant and document why __dict__ rather than getattr is used.
@milenkovicm
Copy link
Copy Markdown
Contributor

df.write_* may have been changed a lot if i remember

The PyO3-bound DataFrame.write_csv and DataFrame.write_parquet_with_options
require write_options to be passed even though their Python signatures
declare a None default, so calls to BallistaSessionContext.sql(...).write_csv(...)
or .write_parquet_with_options(...) currently fail with:

    TypeError: DataFrame.write_csv() missing 1 required positional
    argument: 'write_options'

Pass None explicitly. The commented-out raw_write_options block in
write_parquet_with_options was attempting to thread an unsupplied
parameter and is removed.
These three methods are explicitly defined on DistributedDataFrame and
bypass the metaclass, routing through _to_internal_df() into the Rust-side
_internal_ballista bindings. None of them had test coverage before.

- test_write_csv_round_trip: writes a small DataFrame to CSV and reads
  it back with pyarrow, verifying row count and column values.
- test_write_parquet_round_trip: same shape for the default Parquet path.
- test_write_parquet_with_options_round_trip: constructs a non-default
  ParquetWriterOptions (snappy compression, custom batch / row-group
  sizes, statistics_enabled='chunk') so the ~20 attributes shovelled
  through extension.py:173-194 are actually exercised. Asserts the
  compression attribute propagated to the written file.
@andygrove
Copy link
Copy Markdown
Member Author

df.write_* may have been changed a lot if i remember

Thanks, I added tests for all the write methods and fixed one bug that was discovered 🎉

@milenkovicm
Copy link
Copy Markdown
Contributor

Should we merge this with #1590?

df.write_csv(path, with_header)
# The PyO3-bound DataFrame.write_csv requires write_options to be
# passed even though its Python signature shows a None default.
df.write_csv(path, with_header, None)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as with #1590 we're missing write_options: DataFrameWriteOptions

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into this and this looks complex and beyond my current knowledge of how these things work 😞

@andygrove
Copy link
Copy Markdown
Member Author

Should we merge this with #1590?

My plan was to get tests passing on main branch first, then rebase the DF 52 upgrade and make sure there were no regressions

@milenkovicm
Copy link
Copy Markdown
Contributor

makes sense.

we could add write_options: DataFrameWriteOptions to signatures but internally ignore it for now. i left few comments in #1590

@milenkovicm
Copy link
Copy Markdown
Contributor

lets merge both of them, i can follow up to fix them over weekend

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants