test(python): add datafusion-python compatibility tests#1614
test(python): add datafusion-python compatibility tests#1614andygrove wants to merge 5 commits intoapache:mainfrom
Conversation
Ballista's BallistaSessionContext / DistributedDataFrame rely on metaclass introspection of datafusion's SessionContext / DataFrame: - methods whose return annotation is the literal string 'DataFrame' are re-wrapped to return DistributedDataFrame - a hardcoded EXECUTION_METHODS list is re-wrapped to route execution through the Ballista cluster Both mechanisms can break silently if datafusion changes annotation style or renames methods, leaving queries to quietly run locally instead of on the cluster. Add tests that exercise each wrapping path so drift surfaces as a test failure rather than incorrect behavior: - 3 metaclass smoke tests confirm wrapping happened on DataFrame and SessionContext, and that EXECUTION_METHODS names still exist on datafusion.DataFrame - 8 per-method round-trip tests, one per name in EXECUTION_METHODS, with pytest.importorskip for pandas/polars optional deps Tests pass against datafusion 51.
…ally Add pandas and polars to the dev dependency group so the to_pandas and to_polars compatibility tests exercise real conversions instead of skipping when the optional libraries are missing. CI pipelines that run 'uv sync --dev' will now install both, ensuring drift in either path is caught.
- Module-scope ctx fixture registers the test table once instead of spinning up a fresh cluster per test (file runtime ~16s -> ~2s). - Drop redundant pathlib.Path() wrap and os.path.getsize, use out.glob and p.stat().st_size; remove now-unused os import. - Extract the 'DataFrame' return-annotation literal into a module constant and document why __dict__ rather than getattr is used.
|
df.write_* may have been changed a lot if i remember |
The PyO3-bound DataFrame.write_csv and DataFrame.write_parquet_with_options
require write_options to be passed even though their Python signatures
declare a None default, so calls to BallistaSessionContext.sql(...).write_csv(...)
or .write_parquet_with_options(...) currently fail with:
TypeError: DataFrame.write_csv() missing 1 required positional
argument: 'write_options'
Pass None explicitly. The commented-out raw_write_options block in
write_parquet_with_options was attempting to thread an unsupplied
parameter and is removed.
These three methods are explicitly defined on DistributedDataFrame and bypass the metaclass, routing through _to_internal_df() into the Rust-side _internal_ballista bindings. None of them had test coverage before. - test_write_csv_round_trip: writes a small DataFrame to CSV and reads it back with pyarrow, verifying row count and column values. - test_write_parquet_round_trip: same shape for the default Parquet path. - test_write_parquet_with_options_round_trip: constructs a non-default ParquetWriterOptions (snappy compression, custom batch / row-group sizes, statistics_enabled='chunk') so the ~20 attributes shovelled through extension.py:173-194 are actually exercised. Asserts the compression attribute propagated to the written file.
Thanks, I added tests for all the write methods and fixed one bug that was discovered 🎉 |
|
Should we merge this with #1590? |
| df.write_csv(path, with_header) | ||
| # The PyO3-bound DataFrame.write_csv requires write_options to be | ||
| # passed even though its Python signature shows a None default. | ||
| df.write_csv(path, with_header, None) |
There was a problem hiding this comment.
as with #1590 we're missing write_options: DataFrameWriteOptions
There was a problem hiding this comment.
I looked into this and this looks complex and beyond my current knowledge of how these things work 😞
My plan was to get tests passing on main branch first, then rebase the DF 52 upgrade and make sure there were no regressions |
|
makes sense. we could add |
|
lets merge both of them, i can follow up to fix them over weekend |
Which issue does this PR close?
Closes #.
Rationale for this change
Ballista's Python bindings extend
datafusion-pythonheavily through subclassing and metaclass introspection (seepython/python/ballista/extension.py):RedefiningDataFrameMetawalks the parentDataFrame.__dict__and re-wraps every method whose return annotation is the literal string"DataFrame"so it returnsDistributedDataFrameinstead.RedefiningSessionContextMetadoes the same forSessionContext.EXECUTION_METHODS = ["collect", "collect_partitioned", "show", "count", "to_arrow_table", "to_pandas", "to_polars", "write_json"]is wrapped to route execution through the Ballista cluster.DistributedDataFrame.write_csv,write_parquet, andwrite_parquet_with_optionsare explicitly defined and call into_internal_ballistaRust bindings, bypassing the metaclass.If a future
datafusion-pythonrelease changes annotation style (e.g. switches from forward-reference strings to real class objects), renames methods, or alters signatures, the wrapping silently stops happening or breaks at runtime. Today onlycollect()was exercised under Ballista intest_context.py. The other sevenEXECUTION_METHODSplus all three explicit write methods were entirely uncovered.While adding the new tests, two real signature-drift bugs surfaced in
extension.py:write_csvandwrite_parquet_with_optionswere not passing thewrite_optionsargument that the underlying PyO3-boundDataFramerequires, so any caller ofctx.sql(...).write_csv(...)got aTypeError. These are also fixed in this PR.What changes are included in this PR?
python/python/ballista/extension.pybug fixes — passNoneforwrite_optionsin the underlyingDataFrame.write_csvandDataFrame.write_parquet_with_optionscalls, matching the existing pattern used elsewhere.New file
python/python/tests/test_datafusion_compat.pywith 14 tests in four groups:Metaclass smoke tests (3) — fail loudly if introspection no longer matches:
test_distributed_dataframe_wraps_dataframe_returning_methods— confirms representativeDataFramemethods (select,filter,with_column,aggregate) carry the string"DataFrame"return annotation and are re-wrapped onDistributedDataFrame.test_ballista_session_context_wraps_dataframe_returning_methods— same check forsql/read_csv/read_parquetonBallistaSessionContext.test_execution_methods_are_present_on_dataframe— every name inEXECUTION_METHODSstill exists ondatafusion.DataFrame.Per-
EXECUTION_METHODSround-trip (8) — one test per name inEXECUTION_METHODS(collect,collect_partitioned,show,count,to_arrow_table,to_pandas,to_polars,write_json). Builds a smallDistributedDataFrameand calls each, asserting return shape and content.Write-method round-trip (3) — covers the explicit, non-metaclass write methods on
DistributedDataFrame. Each writes totmp_pathand reads back withpyarrowto verify row count and column values:test_write_csv_round_triptest_write_parquet_round_triptest_write_parquet_with_options_round_trip— uses non-defaultParquetWriterOptions(snappy compression, custom batch / row-group sizes,statistics_enabled='chunk') so the ~20 attributes shovelled throughextension.py:173-194are actually exercised. Asserts the compression attribute propagated to the written file.Dev dependency additions —
pandas>=2.0.0andpolars>=1.0.0in[dependency-groups].devso theto_pandas/to_polarstests run unconditionally in CI rather than skipping when those libraries are absent.uv.lockis regenerated accordingly.Are there any user-facing changes?
Yes —
BallistaSessionContext.sql(...).write_csv(...)and.write_parquet_with_options(...)no longer raiseTypeErrorfrom a missingwrite_optionsargument. No API shape changes.