You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Since #18916 introduced FFI_PhysicalExpr to pass physical expressions across the FFI boundary without going through the protobuf physical codec (avoiding the codec / TaskContext circular dependency described in the PR), physical expressions originating in one shared library and consumed in another arrive as ForeignPhysicalExpr opaque wrappers. Any code in datafusion core that relies on Any::downcast_ref::<T>() / as_any().is::<T>() to identify concrete physical expression types silently mis-classifies these expressions, because the wrapper's TypeId is ForeignPhysicalExpr, not the wrapped concrete type.
The most visible symptom today: simplify_const_expr_immediate (in datafusion/physical-expr/src/simplifier/const_evaluator.rs) checks expr.as_any().is::<Column>() to short-circuit. For a foreign-wrapped Column, that check is false. The function then walks the (empty) children list, vacuously concludes "all children are literals," and attempts evaluate(dummy_batch) on what is effectively a column reference. The result is either a spurious literal substitution (replacing the column with a wrong / null scalar) or a runtime error that is swallowed by the Err(_) => Transformed::no arm. Either way, predicate semantics are broken downstream of the simplifier. This affects row-group pruning and open() paths in Parquet scans for any TableProvider consumed via FFI (concrete repro: datafusion-python with a third-party Rust TableProvider crate).
Reproduction
Minimal repro provided by jwimberl. Query SELECT * FROM dummy_table WHERE a < 5 against a parquet-backed TableProvider whose scan() builds its predicate via state.create_physical_expr(...) where state is the FFI-foreign Session. The predicate arrives at row-group filtering as ForeignPhysicalExpr(Column { name: "a", index: 0 }) and is silently rewritten by the simplifier.
A user-side workaround: bypass state.create_physical_expr and build the PhysicalExpr locally:
That returns a local-typed Column / Literal / BinaryExpr, and the simplifier downcasts succeed. This works for typical filter pushdown because the relevant ExecutionProps fields (var providers, query start time, alias generator) are rarely used in pushdown predicates, and ScalarUDF instances ride inside the logical Expr rather than being looked up via a registry. It is brittle — it ignores anything the foreign session would have provided via ExecutionProps — and it does not solve the underlying class of bug.
Root cause
The downcast pattern as_any().is::<T>() / as_any().downcast_ref::<T>() assumes producer and consumer share a compilation unit. Across the datafusion-ffi boundary they do not: each shared library has its own monomorphization of every core type, so TypeId::of::<Column>() in the consumer differs from TypeId::of::<Column>() in the producer. FFI_PhysicalExpr (introduced by #18916) was designed to carry behavior across the boundary via a vtable; identity (the TypeId) is intentionally not carried. Core code that uses TypeId to dispatch — the simplifier, pruning, parts of the optimizer, stats extraction — therefore mis-classifies any FFI-wrapped expression as "not a Column," "not a Literal," etc., and falls through to whatever branch handles "unknown node." In simplify_const_expr_immediate that branch evaluates the expression against a dummy batch, which is incorrect for a column reference and silently corrupts the predicate.
Alternatives considered
Codec round-trip
Serialize the PhysicalExpr via PhysicalExtensionCodec on the producer side and deserialize it on the consumer side. Rejected by #18916 because PhysicalExtensionCodec::try_decode requires a FunctionRegistry / TaskContext, which itself comes from the session being crossed — the round-trip dependency the PR was created to break. Any new proposal needs to avoid reintroducing this.
name()-based dispatch
Some prior discussions around analogous downcast-by-type fragility at the ExecutionPlan layer have proposed dispatching on ExecutionPlan::name() (the display name string) instead of Any::downcast_ref. It works mechanically — a foreign-wrapped FilterExec reports "FilterExec" through its vtable — but it is brittle:
name() is user-settable. Any third-party ExecutionPlan can return "FilterExec", and dispatch will treat it as the core FilterExec.
It conflates display with identity. name() is intended for EXPLAIN output, not type discrimination. Coupling them locks in display strings as de facto type tags and discourages renaming for readability.
It does not generalize to PhysicalExpr, which has no equivalent stable display contract for every node.
Adding identity methods to core traits (is_column(), node_kind(), accessors)
Considered and rejected. The downcast pattern is widespread enough across the simplifier, pruning, and optimizer that providing a parallel method-based API in PhysicalExpr (and a corresponding one in ExecutionPlan) would mean a sustained boilerplate tax on every implementor — in core and downstream — for the benefit of an FFI code path that is not the common case. The cost falls on the wrong population.
Two-tier model at the FFI boundary, no changes required to the PhysicalExpr or ExecutionPlan traits in core:
Tier 1 — Known structural built-ins are reconstructed as local-typed instances on the consumer side. At wrap time on the producer side, datafusion-ffi attempts Any::downcast_ref against a closed set of well-known core types and, on match, transports the minimal field data needed to rebuild:
Column → (name, index)
Literal → ScalarValue (+ optional FieldMetadata)
BinaryExpr → (op, left, right) with recursive reconstruction of children
(extend as concrete downcast call sites in core demand: IsNullExpr, NotExpr, CastExpr, InListExpr, LikeExpr, etc.)
On the consumer side, FFI rebuilds these as the consumer's local Column / Literal / ... so as_any().is::<Column>() and similar checks throughout core continue to work.
Tier 2 — Unknown or third-party types stay opaque. Wrapped via the existing ForeignPhysicalExpr vtable. Callers must not rely on TypeId for these.
Rationale
No core trait pollution. Adding is_column() / node_kind() / accessor methods to PhysicalExpr (or ExecutionPlan) for FFI's sake would impose ongoing boilerplate on every implementor for the benefit of an exotic code path. FFI is not the common case in DataFusion; the boundary should bear its own cost.
Fixes the bug class, not one site. Every as_any().is::<Column>() / downcast_ref::<Literal>() in core (simplifier, pruning, optimizer rules, stats extraction, partition pruning) starts working again for FFI-wrapped expressions, with no per-call-site migration.
Not name-based. Identity is established by TypeId on the producer side at wrap time, where producer and core are in the same compilation unit. Consumers receive concrete local types they own.
Stable surface. The shapes of Column { name, index } and Literal { value, metadata } are de facto stable public API. Future shape changes are explicit FFI-version events, analogous to Arrow C Data Interface evolution.
Extending to ExecutionPlan
The same pattern applies and addresses related downcast-by-type fragility observed in plan-level code. Initial Tier 1 candidates are "structural" plans: FilterExec, ProjectionExec, CoalesceBatchesExec, RepartitionExec, SortExec, LimitExec, UnionExec, EmptyExec, PlaceholderRowExec. Source plans (DataSourceExec, ParquetExec, custom sources) remain Tier 2 — they are rarely downcast by optimizer rules in a way that requires concrete identity.
PlanProperties (schema, partitioning, equivalence) should be recomputed on the consumer side after reconstruction rather than transported, since equivalence groups are themselves physical expressions.
A documentable rule
If this pattern is adopted, it is worth codifying in the datafusion-ffi README:
Across the FFI boundary, structural types (data carriers, no dyn Trait extensibility) round-trip by value and survive TypeId checks. Behavioral types (extensible via dyn Trait) round-trip by vtable and do not survive TypeId checks. Code in core that downcasts a structural type is fine; code that downcasts a behavioral type is a latent FFI bug.
This rule also makes core auditable: a TypeId check against a structural type is safe; against a behavioral or extension type, it is a bug.
Trade-offs
Cons of the proposed approach:
Boilerplate moves to datafusion-ffi. Adding a new built-in physical expression or execution plan requires updating the FFI encode / decode tables.
datafusion-ffi becomes coupled to the internal field shape of the well-known core types. Acceptable, but a real coupling.
ScalarFunction cannot be fully Tier 1 without addressing UDF identity across the boundary (lookup is by name and needs a registry on the consumer). Recommend leaving ScalarFunction in Tier 2 for now.
Recursive reconstruction of composite expressions costs O(n) FFI calls per filter at scan-plan time. Amortizes well; pushdown happens once per query plan.
Recommended path
Land a narrow defensive fix in simplify_const_expr_immediate so that an expression with zero children is never treated as "all children literal" and never falls through to evaluate(dummy_batch). This is a small, isolated patch that prevents incorrect predicate rewriting today regardless of FFI. Suggested diff:
let children = expr.children();if children.is_empty()
|| !children.iter().all(|c| c.as_any().is::<Literal>()){returnOk(Transformed::no(expr));}
Implement Tier 1 reconstruction in datafusion-ffi starting with Column and Literal, then BinaryExpr (recursive). These three cover the great majority of pushdown-filter shapes and the great majority of TypeId call sites in the simplifier / pruning code paths.
Extend Tier 1 to additional physical expressions as concrete downcast bugs surface (CastExpr, IsNullExpr, NotExpr, InListExpr, etc.).
Apply the same pattern to a small ExecutionPlan Tier 1 (FilterExec, ProjectionExec) once expression-level reconstruction is stable. Drive that work from concrete optimizer / simplifier bugs rather than speculatively covering every plan.
Document the structural-vs-behavioral rule in datafusion-ffi to set caller expectations and prevent name()-based dispatch hacks proliferating.
Summary
Since #18916 introduced
FFI_PhysicalExprto pass physical expressions across the FFI boundary without going through the protobuf physical codec (avoiding the codec /TaskContextcircular dependency described in the PR), physical expressions originating in one shared library and consumed in another arrive asForeignPhysicalExpropaque wrappers. Any code indatafusioncore that relies onAny::downcast_ref::<T>()/as_any().is::<T>()to identify concrete physical expression types silently mis-classifies these expressions, because the wrapper'sTypeIdisForeignPhysicalExpr, not the wrapped concrete type.The most visible symptom today:
simplify_const_expr_immediate(indatafusion/physical-expr/src/simplifier/const_evaluator.rs) checksexpr.as_any().is::<Column>()to short-circuit. For a foreign-wrappedColumn, that check isfalse. The function then walks the (empty) children list, vacuously concludes "all children are literals," and attemptsevaluate(dummy_batch)on what is effectively a column reference. The result is either a spurious literal substitution (replacing the column with a wrong / null scalar) or a runtime error that is swallowed by theErr(_) => Transformed::noarm. Either way, predicate semantics are broken downstream of the simplifier. This affects row-group pruning andopen()paths in Parquet scans for anyTableProviderconsumed via FFI (concrete repro:datafusion-pythonwith a third-party RustTableProvidercrate).Reproduction
Minimal repro provided by
jwimberl. QuerySELECT * FROM dummy_table WHERE a < 5against a parquet-backedTableProviderwhosescan()builds its predicate viastate.create_physical_expr(...)wherestateis the FFI-foreignSession. The predicate arrives at row-group filtering asForeignPhysicalExpr(Column { name: "a", index: 0 })and is silently rewritten by the simplifier.A user-side workaround: bypass
state.create_physical_exprand build thePhysicalExprlocally:That returns a local-typed
Column/Literal/BinaryExpr, and the simplifier downcasts succeed. This works for typical filter pushdown because the relevantExecutionPropsfields (var providers, query start time, alias generator) are rarely used in pushdown predicates, andScalarUDFinstances ride inside the logicalExprrather than being looked up via a registry. It is brittle — it ignores anything the foreign session would have provided viaExecutionProps— and it does not solve the underlying class of bug.Root cause
The downcast pattern
as_any().is::<T>()/as_any().downcast_ref::<T>()assumes producer and consumer share a compilation unit. Across thedatafusion-ffiboundary they do not: each shared library has its own monomorphization of every core type, soTypeId::of::<Column>()in the consumer differs fromTypeId::of::<Column>()in the producer.FFI_PhysicalExpr(introduced by #18916) was designed to carry behavior across the boundary via a vtable; identity (theTypeId) is intentionally not carried. Core code that usesTypeIdto dispatch — the simplifier, pruning, parts of the optimizer, stats extraction — therefore mis-classifies any FFI-wrapped expression as "not aColumn," "not aLiteral," etc., and falls through to whatever branch handles "unknown node." Insimplify_const_expr_immediatethat branch evaluates the expression against a dummy batch, which is incorrect for a column reference and silently corrupts the predicate.Alternatives considered
Codec round-trip
Serialize the
PhysicalExprviaPhysicalExtensionCodecon the producer side and deserialize it on the consumer side. Rejected by #18916 becausePhysicalExtensionCodec::try_decoderequires aFunctionRegistry/TaskContext, which itself comes from the session being crossed — the round-trip dependency the PR was created to break. Any new proposal needs to avoid reintroducing this.name()-based dispatchSome prior discussions around analogous downcast-by-type fragility at the
ExecutionPlanlayer have proposed dispatching onExecutionPlan::name()(the display name string) instead ofAny::downcast_ref. It works mechanically — a foreign-wrappedFilterExecreports"FilterExec"through its vtable — but it is brittle:name()is user-settable. Any third-partyExecutionPlancan return"FilterExec", and dispatch will treat it as the coreFilterExec.name()is intended forEXPLAINoutput, not type discrimination. Coupling them locks in display strings as de facto type tags and discourages renaming for readability.PhysicalExpr, which has no equivalent stable display contract for every node.Adding identity methods to core traits (
is_column(),node_kind(), accessors)Considered and rejected. The downcast pattern is widespread enough across the simplifier, pruning, and optimizer that providing a parallel method-based API in
PhysicalExpr(and a corresponding one inExecutionPlan) would mean a sustained boilerplate tax on every implementor — in core and downstream — for the benefit of an FFI code path that is not the common case. The cost falls on the wrong population.Proposed resolution: tiered reconstruction inside
datafusion-ffiTwo-tier model at the FFI boundary, no changes required to the
PhysicalExprorExecutionPlantraits in core:Tier 1 — Known structural built-ins are reconstructed as local-typed instances on the consumer side. At wrap time on the producer side,
datafusion-ffiattemptsAny::downcast_refagainst a closed set of well-known core types and, on match, transports the minimal field data needed to rebuild:Column→(name, index)Literal→ScalarValue(+ optionalFieldMetadata)BinaryExpr→(op, left, right)with recursive reconstruction of childrenIsNullExpr,NotExpr,CastExpr,InListExpr,LikeExpr, etc.)On the consumer side, FFI rebuilds these as the consumer's local
Column/Literal/ ... soas_any().is::<Column>()and similar checks throughout core continue to work.Tier 2 — Unknown or third-party types stay opaque. Wrapped via the existing
ForeignPhysicalExprvtable. Callers must not rely onTypeIdfor these.Rationale
is_column()/node_kind()/ accessor methods toPhysicalExpr(orExecutionPlan) for FFI's sake would impose ongoing boilerplate on every implementor for the benefit of an exotic code path. FFI is not the common case in DataFusion; the boundary should bear its own cost.TaskContextdependency. Field-level copying of well-known shapes is independent ofPhysicalExtensionCodecand theFunctionRegistry, avoiding the circular dependency Implement FFI_PhysicalExpr and the structs it needs to support it. #18916 was designed to break.as_any().is::<Column>()/downcast_ref::<Literal>()in core (simplifier, pruning, optimizer rules, stats extraction, partition pruning) starts working again for FFI-wrapped expressions, with no per-call-site migration.TypeIdon the producer side at wrap time, where producer and core are in the same compilation unit. Consumers receive concrete local types they own.Column { name, index }andLiteral { value, metadata }are de facto stable public API. Future shape changes are explicit FFI-version events, analogous to Arrow C Data Interface evolution.Extending to
ExecutionPlanThe same pattern applies and addresses related downcast-by-type fragility observed in plan-level code. Initial Tier 1 candidates are "structural" plans:
FilterExec,ProjectionExec,CoalesceBatchesExec,RepartitionExec,SortExec,LimitExec,UnionExec,EmptyExec,PlaceholderRowExec. Source plans (DataSourceExec,ParquetExec, custom sources) remain Tier 2 — they are rarely downcast by optimizer rules in a way that requires concrete identity.PlanProperties(schema, partitioning, equivalence) should be recomputed on the consumer side after reconstruction rather than transported, since equivalence groups are themselves physical expressions.A documentable rule
If this pattern is adopted, it is worth codifying in the
datafusion-ffiREADME:This rule also makes core auditable: a
TypeIdcheck against a structural type is safe; against a behavioral or extension type, it is a bug.Trade-offs
Cons of the proposed approach:
datafusion-ffi. Adding a new built-in physical expression or execution plan requires updating the FFI encode / decode tables.datafusion-ffibecomes coupled to the internal field shape of the well-known core types. Acceptable, but a real coupling.ScalarFunctioncannot be fully Tier 1 without addressing UDF identity across the boundary (lookup is by name and needs a registry on the consumer). Recommend leavingScalarFunctionin Tier 2 for now.Recommended path
Land a narrow defensive fix in
simplify_const_expr_immediateso that an expression with zero children is never treated as "all children literal" and never falls through toevaluate(dummy_batch). This is a small, isolated patch that prevents incorrect predicate rewriting today regardless of FFI. Suggested diff:Implement Tier 1 reconstruction in
datafusion-ffistarting withColumnandLiteral, thenBinaryExpr(recursive). These three cover the great majority of pushdown-filter shapes and the great majority ofTypeIdcall sites in the simplifier / pruning code paths.Extend Tier 1 to additional physical expressions as concrete downcast bugs surface (
CastExpr,IsNullExpr,NotExpr,InListExpr, etc.).Apply the same pattern to a small
ExecutionPlanTier 1 (FilterExec,ProjectionExec) once expression-level reconstruction is stable. Drive that work from concrete optimizer / simplifier bugs rather than speculatively covering every plan.Document the structural-vs-behavioral rule in
datafusion-ffito set caller expectations and preventname()-based dispatch hacks proliferating.Related
FFI_PhysicalExprintroduction; the source of the foreign-wrapper behavior)TaskContextcircular-dep context)