Skip to content

Stats/interval propagation should degrade to unbounded intervals on internal errors, not fail the query #22028

@adriangb

Description

@adriangb

Background

Interval and statistics propagation in DataFusion is a best-effort optimization: it refines bounds on expressions to enable better pruning, filter pushdown, and join planning, but it is never required for query correctness. If propagation can't compute precise bounds for an expression, the optimizer should fall back to an unbounded interval (or unknown distribution) and continue planning.

Today, several propagation paths return Result<Interval> and ?-propagate errors all the way out to the user as Internal error: .... A single arithmetic combination that hasn't been wired through the type-coercion plumbing — or any other unexpected failure inside Interval arithmetic — turns into a hard query failure rather than degraded statistics.

A concrete recent example: #22027 fixed five assert_eq_or_internal_err! calls in Interval::mul/div/intersect/union/contains that were firing for Decimal128(38, 10) / Decimal128(20, 0) and similar mismatched-type cases. The underlying bug was real (missing type coercion), but the way it surfaced — as an internal error from a plain SQL numeric / bigint query — was disproportionate to the actual problem (we couldn't refine bounds for one expression).

Reproducer (current behavior, prior to #22027)

CREATE TABLE t(c1 BIGINT) AS VALUES (1::bigint), (2::bigint), (10::bigint);

SELECT
  c1,
  CASE WHEN c1 = 0 THEN 100.0
       ELSE ROUND((1.0 - (c1::numeric / c1)) * 100, 2)
  END AS rate
FROM t
WHERE (1.0 - (c1::numeric / c1)) * 100 < 95.0;

fails in datafusion-cli with:

Internal error: Assertion failed: ... Intervals must have the same data type for division

After #22027 this specific path returns rows. But any future operator/type pair that isn't fully coerced will reintroduce the same shape of failure — that's what this issue is about preventing structurally.

Proposal

Treat interval/stats propagation as advisory: if any error escapes Interval arithmetic during propagation, log a debug warning and substitute an unbounded interval (or Distribution::new_generic with an unknown range), then continue optimizing.

Natural seams

The two main entry points where errors should be caught and converted into "no information":

  1. BinaryExpr::evaluate_bounds in datafusion/physical-expr/src/expressions/binary.rs (around line 411). Currently:

    fn evaluate_bounds(&self, children: &[&Interval]) -> Result<Interval> {
        let left_interval = children[0];
        let right_interval = children[1];
        apply_operator(&self.op, left_interval, right_interval)
    }

    Wrap the apply_operator call so that on Err, we log and return Interval::make_unbounded(&self.data_type(...)).

  2. apply_operator callers in statistics.rs (datafusion/expr-common/src/statistics.rs lines 738 and 763). Both compute a range_evaluation from operand ranges; on error they should degrade to a Bernoulli(unknown) (line 738) or unbounded Distribution::new_generic (line 763) rather than failing the entire stats computation.

propagate_constraints in BinaryExpr and the CP solver paths in cp_solver.rs (update_ranges, propagate_arithmetic, etc.) likely deserve the same treatment — failure to propagate should yield "no refinement" (return Ok(Some(vec![])) or skip the refinement), not bubble out.

Considerations

  • Visibility: silently swallowing internal errors masks real bugs in interval arithmetic. Mitigations:
    • Always log at warn! or debug! with the operator and operand types.
    • Gate the fallback on DataFusionError::Internal specifically — let External, IoError, ResourcesExhausted, etc. continue to propagate.
    • Add a session config like optimizer.fail_on_interval_error (default false) so tests and CI can opt into the strict behavior and catch regressions.
  • Test coverage: existing tests rely on errors propagating in some cases (e.g. CP-solver edge cases). Each call site needs a careful audit to make sure "no information" is the right fallback for that specific function's contract.
  • Scope: this is a behavioral change, not a bug fix. Worth landing on its own with explicit reviewer sign-off, separate from fix: coerce operand types in Interval mul/div/intersect/union/contains #22027.

Alternatives

  • Make every Interval method infallible by always returning an unbounded interval on internal failure — pushes the policy down into the data structure but loses signal for legitimately fallible cases like cast errors during coercion.
  • Status quo + fix bugs as they're reported. This is what we've been doing; the cost is that each new operator/type combination that wasn't anticipated turns into a user-visible regression (cf. fix: coerce operand types in Interval mul/div/intersect/union/contains #22027).

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions