Skip to content

Semi/anti join column stats not scaled with estimated row count #22743

@neilconway

Description

@neilconway

Describe the bug

estimate_join_cardinality (datafusion/physical-plan/src/joins/utils.rs) estimates a reduced row count for semi/anti joins but returns the preserved input's column statistics unchanged. The per-column stats then describe the full input rather than the emitted subset:

  • null_count, distinct_count, and byte_size become inconsistent with the output num_rows (e.g. null_count > num_rows).
  • sum_value still reflects the full input.
  • Exact values are preserved even though a subset is only an estimate.
  • Join-key columns just copy the input null count, but null keys never match — a semi join drops them all, an anti join keeps them all.

Similar issue for joins in general but it's more complex; will file a separate ticket.

To Reproduce

No response

Expected behavior

No response

Additional context

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions