Skip to content

Support custom build-side hash join accumulators #18936

@peasee

Description

@peasee

Is your feature request related to a problem or challenge?

I have a scenario where I'm joining some large datasets together, and the returned values for the dynamic filters from the hash join are very disparate - resulting in more data being scanned than is necessary from the right side of the join.

For example, imagine a query involving TPCH tables orders and lineitem. orders is the left side of the join and returns 2 rows - one row with o_orderkey=1 and the second with o_orderkey=5900000. These are simplified to a range filter like l_orderkey >= 1 AND l_orderkey <= 5900000 - resulting in nearly the entirety of the lineitem table being scanned when only 2 rows would've sufficed.

Describe the solution you'd like

I would like CollectLeftAccumulator to be a trait, so that I can override it using a physical plan optimizer by replacing the HashJoinExec with a new HashJoinExec containing my custom left side accumulator.

The default accumulator would remain the min-max style accumulator that currently exists.

Describe alternatives you've considered

There could be an argument to be made that we should consider alternative bounds accumulation techniques by default, like clustering, etc. It is my opinion though that making the left side accumulator a trait would make it simpler to test and implement future changes like these to add varying styles of accumulators.

Additional context

I have already implemented this in a fork: spiceai#116

My proposal would basically be upstreaming this PR into DataFusion trunk.

My custom left-side accumulator at the moment simply constructs a IN expression of the identified keys (e.g. IN (1, 5900000)). I am aware the pruning predicate builder does not support IN expressions with more than 20 elements, I am using my own custom pruner which supports larger IN expressions so this is not a problem for me.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions