Gene.bordegaray/2026/02/partition index dynamic filters#20331
Gene.bordegaray/2026/02/partition index dynamic filters#20331gene-bordegaray wants to merge 11 commits intoapache:mainfrom
Conversation
…02/dyn_filter_partition_indexed
| - HashJoinExec: mode=Partitioned, join_type=Inner, on=[(b@0, a@0)] | ||
| - RepartitionExec: partitioning=Hash([b@0], 1), input_partitions=1 | ||
| - HashJoinExec: mode=Partitioned, join_type=Inner, on=[(c@1, d@0)] | ||
| - DataSourceExec: file_groups={1 group: [[test.parquet]]}, projection=[b, c, y], file_type=test, pushdown_supported=true |
There was a problem hiding this comment.
no dynamic filter because its the build side of a build a build side...took me a second 😂
| // its own filter. | ||
| predicate = predicate | ||
| .map(|p| snapshot_physical_expr_for_partition(p, partition_index)) | ||
| .transpose()?; |
There was a problem hiding this comment.
decided to only do this in the parquet opener, if we did for all files (by default) just do nothing since predicates aren't passed to other opneers. This does mean that users will have to implement this for their on data sources.
Given this is a large PR, didn't want to include logic for a fallback and doing nothing seemed out of place, could still reconsider if others have an opinion.
| "hash-repartitioned" | ||
| } else { | ||
| "file-grouped" | ||
| } |
There was a problem hiding this comment.
I didn't love this, maybe could have a helper to map and make it cleaner
There was a problem hiding this comment.
yeah I think its a good idea
LiaCastaneda
left a comment
There was a problem hiding this comment.
This makes sense to me and will be very helpful for use cases where we want to avoid repartitioning data. My only concern is that API users would need to align the probe and build side partitions, but this seems like a reasonable tradeoff. Let’s see what other contributors think. (this is a partial review I will finish later today or early next week) but until now it's looking good to me :)
| // One side starts with multiple partitions while target is 1. EnforceDistribution inserts a | ||
| // hash repartition on the left child. The partitioning schemes are now misaligned: | ||
| // - Left: hash-repartitioned (repartitioned=true) | ||
| // - Right: file-grouped (repartitioned=false) | ||
| // This is a correctness bug, so we expect an error. |
There was a problem hiding this comment.
Can we have the other way around as well? having a Join of type Partitioned and the left perserving file parttioning and the right having RepartitionExec.
| let optimized = ensure_distribution_helper_transform_up(join, 1)?; | ||
| assert_plan!(optimized, @r" | ||
| HashJoinExec: mode=Partitioned, join_type=Inner, on=[(a@0, a@1)] | ||
| DataSourceExec: file_groups={1 group: [[x]]}, projection=[a, b, c, d, e], file_type=parquet |
There was a problem hiding this comment.
Would it make sense to display if DataSourceExec is perserving partitioning? something like preserve_partitioning=[bool]? this may be useful for users to know why there is no RepartitionExec in the plan even if the mode is Partitioned
| - HashJoinExec: mode=Partitioned, join_type=Inner, on=[(a@0, b@0)] | ||
| - DataSourceExec: file_groups={1 group: [[test.parquet]]}, projection=[a, x], file_type=test, pushdown_supported=true | ||
| - HashJoinExec: mode=Partitioned, join_type=Inner, on=[(c@1, d@0)] | ||
| - DataSourceExec: file_groups={1 group: [[test.parquet]]}, projection=[b, c, y], file_type=test, pushdown_supported=true, predicate=DynamicFilter [ b@0 >= aa AND b@0 <= ab AND b@0 IN (SET) ([aa, ab]) ] |
There was a problem hiding this comment.
If this was a Partitioned Join why was there no CASE dynamic filter before? 🤔
There was a problem hiding this comment.
Because there is only one partition, thus it only applied one filter
💯 thank you for the reviews |
LiaCastaneda
left a comment
There was a problem hiding this comment.
👍 I think I'm done with my review, overall looks good, just some minor comments.
| insta::assert_snapshot!( | ||
| OptimizationTest::new(Arc::clone(&plan), FilterPushdown::new_post_optimization(), true), | ||
| @r" |
There was a problem hiding this comment.
super nit: would it be enough to assert on the plan after execution only on these kind of tests? this file is becoming increasingly large
There was a problem hiding this comment.
Yes, I agree we only need to assert on the plan after execution only on these kind of tests
| "hash-repartitioned" | ||
| } else { | ||
| "file-grouped" | ||
| } |
There was a problem hiding this comment.
yeah I think its a good idea
We need to make sure that in the future it’s easy to revert or migrate users away from index-based routing to their custom Partitioning implementation. Since this does not introduce a new API, I don’t think it should be a problem. This was previously a bug, and with this PR dynamic filtering works, but it’s something to keep in mind. |
NGA-TRAN
left a comment
There was a problem hiding this comment.
The approach looks great, Gene. Nice work!
I do have some suggestions on comments and test data to make things clearer for reviewers and future maintennce
| /// │ mode=Partitioned │ | ||
| /// │┌───────┐┌───────┐┌───────┐│ | ||
| /// ││ Hash ││ Hash ││ Hash ││ | ||
| /// ││Table 1││Table 2││Table 2││ |
There was a problem hiding this comment.
| /// ││Table 1││Table 2││Table 2││ | |
| /// ││Table 1││Table 2││Table 3││ |
| ) | ||
| } | ||
|
|
||
| fn first_hash_join_and_direct_hash_repartition_children( |
There was a problem hiding this comment.
Does this comment sound correct? Fix it as you see fit
| fn first_hash_join_and_direct_hash_repartition_children( | |
| // Traversing down the plan and returning the first hash join with direct repartition children | |
| fn first_hash_join_and_direct_hash_repartition_children( |
| None | ||
| } | ||
|
|
||
| fn hash_repartition_on_column( |
There was a problem hiding this comment.
The purpose of this function is to add RepartitionExec on top of the input plan. How about rename it to:
| fn hash_repartition_on_column( | |
| // Add RepartitionExec for the given input | |
| fn add_repartition( |
| config.optimizer.enable_round_robin_repartition = false; | ||
| config.optimizer.repartition_file_scans = false; | ||
| config.optimizer.repartition_file_min_size = 1024; | ||
| config.optimizer.prefer_existing_sort = false; |
There was a problem hiding this comment.
I will be clearer if you add comments explaining why you need these settings and for which tests.
| // Creates a DynamicFilterPhysicalExpr with per-partition bounds: | ||
| // - Partition 0: a >= 1 AND a <= 3 (matches all rows) | ||
| // - Partition 1: a >= 10 AND a <= 20 (excludes all rows via row group stats) | ||
| // - Partition 2: a >= 2 AND a <= 4 (matches some rows) |
There was a problem hiding this comment.
So Partitions 0 and 2 overlap? Why you need this unit test? Haven's you already throw error if this happens in the query plan?
For unit test, we can make it whatever but I suggest we have the test that makes sense to us. Test non-overlapped partitions
| /// - `Some(Some(expr))`: use the partition-local filter. | ||
| /// - `Some(None)`: the build partition is known empty, so return `false`. | ||
| /// - `None` (out-of-range): return `true` (fail-open) to avoid incorrect pruning if | ||
| /// partition alignment/count assumptions are violated by a source. |
There was a problem hiding this comment.
Can you add a comment here describing what the returned value means? Something like this:
- Ok(Expr) : dynamic filter expression will be used for the given partition
- Ok(false): will filter everything on the probe side because the build side is empty
- Ok(true): will not filter anything from the probe side and return as-is
Right, usually when users decide to do this custom partitions, they must have a mechanism to enforce it. Thus, I do not think we need to worry about this at this dynamic filtering stage. We only need to provide a way to use the dynamic filtering correctly which is the purpose of this PR. |
| /// Per-partition filter expressions indexed by partition number. | ||
| type PartitionedFilters = Vec<Option<Arc<dyn PhysicalExpr>>>; |
There was a problem hiding this comment.
Just one more question -- how would evaluation be done for PartitionedFilters?
My understanding is that each partition would need to first access its corresponding PhysicalExpr and then call evaluate() right? However, the evaluate() trait of PhysicalExpr has no partition number in the args, soevaluate()can't directly integrate PartitionedFilters.
The current evaluate() function remains the same and evaluates inner.expr, which, when we preserve file partitioning, holds nothing (just lit(true) placeholder).
There was a problem hiding this comment.
Snapshotting happens before evaluation
The full path in chronological order is:
- ParquetOpener::open() is called
- snapshot_physical_expr_for_partition(predicate, partition_index) is called -> important to note that we pass the index
- snapshot_physical_expr_for_partition replaces the DynamicFilterPhysicalExpr with the filter for the partition on that index (this is a physical expr)
- evaluate() is called which uses the snapshot expr (not the DynamicFilterPhysicalExpr) and we don't need to know the partition parameter because it was already dealt with earlier
For the lit(true) concern, if has_partitioned_filters() returns false during snapshotting then we will fallback to lit(true) which yes then we won't evaluate anything. But this is ok behavior because its ok to do this rather than error
The shouldn't happen because, the we wait until the build side is complete before we snapshot so we should always resolve to true.
Maybe to be safe would be good to add a debug statement if has_partitioned_filters() returns false.
Lmk if this makes sense 🙂
Which issue does this PR close?
Closes #20195
Rationale for this change
Dynamic filter pushdown was completely when
preserve_file_partitionson due to a correctness bug.The Problem
When preserve_file_partitions enabled, DataFusion treats file groups as pre-partitioned data. Existing dynamic filtering used hash-based routing which is incompatible with the value-based partitioning that file groups are kept in:
Example:
For this reason was diabled, this PR re-enables it via
PartitionIndexrouting for dynamic filters.What changes are included in this PR?
Partition-Indexed dynamic filtering
New routing mode that uses direct partition-to-partition mapping:
Example:
Alignment Detection
Detects compatible partitioning to enable safe optimization:
match (left.repartitioned, right.repartitioned) {
In the case there is a
RepartitionExecin the path leading from theDataSourceExecto either the build or probe side of a Partitioned Hash Join -> Falls back to CaseHash.The reason is
RepartitionExecuses hash(value) % N to distribute rows, breaking the value-based partition alignment. When hash-partitioned, partition 0 no longer contains 'A' exclusively, breaking the partition index assumptionsWith hash partitioning, use:
Are these changes tested?
sqlogictests: test_files/preserve_file_partitioning.slt
Integration tests: datafusion/core/tests/physical_optimizer/filter_pushdown.rs
Unit tests: in effected files
Are there any user-facing changes?
Yes a new error message can appear is partition hash joins are not aligned properly and the dynamic filtering display for partition index is a but different then CASE routing.
cc: @NGA-TRAN @LiaCastaneda @adriangb @gabotechs