Refactor InListExpr to support structs by re-using existing hashing infrastructure #18449

adriangb · 2025-11-02T18:53:27Z

Background

This PR is part of an EPIC to push down hash table references from HashJoinExec into scans. The EPIC is tracked in #17171.

A "target state" is tracked in #18393.
There is a series of PRs to get us to this target state in smaller more reviewable changes that are still valuable on their own:

Refactor create_hashes to accept array references #18448
(This PR): Refactor InListExpr to support structs by re-using existing hashing infrastructure #18449 (depends on Refactor create_hashes to accept array references #18448)
Refactor state management in HashJoinExec and use CASE expressions for more precise filters #18451

Changes in this PR

Enhance InListExpr to efficiently store homogeneous lists as arrays and avoid a conversion to Vec
by adding an internal InListStorage enum with Array and Exprs variants
Re-use existing hashing and comparison utilities to support Struct arrays and other complex types
Add public function in_list_from_array(expr, list_array, negated) for creating InList from arrays

Although the diff looks large most of it is actually tests and docs. I think the actual code change is a negative LOC change, or at least negative complexity (eliminates a trait, a macro, matching on data types).

adriangb · 2025-11-02T18:54:08Z

datafusion/proto/src/physical_plan/to_proto.rs

+                    // TODO: serialize the inner ArrayRef directly to avoid materialization into literals
+                    // by extending the protobuf definition to support both representations and adding a public
+                    // accessor method to InListExpr to get the inner ArrayRef


I'll create a followup issue once we merge this

adriangb · 2025-11-02T18:54:38Z

datafusion/sqllogictest/test_files/array.slt

 05)--------ProjectionExec: expr=[]
 06)----------CoalesceBatchesExec: target_batch_size=8192
-07)------------FilterExec: substr(md5(CAST(value@0 AS Utf8View)), 1, 32) IN ([7f4b18de3cfeb9b4ac78c381ee2ad278, a, b, c])
+07)------------FilterExec: substr(md5(CAST(value@0 AS Utf8View)), 1, 32) IN (SET) ([7f4b18de3cfeb9b4ac78c381ee2ad278, a, b, c])


This is because we now support Utf8View for building the sets 😄

adriangb · 2025-11-02T18:56:22Z

datafusion/physical-expr/src/expressions/in_list.rs

+                let random_state = RandomState::with_seed(0);
+                let mut hashes_buf = vec![0u64; array.len()];
+                let Ok(hashes_buf) = create_hashes_from_arrays(
+                    &[array.as_ref()],
+                    &random_state,
+                    &mut hashes_buf,
+                ) else {
+                    unreachable!("Failed to create hashes for InList array. This shouldn't happen because make_set should have succeeded earlier.");
+                };
+                hashes_buf.hash(state);


We could pre-compute and store a hash: u64 which would be both more performant when Hash is called and avoid this error, but it would add more complexity and some overhead when building the InListExpr

## Background This PR is part of an EPIC to push down hash table references from HashJoinExec into scans. The EPIC is tracked in apache#17171. A "target state" is tracked in apache#18393. There is a series of PRs to get us to this target state in smaller more reviewable changes that are still valuable on their own: - (This PR): apache#18448 - apache#18449 (depends on apache#18448) - apache#18451 ## Changes in this PR Change create_hashes and related functions to work with &dyn Array references instead of requiring ArrayRef (Arc-wrapped arrays). This avoids unnecessary Arc::clone() calls and enables calls that only have an &dyn Array to use the hashing utilities. - Add create_hashes_from_arrays(&[&dyn Array]) function - Refactor hash_dictionary, hash_list_array, hash_fixed_list_array to use references instead of cloning - Extract hash_single_array() helper for common logic --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

adriangb · 2025-11-07T19:10:52Z

datafusion/physical-expr/src/utils/guarantee.rs

    /// supported. Returns None otherwise. See [`LiteralGuarantee::analyze`] to
    /// create these structures from an predicate (boolean expression).
-    fn new<'a>(
+    fn new(


I think it's worth discussing in this review how far we propagate the changes.

In particular, InListExpr will now have two operations modes:

Was built with an ArrayRef or was able to build an ArrayRef from a homogeneously typed Vec<Arc<dyn PhysicalExpr>> which are all literals.

Was built with a Vec<Arc<dyn PhysicalExpr>> which are not literals or homogeneously typed.

If we restrict LiteralGuarantee to only operate on the first cases, I think we could lift out a lot of computation: instead of transforming ArrayRef -> Vec<Arc<dyn PhysicalExpr>> -> Vec<ScalarValue> -> HashSet<ScalarValue> which then gets fed into bloom filters which are per-column and don't really support heterogenous ScalarValues we could re-use the already deduplicated ArraySet that InListExpr has internally or something. The ultimate thing to do, but that would require even more work and changes, would be to make PruningPredicate::contains accept an enum ArrayOrScalars { Array(ArrayRef), Scalars(Vec<ScalarValue>) } so that we can push down and iterate over the Arc'ed ArrayRef the whole way down. I think we could make this backwards compatible.

I think that change is worth it, but it requires a bit more coordination (with arrow-rs) and a bigger change.

The end result would be that:

When you create an InListExpr operates in mode (1) we are able to push down into bloom filters with no data copies at all.

When the InListExpr operates in mode (2) we'd bail on the pushdown early (e.g. list() -> Option<ArrayRef>) and avoid building the HashSet<ScalarValue>, etc. that won't be used.

Wdyt @alamb ?

Okay I've looked into this and it is entirely possible, I think we should do it.
Basically the status quo currently is that we always try to build an ArrayHashSet which is only possible if we can convert the Vec<ScalarValue> into an ArrayRef.

At that point the only reason to store the Vec<SclarValue> is to later pass it into PruningPredicate -> bloom filters and LiteralGuarantee. If we can refactor those two to also handle an ArrayRef we could probably avoid a lot of cloning and make things more efficient by using arrays. I don't even think we need to support Vec<ScalarValue> at all: the only reason to have that is if you could not build a homogeneously typed array, and if that is the case you probably can't do any sort of pushdown into a bloom filter. E.g. select variant_get(col, 'abc') in (1, 2.0, 'c') might make sense and work but I don't think we could ever push that down into a bloom filter. So InListExpr needs to operate on both but I don't think the pruning machinery does.

So anyway I think I may try to reduce this change to only be about using create_hashes and ignore any inefficiencies as a TODO for a followup issue. At the end of the day if we can make HashJoinExec faster even if that's with some inefficiencies I think that's okay and we can improve more later.

I'll record a preview of some of the changes I had made to explore this (by no means ready) just for future reference: https://github.com/pydantic/datafusion/compare/refactor-in-list...pydantic:datafusion:use-array-in-pruning?expand=1

adriangb · 2025-11-09T07:01:03Z

datafusion/physical-expr/src/expressions/in_list.rs

-pub trait Set: Send + Sync {
-    fn contains(&self, v: &dyn Array, negated: bool) -> Result<BooleanArray>;
-    fn has_nulls(&self) -> bool;
-}


We get rid of the Set trait. The only implementer was ArraySet

adriangb · 2025-11-09T07:03:07Z

datafusion/physical-expr/src/expressions/in_list.rs

-        array => Arc::new(ArraySet::new(array, make_hash_set(array))),
-        DataType::Boolean => {
-            let array = as_boolean_array(array)?;
-            Arc::new(ArraySet::new(array, make_hash_set(array)))
-        },


We get rid of this type matching logic

adriangb · 2025-11-09T07:03:32Z

datafusion/physical-expr/src/expressions/in_list.rs

-trait IsEqual: HashValue {
-    fn is_equal(&self, other: &Self) -> bool;
-}
-
-impl<T: IsEqual + ?Sized> IsEqual for &T {
-    fn is_equal(&self, other: &Self) -> bool {
-        T::is_equal(self, other)
-    }
-}


We get rid of these custom equality / hash traits

alamb · 2025-11-09T16:26:11Z

🤖 ./gh_compare_branch_bench.sh Benchmark Script Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing refactor-in-list (f412ead) to 1d8bc9b diff
BENCH_NAME=in_list
BENCH_COMMAND=cargo bench --bench in_list
BENCH_FILTER=
BENCH_BRANCH_NAME=refactor-in-list
Results will be posted here when complete

alamb

Thanks @adriangb

I looked through the code and the basic idea makes a lot of sense to me 👍

I kicked off some benchmarks to see what impact, if any, this change has on performance. Assuming it is the same or better, I think it would be good to merge

I do suggest adding some slt level logic for struct IN lists if we don't already have some, but I don't think it is necessary

alamb · 2025-11-09T16:34:15Z

datafusion/physical-expr/src/expressions/in_list.rs

-                        false => Some(negated),
-                    }
-                })
+        let mut hashes_buf = vec![0u64; v.len()];


As a follow on PR we could potentially look into reusing this hashes_buf -- aka rather than reallocating each invocations of contains instead make a field (probably needs to be a Mutex or something) that is a Vec

alamb · 2025-11-09T16:36:56Z

datafusion/physical-expr/src/expressions/in_list.rs

-                })
+        let mut hashes_buf = vec![0u64; v.len()];
+        create_hashes([v], &self.state, &mut hashes_buf)?;
+        let cmp = make_comparator(v, in_array, SortOptions::default())?;


the comparator is some dynamic function -- the overhead of using the dynamic dispatch in the critical path may be substantial).

If it turns out to be too slow, we can potentially create specializations for comparisons (aka make a speicalized hash set for the different physical array types, and fall back to the dynamic comparator)

alamb · 2025-11-09T16:38:12Z

datafusion/physical-expr/src/expressions/in_list.rs

+///
+/// The `list` field will be empty when using this constructor, as the array is stored
+/// directly in the static filter.
+pub fn in_list_from_array(


I wonder if it would be more discoverable if this was a method on InList rather than a free function

Something like

impl InLIst fn new_from_array( expr: Arc<dyn PhysicalExpr>, array: ArrayRef, negated: bool, ) -> Result<Self> { ... }

Yeah I agree, I was just following the existing patterns

alamb · 2025-11-09T16:38:37Z

datafusion/physical-expr/src/expressions/in_list.rs

    }
+
+    #[test]
+    fn in_list_struct() -> Result<()> {


Can we also please add some .slt level tests for IN on a set?

alamb · 2025-11-09T16:45:35Z

🤖: Benchmark completed

Details

group                                       main                                   refactor-in-list
-----                                       ----                                   ----------------
in_list_f32 (1024, 0) IN (1, 0)             1.00      4.3±0.01µs        ? ?/sec    1.14      4.9±0.01µs        ? ?/sec
in_list_f32 (1024, 0) IN (10, 0)            1.00      4.3±0.01µs        ? ?/sec    1.14      4.9±0.01µs        ? ?/sec
in_list_f32 (1024, 0) IN (100, 0)           1.00      4.2±0.03µs        ? ?/sec    1.14      4.9±0.01µs        ? ?/sec
in_list_f32 (1024, 0) IN (3, 0)             1.00      4.2±0.01µs        ? ?/sec    1.15      4.9±0.01µs        ? ?/sec
in_list_f32 (1024, 0.2) IN (1, 0)           1.00      5.5±0.06µs        ? ?/sec    1.36      7.5±0.03µs        ? ?/sec
in_list_f32 (1024, 0.2) IN (10, 0)          1.00      5.5±0.06µs        ? ?/sec    1.40      7.7±0.07µs        ? ?/sec
in_list_f32 (1024, 0.2) IN (100, 0)         1.00      5.6±0.03µs        ? ?/sec    1.23      6.9±0.06µs        ? ?/sec
in_list_f32 (1024, 0.2) IN (3, 0)           1.00      5.5±0.06µs        ? ?/sec    1.38      7.6±0.03µs        ? ?/sec
in_list_i32 (1024, 0) IN (1, 0)             1.00      4.2±0.01µs        ? ?/sec    1.16      4.9±0.01µs        ? ?/sec
in_list_i32 (1024, 0) IN (10, 0)            1.00      4.2±0.01µs        ? ?/sec    1.16      4.9±0.01µs        ? ?/sec
in_list_i32 (1024, 0) IN (100, 0)           1.00      4.2±0.01µs        ? ?/sec    1.16      4.9±0.01µs        ? ?/sec
in_list_i32 (1024, 0) IN (3, 0)             1.00      4.2±0.01µs        ? ?/sec    1.16      4.9±0.01µs        ? ?/sec
in_list_i32 (1024, 0.2) IN (1, 0)           1.00      5.7±0.02µs        ? ?/sec    1.31      7.5±0.04µs        ? ?/sec
in_list_i32 (1024, 0.2) IN (10, 0)          1.00      5.6±0.01µs        ? ?/sec    1.29      7.2±0.04µs        ? ?/sec
in_list_i32 (1024, 0.2) IN (100, 0)         1.00      5.5±0.03µs        ? ?/sec    1.22      6.7±0.04µs        ? ?/sec
in_list_i32 (1024, 0.2) IN (3, 0)           1.00      5.7±0.04µs        ? ?/sec    1.31      7.5±0.03µs        ? ?/sec
in_list_utf8(10) (1024, 0) IN (1, 0)        1.09      5.5±0.02µs        ? ?/sec    1.00      5.1±0.01µs        ? ?/sec
in_list_utf8(10) (1024, 0) IN (10, 0)       1.09      5.5±0.01µs        ? ?/sec    1.00      5.1±0.01µs        ? ?/sec
in_list_utf8(10) (1024, 0) IN (100, 0)      1.05      5.5±0.05µs        ? ?/sec    1.00      5.3±0.03µs        ? ?/sec
in_list_utf8(10) (1024, 0) IN (3, 0)        1.13      5.7±0.06µs        ? ?/sec    1.00      5.1±0.01µs        ? ?/sec
in_list_utf8(10) (1024, 0.2) IN (1, 0)      1.00      7.6±0.04µs        ? ?/sec    1.05      8.0±0.07µs        ? ?/sec
in_list_utf8(10) (1024, 0.2) IN (10, 0)     1.00      7.8±0.04µs        ? ?/sec    1.01      7.8±0.03µs        ? ?/sec
in_list_utf8(10) (1024, 0.2) IN (100, 0)    1.03      8.0±0.09µs        ? ?/sec    1.00      7.8±0.07µs        ? ?/sec
in_list_utf8(10) (1024, 0.2) IN (3, 0)      1.00      7.7±0.04µs        ? ?/sec    1.06      8.1±0.06µs        ? ?/sec
in_list_utf8(20) (1024, 0) IN (1, 0)        1.10      5.6±0.19µs        ? ?/sec    1.00      5.1±0.01µs        ? ?/sec
in_list_utf8(20) (1024, 0) IN (10, 0)       1.09      5.5±0.01µs        ? ?/sec    1.00      5.1±0.01µs        ? ?/sec
in_list_utf8(20) (1024, 0) IN (100, 0)      1.09      5.5±0.01µs        ? ?/sec    1.00      5.1±0.01µs        ? ?/sec
in_list_utf8(20) (1024, 0) IN (3, 0)        1.04      5.5±0.01µs        ? ?/sec    1.00      5.3±0.01µs        ? ?/sec
in_list_utf8(20) (1024, 0.2) IN (1, 0)      1.00      7.9±0.03µs        ? ?/sec    1.01      8.0±0.03µs        ? ?/sec
in_list_utf8(20) (1024, 0.2) IN (10, 0)     1.00      7.9±0.10µs        ? ?/sec    1.00      7.9±0.06µs        ? ?/sec
in_list_utf8(20) (1024, 0.2) IN (100, 0)    1.00      7.8±0.03µs        ? ?/sec    1.00      7.8±0.05µs        ? ?/sec
in_list_utf8(20) (1024, 0.2) IN (3, 0)      1.02      8.2±0.08µs        ? ?/sec    1.00      8.0±0.05µs        ? ?/sec
in_list_utf8(5) (1024, 0) IN (1, 0)         1.12      5.7±0.05µs        ? ?/sec    1.00      5.1±0.01µs        ? ?/sec
in_list_utf8(5) (1024, 0) IN (10, 0)        1.09      5.5±0.01µs        ? ?/sec    1.00      5.1±0.01µs        ? ?/sec
in_list_utf8(5) (1024, 0) IN (100, 0)       1.09      5.5±0.01µs        ? ?/sec    1.00      5.0±0.02µs        ? ?/sec
in_list_utf8(5) (1024, 0) IN (3, 0)         1.09      5.5±0.02µs        ? ?/sec    1.00      5.1±0.01µs        ? ?/sec
in_list_utf8(5) (1024, 0.2) IN (1, 0)       1.00      7.6±0.02µs        ? ?/sec    1.03      7.8±0.08µs        ? ?/sec
in_list_utf8(5) (1024, 0.2) IN (10, 0)      1.00      7.8±0.05µs        ? ?/sec    1.00      7.8±0.03µs        ? ?/sec
in_list_utf8(5) (1024, 0.2) IN (100, 0)     1.00      7.7±0.03µs        ? ?/sec    1.00      7.7±0.06µs        ? ?/sec
in_list_utf8(5) (1024, 0.2) IN (3, 0)       1.04      8.0±0.04µs        ? ?/sec    1.00      7.7±0.05µs        ? ?/sec

adriangb · 2025-11-09T17:51:44Z

It looks like there are indeed some regressions. I propose we do two things:

Add a create_hashes_unbuffered(…) -> &[u64] that uses a thread local to re-use the buffer. I think this will be helpful in other contexts as well.
Create a make_typed_comparator that returns an enum that is typed for non-recursive types and delegates to a fallback dynamically typed variant for recursive types. I’ll implement it here for now but make a note that it would be good to upstream into arrow. When it is up streamed into arrow we can re-implement the current version in terms of their new version and deprecate the current function.

I think that will get us the broader type support and code re-use while avoiding any slowdown. Once we do the upstreaming into arrow it won’t even be any more code than it is now (a bit more code in arrow but not even that much). And we should be able to do it all in one PR here

adriangb · 2025-11-14T23:28:56Z

I removed the enum comparator, benchmarks showed it was slower than the dynamic dispatch version. The thread local hashing / buffer re-use seems to be a big win though.

Although this is +1.7k LOC ~1.5k of those are new tests / docstrings on existing functions. The actual change is closer to ~500LOC, and that includes the new hash_utils.rs stuff that will be used in other places as well.

@alamb could you kick off benchmarks again? If they look good are we good to merge this?

alamb · 2025-11-17T14:09:14Z

🤖 ./gh_compare_branch_bench.sh Benchmark Script Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing refactor-in-list (a5afb96) to 0cfc1fe diff
BENCH_NAME=in_list
BENCH_COMMAND=cargo bench --bench in_list
BENCH_FILTER=
BENCH_BRANCH_NAME=refactor-in-list
Results will be posted here when complete

alamb · 2025-11-17T14:28:46Z

🤖: Benchmark completed

Details

group                                       main                                   refactor-in-list
-----                                       ----                                   ----------------
in_list_f32 (1024, 0) IN (1, 0)             1.01      5.0±0.03µs        ? ?/sec    1.00      4.9±0.03µs        ? ?/sec
in_list_f32 (1024, 0) IN (10, 0)            1.02      5.0±0.02µs        ? ?/sec    1.00      4.9±0.01µs        ? ?/sec
in_list_f32 (1024, 0) IN (100, 0)           1.02      5.0±0.02µs        ? ?/sec    1.00      4.9±0.01µs        ? ?/sec
in_list_f32 (1024, 0) IN (3, 0)             1.02      5.0±0.03µs        ? ?/sec    1.00      4.9±0.01µs        ? ?/sec
in_list_f32 (1024, 0.2) IN (1, 0)           1.00      5.8±0.02µs        ? ?/sec    1.25      7.2±0.04µs        ? ?/sec
in_list_f32 (1024, 0.2) IN (10, 0)          1.00      5.6±0.02µs        ? ?/sec    1.29      7.3±0.05µs        ? ?/sec
in_list_f32 (1024, 0.2) IN (100, 0)         1.00      5.5±0.03µs        ? ?/sec    1.32      7.3±0.03µs        ? ?/sec
in_list_f32 (1024, 0.2) IN (3, 0)           1.00      5.8±0.02µs        ? ?/sec    1.25      7.2±0.04µs        ? ?/sec
in_list_i32 (1024, 0) IN (1, 0)             1.00      4.3±0.01µs        ? ?/sec    1.13      4.9±0.04µs        ? ?/sec
in_list_i32 (1024, 0) IN (10, 0)            1.00      4.2±0.06µs        ? ?/sec    1.17      4.9±0.01µs        ? ?/sec
in_list_i32 (1024, 0) IN (100, 0)           1.00      4.2±0.01µs        ? ?/sec    1.17      4.9±0.05µs        ? ?/sec
in_list_i32 (1024, 0) IN (3, 0)             1.00      4.2±0.01µs        ? ?/sec    1.17      4.9±0.01µs        ? ?/sec
in_list_i32 (1024, 0.2) IN (1, 0)           1.00      5.8±0.03µs        ? ?/sec    1.22      7.1±0.03µs        ? ?/sec
in_list_i32 (1024, 0.2) IN (10, 0)          1.00      5.6±0.02µs        ? ?/sec    1.23      6.9±0.03µs        ? ?/sec
in_list_i32 (1024, 0.2) IN (100, 0)         1.00      5.6±0.02µs        ? ?/sec    1.23      6.9±0.03µs        ? ?/sec
in_list_i32 (1024, 0.2) IN (3, 0)           1.00      5.6±0.02µs        ? ?/sec    1.26      7.1±0.03µs        ? ?/sec
in_list_utf8(10) (1024, 0) IN (1, 0)        1.25      5.5±0.02µs        ? ?/sec    1.00      4.4±0.01µs        ? ?/sec
in_list_utf8(10) (1024, 0) IN (10, 0)       1.25      5.5±0.01µs        ? ?/sec    1.00      4.4±0.01µs        ? ?/sec
in_list_utf8(10) (1024, 0) IN (100, 0)      1.25      5.5±0.02µs        ? ?/sec    1.00      4.4±0.01µs        ? ?/sec
in_list_utf8(10) (1024, 0) IN (3, 0)        1.27      5.6±0.01µs        ? ?/sec    1.00      4.4±0.00µs        ? ?/sec
in_list_utf8(10) (1024, 0.2) IN (1, 0)      1.23      7.7±0.03µs        ? ?/sec    1.00      6.2±0.05µs        ? ?/sec
in_list_utf8(10) (1024, 0.2) IN (10, 0)     1.25      7.8±0.02µs        ? ?/sec    1.00      6.3±0.03µs        ? ?/sec
in_list_utf8(10) (1024, 0.2) IN (100, 0)    1.27      7.9±0.08µs        ? ?/sec    1.00      6.2±0.03µs        ? ?/sec
in_list_utf8(10) (1024, 0.2) IN (3, 0)      1.23      7.8±0.03µs        ? ?/sec    1.00      6.3±0.05µs        ? ?/sec
in_list_utf8(20) (1024, 0) IN (1, 0)        1.25      5.5±0.01µs        ? ?/sec    1.00      4.4±0.01µs        ? ?/sec
in_list_utf8(20) (1024, 0) IN (10, 0)       1.25      5.5±0.01µs        ? ?/sec    1.00      4.4±0.04µs        ? ?/sec
in_list_utf8(20) (1024, 0) IN (100, 0)      1.25      5.5±0.04µs        ? ?/sec    1.00      4.4±0.01µs        ? ?/sec
in_list_utf8(20) (1024, 0) IN (3, 0)        1.26      5.8±0.06µs        ? ?/sec    1.00      4.6±0.01µs        ? ?/sec
in_list_utf8(20) (1024, 0.2) IN (1, 0)      1.29      8.1±0.03µs        ? ?/sec    1.00      6.3±0.03µs        ? ?/sec
in_list_utf8(20) (1024, 0.2) IN (10, 0)     1.26      7.8±0.02µs        ? ?/sec    1.00      6.2±0.02µs        ? ?/sec
in_list_utf8(20) (1024, 0.2) IN (100, 0)    1.25      7.8±0.07µs        ? ?/sec    1.00      6.3±0.03µs        ? ?/sec
in_list_utf8(20) (1024, 0.2) IN (3, 0)      1.26      8.0±0.04µs        ? ?/sec    1.00      6.4±0.03µs        ? ?/sec
in_list_utf8(5) (1024, 0) IN (1, 0)         1.27      5.7±0.01µs        ? ?/sec    1.00      4.5±0.01µs        ? ?/sec
in_list_utf8(5) (1024, 0) IN (10, 0)        1.25      5.5±0.01µs        ? ?/sec    1.00      4.4±0.01µs        ? ?/sec
in_list_utf8(5) (1024, 0) IN (100, 0)       1.25      5.5±0.05µs        ? ?/sec    1.00      4.4±0.01µs        ? ?/sec
in_list_utf8(5) (1024, 0) IN (3, 0)         1.24      5.7±0.01µs        ? ?/sec    1.00      4.6±0.01µs        ? ?/sec
in_list_utf8(5) (1024, 0.2) IN (1, 0)       1.24      7.7±0.02µs        ? ?/sec    1.00      6.2±0.04µs        ? ?/sec
in_list_utf8(5) (1024, 0.2) IN (10, 0)      1.23      7.7±0.02µs        ? ?/sec    1.00      6.2±0.04µs        ? ?/sec
in_list_utf8(5) (1024, 0.2) IN (100, 0)     1.24      7.8±0.19µs        ? ?/sec    1.00      6.3±0.04µs        ? ?/sec
in_list_utf8(5) (1024, 0.2) IN (3, 0)       1.23      7.9±0.03µs        ? ?/sec    1.00      6.4±0.05µs        ? ?/sec

alamb · 2025-11-17T14:28:50Z

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing refactor-in-list (a5afb96) to 0cfc1fe diff using: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

alamb

Thank you @adriangb and @davidhewitt

As long as this PR shows no performance regressions I think it is good to go

I am actually pretty surprised we can get away using the comparator compared to using eq which is typically much faster. I have an idea of how we can potentially have the best of both worlds (full type support as well as faster native implementation).

Let me see if I can bang out something

alamb · 2025-11-14T21:13:22Z

datafusion/sqllogictest/test_files/expr.slt

 NULL

+# ========================================================================
+# Comprehensive IN LIST tests with NULL handling


I always why @claude and similar bots always insist the code is "comprehensive" 😆

alamb · 2025-11-17T14:14:26Z

datafusion/common/src/hash_utils.rs

    random_state: &RandomState,
-    hashes_buffer: &'a mut Vec<u64>,
-) -> Result<&'a mut Vec<u64>>
+    hashes_buffer: &'a mut [u64],


I double checked that the code already asserts that the hashes_buffer and arrays are the same length (aka doesn't actually use the fact this is a Vec to grow the allocation)

alamb · 2025-11-17T14:16:13Z

datafusion/common/src/hash_utils.rs

+    }
+
+    #[test]
+    fn test_with_hashes_reentrancy() {


Can you please add a test / verify the truncate / shrink to fit behavior ? I think that is probably important

alamb · 2025-11-17T14:20:04Z

datafusion/sqllogictest/test_files/tpch/plans/q19.slt.part

 04)------AggregateExec: mode=Partial, gby=[], aggr=[sum(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount)]
 05)--------CoalesceBatchesExec: target_batch_size=8192
-06)----------HashJoinExec: mode=Partitioned, join_type=Inner, on=[(l_partkey@0, p_partkey@0)], filter=p_brand@1 = Brand#12 AND p_container@3 IN ([SM CASE, SM BOX, SM PACK, SM PKG]) AND l_quantity@0 >= Some(100),15,2 AND l_quantity@0 <= Some(1100),15,2 AND p_size@2 <= 5 OR p_brand@1 = Brand#23 AND p_container@3 IN ([MED BAG, MED BOX, MED PKG, MED PACK]) AND l_quantity@0 >= Some(1000),15,2 AND l_quantity@0 <= Some(2000),15,2 AND p_size@2 <= 10 OR p_brand@1 = Brand#34 AND p_container@3 IN ([LG CASE, LG BOX, LG PACK, LG PKG]) AND l_quantity@0 >= Some(2000),15,2 AND l_quantity@0 <= Some(3000),15,2 AND p_size@2 <= 15, projection=[l_extendedprice@2, l_discount@3]
+06)----------HashJoinExec: mode=Partitioned, join_type=Inner, on=[(l_partkey@0, p_partkey@0)], filter=p_brand@1 = Brand#12 AND p_container@3 IN (SET) ([SM CASE, SM BOX, SM PACK, SM PKG]) AND l_quantity@0 >= Some(100),15,2 AND l_quantity@0 <= Some(1100),15,2 AND p_size@2 <= 5 OR p_brand@1 = Brand#23 AND p_container@3 IN (SET) ([MED BAG, MED BOX, MED PKG, MED PACK]) AND l_quantity@0 >= Some(1000),15,2 AND l_quantity@0 <= Some(2000),15,2 AND p_size@2 <= 10 OR p_brand@1 = Brand#34 AND p_container@3 IN (SET) ([LG CASE, LG BOX, LG PACK, LG PKG]) AND l_quantity@0 >= Some(2000),15,2 AND l_quantity@0 <= Some(3000),15,2 AND p_size@2 <= 15, projection=[l_extendedprice@2, l_discount@3]


Why did these queries start being able to use the pre-calculated set? Is it because InList didn't have a special case for Utf8View before?

Yep exactly

alamb · 2025-11-17T14:22:00Z

datafusion/physical-expr/src/expressions/in_list.rs

                        .map
                        .raw_entry()
-                        .from_hash(hash, |idx| in_array.value(*idx).is_equal(&v))
+                        .from_hash(hash, |idx| cmp(i, *idx).is_eq())


If we ever need to make this faster, we could potentially add specializations for different primitive types, and still fall back to the dynamic comparator

I tried that. It seemed slower once the enum got large enough.

I mean specialize the entire thing (including hash table) - so that you pay the dispatch once (either at InLIstExpr creation time or maybe once per batch), rather than on each row

Another problem with the dyn comparator approach is that it prevents inlining/vectorization

Here is one way to specialize the hashset:

WIP: Hack out specialized Int32 static filter pydantic/datafusion#45
(I haven't fully worked out the generics yet)

alamb · 2025-11-17T14:26:07Z

datafusion/physical-expr/src/expressions/in_list.rs

+                            // SQL three-valued logic: null IN (...) is always null
+                            // The code below would handle this correctly but this is a faster path
+                            return Ok(ColumnarValue::Array(Arc::new(
+                                BooleanArray::from(vec![None; num_rows]),


This can probably be made faster by bypassing the Vec entirely -- perhaps via https://docs.rs/arrow/latest/arrow/array/struct.BooleanBufferBuilder.html

not necessary, I am just pointing it out

alamb · 2025-11-17T14:28:37Z

datafusion/physical-expr/src/expressions/in_list.rs

+                                    BooleanArray::from(vec![None; num_rows])
+                                } else {
+                                    // Convert scalar to 1-element array
+                                    let array = scalar.to_array()?;


I am really surprised using this comparator does not cause a performance_regression compared to using eq

alamb · 2025-11-17T14:35:29Z

datafusion/physical-expr/src/expressions/in_list.rs

 use hashbrown::hash_map::RawEntryMut;

+/// Static filter for InList that stores the array and hash set for O(1) lookups
+#[derive(Debug, Clone)]


What is the reason to pull StaticFilter out from ArrayHashSet? It took me a little bit to grok that the fields in ArrayHashSet refer to StaticFilter

In other words, why not something like

/// Static filter for InList that stores the array and hash set for O(1) lookups #[derive(Debug, Clone)] struct StaticFilter { in_array: ArrayRef, state: RandomState, /// Used to provide a lookup from value to in list index /// /// Note: usize::hash is not used, instead the raw entry /// API is used to store entries w.r.t their value map: HashMap<usize, (), ()>, }

Update -- I tried this and it seems to work great

Consolidate StaticFilter and ArrayHashSet pydantic/datafusion#44

alamb · 2025-11-17T14:49:46Z

Looks like in_list is still slower for i32 - #18449 (comment)

I have an idea of how to fix this (use eq rather than cmp)

in_list_f32 (1024, 0) IN (1, 0)             1.01      5.0±0.03µs        ? ?/sec    1.00      4.9±0.03µs        ? ?/sec
in_list_f32 (1024, 0) IN (10, 0)            1.02      5.0±0.02µs        ? ?/sec    1.00      4.9±0.01µs        ? ?/sec
in_list_f32 (1024, 0) IN (100, 0)           1.02      5.0±0.02µs        ? ?/sec    1.00      4.9±0.01µs        ? ?/sec
in_list_f32 (1024, 0) IN (3, 0)             1.02      5.0±0.03µs        ? ?/sec    1.00      4.9±0.01µs        ? ?/sec
in_list_f32 (1024, 0.2) IN (1, 0)           1.00      5.8±0.02µs        ? ?/sec    1.25      7.2±0.04µs        ? ?/sec
in_list_f32 (1024, 0.2) IN (10, 0)          1.00      5.6±0.02µs        ? ?/sec    1.29      7.3±0.05µs        ? ?/sec
in_list_f32 (1024, 0.2) IN (100, 0)         1.00      5.5±0.03µs        ? ?/sec    1.32      7.3±0.03µs        ? ?/sec
in_list_f32 (1024, 0.2) IN (3, 0)           1.00      5.8±0.02µs        ? ?/sec    1.25      7.2±0.04µs        ? ?/sec
in_list_i32 (1024, 0) IN (1, 0)             1.00      4.3±0.01µs        ? ?/sec    1.13      4.9±0.04µs        ? ?/sec
in_list_i32 (1024, 0) IN (10, 0)            1.00      4.2±0.06µs        ? ?/sec    1.17      4.9±0.01µs        ? ?/sec
in_list_i32 (1024, 0) IN (100, 0)           1.00      4.2±0.01µs        ? ?/sec    1.17      4.9±0.05µs        ? ?/sec
in_list_i32 (1024, 0) IN (3, 0)             1.00      4.2±0.01µs        ? ?/sec    1.17      4.9±0.01µs        ? ?/sec
in_list_i32 (1024, 0.2) IN (1, 0)           1.00      5.8±0.03µs        ? ?/sec    1.22      7.1±0.03µs        ? ?/sec
in_list_i32 (1024, 0.2) IN (10, 0)          1.00      5.6±0.02µs        ? ?/sec    1.23      6.9±0.03µs        ? ?/sec
in_list_i32 (1024, 0.2) IN (100, 0)         1.00      5.6±0.02µs        ? ?/sec    1.23      6.9±0.03µs        ? ?/sec
in_list_i32 (1024, 0.2) IN (3, 0)           1.00      5.6±0.02µs        ? ?/sec    1.26      7.1±0.03µs        ? ?/sec

alamb · 2025-11-17T15:27:18Z

🤖: Benchmark completed

Details

Comparing HEAD and refactor-in-list
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ refactor-in-list ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │  2766.30 ms │       2629.46 ms │     no change │
│ QQuery 1     │  1268.45 ms │       1177.20 ms │ +1.08x faster │
│ QQuery 2     │  2471.28 ms │       2270.86 ms │ +1.09x faster │
│ QQuery 3     │  1144.13 ms │       1174.55 ms │     no change │
│ QQuery 4     │  2329.83 ms │       2293.78 ms │     no change │
│ QQuery 5     │ 28751.61 ms │      28171.43 ms │     no change │
│ QQuery 6     │  4257.52 ms │       4234.04 ms │     no change │
│ QQuery 7     │  3536.81 ms │       3464.20 ms │     no change │
└──────────────┴─────────────┴──────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary               ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)               │ 46525.92ms │
│ Total Time (refactor-in-list)   │ 45415.52ms │
│ Average Time (HEAD)             │  5815.74ms │
│ Average Time (refactor-in-list) │  5676.94ms │
│ Queries Faster                  │          2 │
│ Queries Slower                  │          0 │
│ Queries with No Change          │          6 │
│ Queries with Failure            │          0 │
└─────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ refactor-in-list ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.23 ms │          2.25 ms │     no change │
│ QQuery 1     │    49.03 ms │         48.26 ms │     no change │
│ QQuery 2     │   134.60 ms │        138.54 ms │     no change │
│ QQuery 3     │   165.46 ms │        165.79 ms │     no change │
│ QQuery 4     │  1160.92 ms │       1087.40 ms │ +1.07x faster │
│ QQuery 5     │  1520.41 ms │       1495.83 ms │     no change │
│ QQuery 6     │     2.12 ms │          2.22 ms │     no change │
│ QQuery 7     │    55.95 ms │         54.46 ms │     no change │
│ QQuery 8     │  1532.22 ms │       1460.72 ms │     no change │
│ QQuery 9     │  1963.05 ms │       1881.90 ms │     no change │
│ QQuery 10    │   382.36 ms │        370.78 ms │     no change │
│ QQuery 11    │   442.59 ms │        416.38 ms │ +1.06x faster │
│ QQuery 12    │  1445.70 ms │       1367.13 ms │ +1.06x faster │
│ QQuery 13    │  2158.36 ms │       2138.72 ms │     no change │
│ QQuery 14    │  1333.73 ms │       1275.81 ms │     no change │
│ QQuery 15    │  1282.35 ms │       1250.69 ms │     no change │
│ QQuery 16    │  2737.67 ms │       2676.60 ms │     no change │
│ QQuery 17    │  2702.85 ms │       2658.52 ms │     no change │
│ QQuery 18    │  4974.38 ms │       4951.36 ms │     no change │
│ QQuery 19    │   126.49 ms │        127.48 ms │     no change │
│ QQuery 20    │  1989.58 ms │       1994.69 ms │     no change │
│ QQuery 21    │  2310.22 ms │       2323.70 ms │     no change │
│ QQuery 22    │  3952.57 ms │       3912.50 ms │     no change │
│ QQuery 23    │ 13102.42 ms │      12808.44 ms │     no change │
│ QQuery 24    │   208.71 ms │        213.30 ms │     no change │
│ QQuery 25    │   479.41 ms │        465.40 ms │     no change │
│ QQuery 26    │   224.71 ms │        211.83 ms │ +1.06x faster │
│ QQuery 27    │  2826.26 ms │       2904.18 ms │     no change │
│ QQuery 28    │ 23603.63 ms │      23409.01 ms │     no change │
│ QQuery 29    │   998.95 ms │        996.68 ms │     no change │
│ QQuery 30    │  1363.07 ms │       1328.10 ms │     no change │
│ QQuery 31    │  1368.77 ms │       1391.62 ms │     no change │
│ QQuery 32    │  4716.65 ms │       4955.70 ms │  1.05x slower │
│ QQuery 33    │  5899.01 ms │       5757.21 ms │     no change │
│ QQuery 34    │  6144.53 ms │       6163.35 ms │     no change │
│ QQuery 35    │  2106.23 ms │       2105.76 ms │     no change │
│ QQuery 36    │   122.66 ms │        122.12 ms │     no change │
│ QQuery 37    │    52.87 ms │         53.31 ms │     no change │
│ QQuery 38    │   123.47 ms │        121.92 ms │     no change │
│ QQuery 39    │   197.02 ms │        197.71 ms │     no change │
│ QQuery 40    │    44.09 ms │         39.98 ms │ +1.10x faster │
│ QQuery 41    │    39.85 ms │         39.30 ms │     no change │
│ QQuery 42    │    31.80 ms │         33.55 ms │  1.06x slower │
└──────────────┴─────────────┴──────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary               ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)               │ 96078.96ms │
│ Total Time (refactor-in-list)   │ 95120.19ms │
│ Average Time (HEAD)             │  2234.39ms │
│ Average Time (refactor-in-list) │  2212.10ms │
│ Queries Faster                  │          5 │
│ Queries Slower                  │          2 │
│ Queries with No Change          │         36 │
│ Queries with Failure            │          0 │
└─────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ refactor-in-list ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 127.65 ms │        141.22 ms │  1.11x slower │
│ QQuery 2     │  28.65 ms │         28.65 ms │     no change │
│ QQuery 3     │  34.62 ms │         38.53 ms │  1.11x slower │
│ QQuery 4     │  29.26 ms │         29.30 ms │     no change │
│ QQuery 5     │  88.85 ms │         87.01 ms │     no change │
│ QQuery 6     │  19.70 ms │         19.47 ms │     no change │
│ QQuery 7     │ 226.03 ms │        217.56 ms │     no change │
│ QQuery 8     │  32.19 ms │         35.82 ms │  1.11x slower │
│ QQuery 9     │ 104.91 ms │         96.64 ms │ +1.09x faster │
│ QQuery 10    │  65.77 ms │         64.33 ms │     no change │
│ QQuery 11    │  18.06 ms │         19.35 ms │  1.07x slower │
│ QQuery 12    │  51.43 ms │         52.71 ms │     no change │
│ QQuery 13    │  48.17 ms │         47.95 ms │     no change │
│ QQuery 14    │  14.37 ms │         13.97 ms │     no change │
│ QQuery 15    │  24.90 ms │         24.99 ms │     no change │
│ QQuery 16    │  25.34 ms │         24.71 ms │     no change │
│ QQuery 17    │ 158.48 ms │        150.56 ms │     no change │
│ QQuery 18    │ 275.48 ms │        275.95 ms │     no change │
│ QQuery 19    │  39.09 ms │         38.24 ms │     no change │
│ QQuery 20    │  50.06 ms │         49.31 ms │     no change │
│ QQuery 21    │ 325.04 ms │        310.59 ms │     no change │
│ QQuery 22    │  20.94 ms │         17.77 ms │ +1.18x faster │
└──────────────┴───────────┴──────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary               ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)               │ 1808.98ms │
│ Total Time (refactor-in-list)   │ 1784.63ms │
│ Average Time (HEAD)             │   82.23ms │
│ Average Time (refactor-in-list) │   81.12ms │
│ Queries Faster                  │         2 │
│ Queries Slower                  │         4 │
│ Queries with No Change          │        16 │
│ Queries with Failure            │         0 │
└─────────────────────────────────┴───────────┘

alamb · 2025-11-17T18:41:58Z

If we merge this PR in as written, I think we should file a ticket to follow up restoring the performance

…nfrastructure Co-authored-by: David Hewitt <mail@davidhewitt.dev>

* Consolidate StaticFilter and ArrayHashSet * Fix docs

adriangb · 2025-11-18T09:31:54Z

I'm surprised that doing dynamic dispatch once per batch we evaluate as opposed to twice per batch we evaluate makes that much of a difference. What would make sense that makes a difference to me is doing it once per element vs. once per batch. But I guess that's what benchmarks say!

That does leave me with a question... could we squeeze out even more performance if we specialize for ~ all scalar types? It wouldn't be that hard to write a macro and have AI do the copy pasta of implementing it for all of the types... I'll open a follow up ticket.

adriangb · 2025-11-18T09:32:14Z

Also thank you for your help getting this across the line @alamb! I'm excited to continue the work.

alamb · 2025-11-19T16:01:56Z

That does leave me with a question... could we squeeze out even more performance if we specialize for ~ all scalar types? It wouldn't be that hard to write a macro and have AI do the copy pasta of implementing it for all of the types... I'll open a follow up ticket.

Yes this is what I think we should do

alamb · 2025-11-19T16:02:37Z

datafusion/physical-expr/src/expressions/in_list.rs

+}

-    ArrayHashSet { state, map }
+struct Int32StaticFilter {


Oh yeah, we should totally do the same thing here for the other types. I'll file a ticket to track that

alamb · 2025-11-19T18:33:05Z

Filed Restore IN_LIST performance -- Implement specialized StaticFilters for different data types #18824

…for more precise filters (#18451) ## Background This PR is part of an EPIC to push down hash table references from HashJoinExec into scans. The EPIC is tracked in #17171. A "target state" is tracked in #18393. There is a series of PRs to get us to this target state in smaller more reviewable changes that are still valuable on their own: - #18448 - #18449 (depends on #18448) - (This PR): #18451 ## Changes in this PR This PR refactors state management in HashJoinExec to make filter pushdown more efficient and prepare for pushing down membership tests. - Refactor internal data structures to clean up state management and make usage more idiomatic (use `Option` instead of comparing integers, etc.) - Uses CASE expressions to evaluate pushed-down filters selectively by partition Example: `CASE hash_repartition % N WHEN partition_id THEN condition ELSE false END` --------- Co-authored-by: Lía Adriana <lia.castaneda@datadoghq.com>

github-actions bot added physical-expr Changes to the physical-expr crates sqllogictest SQL Logic Tests (.slt) common Related to common crate proto Related to proto crate physical-plan Changes to the physical-plan crate labels Nov 2, 2025

adriangb commented Nov 2, 2025

View reviewed changes

adriangb force-pushed the refactor-in-list branch from 4d4b797 to 9a0f6be Compare November 2, 2025 20:48

adriangb mentioned this pull request Nov 2, 2025

Push down InList or hash table references from HashJoinExec depending on the size of the build side #18393

Open

adriangb force-pushed the refactor-in-list branch from 9a0f6be to f1f3b66 Compare November 2, 2025 21:13

This was referenced Nov 3, 2025

Push down entire hash table from HashJoinExec into scans #17171

Open

Refactor create_hashes to accept array references #18448

Merged

Refactor state management in HashJoinExec and use CASE expressions for more precise filters #18451

Merged

alamb mentioned this pull request Nov 4, 2025

Andrew Lamb Weekly-ish Open Source plan - 2025-11-03 #18486

Closed

53 tasks

adriangb force-pushed the refactor-in-list branch from f1f3b66 to d1b9d05 Compare November 7, 2025 18:25

github-actions bot removed common Related to common crate physical-plan Changes to the physical-plan crate labels Nov 7, 2025

adriangb commented Nov 7, 2025

View reviewed changes

github-actions bot removed the proto Related to proto crate label Nov 9, 2025

adriangb changed the title ~~Refactor InListExpr to store arrays and support structs~~ Refactor InListExpr to support structs by re-using existing hashing infrastructure Nov 9, 2025

adriangb commented Nov 9, 2025

View reviewed changes

adriangb force-pushed the refactor-in-list branch from ab74641 to f412ead Compare November 9, 2025 15:15

alamb reviewed Nov 9, 2025

View reviewed changes

github-actions bot added the common Related to common crate label Nov 10, 2025

adriangb force-pushed the refactor-in-list branch from 846476b to a5afb96 Compare November 15, 2025 07:02

alamb mentioned this pull request Nov 17, 2025

Consolidate StaticFilter and ArrayHashSet pydantic/datafusion#44

Merged

alamb approved these changes Nov 17, 2025

View reviewed changes

alamb mentioned this pull request Nov 17, 2025

WIP: Hack out specialized Int32 static filter pydantic/datafusion#45

Closed

adriangb mentioned this pull request Nov 17, 2025

Hash UnionArrays #18718

Merged

adriangb and others added 5 commits November 18, 2025 17:03

Refactor InListExpr to support structs by re-using existing hashing i…

8f83f9c

…nfrastructure Co-authored-by: David Hewitt <mail@davidhewitt.dev>

remove enum comparator

cd9e2f5

use const thread local

896820e

Consolidate StaticFilter and ArrayHashSet (#44)

621cfe5

* Consolidate StaticFilter and ArrayHashSet * Fix docs

fix rebase

8a2ee06

adriangb force-pushed the refactor-in-list branch from 2f5e435 to 8a2ee06 Compare November 18, 2025 09:19

Add specialized sets for primitive types

06a4763

adriangb added this pull request to the merge queue Nov 18, 2025

Merged via the queue into apache:main with commit 486c5d8 Nov 18, 2025
32 checks passed

adriangb deleted the refactor-in-list branch November 18, 2025 23:37

alamb reviewed Nov 19, 2025

View reviewed changes

alamb mentioned this pull request Nov 19, 2025

Restore IN_LIST performance -- Implement specialized StaticFilters for different data types #18824

Open

Refactor InListExpr to support structs by re-using existing hashing infrastructure #18449

Refactor InListExpr to support structs by re-using existing hashing infrastructure #18449

Uh oh!

Conversation

adriangb commented Nov 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

Changes in this PR

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Nov 9, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Nov 9, 2025

Uh oh!

adriangb commented Nov 9, 2025

Uh oh!

adriangb commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented Nov 17, 2025

Uh oh!

alamb commented Nov 17, 2025

Uh oh!

alamb commented Nov 17, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adriangb commented Nov 2, 2025 •

edited

Loading

adriangb commented Nov 14, 2025 •

edited

Loading

alamb commented Nov 17, 2025 •

edited

Loading