Skip to content

[SPARK-56548][CORE] Replace modulo with bitmask in BloomFilter hot paths#55423

Closed
LuciferYang wants to merge 8 commits intoapache:masterfrom
LuciferYang:sketch-bloomfilter-bitmask-opt
Closed

[SPARK-56548][CORE] Replace modulo with bitmask in BloomFilter hot paths#55423
LuciferYang wants to merge 8 commits intoapache:masterfrom
LuciferYang:sketch-bloomfilter-bitmask-opt

Conversation

@LuciferYang
Copy link
Copy Markdown
Contributor

@LuciferYang LuciferYang commented Apr 20, 2026

What changes were proposed in this pull request?

Replace combinedHash % bitSize with combinedHash & (bitSize - 1) in the BloomFilter put/mightContain hot path. The bitmask trick requires bitSize to be a power of two, so BitArray's public constructor now rounds the word count up to the next power of two.

Concretely:

  1. BitArray constructor — rounds numWords up to the next power of two via roundUpToPowerOfTwo. A precomputed bitMask field (bitSize - 1 for power-of-two, -1L otherwise) lets indexFor(long hash) pick the fast path (hash & bitMask) or fall back to modulo.

  2. BloomFilterImpl / BloomFilterImplV2 — call bits.indexFor(combinedHash) instead of inline % bitSize. The existing combinedHash < 0 ? ~combinedHash : combinedHash sign-normalization stays at the call site so that old readers still compute the same bit indices.

  3. Legacy deserializationBitArray(long[]) (used by readFrom) does not re-round the word count, so deserialized filters keep their original size and fall back to modulo via bitMask == -1L.

Overflow guard: word counts above 2^30 are left unrounded so that doubling cannot overflow int.

Why are the changes needed?

On x86-64 a 64-bit integer division takes ~20-35 cycles; a bitwise AND takes one. With the default FPP the filter uses 7 hash functions, so every put or mightContain call pays for 7 modulos. That cost adds up on the probe side of runtime bloom filter joins, where mightContain runs once per row.

SparkBloomFilterBenchmark on GHA (AMD EPYC 7763, Linux 6.17, JDK 17, ns/row):

Workload V1 before → after V1 Δ V2 before → after V2 Δ
Put — 10K 46.6 → 31.3 −33% 52.9 → 34.8 −34%
Put — 100K 49.1 → 37.7 −23% 57.0 → 40.1 −30%
Put — 1M 54.7 → 46.2 −16% 63.1 → 48.6 −23%
MightContain — 10K, 50% hit 28.1 → 19.6 −30% 29.8 → 22.6 −24%
MightContain — 100K, 50% hit 30.8 → 23.5 −24% 34.3 → 26.8 −22%
MightContain — 1M, 50% hit 35.4 → 30.3 −14% 39.1 → 33.8 −14%

JDK 21 and JDK 25 results (included in the PR) show the same pattern. Gains are larger on smaller filters where modulo dominates per-item cost, and taper off at 1M items where cache misses take over — still 13-23% there.

Does this PR introduce any user-facing change?

Yes — BloomFilter.bitSize() (public abstract in o.a.s.util.sketch.BloomFilter) may now return a value larger than the numBits passed to BloomFilter.create(expectedNumItems, numBits), because the underlying word count is rounded up to the next power of two.

Example: BloomFilter.create(1000, 320) used to give bitSize() == 320 (5 words × 64), now gives bitSize() == 512 (8 words × 64, since 5 rounds up to 8).

What this means in practice:

  • FPP — slightly better (lower) than requested, because there are more bits for the same number of insertions. Correctness is preserved.
  • numHashFunctions — still computed from the caller-supplied numBits, not the actual bitSize. Slightly sub-optimal for the rounded-up array, but the filter stays correct and the difference is marginal.
  • Cross-version mergeInPlace / intersectInPlaceisCompatible() requires bitSize() equality, so a filter built by the new code (512 bits) cannot merge with one built by the old code (320 bits) using the same parameters. Filters built by the same Spark version are always compatible. This only matters if filters are exchanged across Spark versions at runtime, which is not a typical use case.
  • Serialization — format is unchanged. writeTo stores the actual (rounded) word count; readFrom restores it verbatim without re-rounding. Old Spark can read new filters and vice versa, since for non-negative hashes on a power-of-two size, hash & (bitSize - 1) and hash % bitSize produce the same result.
  • Memory — worst case ~2× for small filters (e.g., 3 words → 4). Negligible at typical filter sizes.

How was this patch tested?

  • Five new cases in BitArraySuite: roundUpToPowerOfTwo edge cases (including overflow guard at 2^30 + 1 and Integer.MAX_VALUE), fast-path vs. fallback indexFor, and a serialize-deserialize round-trip for a legacy non-power-of-2 array.
  • Updated bitSize assertions in DataFrameStatSuite, JavaDataFrameSuite, and ClientDataFrameStatSuite to expect the rounded-up value.
  • SparkBloomFilterBenchmark re-run on GHA for JDK 17 / 21 / 25; updated result files included.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Opus 4.7

@LuciferYang LuciferYang marked this pull request as draft April 20, 2026 07:35
BitArray now rounds numWords up to the next power of 2 for bitmask
optimization, so requesting 64*5=320 bits allocates 8 words (512 bits).
Update test expectations accordingly.

BloomFilter filter3 = df.stat().bloomFilter("id", 1000, 64 * 5);
Assertions.assertEquals(64 * 5, filter3.bitSize());
Assertions.assertEquals(64 * 8, filter3.bitSize());
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking change to a public API, so I need to reconsider the feasibility of this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant