Skip to content

Conversation

@github-actions
Copy link
Contributor

Cherry-picked from #57960

```sql
with wscs as (
    select
        ws_sold_date_sk sold_date_sk,
        ws_ext_sales_price sales_price
    from
        web_sales
)
select
    sum(d_week_seq)
from
    wscs
    inner join date_dim on d_date_sk = sold_date_sk;
```
before:
- ProbeWhenSearchHashTableTime: 4sec480ms
after:
- ProbeWhenSearchHashTableTime: 759.414ms

Problem Summary:

This pull request introduces a new "direct mapping" optimization for
hash join operations, primarily targeting joins on numeric key columns.
The changes add a new hash table context type, logic to detect when
direct mapping is possible, and update relevant operators and methods to
utilize this optimization. This aims to improve hash join performance by
reducing hashing overhead in cases where key values are densely
distributed within a limited range.

The most important changes are:

**Direct Mapping Optimization:**

* Added the `MethodOneNumberDirect` class template to enable direct
mapping for numeric key columns, bypassing hash computation when key
values fall within a small range. This includes logic for bucket
assignment and handling nulls and out-of-range values.
(`be/src/vec/common/hash_table/hash_map_context.h`)
* Introduced `DirectPrimaryTypeHashTableContext` type aliases and
extended `HashTableVariants` to support direct mapping contexts for
`UInt8`, `UInt16`, `UInt32`, and `UInt64` types.
(`be/src/pipeline/common/join_utils.h`)
[[1]](diffhunk://#diff-66cf4052118abf5abbef2e0d9193df3c35a46f70db35853c5884d56d4118a963L41-R55)
[[2]](diffhunk://#diff-66cf4052118abf5abbef2e0d9193df3c35a46f70db35853c5884d56d4118a963R64-R67)
* Implemented logic to detect when direct mapping is feasible based on
the min/max key values and to convert hash table variants to the direct
mapping context when possible. (`be/src/pipeline/common/join_utils.h`)

**Operator and Method Updates:**

* Updated `HashJoinBuildSinkLocalState` and related methods to
initialize hash tables with direct mapping when appropriate, passing key
column information and supporting shared hash table scenarios.
(`be/src/pipeline/exec/hashjoin_build_sink.cpp`,
`be/src/pipeline/exec/hashjoin_build_sink.h`)
[[1]](diffhunk://#diff-95f4d643dcceaebd86699edcee6c1bc3b920a4bffb3ea3162316666d18ddbc2aR392-R393)
[[2]](diffhunk://#diff-95f4d643dcceaebd86699edcee6c1bc3b920a4bffb3ea3162316666d18ddbc2aL451-R443)
[[3]](diffhunk://#diff-95f4d643dcceaebd86699edcee6c1bc3b920a4bffb3ea3162316666d18ddbc2aL468-R474)
[[4]](diffhunk://#diff-0732e01c1a3f38997ada381c43aff98286e86ca7519db5469a6e4dcdec5bce44L53-R53)
* Modified memory reservation and build logic to correctly account for
direct mapping ranges and bucket sizes, ensuring accurate resource
estimation and allocation.
(`be/src/pipeline/exec/hashjoin_build_sink.cpp`,
`be/src/pipeline/exec/hashjoin_build_sink.h`,
`be/src/pipeline/exec/partitioned_hash_join_probe_operator.cpp`)
[[1]](diffhunk://#diff-95f4d643dcceaebd86699edcee6c1bc3b920a4bffb3ea3162316666d18ddbc2aL140-R139)
[[2]](diffhunk://#diff-e553b1e5eec6ee94db556ed5ae6d2f1fc3eba9d1ca58a23d0c90f38521bba96bL806-R806)
[[3]](diffhunk://#diff-0732e01c1a3f38997ada381c43aff98286e86ca7519db5469a6e4dcdec5bce44L207-R208)

**Codebase Consistency and Robustness:**

* Extended template instantiations and visitor logic to handle the new
direct mapping context, ensuring correct behavior in probe and build
phases. (`be/src/pipeline/exec/join/process_hash_table_probe_impl.h`,
`be/src/pipeline/exec/hashjoin_build_sink.cpp`)
[[1]](diffhunk://#diff-3110bab7d558f46b88ae1958b09ac369a92cac4bff98b280b2cf83d2d7aecbf4R794-R797)
[[2]](diffhunk://#diff-95f4d643dcceaebd86699edcee6c1bc3b920a4bffb3ea3162316666d18ddbc2aL187-R186)
* Added error handling for hash table type mismatches in shared hash
table scenarios. (`be/src/pipeline/exec/hashjoin_build_sink.cpp`)

These changes collectively enable more efficient hash join execution for
suitable key types and ranges, improving query performance and resource
utilization.
@github-actions github-actions bot requested a review from yiguolei as a code owner November 24, 2025 11:38
@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@dataroaring dataroaring reopened this Nov 24, 2025
@hello-stephen
Copy link
Contributor

run buildall

@hello-stephen
Copy link
Contributor

BE UT Coverage Report

Increment line coverage 77.63% (118/152) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.68% (18156/34467)
Line Coverage 38.03% (165133/434272)
Region Coverage 33.06% (128231/387856)
Branch Coverage 33.85% (55246/163232)

@yiguolei yiguolei merged commit 751b22a into branch-4.0 Nov 25, 2025
24 of 27 checks passed
@github-actions github-actions bot deleted the auto-pick-57960-branch-4.0 branch November 25, 2025 01:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants