-
Notifications
You must be signed in to change notification settings - Fork 3.7k
branch-4.0: [Improvement](join) add direct mapping opt for join #57960 #58309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
```sql
with wscs as (
select
ws_sold_date_sk sold_date_sk,
ws_ext_sales_price sales_price
from
web_sales
)
select
sum(d_week_seq)
from
wscs
inner join date_dim on d_date_sk = sold_date_sk;
```
before:
- ProbeWhenSearchHashTableTime: 4sec480ms
after:
- ProbeWhenSearchHashTableTime: 759.414ms
Problem Summary:
This pull request introduces a new "direct mapping" optimization for
hash join operations, primarily targeting joins on numeric key columns.
The changes add a new hash table context type, logic to detect when
direct mapping is possible, and update relevant operators and methods to
utilize this optimization. This aims to improve hash join performance by
reducing hashing overhead in cases where key values are densely
distributed within a limited range.
The most important changes are:
**Direct Mapping Optimization:**
* Added the `MethodOneNumberDirect` class template to enable direct
mapping for numeric key columns, bypassing hash computation when key
values fall within a small range. This includes logic for bucket
assignment and handling nulls and out-of-range values.
(`be/src/vec/common/hash_table/hash_map_context.h`)
* Introduced `DirectPrimaryTypeHashTableContext` type aliases and
extended `HashTableVariants` to support direct mapping contexts for
`UInt8`, `UInt16`, `UInt32`, and `UInt64` types.
(`be/src/pipeline/common/join_utils.h`)
[[1]](diffhunk://#diff-66cf4052118abf5abbef2e0d9193df3c35a46f70db35853c5884d56d4118a963L41-R55)
[[2]](diffhunk://#diff-66cf4052118abf5abbef2e0d9193df3c35a46f70db35853c5884d56d4118a963R64-R67)
* Implemented logic to detect when direct mapping is feasible based on
the min/max key values and to convert hash table variants to the direct
mapping context when possible. (`be/src/pipeline/common/join_utils.h`)
**Operator and Method Updates:**
* Updated `HashJoinBuildSinkLocalState` and related methods to
initialize hash tables with direct mapping when appropriate, passing key
column information and supporting shared hash table scenarios.
(`be/src/pipeline/exec/hashjoin_build_sink.cpp`,
`be/src/pipeline/exec/hashjoin_build_sink.h`)
[[1]](diffhunk://#diff-95f4d643dcceaebd86699edcee6c1bc3b920a4bffb3ea3162316666d18ddbc2aR392-R393)
[[2]](diffhunk://#diff-95f4d643dcceaebd86699edcee6c1bc3b920a4bffb3ea3162316666d18ddbc2aL451-R443)
[[3]](diffhunk://#diff-95f4d643dcceaebd86699edcee6c1bc3b920a4bffb3ea3162316666d18ddbc2aL468-R474)
[[4]](diffhunk://#diff-0732e01c1a3f38997ada381c43aff98286e86ca7519db5469a6e4dcdec5bce44L53-R53)
* Modified memory reservation and build logic to correctly account for
direct mapping ranges and bucket sizes, ensuring accurate resource
estimation and allocation.
(`be/src/pipeline/exec/hashjoin_build_sink.cpp`,
`be/src/pipeline/exec/hashjoin_build_sink.h`,
`be/src/pipeline/exec/partitioned_hash_join_probe_operator.cpp`)
[[1]](diffhunk://#diff-95f4d643dcceaebd86699edcee6c1bc3b920a4bffb3ea3162316666d18ddbc2aL140-R139)
[[2]](diffhunk://#diff-e553b1e5eec6ee94db556ed5ae6d2f1fc3eba9d1ca58a23d0c90f38521bba96bL806-R806)
[[3]](diffhunk://#diff-0732e01c1a3f38997ada381c43aff98286e86ca7519db5469a6e4dcdec5bce44L207-R208)
**Codebase Consistency and Robustness:**
* Extended template instantiations and visitor logic to handle the new
direct mapping context, ensuring correct behavior in probe and build
phases. (`be/src/pipeline/exec/join/process_hash_table_probe_impl.h`,
`be/src/pipeline/exec/hashjoin_build_sink.cpp`)
[[1]](diffhunk://#diff-3110bab7d558f46b88ae1958b09ac369a92cac4bff98b280b2cf83d2d7aecbf4R794-R797)
[[2]](diffhunk://#diff-95f4d643dcceaebd86699edcee6c1bc3b920a4bffb3ea3162316666d18ddbc2aL187-R186)
* Added error handling for hash table type mismatches in shared hash
table scenarios. (`be/src/pipeline/exec/hashjoin_build_sink.cpp`)
These changes collectively enable more efficient hash join execution for
suitable key types and ranges, improving query performance and resource
utilization.
Contributor
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
Contributor
|
run buildall |
Contributor
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
yiguolei
approved these changes
Nov 25, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Cherry-picked from #57960