-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Parallel-aware Hash Left Anti Semi (Not-In) Join #149
Merged
my-ship-it
merged 1 commit into
cloudberrydb:main
from
avamingli:implement_parallel_aware_lasj_hashjoin
Oct 10, 2023
Merged
Implement Parallel-aware Hash Left Anti Semi (Not-In) Join #149
my-ship-it
merged 1 commit into
cloudberrydb:main
from
avamingli:implement_parallel_aware_lasj_hashjoin
Oct 10, 2023
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
yjhjstz
reviewed
Aug 31, 2023
avamingli
force-pushed
the
implement_parallel_aware_lasj_hashjoin
branch
from
August 31, 2023 10:01
e040b6e
to
0078600
Compare
yjhjstz
reviewed
Sep 1, 2023
my-ship-it
reviewed
Oct 8, 2023
my-ship-it
previously approved these changes
Oct 8, 2023
For parallel-aware hash join, we need to sync between parallel workers to tell the right results when there are NULL values. If we are LASJ and found NULL value by ourself or sibling processes had found NULL values, quit and tell siblings to quit if possible. It's safe to fetch and set phs_lasj_has_null without lock here and at other places. As it's a boolean and we don't need to have the most recent value from CPU or Mem cache. And we should avoid more locks in HashJion Impl. If we miss it here and some others set it at the same time, just bypass and we may get it at the next Hash batch. If we missed it across all batches, we will know it when PHJ_BUILD_HASHING_INNER ends with the help of build_barrier. If we never participated in building hash table, check it when hash table creation job is finished. explain(costs off) select c1 from ao1 where c1 not in(select c2 from ao2); QUERY PLAN ---------------------------------------------------------------------- Gather Motion 12:1 (slice1; segments: 12) -> Parallel Hash Left Anti Semi (Not-In) Join Hash Cond: (ao1.c1 = ao2.c2) -> Parallel Seq Scan on ao1 -> Parallel Hash -> Parallel Broadcast Motion 12:12 (slice2; segments:12) -> Parallel Seq Scan on ao2 Optimizer: Postgres query optimizer (8 rows) Authored-by: Zhang Mingli avamingli@gmail.com
avamingli
force-pushed
the
implement_parallel_aware_lasj_hashjoin
branch
from
October 8, 2023 08:21
0078600
to
ec90764
Compare
yjhjstz
reviewed
Oct 9, 2023
yjhjstz
approved these changes
Oct 10, 2023
my-ship-it
approved these changes
Oct 10, 2023
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
2 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
For parallel-aware hash join, we need to sync between parallel
workers to tell the right results when there are NULL values.
If we are LASJ and found NULL value by ourself or sibling processes
had found NULL values, quit and tell siblings to quit if possible.
It's safe to fetch and set phs_lasj_has_null without lock here and at
other places. As it's a boolean and we don't need to have the most
recent value from CPU or Mem cache. And we should avoid more locks in
HashJion Impl.
If we miss it here and some others set it at the same time, just
bypass and we may get it at the next Hash batch.
If we missed it across all batches, we will know it when
PHJ_BUILD_HASHING_INNER ends with the help of build_barrier.
If we never participated in building hash table, check it when hash
table creation job is finished.
Performance:
A special case NOT IN subslect has null value:
Table ao2 has 1 billion rows in seg file 0-3 and with a NULL value in seg file 4, launch a 4-workers plan.
DDL & DML:
Concurrent session during insertion 1 billion rows, just before transaction commit, insert a NULL value into ao2;
Time: non-parallel plan 309224.911 ms to parallel-aware plan 192.844 ms, 1600x faster.
NOT IN subselect has no null values.
DDL & DML
closes: #ISSUE_Number
Change logs
Describe your change clearly, including what problem is being solved or what feature is being added.
If it has some breaking backward or forward compatibility, please clary.
Why are the changes needed?
Describe why the changes are necessary.
Does this PR introduce any user-facing change?
If yes, please clarify the previous behavior and the change this PR proposes.
How was this patch tested?
Please detail how the changes were tested, including manual tests and any relevant unit or integration tests.
Contributor's Checklist
Here are some reminders and checklists before/when submitting your pull request, please check them:
make installcheck
make -C src/test installcheck-cbdb-parallel