Implement Parallel-aware Hash Left Anti Semi (Not-In) Join #149

avamingli · 2023-08-15T04:25:05Z

For parallel-aware hash join, we need to sync between parallel
workers to tell the right results when there are NULL values.

If we are LASJ and found NULL value by ourself or sibling processes
had found NULL values, quit and tell siblings to quit if possible.
It's safe to fetch and set phs_lasj_has_null without lock here and at
other places. As it's a boolean and we don't need to have the most
recent value from CPU or Mem cache. And we should avoid more locks in
HashJion Impl.
If we miss it here and some others set it at the same time, just
bypass and we may get it at the next Hash batch.
If we missed it across all batches, we will know it when
PHJ_BUILD_HASHING_INNER ends with the help of build_barrier.
If we never participated in building hash table, check it when hash
table creation job is finished.

gpadmin=# explain(costs off) select c1 from ao1 where c1 not in(select c2 from ao2);
                                QUERY PLAN
---------------------------------------------------------------------------
 Gather Motion 12:1  (slice1; segments: 12)
   ->  Parallel Hash Left Anti Semi (Not-In) Join
         Hash Cond: (ao1.c1 = ao2.c2)
         ->  Parallel Seq Scan on ao1
         ->  Parallel Hash
               ->  Parallel Broadcast Motion 12:12  (slice2; segments: 12)
                     ->  Parallel Seq Scan on ao2
 Optimizer: Postgres query optimizer
(8 rows)
gpadmin=# set enable_parallel=off;
SET
Time: 1.020 ms
gpadmin=# explain(costs off) select c1 from ao1 where c1 not in(select c2 from ao2);
                          QUERY PLAN
---------------------------------------------------------------
 Gather Motion 3:1  (slice1; segments: 3)
   ->  Hash Left Anti Semi (Not-In) Join
         Hash Cond: (ao1.c1 = ao2.c2)
         ->  Seq Scan on ao1
         ->  Hash
               ->  Broadcast Motion 3:3  (slice2; segments: 3)
                     ->  Seq Scan on ao2
 Optimizer: Postgres query optimizer
(8 rows)

Performance:

A special case NOT IN subslect has null value:

Table ao2 has 1 billion rows in seg file 0-3 and with a NULL value in seg file 4, launch a 4-workers plan.
DDL & DML:

create table ao1(c1 int, c2 int) using ao_row with(parallel_workers=4) distributed by (c1);
create table ao2(c1 int, c2 int) using ao_row with(parallel_workers=4) distributed by (c1);
insert into ao1 select i, i+1 from generate_series(1, 100000) I;
analyze ao1;
Begin;
set gp_appendonly_insert_files = 3;
insert into ao2 select i, i+1 from generate_series(1, 1000000000) I;
Commit;

Concurrent session during insertion 1 billion rows, just before transaction commit, insert a NULL value into ao2;

insert into ao2 values(1, 2), (3, NULL), (5, 6);

gpadmin=# select count(*) from ao2;
   count
------------
 1000000003
(1 row)

gpadmin=# select c1 from ao1 where c1 not in(select c2 from ao2);
 c1
----
(0 rows)

Time: 309224.911 ms (05:09.225)
set enable_parallel = on;
gpadmin=# select c1 from ao1 where c1 not in(select c2 from ao2);
 c1
----
(0 rows)

Time: 192.844 ms

Time: non-parallel plan 309224.911 ms to parallel-aware plan 192.844 ms, 1600x faster.

NOT IN subselect has no null values.

select count(*) from t2 where c1 not in (select c2 from t1);

parallel workers	avg duration(s)	1st	2nd	3rd
0	41.504	41.792	41.446	41.275
2	27.757	28.637	27.099	27.534
4	24.990	25.130	24.482	25.360
6	24.056	24.489	23.721	23.958

DDL & DML

create table t1(c1 int, c2 int);
create table t2(c1 int, c2 int);
insert into t1 select i, i+1 from generate_series(1, 40000000) i;
insert into t2 select i, i+1 from generate_series(1, 40000000) i;

closes: #ISSUE_Number

Change logs

Describe your change clearly, including what problem is being solved or what feature is being added.

If it has some breaking backward or forward compatibility, please clary.

Why are the changes needed?

Describe why the changes are necessary.

Does this PR introduce any user-facing change?

If yes, please clarify the previous behavior and the change this PR proposes.

How was this patch tested?

Please detail how the changes were tested, including manual tests and any relevant unit or integration tests.

Contributor's Checklist

Here are some reminders and checklists before/when submitting your pull request, please check them:

Make sure your Pull Request has a clear title and commit message. You can take git-commit template as a reference.
Sign the Contributor License Agreement as prompted for your first-time contribution.
List your communication in the GitHub Issues or Discussions (if has or needed).
Document changes.
Add tests for the change
Pass make installcheck
Pass make -C src/test installcheck-cbdb-parallel
Feel free to @cloudberrydb/dev team for review and approval when your PR is ready🥳

src/backend/executor/nodeHash.c

For parallel-aware hash join, we need to sync between parallel workers to tell the right results when there are NULL values. If we are LASJ and found NULL value by ourself or sibling processes had found NULL values, quit and tell siblings to quit if possible. It's safe to fetch and set phs_lasj_has_null without lock here and at other places. As it's a boolean and we don't need to have the most recent value from CPU or Mem cache. And we should avoid more locks in HashJion Impl. If we miss it here and some others set it at the same time, just bypass and we may get it at the next Hash batch. If we missed it across all batches, we will know it when PHJ_BUILD_HASHING_INNER ends with the help of build_barrier. If we never participated in building hash table, check it when hash table creation job is finished. explain(costs off) select c1 from ao1 where c1 not in(select c2 from ao2); QUERY PLAN ---------------------------------------------------------------------- Gather Motion 12:1 (slice1; segments: 12) -> Parallel Hash Left Anti Semi (Not-In) Join Hash Cond: (ao1.c1 = ao2.c2) -> Parallel Seq Scan on ao1 -> Parallel Hash -> Parallel Broadcast Motion 12:12 (slice2; segments:12) -> Parallel Seq Scan on ao2 Optimizer: Postgres query optimizer (8 rows) Authored-by: Zhang Mingli avamingli@gmail.com

src/include/executor/hashjoin.h

my-ship-it

LGTM

avamingli self-assigned this Aug 15, 2023

avamingli requested a review from my-ship-it August 15, 2023 04:25

my-ship-it requested a review from yjhjstz August 31, 2023 01:15

yjhjstz reviewed Aug 31, 2023

View reviewed changes

src/backend/executor/nodeHash.c Show resolved Hide resolved

avamingli force-pushed the implement_parallel_aware_lasj_hashjoin branch from e040b6e to 0078600 Compare August 31, 2023 10:01

yjhjstz reviewed Sep 1, 2023

View reviewed changes

src/backend/executor/nodeHash.c Show resolved Hide resolved

my-ship-it reviewed Oct 8, 2023

View reviewed changes

src/backend/executor/nodeHash.c Show resolved Hide resolved

my-ship-it previously approved these changes Oct 8, 2023

View reviewed changes

avamingli dismissed my-ship-it’s stale review via ec90764 October 8, 2023 08:21

avamingli force-pushed the implement_parallel_aware_lasj_hashjoin branch from 0078600 to ec90764 Compare October 8, 2023 08:21

yjhjstz reviewed Oct 9, 2023

View reviewed changes

src/include/executor/hashjoin.h Show resolved Hide resolved

yjhjstz approved these changes Oct 10, 2023

View reviewed changes

my-ship-it approved these changes Oct 10, 2023

View reviewed changes

my-ship-it merged commit eebe75a into cloudberrydb:main Oct 10, 2023
6 checks passed

avamingli mentioned this pull request Oct 13, 2023

Enable Parallel-aware Hash Left Anti Semi (Not-In) Join #145

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Parallel-aware Hash Left Anti Semi (Not-In) Join #149

Implement Parallel-aware Hash Left Anti Semi (Not-In) Join #149

avamingli commented Aug 15, 2023 •

edited

my-ship-it left a comment

Implement Parallel-aware Hash Left Anti Semi (Not-In) Join #149

Implement Parallel-aware Hash Left Anti Semi (Not-In) Join #149

Conversation

avamingli commented Aug 15, 2023 • edited

Performance:

A special case NOT IN subslect has null value:

NOT IN subselect has no null values.

Change logs

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Contributor's Checklist

my-ship-it left a comment

Choose a reason for hiding this comment

avamingli commented Aug 15, 2023 •

edited