[Bug] flaky incorrect results of hashjoin_spill in parallel mode on x86_64 #259

avamingli · 2023-10-24T03:40:48Z

Cloudberry Database version

main branch 951be2d

What happened

diff -I HINT: -I CONTEXT: -I GP_IGNORE: -U3 /code/gpdb_src/src/test/regress/expected/workfile/hashjoin_spill.out /code/gpdb_src/src/test/regress/results/workfile/hashjoin_spill.out
--- /code/gpdb_src/src/test/regress/expected/workfile/hashjoin_spill.out	2023-10-23 15:54:14.579403168 +0800
+++ /code/gpdb_src/src/test/regress/results/workfile/hashjoin_spill.out	2023-10-23 15:54:14.581402176 +0800
@@ -68,7 +67,7 @@
 select avg(i3) from (SELECT t1.* FROM test_hj_spill AS t1 RIGHT JOIN test_hj_spill AS t2 ON t1.i1=t2.i2) foo;
          avg          
 ----------------------
- 499.5000000000000000
+ 499.0423391378211935
 (1 row)
 
 select * from hashjoin_spill.is_workfile_created('explain (analyze, verbose) SELECT t1.* FROM test_hj_spill AS t1 RIGHT JOIN test_hj_spill AS t2 ON t1.i1=t2.i2');

or previous :

diff -I HINT: -I CONTEXT: -I GP_IGNORE: -U3 /code/gpdb_src/src/test/regress/expected/workfile/hashjoin_spill.out /code/gpdb_src/src/test/regress/results/workfile/hashjoin_spill.out
--- /code/gpdb_src/src/test/regress/expected/workfile/hashjoin_spill.out	2023-08-24 08:08:57.766770920 +0000
+++ /code/gpdb_src/src/test/regress/results/workfile/hashjoin_spill.out	2023-08-24 08:08:57.768770926 +0000
@@ -43,7 +50,7 @@
 select avg(i3) from (SELECT t1.* FROM test_hj_spill AS t1 RIGHT JOIN test_hj_spill AS t2 ON t1.i1=t2.i2) foo;
          avg          
 ----------------------
- 499.5000000000000000
+ 499.2977834544051515
 (1 row)

What you think should happen instead

No response

How to reproduce

We got this kind of results diff occasionally on several CI, ex: https://github.com/cloudberrydb/cloudberrydb/actions/runs/5960899566/job/16169472506?pr=142
and internal CI: https://code.hashdata.xyz/cloudberry/cbdb/-/jobs/366882

Operating System

x86_64

Anything else

No response

Are you willing to submit PR?

Yes, I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct.

The text was updated successfully, but these errors were encountered:

avamingli · 2023-10-24T03:55:00Z

I didn't reproduce it on my x86_64 machine, a 1000 times test show no failed cases:

for in in {1..1000}; 
do PGOPTIONS='-c optimizer=off -c enable_parallel=true -c min_parallel_table_scan_size=0 -c min_parallel_index_scan_size=0 -c force_parallel_mode=1 -c gp_appendonly_insert_files=0' 
./pg_regress  --inputdir=. --bindir='/home/gpadmin/install/gp9/bin' --load-extension=gp_inject_fault --dlpath=. --max-connections=20 --max-concurrent-tests=24 --init-file=./init_file 
workfile/hashjoin_spill | grep -i 'workfile/hashjoin_spill' | tee -a debug.log; 
done

cat debug.log | grep -i ok | wc -l
1000

avamingli · 2023-10-27T01:42:31Z

We got the wrong results again: https://github.com/cloudberrydb/cloudberrydb/actions/runs/6661394625/job/18104487842?pr=265

--- /code/gpdb_src/src/test/regress/expected/workfile/hashjoin_spill.out	2023-10-27 01:05:56.552220732 +0000
+++ /code/gpdb_src/src/test/regress/results/workfile/hashjoin_spill.out	2023-10-27 01:05:56.554220688 +0000
@@ -43,7 +50,7 @@
 select avg(i3::numeric) from (SELECT t1.* FROM test_hj_spill AS t1 RIGHT JOIN test_hj_spill AS t2 ON t1.i1=t2.i2) foo;
          avg          
 ----------------------
- 499.5000000000000000
+ 499.4107171506497674
 (1 row)

Pr #260 doesn't fix this.

Ray-Eldath · 2023-10-27T07:13:24Z

first attempts of reproducing succeed with the following diffs:

--- /home/gpadmin/cbdb/src/test/regress/expected/workfile/hashjoin_spill.out    2023-10-27 11:44:25.385210186 +0800
+++ /home/gpadmin/cbdb/src/test/regress/results/workfile/hashjoin_spill.out 2023-10-27 11:44:25.385210186 +0800
@@ -43,7 +49,7 @@
select avg(i3::numeric) from (SELECT t1.* FROM test_hj_spill AS t1 RIGHT JOIN test_hj_spill AS t2 ON t1.i1=t2.i2) foo;
          avg
 ----------------------
- 499.5000000000000000
+ 498.7976339559297647
(1 row)

select * from hashjoin_spill.is_workfile_created('explain (analyze, verbose) SELECT t1.* FROM test_hj_spill AS t1 RIGHT JOIN test_hj_spill AS t2
ON t1.i1=t2.i2');

avamingli · 2023-10-27T07:17:14Z

499.4107171506497674 again

avamingli · 2023-10-30T02:27:37Z

@Ray-Eldath and I debug this for a while, we have reproduced this about 2~3% of a 200-times loop tests in case of high load. But still can't reproduce it stably.

And a deep experiment shows that it's related with SET statement_mem=1024; and parallel hash join.
And it has less data when failed:

 count |   sum    |         avg
-------+----------+----------------------
 37409 | 18716789 | 500.3285038359753001
(1 row)
 count |   sum    |         avg
-------+----------+----------------------
 37398 | 18676962 | 499.4107171506497674
(1 row)
 count |   sum    |         avg
-------+----------+----------------------
 37398 | 18676962 | 499.4107171506497674
(1 row)

the right results:

 count |   sum    |         avg
-------+----------+----------------------
 45000 | 22477500 | 499.5000000000000000
(1 row)

failed times of 200 loops	set statement_mem = 1024	parallel-aware
0	false	true
0	true	false
0	false	false
4	true	true

The reason may be that the CBDB's special memory limit (work file)doesn't consider parallel-aware hashjoin as it's introduced since GPDB init commit. We may need to re-implement the work file feature with parallel-aware hashjoin.
@my-ship-it @gfphoenix78
Need more dig.

avamingli · 2023-10-30T09:07:50Z

Update: I add some logs to print shared hash table's total rows, a wrong result appears when total rows is correct:
15060+15010+14930 = 45000.

gpadmin=# select count(i3), avg(i3::numeric) from (SELECT t1.* FROM test_hj_spill AS t1 RIGHT JOIN test_
hj_spill AS t2 ON t1.i1=t2.i2) foo;
INFO:  shared hash table has 15060 rows.  (seg1 slice1 127.0.1.1:9003 pid=661224)
INFO:  shared hash table has 15060 rows.  (seg1 slice1 127.0.1.1:9003 pid=661222)
INFO:  shared hash table has 15010 rows.  (seg0 slice1 127.0.1.1:9002 pid=661221)
INFO:  shared hash table has 14930 rows.  (seg2 slice1 127.0.1.1:9004 pid=661225)
INFO:  shared hash table has 14930 rows.  (seg2 slice1 127.0.1.1:9004 pid=661223)
INFO:  shared hash table has 15010 rows.  (seg0 slice1 127.0.1.1:9002 pid=661220)
 count |         avg
-------+----------------------
 37409 | 500.3285038359753001
(1 row)

gpadmin=# explain(costs off) select count(i3), avg(i3::numeric) from (SELECT t1.* FROM test_hj_spill AS t1 RIGHT JOIN test_hj_spill AS t2 ON t1.i1=t2.i2) foo;
                               QUERY PLAN
------------------------------------------------------------------------
 Finalize Aggregate
   ->  Gather Motion 6:1  (slice1; segments: 6)
         ->  Partial Aggregate
               ->  Parallel Hash Left Join
                     Hash Cond: (t2.i2 = t1.i1)
                     ->  Redistribute Motion 6:6  (slice2; segments: 6)
                           Hash Key: t2.i2
                           Hash Module: 3
                           ->  Parallel Seq Scan on test_hj_spill t2
                     ->  Parallel Hash
                           ->  Parallel Seq Scan on test_hj_spill t1
 Optimizer: Postgres query optimizer
(12 rows)

And I modify the SQL to inner join, the issue can't be reproduced anymore with a 200 times loop.

avamingli · 2023-10-30T10:19:49Z

And I modify the SQL to inner join, the issue can't be reproduced anymore with a 200 times loop.

Negative.
Inner join also get wrong results for the plan:

gpadmin=# explain select count(i3), avg(i3::numeric) from (SELECT t1.* FROM test_hj_spill AS t1  JOIN test_hj_spill AS t2 ON t1.i2=t2.i1) foo;
                                                  QUERY PLAN

--------------------------------------------------------------------------------------------------------
-------
 Finalize Aggregate  (cost=684.26..684.27 rows=1 width=40)
   ->  Gather Motion 6:1  (slice1; segments: 6)  (cost=684.13..684.21 rows=6 width=40)
         ->  Partial Aggregate  (cost=684.13..684.14 rows=1 width=40)
               ->  Parallel Hash Join  (cost=196.75..627.88 rows=7500 width=4)
                     Hash Cond: (t1.i2 = t2.i1)
                     ->  Redistribute Motion 6:6  (slice2; segments: 6)  (cost=0.00..253.00 rows=7500 wi
dth=8)
                           Hash Key: t1.i2
                           Hash Module: 3
                           ->  Parallel Seq Scan on test_hj_spill t1  (cost=0.00..103.00 rows=7500 width
=8)
                     ->  Parallel Hash  (cost=103.00..103.00 rows=7500 width=4)
                           ->  Parallel Seq Scan on test_hj_spill t2  (cost=0.00..103.00 rows=7500 width
=4)
 Optimizer: Postgres query optimizer
(12 rows)

smartyhero · 2023-10-31T07:28:45Z

I also encountered the same problem.
https://github.com/cloudberrydb/cloudberrydb/actions/runs/6701267279/job/18208675293
https://github.com/cloudberrydb/cloudberrydb/actions/runs/6689290610/job/18172842563

avamingli · 2023-11-01T01:42:09Z

Update:
I have identified the issue is at the outer side of parallel-aware join, some batches are lost when there is incorrect result.
ex: a process on seg0 lost some data.

psql:p.sql:3: INFO:  shared hash table has 15060 rows.  (seg1 slice1 127.0.1.1:9003 pid=890950)
psql:p.sql:3: INFO:  shared hash table has 15010 rows.  (seg0 slice1 127.0.1.1:9002 pid=890949)
psql:p.sql:3: INFO:  shared hash table has 15010 rows.  (seg0 slice1 127.0.1.1:9002 pid=890948)
psql:p.sql:3: INFO:  shared hash table has 15060 rows.  (seg1 slice1 127.0.1.1:9003 pid=890951)
psql:p.sql:3: INFO:  shared hash table has 14930 rows.  (seg2 slice1 127.0.1.1:9004 pid=890952)
psql:p.sql:3: INFO:  shared hash table has 14930 rows.  (seg2 slice1 127.0.1.1:9004 pid=890953)
psql:p.sql:3: INFO:  parallel outer batch has total 7592 rows.  (seg0 slice1 127.0.1.1:9002 pid=890948)
psql:p.sql:3: INFO:  parallel outer batch has total 7420 rows.  (seg0 slice1 127.0.1.1:9002 pid=890949)
psql:p.sql:3: INFO:  parallel outer batch has total 7603 rows.  (seg2 slice1 127.0.1.1:9004 pid=890952)
psql:p.sql:3: INFO:  parallel outer batch has total 7329 rows.  (seg2 slice1 127.0.1.1:9004 pid=890953)
psql:p.sql:3: INFO:  parallel outer batch has total 7470 rows.  (seg1 slice1 127.0.1.1:9003 pid=890950)
psql:p.sql:3: INFO:  parallel outer batch has total 7592 rows.  (seg1 slice1 127.0.1.1:9003 pid=890951)
 count |   sum    |         avg
-------+----------+----------------------
 45000 | 22477500 | 499.5000000000000000
(1 row)

6 process' logs of parallel outer batch.

psql:p.sql:3: INFO:  shared hash table has 15010 rows.  (seg0 slice1 127.0.1.1:9002 pid=890919)
psql:p.sql:3: INFO:  shared hash table has 15060 rows.  (seg1 slice1 127.0.1.1:9003 pid=890921)
psql:p.sql:3: INFO:  shared hash table has 14930 rows.  (seg2 slice1 127.0.1.1:9004 pid=890922)
psql:p.sql:3: INFO:  shared hash table has 14930 rows.  (seg2 slice1 127.0.1.1:9004 pid=890923)
psql:p.sql:3: INFO:  shared hash table has 15060 rows.  (seg1 slice1 127.0.1.1:9003 pid=890920)
psql:p.sql:3: INFO:  parallel outer batch has total 7420 rows.  (seg0 slice1 127.0.1.1:9002 pid=890919)
psql:p.sql:3: INFO:  parallel outer batch has total 7592 rows.  (seg1 slice1 127.0.1.1:9003 pid=890921)
psql:p.sql:3: INFO:  parallel outer batch has total 7329 rows.  (seg2 slice1 127.0.1.1:9004 pid=890923)
psql:p.sql:3: INFO:  parallel outer batch has total 7603 rows.  (seg2 slice1 127.0.1.1:9004 pid=890922)
psql:p.sql:3: INFO:  parallel outer batch has total 7470 rows.  (seg1 slice1 127.0.1.1:9003 pid=890920)
psql:p.sql:3: INFO:  shared hash table has 15010 rows.  (seg0 slice1 127.0.1.1:9002 pid=890918)
 count |   sum    |         avg
-------+----------+----------------------
 37409 | 18716789 | 500.3285038359753001
(1 row)

5 process' logs of parallel outer batch, lost one on seg0.

avamingli · 2023-11-02T06:37:55Z

Adding some logs show that: when wrong results happened, one worker of seg1 didn't participate in PHJ_BUILD_HASHING_OUTER.

psql:p.sql:3: INFO:  shared hash table has 15060 rows.  (seg1 slice1 127.0.1.1:9003 pid=938113)
psql:p.sql:3: INFO:  parallel build_barrier total participants: 1.  (seg1 slice1 127.0.1.1:9003 pid=938113)                               psql:p.sql:3: INFO:  parallel build_barrier participants begin, nbatch: 2  (seg1 slice1 127.0.1.1:9003 pid=938113)
psql:p.sql:3: INFO:  parallel outer batch begin fetch rows.  (seg1 slice1 127.0.1.1:9003 pid=938113)
psql:p.sql:3: INFO:  parallel outer batch has total 7592 rows.  (seg1 slice1 127.0.1.1:9003 pid=938113)
psql:p.sql:3: INFO:  shared hash table has 15060 rows.  (seg1 slice1 127.0.1.1:9003 pid=938112)
psql:p.sql:3: INFO:  parallel build_barrier total participants: 1.  (seg1 slice1 127.0.1.1:9003 pid=938112)

avamingli · 2023-11-02T12:36:33Z

another failed case: https://github.com/cloudberrydb/cloudberrydb/actions/runs/6729213443/job/18290602808?pr=278

 select avg(i3::numeric) from (SELECT t1.* FROM test_hj_spill AS t1 RIGHT JOIN test_hj_spill AS t2 ON t1.i1=t2.i2) foo;
          avg          
 ----------------------
- 499.5000000000000000
+ 498.7976339559297647
 (1 row)

avamingli · 2023-11-03T07:30:26Z

we have reproduced this about 2~3% of a 200-times loop tests in case of high load.

A 1000-times loop test shows no incorrect results with #284 , we could keep this issue open for several months to check the CI results.

avamingli · 2023-11-03T08:28:43Z

we have reproduced this about 2~3% of a 200-times loop tests in case of high load.

A 1000-times loop test shows no incorrect results with #284 , we could keep this issue open for several months to check the CI results.

comparison

500 loops test

commit	succeed to run sql	wrong results
without pr #284	470	30
with pr #284	474	0

avamingli · 2023-11-06T07:49:40Z

#284 has been merged, let's keep this issue open for a while to see if it's really resolved.

avamingli · 2024-02-19T08:04:58Z

#284 has been merged, let's keep this issue open for a while to see if it's really resolved.

Didn't get this issue for three months since #284 was merged, shall we close this? @my-ship-it

avamingli · 2024-04-22T06:50:09Z

Issue doesn't exist since #284 , close.

avamingli added the type: Bug Something isn't working label Oct 24, 2023

avamingli mentioned this issue Oct 24, 2023

Try to fix hashjoin_spill avg result diff in parallel mode #260

Merged

9 tasks

avamingli closed this as completed in #260 Oct 26, 2023

avamingli reopened this Oct 27, 2023

avamingli self-assigned this Oct 31, 2023

avamingli mentioned this issue Nov 3, 2023

Add motionhazard to the outer side of parallel aware join.(fix flaky incorrect results of agg) #284

Merged

9 tasks

avamingli mentioned this issue Nov 6, 2023

Fix explain bad indent when showing operatorMem. #288

Merged

9 tasks

avamingli closed this as completed Apr 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] flaky incorrect results of hashjoin_spill in parallel mode on x86_64 #259

[Bug] flaky incorrect results of hashjoin_spill in parallel mode on x86_64 #259

avamingli commented Oct 24, 2023 •

edited

avamingli commented Oct 24, 2023

avamingli commented Oct 27, 2023 •

edited

Ray-Eldath commented Oct 27, 2023

avamingli commented Oct 27, 2023

avamingli commented Oct 30, 2023 •

edited

avamingli commented Oct 30, 2023 •

edited

avamingli commented Oct 30, 2023

smartyhero commented Oct 31, 2023

avamingli commented Nov 1, 2023

avamingli commented Nov 2, 2023

avamingli commented Nov 2, 2023

avamingli commented Nov 3, 2023

avamingli commented Nov 3, 2023

avamingli commented Nov 6, 2023

avamingli commented Feb 19, 2024

avamingli commented Apr 22, 2024

[Bug] flaky incorrect results of hashjoin_spill in parallel mode on x86_64 #259

[Bug] flaky incorrect results of hashjoin_spill in parallel mode on x86_64 #259

Comments

avamingli commented Oct 24, 2023 • edited

Cloudberry Database version

What happened

What you think should happen instead

How to reproduce

Operating System

Anything else

Are you willing to submit PR?

Code of Conduct

avamingli commented Oct 24, 2023

avamingli commented Oct 27, 2023 • edited

Ray-Eldath commented Oct 27, 2023

avamingli commented Oct 27, 2023

avamingli commented Oct 30, 2023 • edited

avamingli commented Oct 30, 2023 • edited

avamingli commented Oct 30, 2023

smartyhero commented Oct 31, 2023

avamingli commented Nov 1, 2023

avamingli commented Nov 2, 2023

avamingli commented Nov 2, 2023

avamingli commented Nov 3, 2023

avamingli commented Nov 3, 2023

comparison

avamingli commented Nov 6, 2023

avamingli commented Feb 19, 2024

avamingli commented Apr 22, 2024

avamingli commented Oct 24, 2023 •

edited

avamingli commented Oct 27, 2023 •

edited

avamingli commented Oct 30, 2023 •

edited

avamingli commented Oct 30, 2023 •

edited