Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] flaky incorrect results of hashjoin_spill in parallel mode on x86_64 #259

Closed
1 of 2 tasks
avamingli opened this issue Oct 24, 2023 · 16 comments · Fixed by #260
Closed
1 of 2 tasks

[Bug] flaky incorrect results of hashjoin_spill in parallel mode on x86_64 #259

avamingli opened this issue Oct 24, 2023 · 16 comments · Fixed by #260
Assignees
Labels
type: Bug Something isn't working

Comments

@avamingli
Copy link
Collaborator

avamingli commented Oct 24, 2023

Cloudberry Database version

main branch 951be2d

What happened

diff -I HINT: -I CONTEXT: -I GP_IGNORE: -U3 /code/gpdb_src/src/test/regress/expected/workfile/hashjoin_spill.out /code/gpdb_src/src/test/regress/results/workfile/hashjoin_spill.out
--- /code/gpdb_src/src/test/regress/expected/workfile/hashjoin_spill.out	2023-10-23 15:54:14.579403168 +0800
+++ /code/gpdb_src/src/test/regress/results/workfile/hashjoin_spill.out	2023-10-23 15:54:14.581402176 +0800
@@ -68,7 +67,7 @@
 select avg(i3) from (SELECT t1.* FROM test_hj_spill AS t1 RIGHT JOIN test_hj_spill AS t2 ON t1.i1=t2.i2) foo;
          avg          
 ----------------------
- 499.5000000000000000
+ 499.0423391378211935
 (1 row)
 
 select * from hashjoin_spill.is_workfile_created('explain (analyze, verbose) SELECT t1.* FROM test_hj_spill AS t1 RIGHT JOIN test_hj_spill AS t2 ON t1.i1=t2.i2');

or previous :

diff -I HINT: -I CONTEXT: -I GP_IGNORE: -U3 /code/gpdb_src/src/test/regress/expected/workfile/hashjoin_spill.out /code/gpdb_src/src/test/regress/results/workfile/hashjoin_spill.out
--- /code/gpdb_src/src/test/regress/expected/workfile/hashjoin_spill.out	2023-08-24 08:08:57.766770920 +0000
+++ /code/gpdb_src/src/test/regress/results/workfile/hashjoin_spill.out	2023-08-24 08:08:57.768770926 +0000
@@ -43,7 +50,7 @@
 select avg(i3) from (SELECT t1.* FROM test_hj_spill AS t1 RIGHT JOIN test_hj_spill AS t2 ON t1.i1=t2.i2) foo;
          avg          
 ----------------------
- 499.5000000000000000
+ 499.2977834544051515
 (1 row)

What you think should happen instead

No response

How to reproduce

We got this kind of results diff occasionally on several CI, ex: https://github.com/cloudberrydb/cloudberrydb/actions/runs/5960899566/job/16169472506?pr=142
and internal CI: https://code.hashdata.xyz/cloudberry/cbdb/-/jobs/366882

Operating System

x86_64

Anything else

No response

Are you willing to submit PR?

  • Yes, I am willing to submit a PR!

Code of Conduct

@avamingli avamingli added the type: Bug Something isn't working label Oct 24, 2023
@avamingli
Copy link
Collaborator Author

I didn't reproduce it on my x86_64 machine, a 1000 times test show no failed cases:

for in in {1..1000}; 
do PGOPTIONS='-c optimizer=off -c enable_parallel=true -c min_parallel_table_scan_size=0 -c min_parallel_index_scan_size=0 -c force_parallel_mode=1 -c gp_appendonly_insert_files=0' 
./pg_regress  --inputdir=. --bindir='/home/gpadmin/install/gp9/bin' --load-extension=gp_inject_fault --dlpath=. --max-connections=20 --max-concurrent-tests=24 --init-file=./init_file 
workfile/hashjoin_spill | grep -i 'workfile/hashjoin_spill' | tee -a debug.log; 
done
cat debug.log | grep -i ok | wc -l
1000

@avamingli
Copy link
Collaborator Author

avamingli commented Oct 27, 2023

We got the wrong results again: https://github.com/cloudberrydb/cloudberrydb/actions/runs/6661394625/job/18104487842?pr=265

--- /code/gpdb_src/src/test/regress/expected/workfile/hashjoin_spill.out	2023-10-27 01:05:56.552220732 +0000
+++ /code/gpdb_src/src/test/regress/results/workfile/hashjoin_spill.out	2023-10-27 01:05:56.554220688 +0000
@@ -43,7 +50,7 @@
 select avg(i3::numeric) from (SELECT t1.* FROM test_hj_spill AS t1 RIGHT JOIN test_hj_spill AS t2 ON t1.i1=t2.i2) foo;
          avg          
 ----------------------
- 499.5000000000000000
+ 499.4107171506497674
 (1 row)

Pr #260 doesn't fix this.

@Ray-Eldath
Copy link
Contributor

first attempts of reproducing succeed with the following diffs:

--- /home/gpadmin/cbdb/src/test/regress/expected/workfile/hashjoin_spill.out    2023-10-27 11:44:25.385210186 +0800
+++ /home/gpadmin/cbdb/src/test/regress/results/workfile/hashjoin_spill.out 2023-10-27 11:44:25.385210186 +0800
@@ -43,7 +49,7 @@
select avg(i3::numeric) from (SELECT t1.* FROM test_hj_spill AS t1 RIGHT JOIN test_hj_spill AS t2 ON t1.i1=t2.i2) foo;
          avg
 ----------------------
- 499.5000000000000000
+ 498.7976339559297647
(1 row)

select * from hashjoin_spill.is_workfile_created('explain (analyze, verbose) SELECT t1.* FROM test_hj_spill AS t1 RIGHT JOIN test_hj_spill AS t2
ON t1.i1=t2.i2');

@avamingli
Copy link
Collaborator Author

499.4107171506497674 again

@avamingli
Copy link
Collaborator Author

avamingli commented Oct 30, 2023

@Ray-Eldath and I debug this for a while, we have reproduced this about 2~3% of a 200-times loop tests in case of high load. But still can't reproduce it stably.

And a deep experiment shows that it's related with SET statement_mem=1024; and parallel hash join.
And it has less data when failed:

 count |   sum    |         avg
-------+----------+----------------------
 37409 | 18716789 | 500.3285038359753001
(1 row)
 count |   sum    |         avg
-------+----------+----------------------
 37398 | 18676962 | 499.4107171506497674
(1 row)
 count |   sum    |         avg
-------+----------+----------------------
 37398 | 18676962 | 499.4107171506497674
(1 row)

the right results:

 count |   sum    |         avg
-------+----------+----------------------
 45000 | 22477500 | 499.5000000000000000
(1 row)
failed times of 200 loops set statement_mem = 1024 parallel-aware
0 false true
0 true false
0 false false
4 true true

The reason may be that the CBDB's special memory limit (work file)doesn't consider parallel-aware hashjoin as it's introduced since GPDB init commit. We may need to re-implement the work file feature with parallel-aware hashjoin.
@my-ship-it @gfphoenix78
Need more dig.

@avamingli
Copy link
Collaborator Author

avamingli commented Oct 30, 2023

Update: I add some logs to print shared hash table's total rows, a wrong result appears when total rows is correct:
15060+15010+14930 = 45000.

gpadmin=# select count(i3), avg(i3::numeric) from (SELECT t1.* FROM test_hj_spill AS t1 RIGHT JOIN test_
hj_spill AS t2 ON t1.i1=t2.i2) foo;
INFO:  shared hash table has 15060 rows.  (seg1 slice1 127.0.1.1:9003 pid=661224)
INFO:  shared hash table has 15060 rows.  (seg1 slice1 127.0.1.1:9003 pid=661222)
INFO:  shared hash table has 15010 rows.  (seg0 slice1 127.0.1.1:9002 pid=661221)
INFO:  shared hash table has 14930 rows.  (seg2 slice1 127.0.1.1:9004 pid=661225)
INFO:  shared hash table has 14930 rows.  (seg2 slice1 127.0.1.1:9004 pid=661223)
INFO:  shared hash table has 15010 rows.  (seg0 slice1 127.0.1.1:9002 pid=661220)
 count |         avg
-------+----------------------
 37409 | 500.3285038359753001
(1 row)

gpadmin=# explain(costs off) select count(i3), avg(i3::numeric) from (SELECT t1.* FROM test_hj_spill AS t1 RIGHT JOIN test_hj_spill AS t2 ON t1.i1=t2.i2) foo;
                               QUERY PLAN
------------------------------------------------------------------------
 Finalize Aggregate
   ->  Gather Motion 6:1  (slice1; segments: 6)
         ->  Partial Aggregate
               ->  Parallel Hash Left Join
                     Hash Cond: (t2.i2 = t1.i1)
                     ->  Redistribute Motion 6:6  (slice2; segments: 6)
                           Hash Key: t2.i2
                           Hash Module: 3
                           ->  Parallel Seq Scan on test_hj_spill t2
                     ->  Parallel Hash
                           ->  Parallel Seq Scan on test_hj_spill t1
 Optimizer: Postgres query optimizer
(12 rows)

And I modify the SQL to inner join, the issue can't be reproduced anymore with a 200 times loop.

@avamingli
Copy link
Collaborator Author

And I modify the SQL to inner join, the issue can't be reproduced anymore with a 200 times loop.

Negative.
Inner join also get wrong results for the plan:

gpadmin=# explain select count(i3), avg(i3::numeric) from (SELECT t1.* FROM test_hj_spill AS t1  JOIN test_hj_spill AS t2 ON t1.i2=t2.i1) foo;
                                                  QUERY PLAN

--------------------------------------------------------------------------------------------------------
-------
 Finalize Aggregate  (cost=684.26..684.27 rows=1 width=40)
   ->  Gather Motion 6:1  (slice1; segments: 6)  (cost=684.13..684.21 rows=6 width=40)
         ->  Partial Aggregate  (cost=684.13..684.14 rows=1 width=40)
               ->  Parallel Hash Join  (cost=196.75..627.88 rows=7500 width=4)
                     Hash Cond: (t1.i2 = t2.i1)
                     ->  Redistribute Motion 6:6  (slice2; segments: 6)  (cost=0.00..253.00 rows=7500 wi
dth=8)
                           Hash Key: t1.i2
                           Hash Module: 3
                           ->  Parallel Seq Scan on test_hj_spill t1  (cost=0.00..103.00 rows=7500 width
=8)
                     ->  Parallel Hash  (cost=103.00..103.00 rows=7500 width=4)
                           ->  Parallel Seq Scan on test_hj_spill t2  (cost=0.00..103.00 rows=7500 width
=4)
 Optimizer: Postgres query optimizer
(12 rows)

@avamingli avamingli self-assigned this Oct 31, 2023
@avamingli
Copy link
Collaborator Author

Update:
I have identified the issue is at the outer side of parallel-aware join, some batches are lost when there is incorrect result.
ex: a process on seg0 lost some data.

psql:p.sql:3: INFO:  shared hash table has 15060 rows.  (seg1 slice1 127.0.1.1:9003 pid=890950)
psql:p.sql:3: INFO:  shared hash table has 15010 rows.  (seg0 slice1 127.0.1.1:9002 pid=890949)
psql:p.sql:3: INFO:  shared hash table has 15010 rows.  (seg0 slice1 127.0.1.1:9002 pid=890948)
psql:p.sql:3: INFO:  shared hash table has 15060 rows.  (seg1 slice1 127.0.1.1:9003 pid=890951)
psql:p.sql:3: INFO:  shared hash table has 14930 rows.  (seg2 slice1 127.0.1.1:9004 pid=890952)
psql:p.sql:3: INFO:  shared hash table has 14930 rows.  (seg2 slice1 127.0.1.1:9004 pid=890953)
psql:p.sql:3: INFO:  parallel outer batch has total 7592 rows.  (seg0 slice1 127.0.1.1:9002 pid=890948)
psql:p.sql:3: INFO:  parallel outer batch has total 7420 rows.  (seg0 slice1 127.0.1.1:9002 pid=890949)
psql:p.sql:3: INFO:  parallel outer batch has total 7603 rows.  (seg2 slice1 127.0.1.1:9004 pid=890952)
psql:p.sql:3: INFO:  parallel outer batch has total 7329 rows.  (seg2 slice1 127.0.1.1:9004 pid=890953)
psql:p.sql:3: INFO:  parallel outer batch has total 7470 rows.  (seg1 slice1 127.0.1.1:9003 pid=890950)
psql:p.sql:3: INFO:  parallel outer batch has total 7592 rows.  (seg1 slice1 127.0.1.1:9003 pid=890951)
 count |   sum    |         avg
-------+----------+----------------------
 45000 | 22477500 | 499.5000000000000000
(1 row)

6 process' logs of parallel outer batch.

psql:p.sql:3: INFO:  shared hash table has 15010 rows.  (seg0 slice1 127.0.1.1:9002 pid=890919)
psql:p.sql:3: INFO:  shared hash table has 15060 rows.  (seg1 slice1 127.0.1.1:9003 pid=890921)
psql:p.sql:3: INFO:  shared hash table has 14930 rows.  (seg2 slice1 127.0.1.1:9004 pid=890922)
psql:p.sql:3: INFO:  shared hash table has 14930 rows.  (seg2 slice1 127.0.1.1:9004 pid=890923)
psql:p.sql:3: INFO:  shared hash table has 15060 rows.  (seg1 slice1 127.0.1.1:9003 pid=890920)
psql:p.sql:3: INFO:  parallel outer batch has total 7420 rows.  (seg0 slice1 127.0.1.1:9002 pid=890919)
psql:p.sql:3: INFO:  parallel outer batch has total 7592 rows.  (seg1 slice1 127.0.1.1:9003 pid=890921)
psql:p.sql:3: INFO:  parallel outer batch has total 7329 rows.  (seg2 slice1 127.0.1.1:9004 pid=890923)
psql:p.sql:3: INFO:  parallel outer batch has total 7603 rows.  (seg2 slice1 127.0.1.1:9004 pid=890922)
psql:p.sql:3: INFO:  parallel outer batch has total 7470 rows.  (seg1 slice1 127.0.1.1:9003 pid=890920)
psql:p.sql:3: INFO:  shared hash table has 15010 rows.  (seg0 slice1 127.0.1.1:9002 pid=890918)
 count |   sum    |         avg
-------+----------+----------------------
 37409 | 18716789 | 500.3285038359753001
(1 row)

5 process' logs of parallel outer batch, lost one on seg0.

@avamingli
Copy link
Collaborator Author

Adding some logs show that: when wrong results happened, one worker of seg1 didn't participate in PHJ_BUILD_HASHING_OUTER.

psql:p.sql:3: INFO:  shared hash table has 15060 rows.  (seg1 slice1 127.0.1.1:9003 pid=938113)
psql:p.sql:3: INFO:  parallel build_barrier total participants: 1.  (seg1 slice1 127.0.1.1:9003 pid=938113)                               psql:p.sql:3: INFO:  parallel build_barrier participants begin, nbatch: 2  (seg1 slice1 127.0.1.1:9003 pid=938113)
psql:p.sql:3: INFO:  parallel outer batch begin fetch rows.  (seg1 slice1 127.0.1.1:9003 pid=938113)
psql:p.sql:3: INFO:  parallel outer batch has total 7592 rows.  (seg1 slice1 127.0.1.1:9003 pid=938113)
psql:p.sql:3: INFO:  shared hash table has 15060 rows.  (seg1 slice1 127.0.1.1:9003 pid=938112)
psql:p.sql:3: INFO:  parallel build_barrier total participants: 1.  (seg1 slice1 127.0.1.1:9003 pid=938112)

@avamingli
Copy link
Collaborator Author

another failed case: https://github.com/cloudberrydb/cloudberrydb/actions/runs/6729213443/job/18290602808?pr=278

 select avg(i3::numeric) from (SELECT t1.* FROM test_hj_spill AS t1 RIGHT JOIN test_hj_spill AS t2 ON t1.i1=t2.i2) foo;
          avg          
 ----------------------
- 499.5000000000000000
+ 498.7976339559297647
 (1 row)

@avamingli
Copy link
Collaborator Author

we have reproduced this about 2~3% of a 200-times loop tests in case of high load.

A 1000-times loop test shows no incorrect results with #284 , we could keep this issue open for several months to check the CI results.

@avamingli
Copy link
Collaborator Author

we have reproduced this about 2~3% of a 200-times loop tests in case of high load.

A 1000-times loop test shows no incorrect results with #284 , we could keep this issue open for several months to check the CI results.

comparison

500 loops test

commit succeed to run sql wrong results
without pr #284 470 30
with pr #284 474 0

@avamingli
Copy link
Collaborator Author

#284 has been merged, let's keep this issue open for a while to see if it's really resolved.

@avamingli
Copy link
Collaborator Author

#284 has been merged, let's keep this issue open for a while to see if it's really resolved.

Didn't get this issue for three months since #284 was merged, shall we close this? @my-ship-it

@avamingli
Copy link
Collaborator Author

Issue doesn't exist since #284 , close.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: Bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants