[Bug] unstable pg_upgrade failed #262

avamingli · 2023-10-25T01:42:34Z

Cloudberry Database version

No response

What happened

We suffer it for a long time

pg_upgrade failed
psql: error: connection to server on socket "/tmp/.s.PGSQL.17432" failed: No such file or directory
[6694](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6695)        Is the server running locally and accepting connections on that socket?
[6695](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6696)======================================================================
[6696](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6697)
[6697](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6698)20231024:08:45:55:017476 gpstop:ip-10-0-1-232:gpadmin-[INFO]:-Starting gpstop with args: -a
[6698](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6699)20231024:08:45:55:017476 gpstop:ip-10-0-1-232:gpadmin-[INFO]:-Gathering information and validating the environment...
[6699](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6700)Error: 4:08:45:55:017476 gpstop:ip-10-0-1-232:gpadmin-[ERROR]:-gpstop error: postmaster.pid file does not exist.  is Cloudberry instance already stopped?
[6700](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6701)/code/gpdb_src/src/bin/pg_upgrade/tmp_check/upgrade/qd /code/gpdb_src/src/bin/pg_upgrade
[6701](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6702)Performing Consistency Checks
[6702](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6703)-----------------------------
[6703](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6704)Checking cluster versions                                   ok
[6704](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6705)
[6705](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6706)The target cluster was not shut down cleanly.
[6706](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6707)Failure, exiting
[6707](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6708)
[6708](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6709)ERROR: Failure encountered in upgrading qd node
[6709](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6710)real        0m0.050s
[6710](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6711)user        0m0.019s
[6711](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6712)sys        0m0.030s
[6712](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6713)/code/gpdb_src/src/bin/pg_upgrade /code/gpdb_src/src/bin/pg_upgrade/tmp_check/upgrade/qd /code/gpdb_src/src/bin/pg_upgrade
[6713](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6714)make[1]: *** [Makefile:78: check] Error 1
[6714](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6715)make: *** [GNUmakefile:194: installcheck-world-src/bin/pg_upgrade-recurse] Error 2
[6715](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6716)make: Target 'installcheck-world' not remade because of errors.
20231024:08:35:54:031540 gpstart:ip-10-0-1-232:gpadmin-[INFO]:-CoordinatorStart pg_ctl cmd is env GPSESSID=0000000000 GPERA=01d1134bbbff0ed5_231024083553 $GPHOME/bin/pg_ctl -D /code/gpdb_src/src/bin/pg_upgrade/tmp_check/datadirs/qddir/demoDataDir-1 -l /code/gpdb_src/src/bin/pg_upgrade/tmp_check/datadirs/qddir/demoDataDir-1/log/startup.log -w -t 600 -o " -p 17432 -c gp_role=dispatch " start
[6642](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6643)20231024:08:45:54:031540 gpstart:ip-10-0-1-232:gpadmin-[CRITICAL]:-Error occurred: non-zero rc: 1
[6643](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6644) Command was: 'env GPSESSID=0000000000 GPERA=01d1134bbbff0ed5_231024083553 $GPHOME/bin/pg_ctl -D /code/gpdb_src/src/bin/pg_upgrade/tmp_check/datadirs/qddir/demoDataDir-1 -l /code/gpdb_src/src/bin/pg_upgrade/tmp_check/datadirs/qddir/demoDataDir-1/log/startup.log -w -t 600 -o " -p 17432 -c gp_role=dispatch " start'
[6644](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6645)rc=1, stdout='waiting for server to start........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................... stopped waiting
[6645](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6646)', stderr='pg_ctl: server did not start in time
------------------------

It seems gpstart timeout after switch binary from gpdb5 -> gpdb6

What you think should happen instead

No response

How to reproduce

https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260

Operating System

https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260

Anything else

No response

Are you willing to submit PR?

Yes, I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct.

avamingli · 2023-10-25T01:57:49Z

AFATK, add timeout could not resolve this issue, and -t 600 comes from gpstart's param in CBDB CI. If we change it, all components are affected.

avamingli · 2023-10-27T01:15:10Z

Update: we have Increased CI resources, try to fix it.

Ray-Eldath · 2023-11-06T02:40:05Z

increasing ci resources doesn't help. https://github.com/cloudberrydb/cloudberrydb/actions/runs/6744120555/job/18339332696

avamingli · 2023-11-06T02:46:03Z

another failed: https://github.com/cloudberrydb/cloudberrydb/actions/runs/6751856892/job/18383806659?pr=284

Ray-Eldath · 2023-11-06T02:49:01Z

increasing ci resources doesn't help. https://github.com/cloudberrydb/cloudberrydb/actions/runs/6744120555/job/18339332696

db internal log can be downloaded at https://github.com/cloudberrydb/cloudberrydb/suites/17879327237/artifacts/1027090491

lss602726449 · 2023-11-06T09:44:05Z

The problem Ray-Eldath mentioned is not the same problem as this. His problem is disscussed in the latter.
The standby QE is not ready for connection when the QD is send the request
For test, maybe we should sleep for a while. Better solution may be that QD fts wait for standby QE to ready.

2023-11-03 15:06:33.031038 UTC,,,p31860,th-841484160,,,,0,,,seg1,,,,,"LOG","00000","database system is ready",,,,,,,0,,"xlog.c",8477,
2023-11-03 15:06:33.034261 UTC,"gpadmin","isolation2test",p32307,th-841484160,"10.0.2.31","47820",2023-11-03 15:06:33 UTC,0,con266,,seg1,,,,,"FATAL","57P03","the database system is not accepting connections","Hot standby mode is disabled.",,,,,,0,,"postmaster.c",2747,
2023-11-03 15:06:33.034283 UTC,,,p31853,th-841484160,,,,0,,,seg1,,,,,"LOG","00000","PostgreSQL 14.4 (Cloudberry Database 1.0.0 build 6744120555) on x86_64-pc-linux-gnu, compiled by gcc (GCC) 10.2.1 20210130 (Red Hat 10.2.1-11), 64-bit compiled on Nov 3 2023 13:46:54 (with assert checking)",,,,,,,0,,"postmaster.c",3556,
2023-11-03 15:06:33.034293 UTC,,,p31853,th-841484160,,,,0,,,seg1,,,,,"LOG","00000","database system is ready to accept connections","PostgreSQL 14.4 (Cloudberry Database 1.0.0 build 6744120555) on x86_64-pc-linux-gnu, compiled by gcc (GCC) 10.2.1 20210130 (Red Hat 10.2.1-11), 64-bit compiled on Nov 3 2023 13:46:54 (with assert checking)",,,,,,0,,"postmaster.c",3558,

Ray-Eldath · 2023-11-08T10:35:48Z

another failure: https://github.com/cloudberrydb/cloudberrydb/actions/runs/6794967552/job/18472551915?pr=290

avamingli · 2023-11-10T04:56:48Z

pg_upgrade failed once again https://github.com/cloudberrydb/cloudberrydb/actions/runs/6820307820/job/18549229016?pr=294

wenchaozhang-123 · 2023-11-10T09:30:36Z

https://github.com/cloudberrydb/cloudberrydb/actions/runs/6822019838/job/18553628928

avamingli · 2023-11-14T04:36:50Z

https://github.com/cloudberrydb/cloudberrydb/actions/runs/6858612287/job/18649896679?pr=298

yjhjstz · 2023-11-17T02:46:44Z

@smartyhero please try set MAX_CONNECTIONS = 5 or 10 in workflows/release.yml to control resources.

smartyhero · 2023-11-20T02:58:27Z

Okay, I'll take care of it

smartyhero · 2023-11-20T03:36:27Z

PR has been created: #308

Ray-Eldath · 2023-11-21T03:46:23Z

two unstable tests (this one and #301) which both due to occupied port on the vm do not reoccur since the vm image gets rebuilt yesterday. this is kinda strange because the only change during that rebuild was add tmux as a new package...

keep rerunning in #306 to see whether this resurface.

...and it failed in no time :-( https://github.com/cloudberrydb/cloudberrydb/actions/runs/6938950626/job/18875704817?pr=306

avamingli added the type: Bug Something isn't working label Oct 25, 2023

Ray-Eldath assigned lss602726449 Nov 6, 2023

avamingli mentioned this issue Nov 7, 2023

Revert "Fix explain bad indent when showing operatorMem." #289

Merged

avamingli mentioned this issue Nov 10, 2023

Use pg_class instead of gp_segment_configuration to test Entry. #294

Merged

9 tasks

Ray-Eldath mentioned this issue Nov 20, 2023

[Bug] Cluster down during regress/createdb #301

Open

2 tasks

Ray-Eldath assigned Ray-Eldath and unassigned lss602726449 Nov 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] unstable pg_upgrade failed #262

[Bug] unstable pg_upgrade failed #262

avamingli commented Oct 25, 2023 •

edited

avamingli commented Oct 25, 2023

avamingli commented Oct 27, 2023

Ray-Eldath commented Nov 6, 2023 •

edited

avamingli commented Nov 6, 2023

Ray-Eldath commented Nov 6, 2023

lss602726449 commented Nov 6, 2023 •

edited

Ray-Eldath commented Nov 8, 2023

avamingli commented Nov 10, 2023

wenchaozhang-123 commented Nov 10, 2023

avamingli commented Nov 14, 2023

yjhjstz commented Nov 17, 2023

smartyhero commented Nov 20, 2023

smartyhero commented Nov 20, 2023

Ray-Eldath commented Nov 21, 2023 •

edited

[Bug] unstable pg_upgrade failed #262

[Bug] unstable pg_upgrade failed #262

Comments

avamingli commented Oct 25, 2023 • edited

Cloudberry Database version

What happened

What you think should happen instead

How to reproduce

Operating System

Anything else

Are you willing to submit PR?

Code of Conduct

avamingli commented Oct 25, 2023

avamingli commented Oct 27, 2023

Ray-Eldath commented Nov 6, 2023 • edited

avamingli commented Nov 6, 2023

Ray-Eldath commented Nov 6, 2023

lss602726449 commented Nov 6, 2023 • edited

Ray-Eldath commented Nov 8, 2023

avamingli commented Nov 10, 2023

wenchaozhang-123 commented Nov 10, 2023

avamingli commented Nov 14, 2023

yjhjstz commented Nov 17, 2023

smartyhero commented Nov 20, 2023

smartyhero commented Nov 20, 2023

Ray-Eldath commented Nov 21, 2023 • edited

avamingli commented Oct 25, 2023 •

edited

Ray-Eldath commented Nov 6, 2023 •

edited

lss602726449 commented Nov 6, 2023 •

edited

Ray-Eldath commented Nov 21, 2023 •

edited