Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] unstable pg_upgrade failed #262

Open
1 of 2 tasks
avamingli opened this issue Oct 25, 2023 · 14 comments
Open
1 of 2 tasks

[Bug] unstable pg_upgrade failed #262

avamingli opened this issue Oct 25, 2023 · 14 comments
Assignees
Labels
type: Bug Something isn't working

Comments

@avamingli
Copy link
Collaborator

avamingli commented Oct 25, 2023

Cloudberry Database version

No response

What happened

We suffer it for a long time

pg_upgrade failed
psql: error: connection to server on socket "/tmp/.s.PGSQL.17432" failed: No such file or directory
[6694](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6695)        Is the server running locally and accepting connections on that socket?
[6695](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6696)======================================================================
[6696](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6697)
[6697](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6698)20231024:08:45:55:017476 gpstop:ip-10-0-1-232:gpadmin-[INFO]:-Starting gpstop with args: -a
[6698](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6699)20231024:08:45:55:017476 gpstop:ip-10-0-1-232:gpadmin-[INFO]:-Gathering information and validating the environment...
[6699](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6700)Error: 4:08:45:55:017476 gpstop:ip-10-0-1-232:gpadmin-[ERROR]:-gpstop error: postmaster.pid file does not exist.  is Cloudberry instance already stopped?
[6700](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6701)/code/gpdb_src/src/bin/pg_upgrade/tmp_check/upgrade/qd /code/gpdb_src/src/bin/pg_upgrade
[6701](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6702)Performing Consistency Checks
[6702](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6703)-----------------------------
[6703](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6704)Checking cluster versions                                   ok
[6704](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6705)
[6705](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6706)The target cluster was not shut down cleanly.
[6706](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6707)Failure, exiting
[6707](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6708)
[6708](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6709)ERROR: Failure encountered in upgrading qd node
[6709](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6710)real        0m0.050s
[6710](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6711)user        0m0.019s
[6711](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6712)sys        0m0.030s
[6712](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6713)/code/gpdb_src/src/bin/pg_upgrade /code/gpdb_src/src/bin/pg_upgrade/tmp_check/upgrade/qd /code/gpdb_src/src/bin/pg_upgrade
[6713](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6714)make[1]: *** [Makefile:78: check] Error 1
[6714](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6715)make: *** [GNUmakefile:194: installcheck-world-src/bin/pg_upgrade-recurse] Error 2
[6715](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6716)make: Target 'installcheck-world' not remade because of errors.
20231024:08:35:54:031540 gpstart:ip-10-0-1-232:gpadmin-[INFO]:-CoordinatorStart pg_ctl cmd is env GPSESSID=0000000000 GPERA=01d1134bbbff0ed5_231024083553 $GPHOME/bin/pg_ctl -D /code/gpdb_src/src/bin/pg_upgrade/tmp_check/datadirs/qddir/demoDataDir-1 -l /code/gpdb_src/src/bin/pg_upgrade/tmp_check/datadirs/qddir/demoDataDir-1/log/startup.log -w -t 600 -o " -p 17432 -c gp_role=dispatch " start
[6642](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6643)20231024:08:45:54:031540 gpstart:ip-10-0-1-232:gpadmin-[CRITICAL]:-Error occurred: non-zero rc: 1
[6643](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6644) Command was: 'env GPSESSID=0000000000 GPERA=01d1134bbbff0ed5_231024083553 $GPHOME/bin/pg_ctl -D /code/gpdb_src/src/bin/pg_upgrade/tmp_check/datadirs/qddir/demoDataDir-1 -l /code/gpdb_src/src/bin/pg_upgrade/tmp_check/datadirs/qddir/demoDataDir-1/log/startup.log -w -t 600 -o " -p 17432 -c gp_role=dispatch " start'
[6644](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6645)rc=1, stdout='waiting for server to start........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................... stopped waiting
[6645](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6646)', stderr='pg_ctl: server did not start in time
------------------------

It seems gpstart timeout after switch binary from gpdb5 -> gpdb6

What you think should happen instead

No response

How to reproduce

https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260

Operating System

https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260

Anything else

No response

Are you willing to submit PR?

  • Yes, I am willing to submit a PR!

Code of Conduct

@avamingli avamingli added the type: Bug Something isn't working label Oct 25, 2023
@avamingli
Copy link
Collaborator Author

AFATK, add timeout could not resolve this issue, and -t 600 comes from gpstart's param in CBDB CI. If we change it, all components are affected.

@avamingli
Copy link
Collaborator Author

Update: we have Increased CI resources, try to fix it.

@Ray-Eldath
Copy link
Contributor

Ray-Eldath commented Nov 6, 2023

@avamingli
Copy link
Collaborator Author

@Ray-Eldath
Copy link
Contributor

@lss602726449
Copy link
Contributor

lss602726449 commented Nov 6, 2023

The problem Ray-Eldath mentioned is not the same problem as this. His problem is disscussed in the latter.
The standby QE is not ready for connection when the QD is send the request
For test, maybe we should sleep for a while. Better solution may be that QD fts wait for standby QE to ready.

2023-11-03 15:06:33.031038 UTC,,,p31860,th-841484160,,,,0,,,seg1,,,,,"LOG","00000","database system is ready",,,,,,,0,,"xlog.c",8477,
2023-11-03 15:06:33.034261 UTC,"gpadmin","isolation2test",p32307,th-841484160,"10.0.2.31","47820",2023-11-03 15:06:33 UTC,0,con266,,seg1,,,,,"FATAL","57P03","the database system is not accepting connections","Hot standby mode is disabled.",,,,,,0,,"postmaster.c",2747,
2023-11-03 15:06:33.034283 UTC,,,p31853,th-841484160,,,,0,,,seg1,,,,,"LOG","00000","PostgreSQL 14.4 (Cloudberry Database 1.0.0 build 6744120555) on x86_64-pc-linux-gnu, compiled by gcc (GCC) 10.2.1 20210130 (Red Hat 10.2.1-11), 64-bit compiled on Nov 3 2023 13:46:54 (with assert checking)",,,,,,,0,,"postmaster.c",3556,
2023-11-03 15:06:33.034293 UTC,,,p31853,th-841484160,,,,0,,,seg1,,,,,"LOG","00000","database system is ready to accept connections","PostgreSQL 14.4 (Cloudberry Database 1.0.0 build 6744120555) on x86_64-pc-linux-gnu, compiled by gcc (GCC) 10.2.1 20210130 (Red Hat 10.2.1-11), 64-bit compiled on Nov 3 2023 13:46:54 (with assert checking)",,,,,,0,,"postmaster.c",3558,

@Ray-Eldath
Copy link
Contributor

@avamingli
Copy link
Collaborator Author

@wenchaozhang-123
Copy link
Contributor

@avamingli
Copy link
Collaborator Author

@yjhjstz
Copy link
Collaborator

yjhjstz commented Nov 17, 2023

@smartyhero please try set MAX_CONNECTIONS = 5 or 10 in workflows/release.yml to control resources.

@smartyhero
Copy link
Contributor

Okay, I'll take care of it

@smartyhero
Copy link
Contributor

PR has been created: #308

@Ray-Eldath
Copy link
Contributor

Ray-Eldath commented Nov 21, 2023

two unstable tests (this one and #301) which both due to occupied port on the vm do not reoccur since the vm image gets rebuilt yesterday. this is kinda strange because the only change during that rebuild was add tmux as a new package...

keep rerunning in #306 to see whether this resurface.


...and it failed in no time :-( https://github.com/cloudberrydb/cloudberrydb/actions/runs/6938950626/job/18875704817?pr=306

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: Bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants