Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ADBDEV-5187: Fix case pg_rewind_fail_missing_xlog #902

Merged
merged 1 commit into from
Apr 14, 2024
Merged

Conversation

RekGRpth
Copy link
Member

@RekGRpth RekGRpth commented Mar 25, 2024

Fix case pg_rewind_fail_missing_xlog

When execute inject fault checkpoint_after_redo_calculated or
'checkpoint_control_file_updated' in a newly promoted mirror node,
connection failed, error occurs due to the fatal log 'mirror is being
promoted'. It means when connection state is MIRROR_READY but the
role changes to primary during promoting, the connection will be
declined to avoid confusion.

Wait for segment promotion finished and accept connection before
inject fault.

Co-authored-by: Xing Guo higuoxing@gmail.com

This is a backport of commit 01d9c59.
The connectSeg function has been adapted for the old Python.

@BenderArenadata
Copy link

Allure report https://allure.adsw.io/launch/67378

@BenderArenadata
Copy link

Failed job Resource group isolation tests on x86_64: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1201318

@BenderArenadata
Copy link

Failed job Resource group isolation tests on ppc64le: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1201319

@BenderArenadata
Copy link

Failed job Regression tests with Postgres on x86_64: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1201310

@BenderArenadata
Copy link

Failed job Regression tests with ORCA on x86_64: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1201312

@BenderArenadata
Copy link

Failed job Regression tests with Postgres on ppc64le: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1201311

@BenderArenadata
Copy link

Failed job Regression tests with ORCA on ppc64le: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1201313

@BenderArenadata
Copy link

Allure report https://allure.adsw.io/launch/67991

@BenderArenadata
Copy link

Failed job Resource group isolation tests on x86_64: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1228565

@BenderArenadata
Copy link

Failed job Resource group isolation tests on ppc64le: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1228566

@BenderArenadata
Copy link

Allure report https://allure.adsw.io/launch/68038

@BenderArenadata
Copy link

Failed job Resource group isolation tests on x86_64: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1229441

@BenderArenadata
Copy link

Failed job Resource group isolation tests on ppc64le: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1229442

@RekGRpth RekGRpth marked this pull request as ready for review April 1, 2024 12:16
@BenderArenadata
Copy link

Allure report https://allure.adsw.io/launch/68102

@BenderArenadata
Copy link

Failed job Resource group isolation tests on x86_64: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1231287

@BenderArenadata
Copy link

Failed job Resource group isolation tests on ppc64le: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1231288

@BenderArenadata
Copy link

Allure report https://allure.adsw.io/launch/68323

@BenderArenadata
Copy link

Failed job Resource group isolation tests on x86_64: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1241065

@BenderArenadata
Copy link

Failed job Resource group isolation tests on ppc64le: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1241066

@whitehawk
Copy link

I faced such fail once in ~20 iterations of the test (tested locally). Did you encounter such issue?

@@ -488,10 +2019,11 @@
 
 -- Wait for the segment promotion finished and accept the connection
 3: select connectSeg(600,port,hostname) from gp_segment_configuration where content = 1 and role = 'p';
- connectseg 
-------------
- t          
-(1 row)
+ERROR:  Exception: wait connection timeout (plpy_elog.c:114)
+CONTEXT:  Traceback (most recent call last):
+  PL/Python function "connectseg", line 14, in <module>
+    raise Exception("wait connection timeout")
+PL/Python function "connectseg"
 -- Reset faults and confirm FTS configuration
 3: SELECT gp_inject_fault('wal_sender_loop', 'reset', dbid) FROM gp_segment_configuration WHERE role='p' AND content = 1;
  gp_inject_fault

@whitehawk
Copy link

In the description there is a couple of issues (from the original commit), it would be nice if you fix them:

connection failed error occurs due to the ....

comma missed in "connection failed error occurs"?

but the role change to primary during promoting

"role change" -> "role changes"?

@RekGRpth
Copy link
Member Author

RekGRpth commented Apr 5, 2024

Did you encounter such issue?

No

In the description there is a couple of issues (from the original commit), it would be nice if you fix them

changed

@BenderArenadata
Copy link

Allure report https://allure.adsw.io/launch/68746

@whitehawk
Copy link

Did you encounter such issue?

No

Thus I think it is better not to block on this. But we'll need to monitor the CI tests for a while to ensure that it isn't reproduced there.

@BenderArenadata
Copy link

Failed job Resource group isolation tests on x86_64: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1268021

@BenderArenadata
Copy link

Failed job Resource group isolation tests on ppc64le: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1268023

@andr-sokolov
Copy link
Member

@RekGRpth, Fix Co-authored-by

@RekGRpth
Copy link
Member Author

RekGRpth commented Apr 11, 2024

Co-authored-by

fixed

When execute inject fault `checkpoint_after_redo_calculated` or
'checkpoint_control_file_updated' in a newly promoted mirror node,
connection failed, error occurs due to the fatal log 'mirror is being
promoted'. It means when connection state is MIRROR_READY but the
role changes to primary during promoting, the connection will be
declined to avoid confusion.

Wait for segment promotion finished and accept connection before
inject fault.

Co-authored-by: Xing Guo <higuoxing@gmail.com>

This is a backport of commit 01d9c59.
The connectSeg function has been adapted for the old Python.
@BenderArenadata
Copy link

Allure report https://allure.adsw.io/launch/68833

@BenderArenadata
Copy link

Failed job Resource group isolation tests on x86_64: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1272128

@BenderArenadata
Copy link

Failed job Regression tests with ORCA on ppc64le: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1274526

@BenderArenadata
Copy link

Failed job Resource group isolation tests on ppc64le: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1275457

@Stolb27 Stolb27 merged commit fd88a67 into adb-6.x-dev Apr 14, 2024
5 checks passed
@Stolb27 Stolb27 deleted the ADBDEV-5187 branch April 14, 2024 10:52
@Stolb27 Stolb27 mentioned this pull request May 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants