Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log monitor health changes as events. #703

Merged
merged 2 commits into from May 20, 2021

Conversation

DimCitus
Copy link
Collaborator

@DimCitus DimCitus commented May 7, 2021

Fixes #700
Fixed #701

@DimCitus DimCitus added enhancement New feature or request Size: S Effort Estimate: Small labels May 7, 2021
@DimCitus DimCitus added this to the Sprint 2021 W18 W19 milestone May 7, 2021
@DimCitus DimCitus requested a review from JelteF May 7, 2021 11:20
@DimCitus DimCitus self-assigned this May 7, 2021
@JelteF
Copy link
Contributor

JelteF commented May 19, 2021

When trying this locally using make cluster the monitor postgres crashes for me.

I ran make cluster like this:

make cluster -j20 TMUX_LAYOUT=tiled NODES=3

Output is like this, once beginning of the setup completes:

17:40:17 3789 INFO  New state for node 3 "node3" (localhost:5503): wait_standby ➜ catchingup                                                                                                                                                                                                                                                       17:40:17 3789 INFO  Setting goal state of node 2 "node2" (localhost:5502) to catchingup after node 1 "node1" (localhost:5501) converged to wait_primary.
17:40:17 3789 INFO  New state for node 2 "node2" (localhost:5502): wait_standby ➜ catchingup
17:40:19 3789 WARN  WARNING:  terminating connection because of crash of another server process
17:40:19 3789 WARN  DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
17:40:19 3789 WARN  HINT:  In a moment you should be able to reconnect to the database and repeat your command.
17:40:19 3789 ERROR Failed to LISTEN "state": server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.

17:40:19 3789 WARN  Re-establishing connection. We might miss notifications.
17:40:19 3789 ERROR Connection to database failed: FATAL:  the database system is in recovery mode
17:40:19 3789 ERROR FATAL:  the database system is in recovery mode
17:40:19 3789 ERROR Failed to connect to local database at "port=5500 dbname=pg_auto_failover host=localhost user=autoctl_node", see above for details
17:40:19 3789 WARN  Re-establishing connection. We might miss notifications.
17:40:19 3789 ERROR Connection to database failed: FATAL:  the database system is in recovery mode
17:40:19 3789 ERROR FATAL:  the database system is in recovery mode
17:40:19 3789 ERROR Failed to connect to local database at "port=5500 dbname=pg_auto_failover host=localhost user=autoctl_node", see above for details
17:40:19 3789 WARN  Re-establishing connection. We might miss notifications.
17:40:19 3789 ERROR Connection to database failed: FATAL:  the database system is in recovery mode
17:40:19 3789 ERROR FATAL:  the database system is in recovery mode
17:40:19 3789 ERROR Failed to connect to local database at "port=5500 dbname=pg_auto_failover host=localhost user=autoctl_node", see above for details
17:40:19 3789 WARN  Re-establishing connection. We might miss notifications.
17:40:19 3789 ERROR Connection to database failed: FATAL:  the database system is in recovery mode
17:40:19 3789 ERROR FATAL:  the database system is in recovery mode
17:40:19 3789 ERROR Failed to connect to local database at "port=5500 dbname=pg_auto_failover host=localhost user=autoctl_node", see above for details
17:40:19 3789 WARN  Re-establishing connection. We might miss notifications.
17:40:19 3789 ERROR Connection to database failed: FATAL:  the database system is in recovery mode
17:40:19 3789 ERROR FATAL:  the database system is in recovery mode
17:40:19 3789 ERROR Failed to connect to local database at "port=5500 dbname=pg_auto_failover host=localhost user=autoctl_node", see above for details
....

@DimCitus DimCitus force-pushed the fix/monitor-health-check-events branch from b9a99ce to 52a3f58 Compare May 19, 2021 16:23
@DimCitus
Copy link
Collaborator Author

As per our private messages where it appears that the problem is:

TRAP: FailedAssertion("!(spiStatus == 9)", File: "health_check_metadata.c", Line: 213)
2021-05-19 18:00:22.878 CEST [4611] LOG:  background worker "pg_auto_failover monitor worker" (PID 4631) was terminated by signal 6: Aborted
2021-05-19 18:00:22.878 CEST [4611] DETAIL:  Failed process was running: UPDATE pgautofailover.node   SET health = 1, healthchecktime = now()  WHERE nodeid = 1    AND nodehost = 'localhost' AND nodeport = 5501  RETURNING node.*

I have updated the PR now:

modified   src/monitor/health_check_metadata.c
@@ -210,7 +210,7 @@ SetNodeHealthState(int nodeId,
 		pgstat_report_activity(STATE_RUNNING, query.data);
 
 		spiStatus = SPI_execute(query.data, false, 0);
-		Assert(spiStatus == SPI_OK_UPDATE);
+		Assert(spiStatus == SPI_OK_UPDATE_RETURNING);
 
 		if (healthState != previousHealthState)
 		{

@DimCitus DimCitus merged commit d945ca1 into master May 20, 2021
@DimCitus DimCitus deleted the fix/monitor-health-check-events branch May 20, 2021 08:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Size: S Effort Estimate: Small
Projects
None yet
2 participants