Log monitor health changes as events. #703

DimCitus · 2021-05-07T11:20:51Z

Fixes #700
Fixed #701

JelteF · 2021-05-19T15:45:00Z

When trying this locally using make cluster the monitor postgres crashes for me.

I ran make cluster like this:

make cluster -j20 TMUX_LAYOUT=tiled NODES=3

Output is like this, once beginning of the setup completes:

17:40:17 3789 INFO  New state for node 3 "node3" (localhost:5503): wait_standby ➜ catchingup                                                                                                                                                                                                                                                       17:40:17 3789 INFO  Setting goal state of node 2 "node2" (localhost:5502) to catchingup after node 1 "node1" (localhost:5501) converged to wait_primary.
17:40:17 3789 INFO  New state for node 2 "node2" (localhost:5502): wait_standby ➜ catchingup
17:40:19 3789 WARN  WARNING:  terminating connection because of crash of another server process
17:40:19 3789 WARN  DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
17:40:19 3789 WARN  HINT:  In a moment you should be able to reconnect to the database and repeat your command.
17:40:19 3789 ERROR Failed to LISTEN "state": server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.

17:40:19 3789 WARN  Re-establishing connection. We might miss notifications.
17:40:19 3789 ERROR Connection to database failed: FATAL:  the database system is in recovery mode
17:40:19 3789 ERROR FATAL:  the database system is in recovery mode
17:40:19 3789 ERROR Failed to connect to local database at "port=5500 dbname=pg_auto_failover host=localhost user=autoctl_node", see above for details
17:40:19 3789 WARN  Re-establishing connection. We might miss notifications.
17:40:19 3789 ERROR Connection to database failed: FATAL:  the database system is in recovery mode
17:40:19 3789 ERROR FATAL:  the database system is in recovery mode
17:40:19 3789 ERROR Failed to connect to local database at "port=5500 dbname=pg_auto_failover host=localhost user=autoctl_node", see above for details
17:40:19 3789 WARN  Re-establishing connection. We might miss notifications.
17:40:19 3789 ERROR Connection to database failed: FATAL:  the database system is in recovery mode
17:40:19 3789 ERROR FATAL:  the database system is in recovery mode
17:40:19 3789 ERROR Failed to connect to local database at "port=5500 dbname=pg_auto_failover host=localhost user=autoctl_node", see above for details
17:40:19 3789 WARN  Re-establishing connection. We might miss notifications.
17:40:19 3789 ERROR Connection to database failed: FATAL:  the database system is in recovery mode
17:40:19 3789 ERROR FATAL:  the database system is in recovery mode
17:40:19 3789 ERROR Failed to connect to local database at "port=5500 dbname=pg_auto_failover host=localhost user=autoctl_node", see above for details
17:40:19 3789 WARN  Re-establishing connection. We might miss notifications.
17:40:19 3789 ERROR Connection to database failed: FATAL:  the database system is in recovery mode
17:40:19 3789 ERROR FATAL:  the database system is in recovery mode
17:40:19 3789 ERROR Failed to connect to local database at "port=5500 dbname=pg_auto_failover host=localhost user=autoctl_node", see above for details
....

DimCitus · 2021-05-19T16:24:20Z

As per our private messages where it appears that the problem is:

TRAP: FailedAssertion("!(spiStatus == 9)", File: "health_check_metadata.c", Line: 213)
2021-05-19 18:00:22.878 CEST [4611] LOG:  background worker "pg_auto_failover monitor worker" (PID 4631) was terminated by signal 6: Aborted
2021-05-19 18:00:22.878 CEST [4611] DETAIL:  Failed process was running: UPDATE pgautofailover.node   SET health = 1, healthchecktime = now()  WHERE nodeid = 1    AND nodehost = 'localhost' AND nodeport = 5501  RETURNING node.*

I have updated the PR now:

modified   src/monitor/health_check_metadata.c
@@ -210,7 +210,7 @@ SetNodeHealthState(int nodeId,
 		pgstat_report_activity(STATE_RUNNING, query.data);
 
 		spiStatus = SPI_execute(query.data, false, 0);
-		Assert(spiStatus == SPI_OK_UPDATE);
+		Assert(spiStatus == SPI_OK_UPDATE_RETURNING);
 
 		if (healthState != previousHealthState)
 		{

DimCitus added enhancement New feature or request Size: S Effort Estimate: Small labels May 7, 2021

DimCitus added this to the Sprint 2021 W18 W19 milestone May 7, 2021

DimCitus requested a review from JelteF May 7, 2021 11:20

DimCitus self-assigned this May 7, 2021

DimCitus modified the milestones: Sprint 2021 W18 W19, Sprint 2021 W20 W21 May 17, 2021

DimCitus added 2 commits May 19, 2021 18:23

Log monitor health changes as events.

7e0cc57

Per review.

52a3f58

DimCitus force-pushed the fix/monitor-health-check-events branch from b9a99ce to 52a3f58 Compare May 19, 2021 16:23

JelteF approved these changes May 20, 2021

View reviewed changes

DimCitus merged commit d945ca1 into master May 20, 2021

DimCitus deleted the fix/monitor-health-check-events branch May 20, 2021 08:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Log monitor health changes as events. #703

Log monitor health changes as events. #703

DimCitus commented May 7, 2021 •

edited

JelteF commented May 19, 2021

DimCitus commented May 19, 2021

Log monitor health changes as events. #703

Log monitor health changes as events. #703

Conversation

DimCitus commented May 7, 2021 • edited

JelteF commented May 19, 2021

DimCitus commented May 19, 2021

DimCitus commented May 7, 2021 •

edited