Skip to content

Latest commit

 

History

History
89 lines (68 loc) · 3.83 KB

manual_failovers.md

File metadata and controls

89 lines (68 loc) · 3.83 KB

Manual failover

While automatic failures are already baked-in, there may be times where a manually issued failover is necessary. The steps to perform a manual failover are listed below:

Note: The promotion candidate must reside within your PRIMARY_REGION.

1. Connect to the Machine you wish to promote

fly ssh console -s -a <app-name>

2. Confirm the member is healthy

# Switch to the postgres user and move to the home directory.
su postgres
cd ~

# Verify member is healthy
repmgr node check

Node "fdaa:0:2e26:a7b:7d17:9b36:6e4b:2":
	Server role: OK (node is standby)
	Replication lag: OK (0 seconds)
	WAL archiving: OK (0 pending archive ready files)
	Upstream connection: OK (node "fdaa:0:2e26:a7b:7d17:9b36:6e4b:2" (ID: 1375486377) is attached to expected upstream node "fdaa:0:2e26:a7b:c850:86cf:b175:2" (ID: 1177616922))
	Downstream servers: OK (this node has no downstream nodes)
	Replication slots: OK (node has no physical replication slots)
	Missing physical replication slots: OK (node has no missing physical replication slots)
	Configured data directory: OK (configured "data_directory" is "/data/postgresql")

2. Stop the Machine running as primary

Open up a separate terminal and stop the Machine running primary.

# Identify the primary
fly status --app <app-name>

ID            	STATE  	ROLE   	REGION	HEALTH CHECKS     	IMAGE                                           	CREATED             	UPDATED
6e8226ec711087	started	replica	lax   	3 total, 3 passing	davissp14/postgres-flex:recovery-fix-00 (custom)	2023-02-15T20:20:51Z	2023-02-15T20:21:10Z
6e82931b729087	started	primary	lax   	3 total, 3 passing	davissp14/postgres-flex:recovery-fix-00 (custom)	2023-02-15T20:19:58Z	2023-02-15T20:20:18Z
9185957f411683	started	replica	lax   	3 total, 3 passing	davissp14/postgres-flex:recovery-fix-00 (custom)	2023-02-15T20:20:24Z	2023-02-15T20:20:45Z


fly machines stop 6e82931b729087 --app <app-name>

3. Run the standby promotion command Go back to the first terminal you opened that's connected to your promotion candidate.

# Issue a dry-run to ensure our candidate is eligible for promotion.
repmgr standby promote --siblings-follow --dry-run

INFO: node is a standby
INFO: no active primary server found in this replication cluster
INFO: all sibling nodes are reachable via SSH
INFO: 1 walsenders required, 10 available
INFO: 1 replication slots required, 10 available
INFO: node will be promoted using the "pg_promote()" function
INFO: prerequisites for executing STANDBY PROMOTE are met

WARNING: It's important that you specify --siblings-follow, otherwise any other standbys will not be reconfigured to follow the new primary.

If everything looks good, go ahead and re-run the command without the --dry-run argument.

repmgr standby promote --siblings-follow

NOTICE: promoting standby to primary
DETAIL: promoting server "fdaa:0:2e26:a7b:7d17:9b36:6e4b:2" (ID: 1375486377) using pg_promote()
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
NOTICE: STANDBY PROMOTE successful
DETAIL: server "fdaa:0:2e26:a7b:7d17:9b36:6e4b:2" (ID: 1375486377) was successfully promoted to primary
INFO: executing notification command for event "standby_promote"
DETAIL: command is:
  /usr/local/bin/event_handler -node-id 1375486377 -event standby_promote -success 1 -details "server \"fdaa:0:2e26:a7b:7d17:9b36:6e4b:2\" (ID: 1375486377) was successfully promoted to primary" -new-node-id ''
NOTICE: executing STANDBY FOLLOW on 1 of 1 siblings
INFO: STANDBY FOLLOW successfully executed on all reachable sibling nodes

4. Start the Machine that was previously operating as Primary

fly machines start 6e82931b729087 --app <app-name>

The primary will come back up and recognizing that it's no longer the true primary and will rejoin the cluster as a standby.