Fencing a failed primary #49

davissp14 · 2023-01-28T03:59:12Z

This PR aims to ensure that a booting primary that has diverged from the cluster is properly isolated.

Note: There's still some cleanup to do, but the core logic is there.

What this doesn't solve

Split-brains as a result of a network partition.

How do we verify the real primary?

We start out evaluating the cluster state by checking each registered standby for connectivity and asking who their primary is.

The "clusters state" is represented across a few different dimensions:

Total members
Number of registered members, including the primary.

Total active members
Number of members that are responsive. This includes the primary we are evaluating, so this will never be less than one.

Total inactive members
Number of registered members that are non-responsive.

Conflict map
The conflict map is a map[string]int that tracks conflicting primary's queried from our standbys and the number of occurrences a given primary was referenced.

As an example, say we have a 3 member cluster and both of the standby's indicate their registered primary does not match. This will be recorded as:

map[string]int{
  "fdaa:0:2e26:a7b:8c31:bf37:488c:2": 2
}

The real primary is resolvable so long as the majority of members can agree on who it is. Quorum being defined as total_members / 2 + 1.

There is one exception to note here. When self meets quorum, we will still fence ourself in the event a conflict is found. This is to protect against a possible race condition if we are coming up during an active failover.

Tests can be found here: https://github.com/fly-apps/postgres-flex/pull/49/files#diff-3d71960ff7855f775cb257a74643d67d2636b354c9d485d10c2ded2426a7f362

What if the real primary can't be resolved or doesn't match the booting primary?

In both of these instances the primary member will be fenced.

If the real primary is resolvable
The cluster will be made read-only, the PGBouncer will be reconfigured to target the "real" primary and the ip address is written to a zombie.lock file. The PGBouncer reconfiguration will ensure that any connections hitting this member will be routed to the real primary in order to minimize interruptions. Once, completed there will be panic to force a full member restart. When the member is restarted, we will read the ip address from the zombie.lock file and use that to attempt to rejoin the cluster we diverged from. If we are successful, the zombie.lock is cleared and we will boot as a standby.

Note: We will not attempt to rejoin a cluster if the resolved primary resides in a region that differs from the PRIMARY_REGION environment variable set on self. The PRIMARY_REGION will need to be updated before a rejoin will be attempted.

If the real primary is NOT resolvable
The cluster will be made read-only, PGBouncer will remain disabled and a zombie.lock file will be created without a value. When the member reboots, we will read the zombie.lock file and see that it's empty. This indicates that we've entered a failure mode that can't be recovered automatically. This could be an issue where previously deleted members were not properly unregistered, or the primary's state has diverged to a point where its registered members have been cycled out.

How to address these situations will be a part of future documentation.

…thy cluster

DAlperin · 2023-01-28T22:23:32Z

pkg/flypg/node.go

+						if err := n.PGBouncer.ConfigurePrimary(ctx, primary, true); err != nil {
+							return fmt.Errorf("failed to reconfigure pgbouncer: %s", err)
+						}
+					}
+					// Create a zombie.lock file containing the resolved primary.
+					// This will be an empty string if we are unable to resolve the real primary.
+					if err := writeZombieLock(primary); err != nil {
+						return fmt.Errorf("failed to set zombie lock: %s", err)
+					}
+
+					fmt.Println("Setting all existing tables to read-only")
+					if err := admin.SetReadOnly(ctx, conn); err != nil {
+						return fmt.Errorf("failed to set read-only: %s", err)
+					}


I think we shouldn't return if we can't reconfigure pgbouncer here, since we should still set existing tables to readonly

Yeah, i'll need to re-evaluate the returns a bit more. Good call.

pkg/flypg/node.go

DAlperin · 2023-01-28T22:27:08Z

pkg/flypg/zombie_test.go

+	"testing"
+)
+
+func TestZombieDiagnosis(t *testing.T) {


this is excellent. this makes it easier for me to continue to threaten to make real integration tests :)

benbjohnson · 2023-01-28T22:23:32Z

pkg/flypg/node.go

+				mConn, err := repmgr.NewRemoteConnection(ctx, standby.Hostname)
+				if err != nil {
+					fmt.Printf("failed to connect to %s", standby.Hostname)
+					totalInactive++
+					continue
+				}


Should these mConn connections be closed? It doesn't look like they're used after passing to PrimaryMember().

Yes they should!

benbjohnson · 2023-01-28T22:24:50Z

pkg/flypg/node.go

+			if err != nil {
+				if errors.Is(err, ErrZombieDiscovered) {
+					fmt.Println("Unable to confirm we are the real primary!")
+					fmt.Printf("Registered members: %d, Active member(s): %d, Inactive member(s): %d, Conflicts detected: %d\n",
+						totalMembers,
+						totalActive,
+						totalInactive,
+						totalConflicts,
+					)
+
+					fmt.Println("Identifying ourself as a Zombie")
+
+					// if primary is non-empty we were able to build a consensus on who the real primary and should
+					// be recoverable on reboot.
+					if primary != "" {
+						fmt.Printf("Majority of members agree that %s is the real primary\n", primary)
+						fmt.Println("Reconfiguring PGBouncer to point to the real primary")
+						if err := n.PGBouncer.ConfigurePrimary(ctx, primary, true); err != nil {
+							return fmt.Errorf("failed to reconfigure pgbouncer: %s", err)
+						}
+					}
+					// Create a zombie.lock file containing the resolved primary.
+					// This will be an empty string if we are unable to resolve the real primary.
+					if err := writeZombieLock(primary); err != nil {
+						return fmt.Errorf("failed to set zombie lock: %s", err)
+					}
+
+					fmt.Println("Setting all existing tables to read-only")
+					if err := admin.SetReadOnly(ctx, conn); err != nil {
+						return fmt.Errorf("failed to set read-only: %s", err)
+					}
+
+					return fmt.Errorf("zombie primary detected. Use `fly machines restart <machine-id>` to rejoin the cluster or consider removing this node")
+				}
+
+				return fmt.Errorf("failed to run zombie diagnosis: %s", err)
+			}


You can reduce nesting a bit by moving the errors.Is() outside the outer if:

if errors.Is(err, ErrZombieDiscovered) { ... } else if err != nil { ... }

benbjohnson · 2023-01-28T22:26:02Z

pkg/flypg/node.go

@@ -322,6 +446,11 @@ func (n *Node) configure(store *state.Store) error {
 		fmt.Println(err.Error())
 	}

+	// Clear target and wait for primary resolution
+	if err := n.PGBouncer.ConfigurePrimary(ctx, "", false); err != nil {
+		fmt.Println(err.Error())


Does this not need to be bubbled up?

benbjohnson · 2023-01-28T22:27:07Z

pkg/flypg/zombie.go

+func writeZombieLock(hostname string) error {
+	if err := ioutil.WriteFile("/data/zombie.lock", []byte(hostname), 0644); err != nil {
+		return err
+	}
+
+	return nil
+}


Minor nit. ioutil is deprecated. You can use io.WriteFile() instead.

Oh, right. Are you talking about os.WriteFile ?

benbjohnson · 2023-01-28T22:36:59Z

pkg/flypg/zombie_test.go

+					t.Logf("test case: %d failed. Wasn't expecting to be a Zombie", i)
+					t.Fail()


These can be combined into a t.Fatalf()

benbjohnson · 2023-01-28T22:39:53Z

pkg/flypg/zombie_test.go

+		},
+	}
+
+	for i, c := range tests.Cases {


I'm not personally a fan of table tests for anything but really simple tests. I find they get really complicated as you start adding edge cases. I'd probably rework these into subtests using t.Run().

Although, if you want to keep the table tests, you can still wrap each of these in a t.Run() and then you can filter on them with go test -run=XXX

DAlperin · 2023-01-28T22:48:18Z

pkg/flypg/node.go

@@ -322,6 +445,11 @@ func (n *Node) configure(store *state.Store) error {
 		fmt.Println(err.Error())
 	}


here too I suppose

davissp14 · 2023-01-30T15:44:14Z

This seems to work well overall, however, I was able to discover a race condition that can happen in the event the failed primary comes back up in the middle of a failover.

Example: 3 member setup

Node A: Primary
Node B: Standby
Node C: Standby

Node A - Fails
Node B - Is elected Primary
Node A - Comes back to life before Node C is reconfigured.
Node A - Is able to meet quorum and continues being primary.
Node C - Is reconfigured to follow Node B.

This being the case, I think we need to take a slightly more conservative approach and fence the primary in the event any primary conflicts exist, even when quorum is met.

UPDATE
This has been resolved here:
104e878

…troubleshooting

davissp14 added 15 commits January 27, 2023 14:19

Zombie lock

cfd29e2

Zombie logic

90f56be

Cleanup

1b04b52

Cleanup

00a743d

This isn't used

060c98e

Bug fix, but need to clean this up

0509c2d

CLeanup

bfc7f1f

cleanup

d088e4e

Bug fix and cleanup

7e7f2a5

More cleanup

de06a36

zombie role should be a failing state

cb50b5a

Add some better logs for when we are unable to reconverge with a heal…

0980fa8

…thy cluster

Cleaning up comments and logs

cbaf83e

Sleep for 2 minutes to give users opportunity to clear the zombie lock

3842491

Fix comment

211db0d

davissp14 marked this pull request as ready for review January 28, 2023 20:12

davissp14 added 2 commits January 28, 2023 15:40

Simplify diagnosis logic

53201ca

Fix comments

07b7441

davissp14 changed the title ~~Fence failed primary~~ Fencing a failed primary Jan 28, 2023

fix test reference

e082977

DAlperin reviewed Jan 28, 2023

View reviewed changes

Manual restart is no longer required.

59d2f97

benbjohnson reviewed Jan 28, 2023

View reviewed changes

DAlperin reviewed Jan 28, 2023

View reviewed changes

davissp14 added 6 commits January 28, 2023 19:42

Updates based on feedback

7c937f8

rename

42bb779

Fix log

93b3d7e

PG needs to boot even when a Zombie so manual intervention is possible.

e355651

Minor cleanup

243e2cf

rounding out some edges

b39d479

davissp14 added 2 commits January 30, 2023 10:50

Account of primary coming up in the middle of a failover

104e878

Fixed permission issue with pgbouncer and logging events to file for …

fdc2d32

…troubleshooting

DAlperin approved these changes Jan 30, 2023

View reviewed changes

davissp14 merged commit 1ab7d9d into master Jan 30, 2023

davissp14 mentioned this pull request Feb 1, 2023

Split-brain during in-region network partition #55

Closed

davissp14 deleted the fence branch February 25, 2023 01:59

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fencing a failed primary #49

Fencing a failed primary #49

davissp14 commented Jan 28, 2023 •

edited

Loading

DAlperin Jan 28, 2023

davissp14 Jan 28, 2023

DAlperin Jan 28, 2023

benbjohnson Jan 28, 2023

davissp14 Jan 29, 2023

benbjohnson Jan 28, 2023

benbjohnson Jan 28, 2023

benbjohnson Jan 28, 2023

davissp14 Jan 29, 2023

benbjohnson Jan 28, 2023

benbjohnson Jan 28, 2023

DAlperin Jan 28, 2023

davissp14 commented Jan 30, 2023 •

edited

Loading

		t.Logf("test case: %d failed. Wasn't expecting to be a Zombie", i)
		t.Fail()

		@@ -322,6 +445,11 @@ func (n Node) configure(store state.Store) error {
		fmt.Println(err.Error())
		}

Fencing a failed primary #49

Fencing a failed primary #49

Conversation

davissp14 commented Jan 28, 2023 • edited Loading

What this doesn't solve

How do we verify the real primary?

What if the real primary can't be resolved or doesn't match the booting primary?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davissp14 commented Jan 30, 2023 • edited Loading

davissp14 commented Jan 28, 2023 •

edited

Loading

davissp14 commented Jan 30, 2023 •

edited

Loading