Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fencing a failed primary #49

Merged
merged 27 commits into from
Jan 30, 2023
Merged

Fencing a failed primary #49

merged 27 commits into from
Jan 30, 2023

Conversation

davissp14
Copy link
Contributor

@davissp14 davissp14 commented Jan 28, 2023

This PR aims to ensure that a booting primary that has diverged from the cluster is properly isolated.

Note: There's still some cleanup to do, but the core logic is there.

What this doesn't solve

Split-brains as a result of a network partition.

How do we verify the real primary?

We start out evaluating the cluster state by checking each registered standby for connectivity and asking who their primary is.

The "clusters state" is represented across a few different dimensions:

Total members
Number of registered members, including the primary.

Total active members
Number of members that are responsive. This includes the primary we are evaluating, so this will never be less than one.

Total inactive members
Number of registered members that are non-responsive.

Conflict map
The conflict map is a map[string]int that tracks conflicting primary's queried from our standbys and the number of occurrences a given primary was referenced.

As an example, say we have a 3 member cluster and both of the standby's indicate their registered primary does not match. This will be recorded as:

map[string]int{
  "fdaa:0:2e26:a7b:8c31:bf37:488c:2": 2
}

The real primary is resolvable so long as the majority of members can agree on who it is. Quorum being defined as total_members / 2 + 1.

There is one exception to note here. When self meets quorum, we will still fence ourself in the event a conflict is found. This is to protect against a possible race condition if we are coming up during an active failover.

Tests can be found here: https://github.com/fly-apps/postgres-flex/pull/49/files#diff-3d71960ff7855f775cb257a74643d67d2636b354c9d485d10c2ded2426a7f362

What if the real primary can't be resolved or doesn't match the booting primary?

In both of these instances the primary member will be fenced.

If the real primary is resolvable
The cluster will be made read-only, the PGBouncer will be reconfigured to target the "real" primary and the ip address is written to a zombie.lock file. The PGBouncer reconfiguration will ensure that any connections hitting this member will be routed to the real primary in order to minimize interruptions. Once, completed there will be panic to force a full member restart. When the member is restarted, we will read the ip address from the zombie.lock file and use that to attempt to rejoin the cluster we diverged from. If we are successful, the zombie.lock is cleared and we will boot as a standby.

Note: We will not attempt to rejoin a cluster if the resolved primary resides in a region that differs from the PRIMARY_REGION environment variable set on self. The PRIMARY_REGION will need to be updated before a rejoin will be attempted.

If the real primary is NOT resolvable
The cluster will be made read-only, PGBouncer will remain disabled and a zombie.lock file will be created without a value. When the member reboots, we will read the zombie.lock file and see that it's empty. This indicates that we've entered a failure mode that can't be recovered automatically. This could be an issue where previously deleted members were not properly unregistered, or the primary's state has diverged to a point where its registered members have been cycled out.

How to address these situations will be a part of future documentation.

@davissp14 davissp14 marked this pull request as ready for review January 28, 2023 20:12
@davissp14 davissp14 changed the title Fence failed primary Fencing a failed primary Jan 28, 2023
Comment on lines 332 to 345
if err := n.PGBouncer.ConfigurePrimary(ctx, primary, true); err != nil {
return fmt.Errorf("failed to reconfigure pgbouncer: %s", err)
}
}
// Create a zombie.lock file containing the resolved primary.
// This will be an empty string if we are unable to resolve the real primary.
if err := writeZombieLock(primary); err != nil {
return fmt.Errorf("failed to set zombie lock: %s", err)
}

fmt.Println("Setting all existing tables to read-only")
if err := admin.SetReadOnly(ctx, conn); err != nil {
return fmt.Errorf("failed to set read-only: %s", err)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we shouldn't return if we can't reconfigure pgbouncer here, since we should still set existing tables to readonly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, i'll need to re-evaluate the returns a bit more. Good call.

pkg/flypg/node.go Show resolved Hide resolved
"testing"
)

func TestZombieDiagnosis(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is excellent. this makes it easier for me to continue to threaten to make real integration tests :)

Comment on lines +291 to +296
mConn, err := repmgr.NewRemoteConnection(ctx, standby.Hostname)
if err != nil {
fmt.Printf("failed to connect to %s", standby.Hostname)
totalInactive++
continue
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these mConn connections be closed? It doesn't look like they're used after passing to PrimaryMember().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes they should!

Comment on lines 315 to 351
if err != nil {
if errors.Is(err, ErrZombieDiscovered) {
fmt.Println("Unable to confirm we are the real primary!")
fmt.Printf("Registered members: %d, Active member(s): %d, Inactive member(s): %d, Conflicts detected: %d\n",
totalMembers,
totalActive,
totalInactive,
totalConflicts,
)

fmt.Println("Identifying ourself as a Zombie")

// if primary is non-empty we were able to build a consensus on who the real primary and should
// be recoverable on reboot.
if primary != "" {
fmt.Printf("Majority of members agree that %s is the real primary\n", primary)
fmt.Println("Reconfiguring PGBouncer to point to the real primary")
if err := n.PGBouncer.ConfigurePrimary(ctx, primary, true); err != nil {
return fmt.Errorf("failed to reconfigure pgbouncer: %s", err)
}
}
// Create a zombie.lock file containing the resolved primary.
// This will be an empty string if we are unable to resolve the real primary.
if err := writeZombieLock(primary); err != nil {
return fmt.Errorf("failed to set zombie lock: %s", err)
}

fmt.Println("Setting all existing tables to read-only")
if err := admin.SetReadOnly(ctx, conn); err != nil {
return fmt.Errorf("failed to set read-only: %s", err)
}

return fmt.Errorf("zombie primary detected. Use `fly machines restart <machine-id>` to rejoin the cluster or consider removing this node")
}

return fmt.Errorf("failed to run zombie diagnosis: %s", err)
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can reduce nesting a bit by moving the errors.Is() outside the outer if:

if errors.Is(err, ErrZombieDiscovered) {
	...
} else if err != nil {
	...
}

@@ -322,6 +446,11 @@ func (n *Node) configure(store *state.Store) error {
fmt.Println(err.Error())
}

// Clear target and wait for primary resolution
if err := n.PGBouncer.ConfigurePrimary(ctx, "", false); err != nil {
fmt.Println(err.Error())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this not need to be bubbled up?

Comment on lines 17 to 23
func writeZombieLock(hostname string) error {
if err := ioutil.WriteFile("/data/zombie.lock", []byte(hostname), 0644); err != nil {
return err
}

return nil
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit. ioutil is deprecated. You can use io.WriteFile() instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right. Are you talking about os.WriteFile ?

Comment on lines 217 to 218
t.Logf("test case: %d failed. Wasn't expecting to be a Zombie", i)
t.Fail()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These can be combined into a t.Fatalf()

},
}

for i, c := range tests.Cases {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not personally a fan of table tests for anything but really simple tests. I find they get really complicated as you start adding edge cases. I'd probably rework these into subtests using t.Run().

Although, if you want to keep the table tests, you can still wrap each of these in a t.Run() and then you can filter on them with go test -run=XXX

@@ -322,6 +445,11 @@ func (n *Node) configure(store *state.Store) error {
fmt.Println(err.Error())
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here too I suppose

@davissp14
Copy link
Contributor Author

davissp14 commented Jan 30, 2023

This seems to work well overall, however, I was able to discover a race condition that can happen in the event the failed primary comes back up in the middle of a failover.

Example: 3 member setup

Node A: Primary
Node B: Standby
Node C: Standby

Node A - Fails
Node B - Is elected Primary
Node A - Comes back to life before Node C is reconfigured.
Node A - Is able to meet quorum and continues being primary.
Node C - Is reconfigured to follow Node B.

This being the case, I think we need to take a slightly more conservative approach and fence the primary in the event any primary conflicts exist, even when quorum is met.

UPDATE
This has been resolved here:
104e878

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants