Fix P-chain validator set lookup race condition #2672

StephenButtolph · 2024-01-28T00:34:47Z

Why this should be merged

The P-chain p2p network currently fetches the primary network validator set without first grabbing the P-chain context lock. This can result in racy (incorrect) calculations of the validator set that are cached. This can cause persistently incorrect validator set lookups at P-chain heights where this race occurred.

How this works

Wraps the validators.State passed into the p2p network with the context lock.
Exports a number of utility functions from the p2p library to facilitate testing.

How this was tested

New unit test (fails on master)
Ran node on fuji

StephenButtolph · 2024-01-28T00:37:11Z

vms/platformvm/vm.go

+		validators.NewLockedState(
+			&chainCtx.Lock,
+			validatorManager,
+		),


This is the actual fix. chainCtx.ValidatorState was replaced with validatorManager because chainCtx.ValidatorState is actually equal to vm here and I found that very confusing. Additionally, this made testing much better because we don't need to rely on mocking the ValidatorState.

StephenButtolph · 2024-01-28T00:37:33Z

vms/platformvm/vm_regression_test.go

@@ -2218,6 +2225,61 @@ func TestSubnetValidatorSetAfterPrimaryNetworkValidatorRemoval(t *testing.T) {
 	require.NoError(err)
 }

+func TestValidatorSetRaceCondition(t *testing.T) {


I couldn't find a cleaner way to write this test... Very open to suggestions.

StephenButtolph · 2024-01-28T00:38:45Z

network/p2p/gossip/message.go

+func MarshalAppRequest(filter, salt []byte) ([]byte, error) {
+	request := &sdk.PullGossipRequest{
+		Filter: filter,
+		Salt:   salt,
+	}
+	return proto.Marshal(request)
+}
+
+func ParseAppRequest(bytes []byte) (*bloom.ReadFilter, ids.ID, error) {
+	request := &sdk.PullGossipRequest{}
+	if err := proto.Unmarshal(bytes, request); err != nil {
+		return nil, ids.Empty, err
+	}
+
+	salt, err := ids.ToID(request.Salt)
+	if err != nil {
+		return nil, ids.Empty, err
+	}
+
+	filter, err := bloom.Parse(request.Filter)
+	return filter, salt, err
+}
+
+func MarshalAppResponse(gossip [][]byte) ([]byte, error) {
+	return proto.Marshal(&sdk.PullGossipResponse{
+		Gossip: gossip,
+	})
+}
+
+func ParseAppResponse(bytes []byte) ([][]byte, error) {
+	response := &sdk.PullGossipResponse{}
+	err := proto.Unmarshal(bytes, response)
+	return response.Gossip, err
+}
+
+func MarshalAppGossip(gossip [][]byte) ([]byte, error) {
+	return proto.Marshal(&sdk.PushGossip{
+		Gossip: gossip,
+	})
+}
+
+func ParseAppGossip(bytes []byte) ([][]byte, error) {
+	msg := &sdk.PushGossip{}
+	err := proto.Unmarshal(bytes, msg)
+	return msg.Gossip, err
+}


Technically we don't need to export all of these utilities... But I felt like it made the code easier to understand... If we'd prefer to just directly marshal the proto messages I'll revert these. @joshua-kim thoughts?

Strongly prefer this

This is better, this is also good since we frequently re-write this during testing as well

StephenButtolph · 2024-01-28T00:39:40Z

network/p2p/router.go

+	handlerStr := strconv.FormatUint(handlerID, 10)
+


We could reduce the diff here, but I don't think there is any good reason to do this with the lock held.

network/p2p/client.go

StephenButtolph · 2024-01-28T00:40:51Z

vms/platformvm/vm_test.go

@@ -277,13 +277,14 @@ func defaultVM(t *testing.T, fork activeFork) (*VM, database.Database, *mutableS
 		return nil
 	}

+	dynamicConfigBytes := []byte(`{"network":{"max-validator-set-staleness":0}}`)


There might be other configs that would be good to default to. But this disables the optimization so that we re-calculate the validator set aggressively.

patrick-ogrady · 2024-01-28T01:02:32Z

vms/platformvm/vm.go

@@ -205,7 +205,10 @@ func (vm *VM) Initialize(
 		chainCtx.Log,
 		chainCtx.NodeID,
 		chainCtx.SubnetID,
-		chainCtx.ValidatorState,
+		validators.NewLockedState(


Are there other VMs that provide the raw state here?

Probably makes sense to double-check C/X-chain initialization.

This is a P-chain specific bug. In the C-chain and the X-chain the ctx.ValidatorState is already locked.

vms/platformvm/vm_regression_test.go

Co-authored-by: Stephen Buttolph <stephen@avalabs.org>

Fix P-chain validator set lookup race condition

0a00887

StephenButtolph added bug Something isn't working incident response labels Jan 28, 2024

StephenButtolph added this to the v1.10.19 milestone Jan 28, 2024

StephenButtolph self-assigned this Jan 28, 2024

StephenButtolph requested review from abi87, danlaine, dhrubabasu and joshua-kim as code owners January 28, 2024 00:34

StephenButtolph commented Jan 28, 2024

View reviewed changes

github-advanced-security bot found potential problems Jan 28, 2024

View reviewed changes

network/p2p/client.go Dismissed Show resolved Hide resolved

StephenButtolph commented Jan 28, 2024

View reviewed changes

patrick-ogrady reviewed Jan 28, 2024

View reviewed changes

patrick-ogrady approved these changes Jan 28, 2024

View reviewed changes

patrick-ogrady reviewed Jan 28, 2024

View reviewed changes

vms/platformvm/vm_regression_test.go Outdated Show resolved Hide resolved

darioush approved these changes Jan 28, 2024

View reviewed changes

darioush and others added 2 commits January 28, 2024 10:06

Fix P-chain validator set lookup race condition (#2673)

78ac72c

Co-authored-by: Stephen Buttolph <stephen@avalabs.org>

Merge branch 'master' into fix-validator-set-lookup-race

40b29d2

StephenButtolph enabled auto-merge January 28, 2024 15:09

StephenButtolph added this pull request to the merge queue Jan 28, 2024

Merged via the queue into master with commit 68980eb Jan 28, 2024
17 checks passed

StephenButtolph deleted the fix-validator-set-lookup-race branch January 28, 2024 15:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix P-chain validator set lookup race condition #2672

Fix P-chain validator set lookup race condition #2672

StephenButtolph commented Jan 28, 2024

StephenButtolph Jan 28, 2024

StephenButtolph Jan 28, 2024

StephenButtolph Jan 28, 2024

patrick-ogrady Jan 28, 2024

joshua-kim Jan 29, 2024

StephenButtolph Jan 28, 2024

StephenButtolph Jan 28, 2024

patrick-ogrady Jan 28, 2024

StephenButtolph Jan 28, 2024

Fix P-chain validator set lookup race condition #2672

Fix P-chain validator set lookup race condition #2672

Conversation

StephenButtolph commented Jan 28, 2024

Why this should be merged

How this works

How this was tested

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment