Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tla-plus: add spec for store liveness fabric #123523

Merged

Conversation

nvanbenschoten
Copy link
Member

This commit adds a new TLA+ specification (expressed in PlusCal) called StoreLiveness. StoreLiveness is a specification for the Store Liveness fabric, which will sit below Raft and power the leader-lease mechanism.

The central safety property of Store Liveness is Durable Support. A node in the Store Liveness fabric can receive support from another node with an associated "end time" and be confident that this support will not be withdrawn until that supporting node's clock exceeds the end time. This property allows us to build lease disjointness, as we can be sure that a future lease will have a start time that is greater than the end time of the previous lease's corresponding store liveness support.

Epic: None
Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@miraradeva miraradeva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Had a lot of fun running it myself and tweaking the algorithm.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani, @nvanbenschoten, and @sumeerbhola)


docs/tla-plus/StoreLiveness/StoreLiveness.cfg line 5 at r1 (raw file):

CONSTANT
  Nodes = {n1, n2}

If the durability property holds for two nodes, it should hold for any larger system, between any two nodes pair-wise. So it doesn't seem necessary to check this for more than two nodes, right?


docs/tla-plus/StoreLiveness/StoreLiveness.tla line 118 at r1 (raw file):

macro restart()
begin
  current_epoch[self] := current_epoch[self];

Aren't we supposed to increment the epoch upon restart?


docs/tla-plus/StoreLiveness/StoreLiveness.tla line 160 at r1 (raw file):

            ack := TRUE;
          elsif by_self_for[self][msg.from].epoch = msg.epoch then
            if by_self_for[self][msg.from].end_time < msg.end_time then

It is possible that by_self_for[self][msg.from].end_time > msg.end_time if the message was an old heartbeat for the same epoch, right? So, we don't want to regress the end time. Might be worth a comment.


docs/tla-plus/StoreLiveness/StoreLiveness.tla line 171 at r1 (raw file):

        
      elsif msg.type = MsgHeartbeatResp then
        ReceiveHeartbeatResp:

This behavior differs from the algorithm in the doc. Presumably because we might receive a MsgHeartbeatResp that corresponds to an older MsgHeartbeat? Is this just an artifact of TLA+ not supporting RPC?


docs/tla-plus/StoreLiveness/StoreLiveness.tla line 173 at r1 (raw file):

        ReceiveHeartbeatResp:
          if msg.ack then
            if for_self_by[self][msg.from].epoch < msg.epoch then

I don't think this case is possible. We can't receive an ACK heartbeat response with an epoch for which we didn't explicitly requested support. If we already requested support for it, it will be present in for_self_by[self][msg.from].epoch.

I confirmed by running the model check with this being the ACK if block:

assert for_self_by[self][msg.from].epoch >= msg.epoch;
if for_self_by[self][msg.from].epoch = msg.epoch then
    if for_self_by[self][msg.from].end_time < msg.end_time then
        for_self_by[self][msg.from].end_time := msg.end_time;
    end if;
end if;

Though I'm not sure the epoch equality check is also necessary (the model check succeeds without it). Can we get an ACK from an older epoch with a higher end time? Maybe only if we decrease the value of HeartbeatInterval midway through the execution.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flushing some comments from yesterday since I got interrupted. I will finish looking today.

Reviewed 1 of 2 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @nvanbenschoten)


docs/tla-plus/StoreLiveness/StoreLiveness.cfg line 10 at r1 (raw file):

  HeartbeatInterval = 2

  MsgHeartbeat     = MsgHeartbeat

pluscal question: what does this mean? Is this a model value https://learntla.com/core/constants.html#model-values?


docs/tla-plus/StoreLiveness/StoreLiveness.tla line 46 at r1 (raw file):

  EpochSupportExpired(map, i, j, supporter_time) == map[i][j].end_time < supporter_time
  \* Is i's support from j (according to i's for_self_by map) expired (according to j's clock)?
  ForSelfByEpochSupportExpired(i, j) == EpochSupportExpired(for_self_by, i, j, clocks[j])

So this can't be computed in a real system, since it is using the for_self_by state at i and the clock at j, yes? Presumably this is defined for an invariant.

I was expecting there would also be a function that is evaluatable at i, which uses for_self_by state at i and the clock at i, for i to realize it has lost support.


docs/tla-plus/StoreLiveness/StoreLiveness.tla line 54 at r1 (raw file):

  \* Can i withdraw support for j?
  CanInvlidateBySelfFor(i, j) == BySelfForEpochValid(i, j) /\ BySelfForEpochSupportExpired(i, j)

nit: invalidate


docs/tla-plus/StoreLiveness/StoreLiveness.tla line 64 at r1 (raw file):

      \A j \in Nodes \ {i}:
        ForSelfByEpochValid(i, j) => 
          (SupportUpheld(i, j) \/ ForSelfByEpochSupportExpired(i, j))

makes sense. This also covers the case where i unilaterally increments its for_self_by[i][j].epoch before the expiry time (though I forget whether there was a need for that in the latest design iteration).


docs/tla-plus/StoreLiveness/StoreLiveness.tla line 71 at r1 (raw file):

      \A j \in Nodes \ {i}:
        current_epoch[i] >= for_self_by[i][j].epoch
end define;

we should consider adding an invariant to ensure that the withdrawal epoch does not run away from the original epoch at a node.

by_self_for[i][j].epoch > for_self_by[j][i].epoch /\ by_self_for[i][j].end_time = 0 => by_self_for[i][j].epoch - or_self_by[j][i].epoch <= 1


docs/tla-plus/StoreLiveness/StoreLiveness.tla line 82 at r1 (raw file):

end macro

macro send_heartbeat(to)

I suppose we don't need to model the HLC time causality for safety purposes -- for this scheme it helps only for liveness.


docs/tla-plus/StoreLiveness/StoreLiveness.tla line 118 at r1 (raw file):

macro restart()
begin
  current_epoch[self] := current_epoch[self];

was this meant to increment? Currently looks like a noop.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani, @miraradeva, and @nvanbenschoten)


docs/tla-plus/StoreLiveness/StoreLiveness.tla line 173 at r1 (raw file):

Can we get an ACK from an older epoch with a higher end time? Maybe only if we decrease the value of HeartbeatInterval midway through the execution.

The earlier design iterations allowed for that. And I think the latest one needs to to because as you mention we can change the HeartbeatInterval, and the node could restart and forget that it had support for an earlier epoch that ended at a later time.

Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani, @miraradeva, and @sumeerbhola)


docs/tla-plus/StoreLiveness/StoreLiveness.cfg line 5 at r1 (raw file):

Previously, miraradeva (Mira Radeva) wrote…

If the durability property holds for two nodes, it should hold for any larger system, between any two nodes pair-wise. So it doesn't seem necessary to check this for more than two nodes, right?

Yes, that's correct. In fact, we probably don't even need to heartbeat in two directions in this spec.


docs/tla-plus/StoreLiveness/StoreLiveness.cfg line 10 at r1 (raw file):
Yes, it is a model value. I'm using these here to serve as enum variants. It's strange to me that these need to be declared in the .cfg file, but the model checker requires it.

The constant parameter MsgHeartbeat is not assigned a value by the configuration file.


docs/tla-plus/StoreLiveness/StoreLiveness.tla line 46 at r1 (raw file):
Right, this cannot be computed in a real system and is only used to define the invariant.

I was expecting there would also be a function that is evaluatable at i, which uses for_self_by state at i and the clock at i, for i to realize it has lost support.

Such a function could exist, but it's not needed now that a node never needs to invalidate its own support at the store liveness level.


docs/tla-plus/StoreLiveness/StoreLiveness.tla line 54 at r1 (raw file):

Previously, sumeerbhola wrote…

nit: invalidate

Done.


docs/tla-plus/StoreLiveness/StoreLiveness.tla line 64 at r1 (raw file):

Previously, sumeerbhola wrote…

makes sense. This also covers the case where i unilaterally increments its for_self_by[i][j].epoch before the expiry time (though I forget whether there was a need for that in the latest design iteration).

There is only a need for that in the current design after a restart.


docs/tla-plus/StoreLiveness/StoreLiveness.tla line 71 at r1 (raw file):

Previously, sumeerbhola wrote…

we should consider adding an invariant to ensure that the withdrawal epoch does not run away from the original epoch at a node.

by_self_for[i][j].epoch > for_self_by[j][i].epoch /\ by_self_for[i][j].end_time = 0 => by_self_for[i][j].epoch - or_self_by[j][i].epoch <= 1

Good idea, done.


docs/tla-plus/StoreLiveness/StoreLiveness.tla line 118 at r1 (raw file):

Previously, miraradeva (Mira Radeva) wrote…

Aren't we supposed to increment the epoch upon restart?

Yes, I must have broken this when trying to get invariants to fire. Fixed.


docs/tla-plus/StoreLiveness/StoreLiveness.tla line 118 at r1 (raw file):

Previously, sumeerbhola wrote…

was this meant to increment? Currently looks like a noop.

Yes, fixed.


docs/tla-plus/StoreLiveness/StoreLiveness.tla line 160 at r1 (raw file):

Previously, miraradeva (Mira Radeva) wrote…

It is possible that by_self_for[self][msg.from].end_time > msg.end_time if the message was an old heartbeat for the same epoch, right? So, we don't want to regress the end time. Might be worth a comment.

Correct, we allow out-of-order delivery, so we don't want to regress the end_time. I've added a comment and confirmed that the DurableSupport invariant and the new SupportProvidedLeadsSupportAssumedInvariant both fail without this protection.


docs/tla-plus/StoreLiveness/StoreLiveness.tla line 171 at r1 (raw file):

Previously, miraradeva (Mira Radeva) wrote…

This behavior differs from the algorithm in the doc. Presumably because we might receive a MsgHeartbeatResp that corresponds to an older MsgHeartbeat? Is this just an artifact of TLA+ not supporting RPC?

By "supporting RPC", are you referring to preventing out-of-order delivery of requests and responses? I don't think we want our implementation to assume ordered delivery. Even with an RPC framework that provides ordered delivery, systems need to be careful for ambiguous delivery when the RPC stream breaks and then the potential for out-of-order delivery after a new stream is created.


docs/tla-plus/StoreLiveness/StoreLiveness.tla line 173 at r1 (raw file):

Previously, sumeerbhola wrote…

Can we get an ACK from an older epoch with a higher end time? Maybe only if we decrease the value of HeartbeatInterval midway through the execution.

The earlier design iterations allowed for that. And I think the latest one needs to to because as you mention we can change the HeartbeatInterval, and the node could restart and forget that it had support for an earlier epoch that ended at a later time.

Good point. I added the assertion and simplified this logic for now. I was thinking that there may be cases where we have requested support for a new epoch, but have not yet updated for_self_by. That's not true though, given how the spec is written.


We could make the heartbeat protocol more flexible to allow for such an ACK, to avoid two RTTs when support has been withdrawn. In doing so, we would remove the need for NACKs. To do so, instead of heartbeating with a specific epoch, we could heartbeat with a minimum epoch. The receiver could then take the maximum of the epoch in the heartbeat and its own local epoch to determine what epoch to support.

I think we probably don't want to do this though, as it means that a delayed message after a node has crashed and other nodes have withdrawn support for it could allow the dead node to temporarily regain support.

Copy link
Contributor

@miraradeva miraradeva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani, @nvanbenschoten, and @sumeerbhola)


docs/tla-plus/StoreLiveness/StoreLiveness.tla line 171 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

By "supporting RPC", are you referring to preventing out-of-order delivery of requests and responses? I don't think we want our implementation to assume ordered delivery. Even with an RPC framework that provides ordered delivery, systems need to be careful for ambiguous delivery when the RPC stream breaks and then the potential for out-of-order delivery after a new stream is created.

Yeah, that makes sense. It would be nice to be able to assume matching requests and responses. E.g. if I heartbeat epoch 5, epoch 6, and epoch7, when I receive 3 responses (say, ACK, NACK, ACK), I would know that the first ACK was for epoch 7, the NACK was for epoch 5, and the other ACK was for epoch 6. Even though the responses are received in a different order than the requests were sent, we know which response corresponds to which request. Does our RPC framework guarantee this? I think message passing in TLA+, and in general, doesn't.


docs/tla-plus/StoreLiveness/StoreLiveness.tla line 173 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Good point. I added the assertion and simplified this logic for now. I was thinking that there may be cases where we have requested support for a new epoch, but have not yet updated for_self_by. That's not true though, given how the spec is written.


We could make the heartbeat protocol more flexible to allow for such an ACK, to avoid two RTTs when support has been withdrawn. In doing so, we would remove the need for NACKs. To do so, instead of heartbeating with a specific epoch, we could heartbeat with a minimum epoch. The receiver could then take the maximum of the epoch in the heartbeat and its own local epoch to determine what epoch to support.

I think we probably don't want to do this though, as it means that a delayed message after a node has crashed and other nodes have withdrawn support for it could allow the dead node to temporarily regain support.

If we're adding significant logic to handle changes in HeartbeatInterval, do we want to model increasing/decreasing the HeartbeatInterval here? I'm not sure how much state-space blowup it will cause.

Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani, @miraradeva, and @sumeerbhola)


docs/tla-plus/StoreLiveness/StoreLiveness.tla line 171 at r1 (raw file):

Previously, miraradeva (Mira Radeva) wrote…

Yeah, that makes sense. It would be nice to be able to assume matching requests and responses. E.g. if I heartbeat epoch 5, epoch 6, and epoch7, when I receive 3 responses (say, ACK, NACK, ACK), I would know that the first ACK was for epoch 7, the NACK was for epoch 5, and the other ACK was for epoch 6. Even though the responses are received in a different order than the requests were sent, we know which response corresponds to which request. Does our RPC framework guarantee this? I think message passing in TLA+, and in general, doesn't.

gRPC exposes a unary request/response interface which we were planning to use for store liveness, which does make this mapping clear. That could allow us to omit some information from responses (e.g. the epoch being heartbeated, maybe even the end_time).

However, we may not want to do this even if we can. A request/response interface can look single-threaded and hide the need to handle out-of-order or dropped messages on the request path or the response path. This can lead to a more fragile, harder to test implementation.

As a point of comparison, the raft library does not assume request/response pairs, which mostly is a good thing.


docs/tla-plus/StoreLiveness/StoreLiveness.tla line 173 at r1 (raw file):

Previously, miraradeva (Mira Radeva) wrote…

If we're adding significant logic to handle changes in HeartbeatInterval, do we want to model increasing/decreasing the HeartbeatInterval here? I'm not sure how much state-space blowup it will cause.

We can try. I think it's a good idea.

@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/storeLivenessTLA branch from 80843bf to 64ed3b6 Compare May 8, 2024 20:56
Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani, @miraradeva, and @sumeerbhola)


docs/tla-plus/StoreLiveness/StoreLiveness.tla line 173 at r1 (raw file):

If we're adding significant logic to handle changes in HeartbeatInterval

Added. This required me to remove the by_self_for[self][msg.from].end_time < msg.end_time assertion in ReceiveHeartbeat. I added a comment describing why end_time regressions across epochs are safe. Want to convince yourself that you agree?

I'm not sure how much state-space blowup it will cause.

I dropped the MaxClock down from 5 to 4 to help combat this. Model checking is still reasonable at this point.

Screenshot 2024-05-08 at 4.54.24 PM.png

Screenshot 2024-05-08 at 4.55.13 PM.png

Copy link
Contributor

@miraradeva miraradeva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 2 files at r3, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @arulajmani, @nvanbenschoten, and @sumeerbhola)


docs/tla-plus/StoreLiveness/StoreLiveness.tla line 173 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

If we're adding significant logic to handle changes in HeartbeatInterval

Added. This required me to remove the by_self_for[self][msg.from].end_time < msg.end_time assertion in ReceiveHeartbeat. I added a comment describing why end_time regressions across epochs are safe. Want to convince yourself that you agree?

I'm not sure how much state-space blowup it will cause.

I dropped the MaxClock down from 5 to 4 to help combat this. Model checking is still reasonable at this point.

Screenshot 2024-05-08 at 4.54.24 PM.png

Screenshot 2024-05-08 at 4.55.13 PM.png

Looks great, thanks!

This commit adds a new TLA+ specification (expressed in PlusCal) called
StoreLiveness. StoreLiveness is a specification for the Store Liveness
fabric, which will sit below Raft and power the leader-lease mechanism.

The central safety property of Store Liveness is Durable Support. A node
in the Store Liveness fabric can receive support from another node with
an associated "end time" and be confident that this support will not be
withdrawn until that supporting node's clock exceeds the end time. This
property allows us to build lease disjointness, as we can be sure that a
future lease will have a start time that is greater than the end time of
the previous lease's corresponding store liveness support.

Epic: None
Release note: None
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/storeLivenessTLA branch from 64ed3b6 to ef95359 Compare May 8, 2024 21:08
@nvanbenschoten
Copy link
Member Author

@miraradeva I'll merge this for now. We can iterate on it in other PRs if we want to augment store liveness with state like the current_epoch_max_requested field (and the corresponding invariants) that we had talked about.

bors r=miraradeva

@craig
Copy link
Contributor

craig bot commented May 20, 2024

👎 Rejected by code reviews

@nvanbenschoten
Copy link
Member Author

bors r=miraradeva

@craig craig bot merged commit ceb19c5 into cockroachdb:master May 20, 2024
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants