Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proper handling of incompatible zedtokens #1723

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

josephschorr
Copy link
Member

@josephschorr josephschorr commented Jan 29, 2024

NOTE: ZedTokens are a bit longer now as a result of this change, but should still be well within the 1024 limit previously defined

Fixes #1541

@github-actions github-actions bot added area/api v1 Affects the v1 API area/CLI Affects the command line area/datastore Affects the storage system area/tooling Affects the dev or user toolchain (e.g. tests, ci, build tools) labels Jan 29, 2024
@josephschorr josephschorr force-pushed the incompatible-zedtokens branch 3 times, most recently from d3b259f to 64dda15 Compare January 29, 2024 22:30
@josephschorr josephschorr marked this pull request as ready for review January 29, 2024 22:38
@josephschorr josephschorr requested review from vroldanbet and a team as code owners January 29, 2024 22:38
@josephschorr
Copy link
Member Author

Rebased

@josephschorr josephschorr force-pushed the incompatible-zedtokens branch 2 times, most recently from 52895fe to 9bb170d Compare March 13, 2024 21:43
@josephschorr
Copy link
Member Author

Rebased

@josephschorr josephschorr force-pushed the incompatible-zedtokens branch 2 times, most recently from ba65181 to 160e514 Compare April 22, 2024 20:20
@josephschorr
Copy link
Member Author

Updated

Copy link
Contributor

@vroldanbet vroldanbet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some early feedback, I'll continue tomorrow. Please describe in the PR body what problems are you trying to solve and design choices you took to come up with this solution. It helps folks reviewing the PR with the right context 🙏🏻

}

cds.uniqueID.Store(&uniqueID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this matters too much here because the value converges, but this can race. Ideally we do a CAS operation.

I'd prefer to see this implemented with sync.Once instead of an atomic pointer. The ID does not change for the lifetime of the server.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sync.Once won't work because if we get an error, we need to rerun this code the next time its called

Copy link
Contributor

@vroldanbet vroldanbet Jul 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair. Can we at least use the CAS operation to store it? if it fails, it probably means some other goroutine already set it.

Another point in favor of the atomic value is it does not rely on any mutex. sync.Once relies on it while it's not initialized, but once it is, the overhead of the mutex disappears.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why even do that though? The ID is guaranteed to be the same by the datastore, so no need to compare?

Comment on lines +546 to +549
// UniqueID returns a unique identifier for the datastore. This identifier
// must be stable across restarts of the datastore if the datastore is
// persistent.
UniqueID(context.Context) (string, error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider alternatives using the UniqueID as a stable identifier for zedtokens?

I assume (not present in the PR body) that the goal is making sure zedtokens from one datastore type are not used in another datastore type. But what happens if you evolve the zedtoken implementation from one version to another for the same datastore, and want to force them to be dropped? the uniqueID wouldn't help, would it?
Shouldn't we store a zedtoken versioning parameter like datastoreType+zedtokenVersion?

Copy link
Member Author

@josephschorr josephschorr Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but then it doesn't meet the goal which is to prevent zedtokens not just across types but instances of the datastore as well

@josephschorr
Copy link
Member Author

Some early feedback, I'll continue tomorrow. Please describe in the PR body what problems are you trying to solve and design choices you took to come up with this solution. It helps folks reviewing the PR with the right context 🙏🏻

I added the fixes; it was on the commit but not the PR

Copy link
Contributor

@vroldanbet vroldanbet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are transferring UUIDs in every API call for no reason other than to prevent an eventual migration within the same datastore type. That's extremely wasteful.

It's still within the zedtoken spec limit, but it is a very high price to pay for an incredible rare event (a legit one, don't get me wrong). We are storing UUIDs in the datastore as strings, which is the least compact way of storing them and transferring them over the wire. And we are forcing everybody to see the data transferred increase significantly (and latency!) to prevent a scenario that the system will not be subjected to in, I'll dare to say, practically 100% of its lifespan.

We need to rethink this. I get the zedtoken is a stateless token and that it needs to self-contain this information, but I think we can get 99% there by defining the datastore type as a small attribute in the token. I understand migrating from different instances of the same datastore can be problematic, but no one asked for this, and we could find alternative ways for it (e.g. a flag that forces SpiceDB to ignore requested consistency and fallback to full_consistency all the time, or minimize_latency if the customer can take the hit of ignoring the new enemy problem during the migration). I'd argue that by adding such a flag we could completely avoid having to add the datastore type too.

@@ -55,27 +56,47 @@ func RevisionFromContext(ctx context.Context) (datastore.Revision, *v1.ZedToken,
handle := c.(*revisionHandle)
rev := handle.revision
if rev != nil {
return rev, zedtoken.MustNewFromRevision(rev), nil
ds := datastoremw.FromContext(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This middleware already had a dependency on the datastore middleware, but we never declared it via the middleware framework. Please update the middleware chain to state this middleware has such dependency.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you asking me to add a comment stating that or add something somewhere? (sorry for forgetting)

// Calculate a revision as we see fit
databaseRev, err := ds.OptimizedRevision(ctx)
if err != nil {
return datastore.NoRevision, false, err
}

if requested != nil {
requestedRev, err := zedtoken.DecodeRevision(requested, ds)
requestedRev, status, err := zedtoken.DecodeRevision(requested, ds)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not feel great to return a new return value status when this is something that could be returned via an error condition the callsite can check for. I understand the benefit of the added argument though: it forces the callsite to make a conscious choice about whether to handle it or not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its explicitly not an error though; it only becomes an error if that is the setting on the CLI. I also do like that callers must handle it, as you said.

if status == zedtoken.StatusMismatchedDatastoreID {
switch option {
case TreatMismatchingTokensAsFullConsistency:
log.Warn().Str("zedtoken", requested.Token).Msg("ZedToken specified references an older datastore and SpiceDB is configured to treat this as a full consistency request")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's older, just different instance

Suggested change
log.Warn().Str("zedtoken", requested.Token).Msg("ZedToken specified references an older datastore and SpiceDB is configured to treat this as a full consistency request")
log.Warn().Str("zedtoken", requested.Token).Msg("ZedToken specified references a different datastore instance and SpiceDB is configured to treat this as a full consistency request")

return headRev, false, nil

case TreatMismatchingTokensAsMinLatency:
log.Warn().Str("zedtoken", requested.Token).Msg("ZedToken specified references an older datastore and SpiceDB is configured to treat this as a min latency request")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log.Warn().Str("zedtoken", requested.Token).Msg("ZedToken specified references an older datastore and SpiceDB is configured to treat this as a min latency request")
log.Warn().Str("zedtoken", requested.Token).Msg("ZedToken specified references a different datastore instance and SpiceDB is configured to treat this as a min latency request")

return databaseRev, false, nil

case TreatMismatchingTokensAsError:
log.Error().Str("zedtoken", requested.Token).Msg("ZedToken specified references an older datastore and SpiceDB is configured to raise an error in this scenario")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to log in this case since it will be surfaced via gRPC error anyway

Suggested change
log.Error().Str("zedtoken", requested.Token).Msg("ZedToken specified references an older datastore and SpiceDB is configured to raise an error in this scenario")

if err != nil {
return errInvalidZedToken
}

if status == zedtoken.StatusMismatchedDatastoreID {
log.Error().Str("zedtoken", consistency.GetAtExactSnapshot().Token).Msg("ZedToken specified references an older datastore but at-exact-snapshot was requested")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to log this, it will surface via gRPC error in the logs

Suggested change
log.Error().Str("zedtoken", consistency.GetAtExactSnapshot().Token).Msg("ZedToken specified references an older datastore but at-exact-snapshot was requested")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to a warning

Comment on lines 79 to 94
TreatMismatchingTokensAsFullConsistency MismatchingTokenOption = iota

TreatMismatchingTokensAsMinLatency

TreatMismatchingTokensAsError
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please document

return &v1.WriteRelationshipsResponse{
WrittenAt: zedtoken.MustNewFromRevision(revision),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, Im realizing we used MustNewFromRevision everywhere, when it's something that would panic in case of malformed proto encoding. I guess it worked because the middleware prevented any incorrect revision from making it through.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we could move away from the Must approach, but it shouldn't fail in "normal" operation

@@ -97,9 +97,14 @@ func (ss *schemaServer) ReadSchema(ctx context.Context, _ *v1.ReadSchemaRequest)
DispatchCount: dispatchCount,
})

zedToken, err := zedtoken.NewFromRevision(ctx, headRevision, ds)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at this point if feels like NewFromRevision should be a method in the Datastore interface, but I get the convenience of not relying on the datastore specially for tests

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes and no; it uses the datastore's unique ID but its also not strictly tied to the datastore, so I prefer this approach

if err != nil {
return status.Errorf(codes.InvalidArgument, "failed to decode start revision: %s", err)
}

if tokenStatus == zedtoken.StatusMismatchedDatastoreID {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think context is missing on why we are choosing to fail here. I would recommend adding a comment to clarify, since the implications are not immediately clear.

I'm trying to think what type of scenario are we trying to protect from by failing here, but also potential use-cases we could be hampering by doing so.

  • Clearly zedtoken for a different datastore type should fail as those can fail in unexpected ways, even though we could potentially find ways some of them interoperate.
  • Same datastore type, but different instances: we can't guarantee to observe the same transactions at the same timestamp in different instances.

In the case of Watch, reusing a zedtoken in different instances could lead to missing events.

The only scenario I can think that could be impacted by this are dual-writing to a new SpiceDB instance and doing an eventual cut-over (e.g. deep schema refactor). API calls with at_least_as_fresh can recover by momentarily switching to minimize_latency until the tokens are refreshed (although this could take very long as it depends on resource changes! the system could be running in minimize for a long time). A cut-over to the Watch API will simply not work, regardless of the changes in this PR, as it can only work if they represent the same state of the world.

Long story short, I think this is the right call, and I wanted to think out lout and leave a trace of the why.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment

@josephschorr
Copy link
Member Author

We are transferring UUIDs in every API call for no reason other than to prevent an eventual migration within the same datastore type. That's extremely wasteful.

Is it really? We could reduce the size of the ID to just a small prefix if we wish to save some bytes.

It's still within the zedtoken spec limit, but it is a very high price to pay for an incredible rare event (a legit one, don't get me wrong). We are storing UUIDs in the datastore as strings, which is the least compact way of storing them and transferring them over the wire. And we are forcing everybody to see the data transferred increase significantly (and latency!) to prevent a scenario that the system will not be subjected to in, I'll dare to say, practically 100% of its lifespan.

Except users have already been subject to it - this was spurred by two, different reports: one where a user moved between datastore types, but another where they moved between datastores of the same type (but accidentally)

We need to rethink this. I get the zedtoken is a stateless token and that it needs to self-contain this information, but I think we can get 99% there by defining the datastore type as a small attribute in the token. I understand migrating from different instances of the same datastore can be problematic, but no one asked for this,

It has happened before, sadly.

and we could find alternative ways for it (e.g. a flag that forces SpiceDB to ignore requested consistency and fallback to full_consistency all the time, or minimize_latency if the customer can take the hit of ignoring the new enemy problem during the migration). I'd argue that by adding such a flag we could completely avoid having to add the datastore type too.

This PR does add that flag, but we need a means of tracking to ensure that zedtokens are not used across instances of a datastore. The fact that we allow it now is, technically speaking, a small but real risk.

@josephschorr
Copy link
Member Author

Updated to only store 8 bytes of prefix of the datastore ID. Since datastores are rarely updated, this should be fine for detecting changes.

an older datastore is used

All ZedTokens are now minted with the datastore's unique ID included
in the ZedToken and that ID is checked when the ZedToken is decoded.

In scenarios where the datastore ID does not match, either an error is
raised (watch, at_exact_snapshot) or configurable behavior is used
(at_least_as_fresh)

Fixes authzed#1541
…token

Results in smaller tokens but given that datastore IDs are generated, should still minimize the chances of a conflict
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api v1 Affects the v1 API area/CLI Affects the command line area/datastore Affects the storage system area/tooling Affects the dev or user toolchain (e.g. tests, ci, build tools)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improvement: Prevent ZedToken's from being used cross-datastore
2 participants