-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
util/must: add runtime assertion API #106508
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just looked at the high level design and UX and everything checks out as far as I'm concerned. I would be happy to see this reviewed by TE (or if TE are too swamped I can review on their behalf, just let me know) and get merged soon.
pkg/kv/kvserver/replica_raft.go
Outdated
if ctx.Done() != nil { | ||
return handleRaftReadyStats{}, errors.AssertionFailedf( | ||
"handleRaftReadyRaftMuLocked cannot be called with a cancellable context") | ||
if err := must.Zero(ctx, ctx.Done(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does must.Nil
work here, to match the previous ctx.Done() != nil
check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately no. must.Nil
only works with pointers, since it's generic over *T
. I haven't found a way to also include reference types (slices, maps, channels, funcs, etc) using generics, nor interfaces. We could use reflection, but that would be too expensive here.
However, for reference types and interfaces the zero value is nil
, so must.Zero()
here is exactly equivalent to == nil
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we need a "nilable" constraint in the upstream package :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could enumerate these things manually, to have a constraint that matches all the nilable things. Haven't researched whether the syntax allows for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went down that path when implementing Len()
, trying to get it to work both with slices, maps, and strings. I sort of got there, but it confused type inference, so we'd have to manually instantiate the generic function at all call sites which gets really annoying.
Might still be possible, but I didn't immediately find a way to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think must.Zero is a surprising developer UX to test nil interface references.
Would it be possible to introduce an alias must.NilInterface or NilReference and recommend its use for that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but we'd need one for each reference type, so about 6 in total (with interfaces), plus the NotNil variants, so 12. I've been hoping to find a way to finagle generics into doing it instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
about 6 in total (with interfaces)
I was thinking about focusing on just interfaces, which is the most common.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interfaces are also tricky, because we'd have to handle typed nils -- I'm not sure we can without reflection. But there's always the must.True(ctx, foo == nil)
escape hatch too. I agree though, we should try to do something better here.
f302de7
to
bf16d46
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea! Overall LGTM - I left some context into types of assertions we use in the optimizer in case that's helpful in steering the API, but I'm not sure if the patterns are widespread, so feel free to ignore.
Reviewed 4 of 4 files at r1, 1 of 1 files at r2, 1 of 1 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @pavelkalinnikov, @renatolabs, @smg260, and @tbg)
pkg/util/log/logcrash/crash_reporting.go
line 441 at r8 (raw file):
} must.MaybeSendReport = func(ctx context.Context, err error) { maybeSendCrashReport(ctx, err, ReportTypeAssertionFailure)
maybeSendCrashReport
always using global settings, which is discouraged when we have access to non-global settings:
cockroach/pkg/util/log/logcrash/crash_reporting.go
Lines 110 to 111 in 2675c7c
// Should be used only when strictly necessary; use ReportPanic whenever we have | |
// access to the settings. |
Should the package provide versions of each function that have settings arguments?
pkg/util/must/must.go
line 44 at r8 (raw file):
// different node (thus stack traces can be insufficient by themselves). // // Some assertions may be too expensive to always run. For example, we may want
We have a set of assertions in the optimizer that run for all test builds, but not in release builds. Their cost is somewhere in between "cheap enough to run in release builds" and "too expensive to run for every test". I'm not sure if this type of assertion is used elsewhere, but if it is, it might be worth providing a pattern for.
Another pattern used in the optimizer (and a few other places IIRC) is to panic to propagate an assertion failure upward until it gets turned into an internal error by a panic-catcher.
Translating one of these panics to use must
would look something like:
-if col == nil {
- panic(errors.AssertionFailedf("expected to find column %d in scope", colID))
-}
+if err := must.NotNil(ctx, col, "expected to find column %d in scope", colID); err != nil {
+ panic(err)
+}
They seem ergonomically similar to me, but the latter has added complexity of either fatalling in must.NotNil
or returning and error to panic. Should we provide versions of these assertions that always panic? Again, I'm not convinced this is a use case that must
should cater to, but thought I'd point it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mgartner, @pavelkalinnikov, @renatolabs, @smg260, and @tbg)
pkg/util/log/logcrash/crash_reporting.go
line 441 at r8 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
maybeSendCrashReport
always using global settings, which is discouraged when we have access to non-global settings:cockroach/pkg/util/log/logcrash/crash_reporting.go
Lines 110 to 111 in 2675c7c
// Should be used only when strictly necessary; use ReportPanic whenever we have // access to the settings. Should the package provide versions of each function that have settings arguments?
I generally agree, but I think it's important for an API like this to reduce friction as much as possible, and having to plumb through settings is exactly the kind of friction that might discourage people from using it.
I'm feeling kind of okay about this because log.Fatalf
does exactly the same thing, for exactly the same reason. I think the main argument for plumbing through settings here is testability, and I think that's an ok tradeoff to make? Does plumbing through settings give us any other benefit?
pkg/util/must/must.go
line 44 at r8 (raw file):
Their cost is somewhere in between "cheap enough to run in release builds" and "too expensive to run for every test". I'm not sure if this type of assertion is used elsewhere, but if it is, it might be worth providing a pattern for.
How about something like must.TestOnly(func() { ... })
, which is similar to must.Expensive()
but gated on CrdbTestBuild
?
Another pattern used in the optimizer (and a few other places IIRC) is to panic to propagate an assertion failure upward until it gets turned into an internal error by a panic-catcher.
Right, we could add a helper for this, something like:
must.PanicIf(must.NotNil(ctx, col, "expected to find column %d in scope", colID))
I believe this is marginally better, because in non-release builds the log.Fatalf
is guaranteed to not be caught by any panic handlers anywhere, so it'll always fail loudly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @pavelkalinnikov, @renatolabs, @smg260, and @tbg)
pkg/util/log/logcrash/crash_reporting.go
line 441 at r8 (raw file):
Does plumbing through settings give us any other benefit?
I'm trying to figure that out, but now confused. It looks like only the cli
package sets global settings. If that was the case, then crash reports would never be sent. I must be missing something.
Did you test that Sentry reports are created?
pkg/util/must/must.go
line 44 at r8 (raw file):
How about something like
must.TestOnly(func() { ... })
, which is similar tomust.Expensive()
but gated onCrdbTestBuild
?
SGTM!
I believe this is marginally better, because in non-release builds the
log.Fatalf
is guaranteed to not be caught by any panic handlers anywhere, so it'll always fail loudly.
👍 We actually rely a bit on the distinction between an internal error that is caught (looks like ERROR: internal error ...
) versus a node-crashing panic. The latter causes much more problems for customers so we urgently prioritize fixing those. But we can figure out a workaround for this as we translate our assertions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, thanks for setting this up, Erik!
I'd be happy for us to adopt an API like this throughout the codebase. Should we also mark CrdbTestBuild
as deprecated in the code? It'd be nice if we eventually removed that build tag and replaced its current uses with calls to must
and must.Expensive
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @mgartner, @pavelkalinnikov, @smg260, and @tbg)
pkg/util/must/must.go
line 44 at r8 (raw file):
How about something like
must.TestOnly(func() { ... })
, which is similar tomust.Expensive()
but gated onCrdbTestBuild
?
Maybe I misunderstand, but how is TestOnly
different from Expensive
? AFAICT, both CrdbTestBuild
and Invariants
are set to true based on the same crdb_test
condition.
Looks like |
Ah right, I misread the build tags, thank you. I'm still wondering to what extent we can unify |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good on a first pass... Several things I am not a fan of,
- allowing assertion failure to be non-fatal is a can of worms (see my earlier comment [1])
Expensive
is an unnecessary addition--compiling assertions into a release build is by design an expensive choice- syntax sugar like
Equal
,NotEqual
,Less
, etc. detracts from the parsimonious design; whatever value it adds, it negates it by making the API more error-prone, imo
In my mind, an ideal design would expose nothing more than must.True
and must.False
under the semantics that if you compiled with crdb_test
, and the invoked predicate evaluates to false
, then you crash. The reason for such (extreme) minimalism is I shouldn't have to think how an assertion is evaluated, much less which predicate to use to assert it. That often leads to errors, in my experience. An assertion mechanism should be as simple as the meaning of the logical predicate, P(x)
.
Reviewed 1 of 5 files at r12.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @mgartner, @pavelkalinnikov, @smg260, and @tbg)
pkg/util/must/must.go
line 61 at r13 (raw file):
// } // // Example: double-stopping a component. In release builds, we can simply ignore
Fundamentally, it seems wrong to ignore an invariant violation in release builds. Remember, software testing doesn't guarantee an absence of errors. We've seen bugs that are hard to nearly impossible to reproduce in our test environments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still wondering to what extent we can unify
TestOnly
(if we decide to add it) andExpensive
.
Yeah, I sort of feel the same way. The idea would be to set up an additional nightly run of expensive/invariant tests, which would run the same set of tests that are covered by CrdbTestBuild
, so it isn't clear to me what value we get from gating them on CrdbTestBuild
instead of Expensive
. They'll still exercise exactly the same tests, with the same frequency. But I don't feel very strongly about this, if SQL have a strong preference for continuing to use CrdbTestBuild
then that's ok by me.
allowing assertion failure to be non-fatal is a can of worms
We regularly see fatal assertions cause hour-long outages, which may require shipping a custom binary to disable the assertion (which again takes hours). Fatal assertions in production are too big of a risk, and makes people scared of using assertions. They are also largely unnecessary, because most failures are scoped to a single request/transaction/range rather than an entire node, so killing the node causes a ton of collateral damage. We do want to know when they happen in the wild though, because they indicate uncaught bugs and gaps in test coverage.
Expensive
is an unnecessary addition--compiling assertions into a release build is by design an expensive choice
There is a big difference between a boolean conditional and something like MVCC stats assertions or Pebble's invariant checks -- the former is negligible, the latter will reduce end-to-end performance by at least one order of magnitude, and is entirely unsuitable to enable in non-release builds, but provides valuable coverage.
syntax sugar like
Equal
,NotEqual
,Less
, etc. detracts from the parsimonious design; whatever value it adds, it negates it by making the API more error-prone, imo
The same could be said for Testify -- it's equivalent to t.Fatal
, and yet people clearly prefer it. If it makes people write more tests, then I'm all for it.
I think one of the main values for these helpers is in always including argument values in assertion failures, so that people don't have to do this by hand over and over -- or worse yet, forget it and we get a useless failure report. But one could also imagine other conveniences, like printing diffs when reporting struct inequality, or less trivial assertion logic -- I've kept it pretty bare-bones initially.
None of this is essential, but the point of a convenience API is to be, well, convenient, which encourages wider use.
In my mind, an ideal design would expose nothing more than
must.True
andmust.False
under the semantics that if you compiled withcrdb_test
, and the invoked predicate evaluates tofalse
, then you crash.
What would we do in a release build when encountering the condition that would have caused the compiled-out assertion to fire? Ignore it? With most existing assertions this would currently result in correctness violations. I think there's a valuable middle ground between "always fatal" and "always ignore it". Much of the motivation here is precisely to get coverage in real production environments, with minimal disruption, because we continue to hit issues that aren't caught by our tests.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mgartner, @pavelkalinnikov, @renatolabs, @smg260, @srosenberg, and @tbg)
pkg/util/log/logcrash/crash_reporting.go
line 441 at r8 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
Does plumbing through settings give us any other benefit?
I'm trying to figure that out, but now confused. It looks like only the
cli
package sets global settings. If that was the case, then crash reports would never be sent. I must be missing something.Did you test that Sentry reports are created?
The cli
package is the entry point of the cockroach
binary, so it's hooked up in the binary, which is what we care about here.
pkg/util/must/must.go
line 61 at r13 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
Fundamentally, it seems wrong to ignore an invariant violation in release builds. Remember, software testing doesn't guarantee an absence of errors. We've seen bugs that are hard to nearly impossible to reproduce in our test environments.
We can ignore it when it's safe to do so. The caller is responsible for judging what the appropriate action is in release builds, just like they're responsible for judging what to do with any other error condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Raft assertions will need some more scrutiny, so I've pulled them out of this PR. Instead, I added a handful of simpler but representative examples.
I'll write up a bunch of follow-up issues for the remaining work tomorrow.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @knz, @mgartner, @pavelkalinnikov, @renatolabs, @smg260, @srosenberg, and @tbg)
5ca7f29
to
3d2f8e8
Compare
I'm going to merge this, for any further concerns I suggest opening an issue and we can discuss there. bors r+ |
Needed to pull in `constraints.Ordered`. Epic: none Release note: None
Epic: none Release note: None
This patch adds a convenient and canonical API for runtime assertions, inspired by the Testify package used for Go test assertions. It is intended to encourage liberal use of runtime assertions throughout the code base, by making it as easy as possible to write assertions that follow best practices. It does not attempt to reinvent the wheel, but instead builds on existing infrastructure. Assertion failures are fatal in all non-release builds, including roachprod clusters and roachtests, to ensure they will be noticed. In release builds, they instead log the failure and report it to Sentry (if enabled), and return an assertion error to the caller for propagation. This avoids excessive disruption in production environments, where an assertion failure is often scoped to an individual RPC request, transaction, or range, and crashing the node can turn a minor problem into a full-blown outage. It is still possible to kill the node when appropriate via `log.Fatalf`, but this should be the exception rather than the norm. It also supports expensive assertions that must be compiled out of normal dev/test/release builds for performance reasons. These are instead enabled in special test builds. This is intended to be used instead of other existing assertion mechanisms, which have various shortcomings: * `log.Fatalf`: kills the node even in release builds, which can cause severe disruption over often minor issues. * `errors.AssertionFailedf`: only suitable when we have an error return path, does not fatal in non-release builds, and are not always notified in release builds. * `logcrash.ReportOrPanic`: panics rather than fatals, which can leave the node limping along. Requires the caller to implement separate assertion handling in release builds, which is easy to forget. Also requires propagating cluster settings, which aren't always available. * `buildutil.CrdbTestBuild`: only enabled in Go tests, not roachtests, roachprod clusters, or production clusters. * `util.RaceEnabled`: only enabled in race builds. Expensive assertions should be possible to run without the additional overhead of the race detector. For more details and examples, see the `must` package documentation. Epic: none Release note: None
Epic: none Release note: None
Canceled. |
bors r+ |
This PR was included in a batch that was canceled, it will be automatically retried |
Build succeeded: |
108272: util/must: revert API r=erikgrinaker a=erikgrinaker This patch reverts #106508, since `@RaduBerinde` [pointed out](#107790 (review)) a performance flaw where it will often incur an allocation on the happy path due to interface boxing of the format args. Other options are considered in #108169. We'll revisit runtime assertions with a different API that avoids this cost on the happy path. Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>
For details and usage examples, see the package documentation.
This patch adds a convenient and canonical API for runtime assertions, inspired by the Testify package used for Go test assertions. It is intended to encourage liberal use of runtime assertions throughout the code base, by making it as easy as possible to write assertions that follow best practices. It does not attempt to reinvent the wheel, but instead builds on existing infrastructure.
Assertion failures are fatal in all non-release builds, including roachprod clusters and roachtests, to ensure they will be noticed. In release builds, they instead log the failure and report it to Sentry (if enabled), and return an assertion error to the caller for propagation. This avoids excessive disruption in production environments, where an assertion failure is often scoped to an individual RPC request, transaction, or range, and crashing the node can turn a minor problem into a full-blown outage. It is still possible to kill the node when appropriate via
log.Fatalf
, but this should be the exception rather than the norm.It also supports expensive assertions that must be compiled out of normal dev/test/release builds for performance reasons. These are instead enabled in special test builds.
This is intended to be used instead of other existing assertion mechanisms, which have various shortcomings:
log.Fatalf
: kills the node even in release builds, which can cause severe disruption over often minor issues.errors.AssertionFailedf
: only suitable when we have an error return path, does not fatal in non-release builds, and are not always notified in release builds.logcrash.ReportOrPanic
: panics rather than fatals, which can leave the node limping along. Requires the caller to implement separate assertion handling in release builds, which is easy to forget. Also requires propagating cluster settings, which aren't always available.buildutil.CrdbTestBuild
: only enabled in Go tests, not roachtests, roachprod clusters, or production clusters.util.RaceEnabled
: only enabled in race builds. Expensive assertions should be possible to run without the additional overhead of the race detector.For more details and examples, see the
must
package documentation.Resolves #94986.
Epic: none
Release note: None