-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Eagerly create instruments to fail fast on initialization #141
Conversation
def75ec
to
f25d0d5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems bad to just silently ignore errors creating the instruments. What exactly does otel.Handle(error)
do? Is that really the best way to proceed?
It seems like it would be better to create the instruments on construction of the interceptor instead of deferring to first use. That way it's unlikely for the application to be deployed with a bug that basically means no metrics are getting published. In general, for predictable operations and observability, it is better for the application to fail during startup/initialization than to proceed without any metrics. And this also seems consistent with the best practices you linked:
The API or SDK MAY fail fast and cause the application to fail on initialization, e.g. because of a bad user config or environment, but MUST NOT cause the application to fail later at run time, e.g. due to dynamic config settings received from the Collector.
@jhump yep I think it would be better to error on creation but requires an API change. This fixes the |
But what kind of errors? Surely not initialization errors. The signature of the error handler, even if you install your own, isn't useful. "Monitoring it" isn't helpful because when you observe an error, there will be no metrics from this process at all, and there's not much the error handler could actually do about it -- perhaps stop/kill the current process, which seems bad. And it's doubly-bad that this happens after the process has already started taking actual RPC traffic, which means even if you monitor it, since it's post-facto, your operational data is already missing data.
Then let's make the API change. To me, the current behavior is a serious issue and the only proper fix is the API change. What's in this branch may be an improvement, but IMO not a sufficient one and still far from a proper solution. This repo has not yet reached a v1 (all releases have been marked as "pre-release") so, especially given the nature of the issue (that a failure to initialize metrics will fail all RPCs), the API change is definitely warranted and acceptable. |
Yes, see the linked example from grpc: https://github.com/open-telemetry/opentelemetry-go-contrib/blob/5adc27110c6f8edff55e07c668aeb140a166dcfd/instrumentation/google.golang.org/grpc/otelgrpc/config.go#L75-L80 Will update the API. This means having to initialize both client and server instruments, not not sure what the downside of that are. |
Oh, okay. We could do the same thing then. That's better because they are creating the instruments during initialization instead of deferring to first use. That is strictly better than what's in this branch because it means that the error handler could prevent the server from starting. Although, TBH, I think an API change is even better. I'd think the otel error handler would be better for runtime errors (like a trace collector that encounters I/O errors writing data, or a metrics handler that fails to push to a statsd gateway, etc), not for initialization errors. 🤷 But even if we didn't make the API change, we should at least eagerly initialize the instruments so that the error is reported during startup instead of when serving actual traffic.
Sadly, a shortcoming in the connect interceptor API that is tries to use the same interface for both. I don't actually think it's much to worry about because under-the-hood, most metrics libraries don't really do anything with the instruments until they get an observation, since most of the in-memory storage needs at least one actual label value (like RPC method name) to really initialize. |
Changed the API to return an error on init. |
It is of course not to ensure requests fail. The previous behavior was certainly the worst thing we could do, operationally speaking. As far as the point not being to stop the server: sure, that's not the point, but it certainly needs to be an option.
The best way to preserve up-time is often to not roll out a new version of a service if it has errors in config. The best way to prevent roll-out of a bad version is to prevent the server from starting up. Sane deployment systems will pause roll out if too many new containers are failing to start. And doing so early preserves uptime, by causing the misconfigured server to fail before it was actually accepting any traffic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
I left a comment just in case you'd prefer to explore another alternative that does not change the API.
Change in API on creating
Interceptor
. Instruments are created eagerly and any errors fail on init. We therefore return the error as part of:Before, on metrics instrumentation errors we would fail request without calling
otel.Handle(err)
. This goes against the recommended behaviour documented here to favour uptime over metric loss.