Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eagerly create instruments to fail fast on initialization #141

Merged
merged 11 commits into from
Nov 8, 2023
Merged

Conversation

emcfarlane
Copy link
Contributor

@emcfarlane emcfarlane commented Oct 27, 2023

Change in API on creating Interceptor. Instruments are created eagerly and any errors fail on init. We therefore return the error as part of:

func NewInterceptor(opts ...Option) (*Interceptor, error)

Before, on metrics instrumentation errors we would fail request without calling otel.Handle(err). This goes against the recommended behaviour documented here to favour uptime over metric loss.

@emcfarlane emcfarlane self-assigned this Oct 27, 2023
Base automatically changed from ed/protos to main November 1, 2023 20:27
@emcfarlane emcfarlane requested a review from jhump November 6, 2023 20:04
Copy link
Member

@jhump jhump left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems bad to just silently ignore errors creating the instruments. What exactly does otel.Handle(error) do? Is that really the best way to proceed?

It seems like it would be better to create the instruments on construction of the interceptor instead of deferring to first use. That way it's unlikely for the application to be deployed with a bug that basically means no metrics are getting published. In general, for predictable operations and observability, it is better for the application to fail during startup/initialization than to proceed without any metrics. And this also seems consistent with the best practices you linked:

The API or SDK MAY fail fast and cause the application to fail on initialization, e.g. because of a bad user config or environment, but MUST NOT cause the application to fail later at run time, e.g. due to dynamic config settings received from the Collector.

streaming.go Outdated Show resolved Hide resolved
interceptor.go Outdated Show resolved Hide resolved
@emcfarlane
Copy link
Contributor Author

Could remove once and error on creating the NewInterceptor.

@jhump yep I think it would be better to error on creation but requires an API change. otel.Handle(err) is how errors are handled in other packages so would be required to monitor it. I can improve the docs to clarify?

This fixes the MUST NOT cause the application to fail later at run time and delays solving the initialise issue.

@jhump
Copy link
Member

jhump commented Nov 7, 2023

otel.Handle(err) is how errors are handled in other packages so would be required to monitor it.

But what kind of errors? Surely not initialization errors. The signature of the error handler, even if you install your own, isn't useful. "Monitoring it" isn't helpful because when you observe an error, there will be no metrics from this process at all, and there's not much the error handler could actually do about it -- perhaps stop/kill the current process, which seems bad. And it's doubly-bad that this happens after the process has already started taking actual RPC traffic, which means even if you monitor it, since it's post-facto, your operational data is already missing data.

I think it would be better to error on creation but requires an API change

Then let's make the API change. To me, the current behavior is a serious issue and the only proper fix is the API change. What's in this branch may be an improvement, but IMO not a sufficient one and still far from a proper solution. This repo has not yet reached a v1 (all releases have been marked as "pre-release") so, especially given the nature of the issue (that a failure to initialize metrics will fail all RPCs), the API change is definitely warranted and acceptable.

@emcfarlane
Copy link
Contributor Author

Surely not initialization errors.

Yes, see the linked example from grpc: https://github.com/open-telemetry/opentelemetry-go-contrib/blob/5adc27110c6f8edff55e07c668aeb140a166dcfd/instrumentation/google.golang.org/grpc/otelgrpc/config.go#L75-L80

Will update the API. This means having to initialize both client and server instruments, not not sure what the downside of that are.

@jhump
Copy link
Member

jhump commented Nov 7, 2023

Yes, see the linked example from grpc:

Oh, okay. We could do the same thing then. That's better because they are creating the instruments during initialization instead of deferring to first use. That is strictly better than what's in this branch because it means that the error handler could prevent the server from starting.

Although, TBH, I think an API change is even better. I'd think the otel error handler would be better for runtime errors (like a trace collector that encounters I/O errors writing data, or a metrics handler that fails to push to a statsd gateway, etc), not for initialization errors. 🤷

But even if we didn't make the API change, we should at least eagerly initialize the instruments so that the error is reported during startup instead of when serving actual traffic.

This means having to initialize both client and server instruments

Sadly, a shortcoming in the connect interceptor API that is tries to use the same interface for both. I don't actually think it's much to worry about because under-the-hood, most metrics libraries don't really do anything with the instruments until they get an observation, since most of the in-memory storage needs at least one actual label value (like RPC method name) to really initialize.

@emcfarlane
Copy link
Contributor Author

emcfarlane commented Nov 7, 2023

Changed the API to return an error on init.

@emcfarlane emcfarlane changed the title Init instruments without failing requests Eagerly create instruments to fail fast on initialization Nov 7, 2023
interceptor.go Show resolved Hide resolved
@jhump
Copy link
Member

jhump commented Nov 7, 2023

It's not to stop the server from starting or ensure all requests fail.

It is of course not to ensure requests fail. The previous behavior was certainly the worst thing we could do, operationally speaking.

As far as the point not being to stop the server: sure, that's not the point, but it certainly needs to be an option.

Errors in config are meant to drop metrics in favour of uptime.

The best way to preserve up-time is often to not roll out a new version of a service if it has errors in config. The best way to prevent roll-out of a bad version is to prevent the server from starting up. Sane deployment systems will pause roll out if too many new containers are failing to start. And doing so early preserves uptime, by causing the misconfigured server to fail before it was actually accepting any traffic

Copy link
Member

@jhump jhump left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

I left a comment just in case you'd prefer to explore another alternative that does not change the API.

@jhump jhump merged commit 00e7991 into main Nov 8, 2023
6 checks passed
@jhump jhump deleted the ed/instruments branch November 8, 2023 16:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants