Eagerly create instruments to fail fast on initialization #141

emcfarlane · 2023-10-27T15:55:08Z

Change in API on creating Interceptor. Instruments are created eagerly and any errors fail on init. We therefore return the error as part of:

func NewInterceptor(opts ...Option) (*Interceptor, error)

Before, on metrics instrumentation errors we would fail request without calling otel.Handle(err). This goes against the recommended behaviour documented here to favour uptime over metric loss.

instruments.go

jhump

This seems bad to just silently ignore errors creating the instruments. What exactly does otel.Handle(error) do? Is that really the best way to proceed?

It seems like it would be better to create the instruments on construction of the interceptor instead of deferring to first use. That way it's unlikely for the application to be deployed with a bug that basically means no metrics are getting published. In general, for predictable operations and observability, it is better for the application to fail during startup/initialization than to proceed without any metrics. And this also seems consistent with the best practices you linked:

The API or SDK MAY fail fast and cause the application to fail on initialization, e.g. because of a bad user config or environment, but MUST NOT cause the application to fail later at run time, e.g. due to dynamic config settings received from the Collector.

streaming.go

interceptor.go

emcfarlane · 2023-11-07T15:37:11Z

Could remove once and error on creating the NewInterceptor.

@jhump yep I think it would be better to error on creation but requires an API change. otel.Handle(err) is how errors are handled in other packages so would be required to monitor it. I can improve the docs to clarify?

This fixes the MUST NOT cause the application to fail later at run time and delays solving the initialise issue.

jhump · 2023-11-07T19:24:42Z

otel.Handle(err) is how errors are handled in other packages so would be required to monitor it.

But what kind of errors? Surely not initialization errors. The signature of the error handler, even if you install your own, isn't useful. "Monitoring it" isn't helpful because when you observe an error, there will be no metrics from this process at all, and there's not much the error handler could actually do about it -- perhaps stop/kill the current process, which seems bad. And it's doubly-bad that this happens after the process has already started taking actual RPC traffic, which means even if you monitor it, since it's post-facto, your operational data is already missing data.

I think it would be better to error on creation but requires an API change

Then let's make the API change. To me, the current behavior is a serious issue and the only proper fix is the API change. What's in this branch may be an improvement, but IMO not a sufficient one and still far from a proper solution. This repo has not yet reached a v1 (all releases have been marked as "pre-release") so, especially given the nature of the issue (that a failure to initialize metrics will fail all RPCs), the API change is definitely warranted and acceptable.

emcfarlane · 2023-11-07T19:36:35Z

Surely not initialization errors.

Yes, see the linked example from grpc: https://github.com/open-telemetry/opentelemetry-go-contrib/blob/5adc27110c6f8edff55e07c668aeb140a166dcfd/instrumentation/google.golang.org/grpc/otelgrpc/config.go#L75-L80

Will update the API. This means having to initialize both client and server instruments, not not sure what the downside of that are.

jhump · 2023-11-07T19:47:42Z

Yes, see the linked example from grpc:

Oh, okay. We could do the same thing then. That's better because they are creating the instruments during initialization instead of deferring to first use. That is strictly better than what's in this branch because it means that the error handler could prevent the server from starting.

Although, TBH, I think an API change is even better. I'd think the otel error handler would be better for runtime errors (like a trace collector that encounters I/O errors writing data, or a metrics handler that fails to push to a statsd gateway, etc), not for initialization errors. 🤷

But even if we didn't make the API change, we should at least eagerly initialize the instruments so that the error is reported during startup instead of when serving actual traffic.

This means having to initialize both client and server instruments

Sadly, a shortcoming in the connect interceptor API that is tries to use the same interface for both. I don't actually think it's much to worry about because under-the-hood, most metrics libraries don't really do anything with the instruments until they get an observation, since most of the in-memory storage needs at least one actual label value (like RPC method name) to really initialize.

emcfarlane · 2023-11-07T20:16:48Z

Changed the API to return an error on init.

interceptor.go

jhump · 2023-11-07T21:26:44Z

It's not to stop the server from starting or ensure all requests fail.

It is of course not to ensure requests fail. The previous behavior was certainly the worst thing we could do, operationally speaking.

As far as the point not being to stop the server: sure, that's not the point, but it certainly needs to be an option.

Errors in config are meant to drop metrics in favour of uptime.

The best way to preserve up-time is often to not roll out a new version of a service if it has errors in config. The best way to prevent roll-out of a bad version is to prevent the server from starting up. Sane deployment systems will pause roll out if too many new containers are failing to start. And doing so early preserves uptime, by causing the misconfigured server to fail before it was actually accepting any traffic

jhump

LGTM.

I left a comment just in case you'd prefer to explore another alternative that does not change the API.

emcfarlane requested a review from akshayjshah October 27, 2023 15:55

emcfarlane self-assigned this Oct 27, 2023

emcfarlane commented Oct 27, 2023

View reviewed changes

instruments.go Show resolved Hide resolved

emcfarlane added 3 commits November 1, 2023 15:51

Update test protos for PingStream method

454e81b

Calculate size for tests with proto.Size

c5ec18e

Benchmark ensure close res/rsp

cdc1ce0

emcfarlane force-pushed the ed/protos branch from a700135 to cdc1ce0 Compare November 1, 2023 19:52

emcfarlane added 3 commits November 1, 2023 15:59

Init instruments without failing requests

32ac5ce

Report each error for failed metrics

6a050ce

Drop errorStreamingClientInterceptor

f25d0d5

emcfarlane force-pushed the ed/instruments branch from def75ec to f25d0d5 Compare November 1, 2023 19:59

Base automatically changed from ed/protos to main November 1, 2023 20:27

emcfarlane and others added 2 commits November 1, 2023 16:45

Merge branch 'main' into ed/instruments

c126222

Re-add client and server once instruments

901f580

emcfarlane requested a review from jhump November 6, 2023 20:04

jhump reviewed Nov 7, 2023

View reviewed changes

streaming.go Outdated Show resolved Hide resolved

interceptor.go Outdated Show resolved Hide resolved

Feedback add TODO for construtor error handling

b75fd4f

Eagerly initialise metrics to fail on NewInterceptor

e06e9b6

emcfarlane changed the title ~~Init instruments without failing requests~~ Eagerly create instruments to fail fast on initialization Nov 7, 2023

jhump reviewed Nov 7, 2023

View reviewed changes

interceptor.go Show resolved Hide resolved

jhump approved these changes Nov 7, 2023

View reviewed changes

Update interceptor docs for error and usage

1431ff9

jhump merged commit 00e7991 into main Nov 8, 2023
6 checks passed

jhump deleted the ed/instruments branch November 8, 2023 16:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eagerly create instruments to fail fast on initialization #141

Eagerly create instruments to fail fast on initialization #141

emcfarlane commented Oct 27, 2023 •

edited

Loading

jhump left a comment

emcfarlane commented Nov 7, 2023

jhump commented Nov 7, 2023

emcfarlane commented Nov 7, 2023

jhump commented Nov 7, 2023

emcfarlane commented Nov 7, 2023 •

edited

Loading

jhump commented Nov 7, 2023

jhump left a comment

Eagerly create instruments to fail fast on initialization #141

Eagerly create instruments to fail fast on initialization #141

Conversation

emcfarlane commented Oct 27, 2023 • edited Loading

jhump left a comment

Choose a reason for hiding this comment

emcfarlane commented Nov 7, 2023

jhump commented Nov 7, 2023

emcfarlane commented Nov 7, 2023

jhump commented Nov 7, 2023

emcfarlane commented Nov 7, 2023 • edited Loading

jhump commented Nov 7, 2023

jhump left a comment

Choose a reason for hiding this comment

emcfarlane commented Oct 27, 2023 •

edited

Loading

emcfarlane commented Nov 7, 2023 •

edited

Loading