Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support for sending traces to OpenTelemetry #1230

Merged
merged 1 commit into from
Jun 23, 2022
Merged

Added support for sending traces to OpenTelemetry #1230

merged 1 commit into from
Jun 23, 2022

Conversation

oxtoacart
Copy link
Contributor

@oxtoacart oxtoacart commented Jun 22, 2022

This PR adds support for sending ops data to an OpenTelemetry endpoint as traces. The sample rate can be configured through the global config, and all ops that are fully sampled by Borda are also fully sampled by OpenTelemetry.

@@ -427,3 +431,9 @@ replica:
staticpeeraddrs: []
metadatabaseurls: *AllReplicaBaseUrls
replicaserviceendpoint: *GlobalReplicaRust

otel:
samplerate: 0.0001
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sample rate matches Borda's current sample rate of 0.01%.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's easier to specify the sample rate as an integer, where the ratio is 1/samplerate (this also allows us to re-weight without losing precision both ways)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm not sure I follow. What's the problem with using decimal rates?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, by modeling this after Honeycomb's DeterministicSampler, I ended up switching to Integer rates anyway :)

otel/otel.go Outdated

func init() {
// Fully sample all borda.FullyReportedOps
fullySampledOps = make(map[string]bool, len(borda.FullyReportedOps))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we might want to make these configurable if they're not already? I worry this could otherwise hammer honeycomb

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also the only place in the otel code where we reference borda. Probably worth splitting it out. I can't find where this is defined at a glance, can you link it?

One note - this could muck with our re-weighting of sampled traces.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer an issue

resource :=
resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceNameKey.String("flashlight"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need attribute.Key("SampleRate").Int64(1/cfg.SampleRate) (with some type fudging) so that honeycomb re-weights counts.

Copy link
Contributor

@lincolnmantracer lincolnmantracer Jun 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also appears to be wrong in github.com/getlantern/telemetry. (cc @myleshorton)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm handling this in our sampler on a span by span basis.

samplerate: 0.0001
endpoint: api.honeycomb.io:443
headers:
x-honeycomb-team: GKSJeT1vfWEWgGzPoxiKLE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to not do this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I guess that would be metrics.iantem.io next.

ops.EnableOpenTelemetry("flashlight")

stopperMx.Lock()
if stopper != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it necessary for this code to be thread safe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the stopper can be called through the public function Stop(), which can theoretically run concurrently with new calls to Configure(). Configure() is called every time our global config changes, so it happens multiple times during the running of the program. Stop() is really only called when the app shuts down, but in theory that could happen concurrently with an ongoing Configure().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha - I understand the model now. I am a bit concerned that otel has global state (otel.SetTracerProvider) and that we don't understand how thread-safe that is (it appears that it is, but that's also not in the contract). Do we already have locked global state in flashlight somewhere? If so we could just put a TracerProvider in there, and it might simplify this code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also full rely on otel's thread-safety and do something like

oldTp := otel.GetTracerProvider()
otel.SetTracerProvider(newTp)

if oldTp != nil {
    oldTp.Shutdown()
}

Shutdown appears to call exporter.Shutdown. (At least the documentation says that it does.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this, but sadly GetTracerProvider() returns an interface that doesn't expose Shutdown(). I can cast to the concrete type, but if someone set a different implementation of TracerProvider then we get some weirdness. I think I'll leave as-is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would they do that? That's super annoying. What about

interface Shutdowner {
     Shutdown()
}

if shutdowner, ok := oldTp.(Shutdowner); ok {
    shutdowner.Shutdown()
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, we SetTracerProvider a few lines above, so it doesn't seem a stretch to try to cast to that type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave this here: open-telemetry/opentelemetry-go#23

go stopper()
}
stopper = func() {
ctx, cancel := context.WithTimeout(context.Background(), 1*time.Minute)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hanging for a minute when a user is trying to close the app (?) seems bad. And the otel exporter should be much faster than that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's faster than that under good network conditions, but in a censored environment, potentially going over domain fronting (not yet implemented), it can take a moment. The alternative is to lose potentially valuable telemetry data.

@oxtoacart oxtoacart force-pushed the ox/otel branch 2 times, most recently from 0a9f002 to e75c54a Compare June 22, 2022 12:00
headers:
x-honeycomb-team: GKSJeT1vfWEWgGzPoxiKLE
samplerate: 0.0001
opsamplerates:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the rationale for treating these ops differently?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It varies by op, but let's take something like client_started as an example. We may want to use that as a signifier that the client ran at all and use that to derive our user numbers. This even happens only once per startup, so if we're sampling it at a low rate, we'll completely miss a bunch of clients. If we're interesting in looking for clients of a specific app version in a small country, this means that we could completely miss the presence of these if they just didn't make the cut for getting sampled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More generally, for low frequency ops, sampling at low rates risks missing them completely, so it's desirable to sample some things at higher rates.

otel/otel.go Outdated Show resolved Hide resolved
otel/otel.go Outdated Show resolved Hide resolved
otel/otel.go Outdated
psc := trace.SpanContextFromContext(p.ParentContext)

sampleRateMx.RLock()
defer sampleRateMx.RUnlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not seeing what this protects against. If it's just opSampleRates (isn't that defined at startup?) then we could make a copy of the map in the opAwareSampler.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I think I was missing that we reload the global config sometimes. I would release the lock right after we read the sample right.

This is maybe a stupid question, but could we just restart flashlight when it changes instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flashlight is constantly handling client connections, so restarting the whole process is really disruptive.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, Configure gets called when we reload the config, which creates a new opAwareSampler, right? So we can actually copy the op sample rates into the struct and not worry about locking here. I think it's worth it since this will be called concurrently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like using the copy, I've implemented that.

otel/otel.go Outdated Show resolved Hide resolved
otel/otel.go Outdated Show resolved Hide resolved
otel/otel.go Outdated Show resolved Hide resolved
@oxtoacart oxtoacart force-pushed the ox/otel branch 2 times, most recently from 58ef0c3 to 5e28c90 Compare June 23, 2022 13:01
@oxtoacart
Copy link
Contributor Author

The CI failures don't seem to be specific to this PR, so I'm going to merge.

@oxtoacart oxtoacart merged commit e8d8bb2 into devel Jun 23, 2022
@oxtoacart oxtoacart deleted the ox/otel branch June 23, 2022 13:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants