-
Notifications
You must be signed in to change notification settings - Fork 246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Telemetry stats #515
Telemetry stats #515
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only some small things
7071dbd
to
892605f
Compare
|
||
## Collected metrics | ||
|
||
### spicedb_telemetry_info (Gauge) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I was reviewing I kept coming back to this doc to make sure it matched the impl, maybe we should throw something on the backlog to automate it
Interval = time.Hour | ||
) | ||
|
||
func writeTimeSeries(ctx context.Context, client *http.Client, endpoint string, ts []*prompb.TimeSeries) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline, but I tried to find a spec for the remote write API and failed. The closest I got was https://github.com/prometheus/prometheus/blob/main/storage/remote/client.go#L191
Maybe we could use that implementation, or at least link back to it so that if there are changes between remote write api versions we know where to look
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason we wouldn't want to use this actual client?
internal/telemetry/reporter.go
Outdated
Msg("telemetry reporter scheduled") | ||
|
||
// Fire off the first at start-up. | ||
if err := discoverAndWriteMetrics(ctx, endpoint); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should there be jitter here so a large cluster starting up doesn't thunderously herd the write endpoint?
pkg/cmd/server/server.go
Outdated
@@ -369,6 +383,8 @@ func (c *completedServerConfig) Run(ctx context.Context) error { | |||
g.Go(c.dashboardServer.ListenAndServe) | |||
g.Go(stopOnCancel(c.dashboardServer.Close)) | |||
|
|||
g.Go(func() error { return telemetry.ReportForever(ctx, c.telemetryEndpoint) }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong opinion here, but I like the model that the grpc/http servers follow, where if they're disabled the objects get replaced with "noop" servers that just log and stop.
internal/telemetry/reporter.go
Outdated
Err(err). | ||
Str("endpoint", endpoint). | ||
Msg("failed to push telemetry metric") | ||
nextPush = backoffInterval.NextBackOff() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens if the backoff interval becomes longer than the initial interval?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The normal interval is 1hr, can we exponentially backoff until the interval is 2hrs?
Don't we want a base frequency of 1hr regardless of backoff?
internal/telemetry/reporter.go
Outdated
}) | ||
} | ||
|
||
switch *fam.Type { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can just use https://pkg.go.dev/github.com/prometheus/common/expfmt#ExtractSamples to avoid this, but I didn't test it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looked into it and this looks usable. We'd iterate over those samples and create the prompb.
internal/telemetry/reporter.go
Outdated
"github.com/golang/snappy" | ||
dto "github.com/prometheus/client_model/go" | ||
"github.com/prometheus/common/model" | ||
"github.com/prometheus/prometheus/prompb" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can drop importing all of prometheus if we use this:
go.buf.build/protocolbuffers/go/prometheus/prometheus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
8e10a0b
to
667f2be
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
just had one thing we might want to think about for the future
|
||
return func(ctx context.Context) error { | ||
// Smear the startup delay out over 10% of the reporting interval | ||
startupDelay := time.Duration(rand.Int63n(int64(interval.Seconds()/10))) * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since that could be up to 10m after startup, maybe we should attempt to write on shutdown if we haven't written at all yet and the server has at least successfully started up. And actually, maybe we want separate telemetry on startup error rates.
I realize this is a fairly involved change for an uncommon edge case, maybe we just make an issue and do it later.
No description provided.