-
-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: emit killswitch metrics when needed #71608
base: master
Are you sure you want to change the base?
Conversation
Historically we emitted a low sample rate for these rather than turning them off entirely. Is that an option we can use here? (Obviously won't work with sentry metrics but it should work with the underlying datadog/statsd path). |
@mitsuhiko We can pass a sample rate, but if I understood this correctly, unless we have an incident, we don't need this metrics to be emitted. What is the purpose of this metric outside of an incident? |
@@ -279,7 +279,7 @@ def killswitch_matches_context(killswitch_name: str, context: Context, emit_metr | |||
option_value = options.get(killswitch_name) | |||
rv = _value_matches(killswitch_name, option_value, context) | |||
|
|||
if emit_metrics: | |||
if emit_metrics and options.get("system.emit-kill-switch-metrics"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't see any callers that set emit_metrics=False
. If you wanted to make a smaller change, you could have the option drive the sampling rate of the metric instead of dropping the metric entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but what is the purpose of having this metric in the first place? If killswitch
is not enabled (I assume for 99% of the cases it is not enabled), this metric will behave similar to @metrics.wraps("ingest_consumer.process_event")
.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #71608 +/- ##
=======================================
Coverage 77.90% 77.90%
=======================================
Files 6552 6552
Lines 291857 291862 +5
Branches 50438 50438
=======================================
+ Hits 227364 227378 +14
+ Misses 58241 58232 -9
Partials 6252 6252
|
In the general case it would be a better option to leverage sampling than to disable some in order to improve performance. This is because, while it is true we need a lot of metrics only during incidents, we generally know we need them once the issue already happened. We have a number of sampling mechanisms some are static some are smarter:
On the other hand there may be a case to disable killswitch metrics in this specific case as they rarely are useful to detect an incident and that would reduce our datadog bill. |
I don’t object to this PR, but it also adds additional complexity to something that is already too complex. We discussed this within the team yesterday briefly, and we agree that killswitches and metrics is one contributing factor to death by a thousand papercuts. Emitting a metric has a small but non-negligible cost (<0.1ms), and it adds up if you emit a ton of those for no good reason. Now, specifically for metrics and killswitches, I would ideally prefer an API like the following:
As we assume killswitches are mutually exclusive (failing one, you are early-returning and will never run into another), with those 2 metrics, we can cover these insights / percentages:
Especially accounting for processes being killed unfortunately means that we might have a lot of variance in these metrics due to time delays between "started" and "finished". |
This pull request has gone three weeks without activity. In another week, I will close it. But! If you comment or otherwise update it, I will reset the clock, and if you add the label "A weed is but an unloved flower." ― Ella Wheeler Wilcox 🥀 |
Warning: This is rather a controversial pull-request.
We currently emit metrics for each
killswitch
detection regardless of the outcome of the kill switch operation. This is called in a lot of different places, and mostly on hot paths where performance is critical.I'm recommending using options automator to emit events when needed, rather than the default way which is
always
.For example,
ingest_consumer.process_event
calls this function, which gets executed 4.5 million times in a 5 minutes time span.Regardless of this pull-request is getting merged, we should talk about reducing metrics on hot paths such as
ingest_consumer.process_event
cc @getsentry/ops @getsentry/ingest