perf: emit killswitch metrics when needed #71608

anonrig · 2024-05-28T19:22:57Z

Warning: This is rather a controversial pull-request.

We currently emit metrics for each killswitch detection regardless of the outcome of the kill switch operation. This is called in a lot of different places, and mostly on hot paths where performance is critical.

I'm recommending using options automator to emit events when needed, rather than the default way which is always.

For example, ingest_consumer.process_event calls this function, which gets executed 4.5 million times in a 5 minutes time span.

Regardless of this pull-request is getting merged, we should talk about reducing metrics on hot paths such as ingest_consumer.process_event

cc @getsentry/ops @getsentry/ingest

mitsuhiko · 2024-05-28T19:30:01Z

Historically we emitted a low sample rate for these rather than turning them off entirely. Is that an option we can use here? (Obviously won't work with sentry metrics but it should work with the underlying datadog/statsd path).

anonrig · 2024-05-28T19:34:40Z

Historically we emitted a low sample rate for these rather than turning them off entirely. Is that an option we can use here? (Obviously won't work with sentry metrics but it should work with the underlying datadog/statsd path).

@mitsuhiko We can pass a sample rate, but if I understood this correctly, unless we have an incident, we don't need this metrics to be emitted. What is the purpose of this metric outside of an incident?

markstory · 2024-05-28T19:41:16Z

src/sentry/killswitches.py

@@ -279,7 +279,7 @@ def killswitch_matches_context(killswitch_name: str, context: Context, emit_metr
    option_value = options.get(killswitch_name)
    rv = _value_matches(killswitch_name, option_value, context)

-    if emit_metrics:
+    if emit_metrics and options.get("system.emit-kill-switch-metrics"):


I didn't see any callers that set emit_metrics=False. If you wanted to make a smaller change, you could have the option drive the sampling rate of the metric instead of dropping the metric entirely.

Yes, but what is the purpose of having this metric in the first place? If killswitch is not enabled (I assume for 99% of the cases it is not enabled), this metric will behave similar to @metrics.wraps("ingest_consumer.process_event").

codecov · 2024-05-28T20:02:01Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 77.90%. Comparing base (cf74135) to head (4128a3c).
Report is 1081 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master   #71608   +/-   ##
=======================================
  Coverage   77.90%   77.90%           
=======================================
  Files        6552     6552           
  Lines      291857   291862    +5     
  Branches    50438    50438           
=======================================
+ Hits       227364   227378   +14     
+ Misses      58241    58232    -9     
  Partials     6252     6252

Files	Coverage Δ
src/sentry/options/defaults.py	`100.00% <100.00%> (ø)`

... and 9 files with indirect coverage changes

fpacifici · 2024-05-29T05:58:15Z

In the general case it would be a better option to leverage sampling than to disable some in order to improve performance. This is because, while it is true we need a lot of metrics only during incidents, we generally know we need them once the issue already happened.

We have a number of sampling mechanisms some are static some are smarter:

static. We rely on a setting to define the default sampling rate. The code does not know what is called in a high throughput environment, but we do know when we run a pod. Each deployment in k8s can have different environment variables. We can set the env var differently for high throughput consumers.
adaptive. When you need to record counters, you do not need to send a metric every time you observe an event. See what arroyo does for Kafka consumers where the throughput can be extreme: https://github.com/getsentry/arroyo/blob/main/arroyo/processing/processor.py#L90-L122. It buffers metrics and periodically send one call to datadog. recording three counters that count 1 is the same as recording one counter with value 3 if they are close enough.

On the other hand there may be a case to disable killswitch metrics in this specific case as they rarely are useful to detect an incident and that would reduce our datadog bill.
So I am not against this PR, it is ok for this case. I would not generalize though and rely on sampling in the general case.

Swatinem · 2024-05-29T10:59:15Z

I don’t object to this PR, but it also adds additional complexity to something that is already too complex.

We discussed this within the team yesterday briefly, and we agree that killswitches and metrics is one contributing factor to death by a thousand papercuts. Emitting a metric has a small but non-negligible cost (<0.1ms), and it adds up if you emit a ton of those for no good reason.
The situation with killswitches in particular is that depending on the task, we are chaining 3 or more of those. And we sometimes are checking the same killswitch in multiple tasks. Sometimes for good reason.

Now, specifically for metrics and killswitches, I would ideally prefer an API like the following:

When accepting a message / task, emit a metric (optional in case where the outside system [kafka, celery] already provides a compatible metrics)
When a message / task finishes processing, emit a completed metric
When catching a killswitch (they should ideally throw an error that unwinds the stack), emit a killswitched metric instead, tagging it with the specific killswitch name.

As we assume killswitches are mutually exclusive (failing one, you are early-returning and will never run into another), with those 2 metrics, we can cover these insights / percentages:

task successful
task was killswitched
task was lost (due to process dieing or whatever) (this is the difference between "started" and "finished")

Especially accounting for processes being killed unfortunately means that we might have a lot of variance in these metrics due to time delays between "started" and "finished".

getsantry · 2024-06-20T07:00:09Z

This pull request has gone three weeks without activity. In another week, I will close it.

But! If you comment or otherwise update it, I will reset the clock, and if you add the label WIP, I will leave it alone unless WIP is removed ... forever!

"A weed is but an unloved flower." ― Ella Wheeler Wilcox 🥀

perf: emit killswitch metrics when needed

4128a3c

anonrig requested review from mitsuhiko, markstory, fpacifici, asottile-sentry and a team May 28, 2024 19:22

github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label May 28, 2024

markstory reviewed May 28, 2024

View reviewed changes

getsantry bot added the Stale label Jun 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: emit killswitch metrics when needed #71608

perf: emit killswitch metrics when needed #71608

anonrig commented May 28, 2024

mitsuhiko commented May 28, 2024

anonrig commented May 28, 2024 •

edited

markstory May 28, 2024

anonrig May 28, 2024

codecov bot commented May 28, 2024 •

edited

fpacifici commented May 29, 2024

Swatinem commented May 29, 2024

getsantry bot commented Jun 20, 2024

perf: emit killswitch metrics when needed #71608

Are you sure you want to change the base?

perf: emit killswitch metrics when needed #71608

Conversation

anonrig commented May 28, 2024

mitsuhiko commented May 28, 2024

anonrig commented May 28, 2024 • edited

markstory May 28, 2024

Choose a reason for hiding this comment

anonrig May 28, 2024

Choose a reason for hiding this comment

codecov bot commented May 28, 2024 • edited

Codecov Report

fpacifici commented May 29, 2024

Swatinem commented May 29, 2024

getsantry bot commented Jun 20, 2024

anonrig commented May 28, 2024 •

edited

codecov bot commented May 28, 2024 •

edited