Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: emit killswitch metrics when needed #71608

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

anonrig
Copy link
Member

@anonrig anonrig commented May 28, 2024

Warning: This is rather a controversial pull-request.

We currently emit metrics for each killswitch detection regardless of the outcome of the kill switch operation. This is called in a lot of different places, and mostly on hot paths where performance is critical.

I'm recommending using options automator to emit events when needed, rather than the default way which is always.

For example, ingest_consumer.process_event calls this function, which gets executed 4.5 million times in a 5 minutes time span.

Regardless of this pull-request is getting merged, we should talk about reducing metrics on hot paths such as ingest_consumer.process_event

cc @getsentry/ops @getsentry/ingest

@github-actions github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label May 28, 2024
@mitsuhiko
Copy link
Member

Historically we emitted a low sample rate for these rather than turning them off entirely. Is that an option we can use here? (Obviously won't work with sentry metrics but it should work with the underlying datadog/statsd path).

@anonrig
Copy link
Member Author

anonrig commented May 28, 2024

Historically we emitted a low sample rate for these rather than turning them off entirely. Is that an option we can use here? (Obviously won't work with sentry metrics but it should work with the underlying datadog/statsd path).

@mitsuhiko We can pass a sample rate, but if I understood this correctly, unless we have an incident, we don't need this metrics to be emitted. What is the purpose of this metric outside of an incident?

@@ -279,7 +279,7 @@ def killswitch_matches_context(killswitch_name: str, context: Context, emit_metr
option_value = options.get(killswitch_name)
rv = _value_matches(killswitch_name, option_value, context)

if emit_metrics:
if emit_metrics and options.get("system.emit-kill-switch-metrics"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see any callers that set emit_metrics=False. If you wanted to make a smaller change, you could have the option drive the sampling rate of the metric instead of dropping the metric entirely.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but what is the purpose of having this metric in the first place? If killswitch is not enabled (I assume for 99% of the cases it is not enabled), this metric will behave similar to @metrics.wraps("ingest_consumer.process_event").

Copy link

codecov bot commented May 28, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 77.90%. Comparing base (cf74135) to head (4128a3c).
Report is 1081 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master   #71608   +/-   ##
=======================================
  Coverage   77.90%   77.90%           
=======================================
  Files        6552     6552           
  Lines      291857   291862    +5     
  Branches    50438    50438           
=======================================
+ Hits       227364   227378   +14     
+ Misses      58241    58232    -9     
  Partials     6252     6252           
Files Coverage Δ
src/sentry/options/defaults.py 100.00% <100.00%> (ø)

... and 9 files with indirect coverage changes

@fpacifici
Copy link
Contributor

In the general case it would be a better option to leverage sampling than to disable some in order to improve performance. This is because, while it is true we need a lot of metrics only during incidents, we generally know we need them once the issue already happened.

We have a number of sampling mechanisms some are static some are smarter:

  • static. We rely on a setting to define the default sampling rate. The code does not know what is called in a high throughput environment, but we do know when we run a pod. Each deployment in k8s can have different environment variables. We can set the env var differently for high throughput consumers.
  • adaptive. When you need to record counters, you do not need to send a metric every time you observe an event. See what arroyo does for Kafka consumers where the throughput can be extreme: https://github.com/getsentry/arroyo/blob/main/arroyo/processing/processor.py#L90-L122. It buffers metrics and periodically send one call to datadog. recording three counters that count 1 is the same as recording one counter with value 3 if they are close enough.

On the other hand there may be a case to disable killswitch metrics in this specific case as they rarely are useful to detect an incident and that would reduce our datadog bill.
So I am not against this PR, it is ok for this case. I would not generalize though and rely on sampling in the general case.

@Swatinem
Copy link
Member

I don’t object to this PR, but it also adds additional complexity to something that is already too complex.

We discussed this within the team yesterday briefly, and we agree that killswitches and metrics is one contributing factor to death by a thousand papercuts. Emitting a metric has a small but non-negligible cost (<0.1ms), and it adds up if you emit a ton of those for no good reason.
The situation with killswitches in particular is that depending on the task, we are chaining 3 or more of those. And we sometimes are checking the same killswitch in multiple tasks. Sometimes for good reason.

Now, specifically for metrics and killswitches, I would ideally prefer an API like the following:

  • When accepting a message / task, emit a metric (optional in case where the outside system [kafka, celery] already provides a compatible metrics)
  • When a message / task finishes processing, emit a completed metric
  • When catching a killswitch (they should ideally throw an error that unwinds the stack), emit a killswitched metric instead, tagging it with the specific killswitch name.

As we assume killswitches are mutually exclusive (failing one, you are early-returning and will never run into another), with those 2 metrics, we can cover these insights / percentages:

  • task successful
  • task was killswitched
  • task was lost (due to process dieing or whatever) (this is the difference between "started" and "finished")

Especially accounting for processes being killed unfortunately means that we might have a lot of variance in these metrics due to time delays between "started" and "finished".

@getsantry
Copy link
Contributor

getsantry bot commented Jun 20, 2024

This pull request has gone three weeks without activity. In another week, I will close it.

But! If you comment or otherwise update it, I will reset the clock, and if you add the label WIP, I will leave it alone unless WIP is removed ... forever!


"A weed is but an unloved flower." ― Ella Wheeler Wilcox 🥀

@getsantry getsantry bot added the Stale label Jun 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Scope: Backend Automatically applied to PRs that change backend components Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants