fix(workflow_engine): Add a cache for Workflows to reduce DB load by saponifi3d · Pull Request #106925 · getsentry/sentry

saponifi3d · 2026-01-23T22:57:48Z

Description

We select workflows from the DB very frequently. This has added substantial load to our DB, even though the query is very fast / efficient.

This PR introduces a caching layer for this high frequency db query.

kcons · 2026-01-29T01:18:49Z

src/sentry/workflow_engine/caches/workflow.py

+            .distinct()
+        )
+
+        cache.set(cache_key, workflows, timeout=CACHE_TTL)


I'm curious what our observability of caching is here.
I know in traces one type of cache (django? is this django cache or only sometimes?) doesn't show up, and that's been a bit of a pain for debugging.

Also, it'd be nice if we could have counters for hit/miss so we can brag about how many queries we're avoiding.

yeah, i kinda purposefully was avoiding obs / counters thus far. 😅

did you have any specific obs in mind? i'm thinking a metric for cache hit / miss / invalidation.

🤔 maybe debug logs for cache miss and when we invalidate? (thinking a stack trace might be handy with signals. could at least see which models are causing invalidations etc)

kcons · 2026-01-29T01:26:21Z

src/sentry/workflow_engine/caches/workflow.py

+    This method uses a read-through cache, and returns which workflows to evaluate.
+    """
+    env_id = environment.id if environment is not None else None
+    cache_key = processing_workflow_cache_key(detector.id, env_id)


if you like barely justfied abstractions, we have CacheAccess[T] thing.
The idea is that you define a subclass like

class _ProcessingWorkflowCacheAccess(CacheAcess[set[Workflow]]): def __init__(self, ..., ttl=DEFAULT_TTL) -> None: # verify params, save key def key(self) -> str: return self._key ... cache_access = _ProcessingWorkflowCacheAccess(detector, environment) workflows = cache_access.get() .. cache_access.set(workflows)

Not game changing, but this was after we were using the wrong key in one place and had some wrong type assumptions about cached values, so it seemed appropriate to try for an abstraction that ensures consistent key use and type safety.

(it doesn't have delete, but it should).

👍 -- i like it. i was thinking of something similar tbh 🤣 i always fear text based keys.

Christinarlong · 2026-01-29T18:18:02Z

src/sentry/workflow_engine/receivers/detector.py

+from sentry.workflow_engine.models import Detector
+
+
+@receiver(post_save, sender=Detector)


Q: why did we end up going with post_save signals on detector? Is it cause the lack of SOPA?

we ended up needing to use receivers, because we aren't modifying these models only through the API / validators as initial thought (😢). to make this easier for now (and make sure we invalidate at any point needed), decided to use the monitor.

if we end up pulling workflow_engine into it's own full service, then we can simplify to only using the API validators. (that would help perf, and simplify the logic flows)

Christinarlong · 2026-01-29T18:19:07Z

src/sentry/workflow_engine/models/signals/detector_workflow.py

+from sentry.workflow_engine.models import DetectorWorkflow
+
+
+@receiver(post_migrate, sender=DetectorWorkflow)


Q: why are these signals on post_migrate and pre_save for invalidation?

if we run a migration that effects these models, then we need to clear the cache. it likely mutated relationships, and pre_save is not triggered by a migration.

other pro-tip, these receivers run per test run, so it seems like i might not want to have this after all, just from a CI slowdown perspective. 🤔

Christinarlong · 2026-01-29T18:44:17Z

src/sentry/workflow_engine/caches/workflow.py

+    if detector_id is None:
+        detector_id = "*"


Q: This part is prob still in progress but I'm confused on why we're putting a wildcard here since it doesn't look like we ever set one in the initial cache population? Also we should prob add a comment or seperate out the function to be so that if both detector_id + env_id being None = clear everything.

I guess is it possible only one of the params would be None, and if so when would that happen?

These wildcards allow us to invalidate the cache in different ways / at different times.

When we modify a detector, we need to invalidate all the environments for that detector id.

When we run a migration that effects the relationship with detectors and workflows, we need to invalidate the entire cache (we don't know which models were effected, but need to invalidate the cache because those models could be wrong)

Then all the other cases, we know which specific cache is effected and we target it specifically. If you want to see all the cases, they're in sentry/workflow_engine/models/signals/.

All that said, wildcard invalidation like this is super sus, it's currently not implemented and doesn't work here (lol, i just add a * to the key).

so i'm also trying out a few other approaches right now:

an approach that we could keep a list of all these keys in another bit of redis then look them up -- but then we have to manage the cache with something else and then 😵 , caches on caches.

also investigating if we could make a namespace for the cache, and a namespace for the detector_id. if we can do that, then we could just say like workflow_cache.clear(f"workflow-cache:{detector_id}") kind of thing.

saponifi3d · 2026-02-03T06:15:27Z

src/sentry/workflow_engine/receivers/detector_workflow.py

+        try:
+            # This lookup trade-off is okay, because we rarely update these relationships
+            # Most cases are delete / create new DetectorWorkflow relationships.
+            old_instance = DetectorWorkflow.objects.get(pk=instance.pk)


note to self for tomorrow: make sure to include the env_id in the .get lookup here.

still relevant?

nah, not any more -- surprised it didn't go away. Since this is using the primary key to do the lookup, we shouldn't need any additional filtering.

…eceiver files

this addressed the mypy error as well. - Address mypy errors in cache/test_workflow.py

src/sentry/workflow_engine/receivers/detector_workflow.py

…mmit

…ooking up the connections is handled

…he integration test because the caches are proerly invalidated now.

…d query

saponifi3d · 2026-02-06T20:09:30Z

Let me know if this is too much to review at once, i couldn't decide if it'd be easier with all the context together or to split caching and invalidation stuff as separate PRs.

Also, planning on making some higher level abstractions, but need to get another example in to think through. Seems like we could make an ABC like WorkflowEngineCache or something that could encapsulate the metrics, cache access, read-through, etc.

kcons

I want to look again at the invalidation path, but all seems reasonable to me, and we have tests and such so no need to block on it.

kcons · 2026-02-10T17:21:58Z

src/sentry/workflow_engine/caches/workflow.py

+
+    Args:
+        detector_id: Detector ID to invalidate (required)
+        env_id: {int|None} - The environment the workflow is triggered on, if not set,


This is a slightly hard kind of interface to express in python.
I might suggest a class AllEnvs: pass magic value over a string const, but I'm not sure enough it's better to suggest the change.

🤔 yeah, agree. the env_id interface was rough -- hence the default bit i tried to introduce as a way to get around that None has a different meaning here.

Maybe we should extra it out to it's own type in workflow_engine/types.py or maybe even in the Environment model? then reuse that type everywhere we discuss Environment?

kcons · 2026-02-10T17:26:22Z

src/sentry/workflow_engine/receivers/detector_workflow.py

+        try:
+            # This lookup trade-off is okay, because we rarely update these relationships
+            # Most cases are delete / create new DetectorWorkflow relationships.
+            old_instance = DetectorWorkflow.objects.get(pk=instance.pk)


still relevant?

kcons · 2026-02-10T17:29:12Z

src/sentry/workflow_engine/caches/workflow.py

+CACHE_TTL = 60  # TODO - Increase TTL once we confirm everything
+METRIC_PREFIX = "workflow_engine.cache.processing_workflow"
+
+DEFAULT_VALUE: Literal["default"] = "default"


probably worth a one-liner.

…the env being none.

Christinarlong

dizzying qs

Christinarlong · 2026-02-10T18:41:13Z

src/sentry/workflow_engine/caches/workflow.py

+    global_by_detector: dict[int, set[Workflow]] = {d_id: set() for d_id in detector_ids}
+    env_by_detector: dict[int, set[Workflow]] = {d_id: set() for d_id in detector_ids}


nit: the convention is to not have inline typing unless needed since mypy can generally infer the type w/o

🤔 not sure i agree with this one tbh, while having inline types might be a little slower for mypy, it also gives things like type completion and knowledge that all the values will be a workflow -- fwiw, mypy did not correctly infer these types.

Christinarlong · 2026-02-10T19:14:03Z

src/sentry/workflow_engine/caches/workflow.py

+        env_result = _check_caches_for_detectors(detectors, env_id)
+        workflows |= env_result.cached_workflows
+
+    missed_detector_ids = set(global_result.missed_detector_ids)


should we be making missed_detector_ids a set in the dataclass?

planning to refactor this pretty heavily tbh, now that we have a couple examples of these caches we can make an abstraction to handle all this. (said abstraction exists on another branch of mine)

Christinarlong · 2026-02-10T19:19:48Z

src/sentry/workflow_engine/processors/workflow.py

+        workflows = _get_associated_workflows(event_detectors.detectors, environment)
+
+    if workflows:
+        metrics_incr("process_workflows", len(workflows))


what's the purpose of this metric?

It's used to track how many workflows are being processed. metrics_incr includes a data context to grab things like detector type etc and automatically decorate it too. so in the end we could use this to filter and see all the workflows being evaluated for metric issues for example. (jfyi, this is an existing metric, just moved it to a shared part of the code)

…06925) # Description We select workflows from the DB very frequently. This has added substantial load to our DB, even though the query is very fast / efficient. This PR introduces a caching layer for this high frequency db query.

github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Jan 23, 2026

vercel bot deployed to Preview January 23, 2026 22:58 View deployment

vercel bot deployed to Preview January 23, 2026 23:44 View deployment

vercel bot deployed to Preview January 28, 2026 23:23 View deployment

vercel bot deployed to Preview January 29, 2026 00:38 View deployment

vercel bot deployed to Preview January 29, 2026 00:55 View deployment

vercel bot deployed to Preview January 29, 2026 01:25 View deployment

kcons reviewed Jan 29, 2026

View reviewed changes

vercel bot deployed to Preview January 29, 2026 04:28 View deployment

vercel bot deployed to Preview January 29, 2026 04:52 View deployment

vercel bot deployed to Preview January 29, 2026 05:02 View deployment

vercel bot deployed to Preview January 29, 2026 05:13 View deployment

Christinarlong reviewed Jan 29, 2026

View reviewed changes

vercel bot deployed to Preview February 3, 2026 04:55 View deployment

vercel bot deployed to Preview February 3, 2026 06:13 View deployment

saponifi3d commented Feb 3, 2026

View reviewed changes

vercel bot deployed to Preview February 3, 2026 06:34 View deployment

vercel bot deployed to Preview February 3, 2026 06:44 View deployment

vercel bot deployed to Preview February 4, 2026 09:13 View deployment

saponifi3d force-pushed the jcallender/aci/cache-workflows branch from bcbebec to d3fa8dc Compare February 4, 2026 18:27

vercel bot deployed to Preview February 4, 2026 18:30 View deployment

vercel bot deployed to Preview February 4, 2026 18:35 View deployment

saponifi3d force-pushed the jcallender/aci/cache-workflows branch from 15264ff to c04ca8b Compare February 4, 2026 18:38

vercel bot deployed to Preview February 4, 2026 18:40 View deployment

saponifi3d force-pushed the jcallender/aci/cache-workflows branch from c04ca8b to 4a5fee2 Compare February 4, 2026 19:53

vercel bot deployed to Preview February 4, 2026 19:55 View deployment

vercel bot deployed to Preview February 4, 2026 20:19 View deployment

vercel bot deployed to Preview February 4, 2026 20:51 View deployment

saponifi3d marked this pull request as ready for review February 4, 2026 21:07

saponifi3d requested a review from a team as a code owner February 4, 2026 21:07

saponifi3d added 5 commits February 4, 2026 14:04

update PR for the receivers work, and move the new signals to those r…

65f9698

…eceiver files

update patches in tests

64da47e

- Change cache invalidation method to follow SRP a litle more closely,

0d993e1

this addressed the mypy error as well. - Address mypy errors in cache/test_workflow.py

fix miss from merging branches, should make PR green

7c0161e

whoops, artifact from when None meant something different here

c118412

saponifi3d force-pushed the jcallender/aci/cache-workflows branch from e7bc66e to c118412 Compare February 4, 2026 22:04

vercel bot deployed to Preview February 4, 2026 22:07 View deployment

sentry bot reviewed Feb 4, 2026

View reviewed changes

src/sentry/workflow_engine/receivers/detector_workflow.py Show resolved Hide resolved

fix missing import issue

6a2b583

vercel bot deployed to Preview February 4, 2026 22:15 View deployment

saponifi3d marked this pull request as draft February 4, 2026 22:15

saponifi3d added 2 commits February 4, 2026 15:25

fix type for invaldating processing workflows cache

7936ec5

Fix race condition with transactions, ensure we only invalidate on co…

1776472

…mmit

vercel bot deployed to Preview February 5, 2026 00:07 View deployment

add an on transaction handler to make sure the race conditions when l…

5b45d7c

…ooking up the connections is handled

vercel bot deployed to Preview February 5, 2026 07:31 View deployment

Split the environment caches, it should also address the issue with t…

dcdf427

…he integration test because the caches are proerly invalidated now.

vercel bot deployed to Preview February 5, 2026 08:20 View deployment

Add a feature flag to decide between the cache + new system or the ol…

97cd89d

…d query

saponifi3d marked this pull request as ready for review February 6, 2026 20:08

vercel bot deployed to Preview February 6, 2026 20:10 View deployment

kcons approved these changes Feb 10, 2026

View reviewed changes

Add a comment for 'default' because it's a weird thing to get around …

8e31497

…the env being none.

vercel bot deployed to Preview February 10, 2026 19:09 View deployment

saponifi3d enabled auto-merge (squash) February 10, 2026 19:17

Christinarlong reviewed Feb 10, 2026

View reviewed changes

saponifi3d merged commit eb48846 into master Feb 10, 2026
88 checks passed

saponifi3d deleted the jcallender/aci/cache-workflows branch February 10, 2026 19:28

		from sentry.workflow_engine.models import Detector


		@receiver(post_save, sender=Detector)

		from sentry.workflow_engine.models import DetectorWorkflow


		@receiver(post_migrate, sender=DetectorWorkflow)

		global_by_detector: dict[int, set[Workflow]] = {d_id: set() for d_id in detector_ids}
		env_by_detector: dict[int, set[Workflow]] = {d_id: set() for d_id in detector_ids}

Uh oh!

Conversation

saponifi3d commented Jan 23, 2026

Description

Uh oh!

Choose a reason for hiding this comment

Uh oh!

saponifi3d Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

saponifi3d Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

saponifi3d commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kcons left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Christinarlong left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

saponifi3d Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

saponifi3d Jan 29, 2026 •

edited

Loading

saponifi3d Jan 29, 2026 •

edited

Loading

saponifi3d commented Feb 6, 2026 •

edited

Loading

saponifi3d Feb 10, 2026 •

edited

Loading