Skip to content

fix(workflow_engine): Add a cache for Workflows to reduce DB load#106925

Merged
saponifi3d merged 35 commits intomasterfrom
jcallender/aci/cache-workflows
Feb 10, 2026
Merged

fix(workflow_engine): Add a cache for Workflows to reduce DB load#106925
saponifi3d merged 35 commits intomasterfrom
jcallender/aci/cache-workflows

Conversation

@saponifi3d
Copy link
Contributor

Description

We select workflows from the DB very frequently. This has added substantial load to our DB, even though the query is very fast / efficient.

This PR introduces a caching layer for this high frequency db query.

.distinct()
)

cache.set(cache_key, workflows, timeout=CACHE_TTL)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious what our observability of caching is here.
I know in traces one type of cache (django? is this django cache or only sometimes?) doesn't show up, and that's been a bit of a pain for debugging.

Also, it'd be nice if we could have counters for hit/miss so we can brag about how many queries we're avoiding.

Copy link
Contributor Author

@saponifi3d saponifi3d Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, i kinda purposefully was avoiding obs / counters thus far. 😅

did you have any specific obs in mind? i'm thinking a metric for cache hit / miss / invalidation.

🤔 maybe debug logs for cache miss and when we invalidate? (thinking a stack trace might be handy with signals. could at least see which models are causing invalidations etc)

This method uses a read-through cache, and returns which workflows to evaluate.
"""
env_id = environment.id if environment is not None else None
cache_key = processing_workflow_cache_key(detector.id, env_id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you like barely justfied abstractions, we have CacheAccess[T] thing.
The idea is that you define a subclass like

class _ProcessingWorkflowCacheAccess(CacheAcess[set[Workflow]]):
    def __init__(self, ..., ttl=DEFAULT_TTL) -> None:
           # verify params, save key
    def key(self) -> str:
           return self._key

...
cache_access = _ProcessingWorkflowCacheAccess(detector, environment)

workflows = cache_access.get()
..

cache_access.set(workflows)

Not game changing, but this was after we were using the wrong key in one place and had some wrong type assumptions about cached values, so it seemed appropriate to try for an abstraction that ensures consistent key use and type safety.

(it doesn't have delete, but it should).

Copy link
Contributor Author

@saponifi3d saponifi3d Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 -- i like it. i was thinking of something similar tbh 🤣 i always fear text based keys.

from sentry.workflow_engine.models import Detector


@receiver(post_save, sender=Detector)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: why did we end up going with post_save signals on detector? Is it cause the lack of SOPA?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we ended up needing to use receivers, because we aren't modifying these models only through the API / validators as initial thought (😢). to make this easier for now (and make sure we invalidate at any point needed), decided to use the monitor.

if we end up pulling workflow_engine into it's own full service, then we can simplify to only using the API validators. (that would help perf, and simplify the logic flows)

from sentry.workflow_engine.models import DetectorWorkflow


@receiver(post_migrate, sender=DetectorWorkflow)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: why are these signals on post_migrate and pre_save for invalidation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we run a migration that effects these models, then we need to clear the cache. it likely mutated relationships, and pre_save is not triggered by a migration.

other pro-tip, these receivers run per test run, so it seems like i might not want to have this after all, just from a CI slowdown perspective. 🤔

Comment on lines 26 to 27
if detector_id is None:
detector_id = "*"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: This part is prob still in progress but I'm confused on why we're putting a wildcard here since it doesn't look like we ever set one in the initial cache population? Also we should prob add a comment or seperate out the function to be so that if both detector_id + env_id being None = clear everything.

I guess is it possible only one of the params would be None, and if so when would that happen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These wildcards allow us to invalidate the cache in different ways / at different times.

When we modify a detector, we need to invalidate all the environments for that detector id.

When we run a migration that effects the relationship with detectors and workflows, we need to invalidate the entire cache (we don't know which models were effected, but need to invalidate the cache because those models could be wrong)

Then all the other cases, we know which specific cache is effected and we target it specifically. If you want to see all the cases, they're in sentry/workflow_engine/models/signals/.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All that said, wildcard invalidation like this is super sus, it's currently not implemented and doesn't work here (lol, i just add a * to the key).

so i'm also trying out a few other approaches right now:

  • an approach that we could keep a list of all these keys in another bit of redis then look them up -- but then we have to manage the cache with something else and then 😵 , caches on caches.
  • also investigating if we could make a namespace for the cache, and a namespace for the detector_id. if we can do that, then we could just say like workflow_cache.clear(f"workflow-cache:{detector_id}") kind of thing.

try:
# This lookup trade-off is okay, because we rarely update these relationships
# Most cases are delete / create new DetectorWorkflow relationships.
old_instance = DetectorWorkflow.objects.get(pk=instance.pk)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to self for tomorrow: make sure to include the env_id in the .get lookup here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still relevant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nah, not any more -- surprised it didn't go away. Since this is using the primary key to do the lookup, we shouldn't need any additional filtering.

…he integration test because the caches are proerly invalidated now.
@saponifi3d saponifi3d marked this pull request as ready for review February 6, 2026 20:08
@saponifi3d
Copy link
Contributor Author

saponifi3d commented Feb 6, 2026

Let me know if this is too much to review at once, i couldn't decide if it'd be easier with all the context together or to split caching and invalidation stuff as separate PRs.

Also, planning on making some higher level abstractions, but need to get another example in to think through. Seems like we could make an ABC like WorkflowEngineCache or something that could encapsulate the metrics, cache access, read-through, etc.

Copy link
Member

@kcons kcons left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to look again at the invalidation path, but all seems reasonable to me, and we have tests and such so no need to block on it.


Args:
detector_id: Detector ID to invalidate (required)
env_id: {int|None} - The environment the workflow is triggered on, if not set,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a slightly hard kind of interface to express in python.
I might suggest a class AllEnvs: pass magic value over a string const, but I'm not sure enough it's better to suggest the change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 yeah, agree. the env_id interface was rough -- hence the default bit i tried to introduce as a way to get around that None has a different meaning here.

Maybe we should extra it out to it's own type in workflow_engine/types.py or maybe even in the Environment model? then reuse that type everywhere we discuss Environment?

try:
# This lookup trade-off is okay, because we rarely update these relationships
# Most cases are delete / create new DetectorWorkflow relationships.
old_instance = DetectorWorkflow.objects.get(pk=instance.pk)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still relevant?

CACHE_TTL = 60 # TODO - Increase TTL once we confirm everything
METRIC_PREFIX = "workflow_engine.cache.processing_workflow"

DEFAULT_VALUE: Literal["default"] = "default"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably worth a one-liner.

Copy link
Contributor

@Christinarlong Christinarlong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dizzying qs

Comment on lines +181 to +182
global_by_detector: dict[int, set[Workflow]] = {d_id: set() for d_id in detector_ids}
env_by_detector: dict[int, set[Workflow]] = {d_id: set() for d_id in detector_ids}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the convention is to not have inline typing unless needed since mypy can generally infer the type w/o

Copy link
Contributor Author

@saponifi3d saponifi3d Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 not sure i agree with this one tbh, while having inline types might be a little slower for mypy, it also gives things like type completion and knowledge that all the values will be a workflow -- fwiw, mypy did not correctly infer these types.

env_result = _check_caches_for_detectors(detectors, env_id)
workflows |= env_result.cached_workflows

missed_detector_ids = set(global_result.missed_detector_ids)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we be making missed_detector_ids a set in the dataclass?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

planning to refactor this pretty heavily tbh, now that we have a couple examples of these caches we can make an abstraction to handle all this. (said abstraction exists on another branch of mine)

workflows = _get_associated_workflows(event_detectors.detectors, environment)

if workflows:
metrics_incr("process_workflows", len(workflows))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the purpose of this metric?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's used to track how many workflows are being processed. metrics_incr includes a data context to grab things like detector type etc and automatically decorate it too. so in the end we could use this to filter and see all the workflows being evaluated for metric issues for example. (jfyi, this is an existing metric, just moved it to a shared part of the code)

@saponifi3d saponifi3d merged commit eb48846 into master Feb 10, 2026
88 checks passed
@saponifi3d saponifi3d deleted the jcallender/aci/cache-workflows branch February 10, 2026 19:28
jaydgoss pushed a commit that referenced this pull request Feb 12, 2026
…06925)

# Description
We select workflows from the DB very frequently. This has added
substantial load to our DB, even though the query is very fast /
efficient.

This PR introduces a caching layer for this high frequency db query.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Scope: Backend Automatically applied to PRs that change backend components

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants