feat(replays): Add memory-efficient project and project option query caches#101606
Conversation
❌ 2 Tests Failed:
View the top 2 failed test(s) by shortest run time
To view more test analytics, go to the Test Analytics Dashboard |
…n/replays-project-query-cache
There was a problem hiding this comment.
I think these two tests are failing with db access errors because options.get("replay.consumer.enable_new_query_caching_system") in ProcessReplayRecordingStrategyFactory.create_with_partitions() isn't being correctly patched by the existing options_get mock:
tests/sentry/replays/integration/consumers/test_recording.py::test_recording_consumer_invalid_message
tests/sentry/replays/integration/consumers/test_recording.py::test_recording_consumer
You can patch options.get() in the consumer fixture to make sure the tests pass.
Approving b/c I patched it locally to return True and the tests passed!
|
|
||
| # We're intentionally manually looking up the options. We're avoided the project-options local | ||
| # cache which exist on the preferred interface methods. | ||
| options = ProjectOption.objects.filter( |
There was a problem hiding this comment.
This will return false for both options if project doesn't exist. _has_replays_lookup() will raise before we get here, but might be good to add the same raise DropEvent() behavior here to make the expectation that the project exists explicit.
There was a problem hiding this comment.
That's intentional. I'm not sure if an absence of the project-options implies the absence of the project and I don't want to query to find out (since another motivation with this PR is to reduce the amount of times PG bouncer rejects our queries and it would generally be bad for throughput).
In practice, Relay drops events from deleted projects so what we would be catching here are poorly timed deletions. Raising here would prevent an unnecessary publish to the issues platform but its not a big deal if we occasionally push bad data. They validate it regardless.
|
Thank you @srest2021 for the notes on patching the fixture! |
Issues attributed to commits in this pull requestThis pull request was merged and Sentry observed the following issues: |
get_from_cachecaches results in a thread-local. This means eight copies of the same project model (and project-option model) for our current configuration. Additionallyget_from_cacheis very coarse fetching fields we don't need. By caching just the boolean values we care about we can minimize the footprint of each query on our overall memory usage. Since we need to cache a lot of projects and project-options this is important to maintaining stable memory usage.Total memory should be reduced by
O(8n)wherenis the delta between the size of a project model and a boolean and the size of the project-option model and a tuple of booleans.A word on the
AutoCache.AutoCacheis safe but not logically atomic. We defer to last writer wins and potentially duplicate work. This could be improved but we don't expect the results of thefnargument to produce an effect or be non-deterministic. At least for our current case. However, it might be wise to implement better locking behavior inAutoCache.__getitem__so we don't unnecessarily compute a project or project-option query multiple times. This is easy enough to do but I think we've done enough already for this pull so we can address this in a follow-up!A
contextobject is now being passed around the consumer. This is better for testing than using globals. There are more effects that could be moved out of the processing logic and into the context object. This would make our consumer significantly easier to test and require fewer mocks.