Skip to content

fix(cache): Ensure per-thread pymemcache clients in ReconnectingMemcache#111545

Merged
gricha merged 2 commits intomasterfrom
gricha/fix-memcache-thread-safety
Mar 25, 2026
Merged

fix(cache): Ensure per-thread pymemcache clients in ReconnectingMemcache#111545
gricha merged 2 commits intomasterfrom
gricha/fix-memcache-thread-safety

Conversation

@gricha
Copy link
Member

@gricha gricha commented Mar 25, 2026

Django's CacheHandler stores connections in a contextvar (asgiref.local._CVar). When contextvars.copy_context() copies these across threads — as ContextPropagatingThreadPoolExecutor does — multiple threads share a single pymemcache.HashClient instance. HashClient is not thread-safe: _retry_dead() races on a shared _dead_clients dict, causing KeyError crashes when a twemproxy node flaps.

This was surfaced by the ThreadPoolExecutorContextPropagatingThreadPoolExecutor migration in #111464 (reverted in 7e509b5), which caused consumer worker threads to inherit the parent's cache connection instead of creating their own. The same race also affects web workers (SENTRY-5MFX, SENTRY-5E8T) via granian's thread pool.

The fix stores the backend in a contextvar with a thread ID check. When a copied context is accessed from a different thread, a new client is created. This preserves contextvar semantics (async-safe) while restoring per-thread isolation for HashClient.

Fixes SENTRY-5H8T

Agent transcript: https://claudescope.sentry.dev/share/DUqlzydq70wt3Np4p2CKKgDFZAM251ooNsKQg4RIXiY

@gricha gricha requested a review from a team as a code owner March 25, 2026 18:41
@github-actions github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Mar 25, 2026
@gricha gricha requested a review from markstory March 25, 2026 18:43
Copy link
Contributor

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

created_at=time.time(),
client=client,
)
self._backend_var.set(state)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per-task client creation instead of per-thread reuse

Medium Severity

With ContextPropagatingThreadPoolExecutor, each submit() call creates a fresh context copy via copy_context(). When the worker thread accesses _cache, it detects state.thread_id != current_tid, creates a new HashClient, and stores it in the ephemeral copied context. When the next task runs on the same worker thread, it gets a new context copy (again with the parent's state), detects the thread mismatch again, and creates yet another client. The previous client is never explicitly closed — it relies on GC to clean up socket connections. This yields per-task client creation rather than the intended per-thread reuse, causing significant connection churn under load.

Fix in Cursor Fix in Web

Copy link
Member

@markstory markstory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me. While the cache adapter is stored in a threadlocal, copying context makes what use to be thread isolated values shared which introduced a thread race. Another layer of threadlocals that check thread identities will solve that.

gricha added 2 commits March 25, 2026 12:39
Django's CacheHandler stores connections in a contextvar. When
contextvars are copied across threads (e.g. via copy_context() in
ContextPropagatingThreadPoolExecutor), multiple threads end up sharing
a single pymemcache HashClient instance. HashClient is not thread-safe
— _retry_dead() races on a shared dict, causing KeyError crashes when
a twemproxy node flaps.

Fix by storing the backend in a contextvar with a thread ID check.
When a copied context is accessed from a different thread, a fresh
client is created instead of reusing the parent's.

Fixes SENTRY-5H8T

Agent transcript: https://claudescope.sentry.dev/share/8M_nAJwoRwoO5SzOBr1QIGwraQS8Z4M31i0MtYYHnrc
@gricha gricha force-pushed the gricha/fix-memcache-thread-safety branch from 9be43b6 to 16befd9 Compare March 25, 2026 19:39
@gricha gricha merged commit 3df942b into master Mar 25, 2026
63 of 64 checks passed
@gricha gricha deleted the gricha/fix-memcache-thread-safety branch March 25, 2026 20:35
gricha added a commit that referenced this pull request Mar 25, 2026
…PropagatingThreadPoolExecutor

Re-applies the monitor consumer portion of #111464 (reverted in 7e509b5).
The original revert was due to a pymemcache HashClient thread-safety
issue triggered by contextvar-based cache connection sharing, which is
fixed by #111545.
gricha added a commit that referenced this pull request Mar 25, 2026
…tor consumer (#111568)

Re-applies the monitor consumer portion of #111464 (reverted in
7e509b5).

The revert was caused by a `pymemcache.HashClient._retry_dead()` race
condition — Django's `CacheHandler` stores connections in a contextvar,
and `copy_context()` caused all worker threads to share a single
non-thread-safe `HashClient`. This is fixed by #111545, which gives each
thread its own client.

Depends on #111545.


Agent transcript:
https://claudescope.sentry.dev/share/bXndRR4KT3FDez_xnzf1nwcoicOkmF93CYhAl-bcoKY
gricha added a commit that referenced this pull request Mar 26, 2026
…le S016 (#111569)

Re-applies the non-monitor-consumer portion of #111464 (reverted in
7e509b5): occurrence consumer, result consumer, and S016 lint rule
re-enablement.

Depends on #111545 and #111568.


Agent transcript:
https://claudescope.sentry.dev/share/Jgc7K9L3nJgpb6s4imUcdeHJ3iJNAmxOcsHq09hQW4M
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Scope: Backend Automatically applied to PRs that change backend components

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants