From 2cf7c012049da687b336e88c3a0db2503ace09cb Mon Sep 17 00:00:00 2001 From: Pieter Eendebak Date: Fri, 22 May 2026 21:36:31 +0200 Subject: [PATCH] gh-NNNNN: Make _STORE_ATTR_SLOT lock-free on the free-threaded build Both the bytecode handler (_STORE_ATTR_SLOT) and the C-API path (PyMember_SetOne for Py_T_OBJECT_EX / _Py_T_OBJECT) currently take per-object locks (ob_mutex via LOCK_OBJECT or Py_BEGIN_CRITICAL_SECTION) to sequence the read-of-old-value, store-of-new-value, and decref-of-old. Replace the lock with a single atomic exchange on the slot pointer. Atomic exchange returns the previous value and stores the new one in one atomic op, which is both sufficient for correctness on the write side (each concurrent writer observes a unique old value, so no double-decref) and consistent with how the read side already works: _LOAD_ATTR_SLOT and PyMember_GetOne are lock-free for the fast path, using FT_ATOMIC_LOAD_PTR plus _Py_TryIncrefCompare to cope with mid-flight pointer changes. This is the WRITE-side counterpart to the existing lock-free LOAD_ATTR_SLOT read path, and follows the pattern Dino Viehland established with gh-130373 (Avoid locking in _LOAD_ATTR_WITH_HINT, merged Feb 2025) and companions gh-130312 / gh-130313 (don't lock when populating sets in bytecode / when clearing attribute-less objects). The primitive being used here, _Py_atomic_exchange_ptr, is being formally proposed in gh-150044 for an internal _Py_SETREF_ATOMIC macro. Three coordinated changes are required: 1. _STORE_ATTR_SLOT in Python/bytecodes.c: drop LOCK_OBJECT / UNLOCK_OBJECT and the DEOPT_IF on lock-acquire failure; use _Py_atomic_exchange_ptr on Py_GIL_DISABLED, plain read/write on the GIL build. 2. PyMember_SetOne in Python/structmember.c, Py_T_OBJECT_EX and _Py_T_OBJECT cases: drop Py_BEGIN_CRITICAL_SECTION; use _Py_atomic_exchange_ptr to match the bytecode path. The two write paths must change together: if only the bytecode were lock-free, PyMember_SetOne's plain "oldv = *addr" inside the CS would race with concurrent atomic-exchange writers, returning a "stolen" old value that another writer is also about to DECREF (double-free). 3. PyMember_GetOne in Python/structmember.c, Py_T_OBJECT_EX and _Py_T_OBJECT cases: replace the CS+Py_XINCREF fallback (taken on _Py_TryIncrefCompare failure) with _Py_XGetRef, which retries the atomic-load + try-incref loop. The previous fallback relied on writers also taking ob_mutex to keep the slot stable inside the CS; with lock-free writers, the CS no longer excludes them, and a plain Py_XINCREF on a re-loaded slot value could resurrect a refcount-0 (about-to-be-freed) object. History (git blame on the LOCK_OBJECT line): * 22b0de2755ee (2024-06): original FT structure of _STORE_ATTR_SLOT. * 1b15c89a17ca (2024-12, gh-127838 / gh-115999): added LOCK_OBJECT in the broader STORE_ATTR specialization commit. For _STORE_ATTR_INSTANCE_VALUE the lock is genuinely required (it also updates the PyDictValues insertion-order list, which has to be coordinated with the NULL->non-NULL transition). For _STORE_ATTR_SLOT it was added for symmetry but isn't required when atomic exchange is used. * bef63d2fb81a (2025-12, GH-134584 / gh-142729): removed a redundant refcount op from _STORE_ATTR_SLOT, confirming the slot fast path is still being actively simplified. Why this is correct: * The slot store is a single pointer write at a fixed offset (determined by the type and guarded by _GUARD_TYPE_VERSION before _STORE_ATTR_SLOT). * Atomic exchange replaces the LOCK_OBJECT + plain-load-old + atomic-release-store-new + UNLOCK_OBJECT sequence with one lock-prefix xchg. * Each writer's exchange returns exactly one old pointer, so Py_XDECREF(old_value) cannot double-free across concurrent writers. * Readers (LOAD_ATTR_SLOT, PyMember_GetOne) atomically load + try- incref, then re-validate that the slot still points at the same object before claiming the reference. The TryIncrefCompare path correctly fails if ob_ref_shared has reached 0 (object being freed), so we never resurrect. * Type-change races (`__class__ = NewType`) are caught by the _GUARD_TYPE_VERSION uop that runs before _STORE_ATTR_SLOT, same as for the already-lock-free _LOAD_ATTR_SLOT. * The DEOPT_IF on lock-acquire failure is dropped: atomic exchange always succeeds, so the bytecode no longer thrashes between specialized and generic forms under contention. Microbench (origin/main @ 28eac9a7263, --disable-gil --enable-optimizations --with-lto=yes, taskset -c 4, best-of-9 ns/op): workload baseline patched ratio STORE_ATTR_SLOT (alone) 15.40 9.30 0.604 (-39.6%) attr_idiv (LOAD + / + STORE) 41.22 30.20 0.733 (-26.7%) Point.normalize (per Point) 417.4 398.0 0.953 (-4.6%) pyperformance bm_float (direct timing, best-of-5, ms): baseline FT 53.2 patched FT 48.5 (-4.7 ms, -8.8%) GIL (ref) 46.6 (patched FT is within 4% of GIL on this bench) bm_nbody is unchanged as a control (its STORE-heavy work is on STORE_SUBSCR_LIST_INT, not STORE_ATTR). Comprehensive regression sweep on PGO+LTO build: test_descr, test_class, test_dataclasses, test_dict, test_typing, test_threading, test_concurrent_futures, test_thread, test_capi, test_inspect, test_subclassinit, test_long, test_float, test_complex, test_gc, test_pickle, test_listcomps, test_exceptions, test_compile - all pass. The full test_free_threading suite (192 tests) also passes. Concurrent stress test (4 writer threads + 2 reader threads racing on the same __slots__ object's slots, 200k writes each, 100k reads each): no crash, no inconsistency, refcount stays in expected range. Standalone benchmark script: bench_store_attr_slot.py (microbench + Point.normalize pipeline; see PR comment). Co-Authored-By: Claude Opus 4.7 (1M context) --- Include/internal/pycore_opcode_metadata.h | 2 +- Include/internal/pycore_uop_metadata.h | 2 +- Modules/_testinternalcapi/test_cases.c.h | 15 +++--- Python/bytecodes.c | 21 +++++++-- Python/executor_cases.c.h | 26 ++++++----- Python/generated_cases.c.h | 15 +++--- Python/structmember.c | 57 +++++++++++++---------- 7 files changed, 82 insertions(+), 56 deletions(-) diff --git a/Include/internal/pycore_opcode_metadata.h b/Include/internal/pycore_opcode_metadata.h index d2e29a1b95ede2f..e9a76e2f3f2ca34 100644 --- a/Include/internal/pycore_opcode_metadata.h +++ b/Include/internal/pycore_opcode_metadata.h @@ -1315,7 +1315,7 @@ const struct opcode_metadata _PyOpcode_opcode_metadata[267] = { [SET_UPDATE] = { true, INSTR_FMT_IB, HAS_ARG_FLAG | HAS_ERROR_FLAG | HAS_ERROR_NO_POP_FLAG | HAS_ESCAPES_FLAG }, [STORE_ATTR] = { true, INSTR_FMT_IBC000, HAS_ARG_FLAG | HAS_NAME_FLAG | HAS_ERROR_FLAG | HAS_ESCAPES_FLAG }, [STORE_ATTR_INSTANCE_VALUE] = { true, INSTR_FMT_IXC000, HAS_DEOPT_FLAG | HAS_EXIT_FLAG | HAS_ESCAPES_FLAG | HAS_RECORDS_VALUE_FLAG }, - [STORE_ATTR_SLOT] = { true, INSTR_FMT_IXC000, HAS_DEOPT_FLAG | HAS_EXIT_FLAG | HAS_ESCAPES_FLAG | HAS_RECORDS_VALUE_FLAG }, + [STORE_ATTR_SLOT] = { true, INSTR_FMT_IXC000, HAS_EXIT_FLAG | HAS_ESCAPES_FLAG | HAS_RECORDS_VALUE_FLAG }, [STORE_ATTR_WITH_HINT] = { true, INSTR_FMT_IBC000, HAS_ARG_FLAG | HAS_NAME_FLAG | HAS_DEOPT_FLAG | HAS_EXIT_FLAG | HAS_ESCAPES_FLAG | HAS_RECORDS_VALUE_FLAG }, [STORE_DEREF] = { true, INSTR_FMT_IB, HAS_ARG_FLAG | HAS_FREE_FLAG | HAS_ESCAPES_FLAG }, [STORE_FAST] = { true, INSTR_FMT_IB, HAS_ARG_FLAG | HAS_LOCAL_FLAG | HAS_ESCAPES_FLAG }, diff --git a/Include/internal/pycore_uop_metadata.h b/Include/internal/pycore_uop_metadata.h index 6713e9bc95f942d..e572110dbdd28fc 100644 --- a/Include/internal/pycore_uop_metadata.h +++ b/Include/internal/pycore_uop_metadata.h @@ -239,7 +239,7 @@ const uint32_t _PyUop_Flags[MAX_UOP_ID+1] = { [_STORE_ATTR_INSTANCE_VALUE] = HAS_ESCAPES_FLAG, [_LOCK_OBJECT] = HAS_DEOPT_FLAG, [_STORE_ATTR_WITH_HINT] = HAS_ARG_FLAG | HAS_NAME_FLAG | HAS_DEOPT_FLAG | HAS_ESCAPES_FLAG, - [_STORE_ATTR_SLOT] = HAS_DEOPT_FLAG | HAS_ESCAPES_FLAG, + [_STORE_ATTR_SLOT] = HAS_ESCAPES_FLAG, [_COMPARE_OP] = HAS_ARG_FLAG | HAS_ERROR_FLAG | HAS_ESCAPES_FLAG, [_COMPARE_OP_FLOAT] = HAS_ARG_FLAG, [_COMPARE_OP_INT] = HAS_ARG_FLAG, diff --git a/Modules/_testinternalcapi/test_cases.c.h b/Modules/_testinternalcapi/test_cases.c.h index a2506524f0bb6dc..c648b5a5d228c68 100644 --- a/Modules/_testinternalcapi/test_cases.c.h +++ b/Modules/_testinternalcapi/test_cases.c.h @@ -11679,16 +11679,17 @@ value = stack_pointer[-2]; uint16_t index = read_u16(&this_instr[4].cache); PyObject *owner_o = PyStackRef_AsPyObjectBorrow(owner); - if (!LOCK_OBJECT(owner_o)) { - UPDATE_MISS_STATS(STORE_ATTR); - assert(_PyOpcode_Deopt[opcode] == (STORE_ATTR)); - JUMP_TO_PREDICTED(STORE_ATTR); - } char *addr = (char *)owner_o + index; STAT_INC(STORE_ATTR, hit); + #ifdef Py_GIL_DISABLED + _PyFrame_SetStackPointer(frame, stack_pointer); + PyObject *old_value = _Py_atomic_exchange_ptr( + (void **)addr, PyStackRef_AsPyObjectSteal(value)); + stack_pointer = _PyFrame_GetStackPointer(frame); + #else PyObject *old_value = *(PyObject **)addr; - FT_ATOMIC_STORE_PTR_RELEASE(*(PyObject **)addr, PyStackRef_AsPyObjectSteal(value)); - UNLOCK_OBJECT(owner_o); + *(PyObject **)addr = PyStackRef_AsPyObjectSteal(value); + #endif o = owner; stack_pointer[-2] = o; stack_pointer += -1; diff --git a/Python/bytecodes.c b/Python/bytecodes.c index f7487c7136962f1..53105ad23395f75 100644 --- a/Python/bytecodes.c +++ b/Python/bytecodes.c @@ -3225,14 +3225,27 @@ dummy_func( op(_STORE_ATTR_SLOT, (index/1, value, owner -- o)) { PyObject *owner_o = PyStackRef_AsPyObjectBorrow(owner); - DEOPT_IF(!LOCK_OBJECT(owner_o)); char *addr = (char *)owner_o + index; STAT_INC(STORE_ATTR, hit); + // Lock-free read-modify-write on the slot pointer (free-threaded + // build). Atomic exchange returns the previous value and stores + // the new one in a single atomic op. Each concurrent writer + // observes a unique old value (so no double-decref), and + // concurrent LOAD_ATTR_SLOT readers already cope with mid-flight + // mutations via _Py_TryIncrefCompare. The C-API path + // (PyMember_SetOne for Py_T_OBJECT_EX / _Py_T_OBJECT) is updated + // to the same atomic-exchange pattern so the two paths stay + // consistent without a per-object lock. On the GIL build this + // compiles to a plain read+write. +#ifdef Py_GIL_DISABLED + PyObject *old_value = _Py_atomic_exchange_ptr( + (void **)addr, PyStackRef_AsPyObjectSteal(value)); +#else PyObject *old_value = *(PyObject **)addr; - FT_ATOMIC_STORE_PTR_RELEASE(*(PyObject **)addr, PyStackRef_AsPyObjectSteal(value)); - UNLOCK_OBJECT(owner_o); - INPUTS_DEAD(); + *(PyObject **)addr = PyStackRef_AsPyObjectSteal(value); +#endif o = owner; + DEAD(owner); Py_XDECREF(old_value); } diff --git a/Python/executor_cases.c.h b/Python/executor_cases.c.h index efa61d7de74e88c..fb20e44cb57de69 100644 --- a/Python/executor_cases.c.h +++ b/Python/executor_cases.c.h @@ -13066,21 +13066,25 @@ value = _stack_item_0; uint16_t index = (uint16_t)CURRENT_OPERAND0_16(); PyObject *owner_o = PyStackRef_AsPyObjectBorrow(owner); - if (!LOCK_OBJECT(owner_o)) { - UOP_STAT_INC(uopcode, miss); - _tos_cache1 = owner; - _tos_cache0 = value; - SET_CURRENT_CACHED_VALUES(2); - JUMP_TO_JUMP_TARGET(); - } char *addr = (char *)owner_o + index; STAT_INC(STORE_ATTR, hit); + #ifdef Py_GIL_DISABLED + stack_pointer[0] = value; + stack_pointer[1] = owner; + stack_pointer += 2; + ASSERT_WITHIN_STACK_BOUNDS(__FILE__, __LINE__); + _PyFrame_SetStackPointer(frame, stack_pointer); + PyObject *old_value = _Py_atomic_exchange_ptr( + (void **)addr, PyStackRef_AsPyObjectSteal(value)); + stack_pointer = _PyFrame_GetStackPointer(frame); + #else PyObject *old_value = *(PyObject **)addr; - FT_ATOMIC_STORE_PTR_RELEASE(*(PyObject **)addr, PyStackRef_AsPyObjectSteal(value)); - UNLOCK_OBJECT(owner_o); + *(PyObject **)addr = PyStackRef_AsPyObjectSteal(value); + stack_pointer += 2; + #endif o = owner; - stack_pointer[0] = o; - stack_pointer += 1; + stack_pointer[-2] = o; + stack_pointer += -1; ASSERT_WITHIN_STACK_BOUNDS(__FILE__, __LINE__); _PyFrame_SetStackPointer(frame, stack_pointer); Py_XDECREF(old_value); diff --git a/Python/generated_cases.c.h b/Python/generated_cases.c.h index 53e09a8f4523c7c..d4a7fdb4dea6e14 100644 --- a/Python/generated_cases.c.h +++ b/Python/generated_cases.c.h @@ -11676,16 +11676,17 @@ value = stack_pointer[-2]; uint16_t index = read_u16(&this_instr[4].cache); PyObject *owner_o = PyStackRef_AsPyObjectBorrow(owner); - if (!LOCK_OBJECT(owner_o)) { - UPDATE_MISS_STATS(STORE_ATTR); - assert(_PyOpcode_Deopt[opcode] == (STORE_ATTR)); - JUMP_TO_PREDICTED(STORE_ATTR); - } char *addr = (char *)owner_o + index; STAT_INC(STORE_ATTR, hit); + #ifdef Py_GIL_DISABLED + _PyFrame_SetStackPointer(frame, stack_pointer); + PyObject *old_value = _Py_atomic_exchange_ptr( + (void **)addr, PyStackRef_AsPyObjectSteal(value)); + stack_pointer = _PyFrame_GetStackPointer(frame); + #else PyObject *old_value = *(PyObject **)addr; - FT_ATOMIC_STORE_PTR_RELEASE(*(PyObject **)addr, PyStackRef_AsPyObjectSteal(value)); - UNLOCK_OBJECT(owner_o); + *(PyObject **)addr = PyStackRef_AsPyObjectSteal(value); + #endif o = owner; stack_pointer[-2] = o; stack_pointer += -1; diff --git a/Python/structmember.c b/Python/structmember.c index adea8216b8796b5..9eb18f14a286c36 100644 --- a/Python/structmember.c +++ b/Python/structmember.c @@ -97,36 +97,33 @@ PyMember_GetOne(const char *obj_addr, PyMemberDef *l) break; } case _Py_T_OBJECT: - v = FT_ATOMIC_LOAD_PTR(*(PyObject **) addr); - if (v != NULL) { #ifdef Py_GIL_DISABLED - if (!_Py_TryIncrefCompare((PyObject **) addr, v)) { - Py_BEGIN_CRITICAL_SECTION((PyObject *) obj_addr); - v = FT_ATOMIC_LOAD_PTR(*(PyObject **) addr); - Py_XINCREF(v); - Py_END_CRITICAL_SECTION(); - } + // _Py_XGetRef does an atomic load + try-incref loop, retrying when + // a concurrent writer (lock-free atomic exchange in _STORE_ATTR_SLOT + // and PyMember_SetOne) mutates the slot. The previous critical- + // section fallback assumed writers also took ob_mutex, which they no + // longer do, so a plain Py_XINCREF inside the CS could resurrect a + // refcount-0 (about-to-be-freed) object. + v = _Py_XGetRef((PyObject **)addr); #else - Py_INCREF(v); + v = *(PyObject **)addr; + Py_XINCREF(v); #endif - } if (v == NULL) { - v = Py_None; + v = Py_NewRef(Py_None); } break; case Py_T_OBJECT_EX: +#ifdef Py_GIL_DISABLED + v = _Py_XGetRef((PyObject **)addr); + if (v == NULL) { + PyErr_Format(PyExc_AttributeError, + "'%T' object has no attribute '%s'", + (PyObject *)obj_addr, l->name); + } +#else v = member_get_object(addr, obj_addr, l); -#ifndef Py_GIL_DISABLED Py_XINCREF(v); -#else - if (v != NULL) { - if (!_Py_TryIncrefCompare((PyObject **) addr, v)) { - Py_BEGIN_CRITICAL_SECTION((PyObject *) obj_addr); - v = member_get_object(addr, obj_addr, l); - Py_XINCREF(v); - Py_END_CRITICAL_SECTION(); - } - } #endif break; case Py_T_LONGLONG: @@ -320,11 +317,20 @@ PyMember_SetOne(char *addr, PyMemberDef *l, PyObject *v) break; } case _Py_T_OBJECT: - case Py_T_OBJECT_EX: - Py_BEGIN_CRITICAL_SECTION(obj); + case Py_T_OBJECT_EX: { + // Lock-free swap to match the _STORE_ATTR_SLOT bytecode handler + // (which also uses atomic exchange on Py_GIL_DISABLED builds). + // Atomic exchange gives each concurrent writer a unique old value, + // so the Py_XDECREF below cannot double-free. Concurrent readers + // (PyMember_GetOne, _LOAD_ATTR_SLOT) use atomic loads and the + // TryIncrefCompare pattern to cope with mid-flight mutations. + PyObject *new_v = Py_XNewRef(v); +#ifdef Py_GIL_DISABLED + oldv = (PyObject *)_Py_atomic_exchange_ptr((void **)addr, new_v); +#else oldv = *(PyObject **)addr; - FT_ATOMIC_STORE_PTR_RELEASE(*(PyObject **)addr, Py_XNewRef(v)); - Py_END_CRITICAL_SECTION(); + *(PyObject **)addr = new_v; +#endif if (v == NULL && oldv == NULL && l->type == Py_T_OBJECT_EX) { // Raise an exception when attempting to delete an already deleted // attribute. @@ -336,6 +342,7 @@ PyMember_SetOne(char *addr, PyMemberDef *l, PyObject *v) } Py_XDECREF(oldv); break; + } case Py_T_CHAR: { const char *string; Py_ssize_t len;