Skip to content

bug: sl clone segfaults in always-on sampling profiler (deferred PyCodeObject deref race, not covered by a82fe9f9804) #1320

@JamBalaya56562

Description

@JamBalaya56562

🐛 Bug Report

Sapling version: 0.2.20260522-084851 (Homebrew bottle, commit 1e764c94)
OS: Amazon Linux 2023 (Linux 6.12.88, x86_64)
Install method: brew install sapling (Linuxbrew, Python 3.12.13_2 from Homebrew)
Mode: git-backed repos (sl clone <github URL>)


Summary

sl clone <github URL> reliably crashes with SIGSEGV (exit 139, or exit 255 via chg) after the pull completes but before the working copy is updated. The pull-completed metadata is left behind in .sl/, but no files are checked out.

The crash is in the always-on sampling profiler background thread, specifically in sapling_cext_evalframe_resolve_code_object, called from profiler_loop (not from the signal handler).

The root cause is closely related to the segfault that the existing eden/scm/lib/sampling-profiler/munmap-segv-example/ (commit 09fa42845cb) was added to reproduce, but the fix a82fe9f9804 "backtrace-python: avoid unsafe (segfault) interp frame read" only addresses the signal-handler-side frame read. The profiler-loop-side deferred dereference of the PyCodeObject* (captured from the C stack during the signal) is still racy, and PyCodeObject->co_filename / ->co_name can be read after the object is freed.

Reproduction

100% reproducible on this machine:

$ sl clone https://github.com/Marukome0743/raspi-signage.git
From https://github.com/Marukome0743/raspi-signage
 * [new ref]         1b6ae7c2a0e26ac30efdbc39b109d510bff93a82 -> remote/main
chg: abort: cannot communicate (errno = 32, Broken pipe)
$ echo $?
255

# Without chg, the segfault surfaces directly:
$ CHGDISABLE=1 sl clone https://github.com/Marukome0743/raspi-signage.git
From https://github.com/Marukome0743/raspi-signage
 * [new ref]         1b6ae7c2a0e26ac30efdbc39b109d510bff93a82 -> remote/main
Segmentation fault (core dumped)
$ echo $?
139
Reproduction matrix across multiple repos

Each run on a fresh empty target dir.

Repo sl clone exit files in WC with profiling.always-on-enabled=False
octocat/Hello-World 0 1 0 (1 file)
Marukome0743/raspi-signage (155 files) 255 / 139 (segv) 0 0 (155 files)
facebook/sapling (~11k files) 255 / 139 (segv) 0 0 (10946 files)

Hello-World works because the update step finishes faster than the 1s profiler sampling interval can fire on a problematic frame. Larger repos almost always trip the sampler.

Workaround

Set in ~/.config/sapling/sapling.conf (or pass via --config):

[profiling]
always-on-enabled = False

HGPROF=noop does not help — it only disables the Python-level profiler, not the Rust sampling profiler.

Crash location

Crashing thread is the pfc[worker/...] sampling-profiler worker. Crash at sapling_cext_evalframe_resolve_code_object+14 with rdi = 0x8000000000000002 — i.e., the first arg PyCodeObject* code is a non-canonical address (high bit set on x86-64), which faults on any access.

Full backtrace of the crashing thread
Program terminated with signal SIGSEGV, Segmentation fault.
#0  sapling_cext_evalframe_resolve_code_object ()
#1  <backtrace_python::PythonSupplementalFrameResolver as
       backtrace_ext::SupplementalFrameResolver>::resolve_supplemental_info ()
#2  backtrace_ext::Frame::resolve ()
#3  sampling_profiler::frame_handler::profiler_loop ()
#4  std::sys::backtrace::__rust_begin_short_backtrace ()
#5  core::ops::function::FnOnce::call_once{{vtable.shim}} ()
#6  <std::sys::thread::unix::Thread>::new::thread_start ()
#7  start_thread () from libc.so.6
#8  clone3 () from libc.so.6

Register state at crash (selected):

  • rip = sapling_cext_evalframe_resolve_code_object+14
  • rdi = 0x8000000000000002 (first arg PyCodeObject* code, non-canonical)

The garbage code pointer was captured from the C stack on the Python thread by the signal handler — read at sp + OFFSET_SP_CODE — and then dereferenced later on the profiler thread (no longer holding the Python thread paused, no GIL). At that point the captured value can be stale (frame slot not yet initialized when SIGPROF fired, or already reused) or point to a PyCodeObject that has since been freed.

Relevant source paths and captured offsets

Code paths involved:

  • eden/scm/lib/backtrace-python/src/lib.rs:149read_stack(OFFSET_SP_CODE) captures the code value from the signal handler.
  • eden/scm/lib/backtrace-python/src/lib.rs:114-133resolve_supplemental_info calls evalframe_sys::resolve_code_object(code, ...) from the profiler loop thread.
  • eden/scm/lib/backtrace-python/evalframe-sys/src/evalframe.c:203-224sapling_cext_evalframe_resolve_code_object dereferences code->co_filename / ->co_name. Its safety comment requires the owning Python thread to be paused, but at the call site above that is no longer true.

Runtime offsets captured on this build via bindings.backtrace:

OFFSET_IP:         91
OFFSET_SP_FRAME:   16
OFFSET_SP_CODE:     0
OFFSET_SP_LINE_NO: 40
SUPPORTED_INFO.os_arch:     True
SUPPORTED_INFO.c_evalframe: True

OFFSET_SP_CODE = 0 is notable — the profiler reads code directly from *sp. A signal that fires before Sapling_PyEvalFrame has written code to its stack slot (or after the function has returned and the slot is reused) yields whatever is on the stack at that point.

Why the existing fix is not enough

  • 09fa42845cb (sampling-profiler: add an example to reproduce segfault, Apr 23, 2026) added the munmap-segv-example/ reproducer.
  • a82fe9f9804 (backtrace-python: avoid unsafe (segfault) interp frame read, Apr 30, 2026) replaced the unsafe interp-frame read in maybe_extract_supplemental_info with values stored on the Sapling_PyEvalFrame stack frame.

Both are in this release (1e764c94). That fix was on the signal-handler side — the frame pointer is no longer dereferenced from the signal handler. But the profiler-loop side still dereferences the captured PyCodeObject* to read co_filename / co_name, and that read is what crashes here.

Suggested directions

Roughly in increasing invasiveness:

  1. Hold a strong reference to the captured PyCodeObject* for the duration the profiler may dereference it. (Tricky — INCREF/DECREF need the GIL.)
  2. Defer the dereference to a point that holds the GIL — e.g., extract the co_filename / co_name UTF-8 strings on the Python thread before resuming it after the signal, instead of from the profiler-loop thread.
  3. Validate the pointer before dereferencing (sigsetjmp around the read, or process_vm_readv against /proc/self/mem). Easier, but does not cover the "pointer now points to a different live object" case.
  4. Skip capturing code at all from the signal handler — capture line_no plus a probe-time marker only, and synthesize the function-name string later by walking the current Python stack from the profiler thread under the GIL.
Build / install info
brew info sapling:
  sapling: stable 0.2.20260522-084851 (bottled), HEAD
  /home/linuxbrew/.linuxbrew/Cellar/sapling/0.2.20260522-084851

Dependencies (runtime):
  gh, libssh2, node, openssl@3, python@3.12 (3.12.13_2), zlib-ng-compat,
  bzip2, curl, gcc, glibc (2.39)

System:
  Amazon Linux 2023, Linux 6.12.88-119.157.amzn2023.x86_64
  System glibc: 2.34
  /home/linuxbrew/.linuxbrew loader resolves libpython3.12.so to the Homebrew
  glibc 2.39 build at runtime (verified via LD_DEBUG=libs).

I did not try to verify against a make oss source build: make oss is reported broken on Python 3.12+ (#1032, #1141) and this environment only has Python 3.14 available as the system default. However, the crashing code is unconditionally compiled in regardless of build flavor, and the default [profiling] always-on-enabled = True is set in eden/scm/lib/config/loader/src/builtin_static/production.rs, so the bug is not Homebrew-specific.

Happy to provide a core dump, additional traces, or test patches if useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions