Skip to content

[WIP][feature](be) Support print stack API#64093

Draft
JoverZhang wants to merge 4 commits into
apache:masterfrom
JoverZhang:be-print-stack-1
Draft

[WIP][feature](be) Support print stack API#64093
JoverZhang wants to merge 4 commits into
apache:masterfrom
JoverZhang:be-print-stack-1

Conversation

@JoverZhang
Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Issue Number: close #62497

Problem Summary:

The research project is at https://github.com/JoverZhang/Doris-BE-Print-Stack-API-Research (Processing)

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

mira and others added 4 commits June 3, 2026 19:28
UPSTREAM WORKAROUND, not part of the stack-collection design.

be/src/storage/segment/segment_writer.h declares _is_mow() and
_is_mow_with_cluster_key() as plain (non-inline) member functions; the
definitions in segment_writer.cpp were marked inline. Under -O0 (ASAN_UT)
the compiler emits the symbols anyway, so external callers (the test
subclass in be/test/storage/segment/test_segment_writer.h) link fine.
Under -O3 (RELEASE) the compiler elides the standalone symbol and the
test binary fails to link with:

  ld.lld: error: undefined symbol: doris::segment_v2::SegmentWriter::_is_mow()
  ld.lld: error: undefined symbol: doris::segment_v2::SegmentWriter::_is_mow_with_cluster_key()

The fix is to remove the inline keyword from the two out-of-line
definitions. No semantic change in any build mode. Carry this on
phase2/common so RELEASE and TSAN UT builds can link.

Co-Authored-By: jover <joverzh@gmail.com>
…dler

Layer 1 (startup):
- print_stack_globals.h: kServiceSignal=SIGRTMIN+6, kMaxSignalFrames=1024,
  kPipeReadTimeoutMs=100, StackCaptureSlot, atomic state externs.
- print_stack_init.cpp: define state, install sigaction(SA_SIGINFO|SA_RESTART)
  on kServiceSignal, open notification pipe with O_CLOEXEC. Called at end of
  doris::init_signals().

Layer 2 + 5 (action):
- print_stack_action.{h,cpp}: PrintStackAction parses optional thread_id,
  invokes collect_print_stack, serializes the CK-shape JSON
  {threads:[{thread_id, thread_name, trace:[{dso, dso_offset}]}]} with
  dso_offset as a hex string.

Layer 3 (coordinator+handler):
- print_stack_signal_handler.cpp: signal-safe; rejects foreign pids, checks
  sequence, single-writer latch CAS, calls capture_into_slot,
  publishes data_ready_num, writes the sequence to the notification pipe.
- print_stack.cpp: single-dump mutex; list_target_thread_ids,
  read_thread_names, is_signal_blocked mirror CK's getFilteredThreadIds /
  getFilteredThreadNames / isSignalBlocked (line citations in the helper
  comments); rt_tgsigqueueinfo wrapper; bounded poll on pipe with EINTR
  drain; pc_to_frame via SymbolIndex.

Layer 3d (variant seam):
- print_stack_capture.h: declares capture_into_slot. Variant ships .cpp.

Route:
- http_service.cpp: register PrintStackAction on GET /api/print_stack.

Spec: docs/architecture.md.
Reference: ClickHouse-v26.3.10.62-lts/src/Storages/System/StorageSystemStackTrace.cpp.

Co-Authored-By: jover <joverzh@gmail.com>
PrintStackActionTest fixture:
- SetUpTestCase: EvHttpServer(0), register PrintStackAction on
  /api/print_stack, start, record real port, call print_stack_init(),
  start one parked marker thread.
- TearDownTestCase: stop marker, delete server + action.

Cases (mirror CK system.stack_trace regressions):
- ContractJsonShape: GET /api/print_stack; required keys threads
  per row (thread_id int64, thread_name string, trace array) and per
  frame (dso string, dso_offset string).
- ThreadIdSelector: self-tid returns one row; absent tid returns
  empty threads array.
- BestEffortFrameObserved: 100-attempt loop against the marker tid;
  break when one frame's dso contains 'doris_be_test' and dso_offset
  != "0x0". Asserts at least one success.

Spec: docs/phase2-test-plan.md, docs/architecture.md.

Co-Authored-By: jover <joverzh@gmail.com>
Signal-safe implementation of the variant capture hook:
- Extract RIP/RBP from ucontext_t.uc_mcontext.gregs.
- Seed pcs[0] = RIP, then walk *(rbp+8) for return addresses and
  *(rbp) for the next frame pointer.
- Bounds: 8-aligned RBP; monotonically growing toward higher
  addresses (equality allowed only on the first iteration where
  rbp == first_rbp; strict growth required on each subsequent step);
  within a 16 MiB span of the first RBP; with mincore() verifying
  both probed words live on a resident page.
- Stops at kMaxSignalFrames or when bounds fail.

mincore is AS-safe per POSIX, so the handler does not risk
SIGSEGV during the walk. Uses no allocations, no /proc, no locks,
no Doris TLS.

Spec: docs/architecture.md "Layer 3d".

Co-Authored-By: jover <joverzh@gmail.com>
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Enhancement] Support print stack API

3 participants