Skip to content

[fix](Reliability)Fix Doris query service fails after the file handles on the BE node are used up#62393

Merged
yiguolei merged 7 commits intoapache:masterfrom
LuGuangming:master_filehandles
Apr 16, 2026
Merged

[fix](Reliability)Fix Doris query service fails after the file handles on the BE node are used up#62393
yiguolei merged 7 commits intoapache:masterfrom
LuGuangming:master_filehandles

Conversation

@LuGuangming
Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Issue Number: close #62392

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@Thearas
Copy link
Copy Markdown
Contributor

Thearas commented Apr 11, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@yiguolei
Copy link
Copy Markdown
Contributor

/review

@github-actions
Copy link
Copy Markdown
Contributor

PR approved by at least one committer and no changes requested.

@github-actions github-actions bot added approved Indicates a PR has been approved by one committer. reviewed labels Apr 15, 2026
@github-actions
Copy link
Copy Markdown
Contributor

PR approved by anyone and no changes requested.

@yiguolei
Copy link
Copy Markdown
Contributor

run buildall

Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings:

  1. be/src/common/metrics/system_metrics.cpp: _file_handle_deplenish_counter is never cleared after a successful /proc/stat open, so the new logic counts lifetime EMFILE incidents instead of consecutive failures. Three isolated handle-exhaustion spikes across a long period will still trip the fatal exit, which does not match the code comment or the intended safeguard.
  2. be/src/common/config.cpp + be/src/common/metrics/system_metrics.cpp: the new mutable config file_handles_deplenish_frequency_times has no validator. If it is set to 0 or a negative value, the comparison counter >= config becomes true on the first fopen failure, even for non-EMFILE errors, so a bad config update can kill BE for an unrelated transient /proc/stat problem.

Critical checkpoint conclusions:

  • Goal of the task: Partially achieved. The PR tries to fail fast when BE cannot open /proc/stat because file handles are exhausted, but the current logic can also terminate after non-consecutive incidents and after unsafe config changes. No test proves the intended behavior.
  • Modification size/focus: Yes. The change is small and focused on BE system metrics/config.
  • Concurrency: No new lock-order issue found. SystemMetrics::update() runs through the metric hook path under registry/entity locks, and this PR does not add another concurrent access path.
  • Special lifecycle/static init: No special lifecycle issue found beyond the existing metric hook registration/deregistration.
  • Configuration items added: Yes. The new config is read dynamically, but it needs validation/documentation because unsafe values can change runtime behavior catastrophically.
  • Incompatible changes: None found.
  • Parallel code paths: No corresponding parallel path appears to need the same change.
  • Special conditional checks: The new threshold check is not aligned with its own "consecutive failures" comment because success does not reset the counter.
  • Test coverage: Insufficient. There is no unit/regression coverage for repeated EMFILE failures, success-reset behavior, or invalid threshold values.
  • Observability: The added warning/fatal logs are enough once the logic is corrected.
  • Transaction/persistence/data-write/FE-BE variable passing: Not applicable for this PR.
  • Performance: No meaningful concern from this small addition.
  • Other issues: None beyond the blocking correctness/configuration problems above.

Overall opinion: Request changes until the counter semantics and config validation are fixed.

Comment thread be/src/common/metrics/system_metrics.cpp
Comment thread be/src/common/config.cpp
@yiguolei yiguolei added dev/4.1.x usercase Important user case type label labels Apr 15, 2026
@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 0.00% (0/11) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.61% (27391/37210)
Line Coverage 57.29% (295922/516522)
Region Coverage 54.55% (246714/452244)
Branch Coverage 56.17% (106840/190208)

@yiguolei
Copy link
Copy Markdown
Contributor

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes because the new fail-fast logic can terminate a recovered BE after non-consecutive transient EMFILE failures.

Finding

  • be/src/common/metrics/system_metrics.cpp: _file_handle_deplenish_counter is never reset after a successful fopen("/proc/stat"). After any three intermittent EMFILE failures over the lifetime of the process, the next failure triggers LOG(FATAL)/exit(-1) even if all intervening metric updates succeeded. This does not satisfy the code comment's "consecutive failures" intent and can create false-positive BE exits.

Critical Checkpoints

  • Goal of current task: Partially met. The PR tries to fail fast when file handles are exhausted, but the current implementation can crash on non-consecutive transient failures. No test demonstrates the intended behavior.
  • Modification size/focus: Yes. The diff is small and focused.
  • Concurrency: No new lock/order changes observed. In the default configuration this hook runs from the metrics calculator thread.
  • Lifecycle/static initialization: No special lifecycle or SIOF concerns introduced.
  • Configuration items: Yes. The new mutable config is read live and would be observed without restart, but its semantics are not validated by tests.
  • Compatibility/incompatible changes: None.
  • Functionally parallel code paths: SystemMetrics has other /proc readers, but only _update_cpu_metrics() participates in the new fatal policy.
  • Special conditional checks: errno == 24 is a targeted condition, but the reset logic is incomplete.
  • Test coverage: Missing for the new counter reset / threshold behavior.
  • Observability: Warning and fatal logs are present; no additional metrics were added.
  • Transaction/persistence/data writes: Not applicable.
  • FE-BE variable passing: Not applicable.
  • Performance: Negligible impact.
  • Other issues: No additional blocking issues found beyond the correctness bug above.

Comment thread be/src/common/metrics/system_metrics.cpp
@yiguolei
Copy link
Copy Markdown
Contributor

skip check_coverage

@yiguolei yiguolei merged commit 4eaa942 into apache:master Apr 16, 2026
30 of 31 checks passed
github-actions bot pushed a commit that referenced this pull request Apr 16, 2026
yiguolei pushed a commit that referenced this pull request Apr 16, 2026
… file handles on the BE node are used up #62393 (#62540)

Cherry-picked from #62393

Co-authored-by: Guangming Lu <71873108+LuGuangming@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/4.1.1-merged reviewed usercase Important user case type label

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Doris query services fail continuously after the BE node file handles is used up

4 participants