[fix](Reliability)Fix Doris query service fails after the file handles on the BE node are used up#62393
Conversation
…the BE node are used up
…the BE node are used up
…the BE node are used up
…s on BE node are used up
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
…s on the BE node are used up
…s on the BE node are used up
…s on the BE node are used up
|
/review |
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
|
run buildall |
There was a problem hiding this comment.
Findings:
be/src/common/metrics/system_metrics.cpp:_file_handle_deplenish_counteris never cleared after a successful/proc/statopen, so the new logic counts lifetimeEMFILEincidents instead of consecutive failures. Three isolated handle-exhaustion spikes across a long period will still trip the fatal exit, which does not match the code comment or the intended safeguard.be/src/common/config.cpp+be/src/common/metrics/system_metrics.cpp: the new mutable configfile_handles_deplenish_frequency_timeshas no validator. If it is set to0or a negative value, the comparisoncounter >= configbecomes true on the firstfopenfailure, even for non-EMFILEerrors, so a bad config update can kill BE for an unrelated transient/proc/statproblem.
Critical checkpoint conclusions:
- Goal of the task: Partially achieved. The PR tries to fail fast when BE cannot open
/proc/statbecause file handles are exhausted, but the current logic can also terminate after non-consecutive incidents and after unsafe config changes. No test proves the intended behavior. - Modification size/focus: Yes. The change is small and focused on BE system metrics/config.
- Concurrency: No new lock-order issue found.
SystemMetrics::update()runs through the metric hook path under registry/entity locks, and this PR does not add another concurrent access path. - Special lifecycle/static init: No special lifecycle issue found beyond the existing metric hook registration/deregistration.
- Configuration items added: Yes. The new config is read dynamically, but it needs validation/documentation because unsafe values can change runtime behavior catastrophically.
- Incompatible changes: None found.
- Parallel code paths: No corresponding parallel path appears to need the same change.
- Special conditional checks: The new threshold check is not aligned with its own "consecutive failures" comment because success does not reset the counter.
- Test coverage: Insufficient. There is no unit/regression coverage for repeated
EMFILEfailures, success-reset behavior, or invalid threshold values. - Observability: The added warning/fatal logs are enough once the logic is corrected.
- Transaction/persistence/data-write/FE-BE variable passing: Not applicable for this PR.
- Performance: No meaningful concern from this small addition.
- Other issues: None beyond the blocking correctness/configuration problems above.
Overall opinion: Request changes until the counter semantics and config validation are fixed.
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
/review |
There was a problem hiding this comment.
Requesting changes because the new fail-fast logic can terminate a recovered BE after non-consecutive transient EMFILE failures.
Finding
be/src/common/metrics/system_metrics.cpp:_file_handle_deplenish_counteris never reset after a successfulfopen("/proc/stat"). After any three intermittentEMFILEfailures over the lifetime of the process, the next failure triggersLOG(FATAL)/exit(-1)even if all intervening metric updates succeeded. This does not satisfy the code comment's "consecutive failures" intent and can create false-positive BE exits.
Critical Checkpoints
- Goal of current task: Partially met. The PR tries to fail fast when file handles are exhausted, but the current implementation can crash on non-consecutive transient failures. No test demonstrates the intended behavior.
- Modification size/focus: Yes. The diff is small and focused.
- Concurrency: No new lock/order changes observed. In the default configuration this hook runs from the metrics calculator thread.
- Lifecycle/static initialization: No special lifecycle or SIOF concerns introduced.
- Configuration items: Yes. The new mutable config is read live and would be observed without restart, but its semantics are not validated by tests.
- Compatibility/incompatible changes: None.
- Functionally parallel code paths:
SystemMetricshas other/procreaders, but only_update_cpu_metrics()participates in the new fatal policy. - Special conditional checks:
errno == 24is a targeted condition, but the reset logic is incomplete. - Test coverage: Missing for the new counter reset / threshold behavior.
- Observability: Warning and fatal logs are present; no additional metrics were added.
- Transaction/persistence/data writes: Not applicable.
- FE-BE variable passing: Not applicable.
- Performance: Negligible impact.
- Other issues: No additional blocking issues found beyond the correctness bug above.
|
skip check_coverage |
…s on the BE node are used up (#62393)
What problem does this PR solve?
Issue Number: close #62392
Related PR: #xxx
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)