fs/inode: add dynamic FD backtrace threshold control#18767
fs/inode: add dynamic FD backtrace threshold control#18767xiaoqizhan wants to merge 0 commit intoapache:masterfrom
Conversation
| struct fd fl_prefds[CONFIG_NFILE_DESCRIPTORS_PER_BLOCK]; | ||
|
|
||
| #if CONFIG_FS_BACKTRACE > 0 && defined(CONFIG_FS_BACKTRACE_DYNAMIC) | ||
| atomic_t fl_open_count; /* Current open file descriptor count */ |
There was a problem hiding this comment.
once FS_BACKTRACE is enabled, fd always reserve the space for backtrace. what's benefit to skip saving the backtrace?
There was a problem hiding this comment.
Even after the FS_BACKTRACE macro is enabled, space will still be reserved for backtrace operations. My intention here is not to save memory, but to reduce performance overhead, since capturing backtrace incurs significant performance costs—especially in scenarios involving frequent opening and closing of file.
There was a problem hiding this comment.
this is debugging feature, why do you care about the performance but make the code more complex. I would suggest you optimize your code to reduce the frequency of open/close, or switch to frame base backtrace which is fast then backtrace through unwind table.
There was a problem hiding this comment.
What I mentioned earlier about frequently opening and closing FDs is just one scenario—it may not apply to my specific application.
Using FP-based stack backtracking can improve the quality of stack backtraces, but it still incurs performance overhead.
Essentially, fdleak focuses on leaked FDs rather than normal FD allocations. The FDs opened by a task at the beginning are usually not leaked; some tasks may open 50 FDs by default, while others may open only 10.
The patch I am adding focuses more on detecting leaked FDs, and each task can configure its own threshold for leaked FDs. This kind of operation was not supported in the original implementation.
There was a problem hiding this comment.
have you measure how much(percent) improvement can be got from this patch? I am afraid that this patch can't get the visable(measurable) performance improvement since open is a very slow fs operation.
There was a problem hiding this comment.
if you really want to reduce the cost, I would suggest to add flag(check how TCB_FLAG_HEAP_CHECK done) in task_group_s::tg_flags to disable/enable backtrace for the whole process, since it's more simple and general.
There was a problem hiding this comment.
I agree that using a tg_flags switch is simpler and perfectly eliminates overhead when disabled. However, the main motivation for the dynamic threshold is 'Out-of-the-box fault capturing' for unexpected/random FD leaks.
If we use a manual flag like TCB_FLAG_HEAP_CHECK, developers have to know a leak might happen beforehand and explicitly enable it. For long-running systems where an FD leak might occur randomly after days of running, a manual switch is hard to use because we usually don't have it enabled until the system has already crashed due to FD exhaustion.
With the dynamic threshold (e.g., 60), the overhead is negligible for normal execution, but acts as a 'safety net'. If a task unexpectedly goes crazy and opens 1000 FDs, we automatically capture the backtrace for the leaky ones (from 61 to 1000) without any manual intervention, which is invaluable for post-mortem debugging (e.g., viewing in procfs or coredump).
There was a problem hiding this comment.
Hi @xiaoxiang781216 , gentle ping on this.
Did my explanation about the "out-of-the-box fault capturing" for unexpected random leaks make sense?
That being said, if the community strongly prefers keeping the fast path (open/dup) absolutely clean and strictly zero-overhead (as you mentioned, similar to the TCB_FLAG_HEAP_CHECK
design), I am completely open to dropping the dynamic threshold and changing the implementation to use a manual tg_flags switch instead.
Please let me know which approach you think aligns better with NuttX's design philosophy for this feature, and I'll update the PR accordingly. Thanks!
There was a problem hiding this comment.
I like the simple approach since it isn't good to make the code base become complex for improving the performance for a debugging feature.
There was a problem hiding this comment.
OK,As suggested, I will resubmit the patch.
When CONFIG_FS_BACKTRACE is enabled, collecting a stack trace for every new file descriptor adds avoidable overhead to tasks that use only a small number of FDs.
Add a dynamic threshold option so backtrace capture is enabled only after the per-task FD count reaches the configured limit.
Note: Please adhere to Contributing Guidelines.
Summary
This patch introduces a dynamic backtrace threshold control mechanism for file descriptors.
performance overhead on tasks that only allocate a small number of file descriptors and never exceed typical limits or experience FD leaks.
open file descriptors per task (
fl_open_count) and only begins filling the backtrace array (f_backtrace) once the per-task file descriptor count reaches a configurable threshold(
CONFIG_FS_BACKTRACE_DEFAULT_THRESHOLD).resource leaks when they actually occur.
Impact
Testing
Host Machine: Ubuntu Linux (x86_64)
Target: Verified on generic simulators and local boards (e.g.
sim:nsh)fdlist).
5. Tested
fdlist_copy(task creation) to ensure backtrace thresholds and states are correctly propagated/initialized for child tasks.