-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
processes: do not post metrics from processes that are not running. #4121
base: main
Are you sure you want to change the base?
Conversation
Zombie (exited, but non-reaped) processes can still keep some system resources, until they are reaped by their parent (or init if parent has disappeared). And their presence is also an indication of some potential system issue, so skipping them should IMHO be at least a user-controllable option. |
I do agree with @eero-t , could you please add a switch to skip zombies and make it not-skip by default? That way, it would also be fully backwards-compatible. |
If one sees large number of zombies in process list, that's the issue that should be fixed, not what collectd reports. I also think that this change would be constantly spamming collectd log with warnings. All logging that could happen constantly, should be rate-limited, as filling disk can have much more dire consequences (lost data, even boot fails). |
I replaced WARNING with DEBUG. How about that? |
The PR itself looks fine now, but I just wonder in what case you see so many zombies that it has become an issue... What process it is, what is its parent, and why that is not reaping its zombie children? (Feels like I'm asking character motivations a horror movie...) |
The specific issue I'm facing is unrelated to zombie processes. My main concern is that I have configured a significant number of processes to monitor in the configuration file (without certainty of their runtime). However, even though certain processes specified in the configuration file have not been run since the start of the collectd service until its stop, the 'processes' plugin code still creates 'procstat' structures for these unexecuted processes and submits meaningless metrics with values of 0. |
Your check returns if process has no threads. AFAIK that means that they're zombies (exited, but not reaped by their parent). See e.g. the plugin source you're modifying: https://github.com/collectd/collectd/blob/main/src/processes.c#L1390 What state they are in, if not zombies (GNU (And if they really are not zombies, on which OS that is?) |
The processes with 'num_proc = 0' include both zombie processes and processes that have never been executed. My pull request skips submitting metrics for these two types of processes. Check it out here: https://github.com/collectd/collectd/blob/main/src/processes.c#L613 |
Ah, this was about processes that do not exist at all, not just ones that are not currently running. Now it makes more sense, I did not realize plugin to report metrics although there were no matching processes (I've never used it myself). In some situation that behavior is desirable, in others not, so it's good to have an option for it. |
Yeah. For those processes that do not exist at all, the processes plugin resets all metrics to 0 and submits them. This leads to unnecessary disk and network wastage. |
Your new commit adds NVML / CUDA calls. That stuff does not belong to this PR, please file a separate PR for it. |
Oh sorry, I forgot about this PR. I realize that I didn't switch to the correct branch and it has been a while since the pull request was opened. |
@zsfz Octo promoted me to Trusted Contributors a while ago. If you rebase this to latest |
Signed-off-by: sujankhadka <sujankhadka@didiglobal.com> Signed-off-by: tiozhang <zyhtheonly@yeah.net> Signed-off-by: luffysong <zsfz_one@163.com>
|
ChangeLog: processes plugin: do not post metrics from processes that are not running.
Currently, processes plugin will emit metrics even if the a process listed in ProcessMatch is not running. This is unnecessary and causes wasteful network/disk activity.
For process not running all the metrics are 0 but they still submit data. This will lead to a waste of resources.
If a process listed on ProcessMatch runs after collectd is started, those process will start sending metrics.
Please help review. Thanks a lot!