Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dellhw-exporter causes a large amount of zombie processes #85

Open
deepankersharmaa opened this issue Oct 18, 2023 · 10 comments
Open

dellhw-exporter causes a large amount of zombie processes #85

deepankersharmaa opened this issue Oct 18, 2023 · 10 comments

Comments

@deepankersharmaa
Copy link

Hi,

I have observed large number of omreport and omcliproxy processes generated but not exited or terminated.

Sep 26 16:36:59 devenv01-worker003.devenv01.nfvi.localdomain kernel: omreport invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), order=0, oom_score_adj=999
Sep 26 16:36:59 devenv01-worker003.devenv01.nfvi.localdomain kernel: CPU: 110 PID: 1319442 Comm: omreport Kdump: loaded Not tainted 4.18.0-372.36.1.el8_6.mr3789_221121_2132.x86_64 #1
Sep 26 16:36:59 devenv01-worker003.devenv01.nfvi.localdomain kernel: Tasks state (memory values in pages):
Sep 26 16:36:59 devenv01-worker003.devenv01.nfvi.localdomain kernel: [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
Sep 26 16:36:59 devenv01-worker003.devenv01.nfvi.localdomain kernel: [ 12750] 0 12750 35965 615 167936 0 -1000 conmon
Sep 26 16:36:59 devenv01-worker003.devenv01.nfvi.localdomain kernel: [ 12777] 0 12777 179450 4466 196608 0 999 dellhw_exporter
Sep 26 16:36:59 devenv01-worker003.devenv01.nfvi.localdomain kernel: [ 12844] 0 12844 49366 1786 180224 0 999 dsm_sa_eventmgr
Sep 26 16:36:59 devenv01-worker003.devenv01.nfvi.localdomain kernel: [ 12845] 0 12845 84507 2358 217088 0 999 dsm_sa_snmpd
Sep 26 16:36:59 devenv01-worker003.devenv01.nfvi.localdomain kernel: [ 12851] 0 12851 587311 10325 581632 0 999 dsm_sa_datamgrd
Sep 26 16:36:59 devenv01-worker003.devenv01.nfvi.localdomain kernel: [ 14970] 0 14970 152261 5917 393216 0 999 dsm_sa_datamgrd
Sep 26 16:36:59 devenv01-worker003.devenv01.nfvi.localdomain kernel: [1314763] 0 1314763 2926 650 69632 0 999 omreport
Sep 26 16:36:59 devenv01-worker003.devenv01.nfvi.localdomain kernel: [1314764] 0 1314764 2926 637 69632 0 999 omreport
Sep 26 16:36:59 devenv01-worker003.devenv01.nfvi.localdomain kernel: [1314765] 0 1314765 2926 650 69632 0 999 omreport
Sep 26 16:36:59 devenv01-worker003.devenv01.nfvi.localdomain kernel: [1314766] 0 1314766 2926 663 69632 0 999 omreport
Sep 26 16:36:59 devenv01-worker003.devenv01.nfvi.localdomain kernel: [1314767] 0 1314767 2926 638 65536 0 999 omreport
Sep 26 16:36:59 devenv01-worker003.devenv01.nfvi.localdomain kernel: [1314768] 0 1314768 2926 637 61440 0 999 omreport
Sep 26 16:36:59 devenv01-worker003.devenv01.nfvi.localdomain kernel: [1314769] 0 1314769 2926 664 65536 0 999 omreport
Sep 26 16:36:59 devenv01-worker003.devenv01.nfvi.localdomain kernel: [1314771] 0 1314771 2926 644 69632 0 999 omreport
Sep 26 16:36:59 devenv01-worker003.devenv01.nfvi.localdomain kernel: [1314774] 0 1314774 2926 637 69632 0 999 omreport
Sep 26 16:36:59 devenv01-worker003.devenv01.nfvi.localdomain kernel: [1314776] 0 1314776 2926 627 69632 0 999 omreport
Sep 26 16:36:59 devenv01-worker003.devenv01.nfvi.localdomain kernel: [1314778] 0 1314778 2926 653 65536 0 999 omreport
Sep 26 16:36:59 devenv01-worker003.devenv01.nfvi.localdomain kernel: [1314780] 0 1314780 9239 1179 114688 0 999 omcliproxy
Sep 26 16:36:59 devenv01-worker003.devenv01.nfvi.localdomain kernel: [1314781] 0 1314781 2926 627 73728 0 999 omreport
Sep 26 16:36:59 devenv01-worker003.devenv01.nfvi.localdomain kernel: [1314783] 0 1314783 2926 650 61440 0 999 omreport
Sep 26 16:36:59 devenv01-worker003.devenv01.nfvi.localdomain kernel: [1314784] 0 1314784 9239 1199 114688 0 999 omcliproxy
Sep 26 16:36:59 devenv01-worker003.devenv01.nfvi.localdomain kernel: [1314785] 0 1314785 9239 1190 118784 0 999 omcliproxy
Sep 26 16:36:59 devenv01-worker003.devenv01.nfvi.localdomain kernel: [1314787] 0 1314787 2926 663 65536 0 999 omreport
Sep 26 16:36:59 devenv01-worker003.devenv01.nfvi.localdomain kernel: [1314788] 0 1314788 9239 1205 122880 0 999 omcliproxy
Sep 26 16:36:59 devenv01-worker003.devenv01.nfvi.localdomain kernel: [1314790] 0 1314790 9239 1145 114688 0 999 omcliproxy

I have posting all the results here as it would be redundant, but output similar to approximately 850 lines was seen following this.
It is likely that these processes were started in the dellhw_exporter Pod. From the name of this Pod,
I speculate that it is an application similar to an agent for monitoring Dell hardware. as Dellhw exporter had a omreport cmd wraper to it to get the data from machine.

Regarding the omreport and omcliproxy, i would like to confirm the following things:

  • It seems abnormal for over 800 of these processes to be running, is that correct?
  • Is there any report of the dellhw_exporter Pod being in an abnormal state due to the oom-killer (for example, process proliferation like this time)?
  • For example, with some monitoring agent applications, there is a scenario where processes proliferate

if they behave such as extracting all files including information about the OS under /proc, leading to a sharp increase in the load on the system.
Are these processes performing any processing that could cause a load on the system when the number of processes increases rapidly?

@galexrt
Copy link
Owner

galexrt commented Oct 19, 2023

It seems abnormal for over 800 of these processes to be running, is that correct?

It depends on the hardware, resources given to the exporter, and other factors, though 800 seems high.
The processes the dellhw_exporter starts are expected to be closed either when completed or the commands time outs.

Is there any report of the dellhw_exporter Pod being in an abnormal state due to the oom-killer (for example, process proliferation like this time)?

There are no known issues with the dellhw_exporter in regards to not closing processes/OOM-ing if given the right amount of resources.

For example, with some monitoring agent applications, there is a scenario where processes proliferate

The processes are not meant to stick around, but it depends on exporter config, etc., how often the exporter would call the commands to get the (latest) info for the metrics.

Can you provide the logs of the dellhw_exporter

@deepankersharmaa
Copy link
Author

Hi,

Thanks for your quick response and support.

Please find the below attached dellhw-exporter container log at the time of the problem occurred.

Regards,
Deepankar

@deepankersharmaa
Copy link
Author

@deepankersharmaa
Copy link
Author

Hi,

Thanks for your quick response and support.
Do we have any updated regarding the same.

Regards,
Deepankar

@galexrt
Copy link
Owner

galexrt commented Nov 13, 2023

The logs show that some omreport command processes are being terminated/taking too long.

  • What scrape interval are you running?
  • What config/flags are you running the exporter with?

@deepankersharmaa
Copy link
Author

Hi @galexrt Thanks for you revert

M using the basic command for running exporter using below command there is no specific config/flags used and the scrape interval is 60 seconds.
podman run --name pf-dell-exporter -d --privileged -p 9137:9137 {{exporter_image}}

Regards,
Deepankar

@deepankersharmaa
Copy link
Author

Hi @galexrt

Any Idea about this ?

@galexrt
Copy link
Owner

galexrt commented Nov 24, 2023

@deepankersharmaa The logs indicate that omreport is taking a long time to respond.
Did you look into the Dell OMSA services on the machine if there's anything in their logs? Is that issue happening on a single server or multiple servers?

@adidiborg
Copy link
Contributor

@galexrt , We are also facing similar issues. Looks like it happens randomly on multiple servers

@galexrt
Copy link
Owner

galexrt commented Feb 2, 2024

As written before without logs from the system's OMSA services with any hints it is hard to diagnose this.

I don't have access to a Dell server at the moment, so I would appreciate any logs or outputs from OMSA for me to dive in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants