Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add process misreads process status #451

Closed
deftdawg opened this issue May 24, 2019 · 11 comments
Closed

add process misreads process status #451

deftdawg opened this issue May 24, 2019 · 11 comments

Comments

@deftdawg
Copy link

deftdawg commented May 24, 2019

For several processes on my RHEL 7.3 server, goss seems to misrecord the statuses of processes...

As a test, I pipe every process in ps into a goss add process command...

for p in $(ps -ef | grep -v '\[' | cut -b49- | cut -d" " -f1 | cut -d: -f1 | rev | cut -d/ -f1 | rev |sort -u); do 
  ../goss a process ${p}
done

Apart from the fluff (CMD from ps header, and ps, cut, rev, sort used in the command above), one would expect all these processes to be actual running processes. Yet, goss reports several of them as NOT running.

If I take just systemd processes as an example:

ps -ef | grep systemd
root         1     0  0 May21 ?        00:00:51 /usr/lib/systemd/systemd --switched-root --system --deserialize 21
root       898     1  0 May21 ?        00:01:01 /usr/lib/systemd/systemd-journald
root       932     1  0 May21 ?        00:00:00 /usr/lib/systemd/systemd-udevd
dbus      2353     1  0 May21 ?        00:00:23 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation
root      2383     1  0 May21 ?        00:00:08 /usr/lib/systemd/systemd-logind

We see 4 processes all running since May 21. If we examine the goss.yaml file we see this:

grep -A1 systemd goss.yaml
  systemd:            
    running: true     
  systemd-journald:   
    running: false    
  systemd-logind:     
    running: true     
  systemd-udevd:      
    running: true

3 of the 4 are listed as running, but even though systemd-journald is running, it is reported as false...

Whatever is happening is consistent, in the sense that validation will pass confirming "systemd-journald" is not running when goss re-executes.

There are a few other processes (excluding ps, cut, rev, sort from the command that fed goss) from a vendor product on my system that similarly report as running false when they are actually running...

Any idea what is happening to cause this strange behavior?

[This is using v0.3.7 AMD64]

@aelsabbahy aelsabbahy added the bug label Jun 4, 2019
@aelsabbahy
Copy link
Member

I was able to reproduce this, I'll have to investigate further.

@h44z
Copy link

h44z commented Jun 11, 2020

any news on this issue?

@aelsabbahy
Copy link
Member

I assume you've hit this bug.

Can you give me the process experiencing this issue.

Also, can you provide the output of ps -p <PID> -o comm,cmd

@h44z
Copy link

h44z commented Jun 12, 2020

Yes I have this bug in one of my docker images.

Here is the output of ps:

root@3f4893efc885:/app# ps -p 1 -o comm,cmd
COMMAND         CMD
lightdata_endpo /app/lightdata_endpoint

and this is the goss yaml i use for the check:

process:
  lightdata_endpoint:
    running: true

I see that COMMAND is somehow truncated, maybe that is the problem?

@aelsabbahy
Copy link
Member

Yeah, it's based on the command name. The document was updated sometime after this ticket was created to clarify that process

Does goss a process work just fine with the output of ps -p 1 -o comm?

@h44z
Copy link

h44z commented Jun 12, 2020

Yes if I use the name that ps -p 1 -o comm shows it works. So the problem is, that ps truncates the command name after 15 characters. It would be cool to mention that in the process documentation as well.

@aelsabbahy
Copy link
Member

It can be mentioned in the documentation, but it's a little more nuanced than just truncated process name.

In most cases comm will be a truncated basename of the first argument to command/args but it's not something that's enforced and isn't always the case.

argv is an array of pointers to strings passed to the new program as
its command-line arguments. By convention, the first of these
strings (i.e., argv[0]) should contain the filename associated with
the file being executed. The argv array must be terminated by a NULL
pointer (Thus, in the new program, argv[argc] will be NULL.)

https://www.man7.org/linux/man-pages/man2/execve.2.html

contrived example:

import os
os.execlp("sleep", "foo", "30")

running that python code, here's the output of ps -o comm,command

$ ps -p 415 -o pid,comm,command
  PID COMMAND         COMMAND
  415 sleep           foo 30

real world example: firefox:

COMMAND         COMMAND
firefox         /usr/lib64/firefox/firefox
Web Content     /usr/lib64/firefox/firefox -contentproc -childID 1 -isForBrowser -prefsLen 1 -prefMapSize 173662 -schedulerPrefs 0001,2 -parentBuildID 20181121183750 -greomni /usr/lib64/firefox/omni.ja -appomni /usr/lib64/firefox/browser/omni.ja -appdir /usr/lib64/firefox/browser 5005 true tab
WebExtensions   /usr/lib64/firefox/firefox -contentproc -childID 2 -isForBrowser -prefsLen 4111 -prefMapSize 173662 -schedulerPrefs 0001,2 -parentBuildID 20181121183750 -greomni /usr/lib64/firefox/omni.ja -appomni /usr/lib64/firefox/browser/omni.ja -appdir /usr/lib64/firefox/browser 5005 true tab

That all said, I'm open to any suggestions or recommendations on how to word the process documentation better. It currently reads as follows:

NOTE: This check is inspecting the name of the binary, not the name of the process. For example, a process with the name nginx: master process /usr/sbin/nginx would be checked with the process nginx. To discover the binary of a pid run ps -p -o comm.

@aelsabbahy
Copy link
Member

Looks like switching to gopsutil will fix this. That's scheduled for v0.4.0 of goss since it's a breaking change. see #597

@aelsabbahy aelsabbahy added the v4 label Jul 17, 2020
@aelsabbahy aelsabbahy added this to the v0.4.0 milestone Jul 17, 2020
@aelsabbahy aelsabbahy mentioned this issue Nov 13, 2020
Closed
5 tasks
@aelsabbahy
Copy link
Member

Keeping this open, there are some challenges with gopsutil, so this is not fixed yet. Had to remove gopsutil from v4 release.

@aelsabbahy
Copy link
Member

gopsutil was not released as part of 4 due to it being too magical and difficult to reproduce as an end user.

That said, the documentation has been updated to recommend: cat -E /proc/634/comm for getting the process name.

For example:

$ cat -E /proc/634/comm
systemd-journal$
$ goss a process systemd-journal
Adding Process to './goss.yaml':

systemd-journal:
  running: true

I don't believe this is still a bug as the documentation has been updated to reflect expectations.

There are still some enhancements I'd like to make to the process detection (e.g. duplicates processes with same name), but I consider those feature enhancements.

Anyways, let me know if you disagree.. if not, I'll close out this issue in a week or two.

@aelsabbahy
Copy link
Member

Closing, feel free to comment on here or open a new issue if this is still an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants