Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

common: Improving message sent to user when getting signals #21000

Merged
merged 2 commits into from Apr 23, 2018

Conversation

Projects
None yet
5 participants
@ErwanAliasr1
Copy link
Contributor

commented Mar 22, 2018

As per https://tracker.ceph.com/issues/23320, if we get an abort() or pthread_kill() we could end up in a situation were the info reported to the user is very misleading since we try to get the program name while pid = 0.
In this particular case, we spent time trying to understand the actual value of the structure and why it was reported as 0/Unknown.

The fact this is because the kernel is the emitter in such case. So this PR is about making the sender being explicted when its kernel and offer a way to specify a proper message regarding si_code value, unless print the structure for debugging purpose.

@ErwanAliasr1 ErwanAliasr1 requested a review from tchaikov Mar 23, 2018

@ErwanAliasr1 ErwanAliasr1 force-pushed the evelu-23320 branch 3 times, most recently from d66005a to e6c5b7e Mar 23, 2018

@ErwanAliasr1

This comment has been minimized.

Copy link
Contributor Author

commented Mar 27, 2018

Rebased.

@ErwanAliasr1 ErwanAliasr1 requested a review from liewegas Mar 27, 2018

@tchaikov
Copy link
Contributor

left a comment

please avoid using notations like #23320, it will collide with a PR with the same id in future. probably you could reference the bug using the URL of the tracker ticket like other commits?

<< " UID: " << siginfo->si_uid
<< dendl;
handlers[signum]->handler(signum);
ostringstream message;

This comment has been minimized.

Copy link
@tchaikov

tchaikov Mar 27, 2018

Contributor

no need to collect the error messages in message, would be easier to print them directly to derr

This comment has been minimized.

Copy link
@ErwanAliasr1

ErwanAliasr1 Mar 27, 2018

Author Contributor

That was my first idea but figured that derr expects a dendl for each line making this single line output generated in multi line of code impossible. Am I wrong ?

This comment has been minimized.

Copy link
@tchaikov

tchaikov Mar 30, 2018

Contributor

no, you are right. i just thought it's overkill to print them in a single line. and if you insist, you could use *_dout .

This comment has been minimized.

Copy link
@ErwanAliasr1

ErwanAliasr1 Mar 30, 2018

Author Contributor

but isn't dout to be on stdout while I need stderr ?

message << "received signal: " << sig_str(signum);
switch (siginfo->si_code) {
case SI_USER:
message << " from " << get_name_by_pid(siginfo->si_pid);

This comment has been minimized.

Copy link
@tchaikov

tchaikov Mar 27, 2018

Contributor

i'd suggest update get_name_by_pid(), and return "kernel" if pid is 0. please note, Ceph is portable and runs on FreeBSD also.

This comment has been minimized.

Copy link
@ErwanAliasr1

ErwanAliasr1 Mar 27, 2018

Author Contributor

That's exactly the aim of the first patch of this PR.
I updated it to return "Kernel" instead of "Linux Kernel".

if (siginfo->si_pid) {
message << " (PID: " << siginfo->si_pid << ")";
} else {
message << " ( Could be generated by pthread_kill(), raise(), abort(), alarm() )";

This comment has been minimized.

Copy link
@tchaikov

tchaikov Mar 27, 2018

Contributor

it does not help from user's perspective. probably we can just print out the PID even it's zero.

This comment has been minimized.

Copy link
@ErwanAliasr1

ErwanAliasr1 Mar 27, 2018

Author Contributor

I put that as it took me some time to figure that this signal could be generated by the code itself and no only by a ctrl+c or explicit kill.

This comment has been minimized.

Copy link
@tchaikov

tchaikov Mar 30, 2018

Contributor

well, how do we expect user to interpret this ?

( Could be generated by pthread_kill(), raise(), abort(), alarm() )

This comment has been minimized.

Copy link
@ErwanAliasr1

ErwanAliasr1 Mar 30, 2018

Author Contributor

It's not really for users but much more for advanced users or support.
I was in a support case where this code got triggered and we though for a while that the signal was sent by an external process as the error message was saying so.
It took sometime before considering the signal was not sent by an external tool but self-triggered by ceph itself.
If I had that message saying the signal could be triggered by ceph itself then I could have being focused on the real cause and not tracking my system & other processes.
That's why adding this information could save time for the next people facing this situation.

message << " UID: " << siginfo->si_uid;
break;
default:
/* As we have an unhandled signal, let's report the structure to help debugging */

This comment has been minimized.

Copy link
@tchaikov

tchaikov Mar 27, 2018

Contributor

i don't think the signal is unhandled.

This comment has been minimized.

Copy link
@ErwanAliasr1

ErwanAliasr1 Mar 27, 2018

Author Contributor

Right. Updated this way.

message << ", si_pid : " << siginfo->si_pid;
message << ", si_uid : " << siginfo->si_uid;
message << ", si_addr" << siginfo->si_addr;
message << ", si_status" << siginfo->si_status;

This comment has been minimized.

Copy link
@tchaikov

tchaikov Mar 27, 2018

Contributor

i'd suggest not print these variables unless they will be helpful for postmortem debugging.

This comment has been minimized.

Copy link
@ErwanAliasr1

ErwanAliasr1 Mar 27, 2018

Author Contributor

That's exactly the purpose of that part. When I had to debug a issue related to this code, I had to print all that structure as fields depends on the signaled as per : https://www.mkssoftware.com/docs/man5/siginfo_t.5.asp#Signal_Codes

This comment has been minimized.

Copy link
@tchaikov

tchaikov Mar 30, 2018

Contributor

@ErwanAliasr1 i don't really understand. if your intention is to debug an issue related to this code, i guess you've finished debugging, no?

This comment has been minimized.

Copy link
@ErwanAliasr1

ErwanAliasr1 Mar 30, 2018

Author Contributor

Looks like I'm not clear enough :)
During my debugging sessions, If I had that information being printed, I would have been in a better position to understand what type of signal we got & its meaning.
So my debbugging sessions was to print that information to sort out the case we were in, meaning, checkouting the tag, adding this patch, rebuilding & trick the versionning to let this binary being accepted by the other libs.
That's a lot of energy for very few information.
This code isn't supposed to be printed unless we are in a weird case. If so, printing will save a lot of time for the support/developper.

@ErwanAliasr1

This comment has been minimized.

Copy link
Contributor Author

commented Mar 30, 2018

@tchaikov Is it better this way ?

@ErwanAliasr1 ErwanAliasr1 force-pushed the evelu-23320 branch from aea3714 to 54dfdd7 Apr 3, 2018

@ErwanAliasr1

This comment has been minimized.

Copy link
Contributor Author

commented Apr 3, 2018

Rebased against master

@ErwanAliasr1 ErwanAliasr1 force-pushed the evelu-23320 branch from 54dfdd7 to 8c6d742 Apr 4, 2018

@smithfarm
Copy link
Contributor

left a comment

As @tchaikov already requested, please change the following line in the commit message:

Fixes: #23320

to the correct form:

Fixes: http://tracker.ceph.com/issues/23320

(When the commit message is displayed in github, github will display the string "#23320" as a hyperlink to "#23320" once such a pull request comes to exist, and that will be bogus.)

@ErwanAliasr1

This comment has been minimized.

Copy link
Contributor Author

commented Apr 6, 2018

Oh sorry @smithfarm @tchaikov ... I missed that one.

Fixed & rebased.

@ErwanAliasr1 ErwanAliasr1 force-pushed the evelu-23320 branch from 8c6d742 to a500fbb Apr 6, 2018

Fixes line was fixed

@smithfarm

This comment has been minimized.

Copy link
Contributor

commented Apr 6, 2018

@ErwanAliasr1 thanks!

@ErwanAliasr1 ErwanAliasr1 force-pushed the evelu-23320 branch from a500fbb to ede0bca Apr 10, 2018

@ErwanAliasr1

This comment has been minimized.

Copy link
Contributor Author

commented Apr 10, 2018

Rebased against master.
@tchaikov is it ok for you then ?

@liewegas liewegas changed the title Improving message sent to user when getting signals common: Improving message sent to user when getting signals Apr 12, 2018

@liewegas liewegas added the needs-qa label Apr 12, 2018

@liewegas liewegas added this to the mimic milestone Apr 12, 2018

@ErwanAliasr1 ErwanAliasr1 force-pushed the evelu-23320 branch from ede0bca to afbd772 Apr 12, 2018

@ErwanAliasr1

This comment has been minimized.

Copy link
Contributor Author

commented Apr 12, 2018

Rebased on master to stay up to date

ErwanAliasr1 added some commits Mar 22, 2018

signal_handler: Reports pid 0 as Kernel
When receiving a signal, get_name_by_pid() tries to find what was the process name of the sender.

As per bug #23320, we have case where pid is 0 leading to the following confusing message :

    signal: Interrupt from PID: 0 task name: <unknown> UID: 0

This commit is about returning an explicit "Kernel" as a task name if pid is 0

Signed-off-by: Erwan Velu <erwan@redhat.com>
signal_handler: Implementing specific messages per context
As per bug #23320, regarding the signal we receive, it could be interesting providing a specific message.

If we get a SI_USER we almost print the actual message except we don't print the PID which have no meaning when set to 0.

In all other cases, we do print the full structure to help debuggers understanding the received signal.

This patch is offering a better way to get specific messages based on various si_code values.
It could be expanded later based on debugging feedbacks.

Fixes: http://tracker.ceph.com/issues/23320
Signed-off-by: Erwan Velu <erwan@redhat.com>

@ErwanAliasr1 ErwanAliasr1 force-pushed the evelu-23320 branch from afbd772 to b1037f9 Apr 19, 2018

@ErwanAliasr1

This comment has been minimized.

Copy link
Contributor Author

commented Apr 19, 2018

Rebased against master.

@ErwanAliasr1

This comment has been minimized.

Copy link
Contributor Author

commented Apr 19, 2018

Can we make this PR being merged ?

@liewegas liewegas merged commit f9cd5b1 into master Apr 23, 2018

4 of 5 checks passed

make check (arm64) make check failed
Details
Docs: build check OK - docs built
Details
Signed-off-by all commits in this PR are signed
Details
Unmodified Submodules submodules for project are unmodified
Details
make check make check succeeded
Details

@liewegas liewegas deleted the evelu-23320 branch Apr 23, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.