prevent rare crash when two processes "share" blackbox file #309

jnpkrn · 2018-01-22T14:29:21Z

Take it as a preview, I could not compile the master readily, need to
look at that more.

jfriesse · 2018-01-22T15:04:36Z

Ugh. So ... I would like to ask also chrissie to review that, but from my side it's nack. Super hard code (compared to PR #308) with probable bonus of getting few more startup lines in blackbox. Startup lines are actually not even very interesting

chrissie-c · 2018-01-22T15:20:50Z

I agree with Honza, it's a lot of code to protect something we actually don't care about.

jnpkrn · 2018-01-22T15:42:58Z

You call standard safe signal-based synchronization hard? Really?
Sure, unsafe is always an option in high availability.

jnpkrn · 2018-01-22T15:44:37Z

+ I was super verbose about the error condtions, can be cut as well, if you prefer.

jfriesse · 2018-01-22T16:16:17Z

@jnpkrn I believe you didn't fully understood what I've tried to say you. So let me retry it again (ignoring first part of the comment #309 (comment) because this is just pure trolling).

Patch itself is about adding synchronization between parent/child. What I'm trying to explain you is, that this is not needed and it's a lot of the code for getting log messages which are not needed.

But the case you are trying to solve Because there's a danger that now-released main PID will get picked by new corosync process (e.g. just with -v switch) and the blackbox simply cannot happen because the logging is not called between corosync_tty_detach and re-enabling the blackbox (with new pid).

jnpkrn · 2018-01-22T17:26:53Z

How you come to the conclusion:

I am trolling (that's a nasty assumption to start with, especially since it's
not true, being sarcastic about so shortcut denial comments without
technical merit is whole a lot of different story)
the messages in question are not needed
(apparently they were at some point, otherwise blackbox would not
be enabled this early)
I am talking about messages between cosorync_tty_detach and re-enabling
the blackbox
?

I am not a corosync maintainer, so I am not to decide, though I will be happy to
adapt the proposed patch to fit the needs. Apparently my aim is to have a good
technical solution without avoidable trade-offs. If not taken, fine.

I am going to add fitting libqb counterpart, so at least some clients can leverage
this scheme.

jfriesse · 2018-01-23T09:16:46Z

@jnpkrn

Experience. Good if you were not.
Blackbox is to some extend useful for short live messages which are not logged into other places (stderr/syslog/logfile). Such messages are not created during startup.
So what have you been talking about?

Please start with fixing libqb, then we can discuss how to fix Corosync.

From [1]: """ It was discovered that corosync exposes itself for a self-crash under rare circumstance whereby corosync executable is run when there is already a daemon instance around (does not apply to corosync serving without any backgrounding, i.e. launched with "-f" switch). Such a circumstance can be provoked unattendedly by the third party, incl. "corosync -v" probe triggered internally by pcs (since 9e19af58 ~ 0.9.145), which is what makes the root cause analysis of such inflicted crash somewhat difficult to guess & analyze (the other reason may be rather runaway core dump if produced at all due to fencing coming, based on the few observed cases). The problem comes from the fact that corosync is arranged such that the logging is set up very early, even before the main control flow of the program starts. And part of this early enabling is also starting "blackbox" recording, which uses mmap'd file stored in /dev/shm that, moreover, only varies on PID that is part of the file name -- and when corosync performs the fork so as to detach itself from the environment that started it, such PID is free to be reused. And against all odds, when that happens with this fresh new corosync process, it happily mangles the file underneath the former daemon one, leading to crashes indicated by SIGBUS, rarely also SIGFPE. """ The solution is to restart blackbox logging, either explicitly (for unpatched libqb) or, preferably, implicitly with what patched libqb is supposed to support. [1] http://oss.clusterlabs.org/pipermail/users/2018-January/007169.html Signed-off-by: Jan Pokorný <jpokorny@redhat.com>

jnpkrn · 2018-01-26T23:47:56Z

Sure, here is the preliminary version (incl. test!):
ClusterLabs/libqb#293

That's the change this proposed patch counts with already.

jfriesse · 2018-01-30T13:09:07Z

I've decided to push #308 in favor of this PR simply because #308 is simpler, not dependent on LibQB change and also brings possibility to turn off blackbox completely.

Thank you for the patch and HUGE kudos for finding the main reason of SIGBUS,
Honza

jnpkrn · 2018-01-30T13:51:45Z

Really sad to see technical continuity ditched for proclaimed simplicity.

jnpkrn force-pushed the fix-blackbox-on-fork branch from eddf68f to d58e3a3 Compare January 26, 2018 23:45

jfriesse closed this Jan 30, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prevent rare crash when two processes "share" blackbox file #309

prevent rare crash when two processes "share" blackbox file #309

jnpkrn commented Jan 22, 2018

jfriesse commented Jan 22, 2018 •

edited

chrissie-c commented Jan 22, 2018

jnpkrn commented Jan 22, 2018

jnpkrn commented Jan 22, 2018

jfriesse commented Jan 22, 2018

jnpkrn commented Jan 22, 2018 •

edited

jfriesse commented Jan 23, 2018

jnpkrn commented Jan 26, 2018 •

edited

jfriesse commented Jan 30, 2018 •

edited

jnpkrn commented Jan 30, 2018

prevent rare crash when two processes "share" blackbox file #309

prevent rare crash when two processes "share" blackbox file #309

Conversation

jnpkrn commented Jan 22, 2018

jfriesse commented Jan 22, 2018 • edited

chrissie-c commented Jan 22, 2018

jnpkrn commented Jan 22, 2018

jnpkrn commented Jan 22, 2018

jfriesse commented Jan 22, 2018

jnpkrn commented Jan 22, 2018 • edited

jfriesse commented Jan 23, 2018

jnpkrn commented Jan 26, 2018 • edited

jfriesse commented Jan 30, 2018 • edited

jnpkrn commented Jan 30, 2018

jfriesse commented Jan 22, 2018 •

edited

jnpkrn commented Jan 22, 2018 •

edited

jnpkrn commented Jan 26, 2018 •

edited

jfriesse commented Jan 30, 2018 •

edited