Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prevent rare crash when two processes "share" blackbox file #309

Closed
wants to merge 1 commit into from

Conversation

jnpkrn
Copy link
Contributor

@jnpkrn jnpkrn commented Jan 22, 2018

Take it as a preview, I could not compile the master readily, need to
look at that more.

@jfriesse
Copy link
Member

jfriesse commented Jan 22, 2018

Ugh. So ... I would like to ask also chrissie to review that, but from my side it's nack. Super hard code (compared to PR #308) with probable bonus of getting few more startup lines in blackbox. Startup lines are actually not even very interesting

@chrissie-c
Copy link
Contributor

I agree with Honza, it's a lot of code to protect something we actually don't care about.

@jnpkrn
Copy link
Contributor Author

jnpkrn commented Jan 22, 2018

You call standard safe signal-based synchronization hard? Really?
Sure, unsafe is always an option in high availability.

@jnpkrn
Copy link
Contributor Author

jnpkrn commented Jan 22, 2018

+ I was super verbose about the error condtions, can be cut as well, if you prefer.

@jfriesse
Copy link
Member

@jnpkrn I believe you didn't fully understood what I've tried to say you. So let me retry it again (ignoring first part of the comment #309 (comment) because this is just pure trolling).

Patch itself is about adding synchronization between parent/child. What I'm trying to explain you is, that this is not needed and it's a lot of the code for getting log messages which are not needed.

But the case you are trying to solve Because there's a danger that now-released main PID will get picked by new corosync process (e.g. just with -v switch) and the blackbox simply cannot happen because the logging is not called between corosync_tty_detach and re-enabling the blackbox (with new pid).

@jnpkrn
Copy link
Contributor Author

jnpkrn commented Jan 22, 2018

How you come to the conclusion:

  • I am trolling (that's a nasty assumption to start with, especially since it's
    not true, being sarcastic about so shortcut denial comments without
    technical merit is whole a lot of different story)
  • the messages in question are not needed
    (apparently they were at some point, otherwise blackbox would not
    be enabled this early)
  • I am talking about messages between cosorync_tty_detach and re-enabling
    the blackbox
    ?

I am not a corosync maintainer, so I am not to decide, though I will be happy to
adapt the proposed patch to fit the needs. Apparently my aim is to have a good
technical solution without avoidable trade-offs. If not taken, fine.

I am going to add fitting libqb counterpart, so at least some clients can leverage
this scheme.

@jfriesse
Copy link
Member

@jnpkrn

  • Experience. Good if you were not.
  • Blackbox is to some extend useful for short live messages which are not logged into other places (stderr/syslog/logfile). Such messages are not created during startup.
  • So what have you been talking about?

Please start with fixing libqb, then we can discuss how to fix Corosync.

From [1]:

"""
It was discovered that corosync exposes itself for a self-crash
under rare circumstance whereby corosync executable is run when there
is already a daemon instance around (does not apply to corosync serving
without any backgrounding, i.e. launched with "-f" switch).

Such a circumstance can be provoked unattendedly by the third party,
incl. "corosync -v" probe triggered internally by pcs (since 9e19af58
~ 0.9.145), which is what makes the root cause analysis of such
inflicted crash somewhat difficult to guess & analyze (the other
reason may be rather runaway core dump if produced at all due to
fencing coming, based on the few observed cases).

The problem comes from the fact that corosync is arranged such that
the logging is set up very early, even before the main control flow
of the program starts.  And part of this early enabling is also
starting "blackbox" recording, which uses mmap'd file stored in
/dev/shm that, moreover, only varies on PID that is part of the file
name -- and when corosync performs the fork so as to detach itself
from the environment that started it, such PID is free to be reused.
And against all odds, when that happens with this fresh new corosync
process, it happily mangles the file underneath the former daemon one,
leading to crashes indicated by SIGBUS, rarely also SIGFPE.
"""

The solution is to restart blackbox logging, either explicitly (for
unpatched libqb) or, preferably, implicitly with what patched libqb
is supposed to support.

[1] http://oss.clusterlabs.org/pipermail/users/2018-January/007169.html

Signed-off-by: Jan Pokorný <jpokorny@redhat.com>
@jnpkrn
Copy link
Contributor Author

jnpkrn commented Jan 26, 2018

Sure, here is the preliminary version (incl. test!):
ClusterLabs/libqb#293

That's the change this proposed patch counts with already.

@jfriesse
Copy link
Member

jfriesse commented Jan 30, 2018

I've decided to push #308 in favor of this PR simply because #308 is simpler, not dependent on LibQB change and also brings possibility to turn off blackbox completely.

Thank you for the patch and HUGE kudos for finding the main reason of SIGBUS,
Honza

@jfriesse jfriesse closed this Jan 30, 2018
@jnpkrn
Copy link
Contributor Author

jnpkrn commented Jan 30, 2018

Really sad to see technical continuity ditched for proclaimed simplicity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants