New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prevent rare crash when two processes "share" blackbox file #309
Conversation
Ugh. So ... I would like to ask also chrissie to review that, but from my side it's nack. Super hard code (compared to PR #308) with probable bonus of getting few more startup lines in blackbox. Startup lines are actually not even very interesting |
I agree with Honza, it's a lot of code to protect something we actually don't care about. |
You call standard safe signal-based synchronization hard? Really? |
+ I was super verbose about the error condtions, can be cut as well, if you prefer. |
@jnpkrn I believe you didn't fully understood what I've tried to say you. So let me retry it again (ignoring first part of the comment #309 (comment) because this is just pure trolling). Patch itself is about adding synchronization between parent/child. What I'm trying to explain you is, that this is not needed and it's a lot of the code for getting log messages which are not needed. But the case you are trying to solve Because there's a danger that now-released main PID will get picked by new corosync process (e.g. just with -v switch) and the blackbox simply cannot happen because the logging is not called between |
How you come to the conclusion:
I am not a corosync maintainer, so I am not to decide, though I will be happy to I am going to add fitting libqb counterpart, so at least some clients can leverage |
Please start with fixing libqb, then we can discuss how to fix Corosync. |
From [1]: """ It was discovered that corosync exposes itself for a self-crash under rare circumstance whereby corosync executable is run when there is already a daemon instance around (does not apply to corosync serving without any backgrounding, i.e. launched with "-f" switch). Such a circumstance can be provoked unattendedly by the third party, incl. "corosync -v" probe triggered internally by pcs (since 9e19af58 ~ 0.9.145), which is what makes the root cause analysis of such inflicted crash somewhat difficult to guess & analyze (the other reason may be rather runaway core dump if produced at all due to fencing coming, based on the few observed cases). The problem comes from the fact that corosync is arranged such that the logging is set up very early, even before the main control flow of the program starts. And part of this early enabling is also starting "blackbox" recording, which uses mmap'd file stored in /dev/shm that, moreover, only varies on PID that is part of the file name -- and when corosync performs the fork so as to detach itself from the environment that started it, such PID is free to be reused. And against all odds, when that happens with this fresh new corosync process, it happily mangles the file underneath the former daemon one, leading to crashes indicated by SIGBUS, rarely also SIGFPE. """ The solution is to restart blackbox logging, either explicitly (for unpatched libqb) or, preferably, implicitly with what patched libqb is supposed to support. [1] http://oss.clusterlabs.org/pipermail/users/2018-January/007169.html Signed-off-by: Jan Pokorný <jpokorny@redhat.com>
eddf68f
to
d58e3a3
Compare
Sure, here is the preliminary version (incl. test!): That's the change this proposed patch counts with already. |
Really sad to see technical continuity ditched for proclaimed simplicity. |
Take it as a preview, I could not compile the master readily, need to
look at that more.