Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

corosync daemon at 100% cpu usage #225

Closed
NicolasMa opened this issue Jun 27, 2017 · 18 comments

Comments

Projects
None yet
2 participants
@NicolasMa
Copy link
Contributor

commented Jun 27, 2017

Hi all,
me and my company had an issue with the corosync daemon, where it was eating all the cpu and wasn't replying to request anymore.
Having reproduced the problem, we found that it came from a bad file descriptor used by the totem protocol handler. While the reason we got an invalid file descriptor is still unknown, we tried to workaround this issue by adding a check just after the calls to recvmsg in order to return an error if the file descriptor is invalid.
Here is the path:

diff --git a/exec/totemudp.c b/exec/totemudp.c
index 31d05704..771e925d 100644
--- a/exec/totemudp.c
+++ b/exec/totemudp.c
@@ -487,6 +487,8 @@ static int net_deliver_fn (
 
        bytes_received = recvmsg (fd, &msg_recv, MSG_NOSIGNAL | MSG_DONTWAIT);
        if (bytes_received == -1) {
+               if ( (errno == EBADF) || (errno == ECONNRESET) )
+                       return (-1);
                return (0);
        } else {
                instance->stats_recv += bytes_received;
diff --git a/exec/totemudpu.c b/exec/totemudpu.c
index 037f82b4..c4613ff0 100644
--- a/exec/totemudpu.c
+++ b/exec/totemudpu.c
@@ -474,6 +474,8 @@ static int net_deliver_fn (
 
        bytes_received = recvmsg (fd, &msg_recv, MSG_NOSIGNAL | MSG_DONTWAIT);
        if (bytes_received == -1) {
+               if ( (errno == EBADF) || (errno == ECONNRESET) )
+                       return (-1);
                return (0);
        } else {
                instance->stats_recv += bytes_received;

Please have a look and reports us if this looks reasonable to you.

@jfriesse

This comment has been minimized.

Copy link
Member

commented Jun 28, 2017

@NicolasMa Thanks for report. I have a question - problem is EBADF or ECONNRESET? Because EBADF really shouldn't happen and if it is happening it has to be solved elsewhere in the code.

Do you have reproducer? What version of libqb/corosync are you using?

@NicolasMa

This comment has been minimized.

Copy link
Contributor Author

commented Jun 28, 2017

The problem we observed was due to a EBADF, but I decided to also check for the ECONNRESET error to avoid future problems.
We were only able to reproduce the issue once, and did not get it anymore since this patch was applied in our test environnement.
The problem happens on the version 1.4.8 of corosync, but since this code doesn't seems to have changed since I though it was still usefull to report this.
Also note that I agree on the fact that solving the reason why we get a EBADF would be far better, but until we found where this comes from we should handle this so that it will no longer result in a not-responding daemon.

Regards,
Nicolas.

@jfriesse

This comment has been minimized.

Copy link
Member

commented Jun 29, 2017

@NicolasMa Yep, you are right, since 1.4.8 code in totemudp flatiron branch didn't change. I also really appreciate your patch. What I'm not sure is what exactly it solves? Because it's true corosync with patch stops using 100% CPU but it also stops receiving any message effectively making it dead without letting user know.

So as it's patch today is no go. I can however imagine have assert because EBADF really shouldn't happen. Econnreset is really not needed to catch because it should happen only for connected transfers, what is not udp case.

@NicolasMa

This comment has been minimized.

Copy link
Contributor Author

commented Aug 18, 2017

Hi,
back to this problem. Recently, we discovered a possible cause of the problem. In fact we saw that when corosync is eating lot of cpu, it isn't listening to the right socket anymore, but listen to an ephemeral port on the adress "*" instead.
Also in the logs, we find this:

Aug 18 14:53:39:839 corosync [TOTEM ] Unable to bind the socket to send multicast packets: Address already in use (48)

Having a look at the code (in totemudp_build_sockets_ip), it seems that bind errors aren't really managed since the method just leave without releasing any ressource it allocated and the calling code just ignore the returned error_code.
As a quick fix I tried to make corosync leave in that case (either by calling corosync_shutdown_request directly or by sending a SIGTERM signal), but I ended having a crash instead.
Here is the backtrace I get in both cases:
(gdb) bt
#0 thr_kill () at thr_kill.S:3
#1 0x0000000801359616 in __raise (s=) at /usr/home/build/fw-trunk/FreeBSD/tmp/usr/src/lib/libc/gen/raise.c:51
#2 0x0000000801082b4a in handle_signal (actp=, sig=, info=, ucp=)
at /usr/home/build/fw-trunk/FreeBSD/tmp/usr/src/lib/libthr/thread/thr_sig.c:245
#3 0x000000080108222c in thr_sighandler (sig=, info=, _ucp=)
at /usr/home/build/fw-trunk/FreeBSD/tmp/usr/src/lib/libthr/thread/thr_sig.c:188
#4
#5 semop () at semop.S:3
#6 0x0000000800c6dab0 in ipc_sem_wait () at coroipc_ipc.h:291
#7 0x0000000800c6d9a2 in pthread_ipc_consumer (conn=0x801e11080) at coroipcs.c:628
#8 0x000000080107d855 in thread_start (curthread=0x801c09000) at /usr/home/build/fw-trunk/FreeBSD/tmp/usr/src/lib/libthr/thread/thr_create.c:288
#9 0x00007fffdfdbf000 in ?? ()
#10 0x0000000000000000 in ?? ()

Any idea on how I can leave the process properly when this happen?

@jfriesse

This comment has been minimized.

Copy link
Member

commented Aug 21, 2017

@NicolasMa Correct shutdown of corosync 1.x is very hard to do so it's really no surprise it crashed.

Were you able to find out bind call which resulted in error? If so, you can try to simply call exit, what will (with configured power fencing) result in fencing node. If this happens, can you please share function which has to be checked?

Thanks,
Honza

@NicolasMa

This comment has been minimized.

Copy link
Contributor Author

commented Aug 21, 2017

The bind who fails is in totemudp_build_sockets_ip, this one:
res = bind (sockets->mcast_send, (struct sockaddr *)&sockaddr, addrlen);
Note that the remarks is applicable to every call to bind in totemudp_build_sockets_ip, since they all does the same in case of error.
Calling exit here is not a good solution for us, because we have other processes which could get stuck in ipc_sem_wait when corosync doesn't exit properly (right now _POSIX_THREAD_PROCESS_SHARED is set to 0 for us, but it may change in the future though).

@jfriesse

This comment has been minimized.

Copy link
Member

commented Aug 21, 2017

@NicolasMa Ok, so question is why it fails. Socket has reuse option set so it shouldn't fail if address is not used by any other processes. Also it should be called only on startup, not afterwards (if you don't force it to do rebind - like ifdown).

@NicolasMa

This comment has been minimized.

Copy link
Contributor Author

commented Aug 21, 2017

I could confirm that this happen only when the link was lost and has to be rebind. (The reason why it has been lost is not related to corosync).
I also saw that the SO_REUSEADDR is set on the socket, but seems it isn't enough and that we still have a race here.
So for now i'll use a retry-policy to limit the problem, and do an abort if bind still return an error after several calls.

@jfriesse

This comment has been minimized.

Copy link
Member

commented Aug 31, 2017

@NicolasMa Were you able to test method you've described in last sentence? Does it work for you?

@NicolasMa

This comment has been minimized.

Copy link
Contributor Author

commented Sep 1, 2017

Hi,
sorry, i forgot to shre the results...
And yes, it seems it works since I wasn't able to reproduce the problem after the fix.
Here is the patch:
corosync-retry-on-bind-errors.patch.txt

Nicolas.

@NicolasMa NicolasMa closed this Sep 1, 2017

@jfriesse

This comment has been minimized.

Copy link
Member

commented Sep 1, 2017

@NicolasMa Sounds good. Were you able (with the patch) ever get to abort() state?

Generally patch looks good. I would recommend use poll(NULL, 0, timeout) rather than usleep because usleep is deprecated and interaction with sleep/alarm/... is undefined. Also retries should be probably zeroed before every loop and max_retries would be better to be defined using #define at the beginning of the file.

With this changes I would be more than happy to merge your patch. Can you please open new PR (or reopen this one) with suggested changes?

@NicolasMa

This comment has been minimized.

Copy link
Contributor Author

commented Sep 1, 2017

Were you able (with the patch) ever get to abort() state?

No, not once.

I would recommend use poll(NULL, 0, timeout) rather than usleep because usleep is deprecated and interaction with sleep/alarm/... is undefined.

OK, thanks for the tip.

Also retries should be probably zeroed before every loop and max_retries would be better to be defined using #define at the beginning of the file.

Yes, I just didn't took time to refine this.

With this changes I would be more than happy to merge your patch. Can you please open new PR (or reopen this one) with suggested changes?

Sure.

@NicolasMa NicolasMa reopened this Sep 1, 2017

@NicolasMa

This comment has been minimized.

Copy link
Contributor Author

commented Sep 1, 2017

Updated version of the patch.
Note that I wasn't able to completely test it since our test environnement isn't available right now.
corosync-retry-on-bind-errors.patch.txt

@jfriesse

This comment has been minimized.

Copy link
Member

commented Sep 11, 2017

@NicolasMa Patch looks almost good, but I was unable to merge it simply because patch failed. What version is your patch based on? Can you please try to base your patch on top of current needle/master?

Few more comments:

  • It may be better to replace abort(); by log and exit
  • retries should be set to zero before while cycle, so there should be two set operations for each file
@NicolasMa

This comment has been minimized.

Copy link
Contributor Author

commented Sep 15, 2017

Ok, i've rebased my patch against master, so it should apply without problems now.
I also took your remarks into consideration.
corosync-retry-on-bind-errors-v2.patch.txt

@jfriesse

This comment has been minimized.

Copy link
Member

commented Sep 15, 2017

@NicolasMa Nice patch so ACK and I would like to merge. Sadly I was unable to find your email (and full name) to give you the credit. Can you please send me that info?

If you do not want to share these information I can commit as myself and just mention NicolasMa in the commit message but it's really not preferred.

@NicolasMa

This comment has been minimized.

Copy link
Contributor Author

commented Sep 15, 2017

No problem.
My real name is Masse Nicolas.
for my email, you can mention nicolas.masse@stormshield.eu.

Thanks.

@jfriesse

This comment has been minimized.

Copy link
Member

commented Sep 19, 2017

@NicolasMa Thank you for the information. I've merged your patch to master 5b38aa7 and needle 8bbd6c9.

Thank you again for the patch.

@jfriesse jfriesse closed this Sep 19, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.