Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server termination issue #8

Closed
puffetto opened this issue Apr 2, 2022 · 10 comments
Closed

Server termination issue #8

puffetto opened this issue Apr 2, 2022 · 10 comments

Comments

@puffetto
Copy link
Contributor

puffetto commented Apr 2, 2022

Hello,

I noticed a little termination issue.

Currently the only way to terminate tkzrw_server is through a signal, however this is what happens if I start a plain server, do no other action, wait a couple of seconds and then hit control-C:

blackye@antani:~/tkrzw-rpc ./tkrzw_server 
2022/04/02 14:19:47 [INFO] ======== Starting the process 78688 as a command ========
2022/04/02 14:19:47 [INFO] Version: rpc_pkg=0.9.9, rpc_lib=1.16.0, core_pkg=1.0.24, core_lib=1.67.0
2022/04/02 14:19:47 [INFO] Opening a database: #dbm=tiny
2022/04/02 14:19:47 [INFO] Building the sync server: address=0.0.0.0:1978, id=1
^C2022/04/02 14:19:49 [INFO] Shutting down by signal: 2
[mutex.cc : 1368] RAW: Potential Mutex deadlock: 

[mutex.cc : 1380] RAW: Acquiring absl::Mutex 0x801cac1b8 while holding  0x801cac1b8; a cycle in the historical lock ordering graph has been observed
[mutex.cc : 1381] RAW: Cycle: 
[mutex.cc : 1395] RAW: mutex@0x801cac1b8 stack: 
[mutex.cc : 1400] RAW: dying due to potential deadlock
Abort trap (core dumped)
blackye@antani:~/tkrzw-rpc 

This happens:

  • Always on FreeBSD
  • Sometimes on MacOS
  • Have no idea on other platforms

Digging a bit into the code it seems that there are actually two separate issues:

  1. void ShutdownServer(int signum) is installed as signal handler, and it calls server->Shutdown(deadline), this is explicitly forbidden by the grpc documentation and known to cause undefined behaviour (see the discussion on absl: Potential Mutex deadlock - when using signal handler grpc/grpc#24884 where the grpc maintainers closed this exact issue as "not a bug".
  2. The signal handlers, as designed, might potentially be triggered in different threads (besiodes being triggered at best "in na random one"), thus ShutdownServer might be called in parallel on different threads and this alone is more than enough for a deadlock.

While point 2 is trivial one is not so trivial in a portable way and sticking to C++17. The solution is to have the signal handler to report in some way to a "waiting" thread that is time to quit, this "report" in my knowledge could be done in several ways:
a) By means of a native C++ semaphore, if we were on C++20, but this is C++17
b) By means of a posix semaphore, but this means a named system-level resource with all that it implies (need to cleanup, uncomfortable situation if the server crashes without cleanup, ...)
c) By means of a volatile/atomic global variable, but as we do not have wait/notify in C++17 it means that the "killer" thread must poll it periodically, say once every second: two context switches every second and up to a one second delay for quitting.
d) By means of a mutex, but the interrupt handler MUST run in the same thread context that blocked the mutex initially (and in this case can safely release tthe mutex) while the "waiting killer" thread must be a separate one. Technically clean but not simple solution.

I can dedicate half an hour to sort this out and write the code, but I'd like to know in advance which solution among b-c-d is preferred.
If I were the maintainer I would go for d); in this case: yes, we have TWO "idle" threads dedicated to this, but they are really "idle" (will stay sleeping until the signal arrives.

Let me know,

A.

@puffetto
Copy link
Contributor Author

puffetto commented Apr 2, 2022

Ok, after thinking about it there is one further approach:
e) Have one thread take all the signals and sleep forever on a sigwaitinfo(), then the same thread can do anything outside interrupt context (or detach a temporary thread to do the job).
I will try to find half an hour to code this in the next days.
A.

@estraier
Copy link
Owner

estraier commented Apr 3, 2022 via email

@estraier
Copy link
Owner

estraier commented Apr 3, 2022

Now, I'm writing code to use a dedicated thread to handle signals, as suggested in the gRPC forum.
Please wait for it.

@puffetto
Copy link
Contributor Author

puffetto commented Apr 3, 2022 via email

@estraier
Copy link
Owner

estraier commented Apr 3, 2022

I submitted the change. Please check it on your environment.

@puffetto
Copy link
Contributor Author

puffetto commented Apr 3, 2022

Thanks,
I did a very quick check and seems ok, will be more precise tomorrow and close the issue if everything confirms to be ok (now I have to take care of my son).
Should you need to do any test on FreeBSD just drop me a public ssh key on andrea at cocito.eu and I'll happily give you an account on one of my machines.
Cheers,
A.

@puffetto
Copy link
Contributor Author

puffetto commented Apr 3, 2022

Hi,
everything seems ok.
I took a look at the code and can't see anything strange (I did not know you already had some thread/signal stuff done in the main tkrzw library); I also did some stress test here and everything looks ok, so I am closing this issue.
Thanks a lot and, once again, should you ever need a FreeBSD test environment just drop me a note!
A.

@puffetto puffetto closed this as completed Apr 3, 2022
@estraier
Copy link
Owner

estraier commented Apr 4, 2022

Thanks for checking the change.

One of the main objectives of Tkrzw over Kyoto Cabinet is to improve concurrency.
And, gRPC is a "spartan" library, which requires a deep dive into threading techniques to utilize it fully.
Furthermore, Tkrzw-RPC supports monitoring operations for real-time data processing.
https://dbmx.net/tkrzw-rpc/#tips_compex_monitor
Thus, I had to come up with a lot of threading gimmicks. :)

As for a user account on FreeBSD, I don't need one currently. I'd like to ask it when other issues happen.

@orzel
Copy link

orzel commented Apr 9, 2022

I submitted the change. Please check it on your environment.

I'm lost. Where is this code ? Can we see it ?

@estraier
Copy link
Owner

The change is this: e9f1c44
Though it had been submitted as a change as of your reply, I had not released it as a package yet.
I released the package 0.9.9 just now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants