Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crashes with read/takeMVar after withServer #22

Closed
OlivierNicole opened this issue Jul 17, 2017 · 19 comments
Closed

Crashes with read/takeMVar after withServer #22

OlivierNicole opened this issue Jul 17, 2017 · 19 comments

Comments

@OlivierNicole
Copy link
Contributor

In the following 50-line code: https://gist.github.com/OlivierNicole/945da49f0ab6e2e7d5cdbd53eaeaa44a

When I comment out the MVar-related lines, it works fine. But when I leave them in the code, it crashes once out of two with an undocumented silent segmentation fault and about once every 50 executions with the following backtrace: https://gist.github.com/OlivierNicole/2ea96e27ecff16de5df535d4710d36ad (working is the name of my executable).

Have you ever seen this kind of issue? How can a simple readMVar affect my code like this? It doesn't happen with tryReadMVar.

Thanks.

@OlivierNicole
Copy link
Contributor Author

Precision: this happens with the current latest revision on master: fe55845.

@intractable
Copy link
Contributor

intractable commented Jul 18, 2017

Hi @OlivierNicole --

Unfortunately, I'm not able to reproduce this on my machine (MacOS), even after playing around with mvar placements / delays and ensuring that stopEndpoint wouldn't be called before the newEndpoint async blocks on wait loopThreadId. If you're using a Linux system, @ixmatus might be able to investigate.

It's possible that we might have a terminology mismatch for the term "endpoint", BTW: it seems like you are saying "endpoint" to denote a "server" whereas we usually say "endpoint" to refer to a particular RPC connection point (i.e., a "server" + some RPC path). So I can talk about unary endpoints, or bidistream endpoints, and these map to RPC function signatures in the .proto files.

So that out of the way, I'm curious about what you're trying to do? Have you tried using any of the generated code from an RPC service definition in a .proto? That generated code will contain a record type with fields for each of the RPC endpoints defined in the .proto. Looking at how those endpoints are supplied and used might provide some insight for your particular use case, but I'm not sure. You could also consider calling serverLoop with your endpoint handlers provided directly and see if that gets you what you need.

Does the test suite work for you? Some of the code generation tests demonstrate sample uses of the higher level APIs.

@OlivierNicole
Copy link
Contributor Author

Hi @intractable,

I'm puzzled that you can't reproduce it. Have you tried to run it many times?

We now conjecture that the bug comes when the thread containing withServer is killed after dispatchLoop has been called. In that case, the exception handler installed by bracket will stop the server and clean resources while the dispatchLoop thread is still using them — hence segfaults and "double free" errors. However, if that's true, it would make this comment rather puzzling, since it means that the thread running serverLoop should never be killed.

To answer your question, I am trying to implement "endpoints" that can receive and send messages, and I am trying to do that on top of gRPC requests. The reason "my" endpoints do not map exactly to gRPC endpoints is that I want to be able to create new endpoints at any time in my program, whereas request handlers need to be registered before server startup. So I am forced to have exactly one Server per "endpoint".

The higher-level API works very well, I have simply not been able to make it fit into my (rather awkward) use case.

@intractable
Copy link
Contributor

intractable commented Jul 18, 2017

You're right, it does crash for me after subsequent executions. Sorry, I'm a little embarrassed that I didn't try that, but I was running out the door =).

I'll dig into this a bit further as soon as I'm able. It does seem like there's something amiss with serverLoop and its inability to be killed gracefully and ensure that the dynamic endpoint no longer answer any calls, so I'd start by looking there. I got a modified version of your code "working" -- in that it no longer crashed on the mvar use, but alas, the stopEndpoint did not, in fact, stop the endpoint from answering subsequent requests, which means the underlying dispatchLoop was still executing.

I suspect it has something to do with the comment you referred to -- it is certainly the case that serverLoop has been used without intention of it being killed (there's an exception in the benchmarking code, but I don't think it actually does a complete teardown and is thus also likely a bug), so it's likely that your usage has deviated from what we've tested.

Thanks for raising the issue. It's possible we may have to provide some different sync semantics for your use case, or just rejigger things a bit.

@crclark might have some ideas as well, but I'll dig into this a bit more soon (hopefully in the next couple of days).

@crclark
Copy link
Collaborator

crclark commented Jul 19, 2017

Sorry for the delay responding; I currently can't get grpc-haskell to build on my machine, thanks to a mysterious linker issue when building with stack. And I can't build with Nix because the official Nix installer script crashes with a missing file. 😖

This behavior is very odd. If someone could build grpc-haskell with the Debug flag and post the debug output for a crashing run of the program, I might be able to spot something suspicious in it. There's also the possibility that this bug is in the underlying grpc library, in which case we might not be able to do much about it.

@Gabriella439
Copy link
Contributor

@crclark: What error do you get with the official Nix installer script?

@crclark
Copy link
Collaborator

crclark commented Jul 19, 2017

@Gabriel439 My mistake; I had only removed /nix/store and not /nix like I thought I had. It's building now.

@crclark
Copy link
Collaborator

crclark commented Jul 19, 2017

Finally got the library building again and I was able to reproduce the problem on Linux, too. Here's the debug log output. It looks like the problem is that something isn't shutting down gracefully when blocked on pluck.

[ThreadId 12]: startServer: server CQ: CompletionQueue 0x00007f1eb00028e0
creating endpoint...
endpoint created.
stopping server...
[ThreadId 12]: stopServer: calling shutdownNotify.
[ThreadId 13]: serverRequestCall(U): got pluck permission, registering call for tag=Tag {unTag = 0x8000000000000000}
[ThreadId 13]: pluck: blocking on grpc_completion_queue_pluck for tag=Tag {unTag = 0x8000000000000000}
[ThreadId 12]: called serverShutdownAndNotify; plucking.
[ThreadId 13]: pluck finished: Event {eventCompletionType = OpComplete, eventSuccess = False, eventTag = Tag {unTag = 0x8000000000000000}}
[ThreadId 12]: pluck: called with tag=Tag {unTag = 0x0000000000000000},mwait=Just 30
[ThreadId 13]: serverRequestCall(U): pluck' finished: Left GRPCIOTimeout
[ThreadId 12]: pluck: blocking on grpc_completion_queue_pluck for tag=Tag {unTag = 0x0000000000000000}
[ThreadId 13]: withServerCallAsync: call error: GRPCIOTimeout
[ThreadId 12]: pluck finished: Event {eventCompletionType = OpComplete, eventSuccess = True, eventTag = Tag {unTag = 0x0000000000000000}}
[ThreadId 13]: serverRequestCall(U): got pluck permission, registering call for tag=Tag {unTag = 0x8000000000000001}
[ThreadId 12]: shutdownNotify: got shutdown eventRight ()
[ThreadId 12]: stopServer: cancelling all calls.
[ThreadId 13]: pluck: blocking on grpc_completion_queue_pluck for tag=Tag {unTag = 0x8000000000000001}
[ThreadId 12]: Server shutdown: killing threads: fromList []
[ThreadId 13]: pluck finished: Event {eventCompletionType = OpComplete, eventSuccess = False, eventTag = Tag {unTag = 0x8000000000000001}}
[ThreadId 13]: serverRequestCall(U): pluck' finished: Left GRPCIOTimeout
[ThreadId 12]: Server shutdown: waiting until all threads are dead.
[ThreadId 13]: withServerCallAsync: call error: GRPCIOTimeout
[ThreadId 12]: Server shutdown: All forks cleaned up.
[ThreadId 12]: stopServer: call grpc_server_destroy.
[ThreadId 13]: serverRequestCall(U): got pluck permission, registering call for tag=Tag {unTag = 0x8000000000000002}
[ThreadId 13]: pluck: blocking on grpc_completion_queue_pluck for tag=Tag {unTag = 0x8000000000000002}
[ThreadId 12]: stopServer: shutting down CQ.
[1]    26127 segmentation fault (core dumped) 

@crclark
Copy link
Collaborator

crclark commented Jul 19, 2017

Some interesting valgrind output. It looks like we're calling grpc_completion_queue_next on the CQ after it has been destroyed.

==28710== Thread 15:
==28710== Invalid read of size 8
==28710==    at 0x4EC4570: gpr_ref (in /usr/local/lib/libgrpc.so.3.0.0-dev)
==28710==    by 0x4E85ACC: grpc_completion_queue_next (in /usr/local/lib/libgrpc.so.3.0.0-dev)
==28710==    by 0x427B02: grpc_completion_queue_next_ (in /home/connorc/Dropbox/haskell/gRPC-haskell/.stack-work/dist/x86_64-linux/Cabal-1.24.2.0/build/hellos-server/hellos-server)
==28710==    by 0x459738: grpczmhaskellzm0zi0zi0zi0zmGhc5xnDkU9JKVyeoFknY2U_NetworkziGRPCziUnsafe_zdwgrpcCompletionQueueNext_info (Unsafe.chs:152)
==28710==  Address 0x73e6cf8 is 56 bytes inside a block of size 336 free'd
==28710==    at 0x4C2D11B: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==28710==    by 0x4E719B3: grpc_exec_ctx_flush (in /usr/local/lib/libgrpc.so.3.0.0-dev)
==28710==    by 0x4E864C0: grpc_completion_queue_shutdown (in /usr/local/lib/libgrpc.so.3.0.0-dev)
==28710==    by 0x477AF5: grpczmhaskellzm0zi0zi0zi0zmGhc5xnDkU9JKVyeoFknY2U_NetworkziGRPCziLowLevelziCompletionQueueziInternal_zdwshutdownCompletionQueue_info (Internal.hs:153)
==28710==  Block was alloc'd at
==28710==    at 0x4C2DF05: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==28710==    by 0x4EC24AE: gpr_zalloc (in /usr/local/lib/libgrpc.so.3.0.0-dev)
==28710==    by 0x4E85669: grpc_completion_queue_create (in /usr/local/lib/libgrpc.so.3.0.0-dev)
==28710==    by 0x46DB6E: grpczmhaskellzm0zi0zi0zi0zmGhc5xnDkU9JKVyeoFknY2U_NetworkziGRPCziLowLevelziCompletionQueue_zdwcreateCompletionQueue_info (CompletionQueue.hs:64)

@crclark
Copy link
Collaborator

crclark commented Jul 19, 2017

Looks like that grpc_completion_queue_next was called by drainLoop.

@intractable
Copy link
Contributor

Ah, I think I see what's going on. In @OlivierNicole's example, there's another layer of asyncing going on than in the direct inlining of serverLoop, but that's a little bit of a red herring. The point is that the dispatchLoop logic continues to run during and after server shutdown, and interleaving will sometimes lead to the segfault. Even when a segfault does not occur, if you delay termination of the main thread after supposedly stopping the server, you can see continual call errors emitted by dispatchLoop, indicating that it is still running.

I think the genuine bug is that serverLoop can be canceled without that entailing the inner thread running dispatchLoop also being killed.

If that's the case, I thinkit's just CE.onException (wait tid) (cancel tid) instead of merely wait tid in serverLoop, to ensure that that the spawned thread is killed upon wait interruption. I believe that fix alone is sufficient but need to do a bit more testing. If the fix works I'll submit a patch shortly.

@OlivierNicole, If the fix is valid, I'm going also to hand-write a very simple test of the dynamic endpoint stand-up I think you're trying to do, to ensure that the "endpoints" answer requests when they're live and can't when they aren't (the above bug will make it so that the responder threads may still answer even after the server is down, but the above fix should address that). One thing that came to mind about your approach, however, is that by standing up a whole server at a time as a dynamic endpoint, you're going to be occupying that server port, and clients will need to know about that port mapping, and if you have a lot of dynamic endpoints for some reason, I imagine that might be a pain.

I think a better option for your scenario might be to change (or, more likely, provide an additional API similar to) the API for serverLoop/dispatchLoop so that you could have a single server instance that would incorporate your dynamic routes during endpoint dispatch.

@intractable
Copy link
Contributor

Scratch that, I'm still seeing intermittent crashes once debugging spew is turned off. I'll keep digging =).

@intractable
Copy link
Contributor

Looks like the remaining failure occurs at least as often as when we're blocked on the foreign call for pluck, as Connor indicated earlier. I think we may need to allow the pluck to yield. Experimenting.

@intractable
Copy link
Contributor

Okay, so I think the blocked on foreign call situation (the first failure kind described below) can be addressed separately, but there are still some crashes during teardown that warrant investigation.

@crclark, if you want to dig into this, I think debugging https://gist.github.com/intractable/88e6a960485d01c6347efce8adeb84ce first might be simpler. This is about as far as I got for now but I'll continue investigating as soon as I can. These all smell like a teardown race condition to me.

Unfortunately, with -fdebug and high GRPC_TRACE settings, the crashes occur infrequently so many runs are needed to witness the bad behavior. You should get reliable crashes without verbose logging, though =).

That said, I'm consistently seeing four kinds of failures across multiple runs.

  1. Deadlock when Network.GRPC.LowLevel.CompletionQueue.Unregistered.serverRequestCall is blocked on a foreign call for pluck' , which causes cancel from the main thread to block. No real surprise here, I don't think, and I think this can be mitigated by having that pluck' yield periodically, so let's not focus on this one just yet, but it'd be good to for you to reproduce. There's a remark which indicates where inserting a delay will more reliably trigger the deadlock because it gives the dispatchLoop more time to reach the foreign call. Here's a sample trace (with delay added, and a traceIO call in serverRequestCall to indicate when it's about to block):
creating endpoint...
D0719 12:43:57.802312000 123145307131904 ev_posix.c:107]               Using polling engine: poll
I0719 12:43:57.802932000 123145307131904 init.c:228]                   grpc_init(void)
I0719 12:43:57.802956000 123145307131904 server.c:1017]                grpc_server_create(0x7fb832c05540, 0x0)
I0719 12:43:57.802983000 123145307131904 server_chttp2.c:47]           grpc_server_add_insecure_http2_port(server=0x7fb832c05f50, addr=localhost:50051)
I0719 12:43:57.804218000 123145307131904 completion_queue.c:119]       grpc_completion_queue_create(reserved=0x0)
I0719 12:43:57.804231000 123145307131904 server.c:1003]                grpc_server_register_completion_queue(server=0x7fb832c05f50, cq=0x7fb832e028f0, reserved=0x0)
I0719 12:43:57.804241000 123145307131904 server.c:1086]                grpc_server_start(server=0x7fb832c05f50)
I0719 12:43:57.804839000 123145307131904 completion_queue.c:119]       grpc_completion_queue_create(reserved=0x0)
I0719 12:43:57.804862000 123145307131904 metadata_array.c:42]          grpc_metadata_array_init(array=0x7fb832e01f20)
I0719 12:43:57.849114000 123145307131904 call_details.c:44]            grpc_call_details_init(cd=0x7fb832d00450)
I0719 12:43:57.849149000 123145307131904 server.c:1427]                grpc_server_request_call(server=0x7fb832c05f50, call=0x7fb832e028b0, details=0x7fb832d00450, initial_metadata=0x7fb832e01f20, cq_bound_to_call=0x7fb832e02a90, cq_for_notification=0x7fb832e028f0, tag=0x8000000000000000)
serverRequestCall(U): about to block on pluck'
I0719 12:43:57.849202000 123145307131904 completion_queue.c:528]       grpc_completion_queue_pluck(cc=0x7fb832e028f0, tag=0x8000000000000000, deadline=gpr_timespec { tv_sec: 9223372036854775807, tv_nsec: 0, clock_type: 0 }, reserved=0x0)
stopping server...
  1. Debug malloc (modified after freed)
D0719 12:49:36.935414000 123145310887936 ev_posix.c:107]               Using polling engine: poll
I0719 12:49:36.936010000 123145310887936 init.c:228]                   grpc_init(void)
I0719 12:49:36.936047000 123145310887936 server.c:1017]                grpc_server_create(0x7fbcba2002a0, 0x0)
I0719 12:49:36.936080000 123145310887936 server_chttp2.c:47]           grpc_server_add_insecure_http2_port(server=0x7fbcba2007d0, addr=localhost:50051)
D0719 12:49:36.937279000 123145310887936 combiner.c:145]               C:0x7fbcb9500a80 create
I0719 12:49:36.937389000 123145310887936 completion_queue.c:119]       grpc_completion_queue_create(reserved=0x0)
I0719 12:49:36.937403000 123145310887936 server.c:1003]                grpc_server_register_completion_queue(server=0x7fbcba2007d0, cq=0x7fbcb9501000, reserved=0x0)
I0719 12:49:36.937412000 123145310887936 server.c:1086]                grpc_server_start(server=0x7fbcba2007d0)
I0719 12:49:36.937988000 123145310887936 completion_queue.c:119]       grpc_completion_queue_create(reserved=0x0)
I0719 12:49:36.965183000 123145310887936 metadata_array.c:42]          grpc_metadata_array_init(array=0x7fbcba200330)
I0719 12:49:36.965201000 123145310887936 call_details.c:44]            grpc_call_details_init(cd=0x7fbcba200d30)
I0719 12:49:36.965218000 123145310887936 server.c:1427]                grpc_server_request_call(server=0x7fbcba2007d0, call=0x7fbcba2002a0, details=0x7fbcba200d30, initial_metadata=0x7fbcba200330, cq_bound_to_call=0x7fbcba200b90, cq_for_notification=0x7fbcb9501000, tag=0x8000000000000000)
stopping server...
I0719 12:49:36.965266000 123145311424512 call_details.c:51]            grpc_call_details_destroy(cd=0x7fbcba200d30)
I0719 12:49:36.965281000 123145311424512 metadata_array.c:47]          grpc_metadata_array_destroy(array=0x7fbcba200330)
I0719 12:49:36.965317000 123145311424512 server.c:1250]                grpc_server_shutdown_and_notify(server=0x7fbcba2007d0, cq=0x7fbcb9501000, tag=0x0)
I0719 12:49:36.965335000 123145311424512 completion_queue.c:224]       grpc_cq_end_op(exec_ctx=0x7000008acda0, cc=0x7fbcb9501000, tag=0x8000000000000000, error={"created":"@1500486576.965324000","description":"Server Shutdown","file":"src/core/lib/surface/server.c","file_line":1281}, done=0x10dadb1b0, done_arg=0x10e589f90, storage=0x10e589fc0)
E0719 12:49:36.990588000 123145311424512 completion_queue.c:226]       Operation failed: tag=0x8000000000000000, error={"created":"@1500486576.965324000","description":"Server Shutdown","file":"src/core/lib/surface/server.c","file_line":1281}
D0719 12:49:36.990765000 123145311424512 combiner.c:160]               C:0x7fbcb9500a80 really_destroy old_state=1
D0719 12:49:36.990771000 123145311424512 combiner.c:150]               C:0x7fbcb9500a80 really_destroy
I0719 12:49:36.990784000 123145311424512 completion_queue.c:224]       grpc_cq_end_op(exec_ctx=0x7000008acda0, cc=0x7fbcb9501000, tag=0x0, error="No Error", done=0x10dadb470, done_arg=0x7fbcba2007d0, storage=0x7fbcb9600940)
I0719 12:49:36.990808000 123145311424512 completion_queue.c:528]       grpc_completion_queue_pluck(cc=0x7fbcb9501000, tag=0x0, deadline=gpr_timespec { tv_sec: 30, tv_nsec: 55464708, clock_type: 0 }, reserved=0x0)
I0719 12:49:37.017706000 123145311424512 completion_queue.c:633]       RETURN_EVENT[0x7fbcb9501000]: OP_COMPLETE: tag:0x0 OK
I0719 12:49:37.017741000 123145311424512 server.c:1305]                grpc_server_cancel_all_calls(server=0x7fbcba2007d0)
I0719 12:49:37.017751000 123145311424512 server.c:1320]                grpc_server_destroy(server=0x7fbcba2007d0)
I0719 12:49:37.017760000 123145311424512 completion_queue.c:648]       grpc_completion_queue_shutdown(cc=0x7fbcb9501000)
I0719 12:49:37.017824000 123145311424512 completion_queue.c:360]       grpc_completion_queue_next(cc=0x7fbcb9501000, deadline=gpr_timespec { tv_sec: 1, tv_nsec: 82484918, clock_type: 0 }, reserved=0x0)
Bug(3230,0x7000008b1000) malloc: *** error for object 0x7fbcba200330: incorrect checksum for freed object - object was probably modified after being freed.
*** set a breakpoint in malloc_error_break to debug
Abort trap: 6
  1. Segfault
D0719 12:50:32.518402000 123145307131904 ev_posix.c:107]               Using polling engine: poll
I0719 12:50:32.519012000 123145307131904 init.c:228]                   grpc_init(void)
I0719 12:50:32.519034000 123145307131904 server.c:1017]                grpc_server_create(0x7f9121701230, 0x0)
I0719 12:50:32.519056000 123145307131904 server_chttp2.c:47]           grpc_server_add_insecure_http2_port(server=0x7f9121701a20, addr=localhost:50051)
D0719 12:50:32.520242000 123145307131904 combiner.c:145]               C:0x7f9121503340 create
I0719 12:50:32.520352000 123145307131904 completion_queue.c:119]       grpc_completion_queue_create(reserved=0x0)
I0719 12:50:32.520363000 123145307131904 server.c:1003]                grpc_server_register_completion_queue(server=0x7f9121701a20, cq=0x7f91215038b0, reserved=0x0)
I0719 12:50:32.520372000 123145307131904 server.c:1086]                grpc_server_start(server=0x7f9121701a20)
I0719 12:50:32.520955000 123145307131904 completion_queue.c:119]       grpc_completion_queue_create(reserved=0x0)
I0719 12:50:32.553449000 123145307131904 metadata_array.c:42]          grpc_metadata_array_init(array=0x7f9122b001c0)
I0719 12:50:32.553472000 123145307131904 call_details.c:44]            grpc_call_details_init(cd=0x7f9122b00220)
I0719 12:50:32.553502000 123145307131904 server.c:1427]                grpc_server_request_call(server=0x7f9121701a20, call=0x7f9122b001a0, details=0x7f9122b00220, initial_metadata=0x7f9122b001c0, cq_bound_to_call=0x7f9122b00000, cq_for_notification=0x7f91215038b0, tag=0x8000000000000000)
stopping server...
I0719 12:50:32.553556000 123145307131904 call_details.c:51]            grpc_call_details_destroy(cd=0x7f9122b00220)
I0719 12:50:32.553570000 123145307131904 metadata_array.c:47]          grpc_metadata_array_destroy(array=0x7f9122b001c0)
I0719 12:50:32.553582000 123145307131904 server.c:1250]                grpc_server_shutdown_and_notify(server=0x7f9121701a20, cq=0x7f91215038b0, tag=0x0)
I0719 12:50:32.553610000 123145307131904 completion_queue.c:224]       grpc_cq_end_op(exec_ctx=0x700000494da0, cc=0x7f91215038b0, tag=0x8000000000000000, error={"created":"@1500486632.553589000","description":"Server Shutdown","file":"src/core/lib/surface/server.c","file_line":1281}, done=0x10665b1b0, done_arg=0x10710ff90, storage=0x10710ffc0)
E0719 12:50:32.582387000 123145307131904 completion_queue.c:226]       Operation failed: tag=0x8000000000000000, error={"created":"@1500486632.553589000","description":"Server Shutdown","file":"src/core/lib/surface/server.c","file_line":1281}
D0719 12:50:32.582546000 123145307131904 combiner.c:160]               C:0x7f9121503340 really_destroy old_state=1
D0719 12:50:32.582552000 123145307131904 combiner.c:150]               C:0x7f9121503340 really_destroy
I0719 12:50:32.582564000 123145307131904 completion_queue.c:224]       grpc_cq_end_op(exec_ctx=0x700000494da0, cc=0x7f91215038b0, tag=0x0, error="No Error", done=0x10665b470, done_arg=0x7f9121701a20, storage=0x7f9121502de0)
I0719 12:50:32.582588000 123145307131904 completion_queue.c:528]       grpc_completion_queue_pluck(cc=0x7f91215038b0, tag=0x0, deadline=gpr_timespec { tv_sec: 30, tv_nsec: 64248566, clock_type: 0 }, reserved=0x0)
I0719 12:50:32.609891000 123145307131904 completion_queue.c:633]       RETURN_EVENT[0x7f91215038b0]: OP_COMPLETE: tag:0x0 OK
I0719 12:50:32.609918000 123145307131904 server.c:1305]                grpc_server_cancel_all_calls(server=0x7f9121701a20)
I0719 12:50:32.609928000 123145307131904 server.c:1320]                grpc_server_destroy(server=0x7f9121701a20)
I0719 12:50:32.609937000 123145307131904 completion_queue.c:648]       grpc_completion_queue_shutdown(cc=0x7f91215038b0)
I0719 12:50:32.610000000 123145307131904 completion_queue.c:360]       grpc_completion_queue_next(cc=0x7f91215038b0, deadline=gpr_timespec { tv_sec: 1, tv_nsec: 91664117, clock_type: 0 }, reserved=0x0)
I0719 12:50:32.610013000 123145307131904 completion_queue.c:445]       RETURN_EVENT[0x7f91215038b0]: OP_COMPLETE: tag:0x8000000000000000 ERROR
I0719 12:50:32.610019000 123145307131904 completion_queue.c:360]       grpc_completion_queue_next(cc=0x7f91215038b0, deadline=gpr_timespec { tv_sec: 1, tv_nsec: 91685514, clock_type: 0 }, reserved=0x0)
Segmentation fault: 11
  1. grpc assertion failure
D0719 12:54:39.974505000 123145307131904 ev_posix.c:107]               Using polling engine: poll
I0719 12:54:39.975153000 123145307131904 init.c:228]                   grpc_init(void)
I0719 12:54:39.975174000 123145307131904 server.c:1017]                grpc_server_create(0x7f8310d01360, 0x0)
I0719 12:54:39.975195000 123145307131904 server_chttp2.c:47]           grpc_server_add_insecure_http2_port(server=0x7f8310d018e0, addr=localhost:50051)
D0719 12:54:39.976619000 123145307131904 combiner.c:145]               C:0x7f8310d022e0 create
I0719 12:54:39.976732000 123145307131904 completion_queue.c:119]       grpc_completion_queue_create(reserved=0x0)
I0719 12:54:39.976742000 123145307131904 server.c:1003]                grpc_server_register_completion_queue(server=0x7f8310d018e0, cq=0x7f8310d027c0, reserved=0x0)
I0719 12:54:39.976752000 123145307131904 server.c:1086]                grpc_server_start(server=0x7f8310d018e0)
I0719 12:54:39.977332000 123145307131904 completion_queue.c:119]       grpc_completion_queue_create(reserved=0x0)
I0719 12:54:40.005708000 123145307131904 metadata_array.c:42]          grpc_metadata_array_init(array=0x7f8310d01ca0)
I0719 12:54:40.005731000 123145307131904 call_details.c:44]            grpc_call_details_init(cd=0x7f8310d01d00)
I0719 12:54:40.005755000 123145307131904 server.c:1427]                grpc_server_request_call(server=0x7f8310d018e0, call=0x7f8310d01360, details=0x7f8310d01d00, initial_metadata=0x7f8310d01ca0, cq_bound_to_call=0x7f8310d02960, cq_for_notification=0x7f8310d027c0, tag=0x8000000000000000)
stopping server...
I0719 12:54:40.005787000 123145307131904 call_details.c:51]            grpc_call_details_destroy(cd=0x7f8310d01d00)
I0719 12:54:40.005797000 123145307131904 metadata_array.c:47]          grpc_metadata_array_destroy(array=0x7f8310d01ca0)
I0719 12:54:40.005805000 123145307131904 server.c:1250]                grpc_server_shutdown_and_notify(server=0x7f8310d018e0, cq=0x7f8310d027c0, tag=0x0)
I0719 12:54:40.005822000 123145307131904 completion_queue.c:224]       grpc_cq_end_op(exec_ctx=0x700000494da0, cc=0x7f8310d027c0, tag=0x8000000000000000, error={"created":"@0.005811000","description":"Server Shutdown","file":"src/core/lib/surface/server.c","file_line":1281}, done=0x107b321b0, done_arg=0x1085eaf90, storage=0x1085eafc0)
E0719 12:54:40.032408000 123145307131904 completion_queue.c:226]       Operation failed: tag=0x8000000000000000, error={"created":"@0.005811000","description":"Server Shutdown","file":"src/core/lib/surface/server.c","file_line":1281}
E0719 12:54:40.032430000 123145307131904 sync.c:118]                   assertion failed: prior > 0
Abort trap: 6

Let me know if you're able to reproduce these (at your leisure). I'll try to resume the analysis as soon as I'm able.

@crclark
Copy link
Collaborator

crclark commented Jul 20, 2017

Hmm, in your new gist, @intractable, I am only able to reproduce case 1. I guess for that issue we could call pluck with a timeout and just keep trying until we get the tag we want. Cases 2, 3, and 4 all look to be the same problem I saw in that Valgrind output I pasted above. Namely, the docs for how to shut down a completion queue seem to be misleading. I wonder if this is a regression in 1.2.0; I think I would have noticed this if it had been a problem before... I'll keep playing with it, but not sure how quickly I can go.

@intractable
Copy link
Contributor

@crclark interesting that you can't repro the other issues, but they are very timing sensitive so perhaps it's not surprising. For case 1, yes, I think we can just retry the pluck call so it yields at regular, brief intervals.

I'll try to investigate the CQ shutdown issue as well, as I'm able. Also not sure how soon I'll be able to address it. Thanks!

@OlivierNicole Until these issues are fixed, you might consider modifying serverLoop/dispatchLoop so that you can specify a dynamic collection of routes and handlers, as mentioned earlier, and just run a single server (w/ dynamic routes) via serverLoop.

@OlivierNicole
Copy link
Contributor Author

@OlivierNicole Until these issues are fixed, you might consider modifying serverLoop/dispatchLoop so that you can specify a dynamic collection of routes and handlers, as mentioned earlier, and just run a single server (w/ dynamic routes) via serverLoop.

@intractable Thanks, I will probably end up doing that in the next few days. Being able to also kill the server thread would be nice though, maybe using grpcShutdownAndNotify?

OlivierNicole pushed a commit to tweag/gRPC-haskell that referenced this issue Aug 4, 2017
Instead of making a blocking foreign call without a timeout, we set a
timeout of one second. This way, the thread returns in Haskell code (and
thus is interruptible) at least every second. This is useful in order to
be able to kill `Server`s, see
awakesecurity#22.
@OlivierNicole
Copy link
Contributor Author

FWIW I edited my fork to make pluck yield every second, making Servers actually stoppable: tweag@c29085c

OlivierNicole pushed a commit to tweag/gRPC-haskell that referenced this issue Aug 18, 2017
Instead of making a blocking foreign call without a timeout, we set a
timeout of one second. This way, the thread returns in Haskell code (and
thus is interruptible) at least every second. This is useful in order to
be able to kill `Server`s, see
awakesecurity#22.
intractable pushed a commit that referenced this issue Aug 21, 2017
Instead of making a blocking foreign call without a timeout, we set a
timeout of one second. This way, the thread returns in Haskell code (and
thus is interruptible) at least every second. This is useful in order to
be able to kill `Server`s, see
#22.
@intractable
Copy link
Contributor

Closing this as it is addressed by #30. Thanks again for the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants