-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crashes with read/takeMVar after withServer #22
Comments
Precision: this happens with the current latest revision on master: fe55845. |
Hi @OlivierNicole -- Unfortunately, I'm not able to reproduce this on my machine (MacOS), even after playing around with mvar placements / delays and ensuring that It's possible that we might have a terminology mismatch for the term "endpoint", BTW: it seems like you are saying "endpoint" to denote a "server" whereas we usually say "endpoint" to refer to a particular RPC connection point (i.e., a "server" + some RPC path). So I can talk about unary endpoints, or bidistream endpoints, and these map to RPC function signatures in the So that out of the way, I'm curious about what you're trying to do? Have you tried using any of the generated code from an RPC service definition in a Does the test suite work for you? Some of the code generation tests demonstrate sample uses of the higher level APIs. |
Hi @intractable, I'm puzzled that you can't reproduce it. Have you tried to run it many times? We now conjecture that the bug comes when the thread containing To answer your question, I am trying to implement "endpoints" that can receive and send messages, and I am trying to do that on top of gRPC requests. The reason "my" endpoints do not map exactly to gRPC endpoints is that I want to be able to create new endpoints at any time in my program, whereas request handlers need to be registered before server startup. So I am forced to have exactly one The higher-level API works very well, I have simply not been able to make it fit into my (rather awkward) use case. |
You're right, it does crash for me after subsequent executions. Sorry, I'm a little embarrassed that I didn't try that, but I was running out the door =). I'll dig into this a bit further as soon as I'm able. It does seem like there's something amiss with I suspect it has something to do with the comment you referred to -- it is certainly the case that Thanks for raising the issue. It's possible we may have to provide some different sync semantics for your use case, or just rejigger things a bit. @crclark might have some ideas as well, but I'll dig into this a bit more soon (hopefully in the next couple of days). |
Sorry for the delay responding; I currently can't get This behavior is very odd. If someone could build |
@crclark: What error do you get with the official Nix installer script? |
@Gabriel439 My mistake; I had only removed |
Finally got the library building again and I was able to reproduce the problem on Linux, too. Here's the debug log output. It looks like the problem is that something isn't shutting down gracefully when blocked on
|
Some interesting
|
Looks like that |
Ah, I think I see what's going on. In @OlivierNicole's example, there's another layer of I think the genuine bug is that If that's the case, I thinkit's just @OlivierNicole, If the fix is valid, I'm going also to hand-write a very simple test of the dynamic endpoint stand-up I think you're trying to do, to ensure that the "endpoints" answer requests when they're live and can't when they aren't (the above bug will make it so that the responder threads may still answer even after the server is down, but the above fix should address that). One thing that came to mind about your approach, however, is that by standing up a whole server at a time as a dynamic endpoint, you're going to be occupying that server port, and clients will need to know about that port mapping, and if you have a lot of dynamic endpoints for some reason, I imagine that might be a pain. I think a better option for your scenario might be to change (or, more likely, provide an additional API similar to) the API for |
Scratch that, I'm still seeing intermittent crashes once debugging spew is turned off. I'll keep digging =). |
Looks like the remaining failure occurs at least as often as when we're blocked on the foreign call for |
Okay, so I think the blocked on foreign call situation (the first failure kind described below) can be addressed separately, but there are still some crashes during teardown that warrant investigation. @crclark, if you want to dig into this, I think debugging https://gist.github.com/intractable/88e6a960485d01c6347efce8adeb84ce first might be simpler. This is about as far as I got for now but I'll continue investigating as soon as I can. These all smell like a teardown race condition to me. Unfortunately, with That said, I'm consistently seeing four kinds of failures across multiple runs.
Let me know if you're able to reproduce these (at your leisure). I'll try to resume the analysis as soon as I'm able. |
Hmm, in your new gist, @intractable, I am only able to reproduce case 1. I guess for that issue we could call pluck with a timeout and just keep trying until we get the tag we want. Cases 2, 3, and 4 all look to be the same problem I saw in that Valgrind output I pasted above. Namely, the docs for how to shut down a completion queue seem to be misleading. I wonder if this is a regression in 1.2.0; I think I would have noticed this if it had been a problem before... I'll keep playing with it, but not sure how quickly I can go. |
@crclark interesting that you can't repro the other issues, but they are very timing sensitive so perhaps it's not surprising. For case 1, yes, I think we can just retry the I'll try to investigate the CQ shutdown issue as well, as I'm able. Also not sure how soon I'll be able to address it. Thanks! @OlivierNicole Until these issues are fixed, you might consider modifying |
@intractable Thanks, I will probably end up doing that in the next few days. Being able to also kill the server thread would be nice though, maybe using |
Instead of making a blocking foreign call without a timeout, we set a timeout of one second. This way, the thread returns in Haskell code (and thus is interruptible) at least every second. This is useful in order to be able to kill `Server`s, see awakesecurity#22.
FWIW I edited my fork to make |
Instead of making a blocking foreign call without a timeout, we set a timeout of one second. This way, the thread returns in Haskell code (and thus is interruptible) at least every second. This is useful in order to be able to kill `Server`s, see awakesecurity#22.
Instead of making a blocking foreign call without a timeout, we set a timeout of one second. This way, the thread returns in Haskell code (and thus is interruptible) at least every second. This is useful in order to be able to kill `Server`s, see #22.
Closing this as it is addressed by #30. Thanks again for the fix. |
In the following 50-line code: https://gist.github.com/OlivierNicole/945da49f0ab6e2e7d5cdbd53eaeaa44a
When I comment out the
MVar
-related lines, it works fine. But when I leave them in the code, it crashes once out of two with an undocumented silent segmentation fault and about once every 50 executions with the following backtrace: https://gist.github.com/OlivierNicole/2ea96e27ecff16de5df535d4710d36ad (working
is the name of my executable).Have you ever seen this kind of issue? How can a simple
readMVar
affect my code like this? It doesn't happen withtryReadMVar
.Thanks.
The text was updated successfully, but these errors were encountered: