-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
semi-catatonic synapse #34
Comments
we've recently observed this problem in our environment as well; it's my top priority to investigate ATM |
Well, I guess I'm glad to know it's not just me. If I can provide any additional info for debugging, please ping me on freenode or gchat. |
a way to reproduce this would be AWESOME On Mon, Jan 6, 2014 at 11:41 AM, memory notifications@github.com wrote:
|
Ugh, I wish. I have yet to have it happen in <10 hours of process uptime. :( |
a stack trace from a stuck synapse process, curtesy of @nelgau : https://gist.github.com/nelgau/3a910d5da1ac09abfd22 |
Worth noting, that this indicates that the process gets stuck when trying to exit. We will investigate this further and let you know what we find. |
Found another one in the catatonic state; attached with gdb to get a stacktrace; if I'm reading this correctly it does indeed look like it's trying to shut down? (gdb) bt |
so, reading through the code, we don't really have any shutdown handling to speak of. what's more confusing is that it shuts down cleanly so much of the time. i'm going to try to add shutdown handling modeled after nerve's and see if this helps |
FWIW, after deploying 0.8.0 across an environment with ~45 synapse instances, 10 hours after starting each process I was able to SIGTERM all of them and see them exit correctly. It's not conclusive, but it's at least positive enough that I'll turn off my cronned midnight SIGKILL and see if we're still in good shape after 48-72 hours. |
Okay, now that we have proper cleanup, the underlying problem is a little clearer: 2014-01-17 17:15:48.525411500 I, [2014-01-17T17:15:48.525289 #6877] INFO -- Synapse::ZookeeperWatcher: synapse: discovering backends for service klima-server 2014-01-17 17:16:18.880718500 fatal: Not a git repository (or any of the parent directories): .git 2014-01-17 17:16:19.203039500 fatal: Not a git repository (or any of the parent directories): .git Obviously it's not great if a ZK server times out, but should that really be a fatal error for synapse? We're passing in multiple ZK servers, and a single timeout in one watcher should, it seems to me, be cause at worst for restarting that watcher and/or falling through to the next ZK server. |
I've noticed this now once or twice, and it's a little disturbing. A synapse process with multiple services defined eventually goes catatonic after a certain amount (as yet unquantified) of churn in the zookeeper backends. The symptoms are as follows:
Compare to the currently-live list in ZK:
The text was updated successfully, but these errors were encountered: