Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

semi-catatonic synapse #34

Closed
memory opened this issue Dec 30, 2013 · 11 comments · Fixed by #36
Closed

semi-catatonic synapse #34

memory opened this issue Dec 30, 2013 · 11 comments · Fixed by #36

Comments

@memory
Copy link

memory commented Dec 30, 2013

I've noticed this now once or twice, and it's a little disturbing. A synapse process with multiple services defined eventually goes catatonic after a certain amount (as yet unquantified) of churn in the zookeeper backends. The symptoms are as follows:

  • the synapse process ignores SIGTERM, and can only be terminated with SIGKILL.
  • haproxy stops being updated despite nerve on the churned servers correctly reporting status to zookeeper. For example:
    image

Compare to the currently-live list in ZK:

ls /nerve/services/qa1/ministry-worker/default
time = 0 msec
/nerve/services/qa1/ministry-worker/default: rc = 0
        i-e1e53b9b_ministry-worker
        i-d400ebfa_ministry-worker
        i-e3e53b99_ministry-worker
        i-dde53ba7_ministry-worker
        i-d500ebfb_ministry-worker
        i-d900ebf7_ministry-worker
        i-d800ebf6_ministry-worker
        i-de00ebf0_ministry-worker
        i-db00ebf5_ministry-worker
        i-ebe53b91_ministry-worker
        i-df00ebf1_ministry-worker
        i-da00ebf4_ministry-worker
  • the output log, however, is still being updated:
2013-12-26 01:05:39.679608500 D, [2013-12-26T01:05:39.679352 #21871] DEBUG -- : synapse: still running at 2013-12-26 01:05:39 +0000
2013-12-26 01:06:42.752187500 D, [2013-12-26T01:06:42.752094 #21871] DEBUG -- : synapse: still running at 2013-12-26 01:06:42 +0000
2013-12-26 01:07:45.824888500 D, [2013-12-26T01:07:45.824624 #21871] DEBUG -- : synapse: still running at 2013-12-26 01:07:45 +0000
2013-12-26 01:08:48.807369500 D, [2013-12-26T01:08:48.807255 #21871] DEBUG -- : synapse: still running at 2013-12-26 01:08:48 +0000
2013-12-26 01:09:51.676951500 D, [2013-12-26T01:09:51.676688 #21871] DEBUG -- : synapse: still running at 2013-12-26 01:09:51 +0000
2013-12-26 01:10:54.518216500 D, [2013-12-26T01:10:54.517993 #21871] DEBUG -- : synapse: still running at 2013-12-26 01:10:54 +0000
2013-12-26 01:11:57.621308500 D, [2013-12-26T01:11:57.621061 #21871] DEBUG -- : synapse: still running at 2013-12-26 01:11:57 +0000
2013-12-26 01:13:02.798093500 D, [2013-12-26T01:13:02.797806 #21871] DEBUG -- : synapse: still running at 2013-12-26 01:13:02 +0000
  • nothing much seems to be happening syscall-wise:
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
[...and so on...]
  • if I SIGKILL synapse and restart it, it immediately comes up with a correct list of backends:

image

@igor47
Copy link
Collaborator

igor47 commented Jan 6, 2014

we've recently observed this problem in our environment as well; it's my top priority to investigate ATM

@memory
Copy link
Author

memory commented Jan 6, 2014

Well, I guess I'm glad to know it's not just me. If I can provide any additional info for debugging, please ping me on freenode or gchat.

@igor47
Copy link
Collaborator

igor47 commented Jan 6, 2014

a way to reproduce this would be AWESOME

On Mon, Jan 6, 2014 at 11:41 AM, memory notifications@github.com wrote:

Well, I guess I'm glad to know it's not just me. If I can provide any
additional info for debugging, please ping me on freenode or gchat.


Reply to this email directly or view it on GitHubhttps://github.com//issues/34#issuecomment-31679711
.

@memory
Copy link
Author

memory commented Jan 6, 2014

Ugh, I wish. I have yet to have it happen in <10 hours of process uptime. :(

@igor47
Copy link
Collaborator

igor47 commented Jan 8, 2014

a stack trace from a stuck synapse process, curtesy of @nelgau : https://gist.github.com/nelgau/3a910d5da1ac09abfd22

@schleyfox
Copy link
Contributor

Worth noting, that this indicates that the process gets stuck when trying to exit. We will investigate this further and let you know what we find.

@memory
Copy link
Author

memory commented Jan 9, 2014

Found another one in the catatonic state; attached with gdb to get a stacktrace; if I'm reading this correctly it does indeed look like it's trying to shut down?

(gdb) bt
#0 0x00007f363f7e9f67 in sched_yield () from /lib/libc.so.6
#1 0x00007f3640724045 in gvl_yield (vm=0x1176010, th=) at thread_pthread.c:123
#2 0x00007f3640724b60 in rb_thread_schedule_limits () at thread.c:1025
#3 rb_thread_schedule () at thread.c:1035
#4 0x00007f3640724d77 in rb_thread_terminate_all () at thread.c:375
#5 0x00007f3640604516 in ruby_cleanup (ex=6) at eval.c:140
#6 0x00007f36406046d5 in ruby_run_node (n=0x1562808) at eval.c:244
#7 0x0000000000400a0b in main (argc=4, argv=0x7fff2342b518) at main.c:38

@igor47
Copy link
Collaborator

igor47 commented Jan 9, 2014

so, reading through the code, we don't really have any shutdown handling to speak of. what's more confusing is that it shuts down cleanly so much of the time. i'm going to try to add shutdown handling modeled after nerve's and see if this helps

@memory
Copy link
Author

memory commented Jan 16, 2014

FWIW, after deploying 0.8.0 across an environment with ~45 synapse instances, 10 hours after starting each process I was able to SIGTERM all of them and see them exit correctly. It's not conclusive, but it's at least positive enough that I'll turn off my cronned midnight SIGKILL and see if we're still in good shape after 48-72 hours.

@memory
Copy link
Author

memory commented Jan 17, 2014

Okay, now that we have proper cleanup, the underlying problem is a little clearer:

2014-01-17 17:15:48.525411500 I, [2014-01-17T17:15:48.525289 #6877] INFO -- Synapse::ZookeeperWatcher: synapse: discovering backends for service klima-server
2014-01-17 17:16:18.529763500 E, [2014-01-17T17:16:18.529633 #6877] ERROR -- Synapse::Synapse: synapse: encountered unexpected exception #<Zookeeper::Exceptions::ContinuationTimeoutError: response for meth: :get_children, args: [0, "/nerve/services/staging/klima-server/default", nil, #<Zookeeper::Callbacks::WatcherCallback:0x00000002732038 @completed=false, @proc=#Proc:0x00000002731cc8@/export/disk0/wb/bin/synapse/vendor/bundle/ruby/1.9.1/gems/zookeeper-1.4.7/lib/zookeeper/callbacks.rb:24>], not received within 30 seconds> in main thread
2014-01-17 17:16:18.529873500 W, [2014-01-17T17:16:18.529827 #6877] WARN -- Synapse::Synapse: synapse: exiting; sending stop signal to all watchers
2014-01-17 17:16:18.529951500 W, [2014-01-17T17:16:18.529907 #6877] WARN -- Synapse::ZookeeperWatcher: synapse: zookeeper watcher exiting
2014-01-17 17:16:18.533320500 I, [2014-01-17T17:16:18.531645 #6877] INFO -- Synapse::ZookeeperWatcher: synapse: zookeeper watcher cleaned up successfully
2014-01-17 17:16:18.533323500 W, [2014-01-17T17:16:18.531713 #6877] WARN -- Synapse::ZookeeperWatcher: synapse: zookeeper watcher exiting
2014-01-17 17:16:18.533324500 I, [2014-01-17T17:16:18.531747 #6877] INFO -- Synapse::ZookeeperWatcher: synapse: zookeeper watcher cleaned up successfully
2014-01-17 17:16:18.533325500 W, [2014-01-17T17:16:18.531779 #6877] WARN -- Synapse::ZookeeperWatcher: synapse: zookeeper watcher exiting
2014-01-17 17:16:18.533351500 I, [2014-01-17T17:16:18.531810 #6877] INFO -- Synapse::ZookeeperWatcher: synapse: zookeeper watcher cleaned up successfully
2014-01-17 17:16:18.533353500 W, [2014-01-17T17:16:18.531848 #6877] WARN -- Synapse::ZookeeperWatcher: synapse: zookeeper watcher exiting
2014-01-17 17:16:18.533354500 I, [2014-01-17T17:16:18.531879 #6877] INFO -- Synapse::ZookeeperWatcher: synapse: zookeeper watcher cleaned up successfully
2014-01-17 17:16:18.533358500 W, [2014-01-17T17:16:18.531909 #6877] WARN -- Synapse::ZookeeperWatcher: synapse: zookeeper watcher exiting
2014-01-17 17:16:18.533359500 I, [2014-01-17T17:16:18.531940 #6877] INFO -- Synapse::ZookeeperWatcher: synapse: zookeeper watcher cleaned up successfully
2014-01-17 17:16:18.533367500 W, [2014-01-17T17:16:18.531969 #6877] WARN -- Synapse::ZookeeperWatcher: synapse: zookeeper watcher exiting
2014-01-17 17:16:18.533368500 I, [2014-01-17T17:16:18.531998 #6877] INFO -- Synapse::ZookeeperWatcher: synapse: zookeeper watcher cleaned up successfully
2014-01-17 17:16:18.533370500 W, [2014-01-17T17:16:18.532030 #6877] WARN -- Synapse::ZookeeperWatcher: synapse: zookeeper watcher exiting
2014-01-17 17:16:18.533373500 I, [2014-01-17T17:16:18.532061 #6877] INFO -- Synapse::ZookeeperWatcher: synapse: zookeeper watcher cleaned up successfully
2014-01-17 17:16:18.533374500 W, [2014-01-17T17:16:18.532091 #6877] WARN -- Synapse::ZookeeperWatcher: synapse: zookeeper watcher exiting
2014-01-17 17:16:18.533376500 I, [2014-01-17T17:16:18.532121 #6877] INFO -- Synapse::ZookeeperWatcher: synapse: zookeeper watcher cleaned up successfully
2014-01-17 17:16:18.533379500 W, [2014-01-17T17:16:18.532150 #6877] WARN -- Synapse::ZookeeperWatcher: synapse: zookeeper watcher exiting
2014-01-17 17:16:18.533380500 I, [2014-01-17T17:16:18.532179 #6877] INFO -- Synapse::ZookeeperWatcher: synapse: zookeeper watcher cleaned up successfully
2014-01-17 17:16:18.533382500 W, [2014-01-17T17:16:18.532208 #6877] WARN -- Synapse::ZookeeperWatcher: synapse: zookeeper watcher exiting
2014-01-17 17:16:18.533389500 I, [2014-01-17T17:16:18.532235 #6877] INFO -- Synapse::ZookeeperWatcher: synapse: zookeeper watcher cleaned up successfully
2014-01-17 17:16:18.533390500 W, [2014-01-17T17:16:18.532265 #6877] WARN -- Synapse::ZookeeperWatcher: synapse: zookeeper watcher exiting
2014-01-17 17:16:18.533392500 I, [2014-01-17T17:16:18.532293 #6877] INFO -- Synapse::ZookeeperWatcher: synapse: zookeeper watcher cleaned up successfully
2014-01-17 17:16:18.533395500 W, [2014-01-17T17:16:18.532325 #6877] WARN -- Synapse::ZookeeperWatcher: synapse: zookeeper watcher exiting
2014-01-17 17:16:18.533396500 I, [2014-01-17T17:16:18.532355 #6877] INFO -- Synapse::ZookeeperWatcher: synapse: zookeeper watcher cleaned up successfully
2014-01-17 17:16:18.533398500 W, [2014-01-17T17:16:18.532386 #6877] WARN -- Synapse::ZookeeperWatcher: synapse: zookeeper watcher exiting
2014-01-17 17:16:18.533401500 I, [2014-01-17T17:16:18.532414 #6877] INFO -- Synapse::ZookeeperWatcher: synapse: zookeeper watcher cleaned up successfully
2014-01-17 17:16:18.533402500 W, [2014-01-17T17:16:18.532443 #6877] WARN -- Synapse::ZookeeperWatcher: synapse: zookeeper watcher exiting
2014-01-17 17:16:18.533404500 I, [2014-01-17T17:16:18.532470 #6877] INFO -- Synapse::ZookeeperWatcher: synapse: zookeeper watcher cleaned up successfully
2014-01-17 17:16:18.533411500 W, [2014-01-17T17:16:18.532498 #6877] WARN -- Synapse::ZookeeperWatcher: synapse: zookeeper watcher exiting
2014-01-17 17:16:18.533413500 I, [2014-01-17T17:16:18.532526 #6877] INFO -- Synapse::ZookeeperWatcher: synapse: zookeeper watcher cleaned up successfully
2014-01-17 17:16:18.533414500 W, [2014-01-17T17:16:18.532555 #6877] WARN -- Synapse::ZookeeperWatcher: synapse: zookeeper watcher exiting
2014-01-17 17:16:18.533417500 I, [2014-01-17T17:16:18.532582 #6877] INFO -- Synapse::ZookeeperWatcher: synapse: zookeeper watcher cleaned up successfully
2014-01-17 17:16:18.533419500 W, [2014-01-17T17:16:18.532610 #6877] WARN -- Synapse::ZookeeperWatcher: synapse: zookeeper watcher exiting
2014-01-17 17:16:18.533420500 I, [2014-01-17T17:16:18.532637 #6877] INFO -- Synapse::ZookeeperWatcher: synapse: zookeeper watcher cleaned up successfully
2014-01-17 17:16:18.533428500 W, [2014-01-17T17:16:18.532665 #6877] WARN -- Synapse::ZookeeperWatcher: synapse: zookeeper watcher exiting
2014-01-17 17:16:18.533429500 I, [2014-01-17T17:16:18.532693 #6877] INFO -- Synapse::ZookeeperWatcher: synapse: zookeeper watcher cleaned up successfully
2014-01-17 17:16:18.533436500 W, [2014-01-17T17:16:18.532722 #6877] WARN -- Synapse::ZookeeperWatcher: synapse: zookeeper watcher exiting
2014-01-17 17:16:18.533437500 I, [2014-01-17T17:16:18.532750 #6877] INFO -- Synapse::ZookeeperWatcher: synapse: zookeeper watcher cleaned up successfully
2014-01-17 17:16:18.533439500 W, [2014-01-17T17:16:18.532780 #6877] WARN -- Synapse::ZookeeperWatcher: synapse: zookeeper watcher exiting
2014-01-17 17:16:18.533442500 I, [2014-01-17T17:16:18.532808 #6877] INFO -- Synapse::ZookeeperWatcher: synapse: zookeeper watcher cleaned up successfully
2014-01-17 17:16:18.533444500 W, [2014-01-17T17:16:18.532836 #6877] WARN -- Synapse::ZookeeperWatcher: synapse: zookeeper watcher exiting
2014-01-17 17:16:18.533445500 I, [2014-01-17T17:16:18.532864 #6877] INFO -- Synapse::ZookeeperWatcher: synapse: zookeeper watcher cleaned up successfully
2014-01-17 17:16:18.533448500 W, [2014-01-17T17:16:18.532892 #6877] WARN -- Synapse::ZookeeperWatcher: synapse: zookeeper watcher exiting
2014-01-17 17:16:18.533450500 I, [2014-01-17T17:16:18.532919 #6877] INFO -- Synapse::ZookeeperWatcher: synapse: zookeeper watcher cleaned up successfully
2014-01-17 17:16:18.533451500 W, [2014-01-17T17:16:18.532946 #6877] WARN -- Synapse::ZookeeperWatcher: synapse: zookeeper watcher exiting
2014-01-17 17:16:18.533458500 I, [2014-01-17T17:16:18.532974 #6877] INFO -- Synapse::ZookeeperWatcher: synapse: zookeeper watcher cleaned up successfully
2014-01-17 17:16:18.533459500 W, [2014-01-17T17:16:18.533003 #6877] WARN -- Synapse::ZookeeperWatcher: synapse: zookeeper watcher exiting
2014-01-17 17:16:18.533461500 I, [2014-01-17T17:16:18.533031 #6877] INFO -- Synapse::ZookeeperWatcher: synapse: zookeeper watcher cleaned up successfully
2014-01-17 17:16:18.533464500 W, [2014-01-17T17:16:18.533059 #6877] WARN -- Synapse::ZookeeperWatcher: synapse: zookeeper watcher exiting
2014-01-17 17:16:18.533465500 I, [2014-01-17T17:16:18.533087 #6877] INFO -- Synapse::ZookeeperWatcher: synapse: zookeeper watcher cleaned up successfully
2014-01-17 17:16:18.533467500 W, [2014-01-17T17:16:18.533119 #6877] WARN -- Synapse::ZookeeperWatcher: synapse: zookeeper watcher exiting
2014-01-17 17:16:18.533470500 I, [2014-01-17T17:16:18.533148 #6877] INFO -- Synapse::ZookeeperWatcher: synapse: zookeeper watcher cleaned up successfully
2014-01-17 17:16:18.533471500 W, [2014-01-17T17:16:18.533176 #6877] WARN -- Synapse::ZookeeperWatcher: synapse: zookeeper watcher exiting
2014-01-17 17:16:18.533473500 I, [2014-01-17T17:16:18.533204 #6877] INFO -- Synapse::ZookeeperWatcher: synapse: zookeeper watcher cleaned up successfully
2014-01-17 17:16:18.533481500 W, [2014-01-17T17:16:18.533233 #6877] WARN -- Synapse::ZookeeperWatcher: synapse: zookeeper watcher exiting
2014-01-17 17:16:18.533482500 I, [2014-01-17T17:16:18.533261 #6877] INFO -- Synapse::ZookeeperWatcher: synapse: zookeeper watcher cleaned up successfully
2014-01-17 17:16:18.533515500 /export/disk0/wb/bin/synapse/vendor/bundle/ruby/1.9.1/gems/zookeeper-1.4.7/lib/zookeeper/continuation.rb:117:in block in value': response for meth: :get_children, args: [0, "/nerve/services/staging/klima-server/default", nil, #<Zookeeper::Callbacks::WatcherCallback:0x00000002732038 @completed=false, @proc=#<Proc:0x00000002731cc8@/export/disk0/wb/bin/synapse/vendor/bundle/ruby/1.9.1/gems/zookeeper-1.4.7/lib/zookeeper/callbacks.rb:24>>], not received within 30 seconds (Zookeeper::Exceptions::ContinuationTimeoutError) 2014-01-17 17:16:18.533608500 from /export/disk0/wb/bin/ruby-env/clemens/lib/ruby/1.9.1/monitor.rb:211:inmon_synchronize'
2014-01-17 17:16:18.533631500 from /export/disk0/wb/bin/synapse/vendor/bundle/ruby/1.9.1/gems/zookeeper-1.4.7/lib/zookeeper/continuation.rb:107:in value' 2014-01-17 17:16:18.533649500 from /export/disk0/wb/bin/synapse/vendor/bundle/ruby/1.9.1/gems/zookeeper-1.4.7/ext/c_zookeeper.rb:224:insubmit_and_block'
2014-01-17 17:16:18.533666500 from /export/disk0/wb/bin/synapse/vendor/bundle/ruby/1.9.1/gems/zookeeper-1.4.7/ext/c_zookeeper.rb:40:in get_children' 2014-01-17 17:16:18.533683500 from /export/disk0/wb/bin/synapse/vendor/bundle/ruby/1.9.1/gems/zookeeper-1.4.7/lib/zookeeper/client_methods.rb:69:inget_children'
2014-01-17 17:16:18.533699500 from /export/disk0/wb/bin/synapse/vendor/bundle/ruby/1.9.1/gems/zk-1.9.2/lib/zk/client/base.rb:1055:in call_and_check_rc' 2014-01-17 17:16:18.533716500 from /export/disk0/wb/bin/synapse/vendor/bundle/ruby/1.9.1/gems/zk-1.9.2/lib/zk/client/threaded.rb:584:incall_and_check_rc'
2014-01-17 17:16:18.533733500 from /export/disk0/wb/bin/synapse/vendor/bundle/ruby/1.9.1/gems/zk-1.9.2/lib/zk/client/base.rb:717:in block in children' 2014-01-17 17:16:18.533750500 from /export/disk0/wb/bin/synapse/vendor/bundle/ruby/1.9.1/gems/zk-1.9.2/lib/zk/event_handler.rb:296:inblock in setup_watcher!'
2014-01-17 17:16:18.533766500 from /export/disk0/wb/bin/synapse/vendor/bundle/ruby/1.9.1/gems/zk-1.9.2/lib/zk/event_handler.rb:314:in block in synchronize' 2014-01-17 17:16:18.533783500 from /export/disk0/wb/bin/ruby-env/clemens/lib/ruby/1.9.1/monitor.rb:211:inmon_synchronize'
2014-01-17 17:16:18.533799500 from /export/disk0/wb/bin/synapse/vendor/bundle/ruby/1.9.1/gems/zk-1.9.2/lib/zk/event_handler.rb:314:in synchronize' 2014-01-17 17:16:18.533815500 from /export/disk0/wb/bin/synapse/vendor/bundle/ruby/1.9.1/gems/zk-1.9.2/lib/zk/event_handler.rb:283:insetup_watcher!'
2014-01-17 17:16:18.533832500 from /export/disk0/wb/bin/synapse/vendor/bundle/ruby/1.9.1/gems/zk-1.9.2/lib/zk/client/base.rb:1100:in setup_watcher!' 2014-01-17 17:16:18.533848500 from /export/disk0/wb/bin/synapse/vendor/bundle/ruby/1.9.1/gems/zk-1.9.2/lib/zk/client/base.rb:716:inchildren'
2014-01-17 17:16:18.533865500 from /export/disk0/wb/bin/synapse/lib/synapse/service_watcher/zookeeper.rb:58:in discover' 2014-01-17 17:16:18.533881500 from /export/disk0/wb/bin/synapse/lib/synapse/service_watcher/zookeeper.rb:105:inblock in watcher_callback'
2014-01-17 17:16:18.533898500 from /export/disk0/wb/bin/synapse/lib/synapse/service_watcher/zookeeper.rb:17:in call' 2014-01-17 17:16:18.533914500 from /export/disk0/wb/bin/synapse/lib/synapse/service_watcher/zookeeper.rb:17:instart'
2014-01-17 17:16:18.533930500 from /export/disk0/wb/bin/synapse/lib/synapse.rb:33:in block in run' 2014-01-17 17:16:18.533947500 from /export/disk0/wb/bin/synapse/lib/synapse.rb:33:inmap'
2014-01-17 17:16:18.533963500 from /export/disk0/wb/bin/synapse/lib/synapse.rb:33:in run' 2014-01-17 17:16:18.533979500 from bin/synapse:60:in

'
2014-01-17 17:16:18.880718500 fatal: Not a git repository (or any of the parent directories): .git
2014-01-17 17:16:19.203039500 fatal: Not a git repository (or any of the parent directories): .git

Obviously it's not great if a ZK server times out, but should that really be a fatal error for synapse? We're passing in multiple ZK servers, and a single timeout in one watcher should, it seems to me, be cause at worst for restarting that watcher and/or falling through to the next ZK server.

@igor47
Copy link
Collaborator

igor47 commented Jan 17, 2014

@memory agreed, but this is a separate issue; i opened a new one here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants