Gracefully handle Riak disconnects #519

reiddraper · 2013-04-02T21:41:26Z

Previously, Riak CS would not start if it was unable to connect
to Riak. Furthermore, it would eventually crash if it was disconnected to Riak.
This was due to the fact that the poolboy workers would die and poolboy would
not try and restart them. This leads to the poolboy supervisor eventually
crashing and then it bubbles up to the riak_cs_sup crashing. The main
change here is that we pass the auto_reconnect option into
riakc_pb_socket, which allows it to return from start_link even if
it can't connect to Riak. Attempts to use riakc_pb_socket when it's in
this state will return {error, disconnected}. I've tried to handle as
many of the cases where this might happen, and return 503 to the end
user. There are cases where we might be far enough in the response
(headers already sent over the wire), where for now I just Let It Crash.
Originally, I had thought about adding a 'health' thread that would
monitor Riak availability, for the purposes of informing the deployer
via log messages. However, for now, I think the log messages are clear
enough when Riak gets disconnected that it should be obvious. The sudden
spike in 503 responses should also be a pretty good indicator. Other smaller
changes are detailed below.

Allow riak_cs_storage_d to start even if it can't immediately connect to
Riak
Refactor riak_cs_wm_common:forbidden/2 to handle disconnected Riak
when retrieving the user record

Fixes #262

Testing!

Here are some ideas for testing, beyond the usual bits:

Start Riak CS when Riak is not running
Start Riak CS when it's configured to talk to the wrong Riak port
Kill/stop Riak while Riak CS is running, verify that requests start seeing 503 errors (I think I eliminated all the cases where they'd see, 403, so it's a bug if you see 403).
Verify that things that working again when Riak comes back to life

Previously, Riak CS would not start if it was unable to connect to Riak. Furthermore, it would eventually crash if it was disconnected to Riak. This was due to the fact that the `poolboy` workers would die and poolboy would not try and restart them. This leads to the `poolboy` supervisor eventually crashing and then it bubbles up to the `riak_cs_sup` crashing. The main change here is that we pass the `auto_reconnect` option into `riakc_pb_socket`, which allows it to return from `start_link` even if it can't connect to Riak. Attempts to use `riakc_pb_socket` when it's in this state will return `{error, disconnected}`. I've tried to handle as many of the cases where this might happen, and return 503 to the end user. There are cases where we might be far enough in the response (headers already sent over the wire), where for now I just Let It Crash. Originally, I had thought about adding a 'health' thread that would monitor Riak availability, for the purposes of informing the deployer via log messages. However, for now, I think the log messages are clear enough when Riak gets disconnected that it should be obvious. The sudden spike in 503 responses should also be a pretty good indicator. Other smaller changes are detailed below. * Allow `riak_cs_storage_d` to start even if it can't immediately connect to Riak * Refactor `riak_cs_wm_common:forbidden/2` to handle disconnected Riak when retrieving the user record

kellymclaughlin · 2013-04-05T16:42:39Z

Even with riak running when I start cs I still see a set of disconnected messages. Is that expected?

10:40:57:riak-cs(feature/handle-riak-disconnect) $ ./bin/riak-cs console
config is OK
Exec: /Users/kelly/basho/repos/riak_cs-review/rel/riak-cs/bin/../erts-5.9.3/bin/erlexec -boot /Users/kelly/basho/repos/riak_cs-review/rel/riak-cs/bin/../releases/1.3.0/riak-cs             -embedded -config /Users/kelly/basho/repos/riak_cs-review/rel/riak-cs/bin/../etc/app.config             -pa /Users/kelly/basho/repos/riak_cs-review/rel/riak-cs/bin/../lib/basho-patches             -args_file /Users/kelly/basho/repos/riak_cs-review/rel/riak-cs/bin/../etc/vm.args -- console
Root: /Users/kelly/basho/repos/riak_cs-review/rel/riak-cs/bin/..
Erlang R15B03 (erts-5.9.3) [source] [64-bit] [smp:8:8] [async-threads:64] [kernel-poll:true] [dtrace]


=INFO REPORT==== 5-Apr-2013::10:40:59 ===
    alarm_handler: {set,{{disk_almost_full,"118435569"},[]}}
opening log file: "./log/access.log.2013_04_05_16"
10:40:59.595 [info] Application lager started on node 'riak-cs@127.0.0.1'
10:40:59.596 [info] Application poolboy started on node 'riak-cs@127.0.0.1'
10:40:59.615 [info] Application folsom started on node 'riak-cs@127.0.0.1'
10:40:59.696 [warning] No storage schedule defined. Calculation must be triggered manually.
10:40:59.697 [warning] Unable to verify moss.storage bucket settings (disconnected).
10:40:59.697 [info] Unable to verify bucket properties for storage daemon. Riak currently disconnected. Retrying in 5 seconds.
10:41:00.717 [warning] Unable to verify riak-cs-gc bucket settings (disconnected).
10:41:00.792 [info] Application riak_cs started on node 'riak-cs@127.0.0.1'
Eshell V5.9.3  (abort with ^G)
(riak-cs@127.0.0.1)1> 10:41:04.698 [warning] No storage schedule defined. Calculation must be triggered manually.
10:41:04.699 [info] Storage daemon successfully read bucket properties, continuing

kellymclaughlin · 2013-04-05T17:50:34Z

Eunit and eqc tests pass
Client and int tests pass
Dialyzer output clean
riak_test tests pass

kellymclaughlin · 2013-04-05T17:58:43Z

Do you think having indefinite 5 seconds retries is ok or should we consider a limited number of checks at 5 seconds and then check less frequently? Just wondering if it becomes too spammy if it is the case that the riak node is down for an extended time. I do not feel certain either way just thinking out loud.

kellymclaughlin · 2013-04-05T18:32:57Z

Should we try to handle this in a nicer way? I wonder if a disconnect is worth crashing the block server and the gc daemon processes.

(riak-cs@127.0.0.1)1> 12:00:29.783 [info] alarm_handler: {set,{system_memory_high_watermark,[]}}
12:15:08.358 [error] gen_server <0.6935.0> terminated with reason: no case clause matching {error,disconnected} in riak_cs_block_server:handle_cast/2 line 222
12:15:08.360 [warning] Unable to verify riak-cs-gc bucket settings (disconnected).
12:15:08.366 [error] CRASH REPORT Process <0.6935.0> with 3 neighbours exited with reason: no case clause matching {error,disconnected} in riak_cs_block_server:handle_cast/2 line 222 in gen_server:terminate/6 line 747
12:15:08.368 [error] Supervisor riak_cs_sup had child riak_cs_gc_d started with riak_cs_gc_d:start_link() at <0.287.0> exit with reason no case clause matching {error,disconnected} in riak_cs_block_server:handle_cast/2 line 222 in context child_terminated
12:15:08.368 [error] Supervisor riak_cs_delete_fsm_sup had child undefined started with {riak_cs_delete_fsm,start_link,undefined} at <0.6934.0> exit with reason no case clause matching {error,disconnected} in riak_cs_block_server:handle_cast/2 line 222 in context child_terminated

reiddraper · 2013-04-08T19:59:26Z

Even with riak running when I start cs I still see a set of disconnected messages. Is that expected?

Ah, nice catch. I see it connects fine 5 seconds later, but I'll need to look in to see why that happens. Maybe if you start the riakc_pb_socket with auto_reconnect there is a race condition between starting the proc and sending requests to it.

Do you think having indefinite 5 seconds retries is ok or should we consider a limited number of checks at 5 seconds and then check less frequently? Just wondering if it becomes too spammy if it is the case that the riak node is down for an extended time. I do not feel certain either way just thinking out loud.

Yeah, I could go either way on that. It is worth pointing out that this only happens once in the lifetime of the storage calculation (at least that was my read on things), so if we've already checked the bucket properties and then Riak gets disconnected, we won't keep polling, since we've already got that far. Though if the daemon crashes, it will have to be read again.

Should we try to handle this in a nicer way? I wonder if a disconnect is worth crashing the block server and the gc daemon processes.

Yeah -- these might be worth a mumble-discussion over, as there's varying levels of handling we could do.

reiddraper · 2013-04-08T21:43:20Z

re: seeing disconnect warnings on startup, it must be a race. I only it see it with the storage daemon and gc daemon maybe 50% of the time. The client tries to account for this race, but is either broken or the race is somewhere else.

reiddraper · 2013-04-09T16:14:56Z

Ok, figured out the disconnect issue. We do successfully try and connect before we receive the first set_bucket request, however, some of these connection requests fail (they get tried again later). The connection error they get is {error,econnreset}. This is because the default pb_backlog in Riak is 64, but we're trying 128 concurrent connections from Riak CS at startup time (because of poolboy's 'prepopulate' feature). One way to 'fix' this is to set the pb_backlog on Riak to 128 (though this won't be enough if you start more than one CS node at once). The goods news is that with connection retries, the situation works itself out, you just get some error messages along the way...

kellymclaughlin · 2013-04-09T17:51:55Z

Do you think having indefinite 5 seconds retries is ok or should we consider a limited number of checks at 5 seconds and then check less frequently? Just wondering if it becomes too spammy if it is the case that the riak node is down for an extended time. I do not feel certain either way just thinking out loud.

Yeah, I could go either way on that. It is worth pointing out that this only happens once in the lifetime of the storage calculation (at least that was my read on things), so if we've already checked the bucket properties and then Riak gets disconnected, we won't keep polling, since we've already got that far. Though if the daemon crashes, it will have to be read again.

Ok, then I'd say let's just leave this as-is for now.

kellymclaughlin · 2013-04-09T17:54:37Z

Ok, figured out the disconnect issue. We do successfully try and connect before we receive the first set_bucket request, however, some of these connection requests fail (they get tried again later). The connection error they get is {error,econnreset}. This is because the default pb_backlog in Riak is 64, but we're trying 128 concurrent connections from Riak CS at startup time (because of poolboy's 'prepopulate' feature). One way to 'fix' this is to set the pb_backlog on Riak to 128 (though this won't be enough if you start more than one CS node at once). The goods news is that with connection retries, the situation works itself out, you just get some error messages along the way...

Another option here might be to try and get poolboy changed to allow a pool to be populated in a more staggered fashion. Seems like a generally useful feature for a worker pool library imo.

reiddraper · 2013-04-09T18:36:45Z

Another option here might be to try and get poolboy changed to allow a pool to be populated in a more staggered fashion. Seems like a generally useful feature for a worker pool library imo.

Yeah @andrewjstone and I talked briefly about that earlier today. I too think it'd be useful, though I'm not sure how difficult it might be. I have a couple other [1] [2] poolboy issues I've been musing about too, so maybe we could throw the staggered start stuff in there too?

reiddraper · 2013-04-09T18:41:35Z

Alright, so recapping the outstanding issues raised during review:

'More gracefully' handle disconnects in the block-server (still some open questions there)
Eliminate start-up syn-flood where the gc and storage daemons review disconnected once on startup

Do we want to address either of these in this PR?

kellymclaughlin · 2013-04-09T20:56:50Z

Do we want to address either of these in this PR?

I would say maybe to the former and no to the latter question.

Gracefully handle Riak disconnects

ghost assigned kellymclaughlin Apr 9, 2013

reiddraper mentioned this pull request Apr 12, 2013

Investigate effect of Riak connection blips on GC daemon and block server [JIRA: RCS-341] #534

Closed

ghost assigned reiddraper Apr 12, 2013

reiddraper added a commit that referenced this pull request Apr 12, 2013

Merge pull request #519 from basho/feature/handle-riak-disconnect

7fb829d

Gracefully handle Riak disconnects

reiddraper merged commit 7fb829d into develop Apr 12, 2013

reiddraper deleted the feature/handle-riak-disconnect branch April 12, 2013 19:35

ghost assigned kellymclaughlin Apr 12, 2013

slfritchie mentioned this pull request Sep 2, 2013

Too-large values of request_pool and put_concurrency yield 'disconnected' errors #616

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gracefully handle Riak disconnects #519

Gracefully handle Riak disconnects #519

reiddraper commented Apr 2, 2013

kellymclaughlin commented Apr 5, 2013

kellymclaughlin commented Apr 5, 2013

kellymclaughlin commented Apr 5, 2013

kellymclaughlin commented Apr 5, 2013

reiddraper commented Apr 8, 2013

reiddraper commented Apr 8, 2013

reiddraper commented Apr 9, 2013

kellymclaughlin commented Apr 9, 2013

kellymclaughlin commented Apr 9, 2013

reiddraper commented Apr 9, 2013

reiddraper commented Apr 9, 2013

kellymclaughlin commented Apr 9, 2013

Gracefully handle Riak disconnects #519

Gracefully handle Riak disconnects #519

Conversation

reiddraper commented Apr 2, 2013

Testing!

kellymclaughlin commented Apr 5, 2013

kellymclaughlin commented Apr 5, 2013

kellymclaughlin commented Apr 5, 2013

kellymclaughlin commented Apr 5, 2013

reiddraper commented Apr 8, 2013

reiddraper commented Apr 8, 2013

reiddraper commented Apr 9, 2013

kellymclaughlin commented Apr 9, 2013

kellymclaughlin commented Apr 9, 2013

reiddraper commented Apr 9, 2013

reiddraper commented Apr 9, 2013

kellymclaughlin commented Apr 9, 2013