rexi_server dies without being restarted #1571
Comments
Of note here is that the rexi_server processes weren't restarted. The system_limit issue was just a trigger in this case (and an unrelated unfound bug as of now). |
It's set to 1 restart per second maximum https://github.com/apache/couchdb/blob/master/src/rexi/src/rexi_server_sup.erl#L29 Increase the number of restarts and period perhaps (10 in 10 seconds)? But wonder if we'd want a cool-down time as well in case the system limit or other overload scenario is temporary and would clear out over a few seconds. |
I was thinking more like that it should have spidered up to the top rexi supervisor rather than giving up. Also of note was that the rexi_buffer processes were all alive. Should have mentioned that earlier as well. The half alive part of the issue I think is the bigger bug. |
Hrm, that actually looks not right from the logs. There aren't enough exits from the rexi_server exits to trigger that threshold. However there is this log message for a few nodes:
which seems odd. |
In https://github.com/apache/couchdb/blob/master/src/rexi/src/rexi_sup.erl#L23 noticed it's one_for_one strategy, so if any of the top level children dies and restarts (like say rexi_sup did in the log), when rexi_sup respawns, nothing will restart all the rexi servers.
So maybe we'd want a
In this case if |
I think that'd work cause the rexi_server_mon instances don't store state so there's no ordering (i.e., would things get out of whack if only rexi_server_mon for rexi_server_sup died). Although maybe its just me being tired but I'm also kind of leaning towards one_for_all and just restart the whole shebang if anything dies there. |
Also not sure if we need to tweak the restart intensity or not. 3 in 10s seems fine at this level and that wouldn't have been triggered by the incident anyway. |
It think For restart intensity I was thinking of |
Previously, as described in issue apache#1571, `rexi_server_sup` supervisor could die and restart. After it restarts `rexi_server_mon` would not respan rexi servers as it wouldn't notice `rexi_server_sup` went away and come back. That would leave the cluster in a disabled state. To fix the issue, switch restart strategy to `rest_for_one`. In this case, if a child at the top dies it will restart all the children below it in the list. For example, if `rexi_server` dies, it will restart all the children. If `rexi_server_sup` dies, it will restart `rexi_server_mon`. And then on restart `rexi_server_mon` will properly spawn all the rexi servers. Same for the buffers, if `rexi_buffer_sup` dies, it will restart `rexi_buffer_mon` and on restart it will spawn buffers as expected. Fixes: apache#1571
Previously, as described in issue #1571, `rexi_server_sup` supervisor could die and restart. After it restarts `rexi_server_mon` would not respan rexi servers as it wouldn't notice `rexi_server_sup` went away and come back. That would leave the cluster in a disabled state. To fix the issue, switch restart strategy to `rest_for_one`. In this case, if a child at the top dies it will restart all the children below it in the list. For example, if `rexi_server` dies, it will restart all the children. If `rexi_server_sup` dies, it will restart `rexi_server_mon`. And then on restart `rexi_server_mon` will properly spawn all the rexi servers. Same for the buffers, if `rexi_buffer_sup` dies, it will restart `rexi_buffer_mon` and on restart it will spawn buffers as expected. Fixes: #1571
Turns out that rexi_server's can die in such a way that they're not restarted. This can (and has!) left a cluster without the ability to issue RPC calls effectively rendering the cluster useless.
A slightly redacted log showing it happen due to hitting the process limit is:
2018-08-18T21:00:05.106860Z db3.clustername <0.19934.2> - gen_server 'rexi_server_dbcore@db1.clustername.cloudant.net' terminated with reason: system_limit at erlang:spawn_opt/1 <= erlang:spawn_monitor/3 <= rexi_server:handle_cast/2(line:71) <= gen_server:try_dispatch/4(line:593) <= gen_server:handle_msg/5(line:659) <= proc_lib:init_p_do_apply/3(line:237)#012 state: {st,6946959,7078032,{[],[]},0,0}
The text was updated successfully, but these errors were encountered: