New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rexi_server dies without being restarted #1571

Closed
davisp opened this Issue Aug 21, 2018 · 8 comments

Comments

Projects
None yet
2 participants
@davisp
Member

davisp commented Aug 21, 2018

Turns out that rexi_server's can die in such a way that they're not restarted. This can (and has!) left a cluster without the ability to issue RPC calls effectively rendering the cluster useless.

A slightly redacted log showing it happen due to hitting the process limit is:

2018-08-18T21:00:05.106860Z db3.clustername <0.19934.2> - gen_server 'rexi_server_dbcore@db1.clustername.cloudant.net' terminated with reason: system_limit at erlang:spawn_opt/1 <= erlang:spawn_monitor/3 <= rexi_server:handle_cast/2(line:71) <= gen_server:try_dispatch/4(line:593) <= gen_server:handle_msg/5(line:659) <= proc_lib:init_p_do_apply/3(line:237)#012 state: {st,6946959,7078032,{[],[]},0,0}

@davisp

This comment has been minimized.

Member

davisp commented Aug 21, 2018

Of note here is that the rexi_server processes weren't restarted. The system_limit issue was just a trigger in this case (and an unrelated unfound bug as of now).

@nickva

This comment has been minimized.

Contributor

nickva commented Aug 21, 2018

It's set to 1 restart per second maximum

https://github.com/apache/couchdb/blob/master/src/rexi/src/rexi_server_sup.erl#L29

Increase the number of restarts and period perhaps (10 in 10 seconds)? But wonder if we'd want a cool-down time as well in case the system limit or other overload scenario is temporary and would clear out over a few seconds.

@davisp

This comment has been minimized.

Member

davisp commented Aug 21, 2018

I was thinking more like that it should have spidered up to the top rexi supervisor rather than giving up.

Also of note was that the rexi_buffer processes were all alive. Should have mentioned that earlier as well. The half alive part of the issue I think is the bigger bug.

@davisp

This comment has been minimized.

Member

davisp commented Aug 21, 2018

Hrm, that actually looks not right from the logs. There aren't enough exits from the rexi_server exits to trigger that threshold. However there is this log message for a few nodes:

2018-08-18T21:00:12.332403Z db2.clusternme <0.360.0> - Supervisor rexi_sup had child rexi_server_sup started with rexi_server_sup:start_link(rexi_server_sup) at <0.362.0> exit with reason shutdown in context child_terminated

which seems odd.

@nickva

This comment has been minimized.

Contributor

nickva commented Aug 21, 2018

In https://github.com/apache/couchdb/blob/master/src/rexi/src/rexi_sup.erl#L23 noticed it's one_for_one strategy, so if any of the top level children dies and restarts (like say rexi_sup did in the log), when rexi_sup respawns, nothing will restart all the rexi servers.

rexi_server_mon only checks if it needs to restart servers on cluster membership changes and on startup: https://github.com/apache/couchdb/blob/master/src/rexi/src/rexi_server_mon.erl#L56-L63

So maybe we'd want a rest_for_one strategy in rexi_sup (in addition to increasing the intensity check timeout in rexi_server_sup?).

init([]) ->
    {ok, {{rest_for_one, 3, 10}, [
        {
            rexi_server,
            {rexi_server, start_link, [rexi_server]},
            permanent,
            100,
            worker,
            [rexi_server]
        },
        {
            rexi_server_sup,
            {rexi_server_sup, start_link, [rexi_server_sup]},
            permanent,
            100,
            supervisor,
            [rexi_server_sup]
        },
        {
            rexi_server_mon,
            {rexi_server_mon, start_link, [rexi_server]},
            permanent,
            100,
            worker,
            [rexi_server_mon]
        },
        {
            rexi_buffer_sup,
            {rexi_server_sup, start_link, [rexi_buffer_sup]},
            permanent,
            100,
            supervisor,
            [rexi_server_sup]
        },
        {
            rexi_buffer_mon,
            {rexi_server_mon, start_link, [rexi_buffer]},
            permanent,
            100,
            worker,
            [rexi_server_mon]
        }
    ]}}.

In this case if rexi_server_sup dies, it will take down rexi_server_mon (well and unfortunately the buffers), but then on restart it will bring back rexi_server_mon and it should repopulate the servers in rexi_server_sup.

@davisp

This comment has been minimized.

Member

davisp commented Aug 21, 2018

I think that'd work cause the rexi_server_mon instances don't store state so there's no ordering (i.e., would things get out of whack if only rexi_server_mon for rexi_server_sup died). Although maybe its just me being tired but I'm also kind of leaning towards one_for_all and just restart the whole shebang if anything dies there.

@davisp

This comment has been minimized.

Member

davisp commented Aug 21, 2018

Also not sure if we need to tweak the restart intensity or not. 3 in 10s seems fine at this level and that wouldn't have been triggered by the incident anyway.

@nickva

This comment has been minimized.

Contributor

nickva commented Aug 21, 2018

It think rexi_server_mon can restart by itself. On restart it will check rexi_server_sup and either start or stop any new servers as needed on startup then wait on monitor for cluster membership changes and do the same when anything happens there.

For restart intensity I was thinking of {one_for_one, 1, 1} in https://github.com/apache/couchdb/blob/master/src/rexi/src/rexi_server_sup.erl#L29 but now not sure anymore. Just the restart strategy switch is a better thing to try first

nickva added a commit to cloudant/couchdb that referenced this issue Aug 21, 2018

Switch rexi_sup restart strategy to rest_for_one
Previously, as described in issue apache#1571, `rexi_server_sup` supervisor could die
and restart. After it restarts `rexi_server_mon` would not respan rexi servers
as it wouldn't notice `rexi_server_sup` went away and come back. That would
leave the cluster in a disabled state.

To fix the issue, switch restart strategy to `rest_for_one`. In this case, if a
child at the top dies it will restart all the children below it in the list.
For example, if `rexi_server` dies, it will restart all the children. If
`rexi_server_sup` dies, it will restart `rexi_server_mon`. And then on restart
`rexi_server_mon` will properly spawn all the rexi servers.

Same for the buffers, if `rexi_buffer_sup` dies, it will restart `rexi_buffer_mon`
and on restart it will spawn buffers as expected.

Fixes: apache#1571

@nickva nickva closed this in #1572 Aug 22, 2018

nickva added a commit that referenced this issue Aug 22, 2018

Switch rexi_sup restart strategy to rest_for_one
Previously, as described in issue #1571, `rexi_server_sup` supervisor could die
and restart. After it restarts `rexi_server_mon` would not respan rexi servers
as it wouldn't notice `rexi_server_sup` went away and come back. That would
leave the cluster in a disabled state.

To fix the issue, switch restart strategy to `rest_for_one`. In this case, if a
child at the top dies it will restart all the children below it in the list.
For example, if `rexi_server` dies, it will restart all the children. If
`rexi_server_sup` dies, it will restart `rexi_server_mon`. And then on restart
`rexi_server_mon` will properly spawn all the rexi servers.

Same for the buffers, if `rexi_buffer_sup` dies, it will restart `rexi_buffer_mon`
and on restart it will spawn buffers as expected.

Fixes: #1571
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment