Skip to content
This repository

Chicken & egg problem with vnode proxy process start & registration? #155

Closed
slfritchie opened this Issue · 2 comments

3 participants

Scott Lystig Fritchie Joseph Blomstedt Jon Meredith
Scott Lystig Fritchie
Owner

I have this theory that there's a chicken & egg problem with the vnode proxy process startup.

  1. Riak is started.
  2. Vnode operations arrive at the riak_core_vnode_master
  3. For each operation, forwarding via a registered name is attempted. However, the vnode proxy process hasn't registered its name yet, so gen_fsm:send_event/2 fails with a badarg error.
  4. The riak_kv app finishes starting.

However, because step 3 fails frequently enough, the riak_core app's supervisor's maximum limit is hit before the riak_kv app has started.

I dunno if that theory is correct or not, but here's the log to show the sequence of events.

2012-03-22 03:49:25.096 [info] <0.7.0> Application lager started on node 'riak@host-2150-riak.example.com'
2012-03-22 03:49:25.295 [info] <0.7.0> Application riak_core started on node 'riak@host-2150-riak.example.com'
2012-03-22 03:49:25.299 [info] <0.7.0> Application riak_control started on node 'riak@host-2150-riak.example.com'
2012-03-22 03:49:25.300 [info] <0.7.0> Application basho_metrics started on node 'riak@host-2150-riak.example.com'
2012-03-22 03:49:25.304 [info] <0.7.0> Application cluster_info started on node 'riak@host-2150-riak.example.com'
2012-03-22 03:49:25.458 [info] <0.1283.0>@riak_core:wait_for_application:396 Waiting for application riak_pipe to start (0 seconds).
2012-03-22 03:49:25.461 [info] <0.7.0> Application riak_pipe started on node 'riak@host-2150-riak.example.com'
2012-03-22 03:49:25.498 [info] <0.7.0> Application inets started on node 'riak@host-2150-riak.example.com'
2012-03-22 03:49:25.502 [info] <0.7.0> Application mochiweb started on node 'riak@host-2150-riak.example.com'
2012-03-22 03:49:25.520 [info] <0.7.0> Application erlang_js started on node 'riak@host-2150-riak.example.com'
2012-03-22 03:49:25.530 [info] <0.7.0> Application luke started on node 'riak@host-2150-riak.example.com'
2012-03-22 03:49:25.559 [info] <0.1283.0>@riak_core:wait_for_application:390 Wait complete for application riak_pipe (0 seconds)
2012-03-22 03:49:25.606 [error] <0.1347.0> gen_server riak_kv_vnode_master terminated with reason: bad argument in gen_fsm:send_event/2
2012-03-22 03:49:25.608 [error] <0.1347.0> CRASH REPORT Process riak_kv_vnode_master with 0 neighbours crashed with reason: bad argument in gen_fsm:send_event/2
2012-03-22 03:49:25.731 [info] <0.7.0> Application bitcask started on node 'riak@host-2150-riak.example.com'
2012-03-22 03:49:25.761 [error] <0.1346.0> Supervisor riak_kv_sup had child riak_kv_vnode_master started with riak_core_vnode_master:start_link(riak_kv_vnode, riak_kv_legacy_vnode) at <0.1347.0> exit with reason bad argument in gen_fsm:send_event/2 in context child_terminated
2012-03-22 03:49:25.765 [error] <0.1415.0> gen_server riak_kv_vnode_master terminated with reason: bad argument in gen_fsm:send_event/2
2012-03-22 03:49:25.766 [error] <0.1415.0> CRASH REPORT Process riak_kv_vnode_master with 0 neighbours crashed with reason: bad argument in gen_fsm:send_event/2
2012-03-22 03:49:25.767 [error] <0.1346.0> Supervisor riak_kv_sup had child riak_kv_vnode_master started with riak_core_vnode_master:start_link(riak_kv_vnode, riak_kv_legacy_vnode) at <0.1415.0> exit with reason bad argument in gen_fsm:send_event/2 in context child_terminated
2012-03-22 03:49:25.769 [error] <0.1416.0> gen_server riak_kv_vnode_master terminated with reason: bad argument in gen_fsm:send_event/2
2012-03-22 03:49:25.772 [error] <0.1416.0> CRASH REPORT Process riak_kv_vnode_master with 0 neighbours crashed with reason: bad argument in gen_fsm:send_event/2
2012-03-22 03:49:25.773 [error] <0.1346.0> Supervisor riak_kv_sup had child riak_kv_vnode_master started with riak_core_vnode_master:start_link(riak_kv_vnode, riak_kv_legacy_vnode) at <0.1416.0> exit with reason bad argument in gen_fsm:send_event/2 in context child_terminated
2012-03-22 03:49:25.774 [error] <0.1447.0> gen_server riak_kv_vnode_master terminated with reason: bad argument in gen_fsm:send_event/2
2012-03-22 03:49:25.775 [error] <0.1447.0> CRASH REPORT Process riak_kv_vnode_master with 0 neighbours crashed with reason: bad argument in gen_fsm:send_event/2
2012-03-22 03:49:25.776 [error] <0.1346.0> Supervisor riak_kv_sup had child riak_kv_vnode_master started with riak_core_vnode_master:start_link(riak_kv_vnode, riak_kv_legacy_vnode) at <0.1447.0> exit with reason bad argument in gen_fsm:send_event/2 in context child_terminated
2012-03-22 03:49:25.778 [error] <0.1455.0> gen_server riak_kv_vnode_master terminated with reason: bad argument in gen_fsm:send_event/2
2012-03-22 03:49:25.780 [error] <0.1455.0> CRASH REPORT Process riak_kv_vnode_master with 0 neighbours crashed with reason: bad argument in gen_fsm:send_event/2
2012-03-22 03:49:25.786 [error] <0.1346.0> Supervisor riak_kv_sup had child riak_kv_vnode_master started with riak_core_vnode_master:start_link(riak_kv_vnode, riak_kv_legacy_vnode) at <0.1455.0> exit with reason bad argument in gen_fsm:send_event/2 in context child_terminated
2012-03-22 03:49:25.807 [error] <0.1464.0> gen_server riak_kv_vnode_master terminated with reason: bad argument in gen_fsm:send_event/2
2012-03-22 03:49:25.808 [error] <0.1464.0> CRASH REPORT Process riak_kv_vnode_master with 0 neighbours crashed with reason: bad argument in gen_fsm:send_event/2
2012-03-22 03:49:25.810 [error] <0.1346.0> Supervisor riak_kv_sup had child riak_kv_vnode_master started with riak_core_vnode_master:start_link(riak_kv_vnode, riak_kv_legacy_vnode) at <0.1464.0> exit with reason bad argument in gen_fsm:send_event/2 in context child_terminated
2012-03-22 03:49:25.811 [error] <0.1467.0> gen_server riak_kv_vnode_master terminated with reason: bad argument in gen_fsm:send_event/2
2012-03-22 03:49:25.812 [error] <0.1467.0> CRASH REPORT Process riak_kv_vnode_master with 0 neighbours crashed with reason: bad argument in gen_fsm:send_event/2
2012-03-22 03:49:25.813 [error] <0.1346.0> Supervisor riak_kv_sup had child riak_kv_vnode_master started with riak_core_vnode_master:start_link(riak_kv_vnode, riak_kv_legacy_vnode) at <0.1467.0> exit with reason bad argument in gen_fsm:send_event/2 in context child_terminated
2012-03-22 03:49:25.815 [error] <0.1470.0> gen_server riak_kv_vnode_master terminated with reason: bad argument in gen_fsm:send_event/2
2012-03-22 03:49:25.817 [error] <0.1470.0> CRASH REPORT Process riak_kv_vnode_master with 0 neighbours crashed with reason: bad argument in gen_fsm:send_event/2
2012-03-22 03:49:25.818 [error] <0.1346.0> Supervisor riak_kv_sup had child riak_kv_vnode_master started with riak_core_vnode_master:start_link(riak_kv_vnode, riak_kv_legacy_vnode) at <0.1470.0> exit with reason bad argument in gen_fsm:send_event/2 in context child_terminated
2012-03-22 03:49:25.819 [error] <0.1485.0> gen_server riak_kv_vnode_master terminated with reason: bad argument in gen_fsm:send_event/2
2012-03-22 03:49:25.821 [error] <0.1485.0> CRASH REPORT Process riak_kv_vnode_master with 0 neighbours crashed with reason: bad argument in gen_fsm:send_event/2
2012-03-22 03:49:25.822 [error] <0.1346.0> Supervisor riak_kv_sup had child riak_kv_vnode_master started with riak_core_vnode_master:start_link(riak_kv_vnode, riak_kv_legacy_vnode) at <0.1485.0> exit with reason bad argument in gen_fsm:send_event/2 in context child_terminated
2012-03-22 03:49:25.825 [error] <0.1494.0> gen_server riak_kv_vnode_master terminated with reason: bad argument in gen_fsm:send_event/2
2012-03-22 03:49:25.826 [error] <0.1494.0> CRASH REPORT Process riak_kv_vnode_master with 0 neighbours crashed with reason: bad argument in gen_fsm:send_event/2
2012-03-22 03:49:25.827 [error] <0.1346.0> Supervisor riak_kv_sup had child riak_kv_vnode_master started with riak_core_vnode_master:start_link(riak_kv_vnode, riak_kv_legacy_vnode) at <0.1494.0> exit with reason bad argument in gen_fsm:send_event/2 in context child_terminated
2012-03-22 03:49:25.832 [error] <0.1520.0> gen_server riak_kv_vnode_master terminated with reason: bad argument in gen_fsm:send_event/2
2012-03-22 03:49:25.836 [error] <0.1520.0> CRASH REPORT Process riak_kv_vnode_master with 0 neighbours crashed with reason: bad argument in gen_fsm:send_event/2
2012-03-22 03:49:25.837 [error] <0.1346.0> Supervisor riak_kv_sup had child riak_kv_vnode_master started with riak_core_vnode_master:start_link(riak_kv_vnode, riak_kv_legacy_vnode) at <0.1520.0> exit with reason bad argument in gen_fsm:send_event/2 in context child_terminated
2012-03-22 03:49:25.838 [error] <0.1346.0> Supervisor riak_kv_sup had child riak_kv_vnode_master started with riak_core_vnode_master:start_link(riak_kv_vnode, riak_kv_legacy_vnode) at <0.1520.0> exit with reason reached_max_restart_intensity in context shutdown
2012-03-22 03:49:25.996 [info] <0.2710.0>@riak_core:wait_for_application:396 Waiting for application riak_kv to start (0 seconds).
2012-03-22 03:49:26.003 [info] <0.7.0> Application riak_kv started on node 'riak@host-2150-riak.example.com'
2012-03-22 03:49:26.008 [info] <0.7.0> Application riak_kv exited with reason: shutdown
Joseph Blomstedt
Owner

The issue here appears to be that we are having k/v requests arrive at the node before the k/v service has been marked as up. Riak ensures the vnode proxy process have started and are registered before marking a given service as up -- double checked the code this morning and tested a few times to confirm.

In the future, we should see about fixing these cases where requests are sent before the service is marked as up. However, we can solve this problem independently of that issue.

The above crash should only occur in the {legacy_vnode_routing, true} case. In legacy mode, requests are sent through the vnode master which then routes the requests through the proxy processes. I'll change riak_core_vnode_master to delay handling messages until after the relevant service has completely started.

In the {legacy_vnode_routing, false} case, nodes directly send requests to the proxy processes, avoiding the master entirely. At worst, if the request comes in before k/v has started, the request will fail. There will not be a cascading supervisor failure.

Joseph Blomstedt jtuple closed this
Joseph Blomstedt
Owner

Fix for issue has been merged in. Will be available in Riak 1.1.2.

Heinz N. Gies Licenser referenced this issue from a commit in Licenser/riak_test_core
Joseph Blomstedt Add test for issue basho/riak_core#155 2a89366
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.