Add support to vnode master to wait until related service has started #161

Merged
merged 2 commits into from Apr 4, 2012

2 participants

@jtuple

This patch changes riak_core_vnode_master so that the vnode master will wait until the related service has started before handling requests. This is necessary in legacy_vnode_routing mode because the routing proxies may not be available until the service has fully started. To actually enable waiting, vnode services must change how they spawn the vnode master, passing in the name of the related service. A separate pull-request makes this change to riak_kv.

This change (and the related riak_kv change) is designed to address issue #155.

@jtuple

A reproducible test has been added to riak_test as well. It requires changes introduced in the first commit of this pull-request in order to run. Those changes allow Riak to have a delayed startup, which the tests uses to mock the vnode proxy startup process to introduce additional delay that makes the race condition we're fixing reproducible.

To verify changes. Setup and run the gh_riak_core_155 riak_test.

Running against basho/riak_core@67aaf78 and all other deps at 1.1 and the test should fail.

Running against 1.1 using this branch for riak_core and gh-riak-core-155-kv for riak_kv, and the test should pass.

Example expected failure output:

$ ./riak_test rtdev.config gh_riak_core_155
14:13:14.596 [info] Application lager started on node 'riak_test@127.0.0.1'
14:13:14.599 [info] Riak path: "/tmp/rt"
14:13:14.600 [info] Running: /tmp/rt/dev/dev1/bin/riak stop
14:13:14.828 [info] Resetting nodes to fresh state
14:13:14.828 [debug] Running: git --git-dir="/tmp/rt/dev/.git" --work-tree="/tmp/rt/dev" reset HEAD --hard
14:13:14.858 [debug] Running: git --git-dir="/tmp/rt/dev/.git" --work-tree="/tmp/rt/dev" clean -fd
14:13:14.879 [info] Running: /tmp/rt/dev/dev1/bin/riak start
14:13:16.626 [debug] Supervisor inet_gethost_native_sup started undefined at pid <0.60.0>
14:13:16.627 [debug] Supervisor kernel_safe_sup started inet_gethost_native:start_link() at pid <0.59.0>
14:13:16.640 [info] Deployed nodes: ['dev1@127.0.0.1']
14:13:16.642 [info] Cluster built: ['dev1@127.0.0.1']
14:13:16.648 [info] Adding delayed start to app.config
14:13:16.649 [info] Running: /tmp/rt/dev/dev1/bin/riak stop
14:13:18.889 [info] Running: /tmp/rt/dev/dev1/bin/riak start
14:13:21.975 [info] Running: /tmp/rt/dev/dev1/bin/riak stop
14:13:24.207 [info] Running: /tmp/rt/dev/dev1/bin/riak start
14:13:25.382 [info] Installed mocks to delay riak_kv proxy startup
14:13:25.382 [info] Issuing 10000 gets against 'dev1@127.0.0.1'
14:13:27.295 [info] Verifying 'dev1@127.0.0.1' has not crashed
escript: exception error: {assertEqual_failed,[{module,gh_riak_core_155},
                                      {line,35},
                                      {expression,"net_adm : ping ( Node )"},
                                      {expected,pong},
                                      {value,pang}]}
  in function  gh_riak_core_155:'-gh_riak_core_155/0-fun-0-'/2
  in call from gh_riak_core_155:'-gh_riak_core_155/0-lc$^0/1-0-'/2
  in call from gh_riak_core_155:gh_riak_core_155/0
  in call from riak_test:main/1
  in call from escript:run/2
  in call from escript:start/1
  in call from init:start_it/1
  in call from init:start_em/1

Example expected success output:

$ ./riak_test rtdev.config gh_riak_core_155
14:14:12.895 [info] Application lager started on node 'riak_test@127.0.0.1'
14:14:12.897 [info] Riak path: "/tmp/rt"
14:14:12.898 [info] Running: /tmp/rt/dev/dev1/bin/riak stop
14:14:13.119 [info] Resetting nodes to fresh state
14:14:13.119 [debug] Running: git --git-dir="/tmp/rt/dev/.git" --work-tree="/tmp/rt/dev" reset HEAD --hard
14:14:13.145 [debug] Running: git --git-dir="/tmp/rt/dev/.git" --work-tree="/tmp/rt/dev" clean -fd
14:14:13.160 [info] Running: /tmp/rt/dev/dev1/bin/riak start
14:14:14.906 [debug] Supervisor inet_gethost_native_sup started undefined at pid <0.60.0>
14:14:14.907 [debug] Supervisor kernel_safe_sup started inet_gethost_native:start_link() at pid <0.59.0>
14:14:14.920 [info] Deployed nodes: ['dev1@127.0.0.1']
14:14:14.922 [info] Cluster built: ['dev1@127.0.0.1']
14:14:14.930 [info] Adding delayed start to app.config
14:14:14.930 [info] Running: /tmp/rt/dev/dev1/bin/riak stop
14:14:17.161 [info] Running: /tmp/rt/dev/dev1/bin/riak start
14:14:20.270 [info] Running: /tmp/rt/dev/dev1/bin/riak stop
14:14:22.503 [info] Running: /tmp/rt/dev/dev1/bin/riak start
14:14:23.667 [info] Installed mocks to delay riak_kv proxy startup
14:14:23.667 [info] Issuing 10000 gets against 'dev1@127.0.0.1'
14:14:25.590 [info] Verifying 'dev1@127.0.0.1' has not crashed
14:14:35.601 [info] Test passed
@jonmeredith

Should there be a pr for search too?

@jtuple

It's unclear to me if the search and pipe code actually ever send requests to a node before the service is marked as up. I always figured it was a get/put FSM issue, where we check the service when computing preflists, but then don't recheck when sending messages. Ie. the node could restart in the middle and have the kv service currently down and proxies not yet started.

But, for safety sake, we can certainly add the one-line change to search, pipe, etc.

@jonmeredith

+1 merge

@jtuple jtuple was assigned Apr 3, 2012
@jtuple jtuple merged commit 3119736 into 1.1 Apr 4, 2012
@seancribbs seancribbs deleted the gh155-requests-before-proxy branch Apr 1, 2015
@jtuple jtuple was unassigned by ooshlablu Apr 25, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment