Handle incompatible records between the 1.3 and 1.4 release #120

Closed
wants to merge 15 commits into
from

2 participants

@cmeiklejohn

I've confirmed a problem with a rolling upgrade between the 1.3 and 1.4 versions of Riak.

The root of the problem is that the node hosting Riak Control initiates a RPC request to all nodes, and pattern matches the result against a record. The format of this record changed between the 1.3 and 1.4 releases, without providing a new name for the record, so the pattern match fails.

This can occur when either:

1.) Riak Control hosted on 1.3 node makes a RPC request to a 1.4 node.
2.) Riak Control hosted on 1.4 node makes a RPC request to a 1.3 node.

Both of these scenarios trigger a badmatch exception in the riak_control_session module. This causes the Riak supervisor to restart the riak_control application, which then continues to crash when polling for the state until the node shuts down triggered by a maximum restart threshold.

This patch provides a few things:

  • Use macros for record definitions moving forward.
  • Make the 1.3 record the definitive member_info record, and the proposed-1.4.1 record the definitive member_info_v2 record.
  • When receiving an incompatible 1.4 record, display the node as incompatible.
  • Suppress display of node-specific information when an incompatible node is returned.
  • Fix failed low-watermark handling in node resource.
  • Capability to ensure that we don't send 1.4.1+ records to 1.2 <= 1.3.x nodes.

Updates

  • Capability negotiation could not be used reliably because of race conditions when adding a node which a greater capability than the existing cluster, and would trigger node failures.
  • Instead, the RPC call was renamed, with the original call triggering a badrpc which is handled.

Rolling Upgrades

  • With this patch, both Riak 1.3 (see riak_control#121), Riak 1.4 (see riak_control#111 riak_control#120) will unfortunately throw webmachine errors on the nodes resource because of badarith handling during start of memsup, or when the node is in an incompatible state. This is unfortunate, but this patch at least addresses the major cluster crash.
  • I've tested branches 1.2, 1.3 and 1.4 against a mixed cluster with 1.4.1 nodes and couldn't trigger the crash scenario any longer, however the webmachine issues above do still occur.

I would appreciate someone testing these upgrade scenarios as well to ensure total coverage.

I've left the commits broken out for the reviewer, but will rebase before merge.

@jonmeredith @russelldb @seancribbs @jgnewman @bsparrow435 @rzezeski

@seancribbs seancribbs and 1 other commented on an outdated diff Jul 30, 2013
src/riak_control_session.erl
ring_pct = PctRing,
- pending_pct = PctPending}
+ pending_pct = PctPending};
+ _ ->
@seancribbs
seancribbs added a line comment Jul 30, 2013

What's wrong with matching against #member_info{} in this case? It seems the main difference is that the new record has additional fields.

A capability might also help, allowing new nodes to send the original record to old nodes.

@cmeiklejohn
cmeiklejohn added a line comment Jul 30, 2013

Can't match member_info because we can't guarantee a specific format for it. I've added a capability for handling legacy record formats, but we still don't have a solution for 1.4.0/1.4.1 compatibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@rzezeski

This patch ensures that a proposed-1.4.1 node can successfully run Riak Control, and talk to 1.3 and 1.4 nodes, but an earlier 1.3 node will still exhibit errors talking to a 1.4+ node.

Does this mean that a mixed cluster where the Riak Control node is not updated first will cause the Riak Control node to continually go down because of max restarts?

@cmeiklejohn

@rzezeski

I've added a capability that will prevent 1.4.1 nodes from sending the newer record to 1.2.x and 1.3.x nodes, however, there's still a situation where a 1.4.0 node is running Control and asks for information from a 1.4.1.

cmeiklejohn added some commits Jul 30, 2013
@cmeiklejohn cmeiklejohn Ensure start_link returns supervisor_ret. 0efa0af
@cmeiklejohn cmeiklejohn Pattern match against legacy record. 52f1508
@cmeiklejohn cmeiklejohn Assume v0 record when the capability is accessed before registered. 6941f07
@cmeiklejohn cmeiklejohn Handle mixed RPC calls.
Handle race condition by triggering a badrpc exception, which has
backwards compatibility with <= 1.3.

Craft custom messages for 1.4.0 with the bad record format.
bf21e44
@cmeiklejohn cmeiklejohn Use the member_info macro. ca589c6
@cmeiklejohn cmeiklejohn Don't attempt 1.4.0 detection.
Capability negotiation has a race condition, so
it can't be relied on to determine 1.4.0 as it 
will look like a 1.4.0 momentarily if the node
is actual a 1.3.x node.
18444f6
@cmeiklejohn cmeiklejohn Fix ordering of negotiation. 8db02b7
@cmeiklejohn cmeiklejohn Remove capability.
Capability negotation is problematic because race conditions in
negotation.
e037a9f
@rzezeski

I'm working on a riak test to verify this patch (so we don't have to test this by hand in the future). I've verified the crash upgrading from 1.2.1 to a recent master of Riak. I pasted the observed error below. I figure 1.2.1 -> current proves same issue as 1.3 or 1.4 -> current. Now I'm going to test the patch while upgrading from the 1.3/1.4 versions.

2013-07-31 13:42:23.682 [error] <0.647.0> Supervisor riak_control_sup had child riak_control_session started with riak_control_session:start_link() at <0.648.0> exit with reason no case clause matching {member_info,'dev3@127.0.0.1',undefined,true,[],[],undefined,undefined,32996760000,4792588000,18248904,undefined,undefined} in riak_control_session:get_member_info/2 line 239 in context child_terminated
2013-07-31 13:42:23.683 [error] <0.3537.0> CRASH REPORT Process <0.3537.0> with 0 neighbours exited with reason: no case clause matching {member_info,'dev3@127.0.0.1',undefined,true,[],[],undefined,undefined,32997180000,4795068000,18249096,undefined,undefined} in riak_control_session:get_member_info/2 line 239 in gen_server:init_it/6 line 328
@rzezeski rzezeski referenced this pull request in basho/riak_test Jul 31, 2013
Merged

Add upgrade test for Riak Control #342

@rzezeski

+1 to merge. The riak test I created fails without the patch and passes with it. So at least we know Riak Control should no longer crash a node in a mixed-cluster environment.

@cmeiklejohn

Rebased to #122.

@cmeiklejohn

Rather, rebased to #123.

@rzezeski rzezeski added a commit to basho/riak_test that referenced this pull request Jan 26, 2014
@rzezeski rzezeski Add upgrade test for Riak Control
Some issues have been found when running a mixed cluster with Riak
Control enabled.  This tests the typical upgrade scenario of a
customer and verifies that Riak Control at least doesn't crash any of
the nodes while in a mixed cluster state.

See basho/riak_control#120
9e05b74
@seancribbs seancribbs deleted the csm-1.4-bad-record branch Apr 1, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment