cluster crash recovery #10

Closed
dirtysalt opened this Issue May 6, 2014 · 1 comment

Comments

Projects
None yet
2 participants
Contributor

dirtysalt commented May 6, 2014

copy from t848

typical scenario is data center outage. after power supply is back, cluster nodes will be promoted to primary component automatically.

the designed process is

  1. recovery last known PC from disk and start all nodes in non-prim.
  2. every time node's membership changes, check non-prim view's membership is equal to last known PC's membership. if yes, promoted to primary component.

dirtysalt added a commit that referenced this issue May 7, 2014

Refs #10: handle inconsistent restored views.
put restored view into state message. bootstrap only if membership equals and restored viewid seqno is the highest among all nodes restored viewid seqnos

@dirtysalt dirtysalt added this to the 3.6 milestone May 15, 2014

@dirtysalt dirtysalt self-assigned this May 15, 2014

@temeo temeo assigned temeo and unassigned dirtysalt May 15, 2014

@temeo temeo modified the milestone: 3.6 May 15, 2014

@temeo temeo assigned dirtysalt and unassigned temeo May 15, 2014

dirtysalt added a commit that referenced this issue May 19, 2014

Refs #10: fix unit test. because previous unit test will leave some o…
…bsolete gvwstate.dat file, and since pc.recovery is 1 by default, so some unit test cases may hang and cause timeout error. The fix is to start first node with bootstrap option in those unit test cases.

dirtysalt added a commit that referenced this issue May 20, 2014

Refs #10: fix unit test. set pc.recovery=0 to avoid reading obsolete …
…gvwstate.dat left by other unit test cases

dirtysalt added a commit that referenced this issue May 22, 2014

dirtysalt added a commit that referenced this issue Jun 1, 2014

Refs #10: avoid gvwstate.file to interfere with unit tests
1. move vst.write_file from pc_proto.cpp to pc.cpp
2. add pc.save_prim to control save prim component or not.

dirtysalt added a commit that referenced this issue Jun 1, 2014

dirtysalt added a commit that referenced this issue Jun 3, 2014

Refs #10: remove gvwstate.dat file if pc.recovery=0 to avoid read sta…
…le state file.

This situation could happens in following scenario.
1. node starts up with pc.recovery=1, and has state file
2. then node starts up with pc.recovery=0, so right now state file is stale.
3. then node starts up with pc.recovery=1 again, stale file is used.

dirtysalt added a commit that referenced this issue Jun 3, 2014

dirtysalt added a commit that referenced this issue Jun 8, 2014

Refs #10: clear rst_view when pc is formed.
otherwise it may prevents pc remerge. see case #47\#issuecomment-45083066

dirtysalt added a commit that referenced this issue Jun 9, 2014

Refs #10: don not clear rst view after sending install message. other…
…wise if network partition happens when sending install message, restored view is lost

dirtysalt added a commit that referenced this issue Jun 9, 2014

temeo added a commit that referenced this issue Jun 9, 2014

refs #10 - merged gh10 to 3.x
Squashed commit of the following:

commit 84c4983
Author: dirtysalt <dirtysalt1987@gmail.com>
Date:   Mon Jun 9 18:10:01 2014 +0800

    Refs #10: set restored view type to NON_PRIM, this could solve pc remerge problem well

commit 8fd683e
Author: dirtysalt <dirtysalt1987@gmail.com>
Date:   Mon Jun 9 09:16:42 2014 +0800

    Refs #10: don not clear rst view after sending install message. otherwise if network partition happens when sending install message, restored view is lost

commit 7e5ddba
Author: dirtysalt <dirtysalt1987@gmail.com>
Date:   Sun Jun 8 13:21:29 2014 +0800

    Refs #10: clear rst_view when pc is formed.
    otherwise it may prevents pc remerge. see case #47\#issuecomment-45083066

commit 9bcd01f
Author: dirtysalt <dirtysalt1987@gmail.com>
Date:   Tue Jun 3 17:28:05 2014 +0800

    Refs #10: remove state file on graceful shutdown

commit 53ae454
Author: dirtysalt <dirtysalt1987@gmail.com>
Date:   Tue Jun 3 15:05:58 2014 +0800

    Refs #10: fix typo

commit 0afd00e
Author: dirtysalt <dirtysalt1987@gmail.com>
Date:   Tue Jun 3 10:30:53 2014 +0800

    Refs #10: remove gvwstate.dat file if pc.recovery=0 to avoid read stale state file.
    This situation could happens in following scenario.
    1. node starts up with pc.recovery=1, and has state file
    2. then node starts up with pc.recovery=0, so right now state file is stale.
    3. then node starts up with pc.recovery=1 again, stale file is used.

commit 6003b0e
Author: dirtysalt <dirtysalt1987@gmail.com>
Date:   Mon Jun 2 07:19:54 2014 +0800

    Refs #10: remove pc.save_prim and make pc.recovery to control write state file or not

commit 30a66da
Author: dirtysalt <dirtysalt1987@gmail.com>
Date:   Sun Jun 1 23:02:56 2014 +0800

    Refs #10: avoid gvwstate.file to interfere with unit tests
    1. move vst.write_file from pc_proto.cpp to pc.cpp
    2. add pc.save_prim to control save prim component or not.

@temeo temeo closed this Jun 10, 2014

Contributor

dirtysalt commented Aug 25, 2014

Take an example to explain why removing gvwstate.dat file when stopping gracefully:
two nodes, both running:

  1. stop the first node
  2. stop the second node, it will write last seen PC as node2
  3. start the first node, bootstrap PC
  4. start the second node, now you will have two PCs

@dirtysalt dirtysalt removed their assignment Dec 10, 2014

philip-galera added a commit that referenced this issue Mar 13, 2016

Merge pull request #10 from codership/gh386
Gh386 - adding contributors agreement
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment