Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

brain split happened when network interrupts between dc #825

Closed
accumulatepig opened this issue Jan 10, 2019 · 34 comments
Closed

brain split happened when network interrupts between dc #825

accumulatepig opened this issue Jan 10, 2019 · 34 comments

Comments

@accumulatepig
Copy link

accumulatepig commented Jan 10, 2019

weed vesion:1.15

  1. weed master: deployed in 3 DC , 3 nodes in dc1 + 2 in dc2 +in 2 dc3 ,
  2. volumer server : 3 nodes in dc1 + 3 in dc2 + 3 in dc3
  3. when the network between dc3 and dc2 interrupt , dc3 and dc1 also interrupt ,but network between
    dc1 and dc2 is ok , so dc3 is a network isolated island 。
  4. the original leader in dc1 , but when network issue happened , the second leader selected in dc3 and dc3's volumer server connect to the dc3 new leader. Forming two clusters , dc1 and dc2 is a cluster ,dc3 is a cluster
  5. when the network issue between dc3 and dc1、dc2 solved , also two leader in whole cluster util restart the dc3's leader.
@chrislusf
Copy link
Collaborator

what's the command line to start the 7 master servers?

@accumulatepig
Copy link
Author

7 master start commad like this :
weed -log_dir=/logs -v=2 master -defaultReplication=100 -ip=node_ip -port=9333 -mdir=/eaweedfs/metadata -volumeSizeLimitMB=10000 -peers=dc1_ip1:9333,dc1_ip2:9333,dc1_ip3:9333,dc2_ip1:9333,dc2_ip2:9333,dc3_ip1:9333,dc3_ip2:9333

@chrislusf
Copy link
Collaborator

what kind of network interruption? It'll be good if this can be reproduced. basically the isolated dc3 elected one leader from the 2 masters, even though the quorum is 4.

@accumulatepig
Copy link
Author

dc3 cannot reach other dc beacuse of network provider's line interruption, but dc3 internal network is ok。 totally 7 nodes , the isolated dc3 elected one leader from the 2 dc3's masters , not reach half of 7 nodes. i think it can be reproduced by iptables to interrupt the network between dc3 and other dc

@chrislusf
Copy link
Collaborator

It'll be good to see the actual logs, if you still have them.

@accumulatepig
Copy link
Author

accumulatepig commented Jan 10, 2019

about 00:42 network faliure happened , Back to normal an hour later 。 the second master seems be elected when network goes well after an hour

  1. dc3 elected master(the second master:dc3_masterIP1):
    I0110 00:42:54 10646 master_server.go:96] event: &{typ:leaderChange source:0xc0000d85a0 value: prevValue:dc1_masterIP1:9333}
    I0110 00:45:03 10646 node.go:223] topo adds child DC3
    I0110 00:45:03 10646 node.go:223] topo:DC3 adds child rack01
    I0110 00:45:03 10646 node.go:223] topo:DC3:rack01 adds child dc3_volumer1:9080
    I0110 00:45:03 10646 master_grpc_server.go:66] added volume server dc3_volumer1:9080
    I0110 00:45:03 10646 node.go:223] topo:DC3 adds child rack02
    I0110 00:45:03 10646 node.go:223] topo:DC3:rack02 adds child dc3_volumer2:9080
    I0110 00:45:03 10646 master_grpc_server.go:66] added volume server dc3_volumer2:9080
    I0110 00:45:03 10646 node.go:223] topo:DC3 adds child rack03
    I0110 00:45:03 10646 node.go:223] topo:DC3:rack03 adds child dc3_volumer3:9080
    I0110 00:45:03 10646 master_grpc_server.go:66] added volume server dc3_volumer3:9080
    I0110 01:49:28 10646 master_server.go:96] event: &{typ:leaderChange source:0xc0000d85a0 value::dc3_masterIP1:9333 prevValue:}
    I0110 01:49:28 10646 master_server.go:98] [ dc3_masterIP1:9333 ] dc3_masterIP1:9333 becomes leader.
    I0110 01:49:44 10646 master_grpc_server.go:142] + client filerdc3_filerIP1:53304
    I0110 01:49:44 10646 master_grpc_server.go:142] + client filerdc3_filerIP2:40560
    I0110 01:49:44 10646 master_grpc_server.go:142] + client filerdc3_filerIP3:43422
    I0110 01:58:06 10646 topology_vacuum.go:119] Start vacuum on demand with threshold: 0.300000
    I0110 01:58:06 10646 topology_vacuum.go:151] check vacuum on collection: volume:104
    I0110 01:58:06 10646 topology_vacuum.go:151] check vacuum on collection: volume:219
    ......

  2. dc3 another master node:
    I0110 00:42:54 2805 master_server.go:96] event: &{typ:leaderChange source:0xc0000d4c60 value: prevValue:dc1_masterIP1:9333}
    I0110 01:49:39 2805 master_server.go:96] event: &{typ:leaderChange source:0xc0000d4c60 value: prevValue:dc3_masterIP1:9333}

  3. dc1 original master(dc1_masterIP1):
    I0110 00:43:15 843 master_grpc_server.go:168] - client filerdc3_filerIP1:58366: rpc error: code = Canceled desc = context canceled
    I0110 00:43:15 843 master_grpc_server.go:168] - client filerfilerdc3_filerIP:44388: rpc error: code = Canceled desc = context canceled
    I0110 00:43:15 843 master_grpc_server.go:152] - client filerdc3_filerIP1:58366
    I0110 00:43:15 843 master_grpc_server.go:168] - client filerfilerdc3_filerIP3:44508: rpc error: code = Canceled desc = context canceled
    I0110 00:43:15 843 master_grpc_server.go:152] - client filerfilerdc3_filerIP:44388
    I0110 00:43:15 843 master_grpc_server.go:152] - client filerfilerdc3_filerIP3:44508
    I0110 00:43:17 843 master_grpc_server.go:22] unregister disconnected volume server dc3_volumer1:9080
    I0110 00:43:17 843 master_grpc_server.go:22] unregister disconnected volume server dc3_volumer2:9080
    I0110 00:43:17 843 topology_event_handling.go:56] Removing Volume 92 from the dead volume server dc3_volumer1:9080
    I0110 00:43:17 843 topology_event_handling.go:56] Removing Volume 210 from the dead volume server dc3_volumer2:9080
    I0110 00:43:17 843 volume_layout.go:241] Volume 92 has 1 replica, less than required 2
    I0110 00:43:17 843 topology_event_handling.go:56] Removing Volume 197 from the dead volume server dc3_volumer1:9080
    I0110 00:43:17 843 volume_layout.go:241] Volume 197 has 1 replica, less than required 2
    I0110 00:43:17 843 volume_layout.go:217] Volume 197 becomes unwritable
    I0110 00:43:17 843 topology_event_handling.go:56] Removing Volume 41 from the dead volume server dc3_volumer1:9080
    I0110 00:43:17 843 master_grpc_server.go:22] unregister disconnected volume server dc3_volumer3:9080
    I0110 00:43:17 843 topology_event_handling.go:56] Removing Volume 34 from the dead volume server dc3_volumer3:9080
    ......

4.dc1 another master node:
I0110 01:49:27 14676 master_server.go:96] event: &{typ:leaderChange source:0xc0000c0b40 value: prevValue:dc1_masterIP1:9333}
I0110 01:49:39 14676 master_server.go:96] event: &{typ:leaderChange source:0xc0000c0b40 value: prevValue:dc3_masterIP1:9333}
I0110 01:49:39 14676 master_server.go:96] event: &{typ:leaderChange source:0xc0000c0b40 value:dc1_masterIP2:9333 prevValue:}
I0110 01:49:39 14676 master_server.go:98] [ dc1_masterIP2:9333 ] dc1_masterIP2:9333 becomes leader.

5.dc1 third master node:
I0110 01:49:27 97334 master_server.go:96] event: &{typ:leaderChange source:0xc0004f0360 value: prevValue:dc1_masterIP1:9333}
I0110 01:49:39 97334 master_server.go:96] event: &{typ:leaderChange source:0xc0004f0360 value: prevValue:dc3_masterIP1:9333}

@PapaYofen
Copy link
Contributor

PapaYofen commented Jan 14, 2019

Sounds like a critical bug easy to reproduce.
Maybe raft used incorrectly?
I am walking through raft codes and weed codes to find possbile solutions.

@ingardm
Copy link
Contributor

ingardm commented Jan 16, 2019

Maybe using this tried and true raft implementation would be a solution? https://github.com/etcd-io/etcd/tree/master/raft

@chrislusf
Copy link
Collaborator

chrislusf commented Jan 16, 2019 via email

@PapaYofen
Copy link
Contributor

PapaYofen commented Jan 16, 2019

Reproduce as below:
there are 3 nodes

  • master-0 and volume-0 on node-0
  • master-1 and volume-1 on node-1
  • master-2 and volume-2 on node-2.

now master-0 is leader, master-1 and master-2 are followers. Then execute the following steps:

  1. on node-0 , update iptables to drop data from the node-1 and node-2
  2. now master-0 is still leader, meanwhile, master-1 and master-2 also elect the new leader, provided master-1 is the new leader
  3. curl master-0:9333/dir/assign will succeed to create volumes from 1-7
  4. meanwhile curl master-1:9333/dir/assign also will succeed to create volumes from 1-7
  5. successfully upload one file to master-0, provided the fid=1,23dfdfd
  6. updating iptables on node-0 to accept data from node-1 and node-2
  7. after step 6, all volumes from 1-7 will be unwritable, but when looking up fid 1,23dfdfd, sometimes volume will tell fid NotFound due to master randomly selecting not the volume-0 url

I have also tested removing go keyword in topology.go#L94, it will solve the problem above, but bring the side effect that

  1. /dir/assign consumes more time (each volume id allocation consumes about 500ms, the heartbeat interval of appendEntries) than before in the majority brain-split part (node-1 and node-2)
  2. /dir/assign needs about 10m (maybe more time) to get "raft.Server: Not current leader" error on the other brain-split part (node-0)

Updated:
Test again, side effect 2 still hangs up for over 80m
As raft codes demonstrate, /dir/assign will hang up forever except timeout is set...

image

@chrislusf
Copy link
Collaborator

thanks! this seems a timeout issue where node-0 needs to actively detect stream gRPC disconnections.

chrislusf added a commit that referenced this issue Jan 18, 2019
possible fix for #825
@chrislusf
Copy link
Collaborator

@PapaYofen please check whether this 1d103e3 fix the problem.

@PapaYofen
Copy link
Contributor

PapaYofen commented Jan 18, 2019

@PapaYofen please check whether this 1d103e3 fix the problem.

It only reduces the append entry request time to follower, but can't solve split-brain problem.
Valid solution may be:

  1. remove go keyword in go t.RaftServer.Do(NewMaxVolumeIdCommand(next))
  2. do not wait for heartbeat to append entry to followers, instead immediately trigger the append entry process
  3. step down leader to follower, if number of active followers not up to quorum size, like ectd raft

2 and 3 need updating raft codes which i do not understand thoroughly, maybe etcd raft is a better choice solving this split-brain issue already , and also implementing preVote algorithm to reduce disturb when minority part connects again to majority part.

@chrislusf
Copy link
Collaborator

I think it is fixed. Let me know if otherwise.

@accumulatepig
Copy link
Author

accumulatepig commented Jan 22, 2019

@chrislusf @PapaYofen
version:1.23 (seems split-brain issue already)
3dc: Dc1 Dc2 Dc3
3 master nodes: m1 m2 m3
3 volume nodes: v1 v2 v3
1 filer node: f1
current leader: m1

master m3 log:Now block dc3 and make it to be an isolated network environment, v3 disconnect from m1 and connect to m3 , v1 and v2 also connect to m1。m1 and m3 ,both nodes are leader?
I0122 17:00:25 37332 master_server.go:68] Volume Size Limit is 10000 MB
I0122 17:00:25 37332 master.go:80] Start Seaweed Master 1.23 at 0.0.0.0:9333
I0122 17:00:25 37332 master.go:109] Start Seaweed Master 1.23 grpc server at 0.0.0.0:19333
I0122 17:00:25 37332 raft_server.go:47] Starting RaftServer with v3:9333
I0122 17:00:27 37332 raft_server.go:88] current cluster leader:
I0122 19:41:22 37332 master_server.go:96] event: &{typ:leaderChange source:0xc0000c66c0 value: prevValue:m1:9333}
I0122 19:42:16 37332 node.go:223] topo adds child DC3
I0122 19:42:16 37332 node.go:223] topo:DC3 adds child rack01
I0122 19:42:16 37332 node.go:223] topo:DC3:rack01 adds child v3:9080
I0122 19:42:16 37332 master_grpc_server.go:67] added volume server v3:9080

master m3 log:dc3 network is back to normal , m3 to be the all cluster leader
I0122 19:46:20 37332 master_server.go:96] event: &{typ:leaderChange source:0xc0000c66c0 value:m3:9333 prevValue:}
I0122 19:46:20 37332 master_server.go:98] [ m3:9333 ] m3:9333 becomes leader.
I0122 19:46:20 37332 master_grpc_server.go:143] + client filerf1:8309
I0122 19:46:35 37332 node.go:223] topo adds child DC2
I0122 19:46:35 37332 node.go:223] topo:DC2 adds child rack01
I0122 19:46:35 37332 node.go:223] topo:DC2:rack01 adds child v2:9080
I0122 19:46:35 37332 master_grpc_server.go:67] added volume server v2:9080
I0122 19:46:35 37332 node.go:223] topo adds child DC1
I0122 19:46:35 37332 node.go:223] topo:DC1 adds child rack01
I0122 19:46:35 37332 node.go:223] topo:DC1:rack01 adds child v1:9080
I0122 19:46:35 37332 master_grpc_server.go:67] added volume server v1:9080

filer f1 netstat: filer also connect to preleader and current leader
#netstat -anp|grep 9333
tcp 0 0 f1:8309 m3:19333 ESTABLISHED 40087/./weed
tcp 0 0 f1:45036 m1:19333 ESTABLISHED 40087/./weed

chrislusf added a commit that referenced this issue Jan 22, 2019
@chrislusf
Copy link
Collaborator

added an idle timeout. Not sure how to make this unit testable. Please help to test this.

@accumulatepig
Copy link
Author

  1. I tried 3f56b12 , but still no effect ,the filer still connect to preleader and current leader at the same time when new leader elected;
  2. when network isolated environment are formed,network isolated's volume disconnect from outside leader and connect to network isolated's master node

@chrislusf
Copy link
Collaborator

what's your command to do this?
"on node-0 , update iptables to drop data from the node-1 and node-2"

@PapaYofen
Copy link
Contributor

sudo iptables -I INPUT -s node-1-ip -j DROP

sudo iptables -I INPUT -s node-2-ip -j DROP

@PapaYofen
Copy link
Contributor

PapaYofen added a commit to PapaYofen/raft that referenced this issue Feb 11, 2019
@accumulatepig
Copy link
Author

PapaYofen/raft@ecfccfc
+
https://github.com/PapaYofen/seaweedfs/blob/5737965eaa0e65073f7f33738fef0c4557de0bb1/weed/topology/topology.go#L94

maybe this two commits can fix split-brain problem

I tried ,but also two leaders.
step:
masterNodes : node1 node2 node3

  1. node1 current leader
  2. block network node1 to node2 and node3 , but node2 and node3 communicate normally
  3. node2 alse becomes the leader node

@PapaYofen
Copy link
Contributor

PapaYofen commented Feb 16, 2019

let me know the detailed reproduce step please
And the latest commit of seaweedfs and raft can be used for test again.

@accumulatepig
Copy link
Author

let me know the detailed reproduce step please
And the latest commit of seaweedfs and raft can be used for test again.

@PapaYofen @chrislusf the latest commits seems work fine , the leader will be elected from more than half of nodes cluster , the leader from less than half of nodes cluster will become candidate , There will be always one leader node

@PapaYofen
Copy link
Contributor

if chirs approves this solution, i will pull a merge request

@accumulatepig
Copy link
Author

accumulatepig commented Feb 22, 2019

if chirs approves this solution, i will pull a merge request

@chrislusf how about this ,expect for the request

@chrislusf
Copy link
Collaborator

chrislusf commented Feb 22, 2019

@accumulatepig I am not sure what are the exact versions of code from the seaweedfs and raft repos you have tested.

Besides, I have changed the raft code from http to grpc recently, so the change does not apply any more.

@PapaYofen
Copy link
Contributor

PapaYofen commented Feb 22, 2019

@chrislusf

About raft:

  1. tested raft commit is PapaYofen/raft@478bfd9, based on seaweedfs/raft@5f7ddd8
  2. i can cherry-pick the tested commit to the latest raft.

About seaweedfs:
to fix the brain-split problem, it mainly needs to change func NextVolumeId from

func (t *Topology) NextVolumeId() storage.VolumeId {
	vid := t.GetMaxVolumeId()
	next := vid.Next()
	go t.RaftServer.Do(NewMaxVolumeIdCommand(next))
	return next
}

to

func (t *Topology) NextVolumeId() (storage.VolumeId, error) {
	vid := t.GetMaxVolumeId()
	next := vid.Next()
	if _, err := t.RaftServer.Do(NewMaxVolumeIdCommand(next)); err != nil {
		return 0, err
	}
	return next, nil
}

@chrislusf
Copy link
Collaborator

This commit 7a493bb may be the real fix. Setting up the test env is too much work, I have not checked it yet.

@PapaYofen
Copy link
Contributor

This commit 7a493bb may be the real fix. Setting up the test env is too much work, I have not checked it yet.

i dont think this commit fix brain split issue, due to raft not support auto-changing role from leader to candidate after brain split.
@accumulatepig have your test this commit before?

@accumulatepig
Copy link
Author

accumulatepig commented Feb 22, 2019

This commit 7a493bb may be the real fix. Setting up the test env is too much work, I have not checked it yet.

i dont think this commit fix brain split issue, due to raft not support auto-changing role from leader to candidate after brain split.
@accumulatepig have your test this commit before?

@chrislusf yes ,I tried commit 7a493bb ,but it doesn't work , same as @PapaYofen said ,the leader not support change role from leader to candidate ,and the sceond leader elected when brain split happened
@PapaYofen your lastest commit work fine.

@chrislusf
Copy link
Collaborator

@accumulatepig thanks for testing and confirming
@PapaYofen thanks for your contribution. Please send PR to both repos.

PapaYofen added a commit to PapaYofen/seaweedfs that referenced this issue Feb 25, 2019
@PapaYofen
Copy link
Contributor

raft pr has also been sent

PapaYofen added a commit to PapaYofen/raft that referenced this issue Feb 25, 2019
chrislusf added a commit that referenced this issue Feb 25, 2019
@accumulatepig
Copy link
Author

@chrislusf @PapaYofen
I tried weed 1.25 and fix the spilt brain issue, but maybe found another issue in special circumstances:
DC1: 3 master nodes , 3 volumes nodes
DC2: 2 master nodes, 3 volumes nodes
DC3: 2 master nodes, 3 volumes nodes
leader nodes in DC3 at first,when only DC3's master nodes cannot reach DC2 and DC1's master nodes, the leader nodes elected from DC1 and DC3's leader became candidate , but all volume nodes still connect the DC3's preleader ,not switch to the current leader in DC1。This scenario only occurs when a network problem happened between the master nodes。 if the whole DC3(contain master、volume、filer) has a network problem with DC1 and DC2 , everything goes well.

@PapaYofen
Copy link
Contributor

thanks for your test, i have found the root cause

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants