Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

panic: restore failure: database 'test.db' has 1 leader connections #5

Closed
freeekanayaka opened this issue Jul 13, 2017 · 0 comments · Fixed by #6
Closed

panic: restore failure: database 'test.db' has 1 leader connections #5

freeekanayaka opened this issue Jul 13, 2017 · 0 comments · Fixed by #6

Comments

@freeekanayaka
Copy link
Contributor

This crash happens occasionally in the demo program. It happens when a follower has been down long enough to fall behind so much with applied raft logs that the leader no longer has the individual logs that the follower would need to catch up (the amount of trailing logs that a nodes keep is set by the raft.Raft.TrailingLogs parameter). In this case the leader will send to the follower its last snapshot, instead of the individual missing logs, and the follower it restore them using replication.FSM.restoreDatabase(). However restoreDatabase() checks for open leader connection and finds one because the demo code in the follower has already reached the tx, err := db.Begin() line in the insert loop, and is blocked on it.

To reproduce easily, apply a patch like:

modified   raft.go
@@ -39,9 +39,9 @@ func newRaft(config *Config, join string, fsm *replication.FSM, logger *log.Logg
 		MaxAppendEntries:           64,
 		ShutdownOnRemove:           true,
 		DisableBootstrapAfterElect: true,
-		TrailingLogs:               256,
+		TrailingLogs:               10,
 		SnapshotInterval:           500 * time.Millisecond,
-		SnapshotThreshold:          64,
+		SnapshotThreshold:          5,
 		LeaderLeaseTimeout:         config.LeaderLeaseTimeout,
 		EnableSingleNode:           enableSingleNode,
 		Logger:                     logger,

and just stop a follower long enough that it falls behind at least 10 logs. When it restarts, it will crash with the panic reported in the issue title.

Sample output:

10.204.119.242:9980: 2017/07/12 21:32:02 [INFO] demo: start
10.204.119.242:9980: 2017/07/12 21:32:02 [DEBUG] dqlite: restore snapshot
10.204.119.242:9980: 2017/07/12 21:32:02 [DEBUG] dqlite: snapshot database size 385024
10.204.119.242:9980: 2017/07/12 21:32:02 [DEBUG] dqlite: snapshot wal size 391432
10.204.119.242:9980: 2017/07/12 21:32:02 [DEBUG] dqlite: snapshot txid 
10.204.119.242:9980: 2017/07/12 21:32:02 [ERR] sqlite: recovered 95 frames from WAL file
/root/data/test.db-wal (283)
10.204.119.242:9980: 2017/07/12 21:32:02 [INFO] raft: Restored from snapshot
40-2432-1499894688444
10.204.119.242:9980: 2017/07/12 21:32:02 [INFO] raft: Node at 10.204.119.242:9980 [Follower]
entering Follower state (Leader: "")
10.204.119.242:9980: 2017/07/12 21:32:04 [WARN] raft: Heartbeat timeout from "" reached,
starting election
10.204.119.242:9980: 2017/07/12 21:32:04 [INFO] raft: Node at 10.204.119.242:9980
[Candidate] entering Candidate state
10.204.119.242:9980: 2017/07/12 21:32:04 [ERR] raft: Failed to make RequestVote RPC to
10.204.119.99:9980: dialing failed: dial tcp 10.204.119.99:9980: getsockopt: connection
refused
10.204.119.242:9980: 2017/07/12 21:32:04 [ERR] raft: Failed to make RequestVote RPC to
10.204.119.128:9980: dialing failed: dial tcp 10.204.119.128:9980: getsockopt: connection
refused
10.204.119.242:9980: 2017/07/12 21:32:04 [DEBUG] raft: Votes needed: 2
10.204.119.242:9980: 2017/07/12 21:32:04 [DEBUG] raft: Vote granted from
10.204.119.242:9980. Tally: 1
10.204.119.242:9980: 2017/07/12 21:32:06 [WARN] raft: Election timeout reached, restarting
election
10.204.119.242:9980: 2017/07/12 21:32:06 [INFO] raft: Node at 10.204.119.242:9980
[Candidate] entering Candidate state
10.204.119.242:9980: 2017/07/12 21:32:06 [ERR] raft: Failed to make RequestVote RPC to
10.204.119.128:9980: dialing failed: dial tcp 10.204.119.128:9980: getsockopt: connection
refused
10.204.119.242:9980: 2017/07/12 21:32:06 [DEBUG] raft: Votes needed: 2
10.204.119.242:9980: 2017/07/12 21:32:06 [DEBUG] raft: Vote granted from
10.204.119.242:9980. Tally: 1
10.204.119.242:9980: 2017/07/12 21:32:06 [DEBUG] raft-net: 10.204.119.242:9980 accepted
connection from: 10.204.119.99:54206
10.204.119.242:9980: 2017/07/12 21:32:06 [INFO] raft: Node at 10.204.119.242:9980 [Follower]
entering Follower state (Leader: "")
10.204.119.242:9980: 2017/07/12 21:32:06 [WARN] raft: Failed to get previous log: 3218 log
not found (last: 2451)
10.204.119.242:9980: 2017/07/12 21:32:06 [INFO] snapshot: Creating new snapshot at
/root/data/snapshots/42-3215-1499895126707.tmp
10.204.119.242:9980: 2017/07/12 21:32:06 [INFO] snapshot: reaping snapshot
/root/data/snapshots/40-2368-1499894658052
10.204.119.242:9980: 2017/07/12 21:32:06 [INFO] raft: Copied 1068172 bytes to local snapshot
10.204.119.242:9980: 2017/07/12 21:32:06 [DEBUG] dqlite: restore snapshot
10.204.119.242:9980: 2017/07/12 21:32:06 [DEBUG] dqlite: snapshot database size 532480
10.204.119.242:9980: 2017/07/12 21:32:06 [DEBUG] dqlite: snapshot wal size 535632
panic: restore failure: database 'test.db' has 1 leader connections

goroutine 28 [running]:
github.com/dqlite/dqlite/replication.(*FSM).restoreDatabase(0xc42010a900, 0xc9c3a0,
0xc42010d950, 0x0, 0x0, 0x0)
    /root/go/src/github.com/dqlite/dqlite/replication/fsm.go:394 +0x1146
github.com/dqlite/dqlite/replication.(*FSM).Restore(0xc42010a900, 0xc9c3a0, 0xc42010d950,
0xc420123e00, 0xc9c3a0)
    /root/go/src/github.com/dqlite/dqlite/replication/fsm.go:307 +0x84
github.com/hashicorp/raft.(*Raft).runFSM(0xc42013c800)
    /root/go/src/github.com/hashicorp/raft/raft.go:552 +0x383
github.com/hashicorp/raft.(*Raft).(github.com/hashicorp/raft.runFSM)-fm()
    /root/go/src/github.com/hashicorp/raft/raft.go:259 +0x2a
github.com/hashicorp/raft.(*raftState).goFunc.func1(0xc42013c800, 0xc42010ce00)
    /root/go/src/github.com/hashicorp/raft/state.go:142 +0x53
created by github.com/hashicorp/raft.(*raftState).goFunc
    /root/go/src/github.com/hashicorp/raft/state.go:143 +0x66
freeekanayaka added a commit that referenced this issue Jul 13, 2017
This avoids the issue of a leader connection being open too early,
before leadership is actually acquired, and fixes #5.
freeekanayaka added a commit that referenced this issue Jul 13, 2017
This avoids the issue of a leader connection being open too early,
before leadership is actually acquired, and fixes #5.
mjeanson added a commit to mjeanson/dqlite that referenced this issue Aug 3, 2020
On an up to date Ubuntu 20.04 with raft 0.9.24, libco v20 and sqlite
3.32.3+replication4 the tests fail with errors like:

==96452==ERROR: AddressSanitizer: attempting free on address which was not malloc()-ed: 0x607000000fd8 in thread T0
    #0 0x7f82cfea37cf in __interceptor_free (/lib/x86_64-linux-gnu/libasan.so.5+0x10d7cf)
    canonical#1 0x564796ec2065 in mem_fault_free test/lib/heap.c:57
    canonical#2 0x7f82cfc8370e in sqlite3_free sqlite3.c:28007
    canonical#3 0x7f82cfc8370e in sqlite3_free sqlite3.c:27999
    canonical#4 0x564796dcd89c in applyCb src/replication.c:116
    canonical#5 0x7f82cfc483ad in convertFailApply src/convert.c:59
    canonical#6 0x7f82cfc483ad in convertClearLeader src/convert.c:95
    canonical#7 0x7f82cfc483ad in convertClear src/convert.c:124
    canonical#8 0x7f82cfc4858e in convertToUnavailable src/convert.c:218
    canonical#9 0x7f82cfc4a8a4 in raft_close src/raft.c:97
    canonical#10 0x564796dec401 in exec__tear_down test/unit/test_conn.c:302
    canonical#11 0x564796ec5247 in munit_test_runner_exec test/lib/munit.c:1203
    canonical#12 0x564796ec6b2c in munit_test_runner_run_test_with_params test/lib/munit.c:1357
    canonical#13 0x564796ec7c3c in munit_test_runner_run_test test/lib/munit.c:1585
    canonical#14 0x564796ec8505 in munit_test_runner_run_suite test/lib/munit.c:1678
    canonical#15 0x564796ec85a5 in munit_test_runner_run_suite test/lib/munit.c:1687
    canonical#16 0x564796ec85a5 in munit_test_runner_run_suite test/lib/munit.c:1687
    canonical#17 0x564796ec9c43 in munit_test_runner_run test/lib/munit.c:1697
    canonical#18 0x564796ec9c43 in munit_suite_main_custom test/lib/munit.c:2027
    canonical#19 0x564796ecaf77 in munit_suite_main test/lib/munit.c:2055
    canonical#20 0x564796db2d92 in main test/unit/main.c:3
    canonical#21 0x7f82cfa060b2 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x270b2)
    canonical#22 0x564796db30dd in _start (/home/mjeanson/Git/lxc/dqlite/unit-test.test+0xbd0dd)

0x607000000fd8 is located 8 bytes to the left of 72-byte region [0x607000000fe0,0x607000001028)
allocated by thread T0 here:
    #0 0x7f82cfea3bc8 in malloc (/lib/x86_64-linux-gnu/libasan.so.5+0x10dbc8)

SUMMARY: AddressSanitizer: bad-free (/lib/x86_64-linux-gnu/libasan.so.5+0x10d7cf) in __interceptor_free

Signed-off-by: Michael Jeanson <mjeanson@debian.org>
mjeanson added a commit to mjeanson/dqlite that referenced this issue Aug 3, 2020
On an up to date Ubuntu 20.04 with raft 0.9.24, libco v20 and sqlite
3.32.3+replication4 the tests fail with errors like:

==96452==ERROR: AddressSanitizer: attempting free on address which was not malloc()-ed: 0x607000000fd8 in thread T0
    #0 0x7f82cfea37cf in __interceptor_free (/lib/x86_64-linux-gnu/libasan.so.5+0x10d7cf)
    canonical#1 0x564796ec2065 in mem_fault_free test/lib/heap.c:57
    canonical#2 0x7f82cfc8370e in sqlite3_free sqlite3.c:28007
    canonical#3 0x7f82cfc8370e in sqlite3_free sqlite3.c:27999
    canonical#4 0x564796dcd89c in applyCb src/replication.c:116
    canonical#5 0x7f82cfc483ad in convertFailApply src/convert.c:59
    canonical#6 0x7f82cfc483ad in convertClearLeader src/convert.c:95
    canonical#7 0x7f82cfc483ad in convertClear src/convert.c:124
    canonical#8 0x7f82cfc4858e in convertToUnavailable src/convert.c:218
    canonical#9 0x7f82cfc4a8a4 in raft_close src/raft.c:97
    canonical#10 0x564796dec401 in exec__tear_down test/unit/test_conn.c:302
    canonical#11 0x564796ec5247 in munit_test_runner_exec test/lib/munit.c:1203
    canonical#12 0x564796ec6b2c in munit_test_runner_run_test_with_params test/lib/munit.c:1357
    canonical#13 0x564796ec7c3c in munit_test_runner_run_test test/lib/munit.c:1585
    canonical#14 0x564796ec8505 in munit_test_runner_run_suite test/lib/munit.c:1678
    canonical#15 0x564796ec85a5 in munit_test_runner_run_suite test/lib/munit.c:1687
    canonical#16 0x564796ec85a5 in munit_test_runner_run_suite test/lib/munit.c:1687
    canonical#17 0x564796ec9c43 in munit_test_runner_run test/lib/munit.c:1697
    canonical#18 0x564796ec9c43 in munit_suite_main_custom test/lib/munit.c:2027
    canonical#19 0x564796ecaf77 in munit_suite_main test/lib/munit.c:2055
    canonical#20 0x564796db2d92 in main test/unit/main.c:3
    canonical#21 0x7f82cfa060b2 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x270b2)
    canonical#22 0x564796db30dd in _start (/home/mjeanson/Git/lxc/dqlite/unit-test.test+0xbd0dd)

0x607000000fd8 is located 8 bytes to the left of 72-byte region [0x607000000fe0,0x607000001028)
allocated by thread T0 here:
    #0 0x7f82cfea3bc8 in malloc (/lib/x86_64-linux-gnu/libasan.so.5+0x10dbc8)

SUMMARY: AddressSanitizer: bad-free (/lib/x86_64-linux-gnu/libasan.so.5+0x10d7cf) in __interceptor_free

Signed-off-by: Michael Jeanson <mjeanson@debian.org>
freeekanayaka pushed a commit that referenced this issue Aug 3, 2020
On an up to date Ubuntu 20.04 with raft 0.9.24, libco v20 and sqlite
3.32.3+replication4 the tests fail with errors like:

==96452==ERROR: AddressSanitizer: attempting free on address which was not malloc()-ed: 0x607000000fd8 in thread T0
    #0 0x7f82cfea37cf in __interceptor_free (/lib/x86_64-linux-gnu/libasan.so.5+0x10d7cf)
    #1 0x564796ec2065 in mem_fault_free test/lib/heap.c:57
    #2 0x7f82cfc8370e in sqlite3_free sqlite3.c:28007
    #3 0x7f82cfc8370e in sqlite3_free sqlite3.c:27999
    #4 0x564796dcd89c in applyCb src/replication.c:116
    #5 0x7f82cfc483ad in convertFailApply src/convert.c:59
    #6 0x7f82cfc483ad in convertClearLeader src/convert.c:95
    #7 0x7f82cfc483ad in convertClear src/convert.c:124
    #8 0x7f82cfc4858e in convertToUnavailable src/convert.c:218
    #9 0x7f82cfc4a8a4 in raft_close src/raft.c:97
    #10 0x564796dec401 in exec__tear_down test/unit/test_conn.c:302
    #11 0x564796ec5247 in munit_test_runner_exec test/lib/munit.c:1203
    #12 0x564796ec6b2c in munit_test_runner_run_test_with_params test/lib/munit.c:1357
    #13 0x564796ec7c3c in munit_test_runner_run_test test/lib/munit.c:1585
    #14 0x564796ec8505 in munit_test_runner_run_suite test/lib/munit.c:1678
    #15 0x564796ec85a5 in munit_test_runner_run_suite test/lib/munit.c:1687
    #16 0x564796ec85a5 in munit_test_runner_run_suite test/lib/munit.c:1687
    #17 0x564796ec9c43 in munit_test_runner_run test/lib/munit.c:1697
    #18 0x564796ec9c43 in munit_suite_main_custom test/lib/munit.c:2027
    #19 0x564796ecaf77 in munit_suite_main test/lib/munit.c:2055
    #20 0x564796db2d92 in main test/unit/main.c:3
    #21 0x7f82cfa060b2 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x270b2)
    #22 0x564796db30dd in _start (/home/mjeanson/Git/lxc/dqlite/unit-test.test+0xbd0dd)

0x607000000fd8 is located 8 bytes to the left of 72-byte region [0x607000000fe0,0x607000001028)
allocated by thread T0 here:
    #0 0x7f82cfea3bc8 in malloc (/lib/x86_64-linux-gnu/libasan.so.5+0x10dbc8)

SUMMARY: AddressSanitizer: bad-free (/lib/x86_64-linux-gnu/libasan.so.5+0x10d7cf) in __interceptor_free

Signed-off-by: Michael Jeanson <mjeanson@debian.org>
mjeanson added a commit to mjeanson/dqlite that referenced this issue Aug 3, 2020
==199893==ERROR: AddressSanitizer: attempting free on address which was not malloc()-ed: 0x619000004178 in thread T0
    #0 0x7fd51bdab7cf in __interceptor_free (/lib/x86_64-linux-gnu/libasan.so.5+0x10d7cf)
    canonical#1 0x56243f445145 in mem_fault_free test/lib/heap.c:57
    canonical#2 0x7fd51bb8b70e in sqlite3_free sqlite3.c:28007
    canonical#3 0x7fd51bb8b70e in sqlite3_free sqlite3.c:27999
    canonical#4 0x56243f43a95e in test_vfs_pollAfterWriteTransaction test/integration/test_vfs.c:238
    canonical#5 0x56243f44a1a9 in munit_test_runner_exec test/lib/munit.c:1195
    canonical#6 0x56243f44b198 in munit_test_runner_run_test_with_params test/lib/munit.c:1357
    canonical#7 0x56243f44d81c in munit_test_runner_run_test test/lib/munit.c:1585
    canonical#8 0x56243f44e917 in munit_test_runner_run_suite test/lib/munit.c:1678
    canonical#9 0x56243f44eaf1 in munit_test_runner_run_suite test/lib/munit.c:1687
    canonical#10 0x56243f44ecb8 in munit_test_runner_run test/lib/munit.c:1697
    canonical#11 0x56243f45253d in munit_suite_main_custom test/lib/munit.c:2027
    canonical#12 0x56243f452bac in munit_suite_main test/lib/munit.c:2055
    canonical#13 0x56243f444f0d in main test/integration/main.c:3
    canonical#14 0x7fd51b88b0b2 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x270b2)
    canonical#15 0x56243f4322fd in _start (/home/mjeanson/Git/lxc/dqlite/integration-test+0x142fd)

0x619000004178 is located 8 bytes to the left of 1024-byte region [0x619000004180,0x619000004580)
allocated by thread T0 here:
    #0 0x7fd51bdabdc6 in calloc (/lib/x86_64-linux-gnu/libasan.so.5+0x10ddc6)
    canonical#1 0x56243f4471d6 in munit_malloc_ex test/lib/munit.c:280
    canonical#2 0x56243f43a39f in test_vfs_pollAfterWriteTransaction test/integration/test_vfs.c:230
    canonical#3 0x56243f44a1a9 in munit_test_runner_exec test/lib/munit.c:1195
    canonical#4 0x56243f44b198 in munit_test_runner_run_test_with_params test/lib/munit.c:1357
    canonical#5 0x56243f44d81c in munit_test_runner_run_test test/lib/munit.c:1585
    canonical#6 0x56243f44e917 in munit_test_runner_run_suite test/lib/munit.c:1678
    canonical#7 0x56243f44eaf1 in munit_test_runner_run_suite test/lib/munit.c:1687
    canonical#8 0x56243f44ecb8 in munit_test_runner_run test/lib/munit.c:1697
    canonical#9 0x56243f45253d in munit_suite_main_custom test/lib/munit.c:2027
    canonical#10 0x56243f452bac in munit_suite_main test/lib/munit.c:2055
    canonical#11 0x56243f444f0d in main test/integration/main.c:3
    canonical#12 0x7fd51b88b0b2 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x270b2)

SUMMARY: AddressSanitizer: bad-free (/lib/x86_64-linux-gnu/libasan.so.5+0x10d7cf) in __interceptor_free

Signed-off-by: Michael Jeanson <mjeanson@debian.org>
freeekanayaka pushed a commit that referenced this issue Aug 4, 2020
==199893==ERROR: AddressSanitizer: attempting free on address which was not malloc()-ed: 0x619000004178 in thread T0
    #0 0x7fd51bdab7cf in __interceptor_free (/lib/x86_64-linux-gnu/libasan.so.5+0x10d7cf)
    #1 0x56243f445145 in mem_fault_free test/lib/heap.c:57
    #2 0x7fd51bb8b70e in sqlite3_free sqlite3.c:28007
    #3 0x7fd51bb8b70e in sqlite3_free sqlite3.c:27999
    #4 0x56243f43a95e in test_vfs_pollAfterWriteTransaction test/integration/test_vfs.c:238
    #5 0x56243f44a1a9 in munit_test_runner_exec test/lib/munit.c:1195
    #6 0x56243f44b198 in munit_test_runner_run_test_with_params test/lib/munit.c:1357
    #7 0x56243f44d81c in munit_test_runner_run_test test/lib/munit.c:1585
    #8 0x56243f44e917 in munit_test_runner_run_suite test/lib/munit.c:1678
    #9 0x56243f44eaf1 in munit_test_runner_run_suite test/lib/munit.c:1687
    #10 0x56243f44ecb8 in munit_test_runner_run test/lib/munit.c:1697
    #11 0x56243f45253d in munit_suite_main_custom test/lib/munit.c:2027
    #12 0x56243f452bac in munit_suite_main test/lib/munit.c:2055
    #13 0x56243f444f0d in main test/integration/main.c:3
    #14 0x7fd51b88b0b2 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x270b2)
    #15 0x56243f4322fd in _start (/home/mjeanson/Git/lxc/dqlite/integration-test+0x142fd)

0x619000004178 is located 8 bytes to the left of 1024-byte region [0x619000004180,0x619000004580)
allocated by thread T0 here:
    #0 0x7fd51bdabdc6 in calloc (/lib/x86_64-linux-gnu/libasan.so.5+0x10ddc6)
    #1 0x56243f4471d6 in munit_malloc_ex test/lib/munit.c:280
    #2 0x56243f43a39f in test_vfs_pollAfterWriteTransaction test/integration/test_vfs.c:230
    #3 0x56243f44a1a9 in munit_test_runner_exec test/lib/munit.c:1195
    #4 0x56243f44b198 in munit_test_runner_run_test_with_params test/lib/munit.c:1357
    #5 0x56243f44d81c in munit_test_runner_run_test test/lib/munit.c:1585
    #6 0x56243f44e917 in munit_test_runner_run_suite test/lib/munit.c:1678
    #7 0x56243f44eaf1 in munit_test_runner_run_suite test/lib/munit.c:1687
    #8 0x56243f44ecb8 in munit_test_runner_run test/lib/munit.c:1697
    #9 0x56243f45253d in munit_suite_main_custom test/lib/munit.c:2027
    #10 0x56243f452bac in munit_suite_main test/lib/munit.c:2055
    #11 0x56243f444f0d in main test/integration/main.c:3
    #12 0x7fd51b88b0b2 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x270b2)

SUMMARY: AddressSanitizer: bad-free (/lib/x86_64-linux-gnu/libasan.so.5+0x10d7cf) in __interceptor_free

Signed-off-by: Michael Jeanson <mjeanson@debian.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant