Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GTID seqno counter gets desynchronized after CC #185

Closed
ayurchen opened this issue Nov 20, 2014 · 2 comments
Closed

GTID seqno counter gets desynchronized after CC #185

ayurchen opened this issue Nov 20, 2014 · 2 comments
Assignees
Milestone

Comments

@ayurchen
Copy link
Member

Running stopcont.sh test with induced action fragmentation (to catch #172) leads to

2014-11-20 20:40:07 32018 [Note] WSREP: Assign initial position for certification: 16468, protocol version: 3
2014-11-20 20:40:07 32018 [Note] WSREP: Service thread queue flushed.
mysqld: galera/src/replicator_smm.cpp:1664: wsrep_status_t galera::ReplicatorSMM::cert(galera::TrxHandle*): Assertion `trx->last_seen_seqno() < trx->global_seqno()' failed.

apparently only on slaves

(gdb) bt
#0  0x00007f36bccd7d27 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007f36bccd9418 in __GI_abort () at abort.c:89
#2  0x00007f36bccd0bd6 in __assert_fail_base (fmt=0x7f36bce21748 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x7f36bc196280 "trx->last_seen_seqno() < trx->global_seqno()", 
    file=file@entry=0x7f36bc19507e "galera/src/replicator_smm.cpp", line=line@entry=1664, 
    function=function@entry=0x7f36bc198ce0 <galera::ReplicatorSMM::cert(galera::TrxHandle*)::__PRETTY_FUNCTION__> "wsrep_status_t galera::ReplicatorSMM::cert(galera::TrxHandle*)") at assert.c:92
#3  0x00007f36bccd0c82 in __GI___assert_fail (assertion=0x7f36bc196280 "trx->last_seen_seqno() < trx->global_seqno()", file=0x7f36bc19507e "galera/src/replicator_smm.cpp", line=1664, 
    function=0x7f36bc198ce0 <galera::ReplicatorSMM::cert(galera::TrxHandle*)::__PRETTY_FUNCTION__> "wsrep_status_t galera::ReplicatorSMM::cert(galera::TrxHandle*)") at assert.c:101
#4  0x00007f36bc14c512 in galera::ReplicatorSMM::cert (this=0x36dc260, trx=0x7f366403c8f0) at galera/src/replicator_smm.cpp:1664
#5  0x00007f36bc149930 in galera::ReplicatorSMM::cert_and_catch (this=0x36dc260, trx=0x7f366403c8f0) at galera/src/replicator_smm.cpp:1774
#6  0x00007f36bc147287 in galera::ReplicatorSMM::process_trx (this=0x36dc260, recv_ctx=0x7f36540009a0, trx=0x7f366403c8f0) at galera/src/replicator_smm.cpp:1215
#7  0x00007f36bc122a1d in galera::GcsActionSource::dispatch (this=0x36dc8e8, recv_ctx=0x7f36540009a0, act=..., exit_loop=@0x7f368b4b53d9: false) at galera/src/gcs_action_source.cpp:116
#8  0x00007f36bc12306b in galera::GcsActionSource::process (this=0x36dc8e8, recv_ctx=0x7f36540009a0, exit_loop=@0x7f368b4b53d9: false) at galera/src/gcs_action_source.cpp:180
#9  0x00007f36bc1422a1 in galera::ReplicatorSMM::async_recv (this=0x36dc260, recv_ctx=0x7f36540009a0) at galera/src/replicator_smm.cpp:354
#10 0x00007f36bc15f8a9 in galera_recv (gh=0x366aab0, recv_ctx=0x7f36540009a0) at galera/src/wsrep_provider.cpp:231
#11 0x00000000006526b2 in wsrep_replication_process (thd=0x7f36540009a0) at /home/ayurchen/codership/mysql-git/git/sql/wsrep_thd.cc:309
#12 0x000000000062e8ee in start_wsrep_THD (arg=0x6525eb <wsrep_replication_process(THD*)>) at /home/ayurchen/codership/mysql-git/git/sql/mysqld.cc:5376
#13 0x00007f36bd89a0a5 in start_thread (arg=0x7f368b4b6700) at pthread_create.c:309
#14 0x00007f36bcd9b84d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) f 4
#4  0x00007f36bc14c512 in galera::ReplicatorSMM::cert (this=0x36dc260, trx=0x7f366403c8f0) at galera/src/replicator_smm.cpp:1664
1664        assert(trx->last_seen_seqno() < trx->global_seqno());
(gdb) p trx->last_seen_seqno_
$1 = 16470
(gdb) p trx->global_seqno_
$2 = 16470
(gdb) p trx->depends_seqno_
$3 = -1
(gdb) p trx->local_
$4 = false
@ayurchen ayurchen added the bug label Nov 20, 2014
@ayurchen ayurchen self-assigned this Nov 20, 2014
@ayurchen ayurchen added this to the 3.8 milestone Nov 20, 2014
@ayurchen
Copy link
Member Author

After adding some instrumentation it is now clear that the global seqno counter may get desynchronized between the nodes on configuration change, which essentially means a loss of action.

Master log (cleaned up, showing transactions with checksums as unique IDs):

2014-11-25 15:03:57 15993 [Note] WSREP: Quorum results:
    version    = 3,
    component  = PRIMARY,
    conf_id    = 141,
    members    = 2/3 (joined/total),
    act_id     = 32781,
    last_appl. = 32769,
    protocols  = 0/6/3 (gcs/repl/appl),
    group UUID = efa0b303-749b-11e4-bbc5-e75aae28cf6d
2014-11-25 15:03:57 15993 [Note] WSREP: Assign initial position for certification: 32781, protocol version: 3
2014-11-25 15:03:57 15993 [Note] WSREP: seqno_g: 32785, last_seen: 32781, checksum: 4961483846040335784
2014-11-25 15:03:57 15993 [Note] WSREP: seqno_g: 32786, last_seen: 32781, checksum: 9057784438746836913
2014-11-25 15:03:57 15993 [Note] WSREP: seqno_g: 32788, last_seen: 32786, checksum: 5502232762086329104
2014-11-25 15:03:58 15993 [Note] WSREP: seqno_g: 32790, last_seen: 32789, checksum: 13960164137662801834

Slave 1 (joiner):

2014-11-25 15:03:57 15846 [Note] WSREP: Quorum results:
    version    = 3,
    component  = PRIMARY,
    conf_id    = 141,
    members    = 2/3 (joined/total),
    act_id     = 32781,
    last_appl. = 32538,
    protocols  = 0/6/3 (gcs/repl/appl),
    group UUID = efa0b303-749b-11e4-bbc5-e75aae28cf6d
2014-11-25 15:03:57 15846 [Note] WSREP: State transfer required: 
    Group state: efa0b303-749b-11e4-bbc5-e75aae28cf6d:32781
    Local state: efa0b303-749b-11e4-bbc5-e75aae28cf6d:32545
2014-11-25 15:03:57 15846 [Note] WSREP: Assign initial position for certification: 32781, protocol version: 3
2014-11-25 15:03:57 15846 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 32786)
2014-11-25 15:03:57 15846 [Note] WSREP: Receiving IST: 236 writesets, seqnos 32545-32781
2014-11-25 15:04:00 15846 [Note] WSREP: IST received: efa0b303-749b-11e4-bbc5-e75aae28cf6d:32781
2014-11-25 15:04:00 15846 [ERROR] WSREP: seqno_g: 32789, last_seen: 32789, checksum: 13960164137662801834
mysqld: galera/src/trx_handle.hpp:281: void galera::TrxHandle::set_received(const void*, wsrep_seqno_t, wsrep_seqno_t): Assertion `last_seen_seqno_ < seqno_g' failed.

Slave 2 (donor):

2014-11-25 15:03:57 15984 [Note] WSREP: Quorum results:
    version    = 3,
    component  = PRIMARY,
    conf_id    = 141,
    members    = 2/3 (joined/total),
    act_id     = 32781,
    last_appl. = 32769,
    protocols  = 0/6/3 (gcs/repl/appl),
    group UUID = efa0b303-749b-11e4-bbc5-e75aae28cf6d
2014-11-25 15:03:57 15984 [Note] WSREP: Assign initial position for certification: 32781, protocol version: 3
2014-11-25 15:03:57 15984 [Note] WSREP: seqno_g: 32782, last_seen: 32781, checksum: 12592798987634715592
2014-11-25 15:03:57 15984 [Note] WSREP: seqno_g: 32784, last_seen: 32781, checksum: 16365591989214027386
2014-11-25 15:03:57 15984 [Note] WSREP: seqno_g: 32786, last_seen: 32781, checksum: 59242005923083167
2014-11-25 15:03:57 15984 [Note] WSREP: seqno_g: 32783, last_seen: 32781, checksum: 17126624519908990623
2014-11-25 15:03:57 15984 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 32786)
2014-11-25 15:03:57 15984 [Note] WSREP: async IST sender starting to serve tcp://10.21.32.1:4567 sending 32546-32781
2014-11-25 15:03:57 15984 [Note] WSREP: 1.0 (home1): State transfer to 0.0 (home0) complete.
2014-11-25 15:03:57 15984 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 32786)
2014-11-25 15:03:57 15984 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 32786)
2014-11-25 15:03:57 15984 [Note] WSREP: seqno_g: 32788, last_seen: 32787, checksum: 4903477773928788545
2014-11-25 15:03:58 15984 [ERROR] WSREP: seqno_g: 32789, last_seen: 32789, checksum: 13960164137662801834
mysqld: galera/src/trx_handle.hpp:281: void galera::TrxHandle::set_received(const void*, wsrep_seqno_t, wsrep_seqno_t): Assertion `last_seen_seqno_ < seqno_g' failed.

it can be seen that at least master and slave2 see different transactions under global seqnos 32786 and 32788. And transaction with a checksum of 13960164137662801834 is seen as 32789 on slaves and 32790 on master.

@ayurchen ayurchen changed the title Assertion last_seen_seqno_ < global_seqno_ fails GTID seqno counter gets desynchronized after CC Nov 25, 2014
@ayurchen
Copy link
Member Author

An even better view is here:
node1:

2014-11-25 19:21:13 17162 [Note] WSREP: New cluster view: global state: df9215d6-74c1-11e4-a7c8-7b5177b72ce7:865, view# 24: Primary, number of nodes: 3, my index: 0, protocol version 3
2014-11-25 19:21:13 17162 [Note] WSREP: SA: seqno_g: 866, last_seen: 865, size: 3293, checksum: 0x519e2a1cbcdd71e1
2014-11-25 19:21:13 17162 [Note] WSREP: M: seqno_g: 867, last_seen: 865, size: 5149, checksum: 0x9bb95be1c4d59ad7
2014-11-25 19:21:15 17162 [Note] WSREP: SA: seqno_g: 868, last_seen: 866, size: 2463, checksum: 0x5f735378f600bfbc
2014-11-25 19:21:16 17162 [Note] WSREP: M: seqno_g: 869, last_seen: 868, size: 3471, checksum: 0x986a4433d138ec2

node2 (joiner):

2014-11-25 19:21:13 17310 [Note] WSREP: New cluster view: global state: df9215d6-74c1-11e4-a7c8-7b5177b72ce7:865, view# 24: Primary, number of nodes: 3, my index: 1, protocol version 3
2014-11-25 19:21:15 17310 [Note] WSREP: IST received: df9215d6-74c1-11e4-a7c8-7b5177b72ce7:865
2014-11-25 19:21:15 17310 [Note] WSREP: SA: seqno_g: 866, last_seen: 865, size: 3293, checksum: 0x519e2a1cbcdd71e1
2014-11-25 19:21:15 17310 [Note] WSREP: SA: seqno_g: 867, last_seen: 866, size: 2463, checksum: 0x5f735378f600bfbc
2014-11-25 19:21:16 17310 [ERROR] WSREP: S: seqno_g: 868, last_seen: 868, checksum: 0x986a4433d138ec2
mysqld: galera/src/trx_handle.hpp:281: void galera::TrxHandle::set_received(const void*, wsrep_seqno_t, wsrep_seqno_t): Assertion `last_seen_seqno_ < seqno_g' failed.

node3 (donor):

2014-11-25 19:21:13 17318 [Note] WSREP: New cluster view: global state: df9215d6-74c1-11e4-a7c8-7b5177b72ce7:865, view# 24: Primary, number of nodes: 3, my index: 2, protocol version 3
2014-11-25 19:21:13 17318 [Note] WSREP: M: seqno_g: 866, last_seen: 865, size: 3293, checksum: 0x519e2a1cbcdd71e1
2014-11-25 19:21:15 17318 [Note] WSREP: M: seqno_g: 867, last_seen: 866, size: 2463, checksum: 0x5f735378f600bfbc
2014-11-25 19:21:16 17318 [ERROR] WSREP: S: seqno_g: 868, last_seen: 868, checksum: 0x986a4433d138ec2
mysqld: galera/src/trx_handle.hpp:281: void galera::TrxHandle::set_received(const void*, wsrep_seqno_t, wsrep_seqno_t): Assertion `last_seen_seqno_ < seqno_g' failed.

Notably, again donor and joiner aborted. As can be seen node1 replicated action 867 which only it received. Curiously it was not the first action to be delivered (or sent) in new configuration. node3 was the first to replicate action 866 which was delivered by all nodes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant