Skip to content

Commit ff91c80

Browse files
yoshinorimjtolmer
authored andcommitted
Crash safe slave and master in RocksDB
Summary: This diff has the following changes. - Changed system tables managing replication state from InnoDB to RocksDB. I also converted CHAR types to VARCHAR types since it's more space efficient in RocksDB. - Fixed an issue that Xid binlog event was not written at RocksDB commit. Without Xid event, mysql.slave_*info tables are not updated on slaves, so crash safe slave didn't work. This was fixed by defining rocksdb_prepare() empty function as a handler::prepare method. Since there were two storage engines supporting prepare() (InnoDB and RocksDB, though it's fake in RocksDB), it was also necessary to adjust "total_ha_2pc" variable in MySQL which checks conditions in order to avoid errors/crashes (total_ha_2pc should be binlog + 2 -- InnoDB and RocksDB). With this fix, slave_*info tables were updated on slaves, which was required for slave safe slave to work (but not enough when GTID is enabled). - Wrote binlog state (binlog name, binlog pos and optionally GTID) in RocksDB at transaction commit. This is treated as single Put (part of WriteBatch), and written together at commit. Since binlog is enabled on both master and slaves, This is written on both master and slave. This means on transaction commit, one additional Put (row) is written on master, two (one for this, the other is for slave_*info) Puts are written on slave. To write and read binlog state, I added a utility class Binlog_info_manager and refactored some existing functions. To store/read binlog state in RocksDB, I allocated a system index id BINLOG_INFO_INDEX_NUMBER (0xfffffff0) which is not likely to be used (unless creating 0xfffffff0 indexes). Currently there is no maximum number of index checking but it will be definitely needed. Note that binlog state is directly written to RocksDB, not through MySQL data dictionary like slave_relay_log_info. This is for performance reason. Opening/closing additional MySQL table is expensive. Writing directly to RocksDB avoids the overhead. - Added rocksdb_recover() function as a handler::recover function. It reads binlog state stored in RocksDB, and prints last committed binlog filename, position and GTID. This makes it possible to recover both crashed slave and master. Printed format is aligned to Facebook's enhancement for InnoDB (22e4a82). - Changed variable names innodb_binlog_file and innodb_binlog_pos in binlog.h to engine_binlog_file and engine_binlog_pos because they are now set from both RocksDB and InnoDB. - Added binlog state check log in rocksdb_recover() and innobase_xa_recover(). Both set binlog info from RocksDB/InnoDB (in RocksDB->InnoDB ordering). Without checking anything, binlog state fetched at rocksdb_recover() is overwritten at innobase_xa_recover(). When using RocksDB storage engine only, this means binlog state is overwritten to very old (wrong) position. To fix this issue, I added a utility function is_binlog_advanced() in handlerr.h and handler.cc. rocksdb_recover() and innobase_xa_recover() check binlog state, and if state is already set and overwrite only new state is advanced. This prevents to overwrite with older binlog state. - Added test cases to cover WAL crash, and modified existing crash safe master/slave test cases to cover RocksDB. They are now stored at rocksdb_rpl test suite. Since RocksDB storage engine now does not do anything at prepare(), roll forward recovery (committing transactions that are written as binlog Xid events) doesn't happen. This is a different behaior from InnoDB so I adjusted result files for RocksDB. @update-submodule: rocksdb Test Plan: mtr --repeat=3 --suite=rocksdb and --suite=rocksdb_rpl Reviewers: hermanlee4, jonahcohen, santoshb, maykov Subscribers: jtolmer, MarkCallaghan Differential Revision: https://reviews.facebook.net/D31365 Differential Revision: https://reviews.facebook.net/D48045 Differential Revision: https://reviews.facebook.net/D49221 Differential Revision: https://reviews.facebook.net/D49557 Differential Revision: https://reviews.facebook.net/D51837
1 parent 40e4418 commit ff91c80

39 files changed

+1258
-50
lines changed

mysql-test/extra/rpl_tests/rpl_gtid_crash_safe.inc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
connection master;
44
-- let $uuid = `select @@server_uuid;`
5-
create table t1(a int, PRIMARY KEY(a)) ENGINE=INNODB;
5+
-- eval create table t1(a int, PRIMARY KEY(a)) ENGINE=$engine
66
insert into t1 values(1);
77
insert into t1 values(2);
88
sync_slave_with_master;

mysql-test/extra/rpl_tests/rpl_parallel_load_innodb.test

Lines changed: 31 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,12 @@
66
# load volume parameter
77
#
88

9-
if ($let == '')
9+
if ($storage_engine == '')
10+
{
11+
let $storage_engine='innodb';
12+
}
13+
14+
if ($iter == '')
1015
{
1116
let $iter=20;
1217
}
@@ -68,8 +73,8 @@ while($i)
6873
eval drop database if exists test$i1;
6974
eval create database test$i1;
7075
eval use test$i1;
71-
create table ti_nk (a int, b int, c text) engine=innodb;
72-
create table ti_wk (a int auto_increment primary key, b int, c text) engine=innodb;
76+
eval create table ti_nk (a int, b int, c varchar(36) primary key) engine=$storage_engine default charset=latin1 collate=latin1_bin;
77+
eval create table ti_wk (a int auto_increment primary key, b int, c text) engine=$storage_engine;
7378
let $l1= $init_rows;
7479
while($l1)
7580
{
@@ -79,7 +84,7 @@ while($i)
7984

8085
# this table is special - just for timing. It's more special on test0 db
8186
# where it contains master timing of the load as well.
82-
create table benchmark (state text) engine=innodb; # timestamp keep on the slave side
87+
eval create table benchmark (state varchar(100) primary key) engine=$storage_engine default charset=latin1 collate=latin1_bin; # timestamp keep on the slave side
8388

8489
dec $i;
8590
}
@@ -278,11 +283,33 @@ let $slave_pid_file=`SELECT @@pid_file;`;
278283

279284

280285
let $wait_condition= SELECT count(*)+sleep(1) = 5 FROM test0.benchmark;
286+
--let slave_data_dir= query_get_value(SELECT @@DATADIR, @@DATADIR, 1)
287+
281288
--disable_query_log
282289
while ($num_crashes)
283290
{
284291
SELECT sleep(2 * rand());
285292
exec kill -9 `head -1 $slave_pid_file`;
293+
294+
if ($storage_engine == 'rocksdb')
295+
{
296+
--write_file $MYSQL_TMP_DIR/truncate_tail_wal.sh
297+
#!/bin/bash
298+
F=`ls -t $slave_data_dir/.rocksdb/*.log | head -n 1`
299+
SIZE=`stat -c %s $F`
300+
NEW_SIZE=`expr $SIZE - 100`
301+
truncate -s $NEW_SIZE $F
302+
rc=$?
303+
if [[ $rc != 0 ]]; then
304+
exit 1
305+
fi
306+
exit 0
307+
EOF
308+
--chmod 0755 $MYSQL_TMP_DIR/truncate_tail_wal.sh
309+
--exec $MYSQL_TMP_DIR/truncate_tail_wal.sh
310+
--remove_file $MYSQL_TMP_DIR/truncate_tail_wal.sh
311+
}
312+
286313
-- let $rpl_server_number=2
287314
-- source include/rpl_start_server.inc
288315
-- source include/start_slave.inc

mysql-test/r/rocksdb.result

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
select * from information_schema.engines where engine = 'rocksdb';
22
ENGINE SUPPORT COMMENT TRANSACTIONS XA SAVEPOINTS
3-
ROCKSDB YES RocksDB storage engine YES NO NO
3+
ROCKSDB YES RocksDB storage engine YES YES NO
44
drop table if exists t0,t1,t2,t3,t4,t5,t6,t7,t8,t9,t10;
55
drop table if exists t11,t12,t13,t14,t15,t16,t17,t18,t19,t20;
66
drop table if exists t21,t22,t23,t24,t25,t26,t27,t28,t29;
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
[row]
2+
binlog-format=row
Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
include/master-slave.inc
2+
Warnings:
3+
Note #### Sending passwords in plain text without SSL/TLS is extremely insecure.
4+
Note #### Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START SLAVE; see the 'START SLAVE Syntax' in the MySQL Manual for more information.
5+
[connection master]
6+
drop table if exists x;
7+
select @@binlog_format;
8+
@@binlog_format
9+
ROW
10+
create table x (id int primary key, value int, value2 int, index(value)) engine=rocksdb;
11+
insert into x values (1,1,1);
12+
insert into x values (2,1,1);
13+
insert into x values (3,1,1);
14+
insert into x values (4,1,1);
15+
insert into x values (5,1,1);
16+
select @@global.gtid_executed;
17+
@@global.gtid_executed
18+
19+
20+
--- slave state before crash ---
21+
select * from x;
22+
id value value2
23+
1 1 1
24+
2 1 1
25+
3 1 1
26+
4 1 1
27+
5 1 1
28+
select @@global.gtid_executed;
29+
@@global.gtid_executed
30+
31+
select * from mysql.slave_gtid_info;
32+
Id Database_name Last_gtid
33+
include/rpl_start_server.inc [server_number=2]
34+
35+
--- slave state after crash recovery, slave stop, one WAL entry missing ---
36+
select * from x;
37+
id value value2
38+
1 1 1
39+
2 1 1
40+
3 1 1
41+
4 1 1
42+
select @@global.gtid_executed;
43+
@@global.gtid_executed
44+
45+
select * from mysql.slave_gtid_info;
46+
Id Database_name Last_gtid
47+
48+
--- slave state after restart, slave start ---
49+
include/start_slave.inc
50+
select * from x;
51+
id value value2
52+
1 1 1
53+
2 1 1
54+
3 1 1
55+
4 1 1
56+
5 1 1
57+
select @@global.gtid_executed;
58+
@@global.gtid_executed
59+
60+
select * from mysql.slave_gtid_info;
61+
Id Database_name Last_gtid
62+
insert into x values (6,1,1);
63+
select * from x;
64+
id value value2
65+
1 1 1
66+
2 1 1
67+
3 1 1
68+
4 1 1
69+
5 1 1
70+
6 1 1
71+
select @@global.gtid_executed;
72+
@@global.gtid_executed
73+
74+
select * from mysql.slave_gtid_info;
75+
Id Database_name Last_gtid
76+
drop table x;
77+
include/rpl_end.inc
78+
Binlog Info Found

0 commit comments

Comments
 (0)