Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ALTER TABLE ... ENGINE=ROCKSDB uses too much memory #692

Open
spetrunia opened this Issue Aug 24, 2017 · 10 comments

Comments

Projects
None yet
5 participants
@spetrunia
Copy link
Collaborator

commented Aug 24, 2017

This is a known property, but I am filing it as a bug because it gives bad user experience.

A trivial example: let's create a non-MyRocks table

create table ten(a int);
insert into ten values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
create table one_k(a int);
insert into one_k select A.a + B.a* 10 + C.a * 100 from ten A, ten B, ten C;
create table t1 (pk int primary key, col1 char(100)) engine=myisam;
insert into t1 select A.a + 1000*B.a, 'filler-data' from one_k A, one_k B;

now, suppose one is considering migrating to MyRocks. Their first likely step might be (*):

alter table t1 engine=rocksdb;

Unfortunately, when the table is sufficiently big, this statement will cause the server to consume all the memory and then be killed due to uncaught std::bad_alloc, or by OOM killer.
This is because MyRocks will try to write the whole table contents as one big transaction.

MyRocks actually already has ways to break big bulk load operations into smaller chunks. The issue here is that they are not enabled for this statement, and this gives a bad user experience.

(*) - this bug is actually inspired by a real-world attempt: https://jira.mariadb.org/browse/MDEV-13609 (if you're not a MariaDB member you won't see much in that MDEV as crash details are interspersed with user data and so were made private)

A paste from debugger proving that ALTER TABLE ENGINE=... indeed accumulated all of the table contents in memory:

(gdb) p rep_.size()
  $9 = 112000013
(gdb) wher
  #0  rocksdb::WriteBatch::Clear (this=0x7fffd800ce40) at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/rocksdb/db/write_batch.cc:186
  #1  0x00000000019c4fd2 in rocksdb::WriteBatchWithIndex::Rep::Clear (this=0x7fffd800ce40) at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/rocksdb/utilities/write_batch_with_index/write_batch_with_index.cc:494
  #2  0x00000000019c6132 in rocksdb::WriteBatchWithIndex::Clear (this=0x7fffd800ccc0) at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/rocksdb/utilities/write_batch_with_index/write_batch_with_index.cc:711
  #3  0x0000000001a5da68 in rocksdb::TransactionBaseImpl::Clear (this=0x7fffd800cc50) at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/rocksdb/utilities/transactions/transaction_base.cc:42
  #4  0x00000000019a1f5b in rocksdb::TransactionImpl::Clear (this=0x7fffd800cc50) at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/rocksdb/utilities/transactions/transaction_impl.cc:101
  #5  0x00000000019a2c46 in rocksdb::TransactionImpl::Commit (this=0x7fffd800cc50) at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/rocksdb/utilities/transactions/transaction_impl.cc:253
  #6  0x00000000016691f5 in myrocks::Rdb_transaction_impl::commit_no_binlog (this=0x7fffd800caf0) at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/storage/rocksdb/ha_rocksdb.cc:2081
  #7  0x0000000001668446 in myrocks::Rdb_transaction::commit (this=0x7fffd800caf0) at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/storage/rocksdb/ha_rocksdb.cc:1812
  #8  0x000000000163f54f in myrocks::rocksdb_commit (hton=0x29fb0e0, thd=0x2acab60, commit_tx=false) at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/storage/rocksdb/ha_rocksdb.cc:2829
  #9  0x0000000000d83fa7 in ha_commit_low (thd=0x2acab60, all=false, async=false) at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/sql/handler.cc:1603
  #10 0x0000000001238640 in MYSQL_BIN_LOG::commit (this=0x2744000 <mysql_bin_log>, thd=0x2acab60, all=false, async=false) at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/sql/binlog.cc:7228
  #11 0x0000000000d83cda in ha_commit_trans (thd=0x2acab60, all=false, async=false, ignore_global_read_lock=false) at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/sql/handler.cc:1542
  #12 0x0000000000d8ba5e in ha_enable_transaction (thd=0x2acab60, on=true) at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/sql/handler.cc:5046
  #13 0x000000000101568a in mysql_trans_commit_alter_copy_data (thd=0x2acab60) at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/sql/sql_table.cc:9406
  #14 0x000000000101639a in copy_data_between_tables (from=0x7fffd8101370, to=0x7fffd8037270, create=..., ignore=false, order_num=0, order=0x0, copied=0x7ffff47159b0, deleted=0x7ffff47159b8, keys_onoff=Alter_info::LEAVE_AS_IS, alter_ctx=0x7ffff47161d0) at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/sql/sql_table.cc:9641
  #15 0x000000000101488d in mysql_alter_table (thd=0x2acab60, new_db=0x7fffd8005ca8 "test", new_name=0x0, create_info=0x7ffff4717240, table_list=0x7fffd8005740, alter_info=0x7ffff4717160, order_num=0, order=0x0, ignore=false) at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/sql/sql_table.cc:9099
  #16 0x0000000001176d82 in Sql_cmd_alter_table::execute (this=0x7fffd8005cb8, thd=0x2acab60) at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/sql/sql_alter.cc:335
  #17 0x0000000000f8c734 in mysql_execute_command (thd=0x2acab60, statement_start_time=0x7ffff4718aa8, post_parse=0x7ffff4718c00) at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/sql/sql_parse.cc:6098
  #18 0x0000000000f90913 in mysql_parse (thd=0x2acab60, rawbuf=0x7fffd8005650 "alter table t1 engine=rocksdb", length=29, parser_state=0x7ffff47194b0, last_timer=0x7ffff4718c00, async_commit=0x7ffff4718bcc "") at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/sql/sql_parse.cc:7770
  #19 0x0000000000f80daa in dispatch_command (command=COM_QUERY, thd=0x2acab60, packet=0x2ad6cd1 "alter table t1 engine=rocksdb", packet_length=29, srv_session=0x0) at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/sql/sql_parse.cc:1862
  #20 0x0000000000f7f19a in do_command (thd=0x2acab60) at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/sql/sql_parse.cc:1207
  #21 0x0000000000f3dcaa in do_handle_one_connection (thd_arg=0x2acab60) at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/sql/sql_connect.cc:1142
  #22 0x0000000000f3d6cc in handle_one_connection (arg=0x2acab60) at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/sql/sql_connect.cc:963

@spetrunia spetrunia added the myrocks label Aug 24, 2017

@yoshinorim

This comment has been minimized.

Copy link
Contributor

commented Aug 24, 2017

We have plans to implicitly use bulk loading mode on rebuilding tables. Meanwhile, could your customer use the following instead?

set session rocksdb_bulk_load=1;
alter table t1 engine=rocksdb;
set session rocksdb_bulk_load=0;

@georgelorchpercona

This comment has been minimized.

Copy link
Contributor

commented Sep 8, 2017

Peter Zaitsev already ran into this same OOM situation on large transactions with out-of-the-box settings. This is not likely to pass as acceptable behavior. If there is an upper limit to the size of a transaction, then it should result in a SQL error when reached, not a server assertion.

As far as bulk load behavior, TokuDB implemented a somewhat simple idea in that if we were starting a handler bulk insert and the table is empty, assume that it is a bulk load operation and activate the bulk loader. Another thing that TokuDB allows is out-of-order key insertions. It can handle this because it 'stages' all of the inserts first and sorts the data, then finalizes and actually performs the bulk operation once all inserts have been accepted and sorted. The down side to this is the performance and storage hit to do the staging and sorting of data, but the end result is still a ~15% performance improvement over normal insertions. So maybe there is some idea in there that might be beneficial for MyRocks/RocksDB.

@yoshinorim

This comment has been minimized.

Copy link
Contributor

commented Sep 8, 2017

@georgelorchpercona We have a session variable rocksdb_max_row_locks. If the number of row locks exceeds the variable, the transaction gets a SQL error. Currently default is 1B row locks, which is way too high to return before getting OOM. We are talking internally to reduce defaults, but have not done yet. Maybe we'll reduce to around 1M.

Implicit bulk load on DDL has been discussed as well. We can implement it since quite a lot of people have been hitting the OOM on DDL.

The idea about out-of-order key insertions is interesting. Thanks for your suggestion!

@yoshinorim

This comment has been minimized.

Copy link
Contributor

commented Sep 8, 2017

Longer term we have plans to support long running transactions in RocksDB, which stores intermediate data on storage to prevent from OOM.

@georgelorchpercona

This comment has been minimized.

Copy link
Contributor

commented Sep 8, 2017

Thanks for the tip on rocksdb_max_row_locks @yoshinorim ! That is a perfect short term solution that we can implement ourselves. We have a similar issue with rocksdb_max_open_files default and are reducing it to sane levels and documenting how to tune it properly.

The whole point is that a basic out-of-the-box installation should not fail/assert on simple things. If there are some limitations, we should be erroring to the client when hit and not asserting the instance. I realize that this is a different thought process than when in early development and prototyping and you really don't know what is 'correct' values yet and want a very in your face failure to let you know you hit some limit.

Thanks again for the info! I will pass this on to our team and start looking at changing our defaults for these and ensure we have them documented.

@lth

This comment has been minimized.

Copy link
Contributor

commented Sep 8, 2017

There's also rocksdb_write_batch_max_bytes which offers similar functionality.

@asandryh

This comment has been minimized.

Copy link

commented Sep 23, 2017

@georgelorchpercona We have recently added support for out-of-order key insertion in bulk load mode. Setting rockdsb_bulk_load_allow_unsorted along with rockdsb_bulk_load will store all rows in temporary SST files, sort them, and write to final SST files in order. I've updated wiki as well: https://github.com/facebook/mysql-5.6/wiki/Data-Loading.

@georgelorchpercona

This comment has been minimized.

Copy link
Contributor

commented Jan 4, 2018

Bumping this. Our CTO Vadim hit this today immediately while testing a release candidate build. So we know about https://github.com/facebook/mysql-5.6/wiki/Migrating-from-InnoDB-to-RocksDB and https://github.com/facebook/mysql-5.6/wiki/data-loading and the memory limitations, but this make for a terrible user experience where a 'new' user wanting to try out MyRocks immediately hits this, then has to go and figure out what happened. Users typically do not run right out and read the docs before firing things off other than to follow the fastest path from 'A' to 'B'.

So we now have Sergei Petrunia, Peter Zaitsev, and Vadim Tkachenko, all fairly skilled people with MySQL, hitting (or reporting) this out-of-the box, and all asking, "can't we just not crash and issue some error early on by default?"

Please consider some way to accomplish a reasonable default transaction size threshold and error back to the client and try to get RocksDB guys to fix it (maybe mmap the memory needed from some temp file or something). Also please consider this as one of the highest priority issues we are currently facing as we attempt to release MyRocks as GA in Percona Server.

@yoshinorim

This comment has been minimized.

Copy link
Contributor

commented Jan 4, 2018

We're going to set default rocksdb_max_row_locks from 1B to 1M soon. Longer term, we'll automatically use bulk loading mode on ALTER.

@yoshinorim

This comment has been minimized.

Copy link
Contributor

commented Jan 5, 2018

Committed e2c6868 , which reduced default rocksdb_max_row_locks to 1M.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.