New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MyRocks: Bulk load requires an insane number of file descriptors #347
Comments
Please try the followings.
|
Using his my.cnf and building from HEAD today. Not sure bug reporter used the same source
2016/10/21-11:22:44.089431 7f26046ab940 Options.target_file_size_base: 67108864 |
Defaults changed in facebook/rocksdb@2feafa3 |
@RickPizzi Could you paste "SHOW ENGINE ROCKSDB STATUS" output? |
Please note that I have used two different configurations for bulk load and for normal use. rocksdb_max_open_files=-1 Two notes here:
I have just checked, the 2.3 TB InnoDB dataset size is occupying 2.0 TB in RocksDB, and the server currently has 34,160 file descriptors open. I expected some more space saving here. By the way, some of the InnoDB tables are compressed in InnoDB but I removed compression when importing in RocksDB, while some other have blobs with compressed content (typically zlib) and these were imported unchanged. Since I am going to retry the bulk load, if you want me to try a different bulk load setting just please tell me what to try. Output of SHOW ENGINE (with above settings in effect) follows. Thank you! mysql> show engine rocksdb status\G Percentiles: P50: 3.50 P75: 8.75 P99: 679.00 P99.9: 679.00 P99.99: 679.00[ 0, 1 ) 2 40.000% 40.000% ######## ** Level 1 read latency histogram (micros): Percentiles: P50: 4.05 P75: 4.62 P99: 7.73 P99.9: 23.51 P99.99: 38.84[ 0, 1 ) 11 0.089% 0.089% ** Level 2 read latency histogram (micros): Percentiles: P50: 3.97 P75: 5.76 P99: 97.66 P99.9: 13722.82 P99.99: 35868.87[ 0, 1 ) 588 0.805% 0.805% ** Level 3 read latency histogram (micros): Percentiles: P50: 3.68 P75: 4.87 P99: 126.74 P99.9: 21697.54 P99.99: 66238.90[ 0, 1 ) 1224 0.689% 0.689% ** Level 4 read latency histogram (micros): Percentiles: P50: 6.23 P75: 33.68 P99: 5398.18 P99.9: 33839.59 P99.99: 77159.11[ 0, 1 ) 11180 0.382% 0.382% ** Level 5 read latency histogram (micros): Percentiles: P50: 5.75 P75: 33.94 P99: 15313.78 P99.9: 46624.96 P99.99: 91418.96[ 0, 1 ) 70777 3.775% 3.775% # *************************** 2. row *************************** Level Files Size(MB) Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDropL0 1/0 0.30 0.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.000 0 0 *************************** 3. row *************************** Level Files Size(MB) Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDropL0 1/0 64.21 0.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.000 0 0 *************************** 4. row *************************** |
default column family shows that you have 34094 sst files and total size is 1995553MB.
Average sst file size is 58MB. Since you have 2TB instance size, having tens of thousands of open sst files is not uncommon. Just allocate large enough file descriptors (/etc/security/limits.conf etc). If you don't like too many open files, consider increasing target_file_size_base (my previous 32MB suggestion was not appropriate. Now default is 64MB). |
--- for OOM For the OOM issue do you know what RSS was for mysqld when it was killed? What malloc do you use for mysqld - tcmalloc, jemalloc or glibc? In the past I have had lousy results with glibc malloc. Almost all of my tests use jemalloc with mysqld. One result from my past testing showed that RSS was twice larger with glibc malloc -> http://smalldatum.blogspot.com/2015/10/myrocks-versus-allocators-glibc.html I run days long tests using linkbench as the workload and with rocksdb_block_cache_size=35G the mysqld RSS was about 41G. The other potential issue for using too much memory is using a large number of column families because each CF gets its own memtable, or write_buffer_size * write_buffer_number bytes of RAM. But I don't think you are doing that. --- for compression Are you comparing space used immediately after the load? If the tables will then be subject to random updates, InnoDB usually grows much faster than RocksDB as it suffers more from fragmentation. If RocksDB is loaded in key order, then there is an optimization for leveled compaction and with that the memtable is flushed to write L0 files and then the L0 files are pushed down the tree, so compaction doesn't occur beyond L0. If L0 files are not to be compressed then uncompressed files are pushed all the way down the tree. Looking at the table schema you pasted for deeplink, the table has a PK and 2 secondary indexes. If loads are done into that table and the load is in PK order, I am not sure whether this optimization occurs. Looking at the compaction IO stats that you pasted the interesting data is:
From the "Sum" line Write(GB)=27.0 and Moved(GB)=1495.6 If you are using a special my.cnf during the bulk load then you can configure compression for L0, L1, L2. Change kNoCompression to one of kZlibCompression or kSnappyCompression in the my.cnf used during the bulk load.
|
Maybe I missed this but the my.cnf you pasted at the start that was used for the bulk load doesn't enable compression. Yoshinori, can you suggest a my.cnf? |
One more thing. I have a test that doesn't get compression as requested. Not sure whether you hit this - #350 |
OK, I guess (since I just started playing with MyRocks) that my config for bulk load was not appropriate for the operation, as pointed out by @mdcallag . In fact, I didn't look at all the options carefully and just took the suggested config from the wiki and started loading, to see how it performed. Can you guys recommend a my.cnf config for bulk load that allows for bulk load with maximum compression? As the idea here is to switch from InnoDB to MyRocks for this stats cluster where we have almost INSERTs only, so looks like a good candidate for the LSM storage, also because the master here gets quite a lot of write traffic and we expect it to grow in the next months even more. Also, for file descriptors, thanks @yoshinorim for confirming that this is normal behaviour for MyRocks with large datasets; sure I can live with that, but seemed odd at first :-) I will use your suggestion about target_file_size_base. Re. the OOM, here is what the mysqld had at the time it kicked in:
Thanks |
UPDATE - I have given it another try, after following Mark's and Yoshinori's suggestions above, and results are MUCH better!
Relevant config plus rocksdb status below; if there are other tweaks you want to suggest to further improve compression please let me know as I will repeat it once again with fresh data anyways.... Thanks!
|
Please re-open if you see any other issues. |
Hi,
still trying to load (using myloader) a dump from a large database (830+ tables), we see that MyRocks is eating an insane amount of file descriptors. We were unable to complete the load with 25,000 open files limit! Now we are retrying with 10x that value to see if it arrives to completion, however, this just doesn't seem right. Maybe a file descriptors leak?
Here's the relevant config.
Thanks,
Rick
rocksdb
skip-innodb
default-storage-engine=rocksdb
default-tmp-storage-engine=MyISAM
collation-server=latin1_bin
binlog_format = ROW
rocksdb_max_open_files=-1
rocksdb_base_background_compactions=1
rocksdb_max_total_wal_size=4G
rocksdb_block_size=16384
rocksdb_block_cache_size=16G
rocksdb_table_cache_numshardbits=6
rocksdb_skip_unique_check=1
rocksdb_commit_in_the_middle=1
rocksdb_write_disable_wal=1
rocksdb_max_background_flushes=40
rocksdb_max_background_compactions=40
rocksdb_default_cf_options=write_buffer_size=128m;level0_file_num_compaction_trigger=4;level0_slowdown_writes_trigger=256;level0_stop_writes_trigger=256;max_write_buffer_number=16;memtable=vector:1024
The text was updated successfully, but these errors were encountered: