MyRocks advantages over InnoDB

InnoDB is general purpose, very robust, and the most widely used open source relational database in the world. InnoDB works great on most workloads, but there are some use cases where MyRocks can outperform. MyRocks does not intend to beat InnoDB for all workload types. In this page, I describe MyRocks advantages compared to InnoDB.

MyRocks requires less storage space than InnoDB

If you run databases on disks, space doesn’t matter much. But if you run on flash, reducing space is very important because per GB cost is much higher than HDD.

InnoDB has some issues when you try to reduce space. First, InnoDB uses B+Tree index. B+Tree indexes get fragmented if not inserted sequentially. It is not uncommon that fragmentation increases 30~50% additional space.

Second issue is that InnoDB compression has space overhead. InnoDB compresses per page basis. Uncompressed page size is 16KB by default. After compression, page size is aligned to 4KB unit prior to MySQL 5.7, or OS/device sector size unit after MySQL 5.7 (if using Punch-Hole compression). On modern Flash storage device, sector size is 4KB. This means InnoDB compresses to 25%, 50% or 100% (and 75% in 5.7) only. For example, even though InnoDB compression could compress data from 16KB to 5KB (68.75% reduction), it actually uses 8KB, so compression efficiency deteriorates from 68.75% to 50%.

Fig 1: Space Amplification in InnoDB

On the other hand, RocksDB is LSM database, and is very well designed for reducing space.

RocksDB compresses by page, which is same as InnoDB, but after compressed page is not aligned to OS sector. If the page could be compressed to 30%, it exactly uses 30% space, not aligned to 50%.

RocksDB stores data into "SST" (Sorted String Table) files. RocksDB has block (can be configured by rocksdb_block_size, 4KB by default), but block is not aligned to OS sector. SST file is aligned to OS sector size, but SST size is much larger than page size (2MB by default), so alignment overhead is negligible.

As a side effect, single block read I/O request may end up reading two consecutive blocks. If you carefully look at iostat r/s and rMB/s, you'll notice I/O unit size is slightly bigger than block size. Hopefully this won't cause performance problem on both HDD and Flash.

Fig 2: Compression in MyRocks

In addition to negligible alignment overhead, RocksDB implements two major features to reduce space.

Prefix Key Encoding

Index entries are sorted in RocksDB (like many other databases). When using multi-column indexes, very often first N bytes are the identical to previous index entry. Then RocksDB optimizes space by not storing duplicate prefix byte stream.

Prefix key encoding

Fig 3: Prefix key encoding in RocksDB

Zero-filling row metadata

Both InnoDB and MyRocks has some space overhead per row. InnoDB stores 6 byte transaction id and 7 byte rollback pointer per row. They’re stored in primary key but not in secondary indexes. They are not compressed, even if using compressed format. So the total overhead is 13 bytes per row, regardless of compression settings.

MyRocks stores 7 byte sequence id and 1 byte operation type per row. They’re stored on both primary and secondary indexes. If you have one primary key and two secondary keys, total overhead will be 24 bytes per row. This sounds huge, but in practice, space overhead is a lot smaller. RocksDB does special optimization to zero-fill sequence ids when writing SST files in the bottom-most level, which has 90% of data. Zero-filled sequence ids use very little space after compression.

Zero filling row metadata

Fig 4: Zero-filling row metadata

Write Amplification is much smaller than InnoDB

On pure flash, reducing write volume (write amplification) is important because flash burns out if writing too much data. Reducing write volume also helps to improve overall throughput on flash.

InnoDB adopts "update in place" architecture. Even though updating just 1 record, an entire page where the row belongs becomes dirty, and the dirty page has to be written back to storage.

On typical OLTP systems, modification unit (row) size is much smaller than I/O unit (page) size. This makes write amplification very high.

Fig 5: Write Amplification in InnoDB

On the other hand, RocksDB (LSM) database uses "Append Only" model. Edits are written to WAL (Write Ahead Log), then periodically merged with SST files.

Write amplification in RocksDB Fig 6: Write Amplification in MyRocks