Skip to content

MyRocks data dictionary format

Alex Yang edited this page Mar 17, 2017 · 7 revisions

MyRocks manages a lot of internal information such as mappings between index id and column family id, inside what we call the data dictionary. MyRocks stores all data dictionary entries in a dedicated RocksDB column family named __system__. We call it the System Column Family (System CF). The System CF is separated from column families used by applications. For debugging purposes, MyRocks provides information_schema tables printing data dictionary entries.

Here are some concepts to help to understand MyRocks data dictionary.

  • Column Family ID: This is an ID of the column family in RocksDB. Each MyRocks index belongs to one column family. Multiple indexes can belong to one column family. So there is a 1:N mapping between column family and indexes. Column family name can be specified by setting index COMMENT syntax.
  • Index ID: This is an internal auto-generated id inside MyRocks. A new index id is assigned whenever creating a new index. Index id is assigned when creating a new index. Index id is sequentially incremented and never reused across different indexes. This means you can not create more than 2^32 indexes within the same MyRocks instance in total.
  • Global Index ID: Column Family ID + Index ID.

MyRocks Data Dictionary Models

  1. Table Name => internal index id mappings

key: Rdb_key_def::DDL_ENTRY_INDEX_START_NUMBER(0x1) + dbname.tablename

value: version + {global_index_id}*n_indexes_of_the_table

This dictionary is updated when index definition is updated -- adding/dropping a table/index. Version is internal data dictionary version (currently hard coded as 0x1) and uses 2 bytes. Dictionary id is 4 bytes, and global index id is 8 bytes.

  1. Index information

key: Rdb_key_def::INDEX_INFO(0x2) + global_index_id

value: version, index_type, key_value_format_version

A row is inserted when a new index is created. When dropping an index, matched row is removed. index_type is 1 byte. Currently it is used to differentiate primary key and secondary keys. key_value_format_version is 2 bytes. The version number will be increased when format is changed. This is for keeping compatibility easier.

  1. CF id => CF flags

key: Rdb_key_def::CF_DEFINITION(0x3) + cf_id

value: version, {is_reverse_cf, is_auto_cf, is_per_partition_cf}

A row is inserted when new column family is created. When dropping a column family, matched row is removed. cf_flags is 4 bytes in total. Currently only three bits are used.

  1. Binlog entry (updated at commit)

key: Rdb_key_def::BINLOG_INFO_INDEX_NUMBER (0x4)

value: version, {binlog_name,binlog_pos,binlog_gtid}

This dictionary entry is at most one record, and updated at transaction commit (binlog commit). If binary log was disabled, this entry was not updated. Binlog name and binlog gtid are two byte length encoded, and not null terminated. Binlog pos is 4 bytes.

  1. Ongoing drop index entry

key: Rdb_key_def::DDL_DROP_INDEX_ONGOING(0x5) + global_index_id

value: version

This data dictionary entry was introduced to support "Fast drop/truncate table" feature in MyRocks. When dropping a table (indexes), MyRocks adds target indexes into this data dictionary then a client gets reply very quickly -- without waiting for completing drop table. MyRocks background schedules a compaction filter, periodically checking rows, and if all of rows associated with the index were removed, it deletes the index id from this data dictionary.

  1. Index Statistics

key: Rdb_key_def::INDEX_STATISTICS(0x6) + global_index_id

value: version, {materialized PropertiesCollector::IndexStats}

This data dictionary is added/updated/deleted if index statistics need to be changed.

  1. Current maximum index id

key: Rdb_key_def::CURRENT_MAX_INDEX_ID(0x7)

value: version, current max index id

This data dictionary is updated when creating a new index.

  1. Ongoing create index entry

key: Rdb_key_def::DDL_CREATE_INDEX_ONGOING(0x8) + global_index_id

value: version

This data dictionary entry was introduced to support "Fast secondary index creation" in MyRocks. While an index is undergoing creation in MyRocks, this entry is updated, and removed once index creation is complete. It's primary use is during crash recovery, on startup if any partially created index is found its entries are removed from within RocksDB.


Data dictionary operations are atomic inside RocksDB. For example, when creating a table with two indexes, it is necessary to call Put three times, and MyRocks does it with single WriteBatch.

Clone this wiki locally