Data storage

From Key-Value to Tables

HerdDB internal architecture handles a table as a set of key-values, in Java it is a very large Map<byte[],byte[]>, but usually you are going to use HerdDB using SQL syntax and the concept of Table. The system maps each row from the columnar format (in Java it is something like a Map<String,Object>) to the key-value format (byte[], byte[]), tearing apart the 'primary key' part (one or multiple columns) from the 'value' part (the other columns).

HerdDB keeps local SoftReferences to the deserialized version of rows as a cache in order to support subsequent queries on the same data.

Transaction log

The core of HerdDB is the transaction log, which is built upon Apache BookKeeper in case of replication, every mutation on data is written to the log and then applied to the in memory version of the row. The log is continuously read from replicas, this way each replica applies the same changes to data and remains in synch with the actual leader. There is no limit to the number of replicas. To each write to the log a "sequence number" is assigned and this is the key to define snapshots which can be transferred consistently on the network.

Data pages and snapshots

At any given time data is stored in part in a memory buffer and in part on disk. No data will be lost if the JVM crashed clearing the buffer, because the log is the source of the truth and database can be recovered from it and a snapshot.

When a row is stored on disk it is assigned to a "data page", at the first mutation of the record it is detached from the data page and the page is marked as 'dirty'. At checkpoint time all the 'dirty pages' are dismissed and every record on them is used to build new pages, together with new records or updated records. Records modified/inserted/deleted in the scope of a transaction are never written to disk and they are not present on the main buffer, this way on disk and on the main memory there is always a consistent and committed snapshot of data. Each transaction uses a local temporary buffer to store data.

During the checkpoint the list of the ids of the "active pages" is written to disk, together with the actual log sequence number. Old pages are not deleted instantly because they can be part of and older valid snapshot and maybe requested to do a full table scan at the "age" (log position) of that snapshot.

Local Tablespace Table Metadata file

For each Tablespace there is a file which keeps the actual schema of the tablespace at checkpoint time. The name of the file is tables.LEDGERID.OFFSET.tablesmetadata.

At each checkpoint this structure is written to the file:

tableSpaceName: string
ledgerId: long
offset: long
numTables: int
for each table: -- table: Table

Table schema serialized format

The schema of a table is serialized as follows:

tableSpaceName: string
tableName: string
auto_increment: byte (>0 auto_increment enabled)
maxSerialPosition: varint
numPrimaryKeyColumns: byte
for each primary key column: -- columnName: string
flags: varint
numberOfColumns: varint
for each column: -- column: Column metadata

Column metadata serialized format

Metadata about a column are stored as follows:

columnName: string
columnType: varint
serialPosition: varint
flags: varint

Data page format

Each data page on disk is stored as follows:

flags: varint (fixed value '0')
the number of records: integer
the list of key/values: -- key: byte array -- value: byte array
md5checksum: byte array

Table checkpoints file

At each checkpoint a the "checkpoints" file for each table is updated appending the actual status of the table, for each checkpoint we have this sequence:

marker: int (TABLE_STATUS_MARKER, fixed magic value '1233')
tableName: string
leaderId: long
offset: long
nextPage: long
nextPrimaryKeyValue: byte array (contains the serialized version of the next generated primary key value, format depends on the column type)
numberOfActivePages: varint
for each page: -- pageId: varlong

Values serialization

There are the common rules used for serialization: -- strings are encoded in UTF-8 and are stored writing the length (as varint) of the encoded string before the actual byte[] data -- byte arrays are encoded by writing the raw data, prependend by a varint which represents the length of the array -- integers are always signed and keep 4 bytes -- longs are always signed and keep 8 bytes -- timestamps are encoding storing a number of millis since the epoch, as long, and so take 8 bytes -- varint are encoded using the usual variable integer format -- varlong are encoded using the usual variable integer format, extended to support 64bit longs

Record serialization

For each logical record, that is a Map<String,Object> the corrisponding (byte[], byte[]) representation is:

key: the concatenation of the serialized values of the primary key columns
value: the sequence of values which are not in the primary key

For each Column in the table we keep and internal ID (serialPosition), which is assigned when the column is added to the table. These ids are never recycled, even in case of drop/add of the same column "name".

For each field in the value:

a varint which represent the ID of the column (see the serialPosition field of the Column definition)
the serialized value for the column

Notes:

null values in primary key are not allowed
null values in the value are not written

Durability and storage options

You can configure the system in order to tune durability guarantees and how data is written to the OS.

If you are using local storage mode (without BookKeeper) you can configure the txlog with these options:

acknowledge writes only after fdatasync (default, this is what you expect from a database)
do not perform fdatasync on txlog (okay for tests and for bulk loads)
perform fdatasync periodically (is this really useful ? sometimes)

You can also use Linux O_DIRECT mode in order to write to disks, in some cases this leads to better performances (even x100 on Linux 4 kernels), but sometimes things go worse. You have to try and bench your system.

If you are using BookKeeper you are always guaranteed that writes are persisted durably, there is no way not to fdatasync (unless you configure the Bookie with journalSyncData=false, but it is not suggested as it can lead to data loss).

Regarding data page and index page writes you can choose to use O_DIRECT mode or normal fdatasync mode.

Please refer to the default configuration file for the list of all knobs https://github.com/diennea/herddb/blob/master/herddb-services/src/main/resources/conf/server.properties

Diskless cluster mode

When you are running with mode=diskless-cluster even the data and index pages are written to the Bookies. This configuration has the advantage that the server does not need a persistent volume, but you have only to care about ZooKeeper and BookKeeper servers. The mapping between Pages and Ledgers is still saved on ZooKeeper but reads and writes are never performed on the hotpaths (reads and writes), basically only during boot or during checkpoints.

Currently this mode is suggested if your data reside mostly in memory and you do not need to swap-in data and especially index pages from the bookies. If you have to choose then choose to keep as much as possible the indexes in memory as the primary key index is the hottest data structure that is used for reads and for writes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly