Skip to content
Brian S. O'Neill edited this page Jul 25, 2024 · 82 revisions

Q: What durability guarantees does TuplDB offer?

TuplDB follows a copy-on-write design, with periodic checkpoints. This ensures that the database will not become corrupt due to an unexpected crash or power failure. All changes made up to the last checkpoint are guaranteed to be durable, as well as everything recovered from the redo log. The DurabilityMode controls how changes are written to the redo log.

Q: How ACID are TuplDB transactions?

What is the maximum supported isolation level?

Isolation levels are selected with the LockMode enum. Note the lack of serializable or snapshot isolation.

Are writes atomic and durable, i.e. what happens if the JVM is shut down during a commit (power loss, process killed, ...)? Will this corrupt the store?

Neither JVM failure, OS crash, or power failures can corrupt the store. If a weak DurabilityMode is used, then recent transactions might rollback. Atomicity isn't compromised.

Is TuplDB using pessimistic locking, or is there some MVCC algorithm in place?

TuplDB only supports pessimistic locking. It's possible to implement read-only snapshots without major changes, but fully optimistic transactions would be difficult. Some of its capabilities can be achieved by setting the transaction lock timeout to zero, however.

Q: What are the maximum key and value sizes?

Keys can be as large as 2GiB, and values can be as large as 1EiB (with the default page size of 4096 bytes). Super large keys aren't recommended, since they're not optimized. Keys which are smaller than half the page size are optimized, which is about 2000 bytes when using the default page size. Applications which require large keys might benefit from a larger page size, which can be as large as 65536 bytes.

Q: Does TuplDB require that the entire store fit into RAM?

No. Pages are evicted as required, selecting the least recently used ones first. Always choose a large cache size for best performance.

Q: Is the initial startup time of a TuplDB instance dependent on the size of the database?

TuplDB startup time is influenced by redo log recovery time and the cache size. If a lot of changes were written to the redo log before a checkpoint was issued, then all those operations need to be replayed when the database is opened again. This cost can be reduced by enabling a cache primer, which pre-fetches pages before replaying the redo log. When a very large cache (>10GiB) is configured, it can take a few seconds for the pages to be allocated by the operating system itself. Installing an EventListener reveals how long each step takes.

Q: How can I perform non-transactional writes?

Pass Transaction.BOGUS to any parameter that accepts a Transaction instance. Passing null is not the same — null specifies an auto-commit transaction, which generates redo log entries. Non-transactional writes can be safely performed without risking database corruption, but checkpoints are required for the writes to become durable. By default, checkpoints are performed automatically every 1 to 60 seconds.

Q: How can I perform bulk record insertion into an index?

Records should be ordered by key, and then use the findNearby method as follows:

    Index ix = ...
    // For best performance, inserts are non-transactional.
    Cursor fill = ix.newCursor(Transaction.BOGUS);
    try {
        while (more records to insert) {
            byte[] key = ...
            byte[] value = ...
            fill.findNearby(key);
            fill.store(value);
        }
    } finally {
        fill.reset();
    }
    // Ensure that all non-transactional changes are committed.
    db.checkpoint();

By using the findNearby method, the search cost to find the next ordered key is effectively nothing. If the records to insert are only mostly ordered, findNearby still works properly, but not optimally. For randomly ordered records, the regular find method should be used instead, but it will perform much more slowly than with ordered records.

Consider using the Sorter if the records aren't already ordered.

Q: Why does TuplDB not suffer from garbage collection issues?

Generational garbage collectors work best at collecting objects which have a very short lifetime. Objects which reach the old generation should be retained as long as possible, to avoid expensive full collections. When the old generation is collected, the cost can be reduced if the number of objects visited is minimized. TuplDB exploits this behavior by recycling objects as much as possible, all of which reside in the old generation. The design favors multipurpose objects over specialization, which makes recycling more effective.

TuplDB's cache is primarily backed by a set of Nodes pointing to off heap memory, where each unit of memory is a fixed sized page. Nodes are recycled for use by b-trees, undo logs, free lists, and large values. Off heap memory is allocated at startup by a single call to mmap (anonymous), although additional pages are allocated using malloc if the cache is configured to grow over time.

All the data structures for managing Nodes rely exclusively on long-lived objects. An "intrusive" design is used, in which the Nodes themselves participate in the collections. They manage their own hash collision chains and linked list references. The use of standard Java collections would cause internal entry objects to be created, which in turn causes more work for the garbage collector.

Q: Can a very large cache still cause garbage collection issues?

Although TuplDB's design minimizes garbage collection issues, it cannot fully prevent them.

Other tips for improving performance with large caches:

When using the large pages option, ZGC requires special attention to ensure it works effectively.

Q: When does the cache size increase?

When the configured minimum cache size is less than the maximum size, the cache will grow only to prevent eviction of dirty pages. If only read operations are performed on the database, then the cache typically stays at the minimum size. Dirty pages accumulate as changes are made in between checkpoints, and so the cache will grow more with a high change rate. For non-durable databases, the cache always grows as the database grows.

When only the minimum cache size is configured, the maximum size is set to match. This is the preferred setting in most cases, because it offers more predictable behavior. Configuring only the maximum size is suitable for non-durable databases, or for running unit tests against temporary databases.

Q: Why is write performance sometimes erratic?

Write performance can dip when automatic checkpoints run, which forcibly flush all dirty pages to the storage device. This can be verified by adding an EventListener. The bottleneck is often the file system, which might not be optimized for randomly ordered writes. A simple workaround is to stripe the data file, and this can significantly improve write concurrency.

This technique is most effective on Windows/NTFS. Older versions of Linux/EXT4 also benefited from the striping technique, but newer versions perform well without striping. In some cases, striping tends to hurt performance. For best throughput and performance consistency, consider using a block device directly and bypass the file system.

Linux/XFS tends to perform more consistently than EXT4, although write stalls up to a minute were observed in one test. From dmesg: "possible memory allocation deadlock size 313808 in kmem_alloc (mode:0x2400240)". Throughput appears to be about 15% better when using EXT4. And BTRFS tends to perform quite poorly, with frequent stalls that last for several seconds. Striping the data file doesn't improve BTRFS performance.

Linux/F2FS appears to be the best file system to use, but only when the data file is striped. In one test, striping into eight separate files yielded a 2.2x throughput improvement over EXT4. Amazingly, F2FS outperformed a raw block device by about 1.2x. By preallocating space to match the block device size, writes appear more sequential to the SSD and then performance to a raw block device matches that of F2FS.

Unfortunately, F2FS only tends to perform well on a freshly formatted file system. Sometimes writes to the database file hang, even when new. In long haul tests, write performance plummets, down to a rate much lower than EXT4. EXT4 offers the best overall performance, over a wide range of workloads. It's likely the best choice to use in most circumstances.

EXT4 isn't perfect, and it might behave badly with NAND flash SSDs (the most common type). The EXT4 allocation strategy spreads writes around the entire drive, fully defeating the SSD's ability to pre-erase blocks for future writes. Frequent trimming might be essential. See also: Block allocation strategies of various filesystems