Skip to content

c-blake/ftab

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

70 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BACKGROUND

Nim is a systems programming language. Part of systems culture is efficiency which often entails saving answers rather than re-computing them. A simple key value store is often adequate. Consequently, some variation of file-based KV stores of some kind are already present in 5 of my current repos. To answer a why-so-many question begged, here's a quick list:

Module Notes
ndup/setFile small,RH,NoSatellite
suggest ~tightly integrated
adix/lptabz RH,save,load,mmapRO
pack strict append only
thes mostly pedagogical

Of these, only lptabz supports deletes, but not against "live files". All of the above cases are just "embedded use" within a wider package. The DBM family also has many members.

String orientation makes FTab "weakly typed" in the prog.lang. sense. This is typical in the key-value/blob store space. lptabz above comes closes to an escape, but really only works on files with value types. Other approaches for strongly typed data are at https://github.com/c-blake/nio and https://github.com/Vindaar/nimhdf5 and probably elsewhere.

BASICS

So, to fill a gap needed by some Nim use cases, enter this 6th variant - FTab, a file table. The goal here is the simplest possible reasonably safe and efficient native Nim persistent KV store that supports delete & space limits with unsorted string keys & values of fixed (total) size.

This is implemented with 2 files - a ground truth data file & an index. { FTab is most similar to (and began life as a hard fork of) the indexed log Pack. } With no deletion, a data file is much like a log - back-to-back (kLen, key, vLen, val) records after a tiny header. An index is an open-addressed hash array with high latency-friendly linear probing. Keys can optionally be embedded (with their own bound) in the index for efficient iteration (with a trade off of a bigger index adding more chance for 2-disk latency lookups).

The allocator is kept simple here with a constraint of fixed (K+V total) size records (in both over-time & over-objects senses) and single-writer access. Such constraints can be relaxed, perhaps suboptimally, on top/outside of ftab. { E.g., for diverse value sizes, users can set up many power-of-2 FTabs, naming keys to "route" lookups. For dynamic value sizes, users can del old & put new. A server front-end/locking can mediate concurrent access at some expense. } One reason I called it simply FTab is so a more full featured wrapper could be FileTab.

DETAILS

Space Limits

A space checking calculation might|might not include the transient space for a new index. It is tricky which is best - limiting enduring space use or an electric fence to never cross. Multiple readers can also keep old index space live if they do not refresh. Only the OS may know if such space is live. Only client code knows if there are multiple readers. In light of all this, for now transient space is not included in the limit calculation. Things should work ok to expand the space limit from open-to-open. Shrinking it is likely bad, but may work if you do not shrink beyond how big the files have already grown.

Need To Flush

This code uses memory mapped writes. Stable storage throughput can be more slow than RAM is large. So, client code adding data quickly (on the relevant scale) should periodically synchronously wait for a flush to finish. This gives the OS a chance to un-dirty enough memory pages. This is a necessary byproduct of the OS taking ownership of flushing dirty memory pages to disk which adds a lot of resilience to process crashes (but not OS crashes).

Concurrent Safety

On its own, this is one-writer/multi-reader safe only in limited ways. New tables are always renamed into place atomically and there is a serial number to make parallel reader refresh calls fast (to get a new view on the index and data file as it grows). It is still possible for the writer to delete & replace a record with new data after a reader-refresh, but before a reader finishes access.

Another whole-file serial number, saved by a reader before lookup and verified unchanged after copying out the data, could suffice for point queries. Even with it, a reader iterating over all records just on the data file could still see partial updates and only know from a changed serial that "something changed somewhere during a very long scan". Adding per key/per record checksums to the end of each record would suffice both for scans and point queries (future work).

Crash Recovery

Crash recovery is also future work relating to adding checksums. Recovery can consist of rebuilding the index excluding all mismatching records. It is likely best for right now to not think of it as very crash recoverable.

IO Tuning

The index is very small at present, making point queries more|less 1-major fault in any likely worst case. The trade-off is that if keys are much smaller than a 4096B VM page, an FTab.keys iteration does 4096B/keySize more IO than it must. This can be quite a lot more. For example, for 1 TiB of data in 64 KiB records with 32B keys, the IO is 4096*16 MiBlocks = 64 GiB instead of only 0.5 GiB. So, besides checksums for both crash recovery and concurrent safety, another item of future work is optional direct embedding of nicely bounded whole keys (and their checksums) in the index file. This needs to be a create-time option since users know better what workload of key scans vs. point queries may happen. In fact, it probably makes sense to do this before checksums.

About

Native Nim File-Based Hash Table

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published