Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Fetching contributors…

Cannot retrieve contributors at this time

1218 lines (1043 sloc) 52.7 kb
I Am Legend:
Items are sorted by priority (highest on top).
o a pending TODO item (for the current release)
. a pending TODO item (for future releases)
x a finished TODO item
-----------------------------------------------------------------------------
This Branch Is About Integrating The hamsterdb2 Functionality!!!!!
-----------------------------------------------------------------------------
The big headline is:
As a user i want to run many Transactions in parallel with high performance.
I'm using multiple threads b/c my CPU has multiple cores, and expect hamsterdb
to scale with the number of cores.
==============================================================================
high-level plan for 2.1.9 ..................................................
x new file system structure
x release sparsemap library w/ documentation and sample/tests
x update the webpage (free for non-profits, cheap for startups, add crc32)
x exception safe code
x lots of refactoring
o update file format (compress record IDs, UpfrontIndex, ...)
. delta updates
x PRO:
x fix bug as reported from Ole (mail from 26.)
x also fix segfault during recovery
x create new unittest from his sample
x PRO: use a bitmap index for recnos
- nah...
x but move the sparsemap library to github, as a seperate project
x also write an accompanying blog post or documentation
x PRO: make the release
x PRO: webpage changes
x add CRC32 to pro documentation, feature matrix
x change pricing
-------------------------------------------------------------
x faq: add documentation about library size: strip helps
x refactoring: split files
x root0: add pragmas from btree_impl_default.h, undef min/max
x create a template for the header files
x document the source code structure
x for each file, perform the following changes:
x #34: header guards/macros should not clash with reserved identifiers
(no leading/trailing underscores)
x adapt to template.h, template.cc
x make sure that the layers are respected (do not include n+1!)
x classes without private or protected members should become structs
x 0root/root.h - very first header file (only one!)
x 1base/error.h, pack*, util, mutex, version, abi, pickle
x 1globals/
x 1errorinducer/
x 1mem/
x 1os/
x 1rb/
x 2page/
x 2device/
x 2protoserde/
x 2protobuf/
x 3changeset/
x 3cache/
x 3blob_manager/
x 3journal/
x 3page_manager/
x 3btree/btree*
x 4db/...
x 4txn/...
x 4env/...
x 4cursor/...
x 5server/...
x 5hamsterdb/hamsterdb.cc, hola.cc
x also cleanup unittests
x java still has license.java - remove!
x also check dotnet!
x refactoring: split unittests/cursor.cpp - it does not compile on smaller VMs
x refactoring: #37 - hamzilla.c/running should be sig_atomic_t
-> transform to c++ code
x can we add a generic macro like this:
#define ham_make_key(PTR, SIZE) { SIZE, PTR, 0, 0}
which can be used as an initializer:
ham_key_t key = ham_make_key("hello", 5);
x same for records
x use in samples
x document in header file
x refactoring: get_linear_search_threshold returns size_t, but is
then used as (signed) int; some return 0, others -1
right now the code works but is cheesy. return (int)0 to disable.
x ham_bench/config.h: use iostream instead of printf to avoid compiler
warnings on 32bit linux
x review exception safety
http://exceptionsafecode.com/
- consistently use smart_pointers to manage allocated resources!
use boost::scoped_ptr (from <boost/scoped_ptr.hpp>)
- introduce on_scope_exit macro
can use BOOST_SCOPE_EXIT:
http://www.boost.org/doc/libs/1_56_0/libs/scope_exit/doc/html/scope_exit/tutorial.html
- consistently force use of RAII
- every destructor has to clean up but MOST NOT THROW!
- support std::swap
http://en.wikibooks.org/wiki/More_C%2B%2B_Idioms/Non-throwing_swap
- manage state like a resource
Each state is moved to a separate struct (ClassNameState). The class
itself can return the struct as "const", and it has (in best case)
a single method which performs write-access.
Try to make the State-struct exception safe, whenever members
are retrieved or overwritten.
- use "critical line" for strong guarantees and defer commits till
success is guaranteed
-> this relates to the use of std::swap
- every state has to be restored in case of an exception
- document every rewritten class with an annotation ("@exception_safe yes")
-> define rules
@exception_safe [yes|no|unknown|nothrow]
@thread_safe [no|unknown|atomic|blocking]
x document the rules (coding guidelines)
x document the annotations
x add annotations to *each* file
x refactoring: split os.h in file.h and socket.h
and upgrade File to "strong" exception safety
x refactoring: upgrade "device" to strong exception safety
x refactoring: upgrade "page" to strong exception safety
x see TODOs in page.cc
x refactoring: improve layering
x improve layer separation of device (remove dependency to env)
x improve layer separation of page (remove dependency to env)
x can we move cache to layer 2 or 3?
x refactoring: clean up the close() code of the LocalEnvironment (and also
of the LocalDatabase). distinguish two cases:
- a close() called by ham_env_close/ham_db_close
performs logic that can fail; must not be exception safe
- a close() called in the destructor
do not perform any logic besides cleaning up; must NOT THROW!
actually there shouldn't be a need to do anything if scoped_ptr
is used consistently
x use smart_ptr for all members of the LocalDatabase
(m_txn_index, m_btree_index)
x then make sure that the tests run again
x ... and for the RemoteEnvironment (if required) - not required
x ... and for the LocalEnvironment
(m_header, m_blob_manager, m_page_manager, m_journal)
x see mail from Michael Möllney: journal files are not truncated if
temporary transactions are used
./ham_bench --use-transactions=1 --key=uint64 --distribution=ascending --recsize=10241024
x fix the bug
x check Michael's mail - is an enum large enough for big values?
x investigate why this was not discovered with the recovery.pl script
x add larger records
x check error inducer when flushing the changeset, not when
filling it! this is critical. right now many of the INDUCE
invocations are ignored!
x still not reproducible - wtf? make sure that the inducer in
journal.h is triggered - it is
x also, the bugfix was reverted to make sure that it's covered
by the tests
x what are the possible states during recovery?
invariant: a changeset flushes all committed transactions.
in other word: after a changeset was flushed, only active
transactions remain.
x last changeset is valid, gets applied, no newer transactions
x add a test
x last changeset is valid, gets applied, but newer transactions must be
recreated
x add a test
x last changeset is valid, but after it's applied an existing txn
is committed:
t1 = txn_begin
insert(t1)
for (i : [0..64))
t2 = txn_begin
insert(t2)
txn_commit(t2)
# changeset flush happens here
insert(t1)
txn_commit(t1)
-> changeset is never flushed! why?? because txn's can only be
flushed if the oldest txn is committed. not the case here!
x last changeset is invalid (right now the journal is re-applied
from the beginning, but this would cause duplicate keys to be
added twice) -> need to scan the file from the beginning and
get the header of the last changeset (incl. lsn)
x add a test (just truncate a few bytes)
x also run manual tests as described in the original bug report
x what if the recovery recreates pending transactions and then flushes
them, and there's a problem during flush? these flushes are not covered
by the journal and errors cannot be recovered from.
x add a test
x remove the journal trailer, it's no longer required
x remove the journal header structure, it is not required (but always
add a magic at the beginning of an entry)
x do we really need to store the lsn in the page? - yes, it helps a bit
x verify documentation in journal.h
x make the threshold parameter configurable
x a test currently fails:
--seed=1411263229 --use-cursors --duplicate=first --use-berkeleydb
../testfiles/2/ext_061.tst
x ZINT32: move code changes to a branch of hamsterdb-pro
x ZINT32: implement a NULL-compressor; it needs to keep track of
the "compressed" size and whether a split is required
x persist the number of compressed values, then use it to resize the
ByteArray
x check_integrity checks compressed size (+ overhead!) vs full_range_size
x then run tests against berkeleydb
x refactoring: the btree "slot" should be plain integer instead of ham_u32_t
x ZINT32: add varbyte compression
x make sure that the code runs "as is"
x decode-function needs number of encoded keys (do not persist it in
the compressor!)
x but maybe the compressor needs to keep track of the used storage,
esp. after decoding
x move check_integrity into the compressor class (parameter is node_count,
range_size)
-> calculate the "real" size
x the split is currently based on "simulated" results; if a split is
required then a "real" compression should be performed to figure out
how much space is left (if any)
x requires a change in the interface
(requires_split(size_t available_size))
x fix the tests
x right now files compressed with varbyte are as large as uncompressed
files - but why? is it because the keyrange currently cannot shrink?
-> yes; high initial capacity and no shrinking
x check if some changes need to be ported back to the apl version
x refactoring: it would be nicer if the ham_paramter_t lists are passed to
the Environment implementation and do not have to be parsed in hamsterdb.cc
x rebase to v2
x 2config/env_config.h: a struct with the run-time configuration of
the environment
x created in hamsterdb.cc/ham_env_create/ham_env_open and passed
to all objects that are created
x a State struct (no methods!) for
- the ham_parameter_t list
- the parameters from ham_env_open and ham_env_create
(without any dependencies!)
- log directory
- remote address
- remote timeout
- encryption settings
- page size
x gets initialized with data from the header page
x and with the parameters and the parameter list
x can only be retrieved as const ref for the whole structure
x 2config/db_config.h: a struct with the run-time configuration of
the database
x proceed as described above
- db name
- db flags
x when running ./ham_bench (without any parameters) then the debug version
has a maximum insert latency > 1 sec (sometimes > 2 sec).
-> caused by purging caches; whenever the cache is purged, at least
20 pages (PageManager::kPurgeAtLeast) are purged
-> a quick test without limit improved performance a lot, but there was
one outlier with 1.2 seconds
x run perftest without that limit
x or try other limits (2, 5, 10) -> does not improve
x at least fix the comment/variable name
x refactoring: once more check if we can unify the PaxLayout and
the DefaultLayout (or move more code to the base class!)
- no, not required. Stripping the library will reduce the size.
x refactoring: can we get rid of ham_record_t::_intflags and
ham_record_t::_rid? at least deprecate them, then remove them with
the next ABI change
-> yes, just remove them. they are no longer used.
x refactoring: improve hola performance
x improve performance for hola-sum (profile!)
http://sites.utexas.edu/jdm4372/2010/11/09/optimizing-amd-opteron-memory-bandwidth-part-3-single-thread-read-only/
s1 = a1 + b1
s2 = a2 + b2
s3 = a3 + b3
s4 = a4 + b4
return s1 + s2 + s3 + s4
x run leveldb benchmark for comparison
x refactoring: rewrite capacity calculations (again...)
x rename copy_to() to move_to() -- no!
x get_full_record_size() and get_full_key_size() should return size_t,
not double
x do not persist capacity, only the range size
x if a KeyList requires the capacity then it can store it internally
x Instead of calculating an initial capacity, simply divide the
ranges in the same ratio as the initial key/record sizes; i.e.
if key size is 5, record size is 1, then the KeyList should be 5
times larger than the RecordList.
x exception is the PAX layout - no need to do any changes
x remove KeyList::get_range_size() (or move the function to
BaseKeyList, but don't persist it)
x remove RecordList::get_range_size() (or move the function to
BaseRecordList, but don't persist it)
x for compressed lists it's not possible to efficiently shrink (reduce
capacity) because it does not know it's final size. Right now there's
a hack: Zint32KeyList returns a key size of 1 byte (instead of 4), and
therefore mostly grows and rarely shrinks. But that's hackish.
-> instead of calculating the required capacity rather ask for the
minimum capacity for the current set of keys
-> then add the K:R ratio as described above
x require_split should return the number of bytes required; then adjust
the K:R ratio
x also make sure that at least 1 item fits into the adjusted space
x increase the capacity of the UpfrontIndex only when required
(currently it is always increased whenever a new range is assigned)
x maybe pass a "capacity hint"? take it from the other list if
it kHasSequentialKeys
x if not then use a sane default value
(additional space / item size, but at least 1)
x what about shrinking? if capacity >> node_count then reduce
the capacity
x List:requires_split should no longer vacuumize
x lists should rearrange themselves internally for the new key
-> pass capacity + required data to vacuumize()
x still no success? then call rearrange() in the caller
x unittests are currently failing
x test for filesize regressions against 2.1.8 w/ ham_bench
x pax keys, pax records - ok
x varlen keys, pax records - ok
x pax keys, duplicates - ok
x varlen keys, duplicates - ok
x fix ./ham_bench --stop-ops=500000 --seed=1 --metrics=all --duplicate=last
x it throws assertions in 2.1.8, and HEAD fails as well
x add to monster test
x re-enable journal.cpp:1296
x fix performance regressions
x things to check to improve performance
x initial capacity could be taken from the node_count of previous
pages (save when splitting)
x rearrange requires sorting and many memmoves - very expensive
x vacuumize should not adjust the capacity; this should be
a separate operation. reason: vacuumize is called very
often i.e. before splits/merges, and should not change the
capacity then
x reducing the capacity means that rearrange is not required,
a memmove is enough
current stats:
2.1.8 821.9m 31.7m BtreeIndex::insert
2.1.9 826.1mm 31.5m BtreeIndex::insert
x move UpfrontIndex into its own file
x the following tests fail
x --seed=12345 --stop-ops=50000 --recsize-fixed=1
x fix the crash
x file size increased by 100%!!
x --seed=12345 --stop-ops=50000 --recsize-fixed=0
x fix the crash
x file size increased by 100%!!
x --seed=1380279291 --key=binary --keysize-fixed --duplicate=last --erase-pct=25 --find-pct=40
x --seed=1380279291 --key=binary --keysize-fixed --duplicate=last
x --seed=1380279291 --key=binary --keysize=128 --keysize-fixed --duplicate=last --erase-pct=25 --find-pct=40
x --use-berkeleydb=true --reopen=true --use-cursors --duplicate=first ../testfiles/1/41.tst
x currently, requires_split() is called twice for
leaf nodes: in btree_insert.cc (were it should not be called), and in
btree_node_proxy.h.
However, if the call in btree_insert is removed, some tests fail, i.e.
--use-berkeleydb=true --fullcheck=find ../testfiles/2/ext_050.tst
x run monster tests
x increase file format version
x run performance tests
x see mail from Michael Moellney: cursor is not advancing properly when
it's at the end of the database
x reproduce w/ a sample
The current unittest has a different behaviour; it fails when looking up
keys with the cursor because the (approx.) key was not copied back
x fix
x refactoring: BtreeNodeImpl::print should use a stream object, like the
KeyLists and RecordLists, and avoid the printfs
x refactoring: clean up/review BaseKeyList and its members
x move check_integrity and vacuumize to Base
x is erase_data really required? -> yes
x rename to erase_extkey, move to Base class
x refactoring: rename erase_slot to erase
x refactoring: get_key_size() should return a size_t instead of ham_u32_t
x device: use madvise(address, length, MADV_DONTNEED); when freeing mmapped
pages. There is no equivalent on Win32.
x but run performance tests!
x refactoring: use "int duplicate_index" instead of ham_u32_t
x refactoring: use "int" for record counters
x refactoring: make device stateless and exception safe
x File must allow copying (= move w/o close)
x virtual abstract base class
x get rid of dependency to env, only to EnvironmentConfiguration
x get rid of m_flags, it's not required (only to disable mmap)
x move class state to a struct
x modify state with an atomic swap
x if all Os-calls are atomic then thread-safety should be "strong"!
x refactoring: make page stateless and exception safe
x refactoring: make KeyList::copy_to() parameters int instead of ham_u32_t
x refactoring: does BaseKeyList::check_integrity() really require the
|quick| parameter?
x also rename |ham_u32_t count| to |size_t node_count|
x erlang: 1 test is failing
x when the file format is changed then...
x internal node layout: store page IDs div pagesize to get smaller
integers (will be compressed better)
x improve node layout compression (when the file format changes)
x flags are not required at all if fixed length record size > 8
(and record size is too large for inline records)
-> set m_flags to null if they don't exist -> avoids a new member
x UpfrontIndex: use bit packing to compress slot/offset pairs?
-> no, would be difficult to gain something; big efforts
x compress the flags for the DefaultRecordList (4 bit per item)
-> are 4 bit sufficient if compression is enabled? - yes!
-> even 2 would be sufficient. But erase and insert would require
one memcopy and O(N) shifts; will hurt performance
-> look for a better solution for PRO
x a few tests are failing
./ham_bench --use-berkeleydb=true --reopen=true --key=binary --erase-pct=15 --find-pct=15 --recsize-fixed=0 --recsize=0 --cache=104857600 --duplicate=last --stop-ops=500000 --seed=1413888685
x refactoring: switch to stdint-supplied types instead of own definitions
x see here for msvc: http://msinttypes.googlecode.com/svn/trunk/stdint.h
o windows cleanups
x install VS2008 on the new harddisk
x install VS2010 on the new harddisk
x a few projects are not built (hamzilla_debug_x64, server_dll_x64...)
or end up in wrong directories
o The C# PInvoke for ham_cursor_find is missing the Record structure
o also accept flags in ham_db_find and ham_cursor_find for approx.
matching, and copy the key!
o this needs a unittest
o C# unittests sporadically fail b/c a handle is closed multiple times
- seems that the java API solves these problems more elegantly
o also look at Sandeep's mail regarding a crash
o Database: store all Cursors in a list, then close them before
closing the Database
o Environment: store all Transactions in a list, then close them before
closing the Environment
o automate the build process
o ... and the packaging
------------- hamsterdb-erlang 2.1.9 ---------------------------------
o upgrade to hamsterdb-2.1.9
o improve erlang integration
o quickcheck-ci.com
o look into Thomas' mail
o QuickCheck: create a new property for testing duplicates; the key is
always the same. The number of duplicate keys is tracked and
periodically checked with the API. A cursor can be used to remove a
specific duplicate, or to fetch a specific duplicate.
------------- hamsterdb pro 2.1.9 ------------------------------------
x PRO: review AES usage in read_page; if the page is mmapped
then decryption is not applied - no, it's ok. encryption
disables mmap
o PRO: rebase to vanilla 2.1.9
o PRO: review SIMD code; the problem is that it's never called, because the
search threshold is always smaller than the minimum block size
o PRO: look for a better compression for DefaultRecordList, i.e.
msb is 1 -> 7 bit record ID (should be sufficient)
msb is 0 -> 5 bit flags for size (2 bits unused), inline data (max 8 bytes)
o Use small index which stores offset + bits for each group
o Each group is a GroupedVarInt w/ 4 bits per entry; a 64bit
number can then hold 16 ints
o if flags == 2#11111 then the number is a 64bit blob id
o otherwise the flags specify the length of the inline data
o avoid ripple effects by growing/splitting the block
o Use this index for binary search
o Can we release an early version of zint32?
------------- hamsterdb 2.1.10 ---------------------------------------
o "make dist" should write the current git commit and the build date to
a file; this should be printed by ham_bench and the tools
o ham_bench: print this prior to the monster tests/performance tests
o get rid of HAM_API_REVISION, move HAM_VERSION_* to the header file, or
move version.h to include/ham
o also generate the ChangeLog from the git commits?
o documentation rewrite
o use github wiki
o remove old cruft
o then create the initial repository:
https://github.com/blog/699-making-github-more-open-git-backed-wikis
/introduction
/evaluate & benchmark
/faq (merge w/ performance)
/tutorial
... (existing stuff)
/pro
/c, c++
overview (description, status)
installation
usage (compiling, linking)
api functions (one file per function)
samples
/java
overview (description, status)
installation
usage (compiling, linking)
api functions
pro functions
samples
/dotnet
overview (description, status)
installation
usage (compiling, linking)
api functions
pro functions
samples
/erlang
overview (description, status)
installation
usage (compiling, linking)
api functions
pro functions
samples
/python
overview (description, status)
installation
usage (compiling, linking)
api functions
pro functions
samples
o have the hamsterdb/documentation directory "link" to the git project
o grow allocated memory buffers exponentially
https://blog.mozilla.org/nnethercote/2014/11/04/please-grow-your-buffers-exponentially/comment-page-1/#comment-23272
this only affects the ByteArray, as far as i can see
o refactoring: make journal stateless and exception safe
o refactoring: make changeset stateless and exception safe
o refactoring: make BlobManager stateless and exception safe
<<<<<<< HEAD
o refactoring: make PageManager stateless and exception safe
=======
x #42: approx. matching: cursor returns wrong key when using transactions
x create a unittest to reproduce this
x the fix should be that ham_cursor_find calls ham_db_find(..., cursor)
x cherry-pick the following commits on the master branch
1447ba4eb217532e8fb49c4a84a0dc3b982a3ffe
6a8dd20ec9bd2ec718d1136db7667e0e58911003
x Thomas Fähnle: create new parameter for fadvise configuration
x implement for file-based (w and w/o memory mapping) devices
x need to check for the function in configure.ac
x return flag with ham_env_get_parameter (not documented)
x create a unittest to make sure that the parameter is not persisted
x create a new parameter for ham_bench
x A few things to refactor in the btree
x the BtreeAction classes do not need a Transaction pointer as an argument!
x actually NO part of the Btree should EVER access a Transaction...
x btree_visitor.h has a structure called ScanVisitor, and BtreeVisitor is
in btree_index.h. Can both be combined?
x Introduce new option for record numbers (HAM_RECORD_NUMBER32,
HAM_RECORD_NUMBER64), deprecate the old one. Use 32bit whenever possible!
x introduce a new flag; the old one uses 64bit. all tests must work
x implement (local)
x implement (server)
x implement (remote)
x verify with ham_dump/ham_info
x add unittests
x re-use existing tests
x also for remote!
x check for overflows!
x add to ham_bench
x fix documentation
x samples
x header file
x implement Cursor::get_record_size for remote access
o Zint32: add compression based on bit-packing
-> https://github.com/lemire/simdcomp/blob/master/include/simdintegratedbitpacking.h
x add parameter to header file
x support in ham_bench
x update IndexFactory for the new KeyList
x move block/index related functionality to a base class BlockKeyList
(w/ template parameter for the block structure)
x integrate Daniel's sources as a library
x add a unittest to compress/decompress
x implement the compression
x don't use index->used_size, it's not required
x don't use index->block_size; it's always static
x use delta-encoding to have better compression
x fill in the implementation
x add unittests (similar to zint32)
x add monster tests (similar to zint32/varbyte)
x don't calculate bitwidth if index->bits is already maxxed out (32)
x there is no vacuumize prior to a split. The vacuumize is cancelled
immediately because it is always "internal".
x don't split blocks in the middle if the new key is appended
x don't split blocks in the middle if the new key is prepended
x test with bulk-erase and deltas of size 1, no records
-> no delete stability!
x erase: only calculate bits of the new delta, not of the whole block
o fix the remaining TODOs
x improve vacuumize_impl() performance by allocating multiple
blocks at once
o add functions for simdcomp library to deal with underfilled blocks
. improve copy_to() performance by allocating multiple blocks at once
o add more unittests for delete stability
x issue43: fix segfault
it seems that a node is removed which is not in the rb-tree. the node
used to be there, but got dropped when another node was removed.
x Remove requirement for "delete stability" when erasing keys
x Introduce generic BtreeUpdateAction which handles merges and splits
x when erasing a key: also expect splits; in case of a split the
BtreeEraseAction has to retry the operation - either with the old
or the new node!
x hamsterdb-erlang: add support for HAM_RECORD_NUMBER32
o Concurrency: move "purge cache" to background
1) page list is not modified while it is traversed/modified
- the purger will traverse the list only from back to front
- the list's operation are only atomic regarding back to front
traversals, not reverse
- the purger must not modify the list; it will only unmap/free the
page buffer!
2) it must be an atomic operation to lock the page as "in use" and to
check this lock!
3) PageManager: when fetching a page that is currently flushed to disk:
wait till flush is finished, then return the page
- Alternatively, reorg the hash table. All buckets are in serial memory.
This requires a few allocations, but reduces cache misses because
there's no "jumping" from page to page. Such a bucket can be replaced
atomically with CAS (hazard pointer!) and appended w/o locking. And
it avoids frequent calls to remove_from_list/insert_to_list to push
the used pages to the front. As a consequence, a new purge mechanism
is required to identify "old" pages. These pages then could even be
sorted before they are flushed.
o Create a background thread
o Join thread when closing the Environment
o create a non-blocking queue for passing data to the thread
o add a flag to disable background operations (why?)
o Get rid of the page lists. Have a generic "PageCollection" class
with atomic updates. Use this in the Changeset and the Cache buckets.
o use a DynamicArray<std::pair<uint64_t, Page *>> for memory
o whenever the array grows:
1) grow into a copy
2) perform "compactions" if too many elements were deleted
3) perform CAS with the old pointer
o assert that there is only 1 writer at a time (i.e. use a mutex)
o when inserting: set the pointer, then increment the counter
o when deleting: simply set to zero
o never use a lock when reading
o use hazard pointers; auto-delete them after 1 second
(pass them to the thread for management)
o have a Facade for the thread. If threading is enabled then store the
message in the message-queue; otherwise perform the request while
blocking
o but don't add the same request twice if it's still in the queue
o get rid of the Page-List (use a generic visitor for the Cache)
o purger thread periodically walks over the list, purges
o can be woken up by the main thread
o if a page is deleted or closed: let purger thread clean up the pages
o page must not be purged while it's "in use"
o mark page as "in use" during purge
o run performance tests
o and run the leveldb benchmarks
o extend monster tests
o use valgrind to check for race conditions etc
o refactoring: make EnvironmentHeader stateless and exception safe
-> will it ever be used outside of the EnvironmentConfiguration setting?
-> can we remove it?
o refactoring: make AES Encryption stateless and exception safe
o refactoring: make Btree Actions stateless and exception safe
o LocalEnvironment::open creates a "fakepage" to find out the page_size;
just assume the default page size of 16kb and read the first page. Then
discard if the real page_size is a different one
o deal with databases < 16kb?
o refactoring: improve the code separation of cursor and txn_cursor, i.e.
in db_local::cursor_get_record_size (wtf - is_null(0)?)
etc
-> clean up the whole Cursor stuff
o refactoring: clean up the whole Database code; it's too much and too
complex; there must be an absolutely clear border between Btree and
Transaction related code.
o reserve file size upon creation, then map the whole range
o needs new parameter HAM_PARAM_ALLOCATE_STORAGE_BYTES
o support in ham_bench
o monster tests and performance tests
o also for java
o also for dotnet
o also for python
o also for erlang
o bulk updates/batched updates
- give users API to allocate memory for keys/records
- if user says that data is already sorted then do not re-sort, but
fail hard if the sort order is violated
- add those delta updates to the txn-trees or to the btree node,
simply transfer ownership of the memory
- can contain inserts or deletes
o the Update structure, which describes a single update, can also
be used internally to pass all parameters to the Btree actions;
they would then be completely stateless
o also for remote, transfer in a single message
o also for .NET
o also for java
o also for erlang
o also for python
o If mmap is enabled then keys and records don't have to be copied as long
as they point into the mapped area!
- If a key is in a mapped page (w/o extended key) then simply return a
pointer
- If a record is in a mapped page then simply return a
pointer
o migrate to libuv 1.0, it has a stable API
http://docs.libuv.org/en/latest/migration_010_100.html
o also for windows!
o More things to refactor in the btree
o PageManager::fetch_page should be available in a const version
(fetch_const_page) which sets a flag in the page ("kIsImmutable").
the NodeProxy must test this flag whenever it modifies a page!
(debug only)
o EraseAction uses duplicate_index + 1, InsertAction uses duplicate_index
-> use a common behaviour/indexing
o EraseAction line 71: if the node is empty then it should be merged and
moved to the freelist!
o when splitting and HAM_HINT_APPEND is set, the new page is appended.
do the same for prepend!
o refactor LocalDatabase::cursor_insert: when done, the duplicate index is
updated. In most cases (DUPLICATE_LAST), a duplicate table has to be built,
but it's unclear whether the duplicate index is ever required. Better
use a logical index ("kLastDuplicate") and lazily calculate the actual
position when it's required.
o refactoring: db_local.cc has so many TODOs!
o refactoring: improve the code separation of cursor, btree_cursor
and txn_cursor, i.e. in db_local::cursor_get_record_size (wtf - is_null(0)?)
etc
o the whole cursor state is messed up. there should be 3 states:
- nil
- coupled to btree
- coupled to txn
and nothing else!
o there should be a get_key() and a get_record() method; the caller should
not get access to txn_cursor/btree_cursor etc
o cursor.cc has so many TODOs!
o review if the Cursor class has access to any shared resource; what
about the cursor list??
o continue with concurrency...
next stage: flush changesets in background. this should work whenever
recovery is enabled (with and without transactions)!
o improve the webpage documentation
o document the various btree formats on the webpage, with images
o variable length keys (w/ extended keys)
o POD keys
o default records
o inline records
o fixed-length records
o duplicates (w/ overflow tables)
o PRO: compressed keys
o PRO: compressed records
o internal nodes should use 32bit page IDs!
o look into dynomite, spark, mongodb and mariadb integration
https://github.com/Netflix/dynomite
-> like MapReduce, Spark requires an InputFormat class; templates are
available here:
https://github.com/hypertable/hypertable/blob/master/src/java/MapReduce/org/hypertable/hadoop/mapreduce/InputFormat.java
https://github.com/hypertable/hypertable/blob/master/src/java/MapReduce/org/hypertable/hadoop/mapred/TextTableInputFormat.java (streaming)
-> would be an interesting market with requirements for column store
storage
-> how does the integration look like?
-> has TPC-benchmarks and other benchmarks ready
o write down everything i find, collect information and estimate
efforts/gains (will do the same for mysql, R and mongodb later); this
task is not about writing code!
o come up with priorities, estimates
o dynomite: requires C interface and C marshalling code. rewrite C++
marshalling routines to C
o dynomite: send a patch to support more backends
o dynomite: send a patch to separate existing backends into their own file
o First steps towards concurrency; target is having one lock per db
o create coding rules which help for thread-safety
o make members const whenever possible
o non-const members need to lock or be atomic
o if a class is deliberately NOT threadsafe (i.e. ByteArray) then
document as annotation, and create a special profile build
which verifies that all non-const members are accessed synchronized
class SingleThreadedGuard {
SingleThreadedGuard() {
if (atomic_inc(foo) != 1)
throw ...;
}
~SingleThreadedGuard() {
if (atomic_dec(foo) > 0)
throw ...;
}
};
o can we also utilize valgrind macros?
o need to compile this with -O3, release builds!
o perform this for layers 1 - 3
o and also for layer 4 - make sure that the code separation is clear
-> especially between database and transactions
o define the locking behaviour of all classes/logical modules
o verify this behaviour in the code
o define which public API function locks what
o split the big lock and create one lock per database
o perform excessive testing
o Next steps towards concurrency: flush caches asynchronously,
use non-blocking writes (with libuv)
o or have a separate thread perform purges, flushes with libuv
o ... and prefetches (when opening a database with HAM_PREFETCH_INDEX)
o Next steps towards concurrency; target is having one writer per db, but
multiple readers
o move flushing of transactions into background
o move cache purging into background
o background operations should be disabled with flags
o perform excessive testing
. refactoring: get rid of inheritance for env
- move implementation to a struct, w/o state
- struct will be kept separately (if there is any)
- base class will be abstract
- the env will become a "hub" which manages members (i.e. page_manager,
device) and forwards the calls to those members
- implementation class uses a template parameter and forwards calls
to the struct; the state is a parameter which is only updated
if the operation succeeds
- if this pattern works then also apply it for Database,
Transaction, Cursor
o PRO: find an efficient compression method for InternalRecordList
-> also, this should play well with the compression of the DefaultRecordList
x The page IDs should be modulo page-size (makes numbers smaller) - done
o PRO: use compression also for duplicate records
o PRO: prefix compression for variable-length keys
use an indirection for the prefixes and suffixes; store each
part in a slot. the keys themselves have then fixed length (2 slot id's)
==> supports efficient binary search!
==> is efficient for random read/writes AND linear scans
however, it's very complex to figure out how to optimally split the
strings into prefixes and suffixes
==> prefixes and suffixes can be stored as extended keys if they become
too large
see indexcompression2009.pdf - Efficient index compression in DB2 LUW
o look for more research papers
. release-v2.pl: valgrind takes sooo long. should we use clang's
AddressSanitizer instead? or both?
. hola - next steps
o support java api
o support .net api
o support erlang api
o lua-functions as callbacks - then remote marshalling will work
o PRO: compile callbacks with clang remotely
o add remote support where it makes sense (only for PRO?)
. architecture for a new webpage
o pick an awesome design
i.e. similar to http://foundationdb.com, http://laravel.com,
http://rethinkdb.com, http://www.orientechnologies.com
o use make/m4/markdown to generate static pages:
https://github.com/datagrok/m4-bakery
https://developer.github.com/v3/markdown/
o come up with the full site structure/contents
http://sidekiq.org/pro/
o include full documentation, one page per API
o ... and for all languages
o keep the documentation in the source tree, not in -www?
o documentation comments are hosted on disqus
o blog comments are hosted on disqus, articles are also written in markup
o static pages will be served with nginx, using url rewrites
o dynamic pages are created with a micro-framework (silex?)
o Makefile can "scp -r" everything to the servers (staging or production)
. client area with (low priority)
o authentication
o collection of files
o analytics (who downloads what and when?)
. admin area with (even lower priority)
o authentication
o customer database
o implementing business processes
o sending out release emails
o importing new releases
o etc
. hola: use sum-prefix-trees to precalculate partial sums/results?
they could be stored in a btree, and used to dynamically recalculate
requested values
https://www.cs.cmu.edu/~guyb/papers/Ble93.pdf
o QuickCheck: automatically test the recovery feature by invoking "crashes"
. use cache-oblivious b-tree layout
-> http://supertech.csail.mit.edu/cacheObliviousBTree.html
o see roadmap document for more information
o this feature is *per database*
o calculate number of reqd pages based on estimated keys from the user
(but what about the blobs??)
o make sure that this is not reverted when "reduce file size" feature
(above) is enabled
o the new pages are not managed by the freelist! therefore the freelist
will not need any modifications
o after resize: mmap the whole file area. this is actually important because
mmap is much faster than r/w; but when the database is created, the
original mapping already exists. therefore we might have to handle
more than one mapping in the file
o PageManager: when allocating a new page then use the distribution
function to fetch a page from the reserved storage
. try to batch allocations; when new pages are required then don't just
allocate one but multiple pages (if the database is big enough)
-> could create a second memory mapping for the next chunk
o delta updates managed in the BtreeNode
the operations are attached to the node, but as soon as the node
is accessed for a lookup or a scan, or immediately before the node
is flushed, the deltas are merged into the node. So far this does not
sound too efficient, but the bulk updates in combination with the
key compression (i.e. for prefix compression) will benefit a lot.
Also, they are really required for concurrency and to allow multiple
writers in parallel.
x perform a benchmark/profiling: random inserts vs. ascending inserts;
the difference should be caused by memcopy/memmove (is it?)
x PAX
-> absolutely worth the effort, about 60% are spent in memmove
x Default
-> also worth the effort, about 15% are spent in memmove
o need a flag to disable DeltaUpdates
o add flag to ham_bench
o rename TransactionOperation to DeltaUpdate, decouple code from txn
o totally transparent to the caller, handled in the proxy
o only add deltas to leaf nodes; internal nodes have too many read
operations and would anyway require immediate flushes
o DeltaUpdate objects from a txn-flush should directly go down to
the node (detach from txn, attach to node)
o should the the insert + set_record operations be combined into a
single call? this has the additional advantage that the overhead
of calling set_record will disappear
o merge delta updates when reading and flushing
o however, for simple lookup calls (no duplicates) the DUs can
be traversed, too
o requires_split() must take delta updates into account
o make the merge algorithm as efficient as possible
o sort deltas by key
o first execute all 'erase' either against other deltas or against
the node
o then merge the remaining inserts
o this needs lots of tests
o now run tests: should every update be stored as a DeltaUpdate? If not
then disable them by default, unless Transactions are used (and unless
bulk updates are used)
o the bucket for concurrency TODOs
this are the requirements:
- PRO: inserts are fully concurrent (APL: only one insert at a time)
- transactions are flushed in background
- dirty pages are flushed in background
- reading pages is synchronous, writing pages is asynchronous
-> needs a very fast locking scheme
- SSDs support many async. writes in parallel - be prepared!
http://meetingcpp.com/tl_files/2013/talks/scaling_with_cpp11_edouard_alligand.pdf
- remove db_get_last_error?
- how to deal with temporary pointers (key->data, record->data)?
store them in tls? or always allocate new pointers and let the caller
delete them?
o an intermediate stage would be to have concurrent reads but exclusive
writes (one writer per database, many readers) - makes sense?
o create coding rules
o synchronization either in lowest or highest level of the callstack
o when to use what kind of synchronization
o which modules are single-threaded/synchronized, which are atomic,
which are concurrent?
o how can we verify the correctness of the implementation?
- i.e. by using static analysis or dynamic analysis, together
with annotated code/custom tools? even if these tools do not
exist then a system of annotations/naming conventions can help
writing/maintaining the code
o mutex implementations should have built-in monitoring, i.e.
number of locks/unlocks, wait times, contention etc
o need read/write mutex, fast latches
o come up with a list of all functions, define which locking operation
is required; then review the code and make sure this will work
o the environment configuration
o the database configuration
o the transaction tree handling
o the page manager, the device and the cache
o the btree
o the btree nodes (i.e. extkeycache, compressor)
o parallel lookups (using the same memory arena)
o reduce the linked lists - they're hard to be updated with atomic
operations
o page
o transaction and dependent objects
o ...
o separate SMOs from the actual operation (#2)
-> check the literature
http://pdf.aminer.org/000/409/763/b_trees_with_relaxed_balance.pdf
o move SMO operations to "the janitor" (btree_janitor.h)
o the global environment-lock should go because it's expensive; rather
increment an atomic latch, and refuse to close/erase the database as
long as the latch is > 0
o PRO: hot backups (vacuumizes to a different file)
really only for PRO?
http://sqlite.org/c3ref/backup_finish.html
- make sure that all transactions are closed
- perform ham_env_flush
- then copy the file
- if compaction is enabled: copies keys w/ iterator
(later: performs bulk updates)
--> think this through; how to deal with delta updates? -> merge them
what if only a few databases should be backed up?
what if i want to back up in a logical format (i.e. csv)?
o "hola" - olap functions that operate directly on the btree data
-> see wiki
-> see java8 stream API:
http://download.java.net/jdk8/docs/api/java/util/stream/Stream.html
-> see supersonic:
https://code.google.com/p/supersonic/
-> see fast bitmap indices
http://code.google.com/p/lemurbitmapindex/
o create a design
o operations on compressed data (COUNT(), MIN(), MAX(), ...)?
o use async operations or futures/promises
o deprecate ham_db_get_key_count() (tutorial, documentation)
- bloom filter -> PRO
- concurrency -> PRO
. clean up approx. matching
o ONLY for cursors
o Flags: HAM_FIND_LT_MATCH | HAM_FIND_GT_MATCH | HAM_FIND_EQ_MATCH (default)
o lookup: the cursor is coupled to the key, even if the lookup fails
then perform a lookup:
found_key == requested_key:
HAM_FIND_EQ_MATCH: ok
HAM_FIND_LT_MATCH: return move_prev()
HAM_FIND_GT_MATCH: return move_next()
found_key < requested_key:
HAM_FIND_LT_MATCH: ok
HAM_FIND_GT_MATCH: return move_next()
HAM_FIND_EQ_MATCH: key not found
found_key > requested_key:
HAM_FIND_GT_MATCH: ok
HAM_FIND_LT_MATCH: return move_prev()
HAM_FIND_EQ_MATCH: key not found
o must work with transactions
o do not store key flags; the caller has to compare the key
o remove ham_key_set_intflags, ham_key_get_intflags, key->_flags (?)
. win32: need a release-v2.pl which fully automates the various release steps
o delete all generated protobuf files
o build for msvc 2008
o run unittests for debug and release
o run samples
o delete all generated protobuf files
o build for msvc 2010
o run unittests for debug and release
o run samples
o build release package
. also remove locking from C# and Java APIs
------------------- idea soup ---------------------------------------------
. PRO: should we have a separate "recsize == 0" RecordList for duplicates?
they could only store the duplicate count (but should be able to deal
with duplicates that are > 256!)
-> requires grouped varints
o asynchronous prefetching of pages
-> see posix_fadvice, libprefetch
o when recovering, give users the choice if active transactions should be
aborted (default behavior) or re-created
o needs a function to enumerate them
o A new transactional mode: read-only transactions can run "in the past" - only
on committed transactions. therefore they avoid conflicts and will always
succeed.
o need a function to get the txn of a conflict (same as in v2)
ham_status_t ham_txn_get_conflicting_txn(ham_txn_t *txn, ham_txn_t **other);
oder: txn-id zurückgeben? sonst gibt's ne race condition wenn ein anderer
thread "other" committed/aborted
o also add to c++ API
o add documentation (header file)
o add documentation (wiki)
. new test case for cursors
insert (1, a)
insert (1, b) (duplicate of 1)
move (last) (-> 1, b)
insert (1, c)
move (last) (-> 1, c)? is the dupecache updated correctly?
. there are a couple of areas where a btree cursor is uncoupled, just to
retrieve the key and to couple the txn-key. that's not efficient
db.c:__btree_cursor_points_to
db.c:__compare_cursors
txn_cursor.c:cursor_sync
txn_cursor.c:cursor_overwrite
o move to a separate function
o try to optimize
. add tests to verify that the cursor is not modified if an operation fails!
(in cursor.cpp:LongTxnCursorTest are some wrapper functions to move or
insert the cursor; that's a good starting point)
Jump to Line
Something went wrong with that request. Please try again.