Skip to content

mv nonblocking write2

Matthew Von-Maszewski edited this page Oct 20, 2016 · 11 revisions

Status

  • merged to master -
  • code complete - October 6, 2016
  • development started - December 2, 2015

History / Context

eleveldb is Basho's Erlang to leveldb translation layer. The layer initially accessed the leveldb API calls while remaining on Erlang threads. Turns out the use of Erlang threads within leveldb could result in single and multiple collapses of the Erlang schedulers. Erlang does not recognize scheduled tasks that "take a while", such as native disk operations. Basho installed a specialized worker pool, call hot threads, in the first quarter of 2013 within eleveldb. eleveldb took each Erlang operation request and created an asynchronous task that executed on thread from the worker pool. The Erlang scheduler thread was released to perform other tasks while waiting for an asynchronous message from the independent eleveldb thread.

The thread swapping solution solved the problem of Erlang scheduler collapses. But thread swapping increased the time required to perform any requested operation. This branch looks for opportunities when a write operation is highly unlikely to be delayed or if delayed, not for a significant period. During those opportunities, this branch bypasses the thread swap when write operations are known to not receive intentional delays from the write throttle code and/or rotation of the current write buffer.

There is an mv-nonblocking-write2 branch in both the eleveldb repository and the leveldb repository. This wiki entry discusses both.

Branch description

The basic concept is that each write operation within eleveldb now starts with a call to the new leveldb API RequestNonBlockingTicket(). RequestNonBlockingTicket looks at the state of the write throttle and the write buffer. If the throttle is inactive and the write buffer is not "too close to full", the function increments a counter and returns "true". It returns "false" if both conditions are not met. The counter prevents the underlying code from applying the write throttle and/or rotating the write buffer until all "ticketed" write calls occur. The write calls carry the ticket within the WriteOptions structure member non_blocking.

eleveldb/c_src/eleveldb.cc

The async_write function previously created an asynchronous task for every Erlang write request. The function posted the task, a WriteTask, to the worker pool. This branch adds a call to a new leveldb API RequestNonBlockTicket(). The API returns true if it is safe for the write operation to call leveldb directly without using the worker pool (without swapping threads). async_write marks direct calls to leveldb by setting the WriteOption.non_blocking flag to true. This lets leveldb account for each time RequestNonBlockTicket() returned true.

async_write() now has a more strict return / result definition. async_write either returns the "CallerRef" generated by eleveldb.erl, or it returns an atom. The atom will either be "ok" or an error code. An "ok" indicates that the write operation completed without a thread swap. The "CallerRef" indicates that a thread swap occurred and the caller should expect an asynchronous message (that could be a completion ok or an error code). async_write's direct return of an error code implies that data was not written and not scheduled for async write.

eleveldb/src/eleveldb.erl

The functions write(), async_write(), and async_put() are updated to process return codes and/or return messages detailed above.

leveldb/db/db_impl.cc / .h

The DBImpl object carries three new member variables: non_blocking_tickets_, last_penalty_, and est_mem_usage_. non_blocking_tickets_ tracks how many calls RequestNonBlockTicket() received a "true" return and must clear before blocking may continue within DBImpl::Write(). last_penalty_ tracks the most recent penalty decision within VersionSet::PickCompaction(). The penalty is most likely the value that would apply to the next write operation. The value is cached within DBImpl to be available for RequestNonBlockTicket() without a mutex. Similarly, est_mem_usage_ holds the most recent size of the active write buffer. Again, this variable is a cache to prevent using a mutex.

DBImpl::Write() checks the non_blocking_tickets_ variable to decide when potential blocking action should or should not take place. The WriteOptions.non_blocking flag is currently ignored in the decision. First because a user could unknowingly set the non_blocking flag without calling RequestNonBlockTicket() and mess up internal logic. Second because any incoming write that did not have the flag set could potential block and thereby implicitly block a subsequent write that thinks non-blocking is set. WriteOptions.non_blocking is used when logic decides whether or not to decrease non_blocking_tickets_.

DBImpl::ReqestNonBlockTicket() is a new leveldb API. eleveldb calls it to determine whether a thread swap is necessary. It returns "true" when it is safe to not swap threads. It returns "false" when a write operation might block and therefore a thread swap is warranted.

DBImpl::MakeRoomForWrite() previously used an unterminated "while" loop to process write buffer states. The loop will now terminate if a non-blocking request is granted to an upcoming write operation.

leveldb/db/version_set.cc & .h

VersionSet::PickCompaction() now copies the recently computed write_penalty_ to the DBImpl structure. This allows RequestNonBlockTicket() easy access to that value.

WriteThrottleUsec() had a legacy parameter, active_compaction, that it no longer uses. The parameter is now gone. The change is unrelated to this branch. Just a code cleanup.

leveldb/include/leveldb/db.h

Add API declaration for new RequestNonBlockTicket().

leveldb/include/leveldb/options.h

Add non_blocking flag to WriteOptions structure.

leveldb/util/throttle.cc & .h

Misc. updates to use memory fence operations to update globals that have visibility outside the throttle functions.

Clone this wiki locally