Shared memory interface #23

Open
dw opened this Issue Sep 7, 2013 · 0 comments

Projects

None yet

1 participant

@dw
Owner
dw commented Sep 7, 2013

The goal is to exploit database engines like LMDB (via py-lmdb) by transparently extending their zero-copy behaviour up into user code. Since LMDB never copies during reads, range scans on cached datasets can be extremely fast, however in the current version of Acid most of this efficiency is lost on immediate key decoding, string slicing (partly due to #11 and #37), and immediate value decoding.

py-lmdb already supports exposing zero-copy via buffer() objects, but the resulting buffers must be managed carefully by the user, e.g. they cannot outlive the transaction they were created in. Other databases like LevelDB also export (ptr, length) pairs that are guaranteed to survive at least until the next iterator mutation.

It's therefore desirable to imagine an interface where Acid could receive the database engine's buffer, and safely use it as the backing store for its Key type exactly until the moment before it becomes unavailable, without needing careful management by the user. The plain buffers interface doesn't allow this, it only permits locking, which would prevent any change to the LMDB transaction or LevelDB iterator while any sink object exists. The plain interface also doesn't support any kind of notification, so at best a transaction would be forced to remain alive for as long as any Key instances existed.

The plan is to design a protocol for orthogonal Python objects to communicate with each other at the C level, to learn how to share read-only buffers in a way that is transparent to the (Python) programmer. The programmer need not care where an object came from, or that within a transaction its storage does not exist, or that during transaction finalization its storage was copied to a private heap allocation (or for LevelDB, that the copy happened because the user kept the object around during a subsequent iteration).

If implemented correctly, creating a large number of temporary Key instances (e.g. for hash join, ticket #29) should involve 0-1 allocations per key (assuming a freelist exists), and zero copies. Combined with the planned shared-memory aware value type (#41), it should also allow large range scans with minimal overhead.


Implement KEY_SHARED/KEY_COPIED modes by teaching ext/key.c how to accept a source buffer object and source object, and how to communicate with the source object to get a lease on the buffer. Reuse the new interface for ext/structlib.c (ticket #41) when that module exists.

Current idea is to extend from_raw like KeyList.from_raw(prefix, buf, source=None). If source is present, buf must be a buffer instance. key.c will ask source for __memsource__ class attribute. The attribute must be present, and reference a PyCapsule containing a struct. Struct will look like:

struct MemSource {
    // Changes when ABI changes.
    unsigned long magic;
    // Notify `sink` when `source` buffers expire.
    int (*listen)(PyObject *source, PyObject *sink);
    // Cancel notification of `sink` when `source` buffers expire.
    int (*cancel)(PyObject *source, PyObject *sink);
};

listen() will internally ask KeyList for __memsink__ attribute, which must be present, and reference a PyCapsule containing a struct. Struct will look like:

struct MemSink {
    // Changes when ABI changes.
    unsigned long magic;
    // PyObject offset where SinkList is stored.
    Py_ssize_t list_offset;
    // Invoked when source buffer is about to become invalid.
    void (*invalidate)(PyObject *source, PyObject *sink);
};

SinkList is simply:

struct SinkList {
    // Borrowed reference to previous sink in list, or NULL to indicate
    // head of list (in which case borrowed reference is stored directly in
    // source object).
    PyObject *prev;
    // Borrowed reference to next sink in the list.
    PyObject *next;
};

The idea is that with 2 (potentially interned) hash lookups, source object knows how to maintain a linked list of sink objects directly inside the PyObjects themselves, avoiding separate heap allocations for a separate container and one weakref instance per sink.

Exposing the protocol directly to Python code doesn't seem very useful at all, so keeping everything buried as PyCapsules doesn't feel so bad.

When source object (the database iterator or transaction) is about to abort or commit, it walks the list like:

PyObject *cur = self->list_head;
while(cur) {
    PyObject *capsule = PyObject_GetAttrString(Py_TYPE(cur), "__memsink__");
    struct MemSink *sink = PyCapsule_GetPointer(capsule);
    Py_DECREF(capsule);

    if(sink == NULL || (sink->magic != MEMSINK_MAGIC)) {
        // Should never happen, unless user messes with type object since
        // listen() was called. All that can be done is to crash, since
        // remainder of notifier list can't be found without __memsink__
        // access, which will cause a crash or corruption anyway.
        abort();
    }

    // Tell the object's implementation the buffer is going away.
    sink->invalidate(self, cur);

    // Clear out its linked list node, and proceed to next sink.
    struct SinkList *list = (struct SinkList *) (((char *) cur) + sink->list_offset);
    list->prev = NULL;
    cur = list->next;
    list->next = NULL;
}

This approach is obviously horribly fragile, so not ready to commit to it just yet. The major benefit is that sink objects need only include a header file in their source tree, and all linked list logic and suchlike is constrained to the source object (py-lmdb Transaction in this case).

Other issues with the protocol above are that it does not allow for multiple sources to exist for a single sink (although that is not needed in my use case), and that no verification occurs that source really provided buf. Not sure how useful such a verification would be.

Another issue is that there is potential "jankiness" when the user builds up a huge list of shared-memory objects within a transaction, and then commits or aborts the transaction. In that case, expiry will immediately cause a large number of heap allocations/memory copies, potentially leading to a MemoryError or crash far away from the code that caused it. It seems easy to avoid this, though, and there doesn't seem to be any alternative design that prevents it.


An alternative approach would be to combine __memsource__ and the buffer object into a single custom type, but then PyArg_ParseTuple() and friends would need to call PyTypeObject.tp_as_buffer and create a temporary buffer object when not using the interface. It's also repugnant on the basis that it introduces Yet Another buffer type, in addition to buffer and memoryview.

It also doesn't solve the fact that consumers must also still implement the protocol.

@dw dw added a commit that referenced this issue Sep 7, 2013
@dw Draft bits of memsink.h (ticket #23). e864807
@dw dw added a commit that referenced this issue Sep 8, 2013
@dw First "sane" memsink.h (ticket #23) 1a15119
@dw dw added a commit that referenced this issue Sep 11, 2013
@dw Add source= to Key.from_raw() (ticket #23). aea17bc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment