-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detect cases where unused chunk space can be written to #313
Detect cases where unused chunk space can be written to #313
Conversation
f3378e6
to
1b2ff24
Compare
6d67cb1
to
6c8c1f0
Compare
9d70fb7
to
883bbcf
Compare
ba0a977
to
6a8a8ec
Compare
6a8a8ec
to
6457935
Compare
f18c041
to
43cb300
Compare
A couple of meta comments first:
I tried to run some simple tests and it broke pretty quickly, unfortunately:
|
Thanks for the feedback, having an initial impression is useful here.
One of the major issues I encountered here was figuring out where writes need to be made in memory vs to a file. If we decide to take a lazy approach above I think it won't work to just add the append method, because we will need to solve all sorts of edge cases where writes/appends/reads are interspersed. But that brings us to your next point:
Understandable, and you're probably right about this. Let me get back to you about how we can move forward before we start considering the other implications of this approach. |
This PR adds an
.append()
method to allowInMemoryDataset
objects to append data to any unused space in the last chunk. If no free space exists, data is written as usual into a new chunk.Closes #295.
Changes
data_dict
system that previously kept track of the slices in anInMemoryDataset
. Previously thedata_dict
was a member which lived on the low levelInMemoryDatasetID
object, and contained mappings between slices in the virtual dataset and slices in the raw dataset. After careful consideration, it doesn't seem necessary to keep the data on this lower level object, we can just keep it on the InMemoryDataset, significantly reducing the conceptual complexity of the data model.Previously, when the user called
InMemoryDataset.resize
orInMemoryDataset.__setitem__
, thedata_dict
mapping would keep track of slices that changed as they changed; when exiting thestage_version
context manager for a version, thisdata_dict
was then used to bothraw_data
referenced by the virtual datasetThis system has been replaced with a more explicit scheme with defers all computation until execution exits the
stage_version
context manager. With the new scheme, the user can manipulate data in three ways:InMemoryDataset.__setitem__
InMemoryDataset.resize
InMemoryDataset.append
In all cases, when the user calls one of these methods the
InMemoryDataset
keeps track of the manipulation done by the user. When thestage_version
context is exited, all computations are resolved then. Therefore we should expect to see some of the computation time that would otherwise be spent inside the context manager to be reduced, but with a corresponding increase when the context manager is exited.InMemoryDataset
now keeps track of the last element written to its correspondingraw_data
, which is needed forappend
operationsbackend.create_virtual_dataset
can now accept eitherSlice
objects orTuple
objects in the values of theslices
parameterSetOperation
,ResizeOperation
, andAppendOperation
helper classes which are used to keep track of the deferred operations onInMemoryDataset
objectsAppendChunk
class to help with deferring writes when the user calls.append
. A similar scheme would simplify the code for__setitem__
andresize
calls, but I've left that refactor for another PR.partition
method which partitions andindex.Tuple
ornp.ndarray
object into chunks of a requested size. I realize that there's already anndindex
function which does this, but the syntax is arcane. Parts of this PR took a long time to comprehend, so I'm trying to reduce the maintenance burden going forward.slicetools.py
:to_slice_tuple
: convertTuple
of arbitraryndindex
types toTuple
ofSlice
types, so thatobj.args[0].start
etc doesn't failto_raw_index
: convert a relative (virtual) index to an index in a raw chunkget_vchunk_overlap
: helper to get the overlap of an arbitrary virtual chunk and an index into a virtual datasetget_shape
: helper which gets the size of an index along each dimension