Detect cases where unused chunk space can be written to #313

peytondmurray · 2024-02-22T19:21:48Z

This PR adds an .append() method to allow InMemoryDataset objects to append data to any unused space in the last chunk. If no free space exists, data is written as usual into a new chunk.

Closes #295.

Changes

Refactored the data_dict system that previously kept track of the slices in an InMemoryDataset. Previously the data_dict was a member which lived on the low level InMemoryDatasetID object, and contained mappings between slices in the virtual dataset and slices in the raw dataset. After careful consideration, it doesn't seem necessary to keep the data on this lower level object, we can just keep it on the InMemoryDataset, significantly reducing the conceptual complexity of the data model.

Previously, when the user called InMemoryDataset.resize or InMemoryDataset.__setitem__, the data_dict mapping would keep track of slices that changed as they changed; when exiting the stage_version context manager for a version, this data_dict was then used to both

Write new data to the underlying raw_data referenced by the virtual dataset
Provide the virtual slices that made up the new version of the dataset

This system has been replaced with a more explicit scheme with defers all computation until execution exits the stage_version context manager. With the new scheme, the user can manipulate data in three ways:

InMemoryDataset.__setitem__
InMemoryDataset.resize
InMemoryDataset.append

In all cases, when the user calls one of these methods the InMemoryDataset keeps track of the manipulation done by the user. When the stage_version context is exited, all computations are resolved then. Therefore we should expect to see some of the computation time that would otherwise be spent inside the context manager to be reduced, but with a corresponding increase when the context manager is exited.

InMemoryDataset now keeps track of the last element written to its corresponding raw_data, which is needed for append operations
backend.create_virtual_dataset can now accept either Slice objects or Tuple objects in the values of the slices parameter
Added SetOperation, ResizeOperation, and AppendOperation helper classes which are used to keep track of the deferred operations on InMemoryDataset objects
Added an AppendChunk class to help with deferring writes when the user calls .append. A similar scheme would simplify the code for __setitem__ and resize calls, but I've left that refactor for another PR.
Added a partition method which partitions a ndindex.Tuple or np.ndarray object into chunks of a requested size. I realize that there's already an ndindex function which does this, but the syntax is arcane. Parts of this PR took a long time to comprehend, so I'm trying to reduce the maintenance burden going forward.
Added a few functions in slicetools.py:
- to_slice_tuple: convert Tuple of arbitrary ndindex types to Tuple of Slice types, so that obj.args[0].start etc doesn't fail
- to_raw_index: convert a relative (virtual) index to an index in a raw chunk
- get_vchunk_overlap: helper to get the overlap of an arbitrary virtual chunk and an index into a virtual dataset
- get_shape: helper which gets the size of an index along each dimension

ArvidJB · 2024-05-22T13:27:01Z

A couple of meta comments first:

Is it possible to break this huge change into multiple smaller changes? We can start by only appends.
I am not a huge fan of breaking up things into WriteOperation and delaying their application until committing. I think that introduces a lot of extra complexity to be able to handle all combinations.
1. Let's definitely add a lot more tests for various combinations like append after insert after append after update, etc.
2. What's the memory overhead for handling these operations lazily versus applying them directly? If I stack up lots of operations would that not become very expensive.
How hard would it be to handle multi-dimensional dataset appends as well?

I tried to run some simple tests and it broke pretty quickly, unfortunately:

In [9]: with h5py.File('data.h5', 'w') as f:
   ...:     vf = VersionedHDF5File(f)
   ...:     with vf.stage_version('r0') as sv:
   ...:         sv.create_dataset('values', data=np.arange(5), chunks=(10,))
   ...:

In [10]: with h5py.File('data.h5', 'r+') as f:
    ...:     vf = VersionedHDF5File(f)
    ...:     with vf.stage_version('r1') as sv:
    ...:         sv['values'].append(np.arange(5, 7))
    ...:         sv['values'][3:6] = np.arange(-3, -6, -1)
    ...:         sv['values'].append(np.arange(7, 8))
    ...:
Entering debugger.  Use 'n' to step, 'c' to continue running, 'q' to quit Python completely.
> /codemill/bessen/opensource/versioned-hdf5/versioned_hdf5/backend.py(1463)get_updated_chunks()
   1461                 overlap_data[relative_overlap.raw] = arr[arr_overlap.raw]
   1462                 breakpoint()
-> 1463                 new_chunks[vchunk] = overlap_data
   1464             elif isinstance(rchunk, WriteChunk):
   1465                 raise NotImplementedError

ipdb> c
Continuing execution.
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[10], line 3
      1 with h5py.File('data.h5', 'r+') as f:
      2     vf = VersionedHDF5File(f)
----> 3     with vf.stage_version('r1') as sv:
      4         sv['values'].append(np.arange(5, 7))
      5         sv['values'][3:6] = np.arange(-3, -6, -1)

File /opt/python/python-3.11/lib64/python3.11/contextlib.py:144, in _GeneratorContextManager.__exit__(self, typ, value, traceback)
    142 if typ is None:
    143     try:
--> 144         next(self.gen)
    145     except StopIteration:
    146         return False

File /codemill/bessen/opensource/versioned-hdf5/versioned_hdf5/api.py:307, in VersionedHDF5File.stage_version(self, version_name, prev_version, make_current, timestamp)
    305     yield group
    306     group.close()
--> 307     commit_version(
    308         group,
    309         group.datasets(),
    310         make_current=make_current,
    311         chunks=group.chunks,
    312         compression=group.compression,
    313         compression_opts=group.compression_opts,
    314         timestamp=timestamp,
    315     )
    317     self._log_version_diff_stats(old_current, self.current_version)
    319 except:

File /codemill/bessen/opensource/versioned-hdf5/versioned_hdf5/versions.py:151, in commit_version(version_group, datasets, make_current, chunks, compression, compression_opts, timestamp)
    148             data_copy.attrs[k] = v
    149         continue
--> 151     slices, shape = write_dataset_operations(f, version_name, name, data)
    153 elif isinstance(data, dict):
    154     if chunks[name] is not None:

File /codemill/bessen/opensource/versioned-hdf5/versioned_hdf5/backend.py:1071, in write_dataset_operations(f, version_name, name, dataset)
   1046 def write_dataset_operations(
   1047     f: File,
   1048     version_name: str,
   1049     name: str,
   1050     dataset: "InMemoryDataset",
   1051 ) -> tuple[Dict[Tuple, Tuple], tuple[int, ...]]:
   1052     """Write any pending dataset operations to the file before clearing them.
   1053
   1054     Parameters
   (...)
   1069         shape of the new virtual dataset
   1070     """
-> 1071     chunks, shape = write_operations(f, version_name, name, dataset._operations)
   1072     result = write_dataset_chunks(f, name, chunks, shape)
   1073     dataset._operations.clear()

File /codemill/bessen/opensource/versioned-hdf5/versioned_hdf5/backend.py:1112, in write_operations(f, version_name, name, operations)
   1109 shape = get_previous_version_shape(f, version_name, name)
   1111 for operation in operations:
-> 1112     slices, shape = operation.apply(f, name, version_name, slices, shape)
   1114 sorted_slices = dict(sorted(list(slices.items()), key=lambda s: s[0].args[0].start))
   1116 return sorted_slices, shape

File /codemill/bessen/opensource/versioned-hdf5/versioned_hdf5/backend.py:1012, in AppendOperation.apply(self, f, name, version, slices, shape)
   1009 vchunk = split.get_new_last_vchunk(slices)
   1011 # Get the indices to write the new data into
-> 1012 append_slice = split.get_append_rchunk_slice(slices)
   1014 # Remove the last chunk of the virtual dataset; we are
   1015 # replacing it with the chunk containing the appended data
   1016 del slices[last_virtual_slice]

File /codemill/bessen/opensource/versioned-hdf5/versioned_hdf5/backend.py:208, in SplitResult.get_append_rchunk_slice(self, slices)
    194 """Get the slice into the raw data where the new data will be appended into.
    195
    196 Parameters
   (...)
    205     Slice of the raw dataset where the data is to be appended
    206 """
    207 last_rchunk = list(slices.values())[-1]
--> 208 dim0 = last_rchunk.args[0]
    209 return Tuple(
    210     Slice(dim0.stop, dim0.stop + self.arr_to_append.shape[0]),
    211     *last_rchunk.args[1:],
    212 )

AttributeError: 'numpy.ndarray' object has no attribute 'args'

peytondmurray · 2024-05-22T20:14:56Z

Thanks for the feedback, having an initial impression is useful here.

Is it possible to break this huge change into multiple smaller changes? We can start by only appends.

One of the major issues I encountered here was figuring out where writes need to be made in memory vs to a file. If we decide to take a lazy approach above I think it won't work to just add the append method, because we will need to solve all sorts of edge cases where writes/appends/reads are interspersed. But that brings us to your next point:

I am not a huge fan of breaking up things into WriteOperation and delaying their application until committing. I think that introduces a lot of extra complexity to be able to handle all combinations.

Understandable, and you're probably right about this. Let me get back to you about how we can move forward before we start considering the other implications of this approach.

peytondmurray marked this pull request as draft February 22, 2024 19:21

peytondmurray force-pushed the detect-appends-and-reuse-chunks branch from f3378e6 to 1b2ff24 Compare March 5, 2024 23:42

peytondmurray force-pushed the detect-appends-and-reuse-chunks branch from 6d67cb1 to 6c8c1f0 Compare March 29, 2024 22:49

peytondmurray force-pushed the detect-appends-and-reuse-chunks branch 3 times, most recently from 9d70fb7 to 883bbcf Compare April 22, 2024 22:59

peytondmurray marked this pull request as ready for review April 26, 2024 05:14

peytondmurray changed the title ~~[WIP] Detect cases where unused chunk space can be written to~~ Detect cases where unused chunk space can be written to Apr 30, 2024

peytondmurray force-pushed the detect-appends-and-reuse-chunks branch from ba0a977 to 6a8a8ec Compare April 30, 2024 22:19

peytondmurray added 2 commits May 8, 2024 09:55

Add InMemoryDataset.append method to append data to existing raw chunks

e368909

More cleanup

6457935

peytondmurray force-pushed the detect-appends-and-reuse-chunks branch from 6a8a8ec to 6457935 Compare May 8, 2024 16:55

peytondmurray added 2 commits May 8, 2024 13:39

Fixed a bug where the last used elements of raw_data were incorrect

2bf0c63

Add some additional tests of SplitResult

43cb300

peytondmurray force-pushed the detect-appends-and-reuse-chunks branch from f18c041 to 43cb300 Compare May 9, 2024 00:41

peytondmurray closed this Jun 1, 2024

peytondmurray mentioned this pull request Jun 1, 2024

Add append method for InMemoryDataset #327

Closed

ArvidJB mentioned this pull request Sep 11, 2024

[DNM] Redesign InMemoryDataset data model with hyperrectangles selection #370

Closed

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detect cases where unused chunk space can be written to #313

Detect cases where unused chunk space can be written to #313

peytondmurray commented Feb 22, 2024 •

edited

Loading

ArvidJB commented May 22, 2024

peytondmurray commented May 22, 2024 •

edited

Loading

Detect cases where unused chunk space can be written to #313

Detect cases where unused chunk space can be written to #313

Conversation

peytondmurray commented Feb 22, 2024 • edited Loading

Changes

ArvidJB commented May 22, 2024

peytondmurray commented May 22, 2024 • edited Loading

peytondmurray commented Feb 22, 2024 •

edited

Loading

peytondmurray commented May 22, 2024 •

edited

Loading