Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect cases where unused chunk space can be written to #313

Conversation

peytondmurray
Copy link
Collaborator

@peytondmurray peytondmurray commented Feb 22, 2024

This PR adds an .append() method to allow InMemoryDataset objects to append data to any unused space in the last chunk. If no free space exists, data is written as usual into a new chunk.

Closes #295.

Changes

  • Refactored the data_dict system that previously kept track of the slices in an InMemoryDataset. Previously the data_dict was a member which lived on the low level InMemoryDatasetID object, and contained mappings between slices in the virtual dataset and slices in the raw dataset. After careful consideration, it doesn't seem necessary to keep the data on this lower level object, we can just keep it on the InMemoryDataset, significantly reducing the conceptual complexity of the data model.

Previously, when the user called InMemoryDataset.resize or InMemoryDataset.__setitem__, the data_dict mapping would keep track of slices that changed as they changed; when exiting the stage_version context manager for a version, this data_dict was then used to both

  1. Write new data to the underlying raw_data referenced by the virtual dataset
  2. Provide the virtual slices that made up the new version of the dataset

This system has been replaced with a more explicit scheme with defers all computation until execution exits the stage_version context manager. With the new scheme, the user can manipulate data in three ways:

  1. InMemoryDataset.__setitem__
  2. InMemoryDataset.resize
  3. InMemoryDataset.append

In all cases, when the user calls one of these methods the InMemoryDataset keeps track of the manipulation done by the user. When the stage_version context is exited, all computations are resolved then. Therefore we should expect to see some of the computation time that would otherwise be spent inside the context manager to be reduced, but with a corresponding increase when the context manager is exited.

  • InMemoryDataset now keeps track of the last element written to its corresponding raw_data, which is needed for append operations
  • backend.create_virtual_dataset can now accept either Slice objects or Tuple objects in the values of the slices parameter
  • Added SetOperation, ResizeOperation, and AppendOperation helper classes which are used to keep track of the deferred operations on InMemoryDataset objects
  • Added an AppendChunk class to help with deferring writes when the user calls .append. A similar scheme would simplify the code for __setitem__ and resize calls, but I've left that refactor for another PR.
  • Added a partition method which partitions a ndindex.Tuple or np.ndarray object into chunks of a requested size. I realize that there's already an ndindex function which does this, but the syntax is arcane. Parts of this PR took a long time to comprehend, so I'm trying to reduce the maintenance burden going forward.
  • Added a few functions in slicetools.py:
    • to_slice_tuple: convert Tuple of arbitrary ndindex types to Tuple of Slice types, so that obj.args[0].start etc doesn't fail
    • to_raw_index: convert a relative (virtual) index to an index in a raw chunk
    • get_vchunk_overlap: helper to get the overlap of an arbitrary virtual chunk and an index into a virtual dataset
    • get_shape: helper which gets the size of an index along each dimension

@peytondmurray peytondmurray marked this pull request as draft February 22, 2024 19:21
@peytondmurray peytondmurray force-pushed the detect-appends-and-reuse-chunks branch from f3378e6 to 1b2ff24 Compare March 5, 2024 23:42
@peytondmurray peytondmurray force-pushed the detect-appends-and-reuse-chunks branch from 6d67cb1 to 6c8c1f0 Compare March 29, 2024 22:49
@peytondmurray peytondmurray force-pushed the detect-appends-and-reuse-chunks branch 3 times, most recently from 9d70fb7 to 883bbcf Compare April 22, 2024 22:59
@peytondmurray peytondmurray marked this pull request as ready for review April 26, 2024 05:14
@peytondmurray peytondmurray changed the title [WIP] Detect cases where unused chunk space can be written to Detect cases where unused chunk space can be written to Apr 30, 2024
@peytondmurray peytondmurray force-pushed the detect-appends-and-reuse-chunks branch from ba0a977 to 6a8a8ec Compare April 30, 2024 22:19
@peytondmurray peytondmurray force-pushed the detect-appends-and-reuse-chunks branch from 6a8a8ec to 6457935 Compare May 8, 2024 16:55
@peytondmurray peytondmurray force-pushed the detect-appends-and-reuse-chunks branch from f18c041 to 43cb300 Compare May 9, 2024 00:41
@ArvidJB
Copy link
Collaborator

ArvidJB commented May 22, 2024

A couple of meta comments first:

  1. Is it possible to break this huge change into multiple smaller changes? We can start by only appends.
  2. I am not a huge fan of breaking up things into WriteOperation and delaying their application until committing. I think that introduces a lot of extra complexity to be able to handle all combinations.
    1. Let's definitely add a lot more tests for various combinations like append after insert after append after update, etc.
    2. What's the memory overhead for handling these operations lazily versus applying them directly? If I stack up lots of operations would that not become very expensive.
  3. How hard would it be to handle multi-dimensional dataset appends as well?

I tried to run some simple tests and it broke pretty quickly, unfortunately:

In [9]: with h5py.File('data.h5', 'w') as f:
   ...:     vf = VersionedHDF5File(f)
   ...:     with vf.stage_version('r0') as sv:
   ...:         sv.create_dataset('values', data=np.arange(5), chunks=(10,))
   ...:

In [10]: with h5py.File('data.h5', 'r+') as f:
    ...:     vf = VersionedHDF5File(f)
    ...:     with vf.stage_version('r1') as sv:
    ...:         sv['values'].append(np.arange(5, 7))
    ...:         sv['values'][3:6] = np.arange(-3, -6, -1)
    ...:         sv['values'].append(np.arange(7, 8))
    ...:
Entering debugger.  Use 'n' to step, 'c' to continue running, 'q' to quit Python completely.
> /codemill/bessen/opensource/versioned-hdf5/versioned_hdf5/backend.py(1463)get_updated_chunks()
   1461                 overlap_data[relative_overlap.raw] = arr[arr_overlap.raw]
   1462                 breakpoint()
-> 1463                 new_chunks[vchunk] = overlap_data
   1464             elif isinstance(rchunk, WriteChunk):
   1465                 raise NotImplementedError

ipdb> c
Continuing execution.
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[10], line 3
      1 with h5py.File('data.h5', 'r+') as f:
      2     vf = VersionedHDF5File(f)
----> 3     with vf.stage_version('r1') as sv:
      4         sv['values'].append(np.arange(5, 7))
      5         sv['values'][3:6] = np.arange(-3, -6, -1)

File /opt/python/python-3.11/lib64/python3.11/contextlib.py:144, in _GeneratorContextManager.__exit__(self, typ, value, traceback)
    142 if typ is None:
    143     try:
--> 144         next(self.gen)
    145     except StopIteration:
    146         return False

File /codemill/bessen/opensource/versioned-hdf5/versioned_hdf5/api.py:307, in VersionedHDF5File.stage_version(self, version_name, prev_version, make_current, timestamp)
    305     yield group
    306     group.close()
--> 307     commit_version(
    308         group,
    309         group.datasets(),
    310         make_current=make_current,
    311         chunks=group.chunks,
    312         compression=group.compression,
    313         compression_opts=group.compression_opts,
    314         timestamp=timestamp,
    315     )
    317     self._log_version_diff_stats(old_current, self.current_version)
    319 except:

File /codemill/bessen/opensource/versioned-hdf5/versioned_hdf5/versions.py:151, in commit_version(version_group, datasets, make_current, chunks, compression, compression_opts, timestamp)
    148             data_copy.attrs[k] = v
    149         continue
--> 151     slices, shape = write_dataset_operations(f, version_name, name, data)
    153 elif isinstance(data, dict):
    154     if chunks[name] is not None:

File /codemill/bessen/opensource/versioned-hdf5/versioned_hdf5/backend.py:1071, in write_dataset_operations(f, version_name, name, dataset)
   1046 def write_dataset_operations(
   1047     f: File,
   1048     version_name: str,
   1049     name: str,
   1050     dataset: "InMemoryDataset",
   1051 ) -> tuple[Dict[Tuple, Tuple], tuple[int, ...]]:
   1052     """Write any pending dataset operations to the file before clearing them.
   1053
   1054     Parameters
   (...)
   1069         shape of the new virtual dataset
   1070     """
-> 1071     chunks, shape = write_operations(f, version_name, name, dataset._operations)
   1072     result = write_dataset_chunks(f, name, chunks, shape)
   1073     dataset._operations.clear()

File /codemill/bessen/opensource/versioned-hdf5/versioned_hdf5/backend.py:1112, in write_operations(f, version_name, name, operations)
   1109 shape = get_previous_version_shape(f, version_name, name)
   1111 for operation in operations:
-> 1112     slices, shape = operation.apply(f, name, version_name, slices, shape)
   1114 sorted_slices = dict(sorted(list(slices.items()), key=lambda s: s[0].args[0].start))
   1116 return sorted_slices, shape

File /codemill/bessen/opensource/versioned-hdf5/versioned_hdf5/backend.py:1012, in AppendOperation.apply(self, f, name, version, slices, shape)
   1009 vchunk = split.get_new_last_vchunk(slices)
   1011 # Get the indices to write the new data into
-> 1012 append_slice = split.get_append_rchunk_slice(slices)
   1014 # Remove the last chunk of the virtual dataset; we are
   1015 # replacing it with the chunk containing the appended data
   1016 del slices[last_virtual_slice]

File /codemill/bessen/opensource/versioned-hdf5/versioned_hdf5/backend.py:208, in SplitResult.get_append_rchunk_slice(self, slices)
    194 """Get the slice into the raw data where the new data will be appended into.
    195
    196 Parameters
   (...)
    205     Slice of the raw dataset where the data is to be appended
    206 """
    207 last_rchunk = list(slices.values())[-1]
--> 208 dim0 = last_rchunk.args[0]
    209 return Tuple(
    210     Slice(dim0.stop, dim0.stop + self.arr_to_append.shape[0]),
    211     *last_rchunk.args[1:],
    212 )

AttributeError: 'numpy.ndarray' object has no attribute 'args'

@peytondmurray
Copy link
Collaborator Author

peytondmurray commented May 22, 2024

Thanks for the feedback, having an initial impression is useful here.

Is it possible to break this huge change into multiple smaller changes? We can start by only appends.

One of the major issues I encountered here was figuring out where writes need to be made in memory vs to a file. If we decide to take a lazy approach above I think it won't work to just add the append method, because we will need to solve all sorts of edge cases where writes/appends/reads are interspersed. But that brings us to your next point:

I am not a huge fan of breaking up things into WriteOperation and delaying their application until committing. I think that introduces a lot of extra complexity to be able to handle all combinations.

Understandable, and you're probably right about this. Let me get back to you about how we can move forward before we start considering the other implications of this approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Detect appends and reuse chunks (PyInf#10777)
2 participants