Add append method for `InMemoryDataset` #327

peytondmurray · 2024-06-01T19:48:44Z

Following up from #313, this PR takes a different approach to implementing an append method:

Calls to append are executed right away, rather than following a lazy execution approach as what was proposed in Detect cases where unused chunk space can be written to #313. This is the same scheme that the rest of the code base currently uses.
Implementation is considerably more compact, touching much less code.

I still needed to add an AppendData object because write_dataset_chunks currently only writes data into new chunks at commit time. AppendData instead stores the data to append, the raw indices where the data should be written to, and the corresponding virtual slice that it's a part of.

peytondmurray · 2024-06-05T15:46:00Z

Just adding tests from this point on.

ArvidJB · 2024-06-07T16:09:10Z

Sorry, the failure above is unrelated to the changes in this PR. git bisect points to 7662655. I will open a separate issue.

ArvidJB · 2024-06-13T17:58:01Z

I was wondering where we track which part of a chunk is reused in the commit, so I tried this out and got the following error:

In [2]: with TempDirCtx(DIR_cluster_tmp()) as d:
   ...:     with h5py.File(d / 'data.h5', 'w') as f:
   ...:         vf = VersionedHDF5File(f)
   ...:         with vf.stage_version("r0") as sv:
   ...:             sv.create_dataset('values',
   ...:                 data=np.array([[1, 1, 1, 1, 1, 1],
   ...:                                [2, 2, 2, 2, 2, 2],
   ...:                                [3, 3, 3, 3, 3, 3],
   ...:                                [4, 4, 4, 4, 4, 4]]),
   ...:                 chunks=(3, 3), maxshape=(None, None))
   ...:     with h5py.File(d / 'data.h5', 'r+') as f:
   ...:        vf = VersionedHDF5File(f)
   ...:        with vf.stage_version("r1") as sv:
   ...:             sv['values'].append(np.array([[5, 5, 5, -5, -5, -5]]))
   ...:     with h5py.File(d / 'data.h5', 'r') as f:
   ...:        vf = VersionedHDF5File(f)
   ...:        cv = vf[vf.current_version]
   ...:        print(cv['values'][:])
   ...:
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[2], line 14
     12    vf = VersionedHDF5File(f)
     13    with vf.stage_version("r1") as sv:
---> 14         sv['values'].append(np.array([[5, 5, 5, -5, -5, -5]]))
     15 with h5py.File(d / 'data.h5', 'r') as f:
     16    vf = VersionedHDF5File(f)

File /codemill/bessen/ndindex_venv/lib64/python3.11/site-packages/versioned_hdf5/wrappers.py:1012, in InMemoryDataset.append(self, arr)
   1007 else:
   1008     # The existing data must always exist in the old data dict
   1009     chunk_extant_vindex = Tuple(
   1010         Slice(chunk.args[0].start, old_shape[0]), *other_dims
   1011     ).expand(self.shape)
-> 1012     assert chunk_extant_vindex in old_data_dict
   1014     # In cases where __setitem__ is called and the InMemoryDataset hasn't yet been
   1015     # committed, values in the data_dict contain np.ndarray objects instead of slices.
   1016     # Handle this by just appending the data here to the chunk to be written.
   1017     if isinstance(old_data_dict[chunk_extant_vindex], np.ndarray):

AssertionError:

peytondmurray · 2024-06-18T20:46:13Z

Looks like there was an indexing issue when appending multidimensional datasets. I would have thought this would be caught by test_multidim_random_axes, but for some reason it wasn't.

The issue was that in the process of appending data to a dataset, we need to get the extant data from the dataset's data_dict. The calculation that searched for the part of the data_dict that overlaps each chunk of the newly-resized dataset used the dimensions of the entire dataset rather than the dimensions of the chunk to perform the search. Correcting this made the rest of the logic work as intended.

ArvidJB · 2024-06-21T17:07:27Z

I updated with the latest changes and now the code snippet I had posted no longer fails, but now it stores incorrect values:

In [2]: with TempDirCtx(DIR_cluster_tmp()) as d:
   ...:     with h5py.File(d / 'data.h5', 'w') as f:
   ...:         vf = VersionedHDF5File(f)
   ...:         with vf.stage_version("r0") as sv:
   ...:             sv.create_dataset('values',
   ...:                 data=np.array([[1, 1, 1, 1, 1, 1],
   ...:                                [2, 2, 2, 2, 2, 2],
   ...:                                [3, 3, 3, 3, 3, 3],
   ...:                                [4, 4, 4, 4, 4, 4]]),
   ...:                 chunks=(3, 3), maxshape=(None, None))
   ...:     with h5py.File(d / 'data.h5', 'r+') as f:
   ...:        vf = VersionedHDF5File(f)
   ...:        with vf.stage_version("r1") as sv:
   ...:             sv['values'].append(np.array([[5, 5, 5, -5, -5, -5]]))
   ...:     with h5py.File(d / 'data.h5', 'r') as f:
   ...:        vf = VersionedHDF5File(f)
   ...:        cv = vf[vf.current_version]
   ...:        print(cv['values'][:])
   ...:
[[ 1  1  1  1  1  1]
 [ 2  2  2  2  2  2]
 [ 3  3  3  3  3  3]
 [ 4  4  4  4  4  4]
 [-5 -5 -5 -5 -5 -5]]

…hunk

peytondmurray · 2024-06-21T22:35:20Z

For certain sizes of multidimensional datasets, appending to certain datasets could result in appends which targeted the same raw chunk. This happened in your example above, so the erroneous data you saw was from one append subchunk overwriting the raw data needed by another subchunk.

I've added a new branch so that only one append is allowed to target each individual raw chunk.

ArvidJB · 2024-06-22T01:20:12Z

Here's some more corrupted data (with the latest changes):

In [6]: with TempDirCtx(DIR_cluster_tmp()) as d:
   ...:     with h5py.File(d / 'data.h5', 'w') as f:
   ...:         vf = VersionedHDF5File(f)
   ...:         with vf.stage_version("r0") as sv:
   ...:             sv.create_dataset('values', data=np.arange(3), chunks=(5,), maxshape=(None,))
   ...:
   ...:     with h5py.File(d / 'data.h5', 'r+') as f:
   ...:         vf = VersionedHDF5File(f)
   ...:         with vf.stage_version("r1") as sv:
   ...:             values = sv['values']
   ...:             values.append(np.array([1, 2]))
   ...:
   ...:     with h5py.File(d / 'data.h5', 'r') as f:
   ...:         vf = VersionedHDF5File(f)
   ...:         cv = vf[vf.current_version]
   ...:         print(cv['values'][:])
   ...:
   ...:     with h5py.File(d / 'data.h5', 'r+') as f:
   ...:         vf = VersionedHDF5File(f)
   ...:         with vf.stage_version("r2") as sv:
   ...:             values = sv['values']
   ...:             values.resize((8,))
   ...:             values[5:8] = np.arange(3)
   ...:
   ...:     with h5py.File(d / 'data.h5', 'r+') as f:
   ...:         vf = VersionedHDF5File(f)
   ...:         with vf.stage_version("r3") as sv:
   ...:              values = sv['values']
   ...:              values.append(np.array([3, 4]))
   ...:
   ...:     with h5py.File(d / 'data.h5', 'r') as f:
   ...:         vf = VersionedHDF5File(f)
   ...:         cv = vf[vf.current_version]
   ...:         print(cv['values'][:])
   ...:
[0 1 2 1 2]
[0 1 2 3 4 0 1 2 3 4]

ArvidJB · 2024-06-24T17:18:34Z

This also silently corrupts older versions:

In [2]: with TempDirCtx(DIR_cluster_tmp()) as d:
   ...:     with h5py.File(d / 'data.h5', 'w') as f:
   ...:         vf = VersionedHDF5File(f)
   ...:         with vf.stage_version("r0") as sv:
   ...:             sv.create_dataset('values', data=np.arange(3), chunks=(5,), maxshape=(None,))
   ...:
   ...:     with h5py.File(d / 'data.h5', 'r+') as f:
   ...:         vf = VersionedHDF5File(f)
   ...:         with vf.stage_version("r1") as sv:
   ...:             values = sv['values']
   ...:             values.append(np.array([1, 2]))
   ...:
   ...:     with h5py.File(d / 'data.h5', 'r') as f:
   ...:         vf = VersionedHDF5File(f)
   ...:         cv = vf[vf.current_version]
   ...:         print(cv['values'][:])
   ...:
   ...:     with h5py.File(d / 'data.h5', 'r+') as f:
   ...:         vf = VersionedHDF5File(f)
   ...:         with vf.stage_version("r2") as sv:
   ...:             values = sv['values']
   ...:             values.resize((3,))
   ...:
   ...:     with h5py.File(d / 'data.h5', 'r+') as f:
   ...:         vf = VersionedHDF5File(f)
   ...:         with vf.stage_version("r3") as sv:
   ...:              values = sv['values']
   ...:              values.append(np.array([3, 4]))
   ...:
   ...:     with h5py.File(d / 'data.h5', 'r') as f:
   ...:         vf = VersionedHDF5File(f)
   ...:         # get older version, should not have changes
   ...:         v1 = vf['r1']
   ...:         print(v1['values'][:])
   ...:
[0 1 2 1 2]
[0 1 2 3 4]

referenced regions (i.e. it's unoccupied raw space)

peytondmurray · 2024-06-24T20:32:19Z

Yep, I started adding in a check for this earlier today and just finished it out. The check I've added ensures that no chunk in a previous version points to the space in the raw dataset we are targeting for the append; if there is preexisting data, we instead write to a new chunk.

peytondmurray · 2024-06-25T17:08:40Z

Closing for now.

peytondmurray force-pushed the add-append-api branch 3 times, most recently from 7d547d3 to 8f67094 Compare June 5, 2024 15:44

This comment was marked as resolved.

Sign in to view

ArvidJB mentioned this pull request Jun 7, 2024

Chunk reuse code failing on multi-dimensional array (PyInf#11487) #333

Closed

peytondmurray force-pushed the add-append-api branch from 8f67094 to 792bb66 Compare June 7, 2024 18:47

peytondmurray force-pushed the add-append-api branch 2 times, most recently from cce87eb to b331391 Compare June 19, 2024 22:06

peytondmurray added 7 commits June 21, 2024 15:23

Add append method for InMemoryDataset

610498b

Handle case where data_dict contains np.ndarray; add tests

46aa4e6

Moved stuff around, added docstrings

7a49319

Found a case where getting the dataset.__getitem__ returns junk

09512d3

Fix indexing bug in InMemoryDataset.__getitem__

a0106cb

Add test for appending random indices to multidimensional dataset

225f04b

Correct appends of multidimensional datasets

6461cd9

peytondmurray force-pushed the add-append-api branch from 1be19a6 to a368976 Compare June 21, 2024 22:26

Fix for multiple multidimensional appends which target the same raw c…

24326ae

…hunk

peytondmurray force-pushed the add-append-api branch from a368976 to 24326ae Compare June 21, 2024 22:29

Add test for corrupted data

6e7a50b

peytondmurray added 2 commits June 24, 2024 11:45

Add check that the appended data doesn't overlap any other version's

685c173

referenced regions (i.e. it's unoccupied raw space)

Finish out the check for previous version chunks

c81c574

peytondmurray closed this Jun 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add append method for `InMemoryDataset` #327

Add append method for `InMemoryDataset` #327

peytondmurray commented Jun 1, 2024 •

edited

Loading

peytondmurray commented Jun 5, 2024

This comment was marked as resolved.

ArvidJB commented Jun 7, 2024

ArvidJB commented Jun 13, 2024

peytondmurray commented Jun 18, 2024

ArvidJB commented Jun 21, 2024

peytondmurray commented Jun 21, 2024

ArvidJB commented Jun 22, 2024

ArvidJB commented Jun 24, 2024

peytondmurray commented Jun 24, 2024

peytondmurray commented Jun 25, 2024

Add append method for InMemoryDataset #327

Add append method for InMemoryDataset #327

Conversation

peytondmurray commented Jun 1, 2024 • edited Loading

peytondmurray commented Jun 5, 2024

This comment was marked as resolved.

ArvidJB commented Jun 7, 2024

ArvidJB commented Jun 13, 2024

peytondmurray commented Jun 18, 2024

ArvidJB commented Jun 21, 2024

peytondmurray commented Jun 21, 2024

ArvidJB commented Jun 22, 2024

ArvidJB commented Jun 24, 2024

peytondmurray commented Jun 24, 2024

peytondmurray commented Jun 25, 2024

Add append method for `InMemoryDataset` #327

Add append method for `InMemoryDataset` #327

peytondmurray commented Jun 1, 2024 •

edited

Loading