Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add append method for InMemoryDataset #327

Closed
wants to merge 11 commits into from

Conversation

peytondmurray
Copy link
Collaborator

@peytondmurray peytondmurray commented Jun 1, 2024

Following up from #313, this PR takes a different approach to implementing an append method:

  • Calls to append are executed right away, rather than following a lazy execution approach as what was proposed in Detect cases where unused chunk space can be written to #313. This is the same scheme that the rest of the code base currently uses.
  • Implementation is considerably more compact, touching much less code.

I still needed to add an AppendData object because write_dataset_chunks currently only writes data into new chunks at commit time. AppendData instead stores the data to append, the raw indices where the data should be written to, and the corresponding virtual slice that it's a part of.

@peytondmurray peytondmurray force-pushed the add-append-api branch 3 times, most recently from 7d547d3 to 8f67094 Compare June 5, 2024 15:44
@peytondmurray
Copy link
Collaborator Author

Just adding tests from this point on.

@ArvidJB

This comment was marked as resolved.

@ArvidJB
Copy link
Collaborator

ArvidJB commented Jun 7, 2024

Sorry, the failure above is unrelated to the changes in this PR. git bisect points to 7662655. I will open a separate issue.

@ArvidJB
Copy link
Collaborator

ArvidJB commented Jun 13, 2024

I was wondering where we track which part of a chunk is reused in the commit, so I tried this out and got the following error:

In [2]: with TempDirCtx(DIR_cluster_tmp()) as d:
   ...:     with h5py.File(d / 'data.h5', 'w') as f:
   ...:         vf = VersionedHDF5File(f)
   ...:         with vf.stage_version("r0") as sv:
   ...:             sv.create_dataset('values',
   ...:                 data=np.array([[1, 1, 1, 1, 1, 1],
   ...:                                [2, 2, 2, 2, 2, 2],
   ...:                                [3, 3, 3, 3, 3, 3],
   ...:                                [4, 4, 4, 4, 4, 4]]),
   ...:                 chunks=(3, 3), maxshape=(None, None))
   ...:     with h5py.File(d / 'data.h5', 'r+') as f:
   ...:        vf = VersionedHDF5File(f)
   ...:        with vf.stage_version("r1") as sv:
   ...:             sv['values'].append(np.array([[5, 5, 5, -5, -5, -5]]))
   ...:     with h5py.File(d / 'data.h5', 'r') as f:
   ...:        vf = VersionedHDF5File(f)
   ...:        cv = vf[vf.current_version]
   ...:        print(cv['values'][:])
   ...:
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[2], line 14
     12    vf = VersionedHDF5File(f)
     13    with vf.stage_version("r1") as sv:
---> 14         sv['values'].append(np.array([[5, 5, 5, -5, -5, -5]]))
     15 with h5py.File(d / 'data.h5', 'r') as f:
     16    vf = VersionedHDF5File(f)

File /codemill/bessen/ndindex_venv/lib64/python3.11/site-packages/versioned_hdf5/wrappers.py:1012, in InMemoryDataset.append(self, arr)
   1007 else:
   1008     # The existing data must always exist in the old data dict
   1009     chunk_extant_vindex = Tuple(
   1010         Slice(chunk.args[0].start, old_shape[0]), *other_dims
   1011     ).expand(self.shape)
-> 1012     assert chunk_extant_vindex in old_data_dict
   1014     # In cases where __setitem__ is called and the InMemoryDataset hasn't yet been
   1015     # committed, values in the data_dict contain np.ndarray objects instead of slices.
   1016     # Handle this by just appending the data here to the chunk to be written.
   1017     if isinstance(old_data_dict[chunk_extant_vindex], np.ndarray):

AssertionError:

@peytondmurray
Copy link
Collaborator Author

Looks like there was an indexing issue when appending multidimensional datasets. I would have thought this would be caught by test_multidim_random_axes, but for some reason it wasn't.

The issue was that in the process of appending data to a dataset, we need to get the extant data from the dataset's data_dict. The calculation that searched for the part of the data_dict that overlaps each chunk of the newly-resized dataset used the dimensions of the entire dataset rather than the dimensions of the chunk to perform the search. Correcting this made the rest of the logic work as intended.

@peytondmurray peytondmurray force-pushed the add-append-api branch 2 times, most recently from cce87eb to b331391 Compare June 19, 2024 22:06
@ArvidJB
Copy link
Collaborator

ArvidJB commented Jun 21, 2024

I updated with the latest changes and now the code snippet I had posted no longer fails, but now it stores incorrect values:

In [2]: with TempDirCtx(DIR_cluster_tmp()) as d:
   ...:     with h5py.File(d / 'data.h5', 'w') as f:
   ...:         vf = VersionedHDF5File(f)
   ...:         with vf.stage_version("r0") as sv:
   ...:             sv.create_dataset('values',
   ...:                 data=np.array([[1, 1, 1, 1, 1, 1],
   ...:                                [2, 2, 2, 2, 2, 2],
   ...:                                [3, 3, 3, 3, 3, 3],
   ...:                                [4, 4, 4, 4, 4, 4]]),
   ...:                 chunks=(3, 3), maxshape=(None, None))
   ...:     with h5py.File(d / 'data.h5', 'r+') as f:
   ...:        vf = VersionedHDF5File(f)
   ...:        with vf.stage_version("r1") as sv:
   ...:             sv['values'].append(np.array([[5, 5, 5, -5, -5, -5]]))
   ...:     with h5py.File(d / 'data.h5', 'r') as f:
   ...:        vf = VersionedHDF5File(f)
   ...:        cv = vf[vf.current_version]
   ...:        print(cv['values'][:])
   ...:
[[ 1  1  1  1  1  1]
 [ 2  2  2  2  2  2]
 [ 3  3  3  3  3  3]
 [ 4  4  4  4  4  4]
 [-5 -5 -5 -5 -5 -5]]

@peytondmurray
Copy link
Collaborator Author

For certain sizes of multidimensional datasets, appending to certain datasets could result in appends which targeted the same raw chunk. This happened in your example above, so the erroneous data you saw was from one append subchunk overwriting the raw data needed by another subchunk.

I've added a new branch so that only one append is allowed to target each individual raw chunk.

@ArvidJB
Copy link
Collaborator

ArvidJB commented Jun 22, 2024

Here's some more corrupted data (with the latest changes):

In [6]: with TempDirCtx(DIR_cluster_tmp()) as d:
   ...:     with h5py.File(d / 'data.h5', 'w') as f:
   ...:         vf = VersionedHDF5File(f)
   ...:         with vf.stage_version("r0") as sv:
   ...:             sv.create_dataset('values', data=np.arange(3), chunks=(5,), maxshape=(None,))
   ...:
   ...:     with h5py.File(d / 'data.h5', 'r+') as f:
   ...:         vf = VersionedHDF5File(f)
   ...:         with vf.stage_version("r1") as sv:
   ...:             values = sv['values']
   ...:             values.append(np.array([1, 2]))
   ...:
   ...:     with h5py.File(d / 'data.h5', 'r') as f:
   ...:         vf = VersionedHDF5File(f)
   ...:         cv = vf[vf.current_version]
   ...:         print(cv['values'][:])
   ...:
   ...:     with h5py.File(d / 'data.h5', 'r+') as f:
   ...:         vf = VersionedHDF5File(f)
   ...:         with vf.stage_version("r2") as sv:
   ...:             values = sv['values']
   ...:             values.resize((8,))
   ...:             values[5:8] = np.arange(3)
   ...:
   ...:     with h5py.File(d / 'data.h5', 'r+') as f:
   ...:         vf = VersionedHDF5File(f)
   ...:         with vf.stage_version("r3") as sv:
   ...:              values = sv['values']
   ...:              values.append(np.array([3, 4]))
   ...:
   ...:     with h5py.File(d / 'data.h5', 'r') as f:
   ...:         vf = VersionedHDF5File(f)
   ...:         cv = vf[vf.current_version]
   ...:         print(cv['values'][:])
   ...:
[0 1 2 1 2]
[0 1 2 3 4 0 1 2 3 4]

@ArvidJB
Copy link
Collaborator

ArvidJB commented Jun 24, 2024

This also silently corrupts older versions:

In [2]: with TempDirCtx(DIR_cluster_tmp()) as d:
   ...:     with h5py.File(d / 'data.h5', 'w') as f:
   ...:         vf = VersionedHDF5File(f)
   ...:         with vf.stage_version("r0") as sv:
   ...:             sv.create_dataset('values', data=np.arange(3), chunks=(5,), maxshape=(None,))
   ...:
   ...:     with h5py.File(d / 'data.h5', 'r+') as f:
   ...:         vf = VersionedHDF5File(f)
   ...:         with vf.stage_version("r1") as sv:
   ...:             values = sv['values']
   ...:             values.append(np.array([1, 2]))
   ...:
   ...:     with h5py.File(d / 'data.h5', 'r') as f:
   ...:         vf = VersionedHDF5File(f)
   ...:         cv = vf[vf.current_version]
   ...:         print(cv['values'][:])
   ...:
   ...:     with h5py.File(d / 'data.h5', 'r+') as f:
   ...:         vf = VersionedHDF5File(f)
   ...:         with vf.stage_version("r2") as sv:
   ...:             values = sv['values']
   ...:             values.resize((3,))
   ...:
   ...:     with h5py.File(d / 'data.h5', 'r+') as f:
   ...:         vf = VersionedHDF5File(f)
   ...:         with vf.stage_version("r3") as sv:
   ...:              values = sv['values']
   ...:              values.append(np.array([3, 4]))
   ...:
   ...:     with h5py.File(d / 'data.h5', 'r') as f:
   ...:         vf = VersionedHDF5File(f)
   ...:         # get older version, should not have changes
   ...:         v1 = vf['r1']
   ...:         print(v1['values'][:])
   ...:
[0 1 2 1 2]
[0 1 2 3 4]

@peytondmurray
Copy link
Collaborator Author

Yep, I started adding in a check for this earlier today and just finished it out. The check I've added ensures that no chunk in a previous version points to the space in the raw dataset we are targeting for the append; if there is preexisting data, we instead write to a new chunk.

@peytondmurray
Copy link
Collaborator Author

Closing for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants