Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow writing with many small chunks #167

Open
ArvidJB opened this issue Mar 8, 2021 · 4 comments
Open

Slow writing with many small chunks #167

ArvidJB opened this issue Mar 8, 2021 · 4 comments

Comments

@ArvidJB
Copy link
Collaborator

ArvidJB commented Mar 8, 2021

Unfortunately in our use case we often end up with suboptimal chunk sizes. Unversioned h5py is able to handle those without issues, but with versioned_hdf5 this turns out to be pretty slow:

dt = np.dtype('double')
d0 = 2
d1 = 15220
d2 = 2
chunks = (600, 2, 4)
with h5py.File('foo.h5', 'w') as f:
    vf = VersionedHDF5File(f)
    with vf.stage_version('0') as sv:
        sv.create_dataset('bar', shape=(d0, d1, d2), maxshape=(None, None, None),
                          chunks=chunks, dtype=dt,
                          data=np.full((d0, d1, d2), 0, dtype=dt))

start = time.time()
with h5py.File('foo.h5', 'r+') as f:
    vf = VersionedHDF5File(f)
    with vf.stage_version(str(i)) as sv:
        i2 = np.random.choice(d1, 30, replace=False)
        i2 = np.sort(i2)
        sv['bar'][:, i2, :] = np.full((d0, len(i2), d2), i, dtype=dt)
end = time.time()
print('writing: {}'.format(end - start))

This takes around 9 seconds for me to write 120 numbers.

A little bit of profiling points to two things:

  1. The call to as_subchunks in InMemoryDataset.__setitem__
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   593                                               @with_phil
   594                                               @profile
   595                                               def __setitem__(self, args, val):
...
   700        78   24219378.0 310504.8     99.0          for c in self.chunks.as_subchunks(idx, self.shape):
...

where it ends up calling _fallback because there is no case for IntegerArray. Could we not use the same code path as for Integer?

  1. The other slow spot is this loop in create_virtual_dataset:
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   170                                           @profile
   171                                           def create_virtual_dataset(f, version_name, name, shape, slices, attrs=None, fillvalue=None):
....
   192     26127      50638.0      1.9      0.2          for c, s in slices.items():
   193     26124    1592688.0     61.0      6.6              if c.isempty():
   194                                                           continue
   195                                                       # idx = Tuple(s, *Tuple(*[slice(0, i) for i in shape[1:]]).as_subindex(Tuple(*c.args[1:])).args)
   196     26124    5472288.0    209.5     22.8              S = [Slice(0, shape[i], 1).as_subindex(c.args[i]) for i in range(1, len(shape))]
   197     26123    1725495.0     66.1      7.2              idx = Tuple(s, *S)
   198                                                       # assert c.newshape(shape) == vs[idx.raw].shape, (c, shape, s)
   199     26123   12892876.0    493.5     53.8              layout[c.raw] = vs[idx.raw]
...

Is it possible to speed this up? In this example we only change some very small subset of the data. If we could keep track of the changes we could probably copy the old virtual dataset and modify it appropriately?

@asmeurer
Copy link
Collaborator

asmeurer commented Mar 8, 2021

where it ends up calling _fallback because there is no case for IntegerArray. Could we not use the same code path as for Integer?

Right, this is a case that I forgot to implement. It should be straightforward to fix this, and speed this up quite a bit.

Is it possible to speed this up? In this example we only change some very small subset of the data. If we could keep track of the changes we could probably copy the old virtual dataset and modify it appropriately?

A good chunk of time is spend in the h5py virtual dataset code (the layout[c.raw] = vs[idx.raw]). I previously didn't look at this too closely, I can investigate more and see if this can be improved. The issue is in the h5py code, every call to vs[idx] calls deepcopy on the vs object. It seems likely that this can be done more efficiently.

@ericdatakelly ericdatakelly added this to the March 2021 milestone Mar 22, 2021
asmeurer added a commit to asmeurer/versioned-hdf5 that referenced this issue Mar 22, 2021
@ericdatakelly ericdatakelly modified the milestones: March 2021, April 2021 Apr 5, 2021
@ericdatakelly ericdatakelly modified the milestones: April 2021, May 2021 May 10, 2021
@ericdatakelly ericdatakelly modified the milestones: May 2021, June 2021 Jun 7, 2021
@ericdatakelly ericdatakelly modified the milestones: June 2021, July 2021 Jul 8, 2021
@ericdatakelly ericdatakelly modified the milestones: July 2021, August 2021 Aug 12, 2021
@ericdatakelly ericdatakelly modified the milestones: August 2021, September 2021 Sep 29, 2021
@ilanschnell
Copy link
Contributor

As versioned HDF5 relays a lot on ndindex, my plan is to take Python's C implementation of the slice object, make it hashable and inheritable such that ndindex can use that instead.

@ericdatakelly ericdatakelly assigned ilanschnell and unassigned asmeurer Nov 5, 2021
@ArvidJB
Copy link
Collaborator Author

ArvidJB commented Nov 5, 2021

Is this (hashable slice object) for the construction of data_dict? In particular the construction of slice_map seems like it would benefit from this.

@ilanschnell
Copy link
Contributor

The new object could be used in versioned HDF5 directly, but I want to (after creating the new object) first focus on using it within ndindex.

@ericdatakelly ericdatakelly added this to the January 2022 milestone Jan 4, 2022
@ericdatakelly ericdatakelly removed this from the February 2022 milestone Mar 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants