Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move read_hdf to Blockwise #7625

Merged
merged 7 commits into from
Jun 10, 2021
Merged

Move read_hdf to Blockwise #7625

merged 7 commits into from
Jun 10, 2021

Conversation

rjzamora
Copy link
Member

@rjzamora rjzamora commented May 3, 2021

Superceeds #7284 (since that PR is now quite stale)

Follows same approach as #7415 and #7615 to use Blockwise for Dask-Dataframe's read_hdf. A slight refactor of the original logic was required.

@jrbourbeau
Copy link
Member

Thanks @rjzamora! @ian-r-rose could you take a look at this when you get a moment?

lock=None,
mode="a",
):
class HDFFunctionWrapper:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would love to be able to use a Protocol here. Some day...

dask/dataframe/io/hdf.py Outdated Show resolved Hide resolved
@@ -123,6 +125,16 @@ def test_to_hdf_multiple_nodes():
out = dd.read_hdf(fn, "/data*")
assert_eq(df16, out)

# Test getitem optimization
with tmpfile("h5") as fn:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

dask/dataframe/io/hdf.py Outdated Show resolved Hide resolved
dask/dataframe/io/hdf.py Outdated Show resolved Hide resolved
Comment on lines +462 to +465
if division and global_divisions:
global_divisions = global_divisions[:-1] + division
elif division:
global_divisions = division
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ouch, this is tricky. Can we make the divisions global to begin with?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this PR is just a refactor, most of the actual logic is just a copy of what already existed (iow, this is not my code, and I haven't thought through it carefully). With that said, I can try to revist this later and simplify the logic a bit.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I can appreciate not wanting to mess with that logic too much

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the delayed response here @rjzamora. Would you mind taking care of the merge conflict here and then we can merge this in

@rjzamora
Copy link
Member Author

rjzamora commented Jun 9, 2021

Apologies for the delayed response here @rjzamora. Would you mind taking care of the merge conflict here and then we can merge this in

No worries! Conflict should be resolved. I will also fix any problems if a fresh ci run turns up any failures.

@jsignell
Copy link
Member

I'm going to merge this since it's passing CI :)

@jsignell jsignell merged commit b2bc454 into dask:main Jun 10, 2021
@rjzamora rjzamora deleted the blockwise-io-hdf branch June 10, 2021 16:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants