Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use dd.from_map #24

Merged
merged 5 commits into from
Jun 26, 2023
Merged

Use dd.from_map #24

merged 5 commits into from
Jun 26, 2023

Conversation

jrbourbeau
Copy link
Member

This gives us more efficient graph optimizations when reading in data

Closes #23

Comment on lines +51 to +54
meta = make_meta(self.schema.empty_table().to_pandas())
if self.columns:
meta = meta[self.columns]
self.meta = meta
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is purely cosmetic. Since self._make_meta_from_schema was only used in one place I just moved it into __init__ as an attribute

@codecov-commenter
Copy link

codecov-commenter commented Jun 9, 2023

Codecov Report

Merging #24 (740d8f0) into main (8ccc092) will decrease coverage by 0.19%.
The diff coverage is 100.00%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

@@            Coverage Diff             @@
##             main      #24      +/-   ##
==========================================
- Coverage   93.04%   92.85%   -0.19%     
==========================================
  Files           2        2              
  Lines         115      112       -3     
==========================================
- Hits          107      104       -3     
  Misses          8        8              
Impacted Files Coverage Δ
dask_deltatable/core.py 92.52% <100.00%> (-0.21%) ⬇️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Collaborator

@jacobtomlinson jacobtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. Maybe @rjzamora has interest in reviewing here too?

# Create Blockwise layer
label = "read-delta-table-"
output_name = f"{label}{tokenize(self.fs_token, **kwargs)}"
layer = DataFrameIOLayer(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my general suggestion would be to use from_map instead of interacting directly with DataFrameIOLayer like this, but I may be missing something.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rjzamora. Looks like internally from_map is a wrapper for DataFrameIOLayer. The intent of this PR is to make sure we get nice blockwise optimization here. I've updated to use from_map. Let me know if that looks okay to you

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from_map is a wrapper for DataFrameIOLayer

Yes exactly. All public IO functions (e.g. read_parquet and read_csv) use from_map for this purpose. from_map can be thought of as DataFrameIOLayer's "public" API

dask_deltatable/core.py Outdated Show resolved Hide resolved
Co-authored-by: Richard (Rick) Zamora <rzamora217@gmail.com>
@jrbourbeau jrbourbeau changed the title Use DataFrameIOLayer Use dd.from_map Jun 13, 2023
@jrbourbeau jrbourbeau mentioned this pull request Jun 13, 2023
@jrbourbeau
Copy link
Member Author

xref #25 for the Python 3.7 CI build failures

@jrbourbeau
Copy link
Member Author

@rjzamora how does this look to you?

Copy link
Contributor

@rjzamora rjzamora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

Note that it wouldn't be difficult to add column projection in a follow-up (since we are using from_map).

@jrbourbeau jrbourbeau merged commit ba3801c into dask-contrib:main Jun 26, 2023
13 checks passed
@jrbourbeau
Copy link
Member Author

Thanks for reviewing @rjzamora

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Avoid using delayed when making dask dataframe
4 participants