Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid reading _metadata on every worker #6017

Merged
merged 8 commits into from
Apr 22, 2020
Merged

Conversation

rjzamora
Copy link
Member

@rjzamora rjzamora commented Mar 16, 2020

Closes Partially addresses #5842

For FastParquetEngine, we are currently re-reading the "_metadata" file for every partition in many cases. This PR will avoid doing so whenever possible.

  • Tests added / passed
  • Passes black dask / flake8 dask

@rjzamora
Copy link
Member Author

@ig248 - Note that this PR addresses some of your suggestions from #5842 (although the current changes seem most helpful for non-partitioned datasets, since the "_metadata" file should no longer be re-read at all).

@mrocklin
Copy link
Member

cc @martindurant if you have a moment to review

@martindurant
Copy link
Member

Is this still WIP, or are you ready for me to have a look, @rjzamora ?

@rjzamora rjzamora changed the title [WIP] Avoid reading _metadata on every worker Avoid reading _metadata on every worker Mar 17, 2020
@rjzamora
Copy link
Member Author

rjzamora commented Mar 17, 2020

Thanks @martindurant! I am not planning to add anything here today, so a review at your convenience is certainly appreciated :)

Copy link
Member

@martindurant martindurant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad to see it passing! Do we have benchmarks?

Unfortunately, covering every case means regaining some of the complexity that the original refactor alleviated :|

@@ -344,23 +347,58 @@ def read_metadata(
# Create `parts`
# This is a list of row-group-descriptor dicts, or file-paths
# if we have a list of files and gather_statistics=False
base_path = (base_path or "") + fs.sep
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're doing direct path manipulations, we should maybe start a HTTP server and test against it

# a "_metadata" file for the worker to read.
# Therefore, we need to pass the pf object in
# the task graph
pf_deps = pf
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was some code to strip down the pf instance to avoid thrift serialisation costs. Is that still happening?

for i, piece in enumerate(partsin):
if pf and not fast_metadata:
for col in piece.columns:
col.meta_data.statistics = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is what I was referring to above, making the pf instance smaller

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah - Right! That should certainly stay in.

col.meta_data.statistics = None
col.meta_data.encoding_stats = None
piece_item = i if pf else piece
if partitions and fast_metadata:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The three conditions here are a bit hard to parse. Of course, the situation is complicated. Perhaps up front, we should enumerate the cases and label the branches as required:

  • a single file
  • a directory of files with a _metadata
  • a directory of files without _metadata
    • stats requested
    • stats unnecessary

@martindurant
Copy link
Member

Failure is a RuntimeWarning on py38 in test_cov (array?) - perhaps a new compiler version of numpy? In other words, unrelated to this PR.

@rjzamora
Copy link
Member Author

Glad to see it passing! Do we have benchmarks?

Just a note: I will revisit this soonish, but I did not see significant performance improvements from these changes when I last checked. For the partitioned dataset case, we no longer spend much time in _determine_pf_parts, but we still spend a lot of time creating pf for each partition.

@martindurant
Copy link
Member

Can you please merge from master? I believe things should pass now.

@TomAugspurger
Copy link
Member

@martindurant did you want to give this another look, or was #6017 (comment) saying this was good?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants