-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid reading _metadata on every worker #6017
Conversation
@ig248 - Note that this PR addresses some of your suggestions from #5842 (although the current changes seem most helpful for non-partitioned datasets, since the |
cc @martindurant if you have a moment to review |
Is this still WIP, or are you ready for me to have a look, @rjzamora ? |
Thanks @martindurant! I am not planning to add anything here today, so a review at your convenience is certainly appreciated :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Glad to see it passing! Do we have benchmarks?
Unfortunately, covering every case means regaining some of the complexity that the original refactor alleviated :|
@@ -344,23 +347,58 @@ def read_metadata( | |||
# Create `parts` | |||
# This is a list of row-group-descriptor dicts, or file-paths | |||
# if we have a list of files and gather_statistics=False | |||
base_path = (base_path or "") + fs.sep |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're doing direct path manipulations, we should maybe start a HTTP server and test against it
# a "_metadata" file for the worker to read. | ||
# Therefore, we need to pass the pf object in | ||
# the task graph | ||
pf_deps = pf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was some code to strip down the pf instance to avoid thrift serialisation costs. Is that still happening?
for i, piece in enumerate(partsin): | ||
if pf and not fast_metadata: | ||
for col in piece.columns: | ||
col.meta_data.statistics = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is what I was referring to above, making the pf instance smaller
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah - Right! That should certainly stay in.
col.meta_data.statistics = None | ||
col.meta_data.encoding_stats = None | ||
piece_item = i if pf else piece | ||
if partitions and fast_metadata: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The three conditions here are a bit hard to parse. Of course, the situation is complicated. Perhaps up front, we should enumerate the cases and label the branches as required:
- a single file
- a directory of files with a _metadata
- a directory of files without _metadata
- stats requested
- stats unnecessary
Failure is a RuntimeWarning on py38 in test_cov (array?) - perhaps a new compiler version of numpy? In other words, unrelated to this PR. |
Just a note: I will revisit this soonish, but I did not see significant performance improvements from these changes when I last checked. For the partitioned dataset case, we no longer spend much time in |
Can you please merge from master? I believe things should pass now. |
@martindurant did you want to give this another look, or was #6017 (comment) saying this was good? |
ClosesPartially addresses #5842For
FastParquetEngine
, we are currently re-reading the "_metadata" file for every partition in many cases. This PR will avoid doing so whenever possible.black dask
/flake8 dask