New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Refactor read_metadata in fastparquet engine #8092

Merged

rjzamora merged 60 commits into dask:main from rjzamora:read-metadata-refactor-fastparquet

Oct 19, 2021

Member

rjzamora commented Aug 26, 2021

This is a follow-up to #8072, and corresponds to the "short-term" FastParquetEngine component of the plan discussed in #8058 .

Note that #8072 should be merged first.

rjzamora and others added 26 commits

August 18, 2021 15:24


          initial pyarrow-dataset experimental refactor

8b23c6e


          inital refactor of both arrow engines

a0e0fec


          move private methods to correct location

697e4f2


          save state

08ea5ad


          ready to convert _metadata-free case to a dask graph

87845b9


          parallel collect parts

a44bc39


          remove stale/unused code and move legacy-specific code into legacy class


          Merge remote-tracking branch 'upstream/main' into read-metadata-refactor

a9c36ba


          ensure correct hive-partition order

b978de5


          add files_per_metadata_task to docstring

6c819ea


          change default files_per_metadata_task to 32

54db532


          adding basic test coverage

4b0b4d3


          Merge branch 'main' of https://github.com/dask/dask into read-metadat…

c51e970

…a-refactor


          minor fix

52f6c30


          change name to metadata_task_size


          port _determine_pf_parts to _collect_dataset_info

e34c48b


          ported _generate_dd_meta to _create_dd_meta

0cd35b4


          Move _construct_parts into _construct_collection_plan

1354cf7


          remove unused (stale) code

b4c8a45


          minor paths tweak

d9e48c5


          resolve bug for partitioned data with gather_statistics=False

bb57a2a


          improve pyarrow 'fast-path'

322af53


          enable faster partitioned reads with pyarrow

7e8b8bd


          fix pyarrow partitioning behaviour

9d88830


          basic parallel fastparquet metadata support

18f316e


          remove stale code

b4ca448

github-actions bot added dataframe io labels


          remove sync compute

ef59025

Member

pentschev commented Aug 26, 2021

GPU issues should be fixed now that rapidsai/cudf#9118 is in. Rerunning tests.

rjzamora added 3 commits

September 29, 2021 07:59


          Merge branch 'read-metadata-refactor' into read-metadata-refactor-fas…

c79906c

…tparquet


          avoid parallel metadata collection when the file count is less than m…

6ea87b3

…etadata_task_size


          test_metadata_task_size fix

9a444fe

rjzamora marked this pull request as ready for review

October 1, 2021 22:24

rjzamora marked this pull request as draft

October 1, 2021 22:24

rjzamora added 2 commits

October 1, 2021 15:33


          Merge remote-tracking branch 'upstream/main' into read-metadata-refac…

…tor-fastparquet


          Merge branch 'read-metadata-refactor-fastparquet' of https://github.c…

…om/rjzamora/dask into read-metadata-refactor-fastparquet

rjzamora marked this pull request as ready for review

October 4, 2021 13:21


          address 8201

47b07b9

rjzamora mentioned this pull request

Correct fastparquet call #8201

Closed

rjzamora added 2 commits

October 4, 2021 06:44


          remove debug compute call in test

c7401de


          roll back base_path name change to reduce diff

6deed66

jrbourbeau mentioned this pull request

Inconsistency and errors between different parquet engines on Spark-generated parquet files. #8087

Closed

Member

martindurant commented Oct 7, 2021

Is this ready for review?

Member Author

rjzamora commented Oct 7, 2021

Is this ready for review?

Yes - I will take another look through it now, but you should feel free to review whenever you can find the time :D

rjzamora mentioned this pull request

Let fastparquet read spark output #8180

Closed

3 tasks

martindurant reviewed

View reviewed changes

Member

martindurant left a comment

I think I've gone through most of it, and don't see much awry. There is a lot, though! Probably you can push ahead and note some for follow-ups, such as profiling whether the remaining work in the client is expensive or not.

dask/dataframe/io/parquet/fastparquet.py Show resolved Hide resolved

dask/dataframe/io/parquet/fastparquet.py Outdated Show resolved Hide resolved

dask/dataframe/io/parquet/fastparquet.py

+                          # Find all files if we are not using a _metadata file
+                          if ignore_metadata_file or not _metadata_exists:
+                              # For now, we need to discover every file under paths[0]
+                              paths, base, fns = _sort_and_analyze_paths(fs.find(base), fs)

Member

martindurant Oct 8, 2021

All these path operations should be profiles, I wouldn't be surprised if it can add up to a lot.

Member Author

rjzamora Oct 14, 2021

Agree - We are not really "adding" anything new in this PR, but there are probably ways to reduce fs overhead.

dask/dataframe/io/parquet/fastparquet.py

+                                  gather_statistics = True
+                          else:
+                              # Use 0th file
+                              # Note that "_common_metadata" can cause issues for

Member

martindurant Oct 8, 2021

I don't understand this comment, I thought we had covered this case.

Member Author

rjzamora Oct 14, 2021

We used to use the _common_metadata file to generate the ParquetFile object, but the code was not working for partitioned data (since the file name is used). This small change allows us to handle partitioned data when _metadata is missing.

dask/dataframe/io/parquet/fastparquet.py

+                          _metadata_exists = "_metadata" in fns
+                          if _metadata_exists and ignore_metadata_file:
+                              fns.remove("_metadata")
+                              paths = [fs.sep.join([base, fn]) for fn in fns]

Member

martindurant Oct 8, 2021

We just extracted the base from the filenames and then here we add them back in again.

Member Author

rjzamora Oct 14, 2021

Right - We keep track of base and use it elsewhere after this. We are not actually "adding them back in." We are only executing this last line to remove _metadata from paths. Would you prefer paths = [p for p in paths if p.endswith("_metadata")] ?

dask/dataframe/io/parquet/fastparquet.py

+                          if getattr(dtypes.get(ind), "numpy_dtype", None):
+                              # index does not support masked types
+                              dtypes[ind] = dtypes[ind].numpy_dtype
+                      meta = _meta_from_dtypes(all_columns, dtypes, index_cols, column_index_names)

Member

martindurant Oct 8, 2021

Note that fastparquet can do this too, via the preallocate stuff (just make one row's worth). It may well end up with a closer representation.

dask/dataframe/io/parquet/fastparquet.py

+                      if filters:
+                          # Filters may require us to gather statistics
+                          if gather_statistics is False and pf.info.get("partitions", None):
+                              warnings.warn(

Member

martindurant Oct 8, 2021

It feels like an error if pass a column we can't actually filter on

Member Author

rjzamora Oct 14, 2021

This code is not added in this PR, but I agree an error may make more sense than a warning. However, since there is at least a warning, can we leave the "fix" for a follow-up?

dask/dataframe/io/parquet/fastparquet.py

+                              common_kwargs,
+                          )
+                      dataset_info_kwargs = {

Member

martindurant Oct 8, 2021

Mostly copied from dataset_info?

Member Author

rjzamora Oct 14, 2021

Right - We don't want to pass all of dataset_info to each task in the metadata-processing graph, so we specify the required elements here.

dask/dataframe/io/parquet/fastparquet.py Outdated

+                          or metadata_task_size > len(paths)
+                      ):
+                          # Use original `ParquetFile` object to construct plan,
+                          # since it is based on a global _metadata file

Member

martindurant Oct 8, 2021

This comment is outdated? Basically the "old" method.

Member Author

rjzamora Oct 14, 2021

I can clarify the comment, but it is not really outdated (just incomplete). This code path means we have a _metadata file or the metadata_task_size setting has caused parallel metadata processing to be disabled.

dask/dataframe/io/parquet/fastparquet.py

                   ):
+                      # Collect necessary information from dataset_info
+                      fs = dataset_info_kwargs["fs"]

Member

martindurant Oct 8, 2021

Is there any way to get around this repeat bundling and unbundling? Seems like unnecessary operations and unnecessary code.

Member Author

rjzamora Oct 14, 2021

Yes, probably. To be honest, 90% of this Pr is just moving existing code around (very little "new" logic). One of the only "new" changes is the general usage of a dataset_info dict. I was expecting a bit of pushback on this. However, I decided that explcitly packing and unpacking a dictionary this way makes it much easier to avoid breaking changes to read_metadata.

My preference is to follow this approach for now (especially since the pyarrow version uses the same logic), and onlhy make the API more rigid once we are more confident that we are passing along exactly what we need.

rjzamora added 4 commits

October 14, 2021 14:13


          fix undefined piece mistake

859c107


          remove _common_metadata_exists

37d59e5


          update to OSError

d09c76c


          Merge remote-tracking branch 'upstream/main' into read-metadata-refac…

b9edc41

…tor-fastparquet

Member Author

rjzamora commented Oct 14, 2021

@martindurant - Thank you for the thorough review here! It was very useful. I do want to make a general note that many of your comments/suggestions are focused on code that was simply moved in this PR (not actually added). I am certainly happy that you have pointed out possible problems and improvements within some of the relocated logic. Are you okay with me targeting most of these problems/improvements in follow-up PRs (to keep the changes here as minimal as possible)?

Member

martindurant commented Oct 15, 2021

Are you okay with me targeting most of these problems/improvements in follow-up PRs

Certainly! Basically, the diff version was too hard for me to follow, so I read through the complete code, in the parts that seemed relevant. Things that might have been suboptimal and continue being exactly the same amount of suboptimal don't worry me too much :)

Member Author

rjzamora commented Oct 15, 2021

Basically, the diff version was too hard for me to follow

I don't think this PR is even possible to review from the diff :) So, I really appreciate that you took the time to look through everything!

Member Author

rjzamora commented Oct 18, 2021

I feel that this PR should be merged within the next day or so. Please feel free to speak up if you feel that any of the issues that were discussed in code review (or others that were not discussed) need to be addressed in this particular PR (cc @martindurant @jrbourbeau)

Member

martindurant commented Oct 19, 2021

I don't intend to review further, so w can merge and iterate as usual.

rjzamora merged commit 07ee3b8 into dask:main

rjzamora mentioned this pull request

Improve support for Spark output in read_parquet #8274

Merged

4 tasks

rjzamora deleted the read-metadata-refactor-fastparquet branch

October 19, 2021 22:16

ian-r-rose mentioned this pull request

Dask parquet metadata w/ ~2k files very slow #5272

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment