Skip to content

DOC: defaults in read_parquet parameters#7567

Merged
jsignell merged 2 commits intodask:mainfrom
raybellwaves:doc-parquet-defaults
Apr 26, 2021
Merged

DOC: defaults in read_parquet parameters#7567
jsignell merged 2 commits intodask:mainfrom
raybellwaves:doc-parquet-defaults

Conversation

@raybellwaves
Copy link
Copy Markdown
Member

Added in defaults and tidied up the filters example

filtering is only performed at the partition level, i.e., to prevent the
loading of some row-groups and/or files.
filters : Union[List[Tuple[str, str, Any]], List[List[Tuple[str, str, Any]]]], default None
List of filters to apply, like ``[[('col1', '==', 0), ...], ...]``.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it really need ==? Seems like it works fine with just the = locally.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good Q. I believe so. Here's some code to test (taken from https://sites.google.com/view/raybellwaves/blog/read_parquetcar-parquet-enginevroom)

import pandas as pd
import dask.dataframe as dd

df = pd.DataFrame({'x': [0, 1]})
ddf = dd.from_pandas(df, npartitions=2)
ddf.to_parquet("dd_df.parquet", engine="pyarrow")
dd.read_parquet("dd_df.parquet", engine="pyarrow-dataset", filters=[[('x', '=', 0)]]).compute() # returns empty
dd.read_parquet("dd_df.parquet", engine="pyarrow-dataset", filters=[[('x', '=', 0)]]).compute() # returns filtered data

Screenshot from 2021-04-26 15-39-08

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh huh. I guess the one I was testing on was filtered at the file-level, and it seemed to work the same. Thanks for writing this out!

@jsignell jsignell merged commit e406faf into dask:main Apr 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants