Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix conversion from Categorical to pa.dictionary in read_parquet #10285

Merged
merged 5 commits into from
May 16, 2023

Conversation

phofl
Copy link
Collaborator

@phofl phofl commented May 15, 2023

  • Tests added / passed
  • Passes pre-commit run --all-files

Came across this when working on the blog post.

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @phofl -- one suggestion, but otherwise this looks good to go

Comment on lines 4917 to 4924
expected = pd.DataFrame(
{
"a": pd.Series(
["x", "y"], dtype=pd.ArrowDtype(pa.dictionary(pa.int32(), pa.string()))
),
"b": pd.Series([1, 2], dtype="int64[pyarrow]"),
}
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of hardcoding this result, testing that dask and pandas return the same result is probably a slightly better approach

expected = pd.read_parquet(outdir, engine="pyarrow", dtype_backend="pyarrow")

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, thx

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @phofl. Will merge after CI is done 👍

df.to_parquet(outdir, engine="pyarrow")
ddf = dd.read_parquet(outdir, engine="pyarrow", dtype_backend="pyarrow")
pdf = pd.read_parquet(outdir, engine="pyarrow", dtype_backend="pyarrow")
# Set sort_results=False because of pandas bug
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an upstream issue for this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My pr was already merged, will update tomorrow when nightlies are available

Copy link
Collaborator Author

@phofl phofl May 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

@jrbourbeau
Copy link
Member

Ah, actually it looks like my suggestion of using pd.read_parquet(...) is now causing a legit test failures

TypeError: read_table() got an unexpected keyword argument 'dtype_backend'

I still think using pd.read_parquet(...) is a good approach, we probably just need a more restrictive pandas version check.

@jrbourbeau jrbourbeau merged commit 0b6ddd3 into dask:main May 16, 2023
27 of 28 checks passed
@phofl phofl deleted the read_parquet_cat branch May 16, 2023 16:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants