-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix conversion from Categorical
to pa.dictionary
in read_parquet
#10285
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @phofl -- one suggestion, but otherwise this looks good to go
expected = pd.DataFrame( | ||
{ | ||
"a": pd.Series( | ||
["x", "y"], dtype=pd.ArrowDtype(pa.dictionary(pa.int32(), pa.string())) | ||
), | ||
"b": pd.Series([1, 2], dtype="int64[pyarrow]"), | ||
} | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of hardcoding this result, testing that dask
and pandas
return the same result is probably a slightly better approach
expected = pd.read_parquet(outdir, engine="pyarrow", dtype_backend="pyarrow")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, thx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @phofl. Will merge after CI is done 👍
df.to_parquet(outdir, engine="pyarrow") | ||
ddf = dd.read_parquet(outdir, engine="pyarrow", dtype_backend="pyarrow") | ||
pdf = pd.read_parquet(outdir, engine="pyarrow", dtype_backend="pyarrow") | ||
# Set sort_results=False because of pandas bug |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an upstream issue for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My pr was already merged, will update tomorrow when nightlies are available
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice
Ah, actually it looks like my suggestion of using TypeError: read_table() got an unexpected keyword argument 'dtype_backend' I still think using |
pre-commit run --all-files
Came across this when working on the blog post.