Fix conversion from `Categorical` to `pa.dictionary` in `read_parquet` #10285

phofl · 2023-05-15T08:48:16Z

Tests added / passed
Passes pre-commit run --all-files

Came across this when working on the blog post.

…rquet``

jrbourbeau

Thanks @phofl -- one suggestion, but otherwise this looks good to go

jrbourbeau · 2023-05-15T19:25:41Z

dask/dataframe/io/tests/test_parquet.py

+    expected = pd.DataFrame(
+        {
+            "a": pd.Series(
+                ["x", "y"], dtype=pd.ArrowDtype(pa.dictionary(pa.int32(), pa.string()))
+            ),
+            "b": pd.Series([1, 2], dtype="int64[pyarrow]"),
+        }
+    )


Instead of hardcoding this result, testing that dask and pandas return the same result is probably a slightly better approach

expected = pd.read_parquet(outdir, engine="pyarrow", dtype_backend="pyarrow")

good point, thx

jrbourbeau

Thanks @phofl. Will merge after CI is done 👍

jrbourbeau · 2023-05-15T19:58:24Z

dask/dataframe/io/tests/test_parquet.py

+    df.to_parquet(outdir, engine="pyarrow")
+    ddf = dd.read_parquet(outdir, engine="pyarrow", dtype_backend="pyarrow")
+    pdf = pd.read_parquet(outdir, engine="pyarrow", dtype_backend="pyarrow")
+    # Set sort_results=False because of pandas bug


Is there an upstream issue for this?

My pr was already merged, will update tomorrow when nightlies are available

pandas-dev/pandas#53232

jrbourbeau · 2023-05-15T22:16:54Z

Ah, actually it looks like my suggestion of using pd.read_parquet(...) is now causing a legit test failures

TypeError: read_table() got an unexpected keyword argument 'dtype_backend'

I still think using pd.read_parquet(...) is a good approach, we probably just need a more restrictive pandas version check.

Fix conversion from Categorical to pa.dictionary in ``read_pa…

96c5d1e

…rquet``

github-actions bot added dataframe io labels May 15, 2023

phofl added the upstream label May 15, 2023

phofl closed this May 15, 2023

phofl reopened this May 15, 2023

phofl requested a review from jrbourbeau May 15, 2023 15:22

jrbourbeau reviewed May 15, 2023

View reviewed changes

Fix

d3f3be9

jrbourbeau approved these changes May 15, 2023

View reviewed changes

phofl added 3 commits May 16, 2023 16:26

Update test_parquet.py

f7615bb

Merge remote-tracking branch 'upstream/main' into read_parquet_cat

ae37409

Adjust test

15e9cc3

jrbourbeau merged commit 0b6ddd3 into dask:main May 16, 2023
27 of 28 checks passed

phofl deleted the read_parquet_cat branch May 16, 2023 16:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix conversion from `Categorical` to `pa.dictionary` in `read_parquet` #10285

Fix conversion from `Categorical` to `pa.dictionary` in `read_parquet` #10285

phofl commented May 15, 2023

jrbourbeau left a comment

jrbourbeau May 15, 2023

phofl May 15, 2023

jrbourbeau left a comment

jrbourbeau May 15, 2023

phofl May 15, 2023

phofl May 15, 2023 •

edited

jrbourbeau May 15, 2023

jrbourbeau commented May 15, 2023

Fix conversion from Categorical to pa.dictionary in read_parquet #10285

Fix conversion from Categorical to pa.dictionary in read_parquet #10285

Conversation

phofl commented May 15, 2023

jrbourbeau left a comment

Choose a reason for hiding this comment

jrbourbeau May 15, 2023

Choose a reason for hiding this comment

phofl May 15, 2023

Choose a reason for hiding this comment

jrbourbeau left a comment

Choose a reason for hiding this comment

jrbourbeau May 15, 2023

Choose a reason for hiding this comment

phofl May 15, 2023

Choose a reason for hiding this comment

phofl May 15, 2023 • edited

Choose a reason for hiding this comment

jrbourbeau May 15, 2023

Choose a reason for hiding this comment

jrbourbeau commented May 15, 2023

Fix conversion from `Categorical` to `pa.dictionary` in `read_parquet` #10285

Fix conversion from `Categorical` to `pa.dictionary` in `read_parquet` #10285

phofl May 15, 2023 •

edited