Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cast error on roundtrip of categorical column to parquet and back #32869

Open
asfimport opened this issue Sep 6, 2022 · 0 comments
Open

Cast error on roundtrip of categorical column to parquet and back #32869

asfimport opened this issue Sep 6, 2022 · 0 comments

Comments

@asfimport
Copy link

Writing a table to parquet, then reading it back fails if:

  1. One of the columns is a dictionary (came from a pandas Categorical), and

  2. Passing the table's schema to read_table

    Failing on attempt to cast int64 into dictionary (full stack trace below).

    This seems related to ARROW-11157 - but even if losing the categorical type when reading from parquet, the reader should not barf when reading with the schema.

    Minimal example of failing code:

    import pandas as pd
    import pyarrow as pa
    import pyarrow.parquet as pq
    import pyarrow.dataset as ds
    a = [1,2,3,4,1,2,3,4,1,2,3,4]
    b = ["a" for i in a]
    c = [i for i in range(len(a))]
    df = pd.DataFrame({"a":a, "b":b, "c":c})
    df['a'] = df['a'].astype('category')
    print("df dtypes:\n", df.dtypes)
    t = pa.Table.from_pandas(df, preserve_index=True)
    s = t.schema
    ds.write_dataset(t, format='parquet', base_dir='./test')
    df2 = pq.read_table('./test', schema=s).to_pandas()
    print("df2 dtypes:\n", df2.dtypes)

     

    Which gives: 

    df dtypes:
     a    category
    b      object
    c       int64
    dtype: object
    Traceback (most recent call last):
      File "/Users/yishai/lab/pyarrow_bug/reproduce.py", line 20, in <module>
        df2 = pq.read_table('./test', schema=s).to_pandas()
      File "/Users/yishai/lab/pyarrow_bug/venv/lib/python3.9/site-packages/pyarrow/parquet/_init_.py", line 2827, in read_table
        return dataset.read(columns=columns, use_threads=use_threads,
      File "/Users/yishai/lab/pyarrow_bug/venv/lib/python3.9/site-packages/pyarrow/parquet/_init_.py", line 2473, in read
        table = self._dataset.to_table(
      File "pyarrow/_dataset.pyx", line 331, in pyarrow._dataset.Dataset.to_table
      File "pyarrow/_dataset.pyx", line 2577, in pyarrow._dataset.Scanner.to_table
      File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
      File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status
    pyarrow.lib.ArrowNotImplementedError: Unsupported cast from int64 to dictionary using function cast_dictionary

Reporter: Yishai Beeri

Note: This issue was originally created as ARROW-17625. Please see the migration documentation for further details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant