Skip to content

[python] crashed when filter pyarrow.dataset on category field #8446

@lf-shaw

Description

@lf-shaw

I wanna filter dataset on some (pandas) category field, but python crashed. Some sample code as follows

import pandas as pd

import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds

# B is category
df = pd.DataFrame({'A': range(4), 'B': list('bccd')})
df['B'] = df['B'].astype('category')

# save to parquet file
table = pa.Table.from_pandas(df)
pq.write_table(table, 'test.parquet')

# read with dataset
dataset = ds.dataset('test.parquet')

# it's ok
dataset.to_table().to_pandas()

# it's ok
dataset.to_table(filter=ds.field('A') > 2).to_pandas()

# it crashed
dataset.to_table(filter=ds.field('B') == 'b').to_pandas()

the crash message

ValueOrDie called on an error: Type error: Cannot compare scalars of differing type: dictionary<values=string, indices=int32, ordered=0> vs string

I know in C++ arrow::DictionaryArray of course cannot comapre with string. But I wanna know is there any possible to filter on category field in python?

Thanks for your attention, and Thanks for this brilliant lib.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions