Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Dataset] Scanning a Fragment with a filter + mismatching schema shouldn't abort #25254

Closed
asfimport opened this issue Jun 16, 2020 · 1 comment
Assignees
Milestone

Comments

@asfimport
Copy link

If you have filter on a column where the physical and dataset schema differs, scanning aborts (as right now the dataset schema, if specified, gets used for implicit casts, but then the expression might have a different type as the actual physical column):

Small parquet file with one int32 column:

df = pd.DataFrame({"col": np.array([1, 2, 3, 4], dtype='int32')})
df.to_parquet("test_filter_schema.parquet", engine="pyarrow")

import pyarrow.dataset as ds
dataset = ds.dataset("test_filter_schema.parquet", format="parquet")
fragment = list(dataset.get_fragments())[0]

and then reading in a fragment with a filter on that column, without and with specifying a dataset/read schema:

In [48]: fragment.to_table(filter=ds.field("col") > 2).to_pandas() 
Out[48]: 
   col
0    3
1    4

In [49]: fragment.to_table(filter=ds.field("col") > 2, schema=pa.schema([("col", pa.int64())])).to_pandas()                                                                                                        
../src/arrow/result.cc:28: ValueOrDie called on an error: Type error: Cannot compare scalars of differing type: int64 vs int32
/home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.100(+0xee2f86)[0x7f6b56490f86]
...
Aborted (core dumped)

Now this int32->int64 type change is something we don't support yet (in the schema evolution/normalization, when scanning a dataset), but it also shouldn't abort but raise a normal error about type mismatch.

Reporter: Joris Van den Bossche / @jorisvandenbossche
Assignee: Ben Kietzman / @bkietz

PRs and other links:

Note: This issue was originally created as ARROW-9146. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Francois Saint-Jacques / @fsaintjacques:
Issue resolved by pull request 7526
#7526

@asfimport asfimport added this to the 1.0.0 milestone Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants