[C++][Dataset] Scanning a Fragment with a filter + mismatching schema shouldn't abort #25254

asfimport · 2020-06-16T13:56:27Z

If you have filter on a column where the physical and dataset schema differs, scanning aborts (as right now the dataset schema, if specified, gets used for implicit casts, but then the expression might have a different type as the actual physical column):

Small parquet file with one int32 column:

df = pd.DataFrame({"col": np.array([1, 2, 3, 4], dtype='int32')})
df.to_parquet("test_filter_schema.parquet", engine="pyarrow")

import pyarrow.dataset as ds
dataset = ds.dataset("test_filter_schema.parquet", format="parquet")
fragment = list(dataset.get_fragments())[0]

and then reading in a fragment with a filter on that column, without and with specifying a dataset/read schema:

In [48]: fragment.to_table(filter=ds.field("col") > 2).to_pandas() 
Out[48]: 
   col
0    3
1    4

In [49]: fragment.to_table(filter=ds.field("col") > 2, schema=pa.schema([("col", pa.int64())])).to_pandas()                                                                                                        
../src/arrow/result.cc:28: ValueOrDie called on an error: Type error: Cannot compare scalars of differing type: int64 vs int32
/home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.100(+0xee2f86)[0x7f6b56490f86]
...
Aborted (core dumped)

Now this int32->int64 type change is something we don't support yet (in the schema evolution/normalization, when scanning a dataset), but it also shouldn't abort but raise a normal error about type mismatch.

Reporter: Joris Van den Bossche / @jorisvandenbossche
Assignee: Ben Kietzman / @bkietz

PRs and other links:

GitHub Pull Request #7526

_{Note: This issue was originally created as ARROW-9146. Please see the migration documentation for further details.}

asfimport · 2020-06-26T19:50:50Z

Francois Saint-Jacques / @fsaintjacques:
Issue resolved by pull request 7526
#7526

asfimport closed this as completed Jun 26, 2020

asfimport assigned bkietz Jan 10, 2023

asfimport added this to the 1.0.0 milestone Jan 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++][Dataset] Scanning a Fragment with a filter + mismatching schema shouldn't abort #25254

[C++][Dataset] Scanning a Fragment with a filter + mismatching schema shouldn't abort #25254

asfimport commented Jun 16, 2020

asfimport commented Jun 26, 2020

[C++][Dataset] Scanning a Fragment with a filter + mismatching schema shouldn't abort #25254

[C++][Dataset] Scanning a Fragment with a filter + mismatching schema shouldn't abort #25254

Comments

asfimport commented Jun 16, 2020

PRs and other links:

asfimport commented Jun 26, 2020