Dataset filtering from disk broken for duration type #37111

mattaubury · 2023-08-10T16:20:39Z

Describe the bug, including details regarding any error messages, version, and platform.

Using pyarrow-12.0.1 on RHEL8 Intel.

dataset filters work when the table is read into memory, but break when the table is a referenced Parquet file. The following code demonstrates the issue:

import pyarrow as pa
import pyarrow.dataset as ds
import pyarrow.parquet as pq

value_type = pa.duration("s")

# Create a file with a single row
table = pa.table([pa.array([1], value_type)], ["t"])
filename = "test.parquet"
pq.write_table(table, filename, version="2.6")

# Filter the row in-memory using dataset, THIS WORKS
table2 = pq.read_table(filename)
dataset = ds.dataset(table2)
x = dataset.to_table(filter=ds.field("t") == pa.scalar(1, type=value_type))

# Filter the row using dataset from disk, THIS BREAKS
dataset = ds.dataset(filename, format="parquet")
y = dataset.to_table(filter=ds.field("t") == pa.scalar(1, type=value_type))

assert x == y

The assert is not reached, instead we get the error:

Traceback (most recent call last):
  File "break_filter.py", line 19, in <module>
    y = dataset.to_table(filter=ds.field("t") == pa.scalar(1, type=value_type))
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/_dataset.pyx", line 546, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 3449, in pyarrow._dataset.Scanner.to_table
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Function 'equal' has no kernel matching input types (duration[us], int8)

Superficially, it appears as though the duration scalar is being prematurely unboxed.

The issue is not present with other value_types (such as int64, timestamp, date32), but a potentially related issue occurs with value_type = pa.time32("s"):

pyarrow.lib.ArrowNotImplementedError: Function 'equal' has no kernel matching input types (time32[ms], time32[s])

Component(s)

Python

The text was updated successfully, but these errors were encountered:

felipecrv · 2023-09-14T00:26:02Z

Note that it's complaining about a comparison between microseconds and seconds

Function 'equal' has no kernel matching input types (time32[ms], time32[s])

Which is IMO a good thing because you don't want to compare different time units. You might want to round the ms to s or convert the s to ms and look for exact millisecond-precise matches. It all depends on your application and meaning of the data you have.

mattaubury · 2023-09-14T06:40:24Z

@felipecrv I agree... except that nowhere in the code I presented is the type time32[ms] mentioned. It has been generated internally by some transformation, which appears to be a bug.

mapleFU · 2023-09-14T07:14:28Z

~~You can use a cast or set schema ( like #37071 ?) to workaround?~~

Also I found a weird thing,

>>> dataset = ds.dataset(filename, format="parquet")
>>> y = dataset.to_table(filter=ds.field("t") == pa.scalar(1, type=value_type))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/_dataset.pyx", line 546, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 3449, in pyarrow._dataset.Scanner.to_table
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Function 'equal' has no kernel matching input types (duration[s], int8)
>>> y = dataset.to_table(filter=ds.field("t") == pa.scalar(1, type=value_type))

The first time it would raise ex, but later it works 🤔can you first retry as a workaround? I'll try to see why

mapleFU · 2023-09-14T11:03:32Z

using TestDurationParquetIO = TestParquetIO<::arrow::DurationType>;

TEST_F(TestDurationParquetIO, Roundtrip) {
  std::vector<bool> is_valid = {true, true, false, true};
  std::vector<int64_t> values = {1, 2, 3, 4};

  std::shared_ptr<Array> int_array, duration_arr;
  ::arrow::ArrayFromVector<::arrow::Int64Type, int64_t>(::arrow::int64(), is_valid,
                                                        values, &int_array);
  ::arrow::ArrayFromVector<::arrow::DurationType, int64_t>(
      ::arrow::duration(TimeUnit::NANO), is_valid, values, &duration_arr);

  // When the original Arrow schema isn't stored, a Duration array comes
  // back as int64 (how it is stored in Parquet)
  this->RoundTripSingleColumn(duration_arr, int_array, default_arrow_writer_properties());

  // When the original Arrow schema is stored, the Duration array type is preserved
  const auto arrow_properties =
      ::parquet::ArrowWriterProperties::Builder().store_schema()->build();
  this->RoundTripSingleColumn(duration_arr, duration_arr, arrow_properties);
}

Duration type is store as int64 in Parquet, and when the number is so small it's deduced as int8. So that's why we get int8. I'll se how can I fix that.

mapleFU · 2023-09-14T12:50:42Z

I've find out the root cause. It's caused by implemention of ParquetFileFragment::TestRowGroups and parquet schema. But I'm not sure how to actually fix it... (mapleFU@17a4922 this patch did a tiny patch about it. It's because mismatch schema between arrow and parquet in parquet::arrow::SchemaManifest)

mapleFU · 2023-09-15T09:29:52Z

I've submit a fix here: #37734 . You can have a try

mattaubury · 2023-09-15T09:32:37Z

I've submit a fix here: #37734 . You can have a try

Awesome! Does this also fix the case where value_type = pa.time32("s")?

mapleFU · 2023-09-15T09:36:36Z

Hmm I'm not sure, let me check it tonight, currently I torward to fix the duration type bug, so I'm not sure value_type = pa.time32("s") will be fixed. If the reason is different I may try to fix it in another patch.

You can first use a "Cast" as a workaround, since merged patch will be release in 14.0.0 or even 15.0, which could be a long time

mapleFU · 2023-09-15T09:57:14Z

@mattaubury I think they may have different root cause. Let me explain it.

Parquet Dataset will try to "Extract" statistics from parquet statistics. The rootcause of "duration" is that, during extract parquet statistics, it expect a "duration", but got an integer
When extract an expression, it will try to test the expr, if it's value_type = pa.time32("s"), I guess it might still meet the problem, maybe other can have advice on this

mapleFU · 2023-09-20T05:45:27Z

@mattaubury

Duration problem would be fixed in GH-37111: [C++][Parquet] Dataset: Fixing Schema Cast #37793
As for timestamp, it's [C++][Parquet] Dataset cannot filter timestamp with time_unit SECOND #37799 . I'd like to fix it in another patch and try my best to fix it before 14.0 release

cc @bkietz

### Rationale for this change Parquet and Arrow has two schema: 1. Parquet has a SchemaElement[1], it's language and implement independent. Parquet Arrow will extract the schema and decude it. 2. Parquet arrow stores schema and possible `field_id` in `key_value_metadata`[2] when `store_schema` enabled. When deserializing, it will first parse `SchemaElement`[1], and if self-defined key_value_metadata exists, it will use `key_value_metadata` to override the [1] [1] https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L356 [2] https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1033 The bug raise from that, when dataset parsing `SchemaManifest`, it doesn't use `key_value_metadata` from `Metadata`, which raises the problem. For duration, when `store_schema` enabled, it will store `Int64` as physical type, and add a `::arrow::Duration` in `key_value_metadata`. And there is no `equal(Duration, i64)`. So raise the un-impl ### What changes are included in this PR? Set `key_value_metadata` in implemented. ### Are these changes tested? Yes ### Are there any user-facing changes? bugfix * Closes: #37111 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>

) ### Rationale for this change Parquet and Arrow has two schema: 1. Parquet has a SchemaElement[1], it's language and implement independent. Parquet Arrow will extract the schema and decude it. 2. Parquet arrow stores schema and possible `field_id` in `key_value_metadata`[2] when `store_schema` enabled. When deserializing, it will first parse `SchemaElement`[1], and if self-defined key_value_metadata exists, it will use `key_value_metadata` to override the [1] [1] https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L356 [2] https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1033 The bug raise from that, when dataset parsing `SchemaManifest`, it doesn't use `key_value_metadata` from `Metadata`, which raises the problem. For duration, when `store_schema` enabled, it will store `Int64` as physical type, and add a `::arrow::Duration` in `key_value_metadata`. And there is no `equal(Duration, i64)`. So raise the un-impl ### What changes are included in this PR? Set `key_value_metadata` in implemented. ### Are these changes tested? Yes ### Are there any user-facing changes? bugfix * Closes: apache#37111 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>

mattaubury added the Type: bug label Aug 10, 2023

github-actions bot added the Component: Python label Aug 10, 2023

github-actions bot mentioned this issue Sep 15, 2023

GH-37111: [C++] Dataset: Fixing Parquet Schema Cast #37734

Closed

github-actions bot assigned mapleFU Sep 15, 2023

mapleFU added the Component: C++ label Sep 15, 2023

mapleFU added a commit to mapleFU/arrow that referenced this issue Sep 19, 2023

apacheGH-37111: [C++] Dataset: Fixing Parquet Schema Cast

e1349e7

github-actions bot mentioned this issue Sep 19, 2023

GH-37111: [C++][Parquet] Dataset: Fixing Schema Cast #37793

Merged

mapleFU mentioned this issue Sep 20, 2023

[C++][Parquet] Dataset cannot filter timestamp with time_unit SECOND #37799

Closed

bkietz closed this as completed in #37793 Sep 20, 2023

bkietz added this to the 14.0.0 milestone Sep 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset filtering from disk broken for duration type #37111

Dataset filtering from disk broken for duration type #37111

mattaubury commented Aug 10, 2023

felipecrv commented Sep 14, 2023

mattaubury commented Sep 14, 2023

mapleFU commented Sep 14, 2023 •

edited

Loading

mapleFU commented Sep 14, 2023 •

edited

Loading

mapleFU commented Sep 14, 2023 •

edited

Loading

mapleFU commented Sep 15, 2023

mattaubury commented Sep 15, 2023

mapleFU commented Sep 15, 2023 •

edited

Loading

mapleFU commented Sep 15, 2023

mapleFU commented Sep 20, 2023

Dataset filtering from disk broken for duration type #37111

Dataset filtering from disk broken for duration type #37111

Comments

mattaubury commented Aug 10, 2023

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

felipecrv commented Sep 14, 2023

mattaubury commented Sep 14, 2023

mapleFU commented Sep 14, 2023 • edited Loading

mapleFU commented Sep 14, 2023 • edited Loading

mapleFU commented Sep 14, 2023 • edited Loading

mapleFU commented Sep 15, 2023

mattaubury commented Sep 15, 2023

mapleFU commented Sep 15, 2023 • edited Loading

mapleFU commented Sep 15, 2023

mapleFU commented Sep 20, 2023

mapleFU commented Sep 14, 2023 •

edited

Loading

mapleFU commented Sep 14, 2023 •

edited

Loading

mapleFU commented Sep 14, 2023 •

edited

Loading

mapleFU commented Sep 15, 2023 •

edited

Loading