-
Notifications
You must be signed in to change notification settings - Fork 2k
Closed
Bug
Copy link
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
A multi-column INNER JOIN with dictionary-encoded string keys fails at runtime when scanning Parquet tables with datafusion.execution.parquet.pushdown_filters = true.
The failure is:
Parquet error: External: Compute error: Error evaluating filter predicate: ArrowError(InvalidArgumentError("Can't compare arrays of different types"), Some(""))
This appears related to hash join dynamic filter (InList) pushdown over dictionary columns.
To Reproduce
Environment:
datafusion-cli 52.2.0
Steps:
- Save and run the attached SQL script with
datafusion-clifrom a clean directory:
datafusion-cli -q -f repro_datafusion_cli_multi_column_dictionary_join.sql- Script contents:
-- Required to reproduce the failure path.
SET datafusion.execution.parquet.pushdown_filters = true;
CREATE TABLE h2o AS
SELECT
to_timestamp_nanos(time_ns) AS time,
arrow_cast(state, 'Dictionary(Int32, Utf8)') AS state,
arrow_cast(city, 'Dictionary(Int32, Utf8)') AS city,
temp
FROM (
VALUES
(200, 'CA', 'LA', 90.0),
(250, 'MA', 'Boston', 72.4),
(100, 'MA', 'Boston', 70.4),
(350, 'CA', 'LA', 90.0)
) AS t(time_ns, state, city, temp);
CREATE TABLE o2 AS
SELECT
to_timestamp_nanos(time_ns) AS time,
arrow_cast(state, 'Dictionary(Int32, Utf8)') AS state,
arrow_cast(city, 'Dictionary(Int32, Utf8)') AS city,
temp,
reading
FROM (
VALUES
(250, 'MA', 'Boston', 53.4, 51.0),
(100, 'MA', 'Boston', 50.4, 50.0)
) AS t(time_ns, state, city, temp, reading);
CREATE EXTERNAL TABLE h2o_parquet_tbl STORED AS PARQUET LOCATION 'h2o_parquet';
CREATE EXTERNAL TABLE o2_parquet_tbl STORED AS PARQUET LOCATION 'o2_parquet';
SELECT h2o_parquet_tbl.temp AS h2o_temp, o2_parquet_tbl.temp AS o2_temp, o2_parquet_tbl.reading
FROM h2o_parquet_tbl
INNER JOIN o2_parquet_tbl ON h2o_parquet_tbl.time = o2_parquet_tbl.time
AND h2o_parquet_tbl.state = o2_parquet_tbl.state
AND h2o_parquet_tbl.city = o2_parquet_tbl.city
WHERE h2o_parquet_tbl.time >= '1970-01-01T00:00:00.000000050Z'
AND h2o_parquet_tbl.time <= '1970-01-01T00:00:00.000000300Z';- Observed output:
+-------+
| count |
+-------+
| 4 |
+-------+
+-------+
| count |
+-------+
| 2 |
+-------+
Parquet error: External: Compute error: Error evaluating filter predicate: ArrowError(InvalidArgumentError("Can't compare arrays of different types"), Some(""))
Expected behavior
The query should succeed and return (which it does if run directly from the input 0or if SET datafusion.execution.parquet.pushdown_filters = false is removed
+----------+---------+---------+
| h2o_temp | o2_temp | reading |
+----------+---------+---------+
| 70.4 | 50.4 | 50.0 |
| 72.4 | 53.4 | 51.0 |
+----------+---------+---------+
Additional context
- If
SET datafusion.execution.parquet.pushdown_filters = false, the query succeeds and returns the expected 2 rows.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working